GFO

Guide

How GFO works

GFO scores predictions made in consultancy reports against the actual evidence available today. It does not replace judgment, it shows whether the call was credible and on track.

The workflow

  1. Submit a report. Paste text, drop a URL, or upload a PDF. GFO detects the firm, title, and year automatically.
  2. Confirm predictions. Claude extracts candidate claims. You keep, edit, or remove the ones worth scoring.
  3. Evaluate. For each prediction, Claude searches the web, gathers sources, and returns a credibility verdict.
  4. Share. Each report and each prediction has a public URL. Print, email, or keep private.

The three sub scores

Each completed verdict produces three scores, each on a 0 to 100 scale. Higher means more credible.

Specificity
How falsifiable was the claim? 100 means a precise, dated, quantitative prediction. 50 means directional or vaguely timed. 0 means an unfalsifiable platitude that cannot be checked.
Accuracy
Did the predicted thing happen by today? 100 means clearly yes, 0 means clearly no, 50 means mixed or partial. This is the heaviest weighted of the three.
Calibration
Was the magnitude and timing right? 100 means right number and right date. 0 means off by an order of magnitude or many years.

The composite Credibility score is a weighted average. Default weights are 20 percent Specificity, 60 percent Accuracy, 20 percent Calibration. Admins can change the weights.

Edge cases

Too early to judge
The prediction targets a date in the future, with no evidence yet to assess progress.
Unfalsifiable
The claim is too vague to score, like “AI will change everything”.
Insufficient evidence
Web search could not find enough credible sources to evaluate the claim.
Evaluation failed
The model could not produce a valid verdict on this run. Re evaluate to try again.

Why scores can shift

Re evaluating the same prediction can produce slightly different scores. Two reasons. First, web search returns different sources as the web changes. Second, large language models have some inherent sampling variance even at low temperature. Treat a single verdict as the best snapshot today, not as an absolute fact. Re evaluating across different reference dates is the more interesting signal, it shows the trajectory.

Privacy and access

GFO is invite only. Each submission you make is private by default, but on the submission page you can make it public so any URL holder can view the verdicts. The list of users with access is controlled by the admin.