Velari Research · May 2026

Introducing the European Screening Benchmark.

Velari helps investors find companies that commercial databases miss: the mid-market private firms that don't surface in standard screens. That only works if a research pipeline can read those companies as accurately as a human analyst would. The benchmark measures how close we are.

Velari is more accurate and more consistent than every frontier baseline tested. 94.4% per-criterion accuracy versus 87.5% (GPT-5.5) and 82.7% (Claude Opus 4.7).

The benchmark is a 21-criterion investment thesis applied to a 50-company sample, with 1,050 hand-labelled ground-truth judgements labelled by two independent annotators.

Baselines run one company per context, with maximum reasoning and native web search: the most favourable single-pass configuration. In real screening workflows analysts typically query frontier LLMs with many companies in a single conversation, where long-context and shared-context anchoring effects degrade per-item accuracy. The gap reported here is therefore a conservative estimate of the real-world gap.

Per-criterion accuracy

n = 1,050 (50 companies × 21 criteria)

Models tested May 2026 · Re-run on each frontier release

Velari

Baseline

100%

75%

50%

25%

94.4%

87.5%

82.7%

Velari

Velari research pipeline

GPT-5.5

Reasoning effort: xhigh

OpenAI web_search

Opus 4.7Claude Opus 4.7

Reasoning effort: xhigh

Anthropic web_search

Benchmark construction

Thesis. The benchmark uses a real 21-criterion pan-European investment thesis from an active partner fund, a screening rubric of the kind several of our partner funds currently apply to long-tail mid-market opportunities. It decomposes into a business-model group (5 inclusion criteria OR'd, 7 exclusion criteria AND-of-negations) and an ownership group (3 inclusion, 6 exclusion), combined with logical AND. Most criteria are not filterable by commercial databases: for example, “the company is primarily a software reseller or system integrator” is a judgement on the company's business model from public content, and “the company is majority founder- or family-owned and operated” requires registry research and intent inference on long-tail private firms. Full criterion text is included in the harness.

Companies. The 50 companies are drawn from a real screening workflow conducted by our partner. Each is a private-market, long-tail company; none are widely-covered public-market or unicorn names. We deliberately kept the sample as it arrived from the workflow, with no re-balancing for difficulty, so that the resulting label distribution reflects what a real screening pass looks like rather than a deliberately-curated stress test.

Ground truth. Each (company, criterion) pair was independently labelled by two annotators against canonical sources: official company website, registry filings (Companies House, equivalent national registries), and recent funding-round disclosures. Disagreements were adjudicated by a third reviewer with access to both annotators' evidence trails. Ground-truth labels are binary: each pair resolves to either satisfied or not satisfied; annotators were required to commit to a verdict from public evidence, with ambiguous cases resolved in adjudication. The full label distribution is included in the harness. This yields 50 × 21 = 1,050 ground-truth labels per pipeline, and every system is evaluated on exactly the same set.

Methodology

Task. For each (company, criterion) pair, the system returns one of satisfied, not satisfied, or unknown. Each system receives only the company name and website URL and must retrieve all evidence itself from the public web at inference time; systems are free to issue any number of web searches and to read any pages they retrieve. No system sees ground-truth labels or pre-computed evidence.

Every criterion is presented as a positive statement (e.g. “the company is primarily a software reseller or system integrator”) and judged in isolation, so every per-criterion call is the same structural task regardless of whether the criterion contributes to the verdict as an inclusion or an exclusion. The tree structure that combines them is applied after per-criterion judgement and is never visible to the evaluator.

Baselines. Claude Opus 4.7 and GPT-5.5, the two strongest reasoning models available at time of writing, each run on the same prompt, thesis, and structured output schema as Velari, with no Velari-specific scaffolding:

Maximum reasoning effort. Both baselines run at the highest effort setting their provider exposes: effort=xhigh for Claude Opus 4.7, effort=xhigh for GPT-5.5.
Native web search. Anthropic's web_search for Claude, OpenAI's web_search for GPT-5.5: the configuration each provider has optimised against.

Scoring. Per-criterion accuracy is the fraction of all 1,050 calls where the system's prediction matches the adjudicated ground-truth label. Unknown predictions count as wrong regardless of the ground-truth label, so a system that abstained on every call would score zero. Confidence intervals are 95% Wilson score intervals; pairwise comparisons use McNemar's paired test on the 1,050 matched pairs, which controls for per-item difficulty by testing only discordant pairs.

Results

Headline per-criterion accuracies are shown in the chart at the top of this page. The 95% confidence intervals do not overlap between Velari and either baseline.

Accuracy by criterion category

We break per-criterion accuracy down by the four sub-categories of the thesis tree (inclusion or exclusion, business-model or ownership). Differences between buckets reflect properties of the criterion text itself: chiefly how recency-sensitive the underlying facts are, and how readily evidence appears in public sources.

Per-criterion accuracy by category

scale 0–100% · hover a bar for the 95% Wilson CI

Velari

GPT-5.5

Claude Opus 4.7

100%

75%

50%

25%

90.4

81.2

87.2

96.0

90.0

81.7

91.3

82.7

68.7

97.3

92.3

87.0

BM inclusion

n = 250

BM exclusion

n = 350

Ownership inclusion

n = 150

Ownership exclusion

n = 300

Category	n	Velari	GPT-5.5	Opus 4.7Claude Opus 4.7
Business model, inclusion(5 criteria)	250	90.4%	81.2%	87.2%
Business model, exclusion(7 criteria)	350	96.0%	90.0%	81.7%
Ownership, inclusion(3 criteria)	150	91.3%	82.7%	68.7%
Ownership, exclusion(6 criteria)	300	97.3%	92.3%	87.0%

Per-criterion accuracy by category. n is the number of (company, criterion) calls in that category.

Per-company accuracy

Aggregate per-criterion accuracy can hide an uneven distribution across companies. The chart below shows the per-company error distribution; the table beneath it shows the share of companies meeting each level of correctness. Velari leads at every level.

Decision-level error distribution

Fraction of 50 companies · hover a segment for raw counts and CIs

All 21 correct

Off by 1

Off by 2+

Velari44% all correct · mean 1.18 wrong / co.

44%

24%

32%

GPT-5.510% all correct · mean 2.62 wrong / co.

10%

24%

66%

Claude Opus 4.716% all correct · mean 3.64 wrong / co.

16%

18%

66%

A single load-bearing criterion error flips a screening verdict, so strict accuracy is the practical lower bound on decision quality.

Companies with	Velari	GPT-5.5	Opus 4.7Claude Opus 4.7
21 of 21 correct	44%	10%	16%
20 of 21 or better	68%	34%	34%
19 of 21 or better	80%	54%	46%
18 of 21 or better	92%	76%	58%

Share of the 50 companies meeting each level of per-company correctness.

Error structure

System	Precision (satisfied)	Recall (satisfied)	Abstention rate
Velari	84.5%	92.0%	0.6%
GPT-5.5	67.5%	85.8%	1.1%
Opus 4.7Claude Opus 4.7	71.9%	75.1%	8.9%

Derived from per-system 2×3 confusion matrices.

Statistical significance

McNemar's paired test treats each of the 1,050 (company, criterion) calls as a matched observation across systems and tests only discordant pairs: calls where one system is correct and the other wrong. This controls for per-item difficulty: easy criteria that both systems get right contribute nothing to the test.

Baseline	Velari wins	Baseline wins	χ²	p
GPT-5.5	94 (81.0%)	22 (19.0%)	43.5	<0.0001
Opus 4.7Claude Opus 4.7	159 (81.5%)	36 (18.5%)	76.3	<0.0001

Continuity-corrected McNemar's test on the 1,050 matched (company, criterion) pairs. Each cell shows the discordant-pair count and its share of total discordant pairs for that comparison.

Discussion

Two properties of this task should make it structurally hard for single-pass reasoning models. First, ownership criteria depend on recency-sensitive evidence (recent filings, funding announcements, acquisitions) that sits outside any pre-training cutoff and must be retrieved at inference time. Second, the 50 companies are private mid-market firms drawn from a real long-tail screening workflow, so a model's pre-training priors about who owns or operates each one are unreliable. Both properties push correctness towards source-grounded retrieval and away from memorisation. We expect the gap to narrow as frontier models acquire stronger multi-step tool-use capabilities, and we will re-run this benchmark on new frontier releases as part of the harness.

For a fund running a single-thesis screen across a long-tail funnel of even 1,000 companies, the gap compounds. Closing it appears to be a retrieval and decomposition problem more than a reasoning-strength problem.

Limitations

The benchmark consists of 50 companies and one investment thesis. That is enough to produce non-overlapping 95% confidence intervals on the headline accuracies and statistically significant pairwise comparisons, but it is a starting point rather than a final measurement. The sample size reflects the cost of hand-labelling ground truth, not a constraint of the Velari pipeline: each company is processed independently, so the pipeline scales horizontally to thousands of companies in parallel. We plan to expand the dataset across additional theses and additional companies in subsequent releases, and we invite partner funds interested in contributing theses to get in touch.

Several thesis criteria carry inherent definitional ambiguity (“primarily recurring revenue” and “significant offering of proprietary software”), where two reasonable annotators can land on different sides of a borderline company. Adjudication held a single threshold across the 50 companies, but a meaningful share of the residual disagreement that every system still incurs sits on these borderline calls rather than on factual misses. We do not yet have a measurement of that irreducible floor; quantifying it would require a second independent annotation pass and is on the roadmap for the next release.

Conclusions

On a real investment thesis applied to 50 long-tail private companies, Velari outperforms the strongest single-pass frontier baselines at every level of correctness: 94.4% per-criterion accuracy versus 87.5% (GPT-5.5) and 82.7% (Claude Opus 4.7), and 44% strict per-company accuracy versus 10% and 16%. Both gaps are significant at p < 0.0001 by McNemar's paired test, with non-overlapping 95% confidence intervals. The advantage holds even with both baselines run at maximum reasoning effort and given native web search.

The bottleneck on AI-assisted screening of long-tail private markets is not raw reasoning capability, but retrieval and decomposition. Velari closes that gap on the same evidence the frontier baselines see.

Release

The evaluation harness is released publicly. It contains the scoring code, the structured-output schemas, the exact prompts used for each baseline, and a reproducible runner so that anyone can re-evaluate any system (Velari, frontier reasoning models, or any future pipeline) against the same labels.

The dataset (thesis text, 50-company sample, and 1,050 ground-truth labels with evidence trails) is available on request to research groups and partner funds. Email research@velarihq.com with a one-line description of your use case and we will share access.

Benchmark construction

Methodology

Results

Accuracy by criterion category

Per-company accuracy

Error structure

Statistical significance

Discussion

Limitations

Conclusions

Release

Send us one live thesis.