How We Verify Fuzzer Crashes: Four Principles, Two Reproductions, and Security Proxies

Three Gates Before We File

P1–P4

Four-Principles Re-Audit

scoped to the crash path

2×

Independent Reproductions

harness replay and production path

Proxies

Security Proxies

every rule cites a DOC

<5%

FP Rate

across all submissions

Most Fuzzer Crashes Are Not Worth Filing

Anyone who fuzzes for a living has seen this picture. You leave a harness running overnight, and ASan greets you in the morning with 24,000 crashing inputs. The vast majority are cluster siblings of the same root cause. Maybe a few dozen represent independent signatures, and of those, only a small fraction are real bugs the upstream maintainer will accept.

Get this step wrong and you become a wholesale supplier of automated report noise. So even though the LG plus reverse four-principles work from Part 2 already filters most errors at generation time, every crash a fuzzer produces still has to be verified. Verification is the last gate before upstream.

Our verification has three layers, ordered from cheap to expensive. The first is a four-principles re-audit on the specific crash path; reading source costs almost nothing, and any P-violation drops the crash on the spot. The second is two independent reproductions, which catches crashes that only occur inside the harness. The third is Security Proxies matched against the maintainer's threat model. Cognitive cost is highest here, but missing this layer is expensive when the maintainer fires back.

Gate One. Four Principles on the Crash Path

If the generation pipeline already runs the four principles once, why audit them again at verify time? Because harness generation is not a deterministic process. On a project with mature static analysis, the call graph, the LGs, and the API protocol all extract precisely, and harness quality stays steady. On a project where static analysis is thin, the pipeline falls back to letting an LLM do the semantic reasoning, and that step introduces randomness. Harness quality varies project by project. The reverse four-principles work in generation catches most source-level mistakes, but it does not catch all of them. So once a fuzzer produces a crash, we run the four principles a second time over the specific trace that produced it.

The key word for this gate is specific path. In Part 1 we framed the four principles as a whole-harness correctness condition. At verify time we narrow the scope. We no longer audit the whole harness, only the call path that produced this crash.

Operationally, the sanitizer trace is the starting point. We read the stack from LLVMFuzzerTestOneInput downward and ask, frame by frame, four questions.

P1, the harness itself. Does the harness code on this path use an uninitialized variable, hold a use-after-free pointer, or forget to free a resource? Did the fuzz bytes actually reach the crash site, or did harness-internal logic overwrite them first?
P2, the API contract. Are the library APIs on this path called in the right order, on objects in the right state, with arguments in the legal range? No set_params running before init_ex, no classic ordering inversion?
P3, the security boundary. Does this path reach a static function pulled out via extern, or skip the validation that a real entry point would have run? The test is not whether the harness uses internal symbols (we recommend public APIs but do not forbid internal ones). The test is whether the same crash can be reproduced from a harness that only calls public APIs. If yes, P3 passes. If no, the crash sits on a code path no real attacker can reach.
P4, the attack surface. Is the fuzz entry on this path part of the library's current public attack surface, or is it a deprecated demo entry that no longer ships?

Fail any one of the four and the crash is a harness artifact. We drop it. No P-violation crash is allowed to reach the next gate. Crashes that fail here are recorded in dropped_crashes with their reason and closed out.

This gate is cheap; it is mostly source reading. Its job is to catch the obvious harness-bug crashes early so the next two gates do not waste effort on noise.

Gate Two. Two Independent Reproductions

Passing P1 through P4 only tells us the harness code looks correct. It does not tell us the crash would ever appear in real-world usage. A crash that reproduces only inside the harness, but never under any normal calling pattern, is at best an academic curiosity. It is not a vulnerability.

So we require two independent reproductions that do not rely on each other.

Harness Replay

Feed the crashing input back into the original harness and run it under ASan multiple times. Every run must crash with the same top stack frames. This rules out two kinds of noise.

Flaky crashes. Non-deterministic failures triggered by scheduling, randomness, or whatever happens to live in uninitialized memory at the moment. The stack varies from run to run, so we drop them.
Non-deterministic paths. Global state inside the harness sends the same input down different execution paths each run. An unstable stack is an unstable fuzz signal, and we will not file it.

This step is simple but it solves a real integrity problem. A crash we file has to reproduce on the maintainer's machine. If our own replay is flaky, the maintainer's will definitely fail.

Production Path

This is where the real filtering happens. Replay only confirms the fuzz output is stable. It does not rule out the harness pushing the library into a shape no real caller would ever produce. So we require the crash to fire once more, outside the harness, using a real calling pattern. Either of two routes works.

Route A. The library's own CLI, rebuilt with ASan. Many libraries ship official CLIs (dwebp, flatc, vpxdec, xmllint, upx, and so on), and those CLIs are the library's production mode. If the PoC crashes the CLI too, that is the gold standard. The CLI is production.
Route B. A minimal C or C++ program, around 30 lines, using only the library's public headers. No harness-private logic, no internal symbols, just an honest imitation of how a normal library user would call the API. If the minimal program also crashes, real consumers would crash too.

This is the most important FP killer. Most of the crashes that look real but were really the harness pushing the library into a strange shape do not survive this step. If neither route reproduces the crash, it is harness-only and we drop it.

Every issue we file ships with a self-contained repro/ package containing build.sh, run.sh, generate_poc.py, the source of the 30-line reproducer, and the ASan log. The maintainer can rebuild from that package alone, with no pointers back into our repository. It looks like a small thing, but it reduces the maintainer's reproduction cost to nearly zero, and that is the heart of being friendly to upstream.

Gate Three. The Security Proxies

A crash that survives the first two gates is technically real. The harness is clean, replay is stable, the production path also crashes. That still does not make it a vulnerability. Every library has its own security contract. Under that contract, some scary-looking crashes are simply "things the caller is supposed to defend against," and filing them gets the issue closed.

This third gate handles the security-contract judgement that the previous two gates cannot. The four principles cover source-level correctness; reproduction covers fuzzer-only artifacts. Neither catches the FP that happens because the maintainer simply does not consider this class of bug a vulnerability.

One Security Proxy per Library

For every library we fuzz, we maintain a security_proxy.yaml that records, in one place,

which input paths the maintainer treats as trusted and which are attacker-controlled,
which APIs are part of the public attack surface (in-scope) and which are caller-internal only (out-of-scope),
which crash classes the maintainer has historically refused to treat as vulnerabilities,
and the caller-contract assumptions the maintainer has already published in SECURITY.md.

The hard rule. Every entry must cite an upstream document. The form is source: DOC:<path>, and the path has to point at a specific file in the upstream repository. That means SECURITY.md, comments in a public header, the official README, or an explicit statement from the maintainer in the issue tracker.

No interpretations of our own. If we want to add a rule but cannot find an upstream DOC for it, the rule does not go in. This conservative posture, requiring a citation for every claim, is what keeps us from quietly defining real bugs out of scope.

SECURITY PROXY

flatbuffers

v2 · last_reviewed 2026-04-26

github.com/google/flatbuffers

HOW TO READ THIS CARD

IN-SCOPE maintainer treats this as a real vuln, so file the crash

OUT-OF-SCOPE caller-contract or unsupported, so drop the crash

↳ DOC:<path> mandatory upstream citation. No DOC, no rule.

feeds_into TB-N maintainer's reply hardened rule TB-N

WONTFIX crash class the maintainer has historically rejected

When a candidate crash matches an OUT-OF-SCOPE rule, the gate at the bottom drops it before filing.

THREAT MODEL how the maintainer treats untrusted input, in one paragraph

Accessors do not bounds-check. Callers MUST run the Verifier on untrusted bytes before any GetRoot<T> / Get* / operator[] access. Reflection paths assume schemas come from the project itself.

↳ DOC:docs/source/CppUsage.md#L120

TRUST BOUNDARIES case-by-case rules that decide whether a crash on a given path is filed or dropped

TB-1 OUT-OF-SCOPE

Bytes passed to GetRoot<T> without a prior VerifyBuffer are caller-trusted.

↳ DOC:docs/source/CppUsage.md#L120-L142

TB-2 OUT-OF-SCOPE

Schema reflection assumes the .bfbs schema is project-controlled, not attacker input.

↳ DOC:include/flatbuffers/reflection.h#L45-L78

TB-3 IN-SCOPE

Crash on a buffer that passed Verify is a real bug. The Verifier is incomplete.

↳ DOC:docs/source/Tutorial.md#L312

API SCOPE which APIs we treat as the real attack surface, vs. caller-internal helpers

✓ IN-SCOPE

Verifier::Verify*
GetRoot<T> after Verify
flatc CLI input

✗ OUT-OF-SCOPE

reflection::* (TB-2)
test/* harness helpers

CRASH-CLASS POLICY crash categories the maintainer has historically rejected as not-a-vuln

WONTFIX OOM on adversarial schema DOC:SECURITY.md#L24

WONTFIX Stack-overflow from unbounded recursion in unverified buffer DOC:CppUsage.md#L155

OBSERVED REPLIES real maintainer verdicts. Each one feeds back to harden a rule above.

flatbuffers#9008 ACCEPTED heap-overflow · FlexBuffers ToString → feeds_into TB-3

PoC called VerifyBuffer first; ToString still read past a 4-byte heap allocation. Maintainer steadytao confirmed in 3 days and opened PR #9011 "Reject unterminated FlexBuffers keys during verification".

filed 2026-03-30 · confirmed 2026-04-03 · pov_called_verifier true

flatbuffers#9040 NO_VULNERABILITY heap-overflow · Reflection VerifyObject → feeds_into TB-2

"A similar HBO was reported before with #8567. The vulnerability assessment resulted in NO_VULNERABILITY, because the user has to ensure that the data is from a trusted source and has integrity. The finding still got fixed."

— kendegemaro, 2026-04-13 · filed 2026-04-11 · cites prior verdict #8567

GATE hard rule rather than advisory. An OUT-OF-SCOPE match must be dropped, not filed.

on_match_out_of_scope drop → log to dropped_crashes/<batch_id>.yaml

must_record trust_boundary_id · source_doc_line · sanitizer_top_frames

Three Real Cases Where the Proxy Saves You

Case 1. flatbuffers, where any crash without Verifier does not count

The flatbuffers maintainers repeat the same instruction across their documentation and the CppUsage guide. Your code must call VerifyBuffer before it touches the data. If you skip Verifier and go straight to GetRoot<T> and read fields, that is a caller-contract violation, and any crash that follows is not a library bug.

That is why every LG we fuzz on flatbuffers gets a proxy lookup first. Does this LG run the Verifier? If not, an OOB-looking crash from the fuzzer is not fileable. Among the 8 flatbuffers findings from this campaign, several candidate crashes were identified at the proxy layer as "needs Verify first." We only filed the ones that still crashed after we added VerifyBuffer in front.

Case 2. libxml2 does not pretend to trust its input

libxml2's stance in SECURITY.md is unusually direct. The library unconditionally trusts the XML it parses. "Don't feed it untrusted input" is the official guidance. Many parser-path crashes, including OOM and unbounded recursion, are simply "you fed it garbage" in the maintainer's eyes. They are not vulnerabilities.

Without a proxy, the easy mistake is to see ASan throw an OOM and file it on the spot, only to be closed with a single "WONTFIX, see SECURITY.md." That kind of technically-correct-but-useless report slowly erodes your account's reputation. The proxy lets us check the citation first. If the root cause sits on this trust boundary, the crash goes to dropped with the reason recorded cleanly.

Case 3. Dawn (WebGPU), where the BlobCache is not a trust boundary

Dawn's runtime API is a strong trust boundary. The browser layer validates inputs, and the parser revalidates internally. The on-disk BlobCache, however, is not. Dawn's design document explicitly states that BlobCache is an integrity check, meant to catch accidental corruption, and not an authenticity check that would catch malicious tampering.

Which means, if your fuzz path can only trigger a crash by modifying the BlobCache file on disk, the Dawn maintainers will mark it WontFix. An attacker who can write to that file already has more direct ways to compromise the machine. We encode this in the proxy so we never bring that kind of crash to their inbox.

What It Costs to Skip the Proxy

In this campaign we hit one boundary the proxy did not yet cover. V8's uaf-v8-deserializer-unresolved-external-reference was a real UAF technically, P1 through P4 were clean, and both replay and production reproduction passed. But the V8 team's security triaging policy explicitly lists "vulnerabilities that rely on an attacker-controlled snapshot blob" as out of scope. We filed it, the issue came back Won't Fix, and it became one of three FPs in this round.

If the proxy had encoded V8's policy before we filed (we did not have a V8 proxy at the time), that report would have been dropped before submission. Losing a round on something like that made us more confident in the proxy's engineering value. Its purpose is not to find fewer bugs. Its purpose is to not waste the maintainer's time.

Why This Is Worth Doing

On pure throughput, the verify flow adds 5 to 60 minutes per crash. Across an overnight fuzz that produces tens of thousands of crashes, that is a massive efficiency hit. But what we lose is not real bugs, only noise. Each of the three gates exists to cut that noise earlier and earlier.

The payoff is slow but real. The first time a maintainer encounters you, they see a report with a root cause, a public-API reproducer, and a fix patch. The second time, they ack on their own. After three or five rounds, they start adopting your harnesses directly into their OSS-Fuzz integration. That is the OpenSSL quic_server and fwupd story from Part 1, where four maintainer teams picked up our work.

The cost of skipping is also delayed. You will not immediately see "reputation loss," but slowly your GitHub handle gets muted by certain maintainers, your issues default to low priority, your PRs sit unreviewed. That is the psychological debt LLM-driven fuzz work leaves behind for the whole ecosystem.

There is another subtle payoff to running all three gates. You know which crashes you did not file and why. When a maintainer asks "Why didn't you report X?", we can pull up the dropped_crashes log and show the reason on the spot. That kind of transparency is also part of ecosystem trust.

Open Questions We're Still Working On

This flow is not the end of the story. These are the questions we care most about right now.

Automatically generating the proxy. Hand-writing a security_proxy.yaml for every new library is real work. We are extracting rules automatically from SECURITY.md, public headers, and README files, but a human still has to review whether each extraction is correct and whether the citation lands on the right line.
Proxy decay. The upstream threat model shifts over time as new sandboxes and new trusted boundaries get added, and the proxy has no staleness signal. We need a way to actively invalidate proxy entries when maintainers update their docs.
Libraries with no proxy. Many projects publish no threat model at all. Our default for those is the strictest possible posture, where we only file unambiguous memory-safety bugs. That is a bet, and we are probably missing edge cases the maintainer would have been happy to fix.
Second-order effects. The failure mode that worries us most is not FP spam. It is real bugs getting auto-closed by maintainers who have seen too many low-quality LLM reports and now treat anything automated as low-quality by default. That is an ecosystem-wide trust slide, and no single team's verify flow can catch it. This is the topic we most want to discuss with the OpenSSF working group.

Three Posts End to End, From Harness Audit to Bug Report

Across these three posts, we have applied the four principles three different ways.

Part 1. The four principles as an audit tool. Applied to existing OSS-Fuzz harnesses, the audit prevented 14 FPs from ever reaching upstream, produced 53 fix PRs, and surfaced an OpenSSL bug that had been latent for 25 years.
Part 2. The four principles as a generation tool, applied in reverse (P4 first, P1 last) with Logic Group decomposition. In two weeks we deployed 472 harnesses across Chromium and its dependencies, filed 30 reports upstream (16 acked so far, 9 confirmed plus 7 fixed), and dropped 52 candidates at our own audit stage.
This post. The four principles as a verification tool, on the specific crash path, combined with two independent reproductions and per-library Security Proxies. The result is that every issue we file is genuinely useful to the maintainer.

That a single rule set works in audit, generation, and verification suggests we picked the right level of abstraction. Source-level correctness definitions hold up, while post-hoc filtering on fuzz output does not. That is the headline of this series. Harness quality is the bottleneck for fuzz effectiveness, and the way to fix it is to internalize source-level checks at generation time rather than rely on verify-time triage to save the day.

Open to a Conversation

If you build LLM-driven fuzz harness generators, if you've been on the receiving end of low-quality automated reports on OSS-Fuzz, or if any of the four principles, Logic Groups, or Security Proxies work interests you, we'd be glad to talk.

OpenSSF working group

What worries us most is not FP spam. It is real bugs getting auto-routed to low priority. We would like to discuss this with the LLM-generated bug-report triage working group.

Open source

The four-principles sub-check list, the LG schema, and the verify-flow SKILL are all available on o2lab GitHub. Reuse them freely.

Chrome production data

All 472 harnesses, 30 filed reports, and 52 audit-dropped candidates from Part 2 are public on the chrome-maint dashboard.

— Ze Sheng, O2 Security Team (TAMU)