The question everyone skipped
"I want an LLM to generate fuzz harnesses for me" — this goal has been getting heavy attention from both academia and industry over the past couple of years. But look one layer deeper and it rests on a premise almost everyone skips over: what makes a harness "good"? Without that definition, the only way to score what an LLM produces is "does it build" and "does it survive 60 seconds without crashing" — both necessary, neither sufficient. Prior work has measured FP rates as high as 94%.
Open-source projects on GitHub are increasingly being audited by AI tools at scale. Fuzzing has an outsized share of that attention because it ships with its own oracle in the sanitizer. A crash report with a sanitizer trace looks like ground truth, which makes fuzzing the default vehicle for AI-driven bug discovery and reproduction. The trouble shows up one layer down. When the harness itself has bugs, the fuzzer still produces a convincing-looking report. There is a sanitizer trace, a plausible call stack, the whole package. But the trace points at a problem no real caller can reproduce. The result is corrupted signal. Maintainers end up spending their attention separating real library bugs from harness-induced noise, and the "automation" and "scale" that AI research likes to claim quietly gets shifted downstream as triage burden.
That reduces the core question to a specification problem. Write down a source-level, checkable definition of what counts as a "good" fuzz harness.
Our answer is the Four Principles. The next section walks through them.
A definition by itself isn't enough. We also need to know whether those criteria actually capture "good", and the test we use for that is simple. Apply the criteria to a large sample of production harnesses, and check whether the ones the criteria flag are also the ones real maintainers want fixed. If maintainers consistently accept the resulting repair PRs, the criteria are tracking something real. If they shrug those PRs off, the criteria are noise.
So identifying bad harnesses isn't a downstream application of the Four Principles. It's the validation experiment for them.
We ran that experiment on OSS-Fuzz across 70 projects and 586 production harnesses. Here is what came back.
- 53 violations identified. Each became a fix PR filed upstream.
- 45 of those 53 (85%) were verified or merged by maintainers. 35 have already landed; the other 10 were acknowledged as real problems with the PR still in queue.
- 14 of the 53 would have produced false-positive ASan or LSan crashes if left in place. The audit caught all 14 before they ever reached upstream.
- Two latent library bugs surfaced once the broken harnesses were fixed, including a 25-year-old OpenSSL DES stack-buffer over-read.
The rest of this post follows that order. First the criteria themselves, then what the audit found in production, then the two latent bugs the audit unmasked.
The Four Principles: a source-level definition of a "correct" harness
To be accepted, a harness must pass four source-level checks at the same time. We call them the Four Principles (P1–P4). Each one targets one of the most common failure modes when an LLM (or a human) writes a harness.
P1 — Logic Correctness: no bugs in the harness itself
This is the simplest of the four principles and the one most often violated. A harness is a C/C++ program, and it makes the same mistakes any C/C++ program makes.
- Variables initialized before use, pointers null-checked before deref.
- Every error path releases its resources. No leaks, no double-frees, no use-after-free inside the harness.
- Fuzz bytes actually reach the library API arguments. The harness is not a no-op, and it does not silently swap the fuzz input for hard-coded constants.
- No static or global state leaks across
LLVMFuzzerTestOneInputinvocations. - The buffer sizes passed to API calls match the buffer sizes declared in the harness.
A P1 violation means the root cause is the harness itself, but the sanitizer trace will not necessarily stop inside the harness. The more common pattern is the harness pushing bad state into the library. Examples include oversized buffers, NULL pointers, uninitialized structs, and freed objects. The library code is what trips ASan or LSan, the trace looks like a library bug, and the root cause is still the caller violating the contract. That is the most confusing kind of P1 false positive. Wherever the trace lands, if P1 doesn't pass, the report is noise.
P2 — API Protocol Compliance: the harness must obey the library's contract
Every library API comes with its own usage contract. The contract covers when to init, when to destroy, what parameter ranges are valid, which return values must be checked, and which calls must be paired. LLM-generated harnesses violate these contracts a lot, because the LLM cannot infer a full state machine from a single function signature. The eight P2 sub-checks are listed in the P1.1–P2.8 checklist below.
Here is a real example. The libyaml_emitter_fuzzer on OSS-Fuzz is meant to do a round-trip check. Parse the fuzz input into a sequence of YAML events, push them through the emitter to get a serialized output, then parse that output back and confirm the event sequence matches. The simplified main control flow looks like this.
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
yaml_parser_t parser;
yaml_emitter_t emitter;
yaml_event_t event;
yaml_event_t events[MAX_EVENTS];
size_t event_number = 0;
yaml_parser_initialize(&parser);
yaml_parser_set_input_string(&parser, data, size);
yaml_emitter_initialize(&emitter);
yaml_output_buffer_t out = { NULL, 0, 1000 };
yaml_emitter_set_output(&emitter, yaml_write_handler, &out);
/* ---- Loop 1: parse, save event, emit ---- */
bool done = false;
while (!done) {
if (!yaml_parser_parse(&parser, &event)) goto delete_parser;
done = (event.type == YAML_STREAM_END_EVENT);
if (copy_event(&events[event_number++], &event)) { // BUG
yaml_event_delete(&event); // returns 1 = success,
goto delete_parser; // harness reads it as failure
}
yaml_emitter_emit(&emitter, &event); // never actually reached
}
/* ---- Loop 2: re-parse the output and compare events ---- */
/* (entire loop is dead code: Loop 1 always exits at BUG) */
yaml_parser_set_input_string(&parser, out.buf, out.size);
int count = 0;
while (!done) {
if (!yaml_parser_parse(&parser, &event)) break;
if (events_equal(events + count, &event)) break;
yaml_event_delete(&event);
count++;
}
delete_parser:
yaml_parser_delete(&parser);
yaml_emitter_delete(&emitter);
for (int k = 0; k < event_number; k++) yaml_event_delete(events + k);
free(out.buf);
return 0;
}
What's wrong: line 21 inverts the check. The helper copy_event() follows the libyaml convention of returning 1 on success and 0 on failure. The harness reads 1 as failure, so every successful copy jumps straight into the error-cleanup goto.
Why it's wrong: for any valid YAML input the fuzzer short-circuits on the very first event. The yaml_emitter_emit call inside Loop 1 never runs. All of Loop 2 (re-parse plus the events_equal consistency check) is unreachable. Roughly 50% of the harness logic has never been executed. The harness still compiles, still runs without crashing, and coverage still climbs (in the 50% that does run), which is why it sat on OSS-Fuzz for years without anyone noticing.
The fix is one character: add a !. if (copy_event(...)) becomes if (!copy_event(...)). After the patch landed, edge coverage jumped from 166 to 2479, a +1393% gain (about 15×). This is a textbook P2.4 (Return value handling) violation, with the harness mistaking success for failure. The full fix PR is at google/oss-fuzz#15096, merged.
A P2 violation means the crash is API misuse, not a library vulnerability. But P2 has a stealthier failure mode than just filing FPs, and the example above is exactly it. The harness compiles, runs, and gains coverage, but a wrong check has silently cleared the target path. The fuzzer looks busy. None of the sanitizer-relevant code on the actual target ever runs. You aren't failing to find bugs. You aren't testing in the first place. The sanitizer reports no crash, just a fuzz cycle that produces zero output. Worse than an FP.
P3 — Security Boundary Respect: prefer public APIs
A real attacker can only enter the library through its public exported API. We encourage the harness to respect that same boundary. Pulling a static function out via extern, or feeding bytes straight into an internal chunk handler past the public-entry validation, drives the fuzzer into code that no external caller can ever reach. Whatever crash that produces is hard to map back to a real threat model.
P3 is not an absolute ban on internal symbols. If you are 100% certain the internal use is legitimate, for example the internal state machine's input semantics are equivalent to some public API's intermediate state, or if you can prove the same crash reproduces through a purely-public-API harness, the harness is allowed. The acceptance criterion is reproducibility from the public side. If a public-API-only harness can reach the same crash, P3 passes. If not, P3 fails.
- Encouraged. Call functions declared in public headers, and do not skip the public entry-point validation. That validation is part of what the library ships.
- Use internals carefully. If a harness needs
staticfunctions, friend access, or internal chunk handlers, the PR has to spell out the justification and attach evidence that the same behaviour reproduces from the public side. - Counter-example. Calling
png_handle_IHDRdirectly to trigger a crash where no actual PNG file can produce the corresponding internal state. That is a P3 failure, because the public side cannot reproduce it.
P3 is a judgement call, not a mechanical check. Questions like "is this static function fair to use" or "can the same crash be reached from the public side" need project context. A senior reviewer who has read the library's SECURITY.md, written similar fuzzers, and knows the project history can answer them confidently. A third-party general-purpose LLM with no project context cannot. The LLM has not read the SECURITY.md, gives a different rationale each run, and tends to reject any internal-looking symbol on sight. So P3 (and P4 below) is not something our pipeline lets an LLM decide alone. The LLM proposes candidates, the static call graph supplies reachability, every citation has to land on file:line, and a senior reviewer signs off on borderline cases. The LLM is a tool here, not the judge.
A P3 violation means the crash sits on a code path no one can reach from the public side. Even if the crash itself is real, the fix is hard to land. Upstream will usually close it as NOT_REPRODUCIBLE_FROM_PUBLIC_API.
P4 — Entry Point Adequacy: the target has to sit on a real attack surface
P4 answers a different question. Is this fuzz entry point worth fuzzing in the first place? A harness that pipes every fuzz byte into a logging helper or a trivial accessor will pass P1 through P3 cleanly and still waste every CPU second it gets.
- The entry point has to actually reach the target core implementation, not bail out early on validation.
- The reachable closure from the entry point has to contain memory-safety-relevant operations (
memcpy,malloc,free,strcpy,sprintf, etc.). Otherwise even a successful exploit would be harmless. - The target sits on the library's current public attack surface, not on some deprecated demo function.
A P4 violation means the fuzzer can run forever and still find nothing useful.
P1.1–P2.8: the 16-item checklist
To make P1 and P2 actually engineering-checkable we split each into eight concrete sub-checks. The same checklist drives all three stages (audit, generation, verify), so verdicts from any one stage line up with the others.
P1 — Logic Correctness · 8 sub-checks
| ID | Sub-check | What it requires |
|---|---|---|
| P1.1 | Resource leaks | Every exit path releases all alloc / fd / lock / handle |
| P1.2 | Use-after-free | No read or pass of a pointer after free |
| P1.3 | Stale state | No static / global state carried across LLVMFuzzerTestOneInput invocations |
| P1.4 | Input flow | Fuzz bytes actually reach the target API arguments (call not driven by constants) |
| P1.5 | Buffer safety | Bounded fuzz-buffer reads, null-terminated C strings, length-checked indexing |
| P1.6 | Size checks | Early return when input is below the API minimum |
| P1.7 | Undefined behaviour | No OOB / null-deref / signed overflow / unaligned cast |
| P1.8 | No reimplementation | Calls real library code, not an inline local copy |
P2 — API Protocol Compliance · 8 sub-checks
| ID | Sub-check | What it requires |
|---|---|---|
| P2.1 | Init sequence | All required predecessors called in order before the target |
| P2.2 | Parameter construction | Each parameter has the right type, range, owner, and lifetime |
| P2.3 | Object lifecycle | Opaque objects go through create → configure → use → destroy |
| P2.4 | Return value handling | Return codes checked, error branches exercised, gated output not reused after failure |
| P2.5 | Cleanup sequence | Resources released in API-documented order on every exit path |
| P2.6 | API existence | Every called function is exported in the pinned build of the library |
| P2.7 | Co-call constraints | Paired APIs co-occur; mutually exclusive APIs do not |
| P2.8 | Prerequisite state | External state (fd / socket / env / thread) set up before the target call and torn down after |
Implementation details (which sub-checks use LLM-read source with file:line citations, which use GDB breakpoints on the target API, which read ASan / LSan sanitizer traces) and how prober inputs differ across the three stages are covered in the next post on the reverse pipeline. This section just lays down the criteria themselves.
The audit: 70 projects, 586 harnesses
Our actual goal is to use the four principles to generate correct harnesses. Before trusting that, we need to prove something more basic. Can the principles reliably identify harness defects, and can they drive a fix that holds up upstream? This section is that proof experiment.
The experiment design is straightforward. We treat each existing production harness as if it were a "first-round LLM output". The four principles inspect it, the findings get fed back to the LLM to refine, and the loop iterates until the harness passes. That audit-and-refine cycle is exactly the middle of the generation pipeline, isolated and run on real code. If it surfaces defects reliably on production harnesses and produces fixes that maintainers actually accept, it has earned its place inside the full pipeline.
The baseline is 70 C/C++ projects on OSS-Fuzz. Most of these harnesses were written by the project maintainers or senior contributors, reviewed upstream, and have been running on ClusterFuzz for years. From those 70 projects we pulled 586 production harnesses. The procedure:
- Static pass. An LLM agent reads the harness source plus the library headers and walks the full 16 sub-checks across P1.1–P1.8 and P2.1–P2.8. Note the audit only mechanizes P1 and P2; P3 and P4 are judgement calls that a senior reviewer handles separately. Every claim must cite a
file:line. - Dynamic pass. Adversarial probing builds inputs targeting each suspect point, then runs 60 seconds of libFuzzer with ASan and LSan to confirm the violation actually fires.
- Repair and validate. For each violation we submit a fixed version and run the same 60 seconds to confirm coverage does not regress. That gives the upstream maintainer a direct reason to merge.
- Average wall-clock cost is about 10 minutes per harness with 10 parallel workers, for a total of roughly $720.
Now the results.
53 violations across 28 projects
We found 53 P1/P2 violations across 586 harnesses, or 9.0%, spread across 28 projects. The remaining 42 projects produced no violations in this run. One caveat is worth flagging. The four principles are a checklist, not a decision procedure. "Found nothing" does not prove a project is clean. It only proves that none of the 16 sub-checks tripped on those particular harnesses in this audit.
All 53 went upstream as PRs or issues. As of this writing.
The headline: two real upstream bugs that broken harnesses had been hiding
The most interesting result of this audit is not the count of violations. It is the two real upstream bugs that broken harnesses had been masking for years. Repairing a P2 violation occasionally re-exposes the library's real behaviour to the sanitizer. That happened twice in this run.
Case 1: OpenSSL DES stack-buffer over-read, latent 25+ years
One OpenSSL harness had a P2 violation in its API call ordering that left the entire cipher path as dead code in the harness execution trace. Once we fixed the order, the fuzzer threw a stack-buffer over-read inside the OpenSSL DES implementation within seconds.
The bug itself has been sitting in OpenSSL's DES module for over 25 years, older than the name "OpenSSL". It went unnoticed not because it was hard to find but because every harness that ever walked that path was tripped by the same P2 mistake and never actually entered the function. One call-order error kept a memory-safety bug on a public API hidden for a quarter century.
Upstream issue: openssl/openssl#30284. Fix merged, issue closed.
Case 2: tidy-html5 memory leak, one missing cleanup call
A tidy-html5 harness was missing a required cleanup call, violating P2's paired-API constraint. Once we added the call, the harness immediately tripped LeakSanitizer on the path the missing cleanup had been hiding. The leak lives inside tidy-html5's own lexer code.
The root cause matches the OpenSSL case exactly. A wrong harness usage was masking the library's real behaviour. Treat these two as the canonical example of P2-driven bug discovery. Fixing the harness is not just about killing false positives; it is about returning the sanitizer's view of the library to ground truth.
Upstream issue: htacg/tidy-html5#1177.
14 false positives that never reached upstream
Of the 53 violations, 14 would produce false-positive ASan or LSan crashes if left in place. If we had skipped the audit and run those harnesses through fuzzing directly, all 14 would have surfaced convincing-looking crashes, gotten packaged into 14 bug reports, and dumped on the maintainers.
Not one of the 14 is a real library bug. Each root cause sits inside the harness itself: a dict never freed, an object lifecycle mis-staged, an internal chunk handler driven directly. Every one of them would have cost a maintainer time reading the stack trace, rebuilding a reproducer, walking the source, and eventually closing it NO_VULNERABILITY. And it would leave a "low-quality submitter" stamp on our handle.
The four principles caught all 14 before fuzzing ever ran. They were not crashes-then-discarded-by-verify. They were source-level violations identified, fixed, and merged upstream as PRs without any of them ever firing a sanitizer in production. Involved projects include:
- lcms. Both
dictandtransform_extharnesses had P2 violations producing FP crashes. Clean after the fix. - libarchive / linkify. P1 resource-lifecycle issue.
- libpng / readapi. P2 call-order mistake.
- njs / script. P1 static state not reset between runs.
- openvpn. Both
packet_idandverify_certhad P1 violations producing fake leaks. - wamr / mutator. P2 protocol violation.
This is the biggest lever the audit has. Stop the false positives from leaving the building. You never have to repair a maintainer's trust if you never wasted their time to begin with.
Side effect: coverage gains were not the goal, but they were large
One thing to be clear about. The audit's goal was not coverage. The goal was harness correctness. But in many cases, fixing a P1 or P2 violation also revives dead code, and coverage climbs as a side effect. We attach the coverage delta to each repair PR as a no-regression sanity check for the maintainer. A few representative numbers.
opencv/filestorage· +986% edges (the original harness was almost entirely dead code).libyaml/emitter— 14× edgesopenssl/quic_server— +65.7%bzip2/bzip2_fd— +52.4%tidy-html5/general— +33.5%boost/filesystem— +21.9%
Large jumps usually map to a harness that was almost entirely dead code. Small ones usually map to a harness that was already running but carried a non-blocking P1 or P2 violation (jq/parse_stream is the canonical small-jump case). Both kinds are harness-quality problems; they just present differently.
Why this matters more in the LLM era
When harnesses were hand-written, quality issues got caught in PR review by other humans. In the LLM era a single overnight run can churn out hundreds of harnesses. Without a source-level quality gate up front, automation translates "speed" into "false-positive throughput".
The four principles are not LLM-specific. They are a source-level definition of harness correctness, equally applicable to auditing human-written harnesses (the 586 in this post) and to self-checking LLM-generated ones. We use the same four principles inside our generation pipeline too. The next post covers that.
If you are working on LLM-driven fuzz harness generation today, our suggestions:
- Do not treat "compiles and survives 60 seconds of fuzzing" as a quality signal. Both are necessary, neither is sufficient.
- After generation, run the P1.1–P1.8 and P2.1–P2.8 sub-checks. Send violations back to the LLM for a rewrite. Do not park them in a fuzz queue.
- Apply P3 and P4 before generation, at target selection. Choosing the wrong entry and then writing code is wasted work.
- Every PR upstream needs the root cause, the fix, and a no-regression coverage check. A bare sanitizer trace is not enough.
The pipeline cost is bounded. 586 harnesses, ~10 minutes each at 10 parallel workers, about $720 total. In return: 14 false positives that never become noise reports, 45 maintainer-verified-or-merged PRs (35 already landed), and two real upstream bugs that broken harnesses had been masking for years (including the 25-year-old OpenSSL DES one). The trade is positive either way you score it, in engineering cost or in ecosystem trust.
Next: turning the principles around
This post used the four principles for audit. Catch an existing harness, find the violations, ship a fix. The next post flips the direction. Use the principles for generation. Given a fresh library, walk P4 down to P1 (start with the attack surface, end at the code), combine that with Logic Group as the semantic-unit slicer, then put every harness through Stage-4 adversarial validation before fuzzing. In two weeks: 472 harnesses across Chromium and its upstream dependencies, 30 vulnerabilities filed upstream (16 acked so far — 9 confirmed + 7 fixed, 3 returned as FP, 3 newly filed), and 52 candidates dropped during our own audit before any could reach a maintainer.
— Ze Sheng · Team FuzzingBrain