Two-Week Numbers

472
Harnesses
Generated, validated, fuzzed in two weeks
58
Libraries
Chromium and its upstream dependencies
30
Filed Upstream
16 acked (9 confirmed + 7 fixed) · 3 maintainer-FP · 11 awaiting review
52
Verify-Dropped
Flaky · duplicate · harness-side · blocked before filing

Are the Four Principles Just an Audit Tool?

Part 1 used the four principles (P1 Logic Correctness, P2 API Protocol Compliance, P3 Security Boundary Respect, P4 Entry Point Adequacy) as an audit tool. The harness was already running on OSS-Fuzz. We read it back and checked whether it was correct. The result was 53 fix PRs across 586 harnesses, 14 false positives prevented, and one OpenSSL stack over-read latent for 25 years. The source-level definition of the principles and the 16 sub-checks (P1.1 through P1.8 and P2.1 through P2.8) all live in that post and are not repeated here.

But the same four rules can also be used before the harness exists, to internalize "what counts as a good harness" inside the generation pipeline rather than catch it post-hoc. That is what this post is about. Flip the four principles around (P4 → P3 → P2 → P1), pair them with a tool that slices a library into independently fuzzable semantic units called Logic Group, and in two weeks generate, deploy, and fuzz 472 harnesses across Chromium and its upstream dependencies.

Logic Group: A Library's Fuzzable Semantic Unit

P4 tells you to pick the attack surface, but "attack surface" is too coarse. A real C/C++ library exports hundreds to thousands of public symbols, so writing one harness per function is both inefficient and a way to burn CPU on uninteresting wrappers.

We introduce Logic Group (LG), the smallest semantic unit a library is sliced into. Each LG is defined by four fields.

name
Semantic label
e.g. "parse PNG image"
E
Entry set
Public APIs that take fuzz bytes
C
Core set
Implementation functions reachable from E
desc
Surface description
"Why this LG is worth fuzzing"

One LG maps to one harness. LGs do not overlap. If two LGs share most of their core set we merge them, and if a candidate LG covers no memory-safety-relevant code we drop it. A typical library breaks into 5 to 10 LGs, depending on the topology of the public API surface.

A concrete example. openscreen is Chromium's Cast streaming implementation. Some of the LGs we sliced out:

  • openscreen_receiver_message_json_parse — parse the RECEIVER_MESSAGE JSON body
  • openscreen_sender_message_json_parse — parse the SENDER_MESSAGE JSON body
  • openscreen_answer_messages_deep_substructs — parse the nested substructures inside ANSWER (AudioConstraints / VideoConstraints / ...)
  • openscreen_offer_messages_round_trip — OFFER parse-serialize round-trip
  • ...

Each LG is its own fuzz entry point and its own harness. The benefit shows up later in the results, where the same root-cause jsoncpp-misuse bug fires independently in three different LGs, and each LG produces a clean, non-entangled report for upstream.

Four Stages of the Reverse Pipeline

Part 1's audit only ran the mechanical part, the 16 sub-checks under P1.1 to P1.8 and P2.1 to P2.8. P3 and P4 stayed out of the static pass. The harnesses had been on production OSS-Fuzz for years and the entry points had been picked by upstream reviewers, so we sanity-checked P3 and P4 manually only on borderline cases. P3 and P4 are judgement calls, not mechanical checks. That is a luxury you only get when you are catching an existing harness.

Generation does not get that luxury. The entry point is something we are choosing on the spot, with no upstream reviewer to pre-vet P4 (does the target sit on a real attack surface) or P3 (is the crash reachable from the public side). Skip those two and let an LLM write the code first, and a 60-second fuzz run will eventually report that the entry was wrong, leaving the rest of the work to be thrown away. So reversing is not a cosmetic reorder. It adds the two checks that audit could skip but generation cannot, and puts them ahead of everything else.

The reverse principles cash out into four engineering stages. Each stage's output is the next stage's input. Any stage that fails sends the work back to the LLM rather than forward.

Stage 1, Logic Group Discovery (covers P4 and P3)

An LLM agent reads project source plus public headers, enumerates 5 to 10 candidate LGs, then a static call graph computes a danger score for each.

danger(LG) = sum_over_g_in_C( unsafe_ops_in(g) / depth_from_E_to_g )

unsafe_ops covers memcpy, malloc, free, strcpy, sprintf, and similar primitives. The depth discount 1/d keeps the ranking focused on attack surface that is directly reachable from a public entry, so deep-internal detail does not drown the score. Every candidate whose danger score clears a non-trivial threshold advances to Stage 2. In practice, that leaves about 8 LGs per library, once the wrappers and trivial accessors get pruned. P3 (reachability must originate from a public API) is verified at the same step.

One engineering boundary worth calling out here. P3 and P4 are never delegated to an LLM as the sole decider. Whether an internal usage is legitimate, and whether an entry sits on the real attack surface, are project-context-heavy judgements. A senior reviewer who has read the project's SECURITY.md and written similar harnesses can answer them. A third-party general-purpose LLM with no project context cannot. So Stage 1 runs as a hybrid. The LLM enumerates candidates, the static call graph supplies reachability, and a senior reviewer signs off on borderline cases. The LLM is a tool here, not the judge. That boundary is what makes 472 harnesses humanly accountable for quality.

Stage 2, API Protocol Research (covers P2)

Once an LG is picked, a second agent reads the headers, comments, and key call-sites for every entry function and core function, then writes a protocol document covering:

  • Call-order constraints (init before use before destroy).
  • Object lifetimes (who allocates, who frees, whether reuse across calls is allowed).
  • Parameter ranges (the valid interval of an int, the legal value set of an enum).
  • Return-value semantics (NULL on failure, negative error code, out-param).
  • Cleanup APIs that must be paired with the entry call.

Every line in this document must cite a file:line. No improvisation. This is the crux of why P2 belongs before the harness is written. Once the harness exists, the LLM is anchored to the code it just wrote and no longer rewrites the protocol cleanly. Research the contract first, write the code second.

Stage 3, Static-Driven Build (covers P1 and P2)

The harness-generator agent takes the Stage 2 document and writes the harness in one shot, full source rather than line-by-line trial. Then a bounded build-fix loop. Compile fails, the error feeds back to the LLM, it patches, recompile, repeat until success or the iteration cap fires.

Unlike the "give the LLM a skeleton and let it fill blanks" pattern, we have the LLM produce the entire LLVMFuzzerTestOneInput together with its setup and teardown in a single pass. That keeps reasoning across functions intact, and LLMs do far better on long-span tasks like this than on per-function wrapper generation.

Stage 4, Adversarial Validation (covers P1 and P2)

The last gate is adversarial probing. The system actively constructs adversarial inputs against the freshly generated harness and runs two probe types.

  • Reach probe. Set a GDB breakpoint on the target API and run the harness. If the breakpoint never fires, fuzz bytes are not flowing to the target and P1.4 (input flow) has failed.
  • Run probe. libFuzzer with ASan and LSan, short duration. Any crash is immediately routed through the P1.1 to P1.8 and P2.1 to P2.8 sub-checks for triage. If the crash classifies as a harness-side violation, the harness goes back to the LLM. Only crashes that classify as a real library behavior become bug candidates.

The first probe asserts the harness is exercising the library. The second asserts the harness is not the bug. Only when both pass does the harness move to ready-for-discovery.

What Two Weeks of 472 Harnesses Across 58 Libraries Produced

We started running this pipeline against Chromium and its upstream dependencies in mid-April 2026. About two weeks later, the count of harnesses generated, built, and put through adversarial validation was 472, spread across 58 libraries. The pipeline produced findings in two distinct buckets:

  • 30 filed upstream (the headline number). Maintainer-side state breaks down as 16 acknowledged (9 confirmed and 7 fixed), 3 returned as FP, and 11 still awaiting maintainer review. Of the 11, 3 were just filed and 8 are older submissions still under triage.
  • 52 dropped by our pre-filing verify gate. These are crashes that fuzzing produced and that verify caught before they could become an upstream issue. Reasons split across flaky reproductions, duplicates of an already-filed root cause, and harness-side bugs that slipped past Stage-4 validation. None of these became upstream noise, which is the point.

The 30 here is not a raw findings count. It is the count that has crossed a maintainer's eyes. The mix spans memory-safety crashes on public codec and parser entries, DoS via uncontrolled allocation, and reachable-assertion aborts. We don't pre-classify severity, since that's the maintainer's call.

By library, the busiest targets were:

flatbuffers
8
findings
libvpx
6
findings
openscreen
4
findings
libwebp / icu / libaom / spirv-tools
3 each
findings

Other libraries with at least one finding include crashpad, dawn, sandbox, pdfium, leveldb, quiche, dav1d, zstd, zxing-cpp, libpng, hunspell, net, sqlite, v8, chromium, and skia. All numbers are pulled live from the chrome-maint dashboard.

Three Representative Cases

Case 1: openscreen — Three Sister jsoncpp DoS Bugs, One CL Fixes All

The LG example earlier was openscreen on purpose. Three of its LGs (receiver_message, sender_message, answer_messages) independently produced morphologically similar crashes within the same week. In each case, jsoncpp's Json::Value::operator[] was being indexed by a string key on a non-object Value, hitting JSON_FAIL_MESSAGE's abort().

The root cause is one shared pattern. Each openscreen parser writes:

if (!value) {
  return Error(Error::Code::kJsonParseError, "Invalid message body");
}
// ... value[kResult] / value[kSequenceNumber] ...

But operator! only returns true for nullValue. Any arrayValue, stringValue, numeric, or boolean walks past the guard and into the next line's operator[](string), which calls find()'s precondition. With JSON_USE_EXCEPTION=0 and NDEBUG set, the assert is compiled out but the trailing abort() remains, and the Cast sender process dies on the spot.

Because we sliced these three parsers into three independent LGs, each crash became its own issue. Upstream (jo@google.com, dx@google.com) accepted all three the next day and merged a single CL, "[cast_streaming][security] Prevent JSON parsing aborts with non-objects", which simultaneously promotes if (!value) to if (!value.isObject()) in answer_messages.cc, receiver_message.cc, and sender_message.cc.

LG slicing is what made the same root cause fire independently from three entries, and ironically what made the upstream merge easier rather than harder. Without the split, we would likely have noticed only one of the three. The other two would still be live.

Case 2: libaom AV1 Encoder Heap Overflow — A Real Bug on a Real Public API

libaom/heap-overflow-av1-restore-layer-context is one of the cleanest end-to-end findings in this batch. The LG is "AV1 SVC encoder mid-stream reconfig", with the entry being the public aom_codec_enc_init, aom_codec_encode, and aom_codec_enc_config_set.

P3 cleared the entry as public during LG discovery. P2 protocol research established which parameters to aom_codec_enc_config_set are legally mutable mid-stream. P1 produced the harness, which threw a heap-buffer-overflow within 60 seconds. What the maintainer received was a root cause, a public-API repro, and a minimal fix suggestion, not a dump containing harness-private structures. This finding is fixed, the full LG discovery → harness → crash → upstream merge cycle.

Case 3: libwebp NULL Pointer Dereference — Same-day Confirm, 2-day Fix

In libwebp/null-deref-WebPMuxAssemble, the public API WebPMuxAssemble dereferences an unchecked internal pointer on the mux-edit path. The LG entry is WebPMux*, and the core set covers the entire chunk-assembly chain in muxedit.c. The fuzzer fired the crash on its first night. After the verify pipeline cleared it, we filed it as webmproject#497882857. The libwebp team at Google confirmed it the same day and merged the fix two days later. From issue to merged code in under 48 hours, the strongest endorsement we have for the reverse-pipeline claim that "every finding is a real memory bug on a real public API".

How 52 Were Stopped Before Filing — and Why 3 Still Came Back FP

The clean filing record (30 filed, 16 acked so far, 3 returned as FP) did not happen because verify (covered in Part 3) is heroic on its own. It happened because the pre-filing verify gate caught 52 crash candidates that would otherwise have gone to maintainers as bug reports. Stage 4 does adjacent work earlier in the pipeline. It catches harness-side problems before the harness fuzzes for real, so most harness-induced crashes never become candidate findings in the first place. The 52 are what verify still found after that. They split into three categories.

  • Flaky reproductions. The crash fired once during fuzzing but does not reproduce reliably when the verify replay re-runs the saved input against the same binary. Usually a race or environment-dependent timing pattern. The defect may be real, but a maintainer cannot reliably reproduce it from what we would file, so filing it is a guaranteed close-as-NOT_REPRODUCIBLE.
  • Duplicates. A different LG already produced an issue with the same root-cause stack. We file once per root cause. Subsequent LGs hitting the same site reference the existing issue and drop the new finding from the file queue.
  • Harness-side bugs that slipped past Stage 4. Verify re-runs the P1.1 to P1.8 and P2.1 to P2.8 sub-checks against the specific crash trace and sometimes catches a violation that Stage-4 adversarial probes did not surface (the bad input only shows up in long-form fuzzing). The harness goes back to the LLM with the failing sub-check ID rather than the crash going to the maintainer.

All 52 of these were caught before a maintainer ever saw them. That is the leverage of having a verify gate in the pipeline rather than relying on a maintainer to filter our noise, a job that is theirs to do for legitimate reports rather than ours to dump on them.

What about the 3 that did come back FP after filing? Those slipped past the source-level checks because the failure mode is not source-level. It is a mismatch between the library's threat model and our reading of it. The canonical example is V8's uaf-v8-deserializer-unresolved-external-reference. The entry is the public Isolate::New(deserialize_params), P3 passed (genuinely public API), P4 passed (genuinely on the attack surface, reachable to unsafe ops), and the bug is technically a UAF. But V8's own security triaging policy explicitly excludes attacker-controlled snapshot blobs from the V8 threat model. The crash is real, the threat model says "out of scope", and that is the FP.

"Technically real but threat-model-invalid" is the one failure mode the four principles cannot catch. It lives at the cross-product threat-contract layer rather than the source layer. That is exactly what the next post's Security Proxies are built for. Encode each library's threat model explicitly, run one more gate before filing, push the 3-FP number toward zero in future batches.

Why "Two Weeks, 472 Harnesses" Is the Number That Matters

Neither half of the headline is impressive on its own. Two weeks is not particularly fast, and 472 is not particularly large. The two together are the number that matters, because this is throughput under quality control. Every one of the 472 cleared P1 through P4 at the source level, mapped to a single meaningful LG, and passed both reach-probe and run-probe before being fuzzed. That is the throughput definition LLM-driven harness generation should be working toward, rather than "5,000 harnesses generated overnight" of the garbage-in, garbage-out variety. The point is throughput where every unit can be filed upstream on its own merits.

Versus a hand-driven workflow:

  • A hand-written harness averages 1 to 2 days. Pick the entry, read the protocol, write the code, validate. 472 by hand is roughly 1.5 person-years.
  • Our pipeline averages roughly 30 minutes per LG end-to-end (LG discovery 5 min, protocol 5 min, build/fix 10 min, adversarial validation 10 min), and runs across many concurrent workers. Two weeks total.
  • Source-level FP rate is essentially zero. The 3 maintainer-FP findings all classify as threat-model rather than source-level mistakes. The 16 P1 and P2 sub-checks, applied at Stage 4 and again at verify, absorbed every source-level harness mistake before it could reach a maintainer.

Anyone who has done fuzz work knows the deliverable is bugs, not harnesses. But you cannot get trustworthy bugs out of an untrustworthy harness. Spending the LLM's capacity on producing high-quality harnesses, not on producing raw bug reports, is the thesis of this work.

Next: After the Crash

Everything above is "filter errors at the harness stage". But even a P1-through-P4-clean harness produces a flood of crashes when fuzzed for real, tens of thousands per night. The next post covers the verify pipeline. The four principles applied once more to the specific crash trace, two-step independent reproduction, and Security Proxies that align the report with the maintainer's threat model. That is the gate that pushes the live filing FP rate toward zero.

Blog 3 — After the Crash

Four principles × two-step reproduction × Security Proxies. Picking the few worth filing out of the 24,000 crashes a fuzz batch produces overnight.

Blog 1 — Where the Four Principles Came From

If you are joining mid-series, start with Part 1, which has the full source-level definition of the four principles and the audit run on 586 production OSS-Fuzz harnesses.

Chrome Data Open

Metadata for all 472 harnesses (LG specs, harness sources, fuzz_runs, findings) is published on the chrome-maint dashboard.

— Ze Sheng, O2 Security Team (TAMU)