LATENTFORCE Tech Report · June 3, 2026
Launch Technical Report

How LatentGraph changes the math on coding-agent tasks.

Three independent studies of what a structured codebase graph does for AI coding agents: four agents, ten open-source repositories, three production-grade migrations, and fifteen verified security bugs.

Up to +50%
More tasks resolved on the four coding agents we tested, on 14 hard SWE-bench Pro tasks
99.8%
Edge precision across 10 OSS repositories, ahead of every comparable tool we tested
56%
Cheaper, ~2× faster migrating 1.8M production lines across three real codebases
~1.5×
More dependency edges than comparable tools, across 10 OSS repos
01 Background

Why we built LatentGraph.

AI coding agents are powerful, but every task still starts from zero: files re-read, imports re-followed, architecture re-guessed. LatentGraph gives the agent a persistent, structured map of the codebase so it stops guessing.

Every modern coding agent (Claude Code, Cursor, Codex, OpenCode, Copilot) runs the same loop: read the prompt, scan the codebase, write code. The agents themselves are good. The loop is broken in one specific place: every task starts from zero. The agent reads twenty files, follows imports, guesses what is coupled to what, and discards everything when the session ends. Token bills inflate as repositories grow. The agent invents abstractions that do not match the team's, misses runtime couplings static analysis cannot see, and breaks invariants nobody wrote down. The model is not the bottleneck. Context is.

What the graph contains

When you point LatentGraph at a repository, it builds a hierarchical Context Graph: files clustered into modules, modules into an app-level map, with three layers of edges between the nodes. Explicit dependencies come from static analysis: every import, every call chain, every cross-module reference. Implicit dependencies are the runtime couplings static analysis cannot see: two services sharing a Redis channel, two modules reading the same database table, an event-bus producer and consumer, a config key gating behaviour somewhere else. These are the hardest edges to find and the most expensive to miss; they are where most agent failures hide. Engineering intent is mined from your pull-request history: invariants ("auth tokens must be refreshed before expiry, grounded in PR #847"), design decisions ("we chose JWT over session cookies, see PR #312"), and recurring change playbooks. Each node carries auto-generated documentation. Each edge carries a description and the source that grounded it.

How it stays current

A graph that is right on Monday and stale by Friday is a snapshot, not a knowledge layer. LatentGraph stays current three ways. Incremental indexing. lgraph update re-indexes only what changed since the last commit; new PRs feed back into the invariant set automatically. Agents write back. When an agent discovers something during a coding task (a Redis channel that wasn't on the graph, an invariant nobody had written down), it calls update_graph to propose the edge. Owners approve. Every proposed edit, from agents or humans, lands in a Control Panel queue with a side-by-side diff. Approve, reject, or revise. Nothing modifies the graph silently. Three months in, switching off LatentGraph means switching off three months of compounded team intelligence: discovered, curated, and persisted.

How any agent can query it

The graph is exposed through nine MCP tools any coding agent can call. get_dependencies(file) returns every edge in and out, explicit and implicit. get_call_chain(symbol) walks callers and callees transitively. get_pr_insights(module) surfaces invariants and design decisions from history. ask_codebase(query) answers a cross-cutting semantic question over the whole graph. update_graph(...) is the write tool. One MCP call typically replaces twenty file reads. The protocol is standard: lgraph add claude-code, lgraph add cursor, lgraph add codex, and so on. Any agent that speaks MCP can use the same graph.

Install LatentGraph and connect it to your coding agent:

$ npm install -g @latentforce/latentgraph $ lgraph init $ lgraph add claude-code

Full product documentation at latentgraph.latentforce.ai.

The rest of this report is the empirical evidence that this design actually works.
02 Coding-agent benchmark

Does it actually help coding agents?

On 14 hard SWE-bench Pro tasks across repositories averaging over a million lines, LatentGraph improved resolution rate on every one of the four coding agents we tested.

The benchmark

We ran 14 hard tasks drawn from the SWE-bench Pro corpus, across two large open-source codebases: protonmail/webclients and ansible/ansible. Both repositories average well over a million lines of source code. We picked hard tasks on large codebases deliberately. On small projects, modern agents already have enough room in their context window to solve most things; the differential effect of a graph layer is hard to measure. Hard tasks on million-line codebases are where the navigation cost dominates.

Four coding agents ran the same fourteen tasks twice. Without MCP the agent was given the prompt and a working tree, nothing else. With LatentGraph the same agent had the LatentGraph MCP server attached and could query the graph. Same model behind each agent. Same prompt. Same environment. The only variable was whether the agent could query the graph.

These metrics matter. Resolved is the number of tasks the agent's patch made the target tests pass. Total cost is the API spend across all 14 tasks. $ / resolved measures cost per successful patch: spend on failed attempts is sunk, so this is the operationally meaningful efficiency number.

Without MCP With LatentGraph 0 5 10 14 Tasks resolved (of 14) Claude Code 8/14 10/14 Cursor 6/14 9/14 OpenCode 8/14 10/14 Codex 9/14 10/14
Figure 1. Tasks resolved by each coding agent, with and without LatentGraph. Same model behind each agent, same prompt, same 14-task SWE-bench Pro benchmark on protonmail/webclients and ansible/ansible. Every agent we tested resolved more tasks when it had the graph.
Table 1. Full benchmark results, 4 coding agents × 2 conditions, 14 hard SWE-bench Pro tasks.
Agent Condition Resolved Total cost Total time $ / resolved
Claude Code Without MCP 8/14 $31.45 148.5 min $3.93
With LatentGraph 10/14 $13.31 83.7 min $1.33
Cursor (Composer-2.5) Without MCP 6/14 $8.29 87.4 min $1.38
With LatentGraph 9/14 $7.37 78.7 min $0.82
OpenCode Without MCP 8/14 $15.61 152.7 min $1.95
With LatentGraph 10/14 $20.52 133.5 min $2.05
Codex Without MCP 9/14 $88.61 43.5 min $9.85
With LatentGraph 10/14 $87.87 45.3 min $8.79

Reading the per-agent results

Claude Code shows the biggest swing on cost and time. Resolution climbs from 8/14 to 10/14, total cost drops 58% ($31.45 → $13.31), total time drops 44% (148.5 → 83.7 min). Cost per resolved task falls from $3.93 to $1.33: almost 3× more efficient at producing a correct patch. Same model, same prompt; the graph stops the agent from re-deriving context every task.

Cursor (Composer-2.5) shows the largest relative lift on resolution: 6/14 to 9/14, a 50% increase. Total cost drops 11% ($8.29 → $7.37) and total time drops 10% (87.4 → 78.7 min). Cost per resolved task falls 41% ($1.38 → $0.82). Cursor is already the cheapest agent in the suite per task, so absolute deltas are small, but proportionally every metric improves.

OpenCode is the mixed case. Resolution improves (8/14 → 10/14), total time drops 13% (152.7 → 133.5 min), but total cost rises 31% ($15.61 → $20.52). Per resolved task, cost is roughly flat ($1.95 → $2.05). The graph helps OpenCode solve harder tasks; the agent spends a little more on each successful patch and gets two extra resolutions in return. The net trade is unambiguous on outcomes, mixed on spend.

Codex is the smallest swing in the suite. Codex starts as the strongest baseline at 9/14 resolved, with the highest spend per task in the suite ($88.61 total, $9.85 per resolved). Adding LatentGraph nets +1 resolved task with total cost and total time effectively flat (within run-to-run noise). Cost per resolved task improves 11% ($9.85 → $8.79): the same spend produces one more resolution. Not the dramatic swing of Claude Code or Cursor, but a real improvement on every dimension.

What the agents did with the graph

The aggregate numbers are abstract until you see what changed inside a single task. The three case studies below are real tasks from the benchmark above: same model (Cursor Composer-2.5), same prompt, same commit. Each one is a task the baseline failed to ship any patch for, and the LatentGraph-augmented agent shipped a working one. The difference was not in the model. It was in how the agent navigated 1.3 million lines.

Case study 1 · Cursor

Subscribe to Calendar: capping URL length and centralising a mock

protonmail/webclients · commit 94dc494b · 1.1M LOC
Without MCP · Failed
0-byte patch
$0.92 · 8.3 min · 98 tool calls
  • grep30
  • read26
  • edit22
  • glob10
  • shell9
With LatentGraph · Passed
319-line patch
$0.61 · 6.0 min · 15 tool calls
  • get_symbol6
  • ask_codebase4
  • get_file2
  • get_pr_insights1
  • get_call_chain1
  • get_dependencies1
Why baseline failed: the agent thrashed across the monorepo looking for where MAX_LENGTHS_API was declared, which jest.setup.js was authoritative, and how SubscribeCalendarModal wired to its hooks. The 22 edit calls never produced a saved diff. What LatentGraph did: get_symbol pinpointed the constant and its declaration site. get_dependencies enumerated every ResizeObserver import site in one round-trip. The agent spent its tokens writing the fix instead of finding the place to fix.

The mechanism: navigation, not intelligence

The aggregate numbers and the case studies converge on the same mechanism. In a million-line repository, finding every caller of a function by text search is O(N) over the whole codebase; doing it through a call-graph lookup is O(1). Finding every place a constant should be extended has the same shape. Finding every consumer of an enum has the same shape. The baseline agent's failures across the three cases above were not failures of model intelligence. They were failures of navigation. The graph collapses the navigation step, and the model spends its tokens on writing the fix instead of finding the place to fix.

Section 3 turns to the prerequisite: how good is the graph itself.

03 Edge benchmark

Why the agents win: the graph is state-of-the-art.

Across ten OSS repositories (five Python, five JS/TS), LatentGraph captures more real dependency edges than the closest comparable tools, at higher precision.

Why edge quality matters

Section 2 showed that coding agents do better with the graph. This section turns to the prerequisite: how good is the graph itself. Every MCP call returns context grounded in graph edges. If edges are wrong, the agent ingests wrong context. If edges are missing, the agent misses dependencies. Precision (what fraction of captured edges are real) and coverage (how many real edges are captured at all) are the two axes that determine whether the graph is worth querying.

The benchmark setup

We compared LatentGraph's dependency-edge extraction against the two closest publicly available static-analysis tools: GitNexus and Repowise. Ten OSS repositories were used: five Python (fastapi, keras, openai-python, sqlmap, youtube-dl) and five JavaScript / TypeScript (Chart.js, dayjs, Sortable, reveal.js, fabric.js). Each tool's output was normalised to the same file universe by applying LatentGraph's collector skip-list uniformly across all three tools, so coverage and overlap comparisons operate on identical inputs.

Precision was verified two ways. For Python, an AST-based verifier confirmed every emitted edge against the actual import statements in every file: no sampling, full-corpus check. For JS/TS, where multi-mechanism resolution (calls, extends, accesses, method-overrides, dynamic imports) makes static verification harder, we sampled up to 100 edges per Venn region per tool pair and verified each with Claude Sonnet against the full file source. Precision within a language is directly comparable; across languages it is only indicative, since the verification methods differ.

The aggregate result

LatentGraph wins on precision on both language sets. The gap is modest on Python (where all three tools are in the high 90s) and decisive on JS/TS (where the next-best tool drops more than 11 percentage points behind).

60% 70% 80% 90% 100% Precision PYTHON · 5 repos LatentGraph 99.84% GitNexus 98.55% Repowise 96.29% JS / TS · 5 repos LatentGraph 98.1% GitNexus 86.4% Repowise 74.9%
Figure 2. Aggregate precision per dependency-analysis tool on Python (5 repos, full-corpus AST verification) and JS/TS (5 repos, sampled LLM verification). LatentGraph is the only tool above 98% in both language sets.
Table 2. Aggregate coverage and precision per tool, per language.
Tool Python (5 repos) JS / TS (5 repos)
Edges TPs Precision File pairs Mean precision
LatentGraph 20,882 20,848 99.84% 2,688 98.1%
GitNexus 16,177 15,942 98.55% 3,356 86.4%
Repowise 16,851 16,226 96.29% 1,513 74.9%

Two coverage notes worth surfacing. On Python, LatentGraph captures the most edges total (20,882): about 24% more than Repowise and 29% more than GitNexus, driven mainly by deep traversal of large packages like keras. On JS/TS, GitNexus indexes the most file pairs (3,356), but most of that excess is documentation, scripts, and config files that the other tools intentionally skip. The volume difference is real; whether it counts as useful coverage depends on the use case.

Two findings that travel

Beyond the headline result, two cross-validation findings hold across every repo and every tool pair in the benchmark.

The common-bucket invariant. When any two of these three tools agree on a file edge, that edge is real: 100% precision across every repository and every tool pair in the JS/TS sampled verification, and zero common false positives between LatentGraph and GitNexus across the entire Python subset. Two independent dependency analysers agreeing on an edge is by itself a strong signal of correctness, usable directly as production ground truth.

LatentGraph-only edges are signal, not noise. Edges that only LatentGraph emits remain highly precise: 98.6% vs Repowise and 92.0% vs GitNexus in the JS/TS verification. The extra coverage is not over-emission. On keras alone, LatentGraph finds 5,731 unique true positives that neither other tool surfaces: intra-package edges deep inside nested inheritance hierarchies that text-search-based call-graph traversal does not reach.

What the other tools do not attempt: implicit dependencies

Everything above is about explicit dependencies: imports, calls, class extensions, references that a static analyser can see in the source. This is the comparison the three tools support.

In production, the LatentGraph context graph also surfaces implicit dependencies: the runtime couplings static analysis cannot see, where one service publishes to a Redis channel another service consumes, or two modules read the same database table, or a configuration key gates behaviour somewhere else. Neither GitNexus nor Repowise attempts to mine these. They do not appear in any of the tables above because there is nothing to compare against. Section 4 walks through implicit dependencies in detail, plus two more categories of edge that static analysis does not produce: design intent mined from pull-request history, and knowledge agents add to the graph as they work.

Want to try this on a codebase you recognise? latentgraph.latentforce.ai/showcase hosts the live Engineering Brain for the same OSS repositories benchmarked here. Click any repo and explore the actual graph the agent sees.

Honest limits

The benchmark above is the strongest statement we can make today, but it is bounded. The known limits are listed below.

Section 4 picks up the implicit-dependency claim. Section 5 turns to what high-quality graph context enables downstream: migrations and security audits at production scale.

04 Beyond static analysis

Beyond static analysis: the invisible spectrum.

Three categories of edge static analysis does not produce: implicit runtime couplings between files that do not import each other, architectural intent mined from pull-request history, and knowledge the agents themselves write back to the graph as they work.

Sections 2 and 3 measured what the graph does for coding agents and how good the underlying edges are. This section turns to what the graph contains that no static analyser produces at all. Three categories live in this invisible spectrum: implicit dependencies (runtime couplings between files that do not import each other), PR insights (design intent and architectural invariants the code does not declare), and agent write-back (knowledge the agents add to the graph as they discover it during coding tasks). The examples below come from public OSS codebases. None is reachable by grep, IDE navigation, call hierarchy, or standard dependency analysis.

§4.1 Implicit dependencies

An implicit dependency is a runtime coupling that no import edge declares. Two files that share a Redis channel. Two services that read the same database table. An event-bus producer and consumer whose only connection is a topic name. Two callers that depend on the shape of an object a third file produces, not on the function that produces it. Static analysers stop at import, call, and reference edges. The examples below stop somewhere else.

Featured: axios · six callers, one invisible shape contract

In axios, every final HTTP-request configuration object is constructed by lib/core/mergeConfig.js using Object.create(null). This is a security requirement: a prototype-less object blocks prototype-pollution attacks from leaking attacker-controlled adapter, proxy, or validateStatus overrides into a request. But six internal callers (and an unknown number of user-written interceptors) call config.hasOwnProperty(key) at runtime, and a prototype-less object has no hasOwnProperty. To avoid a runtime TypeError, mergeConfig manually grafts a non-enumerable hasOwnProperty back onto every config object it produces.

The result is an invisible contract. Six internal callers, Axios._request, Axios.getUri, createInstance, resolveConfig, lib/adapters/xhr.js, and lib/adapters/fetch.js, plus every user-written interceptor, depend on the joint shape: null prototype plus grafted method. Nothing about import mergeConfig declares this dependency. The dependency is on what the returned object looks like, not on the function that produces it.

THE HIDDEN SHAPE CONTRACT axios v1.x · contract by shape, not by import lib/core/mergeConfig.js produces config via Object.create(null) + manual hasOwnProperty graft implicit shape contract · no import expresses it Axios._request reads config via own() Axios.getUri reads config via own() createInstance consumes shape resolveConfig depends on merge strategies adapters/xhr.js own-property guards adapters/fetch.js own-property guards User interceptors call config.hasOwnProperty() (works only via the manual graft) Six internal callers plus user-side code depend on the shape. No import edge declares it.
Figure 3. The hidden shape contract in axios. mergeConfig.js produces every config object with a null prototype plus a manually grafted hasOwnProperty. Six internal callers and every user-written interceptor silently depend on the joint shape. Grounded in PRs #10764, #10779, #10833, #10892.

Why static analysis cannot see this. A static analyser sees that six files call mergeConfig. It does not see that all six callers depend on a non-standard prototype shape and a grafted method that is not part of any TypeScript type, JSDoc annotation, or comment in the source. The contract is reconstructable only from PR history and from the manual graft itself.

Failure mode. A developer adds a new config-construction path using {...defaults, ...userConfig} spread instead of calling mergeConfig. The resulting object has a normal prototype. Existing callers continue to work. Prototype pollution becomes reachable again, silently routing requests to attacker-controlled endpoints via injected adapter or proxy overrides. The build passes. The type checker passes. The tests pass against benign input.

§4.2 PR insights

PR insights are the design decisions and architectural invariants that produced the code, mined from the project's pull-request history. They include rejected approaches, escalating security fixes, change playbooks, and constraints the code obeys but does not declare. A grep tells you what the code does. A PR insight tells you what the code is not allowed to do.

Featured: requests · the legacy method that silently reinstates a CVE

In May 2024, requests shipped a fix for CVE-2024-35195: under specific conditions, a TLS connection established with verify=False could be reused for a subsequent request that required verify=True, completely bypassing certificate verification. PRs #6655, #6662, and #6710 introduced a new method, HTTPAdapter.get_connection_with_tls_context, that folds verify and cert into the connection-pool key via a helper called build_connection_pool_key_attributes. Connections with different security parameters are no longer pooled together. CVE closed.

The fix preserved the legacy method HTTPAdapter.get_connection for backwards compatibility. The CVE-safe path lives in the new method and its callee build_connection_pool_key_attributes. The legacy method is never instrumented. Any HTTPAdapter subclass, existing or new, that overrides only the legacy get_connection instead of the new get_connection_with_tls_context silently bypasses the fix. The vulnerability returns. No runtime error, no test failure, no static-analysis warning.

CVE-2024-35195 · THE SUBCLASS TRAP requests · src/requests/adapters.py class HTTPAdapter def get_connection(...) LEGACY · CVE-VULNERABLE does not call build_connection_pool_key_attributes def get_connection_with_tls_context(...) CVE-SAFE · PR #6710 folds verify + cert into pool key (CVE fix lives here) class MyCustomAdapter(HTTPAdapter): def get_connection(...): # override SILENTLY REINSTATES CVE overrides The two methods look like normal API surface. Nothing in the source declares that the new method is canonical for security. The constraint exists only in PR history.
Figure 4. The CVE-2024-35195 subclass trap. Reading the file shows two parallel methods. The graph reveals that build_connection_pool_key_attributes, the actual CVE fix, is reachable only via the new method, and that any subclass override of the legacy method silently bypasses the security check. Grounded in PRs #6655, #6662, #6710, #6716.

Why static analysis cannot see this. Reading the file shows two parallel methods that look like normal API surface. Grep for build_connection_pool_key_attributes finds it but does not reveal its relationship to the CVE. Find-references on get_connection finds direct callers, but a subclass override is not a caller. The constraint that the new method is canonical for security exists only in PR history. get_call_chain traverses a six-hop path from Session.get through Session.send and HTTPAdapter.send to get_connection_with_tls_context, and confirms that build_connection_pool_key_attributes is reachable through the new method only.

§4.3 Agent write-back: code-grounded memory that compounds

The graph is not only something agents query. It is also something agents extend. The MCP write tool update_graph lets any MCP-speaking agent propose new edges, invariants, or design decisions as it discovers them during a coding task. An agent that traces a Redis channel writer to its consumer can record the implicit edge. An agent that surfaces a constraint by failing a test can record the invariant. Every proposed edit lands in a Control Panel queue that owners review before anything is added to the canonical graph, so the artefact remains trustworthy.

The effect compounds. Three months in, the graph holds knowledge mined from PR history (continuous), invariants annotated by engineers (curated), and edges discovered by agents during real coding sessions (incremental). The graph becomes a code-grounded agent memory for the team: a curated artefact that captures both the codebase as it is and the knowledge the team plus its agents have accumulated about how it works. Switching off LatentGraph three months in means switching off three months of compounded team intelligence.

A separate report on the write side and the compounding loop is in preparation.

Section 5 turns to what these capabilities enable downstream at production scale: a code-generation migration study across roughly 1.8 million lines, and a source-code review and security audit study that surfaced fifteen Critical or High verified bugs across three AI-infrastructure repositories.

05 Downstream use cases

What this enables downstream.

Once an agent can query the graph, the tasks where context dominates outcomes start working: migrations, refactors, code reviews, security audits.

Sections 2, 3, and 4 measured and demonstrated the graph itself. Agents do better with it (S2). The underlying edges are accurate and comprehensive (S3). The categories of edge static analysis does not produce (implicit runtime couplings between files that do not import each other, design intent mined from PR history, knowledge agents write back to the graph themselves) are real and reproducible across public OSS code (S4). Section 5 turns to two production-grade case studies where these properties were exercised at scale: a 1.8M-line code-generation migration study, and an exhaustive source-code review and security audit of three real AI-infrastructure repositories. Both ran the same experimental design as Section 2: same model, same prompt, the only variable was whether the agent could query the graph.

5.1 Migrations

We ran three large-scale migrations through Claude Code. BlazingMQ, Bloomberg's distributed message queue (405K lines of C++), translated to idiomatic Rust. PeerTube, a federated video platform (370K lines of Angular), ported to React with TypeScript. PowerToys, Microsoft's Windows productivity suite (324K lines of C# / WPF), re-implemented as an Electron / React / TypeScript desktop application. Roughly 1.8 million lines of source total. Same Claude Code, same exhaustive migration prompt, run twice per project. The only difference: the LGraph-MCP runs had one additional sentence appended to the prompt, "Use LGraph MCPs wisely."

Two evaluation dimensions: efficiency (cost, wall-clock time, lines produced, token and cache efficiency) and fidelity (structural mapping vs the original, feature completeness, functional integration, build state). React and Electron outputs were actually built in the analysis environment (npm run build); the Rust port was structurally compared and grep-verified against the original C++ source. Fidelity was scored out of 100 by a Claude Code analyst comparing each migrated codebase against its original.

Scope note. These are first-pass code-generation runs, not production-ready end-to-end migrations. Real migrations at this scale need manual integration, debugging, and iteration well beyond a single agent session; none of these three outputs would ship as-is. The question this study answers is narrower: given the same Claude Code, same prompt, and same source codebase, how much better is the first-pass generated code when the agent can query a context graph? A full end-to-end migration study is in the pipeline and will be published separately.

On every dimension and every project, the LGraph runs won. Total cost dropped 56% ($93.00 → $41.19). Total wall-clock time dropped from 7 hours to 3h 49m. Average fidelity rose from 59 / 100 to 69 / 100. The LGraph runs produced 44,246 fewer lines of code while scoring higher on completeness: those extra lines in the baseline runs were churn, not value.

Without MCP With LatentGraph 0 25 50 75 100 Fidelity score BlazingMQ C++ → Rust 65 76 PeerTube Angular → React 61 74 PowerToys C#/WPF → Electron 52 58
Figure 5. Fidelity scores out of 100 for each of the three migrations, with and without LatentGraph. Scored by a Claude Code analyst against four criteria (structural mapping, feature completeness, functional integration, build state). LatentGraph wins all three projects.
Table 3. Efficiency overview, three migrations, ~1.8M lines of source.
Migration Condition Cost Time Lines added Fidelity
BlazingMQ Without MCP $9.35 33m 47s 10,321 65
With LatentGraph $6.65 30m 39s 8,072 76
PeerTube Without MCP $35.46 2h 36m 32s 37,053 61
With LatentGraph $12.87 1h 18m 34s 15,122 74
PowerToys Without MCP $48.19 3h 49m 54s 52,550 52
With LatentGraph $21.67 1h 59m 58s 32,484 58
Totals / avg $41.19 / $93.00 3h 49m / 7h 00m 55,678 / 99,924 69 / 59

The mechanism is the one Section 2 measured at smaller scale, now amplified across hundreds of thousands of lines. The LatentGraph MCP lets the migration agent resume exactly where it left off without re-reading every file. From the consolidated run: 43% fewer Sonnet output tokens (941K vs 1.66M), 59% fewer Sonnet cache reads (59.4M vs 143.5M), and roughly 400× less Haiku orchestration overhead (~57 vs ~23,000 tokens). The model spends its tokens writing the migration; the baseline runs spend them re-deriving what the LGraph-augmented agent already discovered.

One concrete fidelity example. On PowerToys, the LGraph-augmented run wired all 10 renderer modules to IPC and verified every OS operation (resize, color-pick, rename, hotkeys). The baseline run wired 16 of 33 renderer modules (48%) and left most utility implementations as self-labelled stubs. The difference was not in how Claude Code writes code: both runs used the same Claude Code, the same model. It was in whether the agent could enumerate the IPC handlers in one MCP call (LatentGraph) or by file traversal across 716K lines (baseline).

5.2 Source code review and security audit

We ran exhaustive source-code reviews on three real public AI-infrastructure codebases, anchored by an identical security-audit prompt. agentmemory (TypeScript, a persistent-memory layer for AI coding agents that captures every prompt, tool call, and observation). hermes-webui (Python + JavaScript, an agent orchestration UI with a built-in terminal feature so the agent can run shell commands). openfang (Rust, a 14-crate "Agent Operating System" covering agent spawning, multi-channel messaging, a peer-to-peer agent-to-agent protocol, and a credential vault). Same model (Claude Sonnet 4.6, with Claude Haiku 4.5 for orchestration). Same exhaustive security-audit prompt. The only variable: whether the auditor had the LatentGraph MCP server attached.

The LGraph-augmented audit produced 15 Critical or High verified bugs that the plain audit, on the same prompt and the same model, did not produce: 7 Critical and 8 High, evenly distributed across the three repos (5 bugs in each). Plain-audit false positives dropped 3.6× under the LGraph audit (5 vs 18 in iteration 1). On agentmemory, where the deltas are sharpest, the LGraph audit cost 52% less ($6.98 vs $14.40) and ran 50% faster (35 min vs 1h 12m).

Critical High 0 1 2 3 4 5 LGraph-only verified bugs agentmemory 3 Critical 2 High hermes-webui 2 Critical 3 High openfang 2 Critical 3 High
Figure 6. 15 LGraph-only verified bugs across three AI-infrastructure repositories, split by severity. Every bug exists at the cited file:line in the cited commit, every exploit is plausible, and none was flagged by the plain audit on the same prompt.

After verifying all 15 bugs against the actual source code at the cited commit and line, three patterns explain why the plain audits missed them.

Multi-file attack chains (6 of 15). Real-world security bugs span modules. One example from agentmemory: a five-step persistent prompt-injection chain that runs from the prompt-capture hook at src/hooks/prompt-submit.ts, through a shared KV store, into the prompt builders at src/prompts/{compression,consolidation,summary}.ts, and out to the LLM as "trusted" context. None of those files imports any of the others; the chain is connected only by runtime data flow through the KV store. Another, from openfang, runs routes::spawn_agent → kernel.spawn_agent → spawn_agent_with_parent(None) and skips validate_capability_inheritance; that chain spans three crates. A file-by-file scanner must spot a vulnerability inside every file along the chain. The graph traces it in one walk.

Systemic patterns (4 of 15). A single root cause replicated across many files. agentmemory's prompt-injection idiom appears in every prompt builder (Bug 4 actually spans compression.ts, consolidation.ts, and summary.ts). openfang's AuthManager.authorize() is defined once in crates/openfang-kernel/src/auth.rs:158 and called from zero HTTP route handlers; the role system is fully implemented, fully documented, and entirely disconnected from the routes that need it. Plain audits catch one instance and miss the category. The graph queries the category and returns the full list.

Severity-via-context (5 of 15). A bug looks individually minor until you see who reads it downstream. agentmemory's cwd parameter in src/functions/branch-aware.ts:14 looks like a path-traversal hygiene issue, until the graph traces it into execFile("git", ...) and reveals that git honors hooks from .git/config in the supplied directory. The hermes-webui terminal SSE looks like an information disclosure until the graph shows what flows through PTYs (sudo prompts, SSH keys typed at agent prompts). openfang's approve endpoint looks like one over-broad route until the graph reveals it backstops every shell_exec call. Severity grading without architectural context systematically under-grades systemic issues.

The pattern across all 15 bugs: the vulnerable code looks correct in isolation. The bug only becomes visible when you can see the relationship between the lines, which is what a graph of the codebase provides and what plain grep-and-read cannot.

Sections 2 through 5 examined the same hypothesis at multiple scales: hard tasks (S2), edge benchmarks (S3), a tour of the categories of edge static analysis does not produce (S4), and 1.8 million lines of migration plus 15 verified bugs (S5). The mechanism in every case is the same: tasks where the agent must reason about how the system fits together, and where structured graph access turns previously intractable navigation into an MCP call. Section 6 names the limits these studies have not yet addressed.

06 Limits

Limits.

What these studies do not yet show, and the conditions under which LatentGraph does not help today.

The studies in this report establish a clear signal inside a deliberately bounded scope: hard tasks on large codebases, four coding agents, two languages, a curated set of real codebases. The signal is consistent and the numbers are traceable. The report makes no claim of universality. Each of the bodies of evidence above has named boundaries, listed here in the same place as the wins so the wins are read inside them, not around them.

Sample size on the coding-agent benchmark. 14 hard SWE-bench Pro tasks is a small N. We chose breadth across four coding agents and consistency across two million-line codebases (where context cost dominates) over running many tasks on a single agent or a single repo. The benchmark does not claim to generalise to small projects, where the model already has enough room to solve most tasks without a graph layer.

OpenCode integration is still early-stage. On cost per resolved task (the operationally meaningful efficiency number), LatentGraph improves three of the four agents substantially: Claude Code 66% ($3.93 → $1.33), Cursor 41% ($1.38 → $0.82), and Codex 11% ($9.85 → $8.79). OpenCode is essentially flat ($1.95 → $2.05). The OpenCode MCP integration is the most recent of the four and the least mature on our end. These are first-cut results, and we expect the OpenCode numbers to improve as the integration matures.

Specific known gaps in the explicit-edge benchmark. The fastapi __init__.py re-export gap costs LatentGraph 1,227 test-import edges that both Repowise and GitNexus find. The fabric.js bundle file (fabric.full.js) produces the only sub-95% precision result in the JS/TS set (91.8%). The keras _tf_keras shim accounts for 26 of LatentGraph's 34 Python false positives. Each one is named, traced, and targeted for fix in upcoming releases.

JS / TS precision verification has a statistical floor. The verifier is Claude Sonnet, sampled at 100 edges per Venn region per tool pair. Prior calibration runs put spot-check reliability at ≥ 95%; the remaining roughly 5% is the floor on every JS / TS precision number in Section 3.

The migration study is first-pass code generation. Restated from the Section 5.1 scope note: the three migrations are agent-driven first-pass conversions, not production-ready deployments. None of the three outputs would ship as-is. The question the study answers is whether the first-pass generated code is better with the graph than without; the broader question of end-to-end agent-driven migration is the subject of a separate study now in progress.

Each limit above bounds the scope of a specific result, not the underlying mechanism. The signal inside the scope is consistent across four agents, ten repositories, three migrations, and fifteen verified bugs, and every number in the report is traceable to its source.

07 Conclusion

Try it.

One install. One MCP plug-in. Your engineering brain, built.

This report examines the same hypothesis at multiple scales. On 14 hard SWE-bench Pro tasks (Section 2), every agent we tested resolved more tasks with LatentGraph attached, and cost per resolved task dropped substantially for three of the four. On 10 OSS repositories (Section 3), LatentGraph captures more real dependency edges at higher precision than the closest comparable tools. On a tour through public OSS codebases (Section 4), the graph surfaced implicit-dependency contracts and PR-mined design intent that no static analyser produces. On 1.8 million lines of code-generation migrations and 15 verified security bugs across three AI-infrastructure repositories (Section 5), the same mechanism scales: structured graph access turns previously intractable navigation into a single MCP call.

LatentGraph ships the map today. Works with any MCP-speaking coding agent: Claude Code, Cursor, Codex, OpenCode, Copilot, or any future client that speaks the protocol.

$ npm install -g @latentforce/latentgraph $ lgraph init $ lgraph add claude-code

Full product documentation at latentgraph.latentforce.ai. To explore a live Engineering Brain on the same OSS repositories benchmarked in Section 3, with no install required, visit latentgraph.latentforce.ai/showcase.

Questions, or want to talk to the team directly? Reach us at latentforce.ai/contact. We respond very quickly.

Codebase today. PRDs, design docs, incident retros tomorrow.