How LatentGraph changes the math on coding-agent tasks.
Three independent studies of what a structured codebase graph does for AI coding agents: four agents, ten open-source repositories, three production-grade migrations, and fifteen verified security bugs.
Why we built LatentGraph.
AI coding agents are powerful, but every task still starts from zero: files re-read, imports re-followed, architecture re-guessed. LatentGraph gives the agent a persistent, structured map of the codebase so it stops guessing.
Every modern coding agent (Claude Code, Cursor, Codex, OpenCode, Copilot) runs the same loop: read the prompt, scan the codebase, write code. The agents themselves are good. The loop is broken in one specific place: every task starts from zero. The agent reads twenty files, follows imports, guesses what is coupled to what, and discards everything when the session ends. Token bills inflate as repositories grow. The agent invents abstractions that do not match the team's, misses runtime couplings static analysis cannot see, and breaks invariants nobody wrote down. The model is not the bottleneck. Context is.
What the graph contains
When you point LatentGraph at a repository, it builds a hierarchical Context Graph: files clustered into modules, modules into an app-level map, with three layers of edges between the nodes. Explicit dependencies come from static analysis: every import, every call chain, every cross-module reference. Implicit dependencies are the runtime couplings static analysis cannot see: two services sharing a Redis channel, two modules reading the same database table, an event-bus producer and consumer, a config key gating behaviour somewhere else. These are the hardest edges to find and the most expensive to miss; they are where most agent failures hide. Engineering intent is mined from your pull-request history: invariants ("auth tokens must be refreshed before expiry, grounded in PR #847"), design decisions ("we chose JWT over session cookies, see PR #312"), and recurring change playbooks. Each node carries auto-generated documentation. Each edge carries a description and the source that grounded it.
How it stays current
A graph that is right on Monday and stale by Friday is a snapshot, not
a knowledge layer. LatentGraph stays current three ways.
Incremental indexing. lgraph update
re-indexes only what changed since the last commit; new PRs feed back
into the invariant set automatically. Agents write back.
When an agent discovers something during a coding task (a Redis
channel that wasn't on the graph, an invariant nobody had written
down), it calls update_graph to propose the edge.
Owners approve. Every proposed edit, from agents or
humans, lands in a Control Panel queue with a side-by-side diff.
Approve, reject, or revise. Nothing modifies the graph silently. Three
months in, switching off LatentGraph means switching off three months
of compounded team intelligence: discovered, curated, and persisted.
How any agent can query it
The graph is exposed through nine MCP tools any coding
agent can call. get_dependencies(file) returns every edge
in and out, explicit and implicit. get_call_chain(symbol)
walks callers and callees transitively.
get_pr_insights(module) surfaces invariants and design
decisions from history. ask_codebase(query) answers a
cross-cutting semantic question over the whole graph.
update_graph(...) is the write tool. One MCP call
typically replaces twenty file reads. The protocol is standard:
lgraph add claude-code, lgraph add cursor,
lgraph add codex, and so on. Any agent that speaks MCP
can use the same graph.
Install LatentGraph and connect it to your coding agent:
Full product documentation at latentgraph.latentforce.ai.
Does it actually help coding agents?
On 14 hard SWE-bench Pro tasks across repositories averaging over a million lines, LatentGraph improved resolution rate on every one of the four coding agents we tested.
The benchmark
We ran 14 hard tasks drawn from the SWE-bench Pro corpus, across two
large open-source codebases: protonmail/webclients and
ansible/ansible. Both repositories average well over a
million lines of source code. We picked hard tasks on large codebases
deliberately. On small projects, modern agents already have enough room
in their context window to solve most things; the differential effect
of a graph layer is hard to measure. Hard tasks on million-line
codebases are where the navigation cost dominates.
Four coding agents ran the same fourteen tasks twice. Without MCP the agent was given the prompt and a working tree, nothing else. With LatentGraph the same agent had the LatentGraph MCP server attached and could query the graph. Same model behind each agent. Same prompt. Same environment. The only variable was whether the agent could query the graph.
These metrics matter. Resolved is the number of tasks the agent's patch made the target tests pass. Total cost is the API spend across all 14 tasks. $ / resolved measures cost per successful patch: spend on failed attempts is sunk, so this is the operationally meaningful efficiency number.
protonmail/webclients and ansible/ansible.
Every agent we tested resolved more tasks when it had the graph.
| Agent | Condition | Resolved | Total cost | Total time | $ / resolved |
|---|---|---|---|---|---|
| Claude Code | Without MCP | 8/14 | $31.45 | 148.5 min | $3.93 |
| With LatentGraph | 10/14 | $13.31 | 83.7 min | $1.33 | |
| Cursor (Composer-2.5) | Without MCP | 6/14 | $8.29 | 87.4 min | $1.38 |
| With LatentGraph | 9/14 | $7.37 | 78.7 min | $0.82 | |
| OpenCode | Without MCP | 8/14 | $15.61 | 152.7 min | $1.95 |
| With LatentGraph | 10/14 | $20.52 | 133.5 min | $2.05 | |
| Codex | Without MCP | 9/14 | $88.61 | 43.5 min | $9.85 |
| With LatentGraph | 10/14 | $87.87 | 45.3 min | $8.79 |
Reading the per-agent results
Claude Code shows the biggest swing on cost and time.
Resolution climbs from 8/14 to 10/14, total cost drops 58%
($31.45 → $13.31), total time drops 44%
(148.5 → 83.7 min). Cost per resolved task falls from
$3.93 to $1.33: almost 3× more efficient at producing a correct patch.
Same model, same prompt; the graph stops the agent from re-deriving
context every task.
Cursor (Composer-2.5) shows the largest
relative lift on resolution: 6/14 to 9/14, a 50% increase.
Total cost drops 11% ($8.29 → $7.37) and total time
drops 10% (87.4 → 78.7 min). Cost per resolved task
falls 41% ($1.38 → $0.82). Cursor is already the
cheapest agent in the suite per task, so absolute deltas are
small, but proportionally every metric improves.
OpenCode is the mixed case. Resolution improves
(8/14 → 10/14), total time drops 13% (152.7 → 133.5 min),
but total cost rises 31% ($15.61 → $20.52). Per resolved
task, cost is roughly flat ($1.95 → $2.05). The graph
helps OpenCode solve harder tasks; the agent spends a little more on
each successful patch and gets two extra resolutions in return. The
net trade is unambiguous on outcomes, mixed on spend.
Codex is the smallest swing in the suite. Codex
starts as the strongest baseline at 9/14 resolved, with the highest
spend per task in the suite ($88.61 total, $9.85 per resolved).
Adding LatentGraph nets +1 resolved task with total cost and total
time effectively flat (within run-to-run noise). Cost per resolved
task improves 11% ($9.85 → $8.79): the same spend
produces one more resolution. Not the dramatic swing of Claude
Code or Cursor, but a real improvement on every dimension.
What the agents did with the graph
The aggregate numbers are abstract until you see what changed inside a single task. The three case studies below are real tasks from the benchmark above: same model (Cursor Composer-2.5), same prompt, same commit. Each one is a task the baseline failed to ship any patch for, and the LatentGraph-augmented agent shipped a working one. The difference was not in the model. It was in how the agent navigated 1.3 million lines.
Subscribe to Calendar: capping URL length and centralising a mock
- grep30
- read26
- edit22
- glob10
- shell9
- get_symbol6
- ask_codebase4
- get_file2
- get_pr_insights1
- get_call_chain1
- get_dependencies1
MAX_LENGTHS_API was declared,
which jest.setup.js was authoritative, and how
SubscribeCalendarModal wired to its hooks. The 22 edit
calls never produced a saved diff.
What LatentGraph did:
get_symbol pinpointed the constant and its declaration
site. get_dependencies enumerated every
ResizeObserver import site in one round-trip. The agent
spent its tokens writing the fix instead of finding the place to fix.
The mechanism: navigation, not intelligence
The aggregate numbers and the case studies converge on the same mechanism. In a million-line repository, finding every caller of a function by text search is O(N) over the whole codebase; doing it through a call-graph lookup is O(1). Finding every place a constant should be extended has the same shape. Finding every consumer of an enum has the same shape. The baseline agent's failures across the three cases above were not failures of model intelligence. They were failures of navigation. The graph collapses the navigation step, and the model spends its tokens on writing the fix instead of finding the place to fix.
Section 3 turns to the prerequisite: how good is the graph itself.
Why the agents win: the graph is state-of-the-art.
Across ten OSS repositories (five Python, five JS/TS), LatentGraph captures more real dependency edges than the closest comparable tools, at higher precision.
Why edge quality matters
Section 2 showed that coding agents do better with the graph. This section turns to the prerequisite: how good is the graph itself. Every MCP call returns context grounded in graph edges. If edges are wrong, the agent ingests wrong context. If edges are missing, the agent misses dependencies. Precision (what fraction of captured edges are real) and coverage (how many real edges are captured at all) are the two axes that determine whether the graph is worth querying.
The benchmark setup
We compared LatentGraph's dependency-edge extraction against the two
closest publicly available static-analysis tools: GitNexus
and Repowise. Ten OSS repositories were used: five
Python (fastapi, keras, openai-python,
sqlmap, youtube-dl) and five JavaScript /
TypeScript (Chart.js, dayjs,
Sortable, reveal.js, fabric.js).
Each tool's output was normalised to the same file universe by
applying LatentGraph's collector skip-list uniformly across all three
tools, so coverage and overlap comparisons operate on identical inputs.
Precision was verified two ways. For Python, an AST-based verifier confirmed every emitted edge against the actual import statements in every file: no sampling, full-corpus check. For JS/TS, where multi-mechanism resolution (calls, extends, accesses, method-overrides, dynamic imports) makes static verification harder, we sampled up to 100 edges per Venn region per tool pair and verified each with Claude Sonnet against the full file source. Precision within a language is directly comparable; across languages it is only indicative, since the verification methods differ.
The aggregate result
LatentGraph wins on precision on both language sets. The gap is modest on Python (where all three tools are in the high 90s) and decisive on JS/TS (where the next-best tool drops more than 11 percentage points behind).
| Tool | Python (5 repos) | JS / TS (5 repos) | |||
|---|---|---|---|---|---|
| Edges | TPs | Precision | File pairs | Mean precision | |
| LatentGraph | 20,882 | 20,848 | 99.84% | 2,688 | 98.1% |
| GitNexus | 16,177 | 15,942 | 98.55% | 3,356 | 86.4% |
| Repowise | 16,851 | 16,226 | 96.29% | 1,513 | 74.9% |
Two coverage notes worth surfacing. On Python, LatentGraph captures
the most edges total (20,882): about 24% more than Repowise and 29%
more than GitNexus, driven mainly by deep traversal of large packages
like keras. On JS/TS, GitNexus indexes the most file
pairs (3,356), but most of that excess is documentation, scripts, and
config files that the other tools intentionally skip. The volume
difference is real; whether it counts as useful coverage depends on
the use case.
Two findings that travel
Beyond the headline result, two cross-validation findings hold across every repo and every tool pair in the benchmark.
The common-bucket invariant. When any two of these three tools agree on a file edge, that edge is real: 100% precision across every repository and every tool pair in the JS/TS sampled verification, and zero common false positives between LatentGraph and GitNexus across the entire Python subset. Two independent dependency analysers agreeing on an edge is by itself a strong signal of correctness, usable directly as production ground truth.
LatentGraph-only edges are signal, not noise. Edges
that only LatentGraph emits remain highly precise: 98.6% vs Repowise
and 92.0% vs GitNexus in the JS/TS verification. The extra coverage
is not over-emission. On keras alone, LatentGraph finds
5,731 unique true positives that neither other tool surfaces:
intra-package edges deep inside nested inheritance hierarchies that
text-search-based call-graph traversal does not reach.
What the other tools do not attempt: implicit dependencies
Everything above is about explicit dependencies: imports, calls, class extensions, references that a static analyser can see in the source. This is the comparison the three tools support.
In production, the LatentGraph context graph also surfaces implicit dependencies: the runtime couplings static analysis cannot see, where one service publishes to a Redis channel another service consumes, or two modules read the same database table, or a configuration key gates behaviour somewhere else. Neither GitNexus nor Repowise attempts to mine these. They do not appear in any of the tables above because there is nothing to compare against. Section 4 walks through implicit dependencies in detail, plus two more categories of edge that static analysis does not produce: design intent mined from pull-request history, and knowledge agents add to the graph as they work.
Honest limits
The benchmark above is the strongest statement we can make today, but it is bounded. The known limits are listed below.
-
The fastapi
__init__.pyre-export gap. Onfastapi, LatentGraph produces 959 edges to Repowise's 2,627. The gap is almost entirely test-file imports of framework symbols (FastAPI,Depends,HTTPException): LatentGraph's symbol resolver marks them as external when they are actually re-exported through the package's top-level__init__.py. 1,227 real edges infastapialone fall in this gap. Fix is in flight; the gap is the largest known issue in the benchmark. -
keras
_tf_kerasshim FPs. 26 of LatentGraph's 34 total Python false positives come from generated re-export files insidekeras/_tf_keras/: a compatibility stub layer that does not represent real source dependencies. Most of LatentGraph's remaining error budget is concentrated in one specific structural pattern. -
fabric.js bundle false positives. The only repo
where LatentGraph drops below 95% is
fabric.js(91.8%), caused by same-file edges insidefabric.full.js: an in-repo dist artefact that the standard skip-list does not catch. - Sample size. Ten repositories is a small N. We chose breadth across two languages and a mix of sizes; it does not claim to generalise to monorepos, CJS-heavy codebases, or closed-source enterprise corpora. Anything broader would need a larger study.
- JS / TS verifier floor. Claude-Sonnet-as-judge is itself probabilistic. Prior calibration runs put spot-check reliability at ≥ 95%; the remaining ~5% is a statistical floor on all JS/TS precision numbers above.
Section 4 picks up the implicit-dependency claim. Section 5 turns to what high-quality graph context enables downstream: migrations and security audits at production scale.
Beyond static analysis: the invisible spectrum.
Three categories of edge static analysis does not produce: implicit runtime couplings between files that do not import each other, architectural intent mined from pull-request history, and knowledge the agents themselves write back to the graph as they work.
Sections 2 and 3 measured what the graph does for coding agents and how good the underlying edges are. This section turns to what the graph contains that no static analyser produces at all. Three categories live in this invisible spectrum: implicit dependencies (runtime couplings between files that do not import each other), PR insights (design intent and architectural invariants the code does not declare), and agent write-back (knowledge the agents add to the graph as they discover it during coding tasks). The examples below come from public OSS codebases. None is reachable by grep, IDE navigation, call hierarchy, or standard dependency analysis.
§4.1 Implicit dependencies
An implicit dependency is a runtime coupling that no import edge declares. Two files that share a Redis channel. Two services that read the same database table. An event-bus producer and consumer whose only connection is a topic name. Two callers that depend on the shape of an object a third file produces, not on the function that produces it. Static analysers stop at import, call, and reference edges. The examples below stop somewhere else.
Featured: axios · six callers, one invisible shape contract
In axios, every final HTTP-request configuration object
is constructed by lib/core/mergeConfig.js using
Object.create(null). This is a security requirement: a
prototype-less object blocks prototype-pollution attacks from
leaking attacker-controlled adapter, proxy,
or validateStatus overrides into a request. But six
internal callers (and an unknown number of user-written interceptors)
call config.hasOwnProperty(key) at runtime, and a
prototype-less object has no hasOwnProperty. To avoid a
runtime TypeError, mergeConfig manually
grafts a non-enumerable hasOwnProperty back onto every
config object it produces.
The result is an invisible contract. Six internal callers,
Axios._request, Axios.getUri,
createInstance, resolveConfig,
lib/adapters/xhr.js, and
lib/adapters/fetch.js, plus every user-written
interceptor, depend on the joint shape: null prototype plus grafted
method. Nothing about import mergeConfig declares this
dependency. The dependency is on what the returned object
looks like, not on the function that produces it.
mergeConfig.js produces every config object with a
null prototype plus a manually grafted hasOwnProperty.
Six internal callers and every user-written interceptor silently
depend on the joint shape. Grounded in PRs #10764, #10779,
#10833, #10892.
Why static analysis cannot see this. A static
analyser sees that six files call mergeConfig. It does
not see that all six callers depend on a non-standard prototype
shape and a grafted method that is not part of any TypeScript type,
JSDoc annotation, or comment in the source. The contract is
reconstructable only from PR history and from the manual graft
itself.
Failure mode. A developer adds a new
config-construction path using {...defaults, ...userConfig}
spread instead of calling mergeConfig. The resulting
object has a normal prototype. Existing callers continue to work.
Prototype pollution becomes reachable again, silently routing
requests to attacker-controlled endpoints via injected
adapter or proxy overrides. The build
passes. The type checker passes. The tests pass against benign
input.
§4.2 PR insights
PR insights are the design decisions and architectural invariants that produced the code, mined from the project's pull-request history. They include rejected approaches, escalating security fixes, change playbooks, and constraints the code obeys but does not declare. A grep tells you what the code does. A PR insight tells you what the code is not allowed to do.
Featured: requests · the legacy method that silently reinstates a CVE
In May 2024, requests shipped a fix for
CVE-2024-35195: under specific conditions, a TLS
connection established with verify=False could be
reused for a subsequent request that required verify=True,
completely bypassing certificate verification. PRs #6655, #6662,
and #6710 introduced a new method,
HTTPAdapter.get_connection_with_tls_context, that
folds verify and cert into the
connection-pool key via a helper called
build_connection_pool_key_attributes. Connections with
different security parameters are no longer pooled together. CVE
closed.
The fix preserved the legacy method
HTTPAdapter.get_connection for backwards compatibility.
The CVE-safe path lives in the new method and its callee
build_connection_pool_key_attributes. The legacy method
is never instrumented. Any HTTPAdapter subclass,
existing or new, that overrides only the legacy
get_connection instead of the new
get_connection_with_tls_context silently bypasses the
fix. The vulnerability returns. No runtime error, no test failure,
no static-analysis warning.
build_connection_pool_key_attributes, the
actual CVE fix, is reachable only via the new method, and that
any subclass override of the legacy method silently bypasses the
security check. Grounded in PRs #6655, #6662, #6710, #6716.
Why static analysis cannot see this. Reading the
file shows two parallel methods that look like normal API surface.
Grep for build_connection_pool_key_attributes finds
it but does not reveal its relationship to the CVE. Find-references
on get_connection finds direct callers, but a subclass
override is not a caller. The constraint that the new method is
canonical for security exists only in PR history.
get_call_chain traverses a six-hop path from
Session.get through Session.send and
HTTPAdapter.send to
get_connection_with_tls_context, and confirms that
build_connection_pool_key_attributes is reachable
through the new method only.
§4.3 Agent write-back: code-grounded memory that compounds
The graph is not only something agents query. It is also something
agents extend. The MCP write tool update_graph lets
any MCP-speaking agent propose new edges, invariants, or design
decisions as it discovers them during a coding task. An agent that
traces a Redis channel writer to its consumer can record the
implicit edge. An agent that surfaces a constraint by failing a
test can record the invariant. Every proposed edit lands in a
Control Panel queue that owners review before anything is added
to the canonical graph, so the artefact remains trustworthy.
The effect compounds. Three months in, the graph holds knowledge mined from PR history (continuous), invariants annotated by engineers (curated), and edges discovered by agents during real coding sessions (incremental). The graph becomes a code-grounded agent memory for the team: a curated artefact that captures both the codebase as it is and the knowledge the team plus its agents have accumulated about how it works. Switching off LatentGraph three months in means switching off three months of compounded team intelligence.
A separate report on the write side and the compounding loop is in preparation.
Section 5 turns to what these capabilities enable downstream at production scale: a code-generation migration study across roughly 1.8 million lines, and a source-code review and security audit study that surfaced fifteen Critical or High verified bugs across three AI-infrastructure repositories.
What this enables downstream.
Once an agent can query the graph, the tasks where context dominates outcomes start working: migrations, refactors, code reviews, security audits.
Sections 2, 3, and 4 measured and demonstrated the graph itself. Agents do better with it (S2). The underlying edges are accurate and comprehensive (S3). The categories of edge static analysis does not produce (implicit runtime couplings between files that do not import each other, design intent mined from PR history, knowledge agents write back to the graph themselves) are real and reproducible across public OSS code (S4). Section 5 turns to two production-grade case studies where these properties were exercised at scale: a 1.8M-line code-generation migration study, and an exhaustive source-code review and security audit of three real AI-infrastructure repositories. Both ran the same experimental design as Section 2: same model, same prompt, the only variable was whether the agent could query the graph.
5.1 Migrations
We ran three large-scale migrations through Claude Code. BlazingMQ, Bloomberg's distributed message queue (405K lines of C++), translated to idiomatic Rust. PeerTube, a federated video platform (370K lines of Angular), ported to React with TypeScript. PowerToys, Microsoft's Windows productivity suite (324K lines of C# / WPF), re-implemented as an Electron / React / TypeScript desktop application. Roughly 1.8 million lines of source total. Same Claude Code, same exhaustive migration prompt, run twice per project. The only difference: the LGraph-MCP runs had one additional sentence appended to the prompt, "Use LGraph MCPs wisely."
Two evaluation dimensions: efficiency (cost,
wall-clock time, lines produced, token and cache efficiency) and
fidelity (structural mapping vs the original,
feature completeness, functional integration, build state). React
and Electron outputs were actually built in the analysis environment
(npm run build); the Rust port was structurally compared
and grep-verified against the original C++ source. Fidelity was
scored out of 100 by a Claude Code analyst comparing each migrated
codebase against its original.
On every dimension and every project, the LGraph runs won. Total cost
dropped 56% ($93.00 → $41.19). Total wall-clock time
dropped from 7 hours to 3h 49m. Average fidelity rose from 59 / 100 to
69 / 100. The LGraph runs produced 44,246 fewer lines of code while
scoring higher on completeness: those extra lines in the baseline
runs were churn, not value.
| Migration | Condition | Cost | Time | Lines added | Fidelity |
|---|---|---|---|---|---|
| BlazingMQ | Without MCP | $9.35 | 33m 47s | 10,321 | 65 |
| With LatentGraph | $6.65 | 30m 39s | 8,072 | 76 | |
| PeerTube | Without MCP | $35.46 | 2h 36m 32s | 37,053 | 61 |
| With LatentGraph | $12.87 | 1h 18m 34s | 15,122 | 74 | |
| PowerToys | Without MCP | $48.19 | 3h 49m 54s | 52,550 | 52 |
| With LatentGraph | $21.67 | 1h 59m 58s | 32,484 | 58 | |
| Totals / avg | $41.19 / $93.00 | 3h 49m / 7h 00m | 55,678 / 99,924 | 69 / 59 |
The mechanism is the one Section 2 measured at smaller scale, now amplified across hundreds of thousands of lines. The LatentGraph MCP lets the migration agent resume exactly where it left off without re-reading every file. From the consolidated run: 43% fewer Sonnet output tokens (941K vs 1.66M), 59% fewer Sonnet cache reads (59.4M vs 143.5M), and roughly 400× less Haiku orchestration overhead (~57 vs ~23,000 tokens). The model spends its tokens writing the migration; the baseline runs spend them re-deriving what the LGraph-augmented agent already discovered.
One concrete fidelity example. On PowerToys, the LGraph-augmented run wired all 10 renderer modules to IPC and verified every OS operation (resize, color-pick, rename, hotkeys). The baseline run wired 16 of 33 renderer modules (48%) and left most utility implementations as self-labelled stubs. The difference was not in how Claude Code writes code: both runs used the same Claude Code, the same model. It was in whether the agent could enumerate the IPC handlers in one MCP call (LatentGraph) or by file traversal across 716K lines (baseline).
5.2 Source code review and security audit
We ran exhaustive source-code reviews on three real public AI-infrastructure codebases, anchored by an identical security-audit prompt. agentmemory (TypeScript, a persistent-memory layer for AI coding agents that captures every prompt, tool call, and observation). hermes-webui (Python + JavaScript, an agent orchestration UI with a built-in terminal feature so the agent can run shell commands). openfang (Rust, a 14-crate "Agent Operating System" covering agent spawning, multi-channel messaging, a peer-to-peer agent-to-agent protocol, and a credential vault). Same model (Claude Sonnet 4.6, with Claude Haiku 4.5 for orchestration). Same exhaustive security-audit prompt. The only variable: whether the auditor had the LatentGraph MCP server attached.
The LGraph-augmented audit produced 15 Critical or High verified bugs that the plain audit, on the same prompt and the same model, did not produce: 7 Critical and 8 High, evenly distributed across the three repos (5 bugs in each). Plain-audit false positives dropped 3.6× under the LGraph audit (5 vs 18 in iteration 1). On agentmemory, where the deltas are sharpest, the LGraph audit cost 52% less ($6.98 vs $14.40) and ran 50% faster (35 min vs 1h 12m).
After verifying all 15 bugs against the actual source code at the cited commit and line, three patterns explain why the plain audits missed them.
Multi-file attack chains (6 of 15). Real-world
security bugs span modules. One example from agentmemory: a
five-step persistent prompt-injection chain that runs from the
prompt-capture hook at src/hooks/prompt-submit.ts,
through a shared KV store, into the prompt builders at
src/prompts/{compression,consolidation,summary}.ts,
and out to the LLM as "trusted" context. None of those files
imports any of the others; the chain is connected only by runtime
data flow through the KV store. Another, from openfang, runs
routes::spawn_agent → kernel.spawn_agent → spawn_agent_with_parent(None)
and skips validate_capability_inheritance; that chain
spans three crates. A file-by-file scanner must spot a vulnerability
inside every file along the chain. The graph traces it in one walk.
Systemic patterns (4 of 15). A single root cause
replicated across many files. agentmemory's prompt-injection idiom
appears in every prompt builder (Bug 4 actually spans
compression.ts, consolidation.ts, and
summary.ts). openfang's AuthManager.authorize()
is defined once in crates/openfang-kernel/src/auth.rs:158
and called from zero HTTP route handlers; the role system is fully
implemented, fully documented, and entirely disconnected from the
routes that need it. Plain audits catch one instance and miss the
category. The graph queries the category and returns the full list.
Severity-via-context (5 of 15). A bug looks
individually minor until you see who reads it downstream.
agentmemory's cwd parameter in
src/functions/branch-aware.ts:14 looks like a
path-traversal hygiene issue, until the graph traces it into
execFile("git", ...) and reveals that git honors hooks
from .git/config in the supplied directory. The
hermes-webui terminal SSE looks like an information disclosure until
the graph shows what flows through PTYs (sudo prompts, SSH keys
typed at agent prompts). openfang's approve endpoint looks like one
over-broad route until the graph reveals it backstops every
shell_exec call. Severity grading without architectural
context systematically under-grades systemic issues.
The pattern across all 15 bugs: the vulnerable code looks correct in isolation. The bug only becomes visible when you can see the relationship between the lines, which is what a graph of the codebase provides and what plain grep-and-read cannot.
Sections 2 through 5 examined the same hypothesis at multiple scales: hard tasks (S2), edge benchmarks (S3), a tour of the categories of edge static analysis does not produce (S4), and 1.8 million lines of migration plus 15 verified bugs (S5). The mechanism in every case is the same: tasks where the agent must reason about how the system fits together, and where structured graph access turns previously intractable navigation into an MCP call. Section 6 names the limits these studies have not yet addressed.
Limits.
What these studies do not yet show, and the conditions under which LatentGraph does not help today.
The studies in this report establish a clear signal inside a deliberately bounded scope: hard tasks on large codebases, four coding agents, two languages, a curated set of real codebases. The signal is consistent and the numbers are traceable. The report makes no claim of universality. Each of the bodies of evidence above has named boundaries, listed here in the same place as the wins so the wins are read inside them, not around them.
Sample size on the coding-agent benchmark. 14 hard SWE-bench Pro tasks is a small N. We chose breadth across four coding agents and consistency across two million-line codebases (where context cost dominates) over running many tasks on a single agent or a single repo. The benchmark does not claim to generalise to small projects, where the model already has enough room to solve most tasks without a graph layer.
OpenCode integration is still early-stage. On
cost per resolved task (the operationally meaningful efficiency
number), LatentGraph improves three of the four agents
substantially: Claude Code 66% ($3.93 → $1.33),
Cursor 41% ($1.38 → $0.82), and Codex 11%
($9.85 → $8.79). OpenCode is essentially flat
($1.95 → $2.05). The OpenCode MCP integration is the
most recent of the four and the least mature on our end. These
are first-cut results, and we expect the OpenCode numbers to
improve as the integration matures.
Specific known gaps in the explicit-edge benchmark.
The fastapi __init__.py re-export gap costs LatentGraph
1,227 test-import edges that both Repowise and GitNexus find. The
fabric.js bundle file (fabric.full.js) produces the
only sub-95% precision result in the JS/TS set (91.8%). The keras
_tf_keras shim accounts for 26 of LatentGraph's 34
Python false positives. Each one is named, traced, and targeted
for fix in upcoming releases.
JS / TS precision verification has a statistical floor. The verifier is Claude Sonnet, sampled at 100 edges per Venn region per tool pair. Prior calibration runs put spot-check reliability at ≥ 95%; the remaining roughly 5% is the floor on every JS / TS precision number in Section 3.
The migration study is first-pass code generation. Restated from the Section 5.1 scope note: the three migrations are agent-driven first-pass conversions, not production-ready deployments. None of the three outputs would ship as-is. The question the study answers is whether the first-pass generated code is better with the graph than without; the broader question of end-to-end agent-driven migration is the subject of a separate study now in progress.
Each limit above bounds the scope of a specific result, not the underlying mechanism. The signal inside the scope is consistent across four agents, ten repositories, three migrations, and fifteen verified bugs, and every number in the report is traceable to its source.
Try it.
One install. One MCP plug-in. Your engineering brain, built.
This report examines the same hypothesis at multiple scales. On 14 hard SWE-bench Pro tasks (Section 2), every agent we tested resolved more tasks with LatentGraph attached, and cost per resolved task dropped substantially for three of the four. On 10 OSS repositories (Section 3), LatentGraph captures more real dependency edges at higher precision than the closest comparable tools. On a tour through public OSS codebases (Section 4), the graph surfaced implicit-dependency contracts and PR-mined design intent that no static analyser produces. On 1.8 million lines of code-generation migrations and 15 verified security bugs across three AI-infrastructure repositories (Section 5), the same mechanism scales: structured graph access turns previously intractable navigation into a single MCP call.
LatentGraph ships the map today. Works with any MCP-speaking coding agent: Claude Code, Cursor, Codex, OpenCode, Copilot, or any future client that speaks the protocol.
Full product documentation at latentgraph.latentforce.ai. To explore a live Engineering Brain on the same OSS repositories benchmarked in Section 3, with no install required, visit latentgraph.latentforce.ai/showcase.
Questions, or want to talk to the team directly? Reach us at latentforce.ai/contact. We respond very quickly.