ArdelioGet early access →
← All posts
·8 min read

The Validation Crisis: Surfacing the Context Gap in Review

An empirical study of review friction across 10,303 pull requests, and the invisible limits of codebase-only tooling.

Executive Summary

  1. The Bottleneck Relocated: As AI code generation drives down the marginal cost of producing diffs, the bottleneck in software delivery has shifted from writing code to validating it.
  2. De-biasing the Open-Source Noise: While the raw open-source baseline suggests 66.6% of review friction is beyond-codebase, this figure is inflated by drive-by gatekeeping (such as signing CLAs/DCOs).
  3. The Enterprise Proxy: Restricting the analysis to repeat/insider contributors (proxies for internal employees) strips away these open-source administrative barriers. For these onboarded developers, process/policy friction collapses from 54.1% to 20.3%. Yet even with these barriers removed, 41.9% of all identified insider friction still lies beyond the codebase, dominated by environment/platform limits, product decisions, and external dependencies. Code-aware tools remain blind to nearly half of all enterprise review friction.

1. The Bottleneck Moved

The marginal cost of generating a plausible diff has fallen to near zero. While developer output has climbed, human review capacity remains finite. The industry symptom of this mismatch is growing review queues and rework creeping back in, as agents and developers generate code that compiles but ignores unwritten architectural boundaries, environment details, or team policies.

Review discussions are the exact place where missing context surfaces. If context is the binding constraint of modern software delivery, review discussions should show it. This study measures that bottleneck.


2. What We Measured

  • Corpus: 10,303 pull requests scraped across 52 active open-source repositories (Jan 2024 – Jun 2026). 7,924 (76.9%) drew reviewer comments and form our commented classification cohort.
  • Classification: Labeled once by gemma-4-31b (temperature 0, JSON output) along three axes:
    • Reviewer stance: blocking, non-blocking, question, nitpick, praise, or none. A friction PR is any PR whose dominant stance is blocking, non-blocking, or question.
    • Knowledge gap: The missing information revealed by the discussion (defined in §3).
    • Closure signal: A text-inferred reason a thread stalled (duplicate, CI failure, policy violation).
  • Rigor (Validation):
    • Evidence Grounding: 99.9% of substantive labels are backed by a verbatim quote from the PR thread (1,251/1,252).
    • Predictive Validity: In a model-blind check, text-inferred closure reasons predict actual merge status perfectly (duplicate: 0% merge, policy violation: 0.3% merge, CI failure: 5.0% merge, against a 74% baseline).
    • Manual Audit: A stratified manual audit of 27 sample PRs across all taxonomy categories showed 100% derived class agreement and 100% precision and recall for the load-bearing 'beyond-codebase' class, with stance accuracy at 96.3% (26/27).

Figure 1: Review-friction taxonomy Reviewer stance (a) and knowledge-gap composition (b) across the corpus.


3. Most Friction is Not About the Code

We categorize knowledge gaps into two classes:

  • Codebase-internal: Missing knowledge of the repository's internal architecture, local APIs, or conventions. Knowable by reading the code.
  • Beyond-codebase: Missing knowledge of contribution process & policy (commit formats, CLA, branch rules), product & business decisions, environment & platform (CI, runtime, build), and external dependencies. Not inferable from the repository.

Among 1,542 friction PRs, the classifier identifies a specific gap in 60.6%. Of those identified gaps:

Knowledge-gap class Share of identified gaps
Beyond-codebase 66.6%
Codebase-internal 33.4%

Within the beyond-codebase class, the dominant subtype is process & policy, which alone accounts for 50.0% of all identified gaps, three-quarters of the beyond-codebase class.

[!NOTE] De-biasing Open-Source vs. Enterprise Friction In a raw open-source corpus, "process & policy" is heavily inflated by administrative gates (e.g., Contributor License Agreements, Developer Certificates of Origin, pull request templates). In Section 7, we strip this administrative noise by isolating core repo insiders to extract a clean enterprise proxy.

Figure 3: The context gap Knowledge-gap composition within friction PRs. Roughly two-thirds of identified gaps are context that cannot be read off the repository.

Furthermore, resolving beyond-codebase friction is structurally harder: it resolves less cleanly and draws significantly more blocking pushback than codebase-internal friction.

Figure 3b: How context-gap friction resolves Friction whose gap is beyond the codebase resolves less cleanly and draws more blocking reviews.


4. The Limits of Codebase-Only Tooling

Code-aware tools (linters, type-checkers, and diff-aware AI review bots) operate exclusively on the changed lines and the surrounding repository. That scope is bounded by the ~1/3 of friction that is codebase-internal.

This is not a limitation of model intelligence, but a boundary condition of context. A pure-diff review bot cannot judge whether a change fits product decisions made in a Slack thread, or whether a deployment environment requires a specific backcompat shim. Surpassing this ceiling requires feeding the tool the organizational context that lives outside the code.


5. It's Not an AI Problem, It's Organizational

A natural objection is that beyond-codebase friction is an artifact of AI assistants failing to understand human context. Restricting detection to author-side signals (login, title, body first-person disclosure) to avoid selection bias, the data shows otherwise:

Group N Merge % Friction % Blocking % Beyond-code gap %
Non-AI human 8,080 62.3 16.8 11.3 9.1
AI-assisted 631 74.2 18.4 8.4 7.4
Bot 1,592 80.4 4.3 2.1 1.4
  • Friction is author-agnostic: Human and AI-assisted PRs show statistically identical friction rates (16.8% vs 18.4%, Chi-squared p=0.30p = 0.30).
  • Beyond-codebase friction aligns: It afflicts human and AI authors at indistinguishable rates (9.1% vs 7.4%, χ2p=0.18\chi^2 p = 0.18).
  • The Bot Baseline: Programmatic bots reach near-zero process friction (1.4%) because their behavior is perfectly policy-conforming. This represents the theoretical floor of automated context provisioning.

Figure 2: Friction by author group Review friction, blocking, and merge rates across human, AI-assisted, and bot author cohorts.


6. The Cost of Friction

Friction rates have remained stable, but submission volumes have risen. The resulting validation bottleneck shifts maintainer effort from code verification to reverse-engineering author intent.

This cost is highly concentrated. A blocking review carries a median time-to-merge of 48.5 hours, roughly 7.5× the overhead of a clean merge. One blocked PR consumes the calendar time of seven and a half clean ones, compounding the bottleneck as generation volume climbs.

Figure 5: The cost of friction Median time-to-merge across different reviewer feedback stances.


7. De-biasing the Signal: The True Enterprise Proxy

Open-source datasets are heavily biased by drive-by contributors who face low-trust administrative gates (e.g., signing CLAs, conforming to PR templates, or selecting labels). An internal enterprise team does not face these hurdles.

To extract a clean signal for enterprise workflows, we restrict our human cohort to repeat/insider contributors (authors with a GitHub author_association of MEMBER, OWNER, or COLLABORATOR, the closest proxy for full-time employees). This partitioning yields two crucial findings (see Figure 7):

  • Process Gatekeeping Evaporates: As expected, process & policy friction collapses from 54.1% of identified gaps for outsiders to just 20.3% for insiders (who already know and follow the rules). The majority of insider friction shifts to codebase-internal architectural complexity (58.1%).
  • The Context Gap Remains at 42%: Crucially, even when all administrative OSS gates are removed, 41.9% of all identified insider friction still stems from beyond-codebase context (95% Wilson CI: [31.3%, 53.3%], Chi-squared p=5.83×106p = 5.83 \times 10^{-6}). Instead of CLA warnings, this friction is driven by environment/platform limits (13.5%), product decisions (5.4%), and external dependencies (2.7%).

This de-biased result is the true enterprise proxy: even for onboarded, internal team members, nearly half of all review friction lives outside the repository's code.

Figure 7: Enterprise proxy cut Beyond-codebase share (a) and detailed knowledge-gap subcomposition (b) for human repeat/insider contributors vs. outsiders. Once OSS process gates are stripped, insiders still draw 41.9% of their friction from beyond-codebase context.


8. Surfacing the Context Layer

Surpassing the validation crisis requires three structural shifts in tooling and repository design:

  1. Codify Process and Policy: Convert prose contribution guidelines and unwritten commit, branch, or release rules into machine-enforced pre-review gates. Contributors and agents should learn policies from failing checks, not human reviews.
  2. Surface Just-in-Time Context: Feed review tools and coding agents dynamic context (Architecture Decision Records, RFCs, issue scope) at review time. Open protocols like the Model Context Protocol (MCP) enable agents to query external documentation instead of guessing conventions.
  3. Establish an Organizational Context Layer: Treat process, policy, product, and environment knowledge as first-class infrastructure. Serving this context programmatically to both humans and agents is the only way to scale validation as code generation volume climbs.

Methodology & Limitations (Appendix)

Statistics: Categorical comparisons use Pearson χ2\chi^2 with Cramér's V and risk differences; proportions carry 95% Wilson score intervals; skewed continuous distributions (time-to-merge) use Mann–Whitney U with bootstrap 95% CIs.

Limitations:

  1. AI detection is a lower bound: Author-side detection misses untraced inline completions, trading recall for precision.
  2. Classification recall is the residual risk: We verified precision via a manual audit of a 27-PR stratified sample, finding 100% derived class agreement. A sensitivity simulation confirms that our insider/outsider beyond-share findings remain statistically significant (p<0.05p < 0.05) in >99% of trials even under a hypothetical 10% classification error rate, and in >91% of trials under a 15% error rate.
  3. Open-source vs enterprise: Off-thread context resolution in companies means the thread-visible gap undercounts the true enterprise context gap. The insider-contributor cut supports the structural finding with high significance (p=5.83×106p = 5.83 \times 10^{-6}, Cramér's V = 0.1604, odds ratio = 0.33).

References:

  • Watanabe, M., et al. (2025). On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub. arXiv:2509.14745 (accepted, ACM TOSEM).
  • Alam, K., Mondal, S., & Roy, B. (2026). Why Are AI Agent–Involved Pull Requests (Fix-Related) Remain Unmerged? arXiv:2602.00164.
  • Foster, G. (2026). Context is AI coding's real bottleneck in 2026. The New Stack.
  • Iyer, A. (2026). Open-source maintainers are drowning in AI-generated pull requests. The New Stack.
  • Tanner, M. (2026). Context Engineering: A Practical Guide for AI Agents. Sourcegraph Blog.
  • Sourcegraph (2026). AI Code Review in 2026: How It Works and How to Adopt It.
  • CodeAnt (2026). How Accurate Is AI Code Review in 2026?

AI tools assisted with data aggregation, initial analysis, and editorial drafting.