TraceabilitySafety

Traceability from signal to code

Formalizing the provenance chain from product signal to merged pull request, and measuring traceability coverage across existing agent frameworks.

December 202414 min read

The provenance problem

When a human developer ships code, the chain of reasoning is implicit: a product manager describes a feature, a developer interprets it, writes code, and opens a PR. The developer can explain why they made each decision when asked. When an AI agent ships code autonomously, this implicit chain breaks. The agent processes a signal, makes dozens of micro-decisions, and produces a diff. Unless those decisions are explicitly recorded, the PR is a black box.

This is not just an accountability concern: it is a safety concern. If you cannot trace why a particular line of code was written, you cannot confidently assess whether it is correct. You cannot determine whether the agent misinterpreted the original signal, hallucinated a requirement, or silently changed scope. Traceability is a prerequisite for trust.

The provenance chain

We define a formal provenance chain with six links, each producing an auditable artifact:

Signal → PRD → Task → Branch → PR → Merge. The Signal is the originating product intent (a user request, a metric alert, a roadmap item). The PRD is the product requirements document the agent generates from the signal. Tasks are the decomposed implementation steps. The Branch contains the code changes. The PR packages the changes for review. The Merge is the final integration into the main branch.

{
  "trace_id": "tr_29f8a1",
  "signal": {
    "source": "linear",
    "id": "LIN-1847",
    "title": "Add webhook retry with exponential backoff"
  },
  "prd": {
    "id": "prd_a3c2",
    "requirements": [
      "Retry failed webhooks up to 5 times",
      "Use exponential backoff with jitter",
      "Dead-letter queue after max retries"
    ]
  },
  "tasks": [
    { "id": "task_01", "description": "Implement retry logic" },
    { "id": "task_02", "description": "Add DLQ table and migration" },
    { "id": "task_03", "description": "Write integration tests" }
  ],
  "branch": "feat/webhook-retry-tr_29f8a1",
  "pr": { "number": 847, "status": "open" }
}

Formal traceability model

We define traceability coverage as the fraction of code changes in a PR that can be traced back to at least one requirement in the PRD, which itself traces to the original signal. A fully traceable PR has coverage of 1.0; a PR where the agent added unrequested changes has coverage below 1.0.

Traceability Coverage = |{changes with valid trace}| / |{all changes in PR}| Valid trace: change \to task \to requirement \to signal

Measuring this requires the agent to annotate each code change with a task ID, and each task with a requirement ID. The traceability system then verifies the chain programmatically. Any change without a valid chain is flagged for human review.

Coverage comparison across tools

We evaluated traceability coverage across four agent frameworks using 30 multi-file edit tasks. For each framework we measured: (1) whether the tool records any provenance data, (2) whether changes trace to specific requirements, and (3) whether the full chain back to the original signal is maintained.

Framework	Records Provenance	Change → Requirement	Full Chain Coverage
Copilot Workspace	Partial	0.45	0.00
Devin	Yes (logs)	0.62	0.15
SWE-Agent	No	0.00	0.00
PMOS (Beaverise)	Yes (structured)	0.94	0.91

Most existing tools either do not record provenance at all, or record unstructured logs that cannot be programmatically verified. PMOS achieves 0.91 full-chain coverage by requiring the agent to annotate every code change with a task reference at generation time, and by structurally linking tasks to PRD requirements to signals.

Why 0.91 and not 1.0

The remaining 9% of untraceable changes fall into three categories: (1) formatting and linting auto-fixes, (2) dependency updates triggered by new imports, and (3) test scaffolding that does not map directly to a single requirement. We are working on coverage rules for these systematic gaps.

Implications for autonomous safety

Traceability is the foundation of safe autonomous software engineering. Without it, code review becomes guesswork: the reviewer must reverse-engineer the agent's reasoning from the diff alone. With traceability, the reviewer can verify that every change serves a stated purpose, that no requirements were dropped, and that no unrequested changes were introduced.

An autonomous agent that cannot explain why it wrote a line of code is no more trustworthy than a contractor who cannot explain why they billed for it.

Conclusion

Traceability from signal to code is not optional overhead: it is the mechanism that makes autonomous software engineering auditable and safe. By formalizing the provenance chain and measuring coverage, we can quantify how much of an agent's output is accountable and direct review effort to the parts that are not. This is how we build trust in systems that write code on our behalf.

References

[1]Gotel, O. and Finkelstein, A. "An Analysis of the Requirements Traceability Problem." RE 1994.
[2]Yang, J. et al. "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering." 2024.
[3]Cognition Labs. "Devin: The First AI Software Engineer." 2024.
[4]GitHub. "Copilot Workspace: Technical Preview." 2024.
[5]Mäder, P. and Egyed, A. "Do Developers Benefit from Requirements Traceability When Evolving and Maintaining a Software System?" ESE 2015.