SafetyAutonomy

Human-in-the-loop safety for autonomous software

A five-level autonomy spectrum for software agents, with approval gates, risk scoring, and rollback protocols at each level.

November 202411 min read

The autonomy spectrum

Self-driving cars have SAE levels 0-5. Autonomous software agents need an equivalent framework. Without it, the industry oscillates between two extremes: fully manual coding assistants (Copilot-style autocomplete) and fully autonomous agents that ship code without human review. Neither extreme is appropriate for production software.

We propose five levels of autonomy for software agents, each with defined approval gates and human oversight requirements:

Level	Name	Description	Human Role
L0	Manual	Human writes all code; AI provides suggestions	Author
L1	Assisted	AI writes code blocks; human reviews every change	Editor
L2	Supervised	AI executes defined tasks; human approves PRs	Reviewer
L3	Monitored	AI executes and merges low-risk changes; human reviews high-risk	Auditor
L4	Autonomous	AI operates independently with oversight dashboards	Governor

Most production teams today operate at L0-L1. PMOS is designed to operate at L2-L3, with a clear path to L4 for well-defined, low-risk task categories. The key insight is that autonomy level should vary by task, not by system. The same agent can operate at L3 for routine bug fixes and L1 for security-sensitive changes.

Approval gate design

At each autonomy level, we define approval gates: checkpoints where the system must obtain human authorization before proceeding. Gates are placed at three points in the execution pipeline: (1) after planning (before code is written), (2) after implementation (before PR is opened), and (3) after review (before merge).

At L2 (Supervised), all three gates require explicit human approval. At L3 (Monitored), gate behavior depends on the risk score of the task: low-risk tasks skip the planning gate and auto-merge after passing CI; high-risk tasks require all three gates. At L4, only anomaly-triggered gates remain active.

Risk scoring

The risk score determines which approval gates are active for a given task. We compute risk as a weighted combination of four factors: blast radius (how many files and services are affected), sensitivity (whether the change touches auth, payments, or user data), reversibility (whether the change can be rolled back), and novelty (whether similar changes have been successfully completed before).

Risk(task) = w₁ \cdot BlastRadius + w₂ \cdot Sensitivity + w₃ \cdot (1 - Reversibility) + w₄ \cdot Novelty where w₁ = 0.3, w₂ = 0.35, w₃ = 0.2, w₄ = 0.15 Risk \in [0, 1], threshold for auto-merge: Risk < 0.3

A task that modifies a single utility function with existing test coverage scores low (~0.1). A task that changes the authentication middleware across multiple services scores high (~0.8). The threshold for each gate is configurable per team: conservative teams set lower thresholds; teams with mature CI pipelines can afford higher ones.

Constraint specification

Beyond risk scoring, teams define static constraints that override dynamic scoring. Constraints are rules that the agent must always follow, regardless of the risk score. Examples include:

# .pmos/constraints.yaml
constraints:
  - scope: "src/auth/**"
    rule: "always_require_approval"
    gate: "all"
    reason: "Authentication is security-critical"

  - scope: "*.migration.*"
    rule: "always_require_approval"
    gate: "pre_merge"
    reason: "Database migrations are irreversible"

  - scope: "package.json"
    rule: "require_approval_if"
    condition: "dependency_removed OR major_version_bump"
    gate: "pre_implementation"

Constraints are version-controlled alongside the codebase and enforced at the system level. The agent cannot bypass a constraint: it is a hard boundary, not a suggestion.

Rollback protocols

Every change made by an autonomous agent must be reversible. PMOS implements rollback at three levels: (1) git-level revert for code changes, (2) migration rollback for database changes, and (3) feature-flag disabling for shipped features. When a merged change is flagged as problematic (by monitoring alerts, user reports, or human review), the system can automatically initiate a rollback without human intervention, then notify the team.

Rollback speed is critical. Our target is < 60 seconds from detection to revert for git-level changes, and < 5 minutes for migration rollbacks. Feature flag changes propagate in < 10 seconds. These timelines assume the agent pre-computes rollback plans at merge time, so the revert path is already validated.

The human as governor

At L4, the human role shifts from reviewer to governor: setting policies, defining constraints, monitoring dashboards, and intervening only when the system flags anomalies. This is not a reduction in human responsibility: it is a change in the nature of that responsibility. Instead of reviewing every line of code, the human ensures that the system's policies are correct, its risk model is calibrated, and its constraints are comprehensive.

The goal is not to remove humans from the loop. The goal is to move humans to the right part of the loop: governance, not line-by-line review.

Conclusion

Safe autonomous software engineering requires a graduated approach: not a binary choice between manual and autonomous, but a spectrum with clearly defined levels, approval gates, risk models, and rollback mechanisms. By making the autonomy level explicit and task-dependent, teams can adopt AI agents incrementally, starting with low-risk automation and expanding as trust is earned through measured performance.

References

[1]SAE International. "Taxonomy and Definitions for Terms Related to Driving Automation Systems." J3016, 2021.
[2]Amodei, D. et al. "Concrete Problems in AI Safety." 2016.
[3]Christiano, P. et al. "Deep Reinforcement Learning from Human Feedback." NeurIPS 2017.
[4]Shneiderman, B. "Human-Centered AI." Oxford University Press, 2022.
[5]Google DeepMind. "Scalable Agent Alignment via Reward Modeling." 2023.