How Senior SWE-Bench works
We're excited to release Senior SWE-Bench, a benchmark for evaluating agents on their ability to act as senior engineers. Senior SWE-Bench is open-source and Harbor-compatible. The initial release has 100 total tasks, with 50 kept private to mitigate contamination.
Why we built Senior SWE-Bench
With the rise of more capable agents and integrations with natural working surfaces like Slack and GitHub, most of us already treat agents like senior engineers. We expect them to complete work independently and tastefully from messages or command-line prompts without full specs. Like senior engineers, they need to exercise skills like aligning to human intent, open-ended technical design, complex runtime debugging, and good taste.
And yet, most existing benchmarks treat agents like junior engineers, providing overly specific requirements (especially for feature tasks) and assessing agents primarily on their ability to write correct code (rather than demonstrate a wide range of skills). The goal of Senior SWE-Bench is to create the most reliable and realistic assessment of agents on their ability to fulfill key behaviors of senior engineers:
- Senior engineers build features without over-specified requirements
- Senior engineers solve bugs that require runtime investigation from behavioral reports
- Senior engineers ship the right code without being told to
Most existing benchmarks have unrealistically over-specified instructions, nothing like what an engineer might send a background agent on Slack. This is primarily to facilitate pre-written verifiers in settings where behavior and interfaces are likely to change. Even recent benchmarks that use more behavioral verifiers are subject to this trade-off, and have task instructions that can look like complete specifications. We break down this trade-off by introducing a validation agent which can adapt to a broad solution design space while still testing code in runtime.
Past benchmarks have also largely focused on behavioral correctness over quality, leading to METR's “Many SWE-bench-Passing PRs Would Not Be Merged into Main”. Cognition recently released FrontierCode which scores agents on code quality through extensive per-task rubrics designed by repo maintainers themselves. We take a more scalable approach through a global taste judge that uses observational and comparative signals to assess code quality, trading off task-level assessment fidelity for a reward mechanism that can be applied to any task.
All of this is combined with an extensive, multi-stage quality control process that combines specialized, automated evaluations with human expert reviews. A benchmark is only as good as its tasks are reliable, and we want to ensure Senior SWE-Bench is measuring real signal.
Now let's dive into how Senior SWE-Bench is designed!
Task design
Our goal is to craft realistic and diverse tasks, representative of the breadth of software development work we entrust coding agents with.
Task sources
Tasks are based on real PRs in real repos, selected for the skills required to do the engineering work.
- Recent real-world PRs: Senior SWE-Bench tasks are sourced from real-world PRs merged after February 2026 in open-source repositories. We focus on real-world PRs so that tasks represent provably realistic work and have authoritative reference patches from repo maintainers.
- Diverse repositories: Repositories span a wide range of types, domains, and age (see Source repositories). 50% of the repos started in the last 5 years; this is an intentional choice to capture higher velocity of larger features and usage of more modern technology stacks.
- Multi-PR scope: Tasks (especially feature tasks) can be created by grouping multiple related PRs together in order to create realistic scope.
- Senior-authored: Most Senior SWE-Bench tasks are based on PRs authored by engineers with 100+ commits in the respective repository, and we prioritize tasks based on PRs by repo maintainers.
- Complex tasks: PRs are filtered for complexity, including changes that cross service boundaries and fixes with significant runtime investigation. See Task types below.
Task types
Senior SWE-Bench tasks fall into two major segments: Design-and-Build and Investigate-and-Fix. Design-and-Build tasks focus on the complexity of designing features and migrations without full specifications. Investigate-and-Fix tasks focus on runtime diagnosis of tricky bugs and performance issues from behavioral reports.
| Aspect | Design-and-Build | Investigate-and-Fix |
|---|---|---|
| Task types | Feature, Migration | Bug, Performance |
| Source PR qualities | Complex, multi-component features and migrations | Bug and performance fixes with evidence of runtime investigation |
| Instruction style | Description of requested change with PM-level user stories | Behavioral user issue report |
| Reward mechanisms | Verifiers, validation, rubrics, taste | Verifiers, rubrics, taste |
| Interface changes | May involve new or changed system-level interfaces | System-level interfaces are not changed |
| Example skills required | Interface / API design, Data model design, Architectural refactoring, Cross-service reasoning, Framework fluency | Fault localization, Log / error interpretation, Local deployment, Concurrency reasoning, Query optimization |
Task scope
We compare the scope of Senior SWE-Bench tasks to other recent benchmarks across instruction length and reference-solution size. Senior SWE-Bench tasks have more naturally under-specified instructions and cover a wide distribution of task scopes, from pinpoint bug fixes to features that touch multiple service layers.
Task environments
Task environments are created by cloning the target repository, rewinding before the first task commit, then expiring the reflog and pruning the repo. Major system dependencies (such as Python or Postgres) are pre-installed, as are heavier individual libraries (such as PyTorch). In order to best reflect the performance of a background agent dropped into a development sandbox, full environment setup (such as installing required packages and starting services) is left to the agent. As such, all tasks have allow_internet=true. We perform trajectory analysis to identify internet-based cheating, and plan to limit internet usage to specific required domains in future releases.
Reward design
Senior SWE-Bench uses a combination of reward mechanisms to evaluate agent performance across correctness and taste. The validation agent and taste judge are discussed in depth in the following sections.
| Reward mechanism | Description | Used for Design-and-Build? | Used for Investigate-and-Fix? | Exercises code? | LLM-based? | Used for correctness? | Task-specific? |
|---|---|---|---|---|---|---|---|
| Verifiers (pre-written) | Pre-written behavioral tests (e.g. pytest, go test) | Yes, required | Yes, optional | Yes | No | Yes | Yes |
| Validation (adaptive) | Behavioral tests (e.g. pytest, go test) written by an agent (using an expert-designed recipe) adapted to the submitted solution | Yes, required | No | Yes | Yes | Yes | Yes |
| Rubric judge | Task-specific rubric assessed by an LLM | Yes, optional | Yes, optional | No | Yes | No | Yes |
| Taste judge | Global code-quality rubric assessed by an agent | Yes | Yes | No | Yes | No | No |
Testing behavior, not implementation details
Verifiers and validation tests are written to test behavior rather than implementation details. This allows reward assignment to be robust to different valid designs and small implementation differences. Whenever possible, they use system-level interfaces like HTTP endpoints, CLI commands, or rendered UI components.
Benchmarks that primarily test coding skills with full-spec instructions are able to harvest unit tests from PRs as verifiers, since they run against the precisely specified implementation shape. When testing behaviorally, verifier and validation tests have to instead be designed from scratch to encode the right discriminating signal at the right system-interface level. When available, we use integration or system tests as inspiration.
Testing stated and unstated requirements
Senior engineers are expected to ship code in a way that aligns with both explicitly stated requirements (such as product requirements from a PM) and unstated requirements (such as implementing functionality within a system-wide framework). Senior SWE-Bench uses runtime reward mechanisms (verifiers and validation) to test both stated and unstated requirements, which generally fall into three categories.
| Requirement type | Description | Example | Stated in instruction |
|---|---|---|---|
| Behavioral contract | Behavioral requirement stated in the task instruction | New functionality added to API | Yes |
| Load-bearing codebase practice | Strong and consistent codebase pattern that impacts functionality | New API endpoint is registered with an existing auth framework | No |
| General best practice | Practice obvious to a senior engineer with no defensible alternatives | API endpoint that lists an unbounded collection must be capped or paginated | No |
Since all verifier and validation tests must pass for a successful solve, careful attention is paid to ensuring that especially the unstated requirements are truly codebase-mandated or have no reasonable alternative. Agent trajectories are analyzed for signs of unfair grading on this basis as part of the quality control process, and tasks are revised accordingly.
Anything that might be deemed a codebase-evident or general-practice preference (but not a load-bearing requirement) is routed instead to the rubric, where a single failed criterion does not gate a successful solve. Additionally, requirements that are infeasible to test in runtime (such as an integration with an external service) are routed to the task-specific rubric. We similarly want non-gating treatment for these, as the LLM judge has lower reliability than methods that exercise the code in runtime.
Pass-to-pass and fail-to-pass tests
Verifiers, validation, and rubrics can all encode both pass-to-pass and fail-to-pass tests. Pass-to-pass tests succeed on the pre-patch codebase and act as regression guards, assuring that existing behavior continues to work. Fail-to-pass tests succeed only after a successful patch has been applied and discriminate between valid and invalid solutions.
Enabling realistic feature tasks with the validation agent
For tasks that involve changing interfaces (like feature tasks), most existing benchmarks are forced into a trade-off between realistic instructions (without over-specified requirements) and reliable evaluation. This is primarily due to the predominant reward mechanisms: pre-written verifiers and LLM-judges. To address this, we introduce a validation agent which uses expert-designed recipes to write system-level, behavioral test scripts adapted to the solutions from agents under evaluation.
How the validation agent works
The validation agent's job is to write behavioral verifier tests adapted to the specific shape of the solution submitted by the solving agent. The validation agent is provided with an expert-designed validation spec, which the solving agent does not have access to. The validation spec includes the set of tests to write, the recipe for writing each test, and multiple test cases that will run for each test. At the start of its workflow, the validation agent reviews the submitted solution for implementation details, such as the response structure of an HTTP endpoint, and writes the verifier scripts accordingly.
The Senior SWE-Bench validation process synthesizes test scripts from scratch (rather than making edits to existing scripts) in order to be able to adapt to a broad range of valid solution shapes. This includes adapting to structural aspects (such as invoking a multi-step process flow implemented in the submitted solution) rather than just smaller renames and parameterizations.
Once the validation agent has completed writing the test scripts, they are executed programmatically by the appropriate test driver (e.g. pytest, cargo test). An LLM judge then reviews the test execution and the test scripts, and decides whether to accept the scripts, write feedback for revision, or discard the trial due to validation agent misalignment (see Ensuring validation agent reliability). One round of revision is permitted before a final accept-or-discard verdict is given. The final score is determined by the final test script execution across all parameterized test cases.
The validation agent allows for more realistic instructions
Because the validation agent is built to adapt to a wide range of valid solution shapes, the task's instructions do not need to spell out the full interface and behavioral contract as they would when using a pre-written verifier. As such, Senior SWE-Bench feature tasks can be written with natural language instructions that describe the desired feature without a level of specification that would be unrealistic for real-world agent usage. Senior SWE-Bench feature task instructions are composed of an overall request and a set of behavioral requirements (or user stories) in line with what a product manager might provide. The behavioral requirements correspond to validation tests and are injected from the validation spec into the task instruction.
As an example, compare the instructions for a feature task in Senior SWE-Bench and SWE-Bench Pro.
Illustrative samples. Senior SWE-Bench instructions read like a teammate's message; verifier-driven benchmarks lean on rigid, over-specified requirements. Note: the Senior SWE-Bench example was built using the same task source as the SWE-Bench Pro sample for comparison; the task is not included in Senior SWE-Bench.
Ensuring validation agent reliability
We take multiple steps to ensure that the validation agent is reliable; implemented naively, it could introduce stochasticity, behavioral misalignment, or false positives/negatives.
- Clear role definition: The validation agent is assigned a clear role of acting as a test engineer whose job is to implement the test procedure faithfully, not fix issues with the solution.
- Expert-authored procedures: The backbone of the validation process is the validation spec, which includes extensive procedures and guidance for the validation agent. It includes details on how to use test fixtures and utilities reliably, patterns to look for in the submitted patch, and explicit testing steps.
- Testing on oracle and no-op: To ensure the validation agent acts reliably on the tasks, the validation agent is run 3 times on the oracle patch and 3 times on a no-op patch for a given task before it is accepted into the benchmark. Tasks are rejected if
pass^3 < 1on the oracle patch orpass^3 > 0on the no-op. That is, we reject a task if the validation agent doesn't always accept a known good solution or always reject a known bad solution. - Multiple test cases: Each test written by the validation agent is parameterized and run against multiple input and expected output pairs. This makes it more difficult for either the solving agent or validation agent to hardcode correct answers.
- Judge review: An LLM judge reviews the validation agent's output and grades against a rubric that includes behavioral fidelity and completeness with respect to the validation spec. Mechanical checks for collusion are also run. If the validation agent performs insufficiently, the trial is discarded. In practice, less than 5% of trials are discarded.
We experimented with multiple harnesses and models to implement the validation agent. We found that Mini-SWE-Agent with Claude Sonnet 4.6 offered the best balance of performance and cost.
Trade-offs in using the validation agent
The primary downside of using the validation agent is the time and expense incurred in running each trial. While the wall-clock time and cost of running a trial is typically dominated by the solving agent, the validation agent adds non-negligible overhead. We are currently working on methods to reduce this overhead, especially for use in reinforcement learning.
| Wall-clock time overhead | 6–20% (median 11%) |
| Token cost overhead | 2–16% (median 6%) |
Mini-SWE-Agent with Claude Sonnet 4.6 used for all validation agent runs. Results vary based on infrastructure and model providers.
Challenging bug tasks with runtime investigation
Senior engineers are trusted to resolve tricky bugs independently from user reports, well above the level of simple code searches from hints in instructions. Bug and performance tasks in Senior SWE-Bench focus on investigation, from starting services to debugging subtle runtime issues. They are sourced from PRs that contain evidence of needing significant runtime investigation to solve (e.g. logs, profiling data, reproduction steps).
Solving a Senior SWE-Bench bug task
In order to solve a Senior SWE-Bench bug task, agents generally need to follow a process similar to what a human maintainer might do. This includes gathering diagnostic evidence from logs and other sources, localizing and reproducing the issue, and making targeted fixes.
Behavioral tests allow for more realistic instructions
By testing behavior after a valid fix rather than implementation details, Senior SWE-Bench investigate-and-fix tasks can be written as a natural language description of the issue's symptoms. This much more closely resembles interaction with a background agent, where an initial user report is provided and the agent is tasked with diagnosing and fixing the issue.
As an example, compare the instructions for a bug task in Senior SWE-Bench and SWE-Bench Pro.
Illustrative samples. Senior SWE-Bench instructions read like an issue report on Slack: describing the behavior observed so that the agent has to investigate instead of spelling out exactly how to make the fix. Note: instructions do not represent the same task.
Grading tasteful solves
Rather than just behavioral correctness, we also score tasteful solves by combining runtime correctness tests with several quality metrics. We introduce a taste judge, which uses observational and comparative signals to grade the agent's solution. The judge's rubric is designed to be technology- and practice-agnostic, grounding its assessment in observed codebase practices and the reference solution. For example, it won't penalize a model for not writing documentation in a codebase that doesn't broadly document its code (in fact, it will reward it for being parsimonious with documentation). The judge uses two rubrics:
| Minimality | Changes are focused, with no scope creep. |
| Approach quality | Right solution approach: root-cause fix for bugs, sound design for features. |
| Hygiene | No shortcuts, workarounds, hardcoded values, or other code smells. |
| Fluency | Fluent with the domain, frameworks, tools, and conventions. |
| Craftsmanship | Displays quality engineering effort that a senior reviewer would approve. |
| Style consistency | Formatting, naming, and structure match the surrounding code. |
| Pattern adherence | Uses the project's established patterns and idioms. |
| Library usage | Reuses libraries already in the project rather than introducing alternatives. |
| Abstraction level | The right abstraction level for this part of the codebase. |
| Documentation fit | Comments and docstrings match the project's style and density. |
We calibrated the taste judge against human reviewers to ensure that its assessments are reliable. We set the thresholds for tasteful solves to conservative values (any score better than 2 out of 5) to minimize the risk of false negatives from judge stochasticity.
In addition, we measure patch bloat (the ratio of SLOC in the agent's solution to SLOC in the reference implementation) to assess whether the agent is bloating the codebase.
The taste scores are combined with the other reward mechanisms (pre-written verifiers, validation agent, and task rubrics) to determine if the agent completed a tasteful solve. All verifier and validation agent tests must pass, and the other scores must meet a minimum threshold.
Quality control
Task quality and trust in reward mechanisms is paramount to Senior SWE-Bench. All Senior SWE-Bench tasks undergo a rigorous, expert-validated quality control process, including:
Quality-driven task design and refinement
- Repo & task integrity: Source PRs are filtered for quality, recency, and the author's level of experience with the codebase.
- Automated evaluations: Each task is graded by dozens of high-precision LLM-based checks (tuned specifically for Senior SWE-Bench) and runtime correctness tests.
- Runtime correctness and reliability tests: Each task is executed multiple times with a no-op solution and an oracle solution to ensure that non-deterministic elements (e.g. validation agent, taste judge) score consistently for solutions with known outcomes. Additionally, this confirms correctness of all checks by their categorization as pass-to-pass (should always pass for both no-op and oracle) or fail-to-pass (should always fail with no-op and succeed with oracle).
- Reference solution analysis: Before a task is finalized, multiple agents from different providers attempt it. Their solutions are analyzed for evidence of false-positive or false-negative risks, such as being penalized for an implementation detail that doesn't impact quality. Tasks and reward mechanisms are tuned based on findings.
- Assessing solvability with guided variants: We create a guided variant of each task that differs only in the instruction. The instruction contains additional guidance such as suggested steps for diagnosis or design factors to consider, but does not reveal e.g. the exact root cause of a bug or the exact ideal design of a feature. The goal is to assess whether the task is solvable given sufficient guidance, or whether there are inherent issues that would prevent an agent from solving it. With the guided variant, the solve rates for Claude Opus 4.8 and GPT-5.5 were both 85-90%, and all but 8% of tasks were solved by at least one agent. Manual review of the remaining 8% was performed and found no fundamental blockers.
Multi-stage expert review
- Research team reviews: Contributing research team members review all tasks for overall design and implementation quality.
- SWE expert reviews: SWE experts from Snorkel AI's expert network review all aspects of each task along with example trajectories using an extensive rubric.
- Judge-expert calibration: Both the taste judge and validation agent judge were calibrated against SWE expert reviews, and the prompts underwent multiple tuning passes to increase alignment.
Next steps for Senior SWE-Bench
We're excited to share Senior SWE-Bench as a step forward in coding agent benchmarking: realistically under-specified instructions, adaptive runtime verification, scalable code quality scoring, and open-source long-horizon coding tasks based on real work from codebase experts. As new models are released, we'll add them to the leaderboard to see how they stack up. We also plan on adding more public and private tasks to the benchmark as models evolve and knowledge cutoffs move forward. If you have any feedback or ideas, let us know at seniorswebench@snorkel.ai.
Appendix
Comparison to other benchmarks
Several recent benchmarks make progress on behavioral testing and instruction realism. The following table provides a brief comparison.
| Benchmark | Task style and source | Instruction realism | Reward mechanisms | Open source |
|---|---|---|---|---|
| Senior SWE-Bench | Real-world PRs | High (natural language message) |
| Yes |
| SWE-Bench Pro | Real-world PRs | Low (full specs) |
| Yes |
| DeepSWE | New tasks in real repos | Mixed (some full specs) |
| Yes |
| FrontierCode | New tasks in real repos | Unknown (examples are mixed) |
| No |
| ProgramBench | Full program recreation | N/A |
| No |
Source repositories
Senior SWE-Bench samples PRs from the repositories below — spanning libraries to multi-service applications across languages, domains, and age.
| Repository | Languages | Type | Description | LOC | Started | Stars |
|---|---|---|---|---|---|---|
| electric-sql/electric | Elixir, TypeScript | Service | Postgres real-time sync | 345k | 2022 | 10.2k |
| go-gitea/gitea | Go | Application | Self-hosted Git forge | 397k | 2016 | 56.3k |
| PostHog/posthog | Python, TypeScript | Application | Product analytics platform | 3.8M | 2020 | 35.1k |
| PrefectHQ/prefect | Python | Library | Workflow orchestration | 664k | 2018 | 22.6k |
| better-auth/better-auth | TypeScript | Library | Authentication framework | 289k | 2024 | 28.7k |
| gravitational/teleport | Go, TypeScript | Application | Infrastructure access platform | 2.8M | 2015 | 20.5k |
| vercel/turborepo | Rust, TypeScript | Tool | Monorepo build system | 215k | 2021 | 30.6k |
| plausible/analytics | Elixir | Application | Privacy-friendly web analytics | 228k | 2018 | 27.2k |
| firezone/firezone | Elixir, Rust | Application | Zero-trust access platform | 247k | 2020 | 8.7k |
| paperless-ngx/paperless-ngx | Python, TypeScript | Application | Document management system | 148k | 2022 | 42.2k |
| immich-app/immich | TypeScript | Application | Self-hosted photo backup | 542k | 2022 | 103.6k |
| harbor-framework/harbor | Python | Tool | Agent evaluation harness | 219k | 2025 | 2.5k |
Source repositories sampled by Senior SWE-Bench. Star counts and LOC (lines of source code, programming languages only) as of June 2026.