| Breakdown | Group | # Tasks | GPT-5.3-Codex | Gemini-3.1-Pro | Grok-4.20 | GPT-5.4-Mini |
|---|---|---|---|---|---|---|
| Overall | All tasks | 51 | 51.0% | 27.5% | 11.8% | 7.8% |
| Category | AI/ML | 25 | 52.0% | 16.0% | 8.0% | 0.0% |
| Sci. Computing | 7 | 42.9% | 57.1% | 14.3% | 0.0% | |
| Systems | 19 | 52.6% | 31.6% | 15.8% | 21.1% |
LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment-setup benchmarks do not cover the full scope of research-artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g., GPU/CUDA and kernel configurations), and legacy artifact compatibility.
We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks drawn from top-tier venues published between 2008 and 2025, spanning AI/ML (25), computer systems (19), and scientific computing (7). The tasks cover 11 programming-language ecosystems and include 22 GPU-dependent workloads, 5 tasks that compile and boot custom Linux kernels inside a QEMU virtual machine, and 10 legacy artifacts (2011–2018) that require compatibility repair against modern toolchains. By setup difficulty, the set splits into 12 easy, 20 medium, and 19 hard tasks.
Every task is given the paper, its code repository, and a fresh cloud VM with no pre-installed drivers; Docker is disallowed so that success reflects native dependency resolution. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs, and we confirm that all 51 tasks can be deployed by running the reference setup scripts. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.
After the agent finishes, reaches its step limit, or times out, we evaluate the final VM state with a hidden, task-specific verifier that has two layers. Layer 1 is a global rule-based log parser that catches common execution failures shared across tasks (missing dependencies, runtime exceptions, compilation errors, out-of-memory). If no Layer-1 failure is found, Layer 2 runs task-specific checks for expected outputs, runtime evidence, generated artifacts, or service availability — verifying that the paper-specific target actually produced the required result rather than merely reaching a superficially runnable state. A task is solved only if it passes both layers. For failed tasks, a separate LLM diagnostic judge attributes root causes; all success rates are computed solely from the deterministic verifier.
Task success rate overall and by research domain, for four frontier LLMs under the OpenHands agent scaffold. Even the strongest model solves only about half of the tasks, and the gap between models shows that DeployBench separates agent capability under a fixed scaffold.
| Breakdown | Group | # Tasks | GPT-5.3-Codex | Gemini-3.1-Pro | Grok-4.20 | GPT-5.4-Mini |
|---|---|---|---|---|---|---|
| Overall | All tasks | 51 | 51.0% | 27.5% | 11.8% | 7.8% |
| Category | AI/ML | 25 | 52.0% | 16.0% | 8.0% | 0.0% |
| Sci. Computing | 7 | 42.9% | 57.1% | 14.3% | 0.0% | |
| Systems | 19 | 52.6% | 31.6% | 15.8% | 21.1% |
Across the 154 failed runs, the global check (Layer 1) catches 46.1%; the remaining 53.9% reach a superficially runnable state but fail the task-specific verifier (Layer 2) — they would be missed entirely by generic success signals.
| Failure location | Count | Percentage |
|---|---|---|
| Layer 1 – Runtime crash | 39 | 25.3% |
| Layer 1 – Dependency error | 27 | 17.5% |
| Layer 1 – Compilation error | 5 | 3.2% |
| Layer 1: global check | 71 | 46.1% |
| Layer 2: task-specific check | 83 | 53.9% |
Diagnostic-agent reports cluster failures into five recurring root-cause patterns: dependency & package resolution, GPU/CUDA setup, native-build & toolchain, artifact & output mismatch, and system- & VM-level execution. Tasks that all four models fail tend to compound multiple patterns at once.
@article{wang2026deploybench,
author = {Wang, Yuanli and Qian, Yaoyao and Zhang, Yue and Zhou, Hanhan and
Huang, Jindan and Fu, Tianfu and Mang, Qiuyang and Mao, Huanzhi and
Chai, Wenhao and Fan, Wendong and Jing, Liqiang},
title = {DeployBench: Benchmarking LLM Agents for Research Artifact Deployment},
journal = {arXiv preprint arXiv:2606.05238},
year = {2026},
}