DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment-setup benchmarks do not cover the full scope of research-artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g., GPU/CUDA and kernel configurations), and legacy artifact compatibility.

We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks drawn from top-tier venues published between 2008 and 2025, spanning AI/ML (25), computer systems (19), and scientific computing (7). The tasks cover 11 programming-language ecosystems and include 22 GPU-dependent workloads, 5 tasks that compile and boot custom Linux kernels inside a QEMU virtual machine, and 10 legacy artifacts (2011–2018) that require compatibility repair against modern toolchains. By setup difficulty, the set splits into 12 easy, 20 medium, and 19 hard tasks.

Every task is given the paper, its code repository, and a fresh cloud VM with no pre-installed drivers; Docker is disallowed so that success reflects native dependency resolution. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs, and we confirm that all 51 tasks can be deployed by running the reference setup scripts. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.

DeployBench overview: a paper artifact (paper + code repo) becomes a DeployBench task with a hidden verifier; a terminal agent sets up the environment on a clean Linux VM, then a hidden verifier script checks the final VM state.

Evaluation Pipeline

After the agent finishes, reaches its step limit, or times out, we evaluate the final VM state with a hidden, task-specific verifier that has two layers. Layer 1 is a global rule-based log parser that catches common execution failures shared across tasks (missing dependencies, runtime exceptions, compilation errors, out-of-memory). If no Layer-1 failure is found, Layer 2 runs task-specific checks for expected outputs, runtime evidence, generated artifacts, or service availability — verifying that the paper-specific target actually produced the required result rather than merely reaching a superficially runnable state. A task is solved only if it passes both layers. For failed tasks, a separate LLM diagnostic judge attributes root causes; all success rates are computed solely from the deterministic verifier.

Key Findings

Even the strongest agent deploys only ~half of the artifacts. Pass rates span 7.8%–51.0%: GPT-5.3-Codex 51.0%, Gemini-3.1-Pro 27.5%, Grok-4.20 11.8%, and GPT-5.4-Mini 7.8%. End-to-end deployment from raw infrastructure remains hard.
Failure is mostly a completion-judgment problem. 97 of 154 failed runs are agent-declared self-stops — the agent decides it is done after validating a weaker or different target than the paper-specific task actually requires.
Generic checks miss most failures. A global rule-based parser catches only 46.1% of failures; the remaining 53.9% pass generic checks and are caught only by per-task verification, motivating DeployBench's two-layer verifier.
More tokens don't buy success. Gemini-3.1-Pro burns the most tokens (2.55M/task on average) yet ranks second, and within every model failed runs consume more tokens than solved ones.
Legacy code rots. Legacy repositories (last updated before 2020) far more often need compatibility repair to deploy onto a modern OS and hardware — required in 5 of 10 legacy tasks versus only 3 of 41 recent ones — and every model scores lower on legacy tasks than on recent ones.

Results

Task success rate overall and by research domain, for four frontier LLMs under the OpenHands agent scaffold. Even the strongest model solves only about half of the tasks, and the gap between models shows that DeployBench separates agent capability under a fixed scaffold.

Breakdown	Group	# Tasks	GPT-5.3-Codex	Gemini-3.1-Pro	Grok-4.20	GPT-5.4-Mini
Overall	All tasks	51	51.0%	27.5%	11.8%	7.8%
Category	AI/ML	25	52.0%	16.0%	8.0%	0.0%
	Sci. Computing	7	42.9%	57.1%	14.3%	0.0%
	Systems	19	52.6%	31.6%	15.8%	21.1%

Where do they fail?

Across the 154 failed runs, the global check (Layer 1) catches 46.1%; the remaining 53.9% reach a superficially runnable state but fail the task-specific verifier (Layer 2) — they would be missed entirely by generic success signals.

Failure location	Count	Percentage
Layer 1 – Runtime crash	39	25.3%
Layer 1 – Dependency error	27	17.5%
Layer 1 – Compilation error	5	3.2%
Layer 1: global check	71	46.1%
Layer 2: task-specific check	83	53.9%

Diagnostic-agent reports cluster failures into five recurring root-cause patterns: dependency & package resolution, GPU/CUDA setup, native-build & toolchain, artifact & output mismatch, and system- & VM-level execution. Tasks that all four models fail tend to compound multiple patterns at once.

BibTeX

@article{wang2026deploybench,
  author    = {Wang, Yuanli and Qian, Yaoyao and Zhang, Yue and Zhou, Hanhan and
               Huang, Jindan and Fu, Tianfu and Mang, Qiuyang and Mao, Huanzhi and
               Chai, Wenhao and Fan, Wendong and Jing, Liqiang},
  title     = {DeployBench: Benchmarking LLM Agents for Research Artifact Deployment},
  journal   = {arXiv preprint arXiv:2606.05238},
  year      = {2026},
}