SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

研究背景

核心思想

20260325153258

20260325154131

Field Type Description Length / Notes
repo string Repository identifier (one of 11 repository classes)
instance_id string Unique identifier for each instance 65–120 characters
base_commit string Git commit hash of the base version 40 characters
patch string The golden code patch / diff 1.44k – 180k characters
test_patch string Test cases related to the patch 325 – 322k characters
problem_statement string Description of the issue being addressed 419 – 8.04k characters
requirements string Project requirements or dependencies 124 – 6.7k characters (nullable)
interface string API or interface specifications 1 – 12.2k characters (nullable)
repo_language string Programming language of the repository (one of 4 language classes)
fail_to_pass string Test cases that should pass after patch application 10 – 155k characters
pass_to_pass string Test cases that should continue passing 2 – 532k characters
issue_specificity string Specificity of the issue 12 – 77 characters
issue_categories string Categories or tags for the issue type
before_repo_set_cmd string Repo setup command for testing
selected_test_files_to_run string Files selected for testing

注: 因为 SWE-Bench Pro 要复杂得多, 因此作者团队提供 requirementsinterface 来提供解决不确定性的额外信息.

参考文献

  1. Deng X., Da J., Pan E., He Y. Y., Ide C., Garg K., Lauffer N., Park A., Pasari N., Rane C., Sampath K., Krishnan M., Kundurthy S., Hendryx S., Wang Z., Bharadwaj V., Holm J., Aluri R., Zhang B., Liu N., and Kenstler B. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv, 2025. [PDF] [Code]