SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

预备知识

核心思想

20260325153258

20260325154131

FieldTypeDescriptionLength / Notes
repostringRepository identifier (one of 11 repository classes)
instance_idstringUnique identifier for each instance65–120 characters
base_commitstringGit commit hash of the base version40 characters
patchstringThe golden code patch / diff1.44k – 180k characters
test_patchstringTest cases related to the patch325 – 322k characters
problem_statementstringDescription of the issue being addressed419 – 8.04k characters
requirementsstringProject requirements or dependencies124 – 6.7k characters (nullable)
interfacestringAPI or interface specifications1 – 12.2k characters (nullable)
repo_languagestringProgramming language of the repository (one of 4 language classes)
fail_to_passstringTest cases that should pass after patch application10 – 155k characters
pass_to_passstringTest cases that should continue passing2 – 532k characters
issue_specificitystringSpecificity of the issue12 – 77 characters
issue_categoriesstringCategories or tags for the issue type
before_repo_set_cmdstringRepo setup command for testing
selected_test_files_to_runstringFiles selected for testing

注: 因为 SWE-Bench Pro 要复杂得多, 因此作者团队提供 requirementsinterface 来提供解决不确定性的额外信息.

参考文献

  1. Deng X., Da J., Pan E., He Y. Y., Ide C., Garg K., Lauffer N., Park A., Pasari N., Rane C., Sampath K., Krishnan M., Kundurthy S., Hendryx S., Wang Z., Bharadwaj V., Holm J., Aluri R., Zhang B., Liu N., and Kenstler B. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv, 2025. [PDF] [Code]