IDE ARENA: EVALUATING AI AGENTS ON SOFTWARE ENGINEERING TASKS

Your IDE can write code. Can it think like a developer?

IDE-ARENA LEADERBOARD

A short description if needed

01020304050
Cursor
Cursor
Windsurf
Windsurf
Kiro
Kiro

ABOUT

THIS IS A HEADER

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

We don't just test code generation. We test the complete workflow:
  • Can the agent explore and understand an unfamiliar codebase?
  • Does it choose the right tools at the right time?
  • Can it reason across 10+ interdependent files?
  • Does it write code that actually passes tests?

Every task in IDE Arena mirrors real software engineering work— feature implementation, bug fixing, refactoring, and performance optimization—across production-grade stacks like FastAPI, Django, Flask, and MERN.

THIS IS A HEADER

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.

We don't just test code generation. We test the complete workflow:
  • Can the agent explore and understand an unfamiliar codebase?
  • Does it choose the right tools at the right time?
  • Can it reason across 10+ interdependent files?
  • Does it write code that actually passes tests?

Every task in IDE Arena mirrors real software engineering work— feature implementation, bug fixing, refactoring, and performance optimization—across production-grade stacks like FastAPI, Django, Flask, and MERN.

ACCESS THE
FULL DATASET

IDE ARENA IS SUPER COOL

IDE Arena is the first comprehensive benchmark designed to evaluate AI agents in the environment where developers actually use them: inside the IDE.