ABAP Code Generation Benchmark

SAP ABAP LLM MODEL TESTING

ABAP LLM Model Leaderboard

This dashboard benchmarks LLMs specifically on SAP ABAP code generation across 180 tasks and 10 repetitions per task, with up to 5 feedback iterations. Baseline methodology comes from the original paper: Benchmarking Large Language Models for ABAP Code Generation (2601.15188) .

Read the blog posts for background, cost analysis, and practical recommendations: Benchmarking LLMs for ABAP: Why ABAP-1 Isn't a Code Generator (Yet) and SAP's ABAP-1 Loses Every ABAP Benchmark, Even "Explaining" .

ABAP Code Generation — Main Comparison Table

Only fully evaluated models are shown (all 6 rounds tested). Sort by any column and filter by model name.

Code Gen (after R5)
% of 1,800 runs passing after up to 5 feedback rounds — headline metric
Code Gen (1st attempt)
Same, but on the very first try with no feedback
Understanding (after R5)
Score on the separate ABAP understanding benchmark (180 tasks × 3 reps)
AUC R0–R5
Area under the cumulative-success curve; higher = improves faster with feedback
R0 Code Compiles (%)
% of first attempts where the code compiled & activated in SAP (regardless of test outcome)
pass@5
Probability that ≥1 of 5 randomly drawn runs passes (standard HumanEval metric)
Tasks Solved (≥1/10)
% of the 180 tasks where at least one run eventually succeeded
Tasks Solved (10/10)
% of the 180 tasks where every run succeeded (maximum consistency)
Sortable main comparison table of evaluated ABAP LLM models.

ABAP Code Understanding Benchmark

A separate track testing how well models understand existing ABAP code and unit tests — not generation. Each model answered 180 structured questions (3 repetitions, up to 6 feedback rounds). Scored on 6 objective fields extracted from canonical code and unit tests.

R0 (1st attempt)
% of runs scored correctly on the first try, no feedback given
R1–R5 (+N feedback)
Cumulative % correct after N rounds of automated correction feedback
AUC R0–R5
Area under the cumulative-success curve; higher = improves faster with feedback
Understanding benchmark cumulative success rates by feedback round (R0 to R5).

Methodology Notes

What Is Reused From The Paper

  • Success progression by feedback round (R0 to R5).
  • Success rates by benchmark task category.
  • Error-stage perspective based on SAP validation order.

What Is Extended Here

  • Sortable developer-facing leaderboard table.
  • pass@5, AUC over rounds, and prompt consistency metrics.
  • Completeness signal (`Max rounds tested`) for fair comparison.

Cumulative Success By Feedback Round

Cumulative % of 1,800 runs passing up to and including each round. R0 = first attempt with no feedback · R1–R5 = after receiving SAP compiler & unit test errors.

Cumulative success rates by feedback round (R0 to R5).

Success By Task Category

Final success rate after R5, broken down by the five benchmark task categories.

Final success rates by benchmark task category.

Plots