ABAP LLM Model Leaderboard

This dashboard benchmarks LLMs specifically on SAP ABAP code generation across 180 tasks and 10 repetitions per task, with up to 5 feedback iterations. Baseline methodology comes from the original paper: Benchmarking Large Language Models for ABAP Code Generation (2601.15188) .

Read the blog posts for background, cost analysis, and practical recommendations: Benchmarking LLMs for ABAP: Why ABAP-1 Isn't a Code Generator (Yet) and SAP's ABAP-1 Loses Every ABAP Benchmark, Even "Explaining" .

JavaScript is required to render the tables

If you cannot enable JavaScript, you can still download the raw benchmark artifacts: model leaderboard (CSV), results (CSV), and syntax errors (JSON).

ABAP Code Generation — Main Comparison Table

Only fully evaluated models are shown (all 6 rounds tested). Sort by any column and filter by model name.

Code Gen (after R5): % of 1,800 runs passing after up to 5 feedback rounds — headline metric
Code Gen (1st attempt): Same, but on the very first try with no feedback
Understanding (after R5): Score on the separate ABAP understanding benchmark (180 tasks × 3 reps)
AUC R0–R5: Area under the cumulative-success curve; higher = improves faster with feedback
R0 Code Compiles (%): % of first attempts where the code compiled & activated in SAP (regardless of test outcome)
pass@5: Probability that ≥1 of 5 randomly drawn runs passes (standard HumanEval metric)
Tasks Solved (≥1/10): % of the 180 tasks where at least one run eventually succeeded
Tasks Solved (10/10): % of the 180 tasks where every run succeeded (maximum consistency)

Filter Model

ABAP Code Understanding Benchmark

A separate track testing how well models understand existing ABAP code and unit tests — not generation. Each model answered 180 structured questions (3 repetitions, up to 6 feedback rounds). Scored on 6 objective fields extracted from canonical code and unit tests.

R0 (1st attempt): % of runs scored correctly on the first try, no feedback given
R1–R5 (+N feedback): Cumulative % correct after N rounds of automated correction feedback
AUC R0–R5: Area under the cumulative-success curve; higher = improves faster with feedback

Methodology Notes

What Is Reused From The Paper

Success progression by feedback round (R0 to R5).
Success rates by benchmark task category.
Error-stage perspective based on SAP validation order.

What Is Extended Here

Sortable developer-facing leaderboard table.
pass@5, AUC over rounds, and prompt consistency metrics.
Completeness signal (`Max rounds tested`) for fair comparison.

Generated Artifacts

Cumulative Success By Feedback Round

Cumulative % of 1,800 runs passing up to and including each round. R0 = first attempt with no feedback · R1–R5 = after receiving SAP compiler & unit test errors.

Success By Task Category

Final success rate after R5, broken down by the five benchmark task categories.