What Is Reused From The Paper
- Success progression by feedback round (R0 to R5).
- Success rates by benchmark task category.
- Error-stage perspective based on SAP validation order.
ABAP Code Generation Benchmark
SAP ABAP LLM MODEL TESTING
This dashboard benchmarks LLMs specifically on SAP ABAP code generation across 180 tasks and 10 repetitions per task, with up to 5 feedback iterations. Baseline methodology comes from the original paper: Benchmarking Large Language Models for ABAP Code Generation (2601.15188) .
Read the blog posts for background, cost analysis, and practical recommendations: Benchmarking LLMs for ABAP: Why ABAP-1 Isn't a Code Generator (Yet) and SAP's ABAP-1 Loses Every ABAP Benchmark, Even "Explaining" .
Only fully evaluated models are shown (all 6 rounds tested). Sort by any column and filter by model name.
A separate track testing how well models understand existing ABAP code and unit tests — not generation. Each model answered 180 structured questions (3 repetitions, up to 6 feedback rounds). Scored on 6 objective fields extracted from canonical code and unit tests.
Cumulative % of 1,800 runs passing up to and including each round. R0 = first attempt with no feedback · R1–R5 = after receiving SAP compiler & unit test errors.
Final success rate after R5, broken down by the five benchmark task categories.