ABAP Code Generation Benchmark

SAP ABAP LLM MODEL TESTING

ABAP LLM Model Leaderboard

This dashboard benchmarks LLMs specifically on SAP ABAP code generation across 180 tasks and 10 repetitions per task, with up to 5 feedback iterations. Baseline methodology comes from the original paper: Benchmarking Large Language Models for ABAP Code Generation (2601.15188) .

Main Comparison Table

Only fully evaluated models are shown (Max rounds tested = 6). Sort by any column and filter by model name.

Filter Model

Methodology Notes

What Is Reused From The Paper

Success progression by feedback round (R0 to R5).
Success rates by benchmark task category.
Error-stage perspective based on SAP validation order.

What Is Extended Here

Sortable developer-facing leaderboard table.
pass@5, AUC over rounds, and prompt consistency metrics.
Completeness signal (`Max rounds tested`) for fair comparison.

Generated Artifacts

Cumulative Success By Feedback Round

Success By Task Category

Plots