Discussion about this post

User's avatar
Nicolò Bagarin - 404_NOT_FOUND's avatar

At the time of the question, the top four models on LiveCodeBench Pro Hard were the only ones with a single correct resolution in the 15-problem subset. Yet, they were not achieving this on the same problem, and collectively they solved three problems and scored 3/15 (20%).

While it's certainly a significant improvement that GPT-5.2 solved 5, it added just 2 more to the total, suggesting that the relative improvement in overall AI capabilities is not as large as one would guess by looking at the individual benchmark performance.

No posts

Ready for more?