NodeSphere
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
Lemmit.Online bot@lemmit.onlineMB to Singularity@lemmit.onlineEnglish · 1 month ago

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

arxiv.org

external-link
message-square
0
link
fedilink
  • cross-posted to:
  • techtakes@awful.systems
1
external-link

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

arxiv.org

Lemmit.Online bot@lemmit.onlineMB to Singularity@lemmit.onlineEnglish · 1 month ago
message-square
0
link
fedilink
  • cross-posted to:
  • techtakes@awful.systems
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/tridentgum on 2025-03-31 20:13:14+00:00.

alert-triangle
You must log in or register to comment.

Singularity@lemmit.online

singularity@lemmit.online

Subscribe from Remote Instance

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !singularity@lemmit.online
lock
Community locked: only moderators can create posts. You can still comment on posts.

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 1 user / day
  • 2 users / week
  • 9 users / month
  • 58 users / 6 months
  • 1 local subscriber
  • 111 subscribers
  • 6.32K Posts
  • 15 Comments
  • Modlog
  • mods:
  • Lemmit.Online bot@lemmit.online
  • BE: 0.19.11
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org