
Socrates or Smarty Pants? Can LLMs really understand logic?
SocratesEval: An LLM Logic Reason Benchmark
| Model | Org | F1 Score | False Positive | False Negative | Fallacy Labelling Score | Explanation Score | Date |
|---|---|---|---|---|---|---|---|
| Claude 3.7 Sonnet thinking | Anthropic | 0.862 | 0.22588424437299034 | 0.022408963585434174 | -473.912 | 0.971 | 2025-05-13 |
| Claude 3.7 Sonnet | Anthropic | 0.888 | 0.18104184457728437 | 0.025412960609911054 | -526.909 | 1 | 2025-05-13 |
| LLaMA3.1 405b | Meta | 0.907 | 0.125 | 0.053691275167785234 | -506.93 | 0.479 | 2025-05-13 |
| Gemini 2.5 Pro thinking | 0.733 | 0.4114803625377644 | 0.0165016501650165 | -332.441 | 0.793 | 2025-05-13 | |
| Qwen 3 thinking | Alibaba | 0.899 | 0.16376306620209058 | 0.02345679012345679 | -793.875 | 0.297 | 2025-05-13 |
| Gemini 2.5 Pro | 0.739 | 0.40500916310323765 | 0.01557632398753894 | -334.864 | 0.783 | 2025-05-13 | |
| Qwen 3 | Alibaba | 0.895 | 0.17026793431287812 | 0.02372034956304619 | -811.122 | 0.304 | 2025-05-13 |
| Grok 3 | xAI | 0.844 | 0.2572964669738863 | 0.018292682926829267 | -1574.708 | 0.825 | 2025-05-13 |
| DeepSeek V3 | DeepSeek | 0.926 | 0.09817131857555342 | 0.04570184983677911 | -850.253 | 0.669 | 2025-05-13 |
| DeepSeek R1 | DeepSeek | 0.843 | 0.2654600301659125 | 0.007911392405063292 | -521.672 | 0.832 | 2025-05-13 |
| Claude 3.5 Sonnet | Anthropic | 0.821 | 0.2980561555075594 | 0.007029876977152899 | -680.41 | 0 | 2025-05-13 |
| Grok 3 mini | xAI | 0.894 | 0.1729776247848537 | 0.022613065326633167 | -1298.597 | 0.51 | 2025-05-13 |
| Grok 3 mini thinking | xAI | 0.894 | 0.1679721496953873 | 0.02843016069221261 | -1323.705 | 0.52 | 2025-05-13 |
| Grok 2 | Grok | 0.896 | 0.16564952048823017 | 0.027127003699136867 | -622.719 | 0.395 | 2025-05-13 |
| o4 mini | OpenAI | 0.89 | 0.15096065873741996 | 0.058959537572254334 | -341.532 | 0.579 | 2025-05-13 |
| GPT 4o | OpenAI | 0.855 | 0.24632068164213788 | 0.008995502248875561 | -810.217 | 0.506 | 2025-05-13 |
| o3 mini | OpenAI | 0.903 | 0.15075825156110614 | 0.03225806451612903 | -355.898 | 0.377 | 2025-05-13 |
Despite the growing focus on evaluating LLM reasoning, no existing benchmark specifically targets logic traps—often humorous yet deceptive statements common in English-speaking contexts. Moreover, no systematic survey has been conducted to assess how well LLMs navigate such reasoning challenges.
To fill this gap, SocratesEval introduces a benchmark with clear logical error labels, evaluating LLMs on two key tasks: (1) determining whether a given statement contains a logical error, and (2) classifying the type of fallacy. We introduce structured scoring to assess fallacy identification and overall reasoning quality, providing a detailed evaluation of LLM performance.
Our results offer the first comparative analysis of LLMs’ ability to handle logic traps, shedding light on their reasoning limitations and potential.
False Positive (FP) occurs when a logically correct sentence is misclassified as a logical fallacy, while False Negative (FN) means a logical fallacy is mistakenly labeled as correct.
F1 Score evaluates the model’s ability to distinguish logical errors, balancing precision and recall.
Fallacy Labelling Score measures how accurately the model classifies sentences into human-annotated and verified logical fallacy categories.
Explanation Score assesses how well the model's interpretation of logical errors aligns with human reasoning.
If people can't afford electricity, why don't they just grow more power plants?
Equivocation