Hello from Socrates or Smarty Pants? Can LLMs really understand logic? | Socrates or Smarty Pants? Can LLMs really understand logic?

Leaderboard

Model	Org	F1 Score	False Positive	False Negative	Fallacy Labelling Score	Explanation Score	Date
Claude 3.7 Sonnet thinking	Anthropic	0.862	0.22588424437299034	0.022408963585434174	-473.912	0.971	2025-05-13
Claude 3.7 Sonnet	Anthropic	0.888	0.18104184457728437	0.025412960609911054	-526.909	1	2025-05-13
LLaMA3.1 405b	Meta	0.907	0.125	0.053691275167785234	-506.93	0.479	2025-05-13
Gemini 2.5 Pro thinking	Google	0.733	0.4114803625377644	0.0165016501650165	-332.441	0.793	2025-05-13
Qwen 3 thinking	Alibaba	0.899	0.16376306620209058	0.02345679012345679	-793.875	0.297	2025-05-13
Gemini 2.5 Pro	Google	0.739	0.40500916310323765	0.01557632398753894	-334.864	0.783	2025-05-13
Qwen 3	Alibaba	0.895	0.17026793431287812	0.02372034956304619	-811.122	0.304	2025-05-13
Grok 3	xAI	0.844	0.2572964669738863	0.018292682926829267	-1574.708	0.825	2025-05-13
DeepSeek V3	DeepSeek	0.926	0.09817131857555342	0.04570184983677911	-850.253	0.669	2025-05-13
DeepSeek R1	DeepSeek	0.843	0.2654600301659125	0.007911392405063292	-521.672	0.832	2025-05-13
Claude 3.5 Sonnet	Anthropic	0.821	0.2980561555075594	0.007029876977152899	-680.41	0	2025-05-13
Grok 3 mini	xAI	0.894	0.1729776247848537	0.022613065326633167	-1298.597	0.51	2025-05-13
Grok 3 mini thinking	xAI	0.894	0.1679721496953873	0.02843016069221261	-1323.705	0.52	2025-05-13
Grok 2	Grok	0.896	0.16564952048823017	0.027127003699136867	-622.719	0.395	2025-05-13
o4 mini	OpenAI	0.89	0.15096065873741996	0.058959537572254334	-341.532	0.579	2025-05-13
GPT 4o	OpenAI	0.855	0.24632068164213788	0.008995502248875561	-810.217	0.506	2025-05-13
o3 mini	OpenAI	0.903	0.15075825156110614	0.03225806451612903	-355.898	0.377	2025-05-13

Introduction

Despite the growing focus on evaluating LLM reasoning, no existing benchmark specifically targets logic traps—often humorous yet deceptive statements common in English-speaking contexts. Moreover, no systematic survey has been conducted to assess how well LLMs navigate such reasoning challenges.

To fill this gap, SocratesEval introduces a benchmark with clear logical error labels, evaluating LLMs on two key tasks: (1) determining whether a given statement contains a logical error, and (2) classifying the type of fallacy. We introduce structured scoring to assess fallacy identification and overall reasoning quality, providing a detailed evaluation of LLM performance.

Our results offer the first comparative analysis of LLMs’ ability to handle logic traps, shedding light on their reasoning limitations and potential.

Metric

False Positive (FP) occurs when a logically correct sentence is misclassified as a logical fallacy, while False Negative (FN) means a logical fallacy is mistakenly labeled as correct.

F1 Score evaluates the model’s ability to distinguish logical errors, balancing precision and recall.

Fallacy Labelling Score measures how accurately the model classifies sentences into human-annotated and verified logical fallacy categories.

Explanation Score assesses how well the model's interpretation of logical errors aligns with human reasoning.

Radar Chart: Top F1 Model from Each Family

Peek into Our Dataset

If people can't afford electricity, why don't they just grow more power plants?

Equivocation