Skip to main content
Preview

Socrates or Smarty Pants? Can LLMs really understand logic?

SocratesEval: An LLM Logic Reason Benchmark

Model
Org
F1 Score
False Positive
False Negative
Fallacy Labelling Score
Explanation Score
Date
Claude 3.7 Sonnet thinkingAnthropic0.8620.225884244372990340.022408963585434174-473.9120.9712025-05-13
Claude 3.7 SonnetAnthropic0.8880.181041844577284370.025412960609911054-526.90912025-05-13
LLaMA3.1 405bMeta0.9070.1250.053691275167785234-506.930.4792025-05-13
Gemini 2.5 Pro thinkingGoogle0.7330.41148036253776440.0165016501650165-332.4410.7932025-05-13
Qwen 3 thinkingAlibaba0.8990.163763066202090580.02345679012345679-793.8750.2972025-05-13
Gemini 2.5 ProGoogle0.7390.405009163103237650.01557632398753894-334.8640.7832025-05-13
Qwen 3Alibaba0.8950.170267934312878120.02372034956304619-811.1220.3042025-05-13
Grok 3xAI0.8440.25729646697388630.018292682926829267-1574.7080.8252025-05-13
DeepSeek V3DeepSeek0.9260.098171318575553420.04570184983677911-850.2530.6692025-05-13
DeepSeek R1DeepSeek0.8430.26546003016591250.007911392405063292-521.6720.8322025-05-13
Claude 3.5 SonnetAnthropic0.8210.29805615550755940.007029876977152899-680.4102025-05-13
Grok 3 minixAI0.8940.17297762478485370.022613065326633167-1298.5970.512025-05-13
Grok 3 mini thinkingxAI0.8940.16797214969538730.02843016069221261-1323.7050.522025-05-13
Grok 2Grok0.8960.165649520488230170.027127003699136867-622.7190.3952025-05-13
o4 miniOpenAI0.890.150960658737419960.058959537572254334-341.5320.5792025-05-13
GPT 4oOpenAI0.8550.246320681642137880.008995502248875561-810.2170.5062025-05-13
o3 miniOpenAI0.9030.150758251561106140.03225806451612903-355.8980.3772025-05-13

Despite the growing focus on evaluating LLM reasoning, no existing benchmark specifically targets  logic traps—often humorous yet deceptive statements common in English-speaking contexts. Moreover, no systematic survey has been conducted to assess how well LLMs navigate such reasoning challenges.

To fill this gap, SocratesEval introduces a benchmark with clear logical error labels, evaluating LLMs on two key tasks: (1) determining whether a given statement contains a logical error, and (2) classifying the type of fallacy. We introduce structured scoring to assess fallacy identification and overall reasoning quality, providing a detailed evaluation of LLM performance.

Our results offer the first comparative analysis of LLMs’ ability to handle logic traps, shedding light on their reasoning limitations and potential.

False Positive (FP) occurs when a logically correct sentence is misclassified as a logical fallacy, while False Negative (FN) means a logical fallacy is mistakenly labeled as correct.

F1 Score evaluates the model’s ability to distinguish logical errors, balancing precision and recall.

Fallacy Labelling Score measures how accurately the model classifies sentences into human-annotated and verified logical fallacy categories.

Explanation Score assesses how well the model's interpretation of logical errors aligns with human reasoning.

If people can't afford electricity, why don't they just grow more power plants?

Equivocation