arxiv:2509.25868

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Published on Sep 30, 2025

Authors:

Abstract

A benchmark called ReFACT evaluates the detection, localization, and correction of scientific confabulation in LLMs, revealing their limited accuracy.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Models (LLMs) frequently confabulate scientific facts, severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with precise error spans and error types. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance (about 50 percent accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of LLM-as-judge evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. The dataset is available at: https://github.com/ddz5431/ReFACT

View arXiv page View PDF GitHub 1 auto Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2509.25868

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25868 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.25868 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.