🤔 ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

We introduce ReTabAD, the first context-aware tabular anomaly detection benchmark, which provides semantically enriched datasets and a zero-shot LLM framework.

🎯 Overview

ReTabAD Overview

In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by Restoring textual semantics to enable context-aware Tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms—including classical, deep learning, and LLM-based approaches—and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.


✨ Key Features

📚 Semantically-Rich Tabular AD Benchmark

Tabular data paired with comprehensive JSON text metadata containing column descriptions, logical types, and characterizations of normal data.

💡 Support SOTA Algorithms

Unified pipeline enabling fair comparisons across traditional ML, deep learning, and modern LLM approaches.

🚀 LLM Potential

Demonstrates substantial performance improvements when models can leverage semantic information.


🔬 Why ReTabAD?

Traditional tabular AD benchmarks exhibit a fundamental disconnect from industrial practice:

ReTabAD solves these problems by restoring semantic context and enabling context-aware AD research.


📊 Benchmark Statistics

ReTabAD includes 20 diverse datasets spanning multiple domains:

Metric Range
Datasets 20 real-world scenarios
Data Points 159 - 50,000 per dataset
Features 6 - 42 columns
Anomaly Ratio 0.38% - 33.29%

🚀 Quick Start

# Clone the repository
git clone https://github.com/yoonsanghyu/ReTabAD.git
cd ReTabAD

# Build Docker image
docker build -t retabad:1.0.0 .

# Run experiment
python run_default.py --data_name wine --model_name OCSVM --cfg_file configs/default/pyod/OCSVM.yaml

See Usage for detailed instructions.


📰 News