A Multi-Granular Tabular Representation Learning Benchmark
Tabular encoders span many paradigms — different inputs, training objectives, and output heads — so models are hard to compare even when they operate on similar tabular signals. TRL-Bench unifies them at the level of the representation: each model exports row-, column-, or table-embeddings through one shared wrapper, and lightweight shared heads probe them across 20 encoders, 16 tasks, and 87 datasets in three suites.
Datasets: TRL-CTbench · TRL-Rbench · TRL-DLTE
Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so a strong result may come from the wrapped predictor, training budget, and task-specific adaptation as much as from the encoder itself. TRL-Bench asks a comparability question: under one shared protocol over the exported representations, how do heterogeneous tabular encoders actually differ?
In encode-once, reuse-many settings, tables are embedded once and reused across tasks and large multi-table corpora such as data lakes. The representation — not the task-specific wrapper — becomes the object of evaluation.
Each model exports frozen row-, column-, or table-embeddings; shared lightweight heads (training-free, learned probes, or query-conditioned) read them out. Comparisons reflect the exported embedding under common readouts, not the choice of downstream predictor.
Many wrapped models were never built to emit embeddings — in-context predictors (TabPFN, TabICL), table-QA/parsing models (TAPAS, TAPEX, TaBERT), and self-supervised learners (SAINT, SCARF, SubTab, VIME). One shared wrapper turns each into a reusable tabular embedding model.
TRL-Bench treats retrieval, schema alignment, linkage, prediction, and grounding as atomic capabilities, and measures them at the granularities where embeddings are actually reused — columns and tables, rows, and their composition.
Column & table transfer · 13 tasks
Eight column-level and five table-level tasks across schema understanding, joinability, unionability, and grounding (table QA, retrieval & subset). Twenty standardized column/table datasets; evaluated on the 10 of 20 models that natively expose column or table embeddings.
Leader: Starmie — 0.662 MAP Union Search, 0.764 R@GT Schema Matching.
TRL-CTbench dataset →Row transfer · prediction + linkage
Within-table row prediction over 50 OpenML-derived tables with 123 hand-verified targets (77 classification, 46 regression), and cross-table record linkage over 16 datasets, split into Clean and Robust linkage. Tests whether one row embedding transfers both within and across tables.
Leader: TabICL — 0.816 AUROC row prediction; BERT/GTE lead Clean/Robust linkage.
TRL-Rbench dataset →Compositional · all three granularities
Multi-stage Data-Lake Table Enrichment over a 47,772-table lake built from 1,379 parent tables. Retrieve with table embeddings, align columns with column embeddings, match rows with row embeddings — a single composition test for the whole benchmark.
Best pipeline: TUTA/GTE/GTE — 0.229 UJ-H; capability-matched hybrids beat monoliths.
TRL-DLTE dataset →
Curated assets. (a) TRL-Rbench row-prediction curation: 158 candidate tables pass rule screening, degeneracy audit, and human review with label repair into 50 tables with 123 targets. (b) TRL-DLTE lake assembly: 1,379 TabFact/WTQ parents are fragmented at four noise tiers; 11,032 targets are embedded alongside 36,740 CKAN distractors in a 47,772-table lake (parent-disjoint splits).
Thirteen tasks — eight column-level, five table-level — probe column- and table-embeddings across four capabilities, on the 10 of 20 models that natively expose them. No pretraining recipe leads everywhere: family-level rank shifts from surface-text tasks to cross-table geometry.
Column type prediction & relation extraction. Generic text encoders lead — BERT takes the best Schema family rank (NR 0.000 ↓) where headers and short cell strings carry most of the signal.
Join search across tables. GTE leads Join Search — its retrieval-contrastive pretraining matches the cross-table cosine setup better than generic schema encoders.
Union search & schema matching — cross-table alignment geometry. Starmie's contrastive objective wins both: 0.662 MAP (Union Search) and 0.764 R@GT (Schema Matching).
Table QA, retrieval & subset. The text-table family leads — TaBERT best Grounding rank (NR 0.198 ↓); task wins split between TURL (Table QA, 0.277 Acc) and GTE (Table Retrieval, 0.476 MRR).
Specialization gap. Generic-text rank degrades from Schema → Grounding (BERT NR 0.000 → 0.397), while specialists win exactly the Union and Grounding tasks that reward structure — no recipe behaves like a universal representation.
One row embedding, two transfer regimes — within-table prediction and cross-table linkage. They reward different training scopes, so no encoder leads both.
Within-table · intra-table transfer
Prediction over 50 OpenML tables with 123 targets (77 classification, 46 regression). Prior-based TabICL leads — 0.816 AUROC (Macro-F1 0.671) — layering target-table adaptation on a meta-pretrained prior.
Cross-table · inter-table transfer
Identity resolution over 16 datasets, split into Clean and Robust. Transfer encoders win: BERT on Clean Linkage (0.418 F1), GTE on Robust Linkage (NR 0.048 ↓), with TransTab second on Robust.
Transfer-scope gap. Target-table self-supervision fits locally (strong on prediction, weak on linkage); shared transfer encoders give comparable row spaces (strong on linkage, mid-pack on prediction). The same row-matching capability resurfaces as TRL-DLTE Stage 3 — the two agree at |ρ| = 0.80.
From a complete parent table we remove a block of rows (the union target) and a block of columns (the join target); the remainder is a seed query. Given only the seed and a data lake, a pipeline must recover both targets — composing table, column, and row embeddings across three stages.
Retrieve candidate tables from the 47,772-table lake using table embeddings (target recall@100), against 36,740 CKAN distractors.
Align columns and predict union / join / none using column embeddings, with thresholds calibrated on the development split.
Match rows across tables and merge content using row embeddings, recovering the removed cells end-to-end.
End-to-end union/join recall via UJ-H (their per-query harmonic mean), with Cell F1 as a complementary cell-recovery diagnostic.
End-to-end DLTE scores (UJ-H, higher is better). Headline pipelines are selected on the development split and evaluated once on test to avoid selection bias over 1,120 test evaluations.
| Pipeline (Stage 1 / 2 / 3) | Composition | Selection | UJ-H ↑ |
|---|---|---|---|
| Starmie / GTE / GTE | Hybrid | Test rank-1 | 0.253 |
| Starmie / BERT / TransTab | Hybrid | Dev marginals | 0.231 |
| TUTA / GTE / GTE | Hybrid | Dev-selected (headline) | 0.229 |
| BERT / BERT / BERT | Monolithic | Best monolith | 0.139 |
| Starmie / TABBIE / TransTab | Hybrid | Test marginals | 0.134 |
UJ-H is the harmonic mean of union- and join-recall. Explore all 1,120 pipelines in the interactive viewer →
Composition gap. Stacking per-stage winners is not the best pipeline: the dev-selected hybrid TUTA/GTE/GTE (0.229 UJ-H) beats the best monolith BERT/BERT/BERT (0.139) by 0.090, while assembling each stage's marginal leader reaches only 0.134; the unrestricted test maximum is 0.253 (Starmie/GTE/GTE). Granularities interact rather than stack — capability-matched hybrids win.
A hands-off fly-through of the 1,120 TRL-DLTE pipelines — every Stage-1 × Stage-2 × Stage-3 combination, colored by end-to-end UJ-H. The cube auto-tours on loop; no controls to fiddle with.
Once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard — and end-to-end quality depends on how capabilities compose, not on per-stage rank in isolation.
A capability profile of representative encoders — the shape, not any single score, is the point.
No model spans every capability. Different encoder families peak on different capability axes — surface-text tasks reward generic text encoders, while cross-table alignment and matching reward structure-aware specialists.
Read jointly, the three suites expose three structural gaps that single-paradigm evaluations cannot isolate: a specialization gap in model choice (CTbench), a transfer-scope gap between intra- and cross-table transfer (Rbench), and a composition gap where granularities interact rather than stack (DLTE).
If you find TRL-Bench useful for your research, please cite our work.
@article{pang2026trl,
title = {{TRL}-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders},
author = {Pang, Wei and Jian, Xiangru and Li, Hehan and Yu, Zhixuan and Xue, Alex and Li, Jinyang and Dong, Zhengyuan and Zhao, Xinjian and Xu, Hao and Zhang, Chao and Cheng, Reynold and {\"O}zsu, M. Tamer and Yu, Tianshu},
journal = {arXiv preprint arXiv:2606.09323},
year = {2026}
}
TRL-Bench spans three suites — TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional data-lake table enrichment).