A Multi-Granular Tabular Representation Learning Benchmark

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Wei Pang^1,*, Xiangru Jian^2,*, Hehan Li^1,*, Zhixuan Yu^1,*, Alex Xue^2,*, Jinyang Li³, Zhengyuan Dong², Xinjian Zhao¹, Hao Xu⁴, Chao Zhang⁵, Reynold Cheng³, M. Tamer Özsu², Tianshu Yu^1,†

¹The Chinese University of Hong Kong, Shenzhen ²University of Waterloo ³The University of Hong Kong ⁴The University of Sydney ⁵Université Lyon 1

^*Core contributors ^†Corresponding author: yutianshu@cuhk.edu.cn

Tabular encoders span many paradigms — different inputs, training objectives, and output heads — so models are hard to compare even when they operate on similar tabular signals. TRL-Bench unifies them at the level of the representation: each model exports row-, column-, or table-embeddings through one shared wrapper, and lightweight shared heads probe them across 20 encoders, 16 tasks, and 87 datasets in three suites.

Paper Interactive Viewer Code

Datasets: TRL-CTbench · TRL-Rbench · TRL-DLTE

TRL-Bench at a glance: each model is processed once through its supported wrapper to export row, column, and table embeddings; shared lightweight modules then evaluate those embeddings across TRL-CTbench, TRL-Rbench, and TRL-DLTE

Overview

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so a strong result may come from the wrapped predictor, training budget, and task-specific adaptation as much as from the encoder itself. TRL-Bench asks a comparability question: under one shared protocol over the exported representations, how do heterogeneous tabular encoders actually differ?

Encoders

across every paradigm

Tasks

over three suites

Datasets

SATO, SOTAB, Spider, OpenML…

Suites

row · column · table

47,772

Lake Tables

TRL-DLTE enrichment lake

1,120

DLTE Pipelines

10 × 8 × 14 stage combos

Encode once, reuse many

In encode-once, reuse-many settings, tables are embedded once and reused across tasks and large multi-table corpora such as data lakes. The representation — not the task-specific wrapper — becomes the object of evaluation.

Representation-level protocol

Each model exports frozen row-, column-, or table-embeddings; shared lightweight heads (training-free, learned probes, or query-conditioned) read them out. Comparisons reflect the exported embedding under common readouts, not the choice of downstream predictor.

A benchmark and a library

Many wrapped models were never built to emit embeddings — in-context predictors (TabPFN, TabICL), table-QA/parsing models (TAPAS, TAPEX, TaBERT), and self-supervised learners (SAINT, SCARF, SubTab, VIME). One shared wrapper turns each into a reusable tabular embedding model.

Three Suites, Three Granularities

TRL-Bench treats retrieval, schema alignment, linkage, prediction, and grounding as atomic capabilities, and measures them at the granularities where embeddings are actually reused — columns and tables, rows, and their composition.

TRL-CTbench

Column & table transfer · 13 tasks

Eight column-level and five table-level tasks across schema understanding, joinability, unionability, and grounding (table QA, retrieval & subset). Twenty standardized column/table datasets; evaluated on the 10 of 20 models that natively expose column or table embeddings.

Leader: Starmie — 0.662 MAP Union Search, 0.764 R@GT Schema Matching.

TRL-CTbench dataset →

TRL-Rbench

Row transfer · prediction + linkage

Within-table row prediction over 50 OpenML-derived tables with 123 hand-verified targets (77 classification, 46 regression), and cross-table record linkage over 16 datasets, split into Clean and Robust linkage. Tests whether one row embedding transfers both within and across tables.

Leader: TabICL — 0.816 AUROC row prediction; BERT/GTE lead Clean/Robust linkage.

TRL-Rbench dataset →

TRL-DLTE

Compositional · all three granularities

Multi-stage Data-Lake Table Enrichment over a 47,772-table lake built from 1,379 parent tables. Retrieve with table embeddings, align columns with column embeddings, match rows with row embeddings — a single composition test for the whole benchmark.

Best pipeline: TUTA/GTE/GTE — 0.229 UJ-H; capability-matched hybrids beat monoliths.

TRL-DLTE dataset →

Curation of TRL-Rbench row-prediction tables (158 candidate tables filtered to 50 tables with 123 targets) and assembly of the TRL-DLTE lake (1,379 TabFact/WTQ parents fragmented into seed queries and union/join targets; 11,032 targets embedded with 36,740 CKAN distractors into a 47,772-table lake)

Curated assets. (a) TRL-Rbench row-prediction curation: 158 candidate tables pass rule screening, degeneracy audit, and human review with label repair into 50 tables with 123 targets. (b) TRL-DLTE lake assembly: 1,379 TabFact/WTQ parents are fragmented at four noise tiers; 11,032 targets are embedded alongside 36,740 CKAN distractors in a 47,772-table lake (parent-disjoint splits).

Inside TRL-CTbench: Column & Table Transfer

Thirteen tasks — eight column-level, five table-level — probe column- and table-embeddings across four capabilities, on the 10 of 20 models that natively expose them. No pretraining recipe leads everywhere: family-level rank shifts from surface-text tasks to cross-table geometry.

Schema understanding

Column type prediction & relation extraction. Generic text encoders lead — BERT takes the best Schema family rank (NR 0.000 ↓) where headers and short cell strings carry most of the signal.

Joinability

Join search across tables. GTE leads Join Search — its retrieval-contrastive pretraining matches the cross-table cosine setup better than generic schema encoders.

Unionability

Union search & schema matching — cross-table alignment geometry. Starmie's contrastive objective wins both: 0.662 MAP (Union Search) and 0.764 R@GT (Schema Matching).

Grounding

Table QA, retrieval & subset. The text-table family leads — TaBERT best Grounding rank (NR 0.198 ↓); task wins split between TURL (Table QA, 0.277 Acc) and GTE (Table Retrieval, 0.476 MRR).

Specialization gap. Generic-text rank degrades from Schema → Grounding (BERT NR 0.000 → 0.397), while specialists win exactly the Union and Grounding tasks that reward structure — no recipe behaves like a universal representation.

Inside TRL-Rbench: Row Transfer

One row embedding, two transfer regimes — within-table prediction and cross-table linkage. They reward different training scopes, so no encoder leads both.

Row Prediction

Within-table · intra-table transfer

Prediction over 50 OpenML tables with 123 targets (77 classification, 46 regression). Prior-based TabICL leads — 0.816 AUROC (Macro-F1 0.671) — layering target-table adaptation on a meta-pretrained prior.

Record Linkage

Cross-table · inter-table transfer

Identity resolution over 16 datasets, split into Clean and Robust. Transfer encoders win: BERT on Clean Linkage (0.418 F1), GTE on Robust Linkage (NR 0.048 ↓), with TransTab second on Robust.

Transfer-scope gap. Target-table self-supervision fits locally (strong on prediction, weak on linkage); shared transfer encoders give comparable row spaces (strong on linkage, mid-pack on prediction). The same row-matching capability resurfaces as TRL-DLTE Stage 3 — the two agree at |ρ| = 0.80.

Inside TRL-DLTE: Compositional Enrichment

From a complete parent table we remove a block of rows (the union target) and a block of columns (the join target); the remainder is a seed query. Given only the seed and a data lake, a pipeline must recover both targets — composing table, column, and row embeddings across three stages.

Table Retrieval

Retrieve candidate tables from the 47,772-table lake using table embeddings (target recall@100), against 36,740 CKAN distractors.

Column Alignment

Align columns and predict union / join / none using column embeddings, with thresholds calibrated on the development split.

Row Matching & Enrichment

Match rows across tables and merge content using row embeddings, recovering the removed cells end-to-end.

★

Scoring — UJ-H

End-to-end union/join recall via UJ-H (their per-query harmonic mean), with Cell F1 as a complementary cell-recovery diagnostic.

Representative TRL-DLTE pipelines

End-to-end DLTE scores (UJ-H, higher is better). Headline pipelines are selected on the development split and evaluated once on test to avoid selection bias over 1,120 test evaluations.

Pipeline (Stage 1 / 2 / 3)	Composition	Selection	UJ-H ↑
Starmie / GTE / GTE	Hybrid	Test rank-1	0.253
Starmie / BERT / TransTab	Hybrid	Dev marginals	0.231
TUTA / GTE / GTE	Hybrid	Dev-selected (headline)	0.229
BERT / BERT / BERT	Monolithic	Best monolith	0.139
Starmie / TABBIE / TransTab	Hybrid	Test marginals	0.134

UJ-H is the harmonic mean of union- and join-recall. Explore all 1,120 pipelines in the interactive viewer →

Composition gap. Stacking per-stage winners is not the best pipeline: the dev-selected hybrid TUTA/GTE/GTE (0.229 UJ-H) beats the best monolith BERT/BERT/BERT (0.139) by 0.090, while assembling each stage's marginal leader reaches only 0.134; the unrestricted test maximum is 0.253 (Starmie/GTE/GTE). Granularities interact rather than stack — capability-matched hybrids win.

Demo

A hands-off fly-through of the 1,120 TRL-DLTE pipelines — every Stage-1 × Stage-2 × Stage-3 combination, colored by end-to-end UJ-H. The cube auto-tours on loop; no controls to fiddle with.

Key Findings

Once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard — and end-to-end quality depends on how capabilities compose, not on per-stage rank in isolation.

encoders — no single universal winner

0.139→0.229UJ-H

hybrids beat the best monolith

0.80ρ

row identity transfers across suites (p=6.3e-4)

Where each encoder family peaks

A capability profile of representative encoders — the shape, not any single score, is the point.

Radar chart in which different tabular encoder families spike on different capability axes, illustrating that no single model dominates across capabilities

No model spans every capability. Different encoder families peak on different capability axes — surface-text tasks reward generic text encoders, while cross-table alignment and matching reward structure-aware specialists.

Read jointly, the three suites expose three structural gaps that single-paradigm evaluations cannot isolate: a specialization gap in model choice (CTbench), a transfer-scope gap between intra- and cross-table transfer (Rbench), and a composition gap where granularities interact rather than stack (DLTE).

Citation

If you find TRL-Bench useful for your research, please cite our work.

@article{pang2026trl,
  title   = {{TRL}-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders},
  author  = {Pang, Wei and Jian, Xiangru and Li, Hehan and Yu, Zhixuan and Xue, Alex and Li, Jinyang and Dong, Zhengyuan and Zhao, Xinjian and Xu, Hao and Zhang, Chao and Cheng, Reynold and {\"O}zsu, M. Tamer and Yu, Tianshu},
  journal = {arXiv preprint arXiv:2606.09323},
  year    = {2026}
}

TRL-Bench spans three suites — TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional data-lake table enrichment).