Google AI Unveils TabFM for Zero-Shot Tabular Predictions
Google Research introduced TabFM, a novel foundation model designed specifically for tabular data, on an unspecified recent date. This model is capable of performing both classification and regression tasks without requiring dataset-specific training, a significant departure from traditional machine learning approaches. Every prediction generated by TabFM is achieved through a single forward pass, reframing tabular prediction as an in-context learning problem. The model is currently accessible on Hugging Face and GitHub, allowing researchers and developers to utilize its capabilities. TabFM's core innovation lies in its ability to predict on unseen tables without any training, tuning, or feature engineering. It processes the entire dataset as a single prompt, enabling predictions via in-context learning. The underlying architecture is a hybrid, combining TabPFN-style row and column attention mechanisms with TabICL-style in-context learning. Its training involved hundreds of millions of synthetic datasets generated from structural causal models. Google BigQuery is slated to integrate TabFM soon, offering it through an AI.PREDICT SQL command. Tabular data is fundamental to enterprise data infrastructure, underpinning critical tasks such as customer churn analysis and financial fraud detection. Historically, tree-based methods like XGBoost, AdaBoost, and random forests have been the dominant tools for structured data. However, these methods often require extensive hyperparameter optimization and feature engineering for each new dataset, a process that can consume significant data scientist time. TabFM aims to alleviate this bottleneck by applying the zero-shot learning paradigm, popularized by large language models, to tabular data. Similar to how LLMs learn new tasks from in-context examples without weight updates, TabFM applies this concept to tables, generating predictions on unseen data in a single pass. Unlike traditional models that update parameters based on a dataset's specific distribution, TabFM bypasses this step by treating the entire dataset, including training examples and target testing rows, as a unified prompt.
Original source — read the full reporting at the publisher:
Read on MarkTechPost