Quantcast
Channel: Microsoft Research Lab - Asia Articles
Viewing all articles
Browse latest Browse all 49

Towards industrial foundation models: Integrating large language models with industrial data intelligence

$
0
0

Although large language models (LLMs) excel in language-focused tasks like news writing, document summarization, customer service, and virtual assistants, they face challenges when it comes to  learning and inference on numeric and structured industry data, such as tabular data and time series. To address these issues, researchers from Microsoft Research Asia have proposed a new approach to building industrial foundation models (IFMs). They successfully demonstrated the feasibility and significant potential to achieve cross-domain universal in-context learning on  tabular data.

They designed a Generative Tabular Learning (opens in new tab) (GTL) framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks . This technical paradigm has been officially open-sourced (opens in new tab) to promote the broader use of data science across different sectors and make advanced data intelligence accessible to everyone.

Unlocking the vast potential of industrial data

Researchers have identified untapped potential for LLMs in industrial data. Gathered from a wide range of industries and sectors, this data is stored in unique formats, such as tabular data for relational structures, time-series data for temporal patterns, and graph data for complex interconnections. These specialized formats contain valuable insights that are hard to capture in human language, making them largely absent from existing LLMs.

More importantly, industrial data and the intelligence it carries are foundational for critical applications across various domains. In energy storage, identifying patterns in battery cycling data can accelerate the material screening during manufacturing, optimize charge-discharge protocols in usage, and guide value pricing during recycling. In commerce, historical sales and demand data can help forecast future demand and set pricing strategies. Importantly, this kind of intelligence is drawn not only from numerical and structured information but also from task-specific methodologies and domain-specific expertise.

To advance data intelligence applications across industries, researchers at Microsoft Research Asia propose the development of industrial foundation models (IFMs). Their approach involves post-training LLMs on industrial data science tasks, embedding specialized knowledge unique to various sectors and enhancing in-context data learning capabilities. This method aims to create IFMs capable of excelling in diverse data science tasks across industries, extracting data knowledge that spans tasks and domains, and performing predictive and logical reasoning tailored to industrial needs.

Building IFMs with tabular data

To implement the proposed approach, researchers have focused on developing IFMs using tabular data—one of the most common data formats, typically stored in relational databases and widely used across various domains.

The process begins with collecting diverse tabular datasets from various domains and converting them into an instruction-oriented format. This conversion accommodates diverse data schemas, including variations in feature semantics and numerical interpretations, and supports any combination of numerical and categorical features. Additionally, the approach incorporates data samples along with optional metadata, enables both regression and classification tasks, and is designed to handle zero-shot and in-context learning scenarios effectively.

However, integrating the language processing capabilities of LLMs with tabular data learning poses significant challenges. A key issue is that LLMs are generally pretrained on natural language data, leaving them less equipped to handle the nuanced and structured nature of tabular data. Additionally, they often lack the domain-specific knowledge essential for effective learning and inference in tabular formats.

To address these challenges, researchers have introduced a post-training approach called generative tabular learning (GTL). This process enhances LLMs by integrating data knowledge and statistical learning capabilities through an autoregressive generative modeling technique applied to feature and label tokens.

Following this post-training stage, a GTL-enhanced LLM can be directly applied to new industrial data schemas and tasks by adjusting instruction prompts, eliminating the need for complex parameter tuning. The enhanced model can also generalize across diverse domains, data patterns, and tasks, marking a significant step forward in applying LLMs to industrial contexts. Figure 1 illustrates the end-to-end process.

A diagram of a diagram

Description automatically generated
Figure 1. The pipeline for building IFMs on tabular data

GTL significantly boosts LLaMA’s comprehension of tabular data

To evaluate the effectiveness of GTL, researchers compiled over 400 datasets and applied rigorous deduplication filtering, resulting in 384 datasets. Of these, 44 were reserved for evaluation, while the remaining datasets were used to generate over 1,000 distinct prediction tasks for GTL. The study used Meta’s LLaMA 13B model as the base LLM and compared its performance against both open-source and proprietary LLMs, as well as traditional tabular learning algorithms.

Figure 2 summarizes the comparison between GTL-enhanced LLaMA and other baseline models. The results reveal that GTL significantly improves LLaMA’s ability to interpret tabular data. Notably, the GTL-enhanced LLaMA, despite its smaller size, achieves performance that is competitive with—and in some cases superior to—GPT-4. It is worth noting, however, that assessing data contamination risks for GPT-4 is challenging, as its training data includes publicly available web content, which could provide it with an unintended advantage. Additionally, the GTL-enhanced LLaMA shows remarkable in-context learning capabilities, outperforming traditional tabular learning methods in few-shot scenarios.

A summary of the experimental results
Figure 2. A summary of the experimental results

The researchers also conducted a preliminary study to investigate scaling laws for GTL. The findings, illustrated in Figure 3, reveal that both dataset diversity and model size drive performance improvements on holdout datasets in a power-law manner. These results highlight the potential of IFMs to generalize effectively across a wide range of tasks and domains, making advanced data intelligence tools accessible even to industries with limited data resources.

A graph of a number of numbers

Description automatically generated with medium confidence
Figure 3. A preliminary study of scaling laws

Broadening the horizons of IFMs

GTL has paved the way for conversational tabular data deep learning, enabling users to perform analysis, prediction, reasoning, and decision-making through direct interactions with the model. By integrating GTL with language models, the system not only generates predictive results but also provides explanations, enhancing the interpretability of tabular data learning. Inspired by the potential of this paradigm, researchers at Microsoft Research Asia envision promising directions for foundational industry models.

The first direction involves scaling this approach across multiple dimensions, such as expanding the variety and size of datasets, increasing model sizes, extending context lengths, and incorporating diverse data formats like time-series and graph data. These enhancements will allow IFMs to handle a broader range of tasks and domains with greater precision and adaptability. Additionally, integrating industrial data knowledge with advances in the modern LLM ecosystem—such as tool usage, intelligent agents, and interactive applications—can further increase IFMs’ capabilities. This synergy can lead to more robust, versatile models that seamlessly blend industrial data intelligence with the sophisticated functionalities of contemporary LLMs.

The second direction emphasizes the transformative potential of IFMs in industrial data intelligence. This shift calls for a reimagining of user interfaces and toolchains for data science, paving the way for innovative products and service like data science copilots. These copilots can assist domain experts by providing advanced data analysis and predictive capabilities without requiring deep technical expertise, democratizing access to cutting-edge data science tools. Moreover, IFMs can serve as powerful decision-making tools for business leaders and industry practitioners. By delivering comprehensive insights and personalized analytics, IFMs can help companies make more informed strategic decisions, optimize operations, and identify new opportunities for growth and innovation.

The development of IFMs marks a significant step in merging LLM capabilities with industrial data knowledge. By continuing to scale, innovate, and design user-centric tools, researchers can make advanced data intelligence accessible and unlock innovation across various industries. This vision aligns with researchers’ original goal of leveraging LLMs to complete instruction-centric tasks, extract cross-sector knowledge, and perform industrial predictive and logical reasoning. Moving forward, researchers remain committed to pushing the boundaries of what’s possible, helping unlock new potential for industries worldwide.

The post Towards industrial foundation models: Integrating large language models with industrial data intelligence appeared first on Microsoft Research.


Viewing all articles
Browse latest Browse all 49

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>