Large language models (LLMs), led by GPT and followed by numerous other models, have demonstrated their strong capabilities in many areas, from language processing such as text generation and document summarization, to coding, reasoning, and planning tasks, etc. LLMs have fueled the new wave of AI revolution that will generate a deep impact on every aspect of society.
While the theory of computation has been developed over many decades, and the theory of machine learning and deep learning has been extensively studied, the theoretical foundation of LLMs is still a widely open and unexplored territory.
Microsoft Research Asia would like to invest in a collaborative effort to address this fundamental question facing LLMs and AI in general. We believe that building a theoretical foundation for LLMs is crucial to the safe and sustainable development of advanced AI technologies for the benefit of the entire society. If you are an aspiring researcher with a zeal for exploring the LLMs, we invite you to apply to the Microsoft Research Asia StarTrack Scholars Program. Applications are now open for the 2025 program. For more details and to submit your registration, visit our official website: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
Research topics and directions: Bridging the cognitive gap
The underlying mechanism for all LLMs can be simply explained as the next word prediction — predicting the next word to be generated as accurately as possible based on a large amount of training text data. There seems to be a huge cognitive gap between the seemly simple next-word-prediction mechanism and the amazingly intelligent capabilities of language processing, knowledge retrieval, theory of mind, planning. How to bridge the gap and connect next word prediction with intelligent capabilities? Is there any fundamental limitation of LLMs based on the next word prediction? Currently there is still lacking principled understanding of these crucial questions.
Although LLMs are based on deep learning techniques, their autoregressive nature of learning and their crucial dependency on large training data, large model size, large training, and test time compute power, making LLMs unique from existing computation models. Existing computational and statistical learning results cannot fully explain the intelligent capabilities emerged from these LLMs.
Therefore, the team believe innovative approaches and paradigms are needed to understand and explain the power of LLMs, together, they may build a new foundation for LLMs and for AI in general. These novel approaches may come from a combination of the following research directions:
Understanding and enhancing the intelligent capabilities of reasoning and planning
LLMs based on autoregressive learning possess versatile capabilities from language manipulation, knowledge retrieval, to planning and reasoning [1]. While the language and knowledge retrieval capabilities may be attributed to vast amounts of training data and their statistical training and compression for next word prediction [2,3,4], it is still quite unclear where the planning and reasoning capabilities of LLM come from. Some empirical evidence shows that LLMs has limited planning and reasoning capabilities, even for the most advanced reasoning engine of OpenAI’s o1 system [5,6], but the reasoning capability could be greatly enhanced in specific domains if carefully curated synthetic training data are applied or outside reasoning and verification engines are integrated [7].
Research team in the Theory Center of MSRA, in collaboration with our academic collaborators, has applied theoretical analysis and unveiled some internal mechanisms of the Transformer architecture that enables the core path-planning task, and found certain limitations of Transformer in terms of path planning [8]. However, a deeper understanding of the planning and reasoning capabilities of LLMs is required to enhance their planning and reasoning capabilities.
Theoretical model and explanation of the scaling law and emergent nehavior
It is now well known that the scaling law and the emergent behaviors are instrumental to the enormous success of LLMs. However, why would the Transformer architecture with the autoregressive learning mechanism enable the scaling law and emergent behaviors, and under what conditions, is still a mystery. Some preliminary theoretical studies attempted to explain these phenomena [9,10], but much more systematic modeling and investigation are needed to understand these crucial aspects of LLMs.
A new complexity theory for intelligence
Computer science as a discipline is built on top of the theoretical foundation of computation, in particular the complexity of theory and theory of algorithms. However, the classical computational complexity theory may not be applicable in the AI era, since it measures the complexity and efficiency of the computation, but not the “intelligence” of computation. We may need a new paradigm and a new kind of complexity theory that measures the intelligence of computation models.
For example, given the same training data, why can some models learn and extract deep insights and results from the data while other models only provide superficial analysis and predictions? Why are some models intrinsically better than others? Is there an intelligence reduction paradigm that we can use to measure the intelligence power of different models, like algorithmic reduction to measure the complexity of different computation tasks? The advent of LLMs and other AI models may invite a completely new complexity paradigm addressing the power of intelligent computation.
Theory of in-context learning
In-context learning (ICL) involves the generalization in task spaces and intuitively relates to certain data-selection mechanisms induced by modeling biases (such as the attention and gating). An idealized yet powerful mechanism known as induction heads can be used to understand how Transformers can implement the ICL ability. Wang et al. [12] provided theoretical characterizations of approximation errors and optimization dynamics of shallow Transformers applied to induction heads targets, while the mystery remains for LLMs with larger and deeper Transformer backbones.
Efficiency improvement of LLMs
The training ofLLMs involves enormous number of parameters and requires web-scale datasets, leading to increasing overheads in computation, memory, energy, and infrastructures, which calls for more efficient developments of modern machine learning.
Efficient learning involves several fundamental aspects, including modeling, data, optimization, etc. and their interactions [11]. When centering around data (i.e. data-efficient AI), potential research directions may include data-dependent modeling (model data), data-centric sampling (data data), data-related training (optimization data) and so on. Specifically, one can study respectively, e.g., inductive biases/redundancy, sampling of data mixtures, and data-awareness dynamics towards theoretically grounded (at least, principled) efficiency in data-centric AI and LLMs.
New model architecture and learning algorithms for next-generation AI
While current LLMs exhibit near-human intelligence, recent research reveals that they struggle with many tasks, such as computation and reasoning. The reasons for these limitations are not fully understood, but they are often attributed to the underlying model architectures and training algorithms. Therefore, identifying the new model architectures and training methodologies for next-generation AI is a crucial area of research, with the potential to address the shortcomings of current models.
Collaborative exploration: Discover more possibilities of LLMs
The team at the Theory Center of Microsoft Research Asia is focusing on the theoretical foundation of LLMs, aiming to bridge the cognitive gap within LLMs, understand and improve the performance of LLMs, and build new theories and algorithms for the next generation.
This is undoubtedly a challenging and arduous road of exploration, which requires the participation of many outstanding scholars, and the infinite possibilities of LLMs can be explored through extensive cooperation and the concentration of collective wisdom.
In addressing these challenging issues, Microsoft Research Asia StarTrack Scholars advocate an open attitude, encouraging dialogue and joint experimentation with researchers from various disciplines to discover viable solutions. Now visit our official website to know more: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
- S. Bubeck, et al., Sparks of artificial general intelligence: Early experiments with GPT-4. Preprint arXiv:2303.12712, 2023.
- Z. Allen-Zhu and Y. Li. Physics of language models: Part 1, context-free grammar. Preprint arXiv:2305.13673, 2023.
- Z. Allen-Zhu and Y. Li. Physics of language models: Part 3.1, knowledge storage and extraction. Preprint arXiv:2309.14316, 2023.
- Z. Allen-Zhu and Y. Li. Physics of language models: Part 3.2, knowledge manipulation. Preprint arXiv:2309.14402, 2023.
- K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 2023.
- K. Valmeekam, K. Stechly, and S. Kambhampati. LLMs still can’t plan; Can LRMs? A preliminary evaluation of OpenAI’s o1 on PlanBench. Preprint arXiv: 2409.13373, 2024.
- T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving Olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- S. Wang, Y. Shen, S. Feng, H. Sun, S-H Teng, and W. Chen. ALPINE: Unveiling The planning capability of autoregressive learning in language models. Advances in Neural Information Processing Systems, 2024.
- S. Arora and A. Goyal. A theory for emergence of complex skills in language models. Preprint arXiv: :2307.15936, 2023.
- K. Lyu, J. Jin, Z. Li, S. S. Du, J. Lee, and W. Hu. Dichotomy of early and late phase implicit biases can provably induce grokking. International Conference on Learning Representations, 2024.
- Li Shen, Yan Sun, Zhiyuan Yu, Liang Ding, Xinmei Tian, and Dacheng Tao. On efficient training of large-scale deep learning models: A literature review. arXiv preprint arXiv:2304.03589, 2023.
- Mingze Wang, Ruoxi Yu, Weinan E, and Lei Wu. How transformers implement induction heads: Approximation and optimization analysis. arXiv preprint arXiv:2410.11474, 2024.
Theme Team:
Wei Chen, Principal Researcher, Microsoft Research Asia
Siwei Wang, Senior Researcher, Microsoft Research Asia
Yifei Shen, Senior Researcher, Microsoft Research Asia
Zhong Li, Researcher, Microsoft Research Asia
If you have any questions, please email Ms. Yanxuan Wu, program manager of the Microsoft Research Asia StarTrack Scholars Program, at v-yanxuanwu@microsoft.com
The post Theoretical foundations of large language models: Microsoft Research Asia StarTrack Scholars 2025 enhancing the power of LLMs appeared first on Microsoft Research.