Microsoft Research Asia is pioneering innovations in Media Foundation to advance AI’s ability in processing real-world media. As one of the keys focuses of the 2025 StarTrack Scholars Program, this research aims to provide new insights into multimodal large models.
“We aspire for AI to acquire knowledge and intelligence from various media sources in the real world,” said Yan Lu, Partner Research Manager. “To achieve this goal, we must transform the complex and noisy real world into abstract representations capable of capturing essential information and dynamics. The exploration of Media Foundation serves as a new breakthrough in the synergy of multimedia and AI, offering novel perspectives for the research on multimodal large models.”
If you are an aspiring researcher with a zeal for media foundation, we invite you to apply to the Microsoft Research Asia StarTrack Scholars Program. Applications are now open for the 2025 program. For more details and to submit your registration, visit our official website: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
Bridging the semantic gap between AI and the real world
Since the term “artificial intelligence” was coined at the Dartmouth Conference in 1956, technological advances have significantly propelled AI forward. The rise of large language models (LLMs) like ChatGPT, GPT-4, and DALL-E has demonstrated impressive progress in natural language, audio, and image understanding. However, despite these advancements, AI still lags human cognitive abilities. Humans can interpret and abstract complex phenomena from the physical world, such as videos, sounds, and text, into lasting and accumulative information, whereas multimodal AI models are still developing in handling universal tasks.
The goal is for AI to learn from real-world data, but the challenge remains in bridging the gap between the complex, noisy real world, and the abstract semantic space where AI functions. Yan Lu and colleagues at Microsoft Research Asia propose constructing a comprehensive Media Foundation framework, starting with the neural codec, to address this challenge. This framework aims to extract media content representations, enabling AI to understand real-world semantics and bridge the gap between reality and abstract semantics for multimodal AI research.
Humans are exceptional learners due to their ability to interact with the physical world through various senses. The aspiration is to enable AI to learn from rich real-world data. While most AI models rely on LLMs that use abstract text expressions, there is a need for efficient methods to transform complex signals from video and audio into abstract representations capturing real-world essence and dynamics.
The Media Foundation framework consists of two components: online media tokenization and the offline foundation model. The online model converts multimedia information into compact, abstract semantic representations for AI to interact with the real world, while the offline model uses extracted media tokens to predict dynamics through offline learning. Efficient, near-lossless compression of real-world dynamics is crucial for AI intelligence, whether learning from text, audio, or video signals.
Neural codec: Constructing abstract representations for multimedia
The Media Foundation framework is designed to transform diverse media signals into compact, semantic representations, creating an abstract depiction of the real world and its dynamics. This framework consists of two main components: online media tokenization and offline foundation models, both powered by the neural codec. The development plan unfolds in three phases: first, training initial encoder and decoder models for each media modality; second, building and optimizing foundation models and encoders/decoders for each modality; and third, learning cross-modal correlations to construct the final multimodal foundation model.
This dynamic representation, combined with the multimodal foundation model, offers a fresh perspective for studying multimodal AI. Given that abstract semantic representations are more concise than complex and noisy video and audio signals, a key challenge is whether Media Foundation can compress real-world dynamics efficiently with minimal loss. This challenge has driven the team to create a new neural codec framework dedicated to forming abstract representations for video, audio, and their dynamics.
Efficient neural audio/video codec development: Paving the way for innovative applications
In recent years, Yan Lu and his team have made significant strides in developing efficient neural audio/video codecs. By leveraging deep learning, they have disrupted traditional codec architectures, achieving lower computational costs and superior performance. In neural audio codec development, they have managed to compress high-quality speech signals at 256bps, using disentangled semantic representation learning through an information bottleneck at extremely low bitrates. This advancement is not only important for multimedia technology but also enhances tasks like voice conversion, speech-to-speech translation, and emotion recognition. The team is currently enhancing the audio codec to handle general sound with improved intelligence.
They have also developed the DCVC-FM (Deep Contextual Video Compression-Feature Modulation) neural video codec. This codec automates the integration of different modules and algorithms, traditionally done through rule-based methods, into a deep learning process. This innovation significantly improves the video compression ratio, outperforming existing codecs. As part of the comprehensive Media Foundation initiative, the team is making substantial modifications to the DCVC-FM codec to further enhance its capabilities.
Exploring new possibilities of AI development with Media Foundation
The newly developed neural codec revolutionizes how diverse types of information are modeled in latent space, achieving higher compression ratios. For multimodal large models, this enables the conversion of visual, language, and audio information into neural representations. Unlike the sequential nature of natural language, these multimedia representations align with natural laws and support a broader array of applications. The team’s work demonstrates the potential for building a new Media Foundation based on video and audio data, offering a fresh perspective for AI development.
While natural language has been effective for constructing AI, constantly converting complex multimedia signals into text can be cumbersome and limiting. A Media Foundation based on neural codecs could offer a more efficient approach. Although the paths for developing multimodal large models through Media Foundation and natural language models differ, both are vital for AI’s advancement. If AI-learned multimedia representations are viewed as a parallel “language” to natural language, then large multimodal models can be considered “large multimedia language models.”
The neural codec is poised to play a critical role in evolving the Media Foundation, with its models and large language models jointly shaping the future of multimodal AI. The team will continue exploring various modeling approaches in latent space for multimedia data using neural codecs, with Media Foundation as their guiding principle, opening numerous possibilities.
As they construct a comprehensive Media Foundation framework, they invite brilliant young minds passionate about transforming the future of AI to join in this endeavor.
Generative AI for HCI: Next generation of intelligent and user-centric interfaces
Generative AI is transforming how we interact with technology, offering new opportunities in user interfaces (UI) and Human-Computer Interaction (HCI). Generative AI enhances user experiences, streamline design processes, and enables more intuitive and adaptive interfaces. By leveraging AI, we can automate UI creation, personalize interactions, and develop systems that evolve with user behavior, boosting efficiency and fostering innovation in HCI.
Generative AI can be applied in various aspects of UI and HCI, including:
- Automated UI design: AI can generate UI components and layouts tailored to user preferences and behavior, reducing manual design time and effort.
- Personalized user experiences: AI algorithms analyze user data to create interfaces that adapt to individual needs and preferences.
- Adaptive systems: AI-driven systems learn from user interactions, continuously improving performance for a seamless, intuitive experience.
- Enhanced accessibility: AI can automatically adjust interfaces to better serve users with disabilities, creating more accessible environments.
Our team is committed to pioneering research in generative AI for UI and HCI, exploring AI’s potential to transform design and interaction, and pushing the boundaries of what is possible. We invite scholars with diverse expertise to join us, share insights, and contribute to advancing this exciting field. Together, we can pave the way for the next generation of intelligent, user-centric interfaces.
Insights and innovations: StarTrack Scholars’ experiences
The essence and impact of the Microsoft Research Asia StarTrack Scholars Program are vividly captured through the firsthand experiences of StarTrack Scholars:

Jin Gao, professor with State Key Laboratory of Multimodal Artificial Intelligence Systems at the Institute of Automation, Chinese Academic of Sciences (CASIA), reflects on his visit:
“The experience at MSRA also encouraged me to dive deep into closing the gap between real world applications and innovative AI modeling methods, which shaped my research path in the following five years after my visiting.
It is my immense pleasure to share my experience as a StarTrack Scholar in Yan Lu’s research group at MSRA from October 2019 to March 2020. During this period, I enjoyed a pleasant research journey by collaborating with Yan Lu and exploring the fundamental online learning mechanism for processing the streaming data using deep learning techniques.
Our innovative findings regarding the improvement of the SGD algorithm bring significant benefits in reducing the risk of over-fitting resulting from catastrophic forgetting in excessive online learning. This work was accepted at CVPR 2020 as an Oral presentation.
During my office life working at MSRA, my colleagues and I also had a fruitful exchange of views on advancing AI’s capability to analyze and interpret data from the real world. They are so smart, willing, and most importantly, always full of passion, which leaves me with many memorable moments.
Influenced by the experience of working and studying at MSRA, I have hosted more than 10 research projects sponsored by both government and industry which include the Key Program of Joint Fund of National Natural Science Foundation of China. I was also elected to several talent programs such as the Outstanding Youth Science Foundation project, the Beijing Municipal Natural Science Fund for Distinguished Young Scholars and the Youth Innovation Promotion Association, CAS.”

Qi Mao, assistant professor in the school of information and communication engineering at the communication university of China, shares:
“Engaging with leading experts refined my research skills and expanded my perspectives. The supportive and dynamic culture at MSRA encouraged critical and creative thinking, proving invaluable to my ongoing research efforts.
I am excited to share my enriching journey at Microsoft Research Asia (MSRA), a pivotal chapter in my academic and professional development. As a StarTrack scholar in Yan Lu’s research group, I led the project titled “Unifying Generation, Understanding, Compression in One-framework from the Multi-Modal Large Models Perspective.”
Our goal was to revolutionize image compression by integrating multi-modal large foundation models, facilitating compression, generation, and understanding all within a single framework. This project not only marked significant milestones in my research career but is also poised to notable publications in top-tier conferences and journals.
My collaboration with MSRA’s esteemed researchers significantly broadened my expertise in AI and compression technologies, enabling me to tackle highly challenging and impactful projects.
Reflecting on my personal growth, MSRA is more than a workplace, it is a vibrant community. Working alongside brilliant minds, I encountered new learning experiences and challenging projects daily. Memorable moments ranged from intensive brainstorming sessions to the camaraderie during coffee breaks, with a standout memory of lively project discussions over lunch with fellow researchers, highlighting the supportive and vibrant office culture at MSRA.
As I look back on my journey, I am grateful for the connections I have made and the knowledge I’ve gained. My experience at MSRA has fueled my passion for research and innovation, and I am eager to carry these insights forward into my future endeavors.”
The team will continue to explore various modeling approaches in the latent space for multimedia information using neural codecs. Media Foundation will serve as their guiding principle, presenting myriad possibilities at every entry point. As we navigate the intricacies of constructing a comprehensive Media Foundation framework, we extend a call to scholars passionate about reshaping the future of artificial intelligence.
Microsoft Research Asia StarTrack Scholars advocates an open attitude, encouraging dialogue and joint experimentation with researchers from various disciplines to discover viable solutions. Now visit our official website to know more: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research
Theme Team
Yan Lu (Engaging Lead), Partner Research Manager, Microsoft Research Asia
Xiulian Peng, Principal Research Manager, Microsoft Research Asia
Shujie Liu, Principal Research Manager, Microsoft Research Asia
Bin Li, Principal Researcher, Microsoft Research Asia
If you have any questions, please email Ms. Yanxuan Wu, program manager of the Microsoft Research Asia StarTrack Scholars Program, at v-yanxuanwu@microsoft.com
The post Media Foundation: Microsoft Research Asia StarTrack Scholars 2025 unlocks the potential of AI in observing and understanding the real world appeared first on Microsoft Research.