Quantcast
Channel: Microsoft Research Lab - Asia Articles
Viewing all articles
Browse latest Browse all 66

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

$
0
0

By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager

Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

Introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their daily lives. We are releasing an open-source implementation of Magentic-One on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives. We are also releasing an open-source implementation of Magentic-One (opens in new tab) on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.

The future of AI is agentic. AI systems are evolving from having conversations to getting things done—this is where we expect much of AI’s value to shine. It’s the difference between generative AI recommending dinner options to agentic assistants that can autonomously place your order and arrange delivery. It’s the shift from summarizing research papers to actively searching for and organizing relevant studies in a comprehensive literature review.

Modern AI agents, capable of perceiving, reasoning, and acting on our behalf, are demonstrating remarkable performance in areas such as software engineering, data analysis, scientific research, and web navigation. Still, to fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives, we need advances in generalist agentic systems. These systems must reliably complete complex, multi-step tasks across a wide range of scenarios people encounter in their daily lives.

Introducing Magentic-One (opens in new tab), a high-performing generalist agentic system designed to solve such tasks. Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.

Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Implemented using AutoGen (opens in new tab), our popular open-source multi-agent framework, Magentic-One benefits from the modular and flexible multi-agent paradigm. This approach offers numerous advantages over monolithic single-agent systems. For example, encapsulating distinct skills in separate agents simplifies development and reuse, much like object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without altering other agents or the overall architecture, unlike single-agent systems that often struggle with constrained and inflexible workflows.  

We’re making Magentic-One open-source for researchers and developers. While Magentic-One shows strong generalist capabilities, it’s still far from human-level performance and can make mistakes. Moreover, as agentic systems grow more powerful, their risks—like taking undesirable actions or enabling malicious use-cases—can also increase. While we’re still in the early days of modern agentic AI, we’re inviting the community to help tackle these open challenges and ensure our future agentic systems are both helpful and safe. To this end, we’re also releasing AutoGenBench (opens in new tab), an agentic evaluation tool with built-in controls for repetition and isolation to rigorously test agentic benchmarks and tasks while minimizing undesirable side-effects.

How it Works

Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).
Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).

Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.

Overall, Magentic-One consists of the following agents:

  • Orchestrator: the lead agent responsible for task decomposition and planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
  • WebSurfer: This is an LLM-based agent that is proficient in commanding and managing the state of a Chromium-based web browser. With each incoming request, the WebSurfer performs an action on the browser then reports on the new state of the web page   The action space of the WebSurfer includes navigation (e.g. visiting a URL, performing a web search);  web page actions (e.g., clicking and typing); and reading actions (e.g., summarizing or answering questions). The WebSurfer relies on the accessibility tree of the browser and on set-of-marks prompting to perform its actions.
  • FileSurfer: This is an LLM-based agent that commands a markdown-based file preview application to read local files of most types. The FileSurfer can also perform common navigation tasks such as listing the contents of directories and navigating a folder structure.
  • Coder: This is an LLM-based agent specialized through its system prompt for writing code, analyzing information collected from the other agents, or creating new artifacts.
  • ComputerTerminal: Finally, ComputerTerminal provides the team with access to a console shell where the Coder’s programs can be executed, and where new programming libraries can be installed.

Together, Magentic-One’s agents provide the Orchestrator with the tools and capabilities that it needs to solve a broad variety of open-ended problems, as well as the ability to autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.

While the default multimodal LLM we use for all agents is GPT-4o, Magentic-One is model agnostic and can incorporate heterogonous models to support different capabilities or meet different cost requirements when getting tasks done. For example, it can use different LLMs and SLMs and their specialized versions to power different agents. We recommend a strong reasoning model for the Orchestrator agent such as GPT-4o. In a different configuration of Magentic-One, we also experiment with using OpenAI o1-preview for the outer loop of the Orchestrator and for the Coder, while other agents continue to use GPT-4o.

Evaluation

To rigorously evaluate Magentic-One’s performance, we introduce AutoGenBench, an open-source standalone tool for running agentic benchmarks that allows repetition and isolation, e.g., to control for variance of stochastic LLM calls and side-effects of agents taking actions in the world. AutoGenBench facilitates agentic evaluation and allows adding new benchmarks. Using AutoGenBench, we can evaluate Magentic-One on a variety of benchmarks. Our criterion for selecting benchmarks is that they should involve complex multi-step tasks, with at least some steps requiring planning and tool use, including using web browsers to act on real or simulated webpages. We consider three benchmarks in this work that satisfy this criterion: GAIA, AssistantBench, and WebArena.

In the Figure below we show the performance of Magentic-One on the three benchmarks and compare with GPT-4 operating on its own and the per-benchmark highest-performing open-source baseline and non open-source benchmark specific baseline according to the public leaderboards as of October 21, 2024. Magentic-One (GPT-4o, o1) achieves statistically comparable performance to previous SOTA methods on both GAIA and AssistantBench and competitive performance on WebArena. Note that GAIA and AssistantBench have a hidden test set while WebArena does not, and thus WebArena results are self-reported. Together, these results establish Magentic-One as a strong generalist agentic system for completing complex tasks.

Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.
Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.

Risks & Mitigations

Agentic systems like Magentic-One represent a phase transition in the opportunities and risks of having AI systems in the world. Magentic-One interacts with a digital world designed for, and inhabited by, humans. It can take actions, change the state of the world and result in consequences that might be irreversible. This carries inherent and undeniable risks and we observed examples of emerging risks during our testing. For example, during development, a misconfiguration prevented agents from successfully logging in to a particular WebArena website. The agents attempted to log in to that website until the repeated attempts caused the account to be temporarily suspended. The agents then attempted to reset the account’s password. More worryingly, in a handful of cases — and until prompted otherwise — the agents occasionally attempted to recruit other humans for help (e.g., by posting to social media, emailing textbook authors, or, in one case, drafting a freedom of information request to a government entity). In each of these cases, the agents failed because they did not have access to the requisite tools or accounts, and/or were stopped by human observers.

In accordance with Microsoft’s commitments to Responsible AI, we worked to identify, measure, and mitigate potential risks of Magentic-One prior to deployment. In particular, we performed red-teaming exercises for potential harmful content, jailbreak, and prompt injection attacks finding no increased risk from our design. In addition, we provide cautionary notices about how to use Magentic-One safely along with guidance, examples, and appropriate defaults for using Magentic-One in ways that minimize risks including how to bring humans-in-the-loop for monitoring and oversight, and ensuring all examples involving code execution, and our evaluation and benchmarking tools run in sandboxed docker containers.

We recommend using Magentic-One with models with strong alignment and pre- and post-generation filtering, and closely monitoring logs during and after execution. Indeed, in our own use, we follow the strict principle of least privilege, and maximum oversight. We acknowledge that minimizing the potential risks from agentic AI will require new techniques and much research is needed in both understanding these emerging risks and developing techniques. We will continue sharing our learnings with the community and continue evolving Magentic-One with the latest on safety research.

We see valuable new directions in agentic, safety and Responsible AI research: In terms of anticipating new risks from agentic systems, it is possible that agents will be subject to the same phishing, social engineering, and misinformation attacks that target human web surfers when they are acting on the public web. On cross-cutting mitigations, we anticipate an important direction will be equipping agents with an understanding of which actions are easily reversible, which are reversible with some effort, and which cannot be undone. As an example, deleting files, sending emails, and filing forms, are unlikely to be easily reversed. When faced with a high-cost or irreversible action, systems should be designed to pause, and to seek human input.

To conclude, in this work we introduced Magentic-One, a generalist multi-agent system represents a significant development in agentic systems capable of solving open-ended tasks.

For further information, results and discussion, please see our technical report. (opens in new tab)

The post Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks appeared first on Microsoft Research.


Viewing all articles
Browse latest Browse all 66

Trending Articles