JZ

What Is the Research Stack Powering the World’s Most Experienced Autonomous Driver?

Read Time I 7 minutes

What Is the Research Stack Powering the World’s Most Experienced Autonomous Driver?

At the MIT Mobility Forum on November 21, 2025, I interviewed Dragomir Anguelov (Drago), Vice President of Engineering at Waymo, to explore the research stack behind the world’s most experienced autonomous driving system.

As the lead of the AI Foundations team, Drago oversees the development of large-scale models that serve as “teachers” for onboard systems and simulators. Our conversation pulled back the curtain on how a fleet serving over 250,000 paid trips per week navigates the complexities of the real world.

We discussed the following areas:

  1. Highway Operations: The qualitatively different challenges of high-speed autonomy.
  2. Next-Generation AI Architectures: Moving beyond the dichotomy of “End-to-End vs. Modular” designs.
  3. Foundation Models: Integrating world knowledge and 3D spatial reasoning.
  4. Scaling Laws: Why motion forecasting models are 50 times smaller than language models.
  5. The Long Tail: Using the Waymo Open Dataset (WOD) to solve rare edge cases and social interactions.
  6. Safety Benchmarking: Rigorously measuring performance against human drivers to establish a “safety mandate”.

Part 1: Waymo’s launch of highway operations represents a massive technical milestone

While there was an “initial belief” in the industry that freeways might be easier than dense urban environments because “nothing particularly interesting happens” for long stretches, Drago clarified that “freeways present a qualitatively different challenge.”

The primary challenge is the speed. Unlike downtown San Francisco, where a vehicle might move at 10 miles per hour and can simply stop if it encounters a problem, a highway environment does not allow for immediate cessation of movement. Drago noted:

“When it’s 70 miles per hour, it’s not like you hit the brakes and stop in the middle of the freeway, that’s already not safe”.

The Requirement for “Degraded Condition” Driving: Because an AV cannot stop abruptly on a high-speed road, the system must be engineered to handle hardware and system failures while remaining in motion. This requires a “robust safety case, spanning all the way from the hardware and the system to the software and the validation”. Specific challenges include:

  • System Failures: The AV must be prepared for “power steering going off,” “brake” issues, or “compute failure”.
  • Safe Harbor Navigation: In the event of such a failure, the vehicle must be able to “keep driving, potentially for a mile, until you hit the shoulder, even in degraded condition”.
  • Validation: Every one of these rare, high-stakes scenarios must be rigorously tested and validated before public deployment.

Drago emphasized that Waymo does not rely on real-time human takeover for highway driving because of latency and reaction time. At high speeds, the need for “quick reactions” makes human remote driving “very challenging,” which is why Waymo focuses on being “as maximally autonomous as possible”.

One point of clarity: Waymo’s vehicles are fully autonomous. While they have tele-operations support, these “minders” do not drive the car. They may provide high-level guidance in rare, “out of a jam” situations, but they do not intervene in real-time, especially at high speeds where latency would make human backup dangerous. The role of these centers is expected to decline as the technology continues to mature.

Part 2: Next-Generation AI Architectures: Moving beyond the dichotomy of “End-to-End vs. Modular” designs.

“I don’t think it’s a dichotomy,” Drago explains, noting that “you use each technique for what it is useful for.”

The primary appeal of an end-to-end approach is the power of rich representations

Instead of engineers manually hand-coding every intermediate representation of the world, “end-to-end is the ability to train from outputs to inputs and learn rich representations automatically of the environment”. This is significantly more powerful than traditional methods which has now become the industry standard for advanced AI models.

The Gradient Problem: Billions to Two

Despite its power, a pure “black box” end-to-end system faces a massive technical hurdle: the dimensionality gap.

An autonomous vehicle’s input space is “extremely high dimensional and complex.” A dozen cameras and several LiDARs and radars produce “billions to tens of billions of readings per second” over a 10- to 30-second context. In contrast, the required output is incredibly small—usually just an XY trajectory or steering and throttle commands: “mostly two numbers”. “How much gradient can you push?” to map those billions of input points to just two output numbers–very little “gradient” for the model to learn effectively.

Why Structure Matters for Safety

For safety-critical applications like driving at scale, we cannot rely on an uninterpretable black box. “if it’s just a black box, it’s problematic”. By maintaining structured modules within an end-to-end framework, the system gains three vital capabilities:

  1. Reasoning and Reliability: Structured data helps the model perform complex reasoning and, crucially, “not hallucinate”.
  2. Introspection: A modular structure allows engineers to understand why a car made a specific move. “the model needs to be able to introspect its own understanding” to identify when it is performing well or when it is struggling.
  3. Controllability: Having rich, intermediate representations provides the “controllability” and “fixability” needed to refine the driver’s behavior without breaking the entire system.

Ultimately, Waymo’s strategy is a structured end-to-end approach. This architecture captures the efficiency of automatic learning while retaining the “richer representations of the environment” necessary to build a driver that is both intelligent and transparently safe.

Part 3: Integrating Foundation Models and “World Knowledge”

Waymo is actively researching how to incorporate Foundational Models into the driving stack. The primary appeal of vision-language models is the ability to “pre-train on internet-scale knowledge and then fine-tune to specific tasks,” allowing the system to leverage world knowledge without having to observe every single situation firsthand.

While models like Google’s Gemini are impressive, a “vanilla” vision-language model is not sufficient for the specific demands of driving. Autonomous vehicles require “Waymo-specific learnings” to better handle “3D spatial reasoning, multi-sensorial inputs, and longer context, memory.”

On “Uplifting” to 3D Euclidean Space

A major technical challenge is that most vision-language models are learned from the internet in 2D and “do not directly model or are concerned about this Euclidean 3D space”.

For an autonomous vehicle, if we represent the environment in a “bird’s eye view” with “Euclidean coordinates centered around the vehicle”, we can reason a lot better and model things a lot better.

Drago’s team specifically studies how to “effectively uplift a 2D learned representation to 3D Euclidean space where the car actually operates.”

On World Modeling and DeepMind’s Genie 3

Drago identified “world modeling” as a significant trend, specifically the causal variant of video prediction. He highlighted “Genie 3 by DeepMind,” describing it as a model that “learns to essentially dream future video of scenes… with controls which can be language or potentially even steering controls”.

These models are highly valuable because they capture “visual-temporal knowledge” that allows the AV to predict how a scene might evolve over 10 to 30 seconds and that can then be used to teach onboard models or simulators.

This predictive capability is essential for managing “safety-critical… multi-agent interactions” with drivers, pedestrians, and cyclists when navigating “chaotic airport scenes” or complex intersections where the behavior of other agents is uncertain.

Part 4: The Scaling Laws of Motion

When it comes to scaling, motion forecasting follows different rules than natural language. Waymo demonstrated that an optimal autonomous vehicle model is approximately 50 times smaller than an optimal language model. Drago explained that while human language is infinitely rich, the “motion vocabulary is small”. Drivers and pedestrians have a limited set of possible actions they can take.

Waymo’s MotionLM approach treats motion prediction as a “conversation” where all actors in a scene speak to each other simultaneously through their movements. This multi-agent modeling is critical for solving the negotiation challenges inherent in driving, such as merging into traffic or yielding to a cyclist.

Major breakthrough: Open-loop improvements translate to closed-loop improvements

In traditional robotics, training a model in an “open-loop” fashion does not always result in a better driver. The primary obstacle is “covariate shift”.

In a “closed-loop” environment, the vehicle must drive based on its own policy over a duration. “As you accumulate errors in your predictions, you can increasingly get out of domain for the model”. This causes the machine learning model to perform “worse and worse, and divergent, potentially… behaving suboptimally”.

For many architectures, “open loop often does not translate to closed-loop performance” and can actually make closed-loop results worse.

A major breakthrough in Waymo’s specific model of MotionLM: “what we found is actually open-loop improvements translate to closed-loop improvements. This is not a given, but it’s a really good result.”

This property is significant because it allows Waymo to harness their massive repository of human driving data. While most robotics domains lack sufficient data, autonomous driving has “no shortage of demonstration” from both expert drivers and observed human behavior.

Because their architecture ensures that gains in imitation training (open-loop) hold up during actual driving (closed-loop), Waymo can scale their models with the confidence that more data will directly result in a safer, more capable driver.

Part 5: Tackling the Long Tail and Social Intelligence

The hardest part of the “long tail” is not just detecting objects, but understanding social intelligence and predicting human intent in a complex, multi-agent environment.

Waymo has spent years training its driver to understand the gestures of police officers, the gaze of pedestrians, and the skeletal poses of cyclists to predict their intent.

This level of “situational awareness” is critical for navigating dense urban environments where pedestrians, cyclists, and drivers are in constant negotiation.

Waymo’s system is trained to understand the “gestures of personnel that’s directing traffic, of responders, and of firefighters”.

This social intelligence extends to construction zones, where human workers may use a mix of hand signals and physical signs.

Handling these cases often requires navigating a hierarchy of rules. For example, a car must know when it is appropriate to “violate” a standard rule, such as crossing a double yellow line to bypass a construction zone when directed by a human.

Because there is no single formula for these interactions, Waymo uses imitation learning from expert drivers to teach the model how to prioritize these social cues over static road rules.

This situational awareness is tested through the Waymo Open Dataset (WOD), which now includes planning-like challenges and thousands of rare, semantically complex scenarios.

Part 6: The Mandate of Safety and Global Scaling

Waymo’s strategy for expansion is a methodical “carving up” of the Operational Design Domain (ODD). They started in suburbs (Phoenix), moved to dense urban environments (San Francisco), and are now mastering freeways.

Why expansions are primarily in southern states, “That’s because they don’t have snow”. However, Waymo has been testing in snowy conditions for years, from Lake Tahoe to Minnesota, and Detroit is a crucial part of Waymo’s strategy to “cover the full space of needs”

The technology is also highly adaptable internationally. Waymo is seeing “positive transfer” between different vehicle types and geographies. Waymo plans to launch fully driverless operations in London next year, a testament to the system’s ability to generalize from right-side to left-side driving with minimal bespoke local modifications.

Ultimately, this scaling is driven by a “safety mandate”.

With over 100 million fully autonomous miles driven, Waymo’s data shows they are at least 5 times safer than human drivers regarding collisions and 12 times safer regarding impacts on pedestrians and bicyclists.

With 40,000 traffic fatalities annually in the U.S., Drago views these safety numbers as the motivation to bring this technology to as many people as possible.

The full video is available at https://www.mmi.mit.edu/forum.

More articles on AI, Mobility and Cities at https://zhaojinhua.com/newsletter/

Subscribe to the Newsletter

join my newsletter to understand what actually works, what’s not, and what might come next

[gravityform id="3" title="false" ajax="true"]

Share this Article on:

Related Articles