Humanoid Robot AI Stacks Compared: FSD, Helix, Gemini
Humanoid Inc Research
Humanoid Robot AI Stacks Compared: FSD, Helix, Gemini
The AI brain is the most-searched spec in humanoid robotics. Tesla runs FSD adapted for robots. Figure built Helix from scratch. Google is bringing Gemini into the physical world. Here's how the stacks actually compare — and why the architecture battle matters.
Ask anyone following humanoid robotics what matters most, and they'll say "the AI." It's the right instinct. Mechanical design is converging — everyone is building roughly the same 28-DOF biped with rotary actuators and stereo cameras. The differentiator is the software that turns sensor data into action.
Three AI architectures dominate the conversation in mid-2026: Tesla's FSD transfer approach, Figure AI's Helix platform, and Google DeepMind's Gemini Robotics. They represent fundamentally different philosophies about how to build an AI brain for a physical robot.
The Architectures
Tesla: FSD Robotics (Transfer Learning)
Tesla's approach is unique: take a neural network trained on billions of miles of driving data and retrain it for robot control. The thesis is that driving is a robotics problem — perception, planning, and control in a dynamic environment — and the FSD stack has already solved the hardest parts.
How it works:
- The FSD neural network processes camera inputs into a vector-space representation of the environment
- For Optimus, the perception pipeline remains largely the same — cameras see the factory floor instead of the road
- The planning and control layers are retrained for manipulation tasks (grasping, placing, assembly) instead of driving tasks (steering, acceleration, lane changes)
- Training data comes from Optimus units operating in Tesla factories, with human teleoperation providing demonstration data
Architecture: End-to-end neural network. Camera pixels in, joint commands out. No explicit modularity between perception, planning, and control — the network learns the mapping directly.
Key advantage: Scale of training data. Tesla has 1,000+ Optimus units generating real-world interaction data, plus the ability to generate synthetic data from factory digital twins. No other humanoid program has this volume of training data.
Key risk: Driving and manipulation are different domains. A car navigates a 2D plane with well-defined rules. A robot manipulates objects in 3D space with infinite edge cases. The FSD transfer might work for navigation but struggle with dexterity.
Figure AI: Helix (Neural-First Architecture)
Figure AI's Helix is the most talked-about humanoid-specific AI architecture. The key claim: Helix replaced 109,000 lines of hand-coded C++ with a single neural prior — a learned representation that generates locomotion and manipulation behavior without explicit programming.
How it works (Helix 02, current version):
Three-layer architecture:
-
Neural Prior (S1): A 80M-parameter "world model" trained on Figure's deployment data. This isn't a language model — it's a visuomotor model that learns the physics of how Figure's body interacts with objects and environments. It handles low-level control: joint torques, balance, gait.
-
Semantic Planner (S2): A vision-language model (VLM) that interprets high-level instructions ("pick up the red bin and place it on the conveyor") into a sequence of sub-goals. Each sub-goal is a target state for the neural prior to achieve.
-
Transformer Bridge: A cross-attention layer that maps semantic sub-goals to neural prior latent space. This is the critical integration point — it translates "pick up the red bin" into "move end effector to position X with grasp configuration Y."
Key insight: The neural prior handles everything below the level of conscious planning. It doesn't "think" about how to walk — it just walks, the way you don't think about which muscles to contract when you stand up. The S2 planner only engages when the task requires reasoning.
Architecture: Hybrid — learned low-level control (S1) + VLM planning (S2) + transformer bridge.
Key advantage: Deployment-proven. The BMW Spartanburg deployment ran on this architecture. 90,000+ parts loaded, 1,250+ operational hours. The neural prior approach also means Figure can add new tasks by updating the semantic planner without retraining the entire control stack.
Key risk: The S1/S2 split is elegant but adds latency. The transformer bridge introduces an inference step between "decide what to do" and "do it." For real-time tasks requiring high-frequency control loops, this could be a bottleneck.
Google DeepMind: Gemini Robotics
Google's entry into embodied AI leverages its most powerful asset: Gemini, the multimodal foundation model that competes with GPT-5. Gemini Robotics is an adaptation of Gemini for physical robot control.
How it works:
- Gemini processes visual input (robot cameras), language input (instructions), and proprioceptive input (joint positions, forces) as a unified multimodal stream
- Output is an action sequence: joint trajectories, gripper commands, and navigation targets
- Fine-tuned on a mixture of real-world robot data (from Google's internal robotics fleet and partner deployments) and synthetic data (simulation + video understanding)
Architecture: Single unified model. Gemini processes everything through the same transformer stack. No explicit separation between "thinking" and "doing." The model's native multimodality — it already understands images, text, video, and code — extends to robot actions.
Key advantage: Gemini's scale and generality. The same model that can analyze a research paper, write code, and understand a video can also control a robot. This means Gemini Robotics inherits Gemini's world knowledge — it understands what a "cup" is, what it's for, and how humans interact with it, before it ever touches one.
Key risk: Deployment data. Google has the best foundation model but the least real-world humanoid deployment data among the three. Tesla has 1,000+ Optimus units. Figure has BMW deployment data. Google has lab robots and simulation — the sim-to-real gap is still significant for manipulation.
Head-to-Head Comparison
| Tesla FSD Robotics | Figure Helix | Google Gemini Robotics | |
|---|---|---|---|
| Architecture | End-to-end neural | Hybrid (S1 prior + S2 VLM) | Unified multimodal |
| Training data | 1,000+ Optimus units + FSD miles | BMW deployment + lab data | Lab robots + sim + web-scale |
| Key strength | Scale of real-world data | Proven factory deployment | General world knowledge |
| Key weakness | Driving-to-manipulation transfer gap | S1/S2 latency overhead | Sim-to-real gap |
| Deployment maturity | Internal factory use | Commercial pilot (BMW) | Research/lab stage |
| Model size | Not disclosed (likely 1-10B params) | S1: 80M, S2: based on GPT-4 class | Gemini-scale (100B+ params) |
| Inference hardware | Custom (FSD chip derivative) | Onboard GPU + cloud offload | Cloud-primary (TPU) |
| Real-time capable | Yes (onboard) | Yes (S1 onboard, S2 offload) | Latency-dependent on cloud |
The Architecture Battle
These three approaches map to a deeper debate in embodied AI: modular vs. monolithic.
The modular argument (Figure): Different parts of robot control have different requirements. Low-level motor control needs microsecond latency and deterministic behavior. High-level planning needs reasoning and common sense. Trying to serve both from one model means compromising on both. Keep them separate with a clean interface.
The monolithic argument (Tesla, Google): Every modular boundary is a source of error and latency. The planner's "pick up the box" sub-goal might not map cleanly to the controller's grasp manifold. A unified model that learns pixels-to-joints end-to-end can discover control strategies that modular systems can't represent. And as models scale, the performance gap between modular and monolithic narrows.
The data argument (all three): Architecture debates aside, the winner will be determined by training data. Tesla has the most real-world robot data. Figure has the most deployment-proven architecture. Google has the most capable base model. Whoever combines all three first wins.
What to Watch
Figure's next deployment announcement. The BMW pilot proved the architecture works in one factory. Figure's next customer deployment will show whether Helix generalizes to new environments, new tasks, and new part geometries — or whether BMW was a special case.
Tesla's manipulation metrics. Tesla talks about Optimus in terms of units and factories. What's missing: detailed task-completion rates, cycle times, and reliability data for specific manipulation tasks. When Tesla starts publishing these numbers, it means the system is ready for external scrutiny.
Gemini Robotics' sim-to-real results. Google has a history of building impressive research demos that don't ship. If Gemini Robotics can demonstrate reliable real-world manipulation without the sim-to-real gap causing failures, it changes the competitive landscape overnight.
Commoditization risk. If the AI stack becomes commoditized — if a capable visuomotor model becomes available as open source or a cheap API — then the architecture debate becomes academic. The winners will be the companies with the best hardware economics and deployment relationships, not the best AI. Unitree at $16K doesn't care whether the brain is FSD, Helix, or Gemini.
Bottom Line
The AI brain competition is the most important technical battle in humanoid robotics. Figure has the deployment-proven architecture. Tesla has the data. Google has the model. The company that combines all three — a capable architecture, massive real-world training data, and a general-purpose base model — will have an insurmountable lead.
Right now, nobody has all three. The race is on.
Explore the full dataset
Access detailed company profiles, robot specs, and market forecasts for 28+ companies and 40+ models.