Get ready to rethink what robots can do—Xiaomi, the tech giant known for smartphones and smart home gadgets, is now diving headfirst into robotics with its groundbreaking Xiaomi-Robotics-0, a first-generation large-scale robot model. But here’s where it gets controversial: Xiaomi claims this isn’t just another robot—it’s a leap toward what they call “physical intelligence,” blending visual understanding, language comprehension, and real-time action execution into one seamless system. Is this the future of robotics, or just another overhyped tech promise?
At its core, Xiaomi-Robotics-0 is an open-source Vision-Language-Action (VLA) model packed with 4.7 billion parameters. According to Xiaomi, it’s already breaking records in both simulations and real-world tests. But what does that mean? Imagine a robot that doesn’t just follow commands but truly understands them—even vague ones like “fold the towel.” That’s the promise of Robotics-0, designed to close the loop between perception, decision-making, and execution. And this is the part most people miss: Xiaomi says it’s achieved this by balancing broad understanding with precise motor control, a challenge that’s stumped many in the field.
Here’s how it works: The model is built on a Mixture-of-Transformers (MoT) architecture, splitting responsibilities into two key components. First, there’s the Visual Language Model (VLM), acting as the robot’s “brain.” It interprets human instructions, understands spatial relationships from high-resolution visuals, and handles tasks like object detection and logical reasoning. The second component is the Action Expert, powered by a multi-layer Diffusion Transformer (DiT). Instead of clunky, single actions, it generates “Action Chunks”—smooth sequences of movements—using advanced flow-matching techniques. But does this approach really solve the age-old problem of robots losing understanding when they learn to move? Xiaomi claims it does, by co-training the model on both multimodal and action data, ensuring it can reason and act simultaneously.
Training this beast isn’t simple. It happens in stages: First, an “Action Proposal” mechanism aligns the VLM’s visual understanding with action execution. Then, the VLM is frozen, and the DiT is trained separately to generate precise action sequences from noise. Xiaomi also tackles inference latency—those awkward pauses between a robot’s thought and action—by implementing asynchronous inference, keeping movements continuous even when the model takes its time. To ensure stability, they use a “Clean Action Prefix” technique, feeding previous actions back into the model for jitter-free motion. Plus, a Λ-shaped attention mask keeps the robot focused on current visual input, making it more responsive to sudden changes.
In benchmark tests, Xiaomi-Robotics-0 reportedly outperformed 30 other models in simulations like LIBERO, CALVIN, and SimplerEnv. Even more impressive? Xiaomi deployed it on a dual-arm robot platform, where it handled long-horizon tasks like folding towels and disassembling blocks with steady hand-eye coordination. Unlike earlier VLA systems that sacrificed reasoning for action, Robotics-0 retains strong visual and language capabilities, especially in tasks blending perception and physical interaction. Is this the robot revolution we’ve been waiting for, or just another step in a long journey?
For daily tech updates, visit our News Section. Stay ahead of the curve by joining our Telegram community and subscribing to our daily newsletter.
What do you think? Is Xiaomi’s Robotics-0 a game-changer, or just another tech experiment? Let us know in the comments! (Source: Xiaomi Robotics-0)