To be effective, domestic and personal care robots must handle a variety of tasks such as stacking and unstacking, fetching and returning items, emptying and refilling containers, screwing and unscrewing, switching devices on and off, opening and closing doors, and folding and putting away laundry. The current popular home robot, the robot vacuum cleaner, is designed to avoid everything except dirt.
The rapid advancements in AI are driven by the vast amounts of data available for training models. However, training robots is challenging due to the need for extensive video-based data of physical interactions within the complex, unpredictable environment of a home.
A Breakthrough Approach: Teaching AI to Think in Actions
DeepMind researchers describe their breakthrough solution as “simple and surprisingly effective.” They began with powerful vision-language models (VLMs) like PaLI-X and PaLM-E, which take vision and language as input and provide free-form text outputs. These VLMs are typically used in applications such as object classification and image captioning. However, instead of producing text outputs, the VLMs were trained on “robotic trajectories” (i.e., how a robotic limb would move) by tokenizing the actions into text tokens and creating multimodal sentences that respond to robotic instructions paired with camera observations by producing corresponding actions. By getting the AI to directly think actions, not words, the AI can produce instructions following robotic policies: i.e., move the robotic arm in a linear, joint, or arc motion along the x-axis, y-axis, or z-axis.
The DeepMind researchers discovered that their AI-powered robot:
“[exhibits]… a range of remarkable capabilities, combining the physical motions learned from the robot data with the ability to interpret images and text learned from web data into a single model… [w]hile the model’s physical skills are still limited to the distribution of skills seen in the robot data, the model acquires the ability to deploy those skills in new ways by interpreting images and language commands using knowledge gleaned from the web.”
By transferring knowledge from the web, the VLM-powered robot exhibited new capabilities beyond those demonstrated in the robot data, which the researchers termed “emergent,” in the sense that they arise by transferring Internet-scale pretraining. For example, in one experiment, a standard robot would have to be instructed to move an object at Point A (e.g., x=0, y=0) to Point B (e.g., x=400, y=200), but the VLM-powered robot can simply be told to move the green circle to the yellow hexagon. It will identify the objects from its internet data training and perform the task accordingly.
Beyond AI: Bridging Digital Intelligence and Physical Actions
With significant advancements in AI, it seems logical to equip Large Language Models (LLMs) with robotic arms and legs. According to Google DeepMind researchers:
“High-capacity models pretrained on broad web-scale datasets provide a powerful platform for various downstream tasks. Large language models enable fluent text generation, emergent problem-solving, and creative prose and code generation, while vision-language models enable open-vocabulary visual recognition and complex inferences about object-agent interactions in images. These capabilities would be extremely useful for generalist robots performing various tasks in real-world environments.”
However, integrating this vision is more challenging than expected. The primary issue is that AI and robots operate in different realms. Robots interact with their physical environment using Cartesian coordinates (x, y, and z axes), where movements are defined in 3D space. For instance, instructing a robot to “x=100, y=0” means moving straight ahead 100mm. Each robot has a base Cartesian system and additional systems for each moving part or tool, known as frames. Within each frame, the robot has three primary motions: linear (A to B in a straight line), joint (A to B via a non-linear path), and arc (moving around a fixed point in a constant radius).
Conversely, LLMs operate in the digital realm, processing and generating text based on semantics, labels, and textual prompts without perceiving or interacting with the physical world. Therefore, while LLMs produce outputs in words, robots require inputs in combinations of the x, y, and z axes.
The Future of Domestic Robotics
While significant progress has been made, many challenges remain before generalist home robots become a reality. Ensuring safety, reliability, and efficiency in unpredictable environments will require further advances in both AI and robotics. Nevertheless, the potential benefits are immense, ranging from increased convenience in everyday tasks to enhanced care for the elderly and disabled.
As research continues, we can look forward to a future where robots are not just intelligent but also adept at navigating and interacting with the physical world, making our lives easier and more productive.
By integrating the latest advancements in AI and robotics, researchers are steadily working towards creating versatile and capable home robots that can perform a wide range of tasks, bringing the promise of futuristic home automation closer to reality.
Subtly charming pop culture geek. Amateur analyst. Freelance tv buff. Coffee lover