RT-2 Deepmind’s new model being launched!
This is just the start of how Robots 2.0 will look in the near future. RT-2 will use web data and vision and turn it into actions. It could in the future use all data published on the web, combine it with what it sees, and then take action. This is
Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data and translates this knowledge into generalized instructions for robotic control.
High-capacity vision-language models (VLMs) are trained on web-scale datasets, making these systems remarkably good at recognizing visual or language patterns and operating across different languages. But for robots to achieve a similar level of competency, they would need to collect robot data, first-hand, across every object, environment, task, and situation.
RT-2 shows improved generalization capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.
We also show that incorporating chain-of-thought reasoning allows RT-2 to perform multi-stage semantic reasoning, like deciding which object could be used as an improvised hammer (a rock), or which type of drink is best for a tired person (an energy drink).