FEATURE real robot data to generate synthetic trajectories and compressing them into compact action tokens, the company cut development time for its GR00T N1.5 model from three months to just 36 hours. For the report’ s authors, this“ realto-real” synthetic workflow signals a step-change in training efficiency.
Equally important is progress in imitation learning and teleoperation. Historically, capturing full-body human motion required expensive marker systems and complex setups. Now, reinforcement learning systems can infer human pose from vision alone and map it to humanoid motion in real time. Forrester cites the H2O( human-to-humanoid) framework, which uses only RGB camera input to deliver real-time whole-body teleoperation. The implication, the analysts note, is profound:
They are transforming humanoid development from a craft built on handtuned behaviors into a model-driven engineering discipline.
demonstration capture is no longer a bottleneck reserved for elite labs.
Generative AI is also reshaping perception. Multimodal foundation models fuse video, images, audio and text into unified architectures that help robots interpret intent and context. Vendors are investing heavily in world models that predict future frames, anticipate object motion and support longhorizon planning. Among those named in the report are Meta with V-JEPA 2; Google with Gemini Robotics; Alibaba with Qwen3-VL; and UBTECH Robotics with its multimodal inference model.
Forrester’ s compilers emphasize that richer scene understanding is not cosmetic. It enables more reliable task execution across varied environments and instruction styles – a www. intelligentcio. com
INTELLIGENT CIO NORTH AMERICA
25