Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, utilizing the generalization capabilities of Large Language Models (LLMs) and atomic skill libraries. In this work, we propose Robotic Programmer, a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we propose Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to different robotic configurations, and demonstrates the emergent ability for unseen skills on complex tasks.
RoboPro is a policy code generation model for robotics, enabling the capacity to follow free-form language instruction and grounding with visual observation in a zero-shot manner. The design of RoboPro introduces a unified architecture that seamlessly integrates visual perception, instruction following, and code generation by leveraging end-to-end vision-language models (VLMs). The unified pipeline eliminates the potential loss of critical information during intermediate steps and enhances computational efficiency during inference. However, training such VLMs to perceive environments, follow instructions and generate executable code will inevitably require a vast amount of diverse and well-aligned robot-centric multimodal runtime code data, which poses a significant challenge.
RoboPro adopts the LLaVA architecture, with a vision encoder and a pre-trained LLM connected with a two-layer MLP adaptor. We directly concatenate the
visual and text tokens, then feed them into the LLM. The LLM are trained to generate the runtime code based on the visual inputs and task description.
RoboPro adopts SigLIP-L as the vision encoder, which brings better performance on general visual reasoning tasks. For the base LLM, we select a code-domain LLM, CodeQwen1.5, with state-of-the-art performance among open-source code models.
The training procedure of RoboPro consists of three stages: visual alignment, pre-training, and supervised fine-tuning (SFT). For supervised fine-tuning, the 115k runtime code data generated by Video2Code are used. To avoid overfitting and enhance visual reasoning ability, a general vision language fine-tuning dataset is also involved during the SFT process. Thus, RoboPro is trained to follow free-form language instructions and perceive visual information to generate executable policy code for robotic manipulation.