Skip to content Skip to sidebar Skip to footer

NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Estimated reading time: 5 minutes Introduction Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to…

Read More

URBAN-SIM: Advancing Autonomous Micromobility with Scalable Urban Simulation

Micromobility solutions—such as delivery robots, mobility scooters, and electric wheelchairs—are rapidly transforming short-distance urban travel. Despite their growing popularity as flexible, eco-friendly transport alternatives, most micromobility devices still rely heavily on human control. This dependence limits operational efficiency and raises safety concerns, especially in complex, crowded city environments filled with dynamic obstacles like pedestrians and…

Read More

Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning

Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield…

Read More

EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations

The Challenge of Scaling 3D Environments in Embodied AI Creating realistic and accurately scaled 3D environments is essential for training and evaluating embodied AI. However, current methods still rely on manually designed 3D graphics, which are costly and lack realism, thereby limiting scalability and generalization. Unlike internet-scale data used in models like GPT and CLIP,…

Read More

Google DeepMind Releases Gemini Robotics On-Device: Local AI Model for Real-Time Robotic Dexterity

Google DeepMind has unveiled Gemini Robotics On-Device, a compact, local version of its powerful vision-language-action (VLA) model, bringing advanced robotic intelligence directly onto devices. This marks a key step forward in the field of embodied AI by eliminating the need for continuous cloud connectivity while maintaining the flexibility, generality, and high precision associated with the…

Read More

UC San Diego Researchers Introduced Dex1B: A Billion-Scale Dataset for Dexterous Hand Manipulation in Robotics

Challenges in Dexterous Hand Manipulation Data Collection Creating large-scale data for dexterous hand manipulation remains a major challenge in robotics. Although hands offer greater flexibility and richer manipulation potential than simpler tools, such as grippers, their complexity makes them difficult to control effectively. Many in the field have questioned whether dexterous hands are worth the…

Read More

π0 Released and Open Sourced: A General-Purpose Robotic Foundation Model that could be Fine-Tuned to a Diverse Range of Tasks

Robots are usually unsuitable for altering different tasks and environments. General-purpose models of robots are devised to circumvent this problem. They allow fine-tuning these general-purpose models for a wide scope of robotic tasks. However, it is challenging to maintain the consistency of shared open resources across various platforms. Success in real-world environments is far from…

Read More

Researchers from Stanford and Cornell Introduce APRICOT: A Novel AI Approach that Merges LLM-based Bayesian Active Preference Learning with Constraint-Aware Task Planning

In the rapidly evolving field of household robotics, a significant challenge has emerged in executing personalized organizational tasks, such as arranging groceries in a refrigerator. These tasks require robots to balance user preferences with physical constraints while avoiding collisions and maintaining stability. While Large Language Models (LLMs) enable natural language communication of user preferences, this…

Read More

Google DeepMind Researchers Propose RT-Affordance: A Hierarchical Method that Uses Affordances as an Intermediate Representation for Policies

In recent years, there has been significant development in the field of large pre-trained models for learning robot policies. The term “policy representation” here refers to the different ways of interfacing with the decision-making mechanisms of robots, which can potentially facilitate generalization to new tasks and environments. Vision-language-action (VLA) models are pre-trained with large-scale robot…

Read More

Latent Action Pretraining for General Action models (LAPA): An Unsupervised Method for Pretraining Vision-Language-Action (VLA) Models without Ground-Truth Robot Action Labels

Vision-Language-Action Models (VLA) for robotics are trained by combining large language models with vision encoders and then fine-tuning them on various robot datasets; this allows generalization to new instructions, unseen objects, and distribution shifts. However, various real-world robot datasets mostly require human control, which makes scaling difficult. On the other hand, Internet video data offers…

Read More