Two look-alike floor robots hum to life in a cavernous fulfillment center, pallets queued like chess pieces. One quickly moves through pick-and-place tasks, rerouting smoothly and handling edge cases; the other wobbles, misreads a shelf, and requests remote help as soon as its third task. The hardware’s nearly identical, and their startup’s used a similar number of model parameters. What isn’t the same is the flood of labeled trajectories, sim variations, and real-world feedback that trained the first machine—and the data drought that left the second guessing.

We learn from experience, and that’s why data strategy is emerging as the new competitive battleground in generative robotics. Those who build the richest, most agile data pipelines will set the pace in this field. Startups anchor on data collection from proprietary robot logs, high-fidelity simulation, and hardware-linked data streams. The message is clear: invest in your data pipelines now or risk being left behind.
Generative robotics – robots powered by large foundation models that can learn a variety of tasks – is advancing rapidly in 2025. Unlike text or images, there is no unlimited online repository for robot actions; as Skild AI’s CEO notes, “unlike language or vision, there is no data for robotics on the internet”[1].
Therefore, companies have a need for massive and diverse datasets and are devising creative data strategies to feed their models. This article surveys key data milestones and methods from January–August 2025, illustrating how data pipelines have become a strategic imperative. We outline the approaches of leading startups, the types of data fueling generative robots, stages of the data lifecycle, emerging open-source efforts, and areas of opportunity. The trend is unmistakable – the scale and quality of your robot’s training data will determine your competitive edge.
Here’s what the industry has been up to (Jan–Aug 2025):
Physical Intelligence (PI) open-sourced its π₀ generalist robot policy in early 2025 (https://www.physicalintelligence.company/blog/openpi), then introduced an improved π0.5 model by April that demonstrated significantly better generalization to unseen environments. PI achieved this by co-training on more diverse multi-robot data, expanding its dataset of teleoperated demonstrations by roughly 15% compared to π₀ (https://www.physicalintelligence.company/blog/pi05) (on the order of hundreds of hours of robot learning data).
Skild AI (backed by Amazon and SoftBank) unveiled its “Skild Brain” foundation model in July 2025. The system is trained on millions of simulated episodes and human-action video frames to mimic human-like robotic skills. Skild’s robots continuously feed their real-world experience back into this shared “brain,” tackling the data scarcity in robotics by growing a collective dataset with each deployment. (https://www.reuters.com/business/media-telecom/amazon-backed-skild-ai-unveils-general-purpose-ai-model-multi-purpose-robots-2025-07-29/#:~:text=July%2029%20%28Reuters%29%20,line%20machines%20to%20humanoids)
Genesis AI emerged from stealth in mid-2025 with a $105 M funding round (co-led by Eclipse and Khosla) to build a universal Robotics Foundation Model (RFM) (PR News) techcrunch.com. The company is developing a full-stack data engine that unifies high-fidelity physics simulation and large-scale real robot data collection, aiming to gather the field’s largest-scale, most diverse training dataset for general-purpose robots. (Techcrunch)
Figure collects a high-quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, totaling roughly 500 hours. A vision-language model then auto-labels each segmented video clip with hindsight natural-language instructions. Their model, Helix, has a synthetic pretraining step which provides robust, multi-scale visual features that transfer seamlessly to on-robot perception, jump-starting learning and reducing reliance on real-world labeling. (Figure)
1X Technologies (maker of the EVE & NEO humanoids, backed by OpenAI) rolled out its “Redwood” AI model in mid-2025 to enable autonomous home assistance by its NEO robot. Redwood is a 160 M-parameter vision-language-action transformer that was trained on a massive trove of real robot logs gathered from 1X’s robots in offices and homes (1x). This multi-terabyte dataset (including both successful and failure experiences) powers Redwood’s continual learning, allowing NEO to handle tasks like laundry, dish cleaning, door-opening and navigation in unstructured home environments.
SmolVLA (Hugging Face’s compact open-source VLA model) was introduced in June 2025 as an efficient alternative to proprietary robotic brains. SmolVLA-1.0 (450 M parameters) was trained entirely on community-contributed “LeRobot” datasets – about 10 million frames of diverse real robot demonstrations from 487 different data. Rather than brute-force scale, SmolVLA emphasizes data diversity (varied tasks, camera angles, robot embodiments) to achieve strong generalist performance. Despite its smaller size, it matched or outperformed larger closed models on simulation benchmarks and real-world tasks, validating the open-data approach. (hugging face)
Each of these companies recognizes that owning a proprietary data pipeline is as critical as the model architecture itself. The strategies differ – from teleoperation to simulation to crowd-sourced videos – but the end goal is the same: amass a data moat that competitors can’t easily replicate. As Skild’s CEO Pathak summarizes, “you cannot just scrape the internet for robot data” (Reuters) – so those who build their own rich datasets will lead in generative robotics.
How can the different data types be used?
Let’s first take a look at how the different data strategies can be applied to understand where the opportunities are.
Generally, a training pipeline will follow these stages:

Let’s dig into and explore each phase.
Pre-Training: Building a generalist “robot brain” starts with bootstrapping with diverse pre-training data – often by leveraging existing AI backbones and massive simulated or internet-derived datasets.
For example, some of the earliest generative robotics vision-language-action models (VLAs), such as DeepMind’s RT-2, explicitly integrated pre-trained vision language foundation models: it “adapt[s] Pathways Language and Image model (PaLI-X) and Pathways Language model Embodied (PaLM-E) to act as the backbones of RT-2”. (https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/#:~:text=RT,2)
Similarly, Physical Intelligence’s π₀ model begins with a 3 billion-parameter vision-language model that was pre-trained on internet images and text, letting π₀ “inherit semantic knowledge and visual understanding from Internet-scale pretraining”. This VLM backbone is then adapted to output robot actions, giving the robot policy a built-in sense of objects and concepts from the web. At the same time, large-scale robot-specific data is compiled to teach the model about embodiment and physics. π₀ was “trained on the largest robot interaction dataset to date… including both open-source data and a large, diverse dataset of dexterous tasks [collected] across 8 distinct robots”. (https://www.physicalintelligence.company/blog/pi0)
Startups like Skild AI and Genesis AI have taken a slightly different route – emphasizing simulated data to reach internet scale. Skild’s founders note that achieving true “omni-bodied” intelligence may require “trillions of examples – and there is no way just real world data can achieve this”. (https://www.skild.ai/blogs/building-the-general-purpose-robotic-brain)
Fine-Tuning: Once a foundation model has broad capabilities, it is adapted to specific tasks or hardware via focused fine-tuning on high-quality data. This stage uses smaller, carefully curated datasets (often expert demonstrations or teleoperated trajectories) to sharpen the model’s performance on target skills.
Physical Intelligence describes this as “post-training with high-quality data for a challenging task”, analogous to how a large language model is fine-tuned for a domain. One showcase was laundry folding – π₀ was adapted using a set of expert teleoperated runs of folding clothes and stacking them. After this, π₀ could reliably fold garments of various types, even recovering gracefully if a human moved the clothes mid-task. (https://www.physicalintelligence.company/blog/pi0)
The amount of data needed for such fine-tuning is surprisingly modest: the earlier ALOHA research (a precursor to π₀) found that “with co-training, [they] achieve over 80% success on tasks with only 50 human demonstrations per task,” and using those 50 demos alongside broader data nearly doubled success rates on hard tasks. (https://www.physicalintelligence.company/blog/pi0)
Figure AI reported a similar data efficiency for its “Helix” model – it was trained on “~500 hours in total” of diverse teleoperated humanoid behaviors, a fraction of what prior systems needed. Those 500 hours (collected by human operators controlling Figure’s robots in labs) included an array of household activities; during training an auto-labeling VLM generated a text description for each demo, so Helix effectively learned to match “natural language-conditioned” commands to actions. With that relatively small but rich dataset, Helix’s single policy network can zero-shot manipulate “thousands of items [the robots] have never encountered before” just by being told what to do. (https://www.figure.ai/news/helix)
Another approach to fine-tuning is community-sourced data: projects like SmolVLA from HuggingFace’s LeRobot (open-source VLA model) relied entirely on crowdsourced robot demos. The SmolVLA team curated ~30,000 teleop episodes from hobbyist contributors (the LeRobot community) and cleaned/standardized them intensively. The result was a fine-tuning dataset that, while smaller than proprietary datasets, was sufficiently high-quality to train a useful multi-skill policy. (https://huggingface.co/blog/smolvla)
In summary, fine-tuning is where quality beats quantity: a few dozen to a few hundred well-crafted demonstrations per task can align a generalist model to expert-level performance on specific real-world problems.
It’s also the stage to inject proprietary or unique data – for example, a healthcare robotics firm might fine-tune a foundation model on its private surgical robot demonstrations to attain superhuman precision in that niche. Fine-tuning thus customizes the foundation model, turning broad competency into expert skill on whichever tasks matter most.
Alignment & Human Feedback: The final critical stage is aligning the model’s behavior with human values, safety constraints, and user preferences. This goes beyond performing tasks correctly – it’s about doing them the right way.
1X Technologies treats each of its humanoid robots in the field as a continual data source. As 1X’s AI lead explained, “each deployed NEO robot acts as a real-time data factory, continuously collecting manipulation data from real-world interactions, which 1X uses to refine and enhance [its model]’s capabilities.” (https://www.xmaquina.io/blog/the-story-of-1x-technologies)
Thanks to this loop, 1X’s Redwood VLA model had logged over “10,000 hours of data” from home trials as of 2025. Every time a NEO encounters a new situation or edge case, that experience (success or failure) is fed back into training – effectively letting the robots learn from their mistakes and from user corrections. (https://www.1x.tech/discover/redwood-ai)
Skild AI reports a similar dynamic: “robots deployed by customers feed data back into Skild Brain to sharpen its skills, creating the same ‘shared brain’” across all units. In practice, if one Skild-powered robot in a warehouse learns to handle a new type of package (or receives a human correction for a fumble), that data goes into the next model update and propagates to all other robots (https://www.reuters.com/business/media-telecom/amazon-backed-skild-ai-unveils-general-purpose-ai-model-multi-purpose-robots-2025-07-29).
In summary, alignment and feedback mechanisms ensure that a robotics foundation model doesn’t remain a static bundle of skills – it continually evolves based on human guidance and real-world experience. By combining fine-tuned expertise with ongoing human preference learning and safety checks, the leading teams close the loop from data collection to deployment and back again.
Gaps & Opportunities
Even as data volume explodes, there are clear gaps in the current landscape. For founders and investors, these gaps represent opportunities to differentiate. Here are some pressing needs and ideas in generative robotics data:
Vertical-Specific Pipelines: Most current datasets focus on household and generic manipulation tasks. Huge opportunities exist in verticals like agriculture (AgTech) and healthcare (MedTech) for tailored robot data.
For example, an AgriRobot dataset might consist of thousands of hours of crop inspection, fruit picking, and tractor navigation data – much of which could be synthetically generated with physics engines plus a smaller real subset from farms.
Similarly in healthcare, data on robot-assisted procedures or eldercare tasks (with proper privacy safeguards) could unlock specialized robot nurses or aides. Startups that become the data hub for a niche (e.g. the “Bloomberg of farming robots” or “ImageNet of surgical robots”) could then leverage that position to train superior models for those domains.
Advanced Annotation & Sensing: A relatively underdeveloped area is rich annotation of robot data. Many datasets have images and actions, but lack labeled information on forces, tactile events, or high-level outcomes (success/failure).
Developing pipelines for force/torque sensing data and tactile event labeling could greatly benefit tasks like insertion, assembly, or any contact-heavy manipulation. Additionally, semantic annotations (identifying objects, describing scenes) done in a robot-aware way would add value.
There’s room for startups offering annotation-as-a-service for robotics – e.g. services that label robot camera footage with physical events (slip, grip, collision) or user satisfaction scores. High-quality labeled data could be used to supervise model components that today learn only implicitly.
Hardware-Linked Data Incentives: As seen with HuggingFace’s LeRobot and K-Scale’s low-cost robot, there’s a strong case for increasing the number of robots in the field that can collect data at a low cost. Robot companies might adopt a “razors and blades” model: sell robots at or below cost in order to gain exclusive data streams.
This is an opportunity for creative business models – e.g., a home robot vacuum that’s ultra-cheap if the owner opts into data sharing (much like early Alexa devices were sold near cost to get voice data). We might also see consortiums of companies sponsoring deployments of robots in public infrastructure (stores, hotels, warehouses) in exchange for the operational data. Scale diversity of environments is key – getting robots into many homes or businesses can diversify data more than any simulation can.
Coverage & Gap Diagnostics: Most teams can’t see what their data truly covers—by environment, task, object, lighting, angles—or where models fail and feel uncertain. An internal “data observability” tool that can help identify gaps and then proposes targeted “collect-next” recipes—specific scenarios, objects, and contact conditions that would most improve success—so data spend flows to the highest-ROI gaps.
This turns undirected data collection into a measurable uplift loop: identify blind spots, collect focused episodes (real + sim), and track the performance gain per 1k samples. For teams, it becomes a practical moat: faster iteration, clearer data ROI, and a living coverage map that guides both product and deployment decisions.
Human Video Bootstrapping: Finally, there is the immense repository of human action videos (e.g. YouTube, TikTok) as a potential data source for robots. Research projects have mined YouTube for instructional videos and human demonstrations for robots.
While human video won’t replace robot-collected data, it can fill in knowledge gaps (e.g. how humans manipulate flexible objects) if used judiciously. The key is filtering and adapting it properly – an opportunity for tools that turn unstructured video into robot-usable training snippets.
In generative AI for robotics, models may get the spotlight – but data is the real factor behind the scenes. Building robust, generalizable robot intelligence requires far more data, from far more sources, than earlier AI waves. As we’ve seen, the companies and communities leading the charge are those treating data pipelines as a core product: investing in simulation, instrumenting real-world usage, tapping into human demonstrators, and sharing across ecosystems.
For founders and funders in this space, the takeaway is straightforward: make data strategy a first-class priority. The winners of the next decade of robotics will be defined not just by the smartest robots, but by the richness and coverage of data that allowed their robots to robustly succeed in their deployments.