**Jim Fan:** 谢谢。那是 2016 年的一个夏天。就在我们现在坐着的这间办公室里,有个穿着闪亮皮夹克的家伙,你知道的,大臂肌,搬着一个巨大的金属托盘。在这块大金属板上,他写道:"致 Elon 和 OpenAI 团队,致计算和人类的未来,我向你们呈上世界上第一台 DGX-1。"那是我第一次见到 Jensen。作为一个称职的实习生,我赶紧跑去排队在上面签了自己的名字。你们能找到吗,我的名字?在这里。还能找到另一个吗?那是 Andrej,就在那里。所以,Andrej,我们要进计算机历史博物馆了。我觉得自己像个恐龙。你知道,那时候我完全不知道自己在签的是什么。接下来发生的事情,没有人能比 Ilya 本人描述得更好了。如果你相信深度学习(deep learning),深度学习也会相信你。天哪,深度学习确实大大地相信了我们所有人。三个阶跃函数(step function),六年。这就是把我们带到今天所需要的全部。第一跳,GPT-3,预训练(pre-training)。下一个 token 预测,本质上是学习语法规则、语言的形态。它是在模拟思想、代码和字符串应该如何展开。2022 年,InstructGPT,监督微调(supervised fine-tuning),把模拟对齐到做有用的工作上。o1,推理(reasoning),用强化学习(reinforcement learning)超越模仿学习(imitation learning),最后,自动研究(auto research),把整个循环加速到超越人类可能的程度。所以正如 Andrej 说的,所有实验室都在打最终 Boss 战。对于 LLM 来说,它们正处在终局(end game)的核心。说实话,我非常嫉妒。看看 Andrej 多开心,脸上挂着大大的笑容。LLM 的人正在经历他们人生中最盛大的派对。他们在用字面意思叫"methos"的神话生物速通 AGI。那机器人为什么不能分一杯羹呢?
所以,作为一个有自尊的科学家,我抄作业,然后给它取了个新名字。我叫它"伟大的平行"(the great parallel)。能不能不模拟字符串,而是模拟下一个物理世界状态?然后我们可以通过动作微调(action fine-tuning)来对齐,对齐到模拟中对真实机器人重要的那一薄片上。再让强化学习跑完最后一英里。就是这样。伟大的平行,复制 LLM 的成功。打不过就加入。那么,请跟我一起进入新一集——机器人:终局。抱歉,我实在忍不住。Nano bananas 太好了。谢谢 Demis。
那么,终局怎么玩?归结为两件事:模型策略和数据策略。先看模型。过去三年由 VLA 主导——也就是视觉语言动作模型(Vision Language Action models),Pi 和 Groot 这样的模型属于这一类。我们假设预训练由 VLA 完成,然后简单地在上面接一个动作头(action head)。但说真的,如果你想想这些模型,它们其实是"LVA",因为绝大部分参数都献给了语言。所以语言是一等公民,其次才是视觉和动作。从设计上来说,VLA 擅长编码知识和名词,但不太擅长物理和动词。可以说是在错误的地方头重脚轻。这是原始 VLA 论文中我最喜欢的例子。把可乐罐移到 Taylor Swift 的照片旁边。是的,它以前没见过 Taylor Swift。是的,它能泛化,但这不太是我们在找的那种预训练能力。
那么第二种预训练范式是什么?我一直以为会是什么很辉煌的东西。不幸的是,结果是这种我们叫"AI 视频垃圾"的东西。你知道,我可以一整天看这些猫在监控摄像头前弹班卓琴。这是互联网的巅峰。但说真的,看看这个。没人能把这当回事——
*[笑声]*
——直到我们意识到这些视频模型正在学习在内部模拟下一个世界状态。这些是 VEO-3 的一些生成结果(rollout)。你可以看到模型自己学会了重力、浮力、光照、反射、折射。这些都不是编码进去的。物理规律通过大规模预测下一坨像素而涌现(emerge)出来。甚至视觉规划(visual planning)也涌现了。看看 VEO-3 怎么解这些迷宫。它通过在像素空间中向前运行模拟来解迷宫。注意右下角这个。这是我最喜欢的例子。我们来看,你眨一下眼就会错过 VEO-3 是怎么解这个的。
它超级聪明。你知道,VEO-3 发现如果你不看的话,几何是可选的。我叫这个"物理垃圾"(physics slop)。
那么,我们怎么让这些世界模型(world model)有用?好,我们做动作微调。我们把所有可能未来状态的叠加态对齐,坍缩到对真实机器人重要的那一薄片上。介绍一下 Dream Zero。这是一种新型策略模型(policy model),它梦到未来几秒钟然后据此行动。你知道运动动作是高维连续信号。所以它看起来就像像素。我们可以在渲染视频的同时渲染动作。Dream Zero 联合解码(jointly decode)下一个世界状态和下一个动作。结果是,它能零样本(zero-shot)解决训练中从未见过的任务和动作。当机器人执行时,我们可以可视化它在"梦"什么。相关性非常紧密。如果视频预测对了,动作就对了。如果视频产生幻觉(hallucinate),动作就失败。所以,视觉和动作现在都是一等公民了。我们用 Dream Zero 玩得很开心。我们就在实验室里推着机器人到处走,然后在提示框里随便打字。当然,Dream Zero 不会100%稳定地完成所有这些任务,但它有点像 GPT-2。它在每种情况下都试图把运动的形态做对。所以 Dream Zero 是我们迈向机器人开放式、开放词汇提示的第一步。我们把这种新型模型叫做世界动作模型(World Action Models),简称 WAM。让我们为我们亲爱的朋友 VLA 默哀片刻。它们为我们服务得很好。安息吧。世界动作模型万岁。
接下来,数据策略。这位是 Nvidia 首席科学家 Bill Dally,正在我们实验室里做遥操作(teleoperation)。考虑到他的薪水,我觉得这绝对是我们数据集里有史以来最贵的遥操作轨迹。过去三年一直被遥操作主导。这是黄金时代。好的,VR 头显、极致优化的流媒体延迟,还有这些看起来像中世纪刑具的复杂装置。你知道,行业投入了这么多资金,这么多痛苦和煎熬。然而,遥操作的上限是每个机器人每天 24 小时——这是基本的物理极限。而且说实话,谁在开玩笑?实际上更像是每个机器人每天 3 小时,而且只有在机器人之神开恩的时候才行,因为它们动不动就闹脾气。
那么,怎么做得更好?好,这个怎么样?你直接把机器人手戴在自己手上。这叫 UMI,也就是通用操作接口(Universal Manipulation Interface),这是一个看似简单到骗人的想法。你把机器人的执行器(actuator)戴在手上,直接以人的方式采集数据,同时让机器人身体的其余部分退出循环。我得说,UMI 可能是机器人数据领域有史以来最伟大的论文之一,它催生了两家独角兽公司(unicorn startup)。左边是 Genesis,改进了这个设计,让你可以把夹爪戴在这里。右边是 Sunday,做了这些三指数据手套。
去年,我们更进一步。我们设计了这个外骨骼(exoskeleton),跟五指灵巧机器人手有一一映射关系,我们叫它 Dex UMI。我们来看看实际操作。左边,人直接采集数据,永远是最快的。右边,看看遥操作有多难。好的,操作员——这里是我们最熟练的博士之一——他必须非常仔细地对准,对吧?然后超级慢。而且成功率也很低。中间,你就戴上外骨骼直接采集数据。然后我们在这些数据上训练机器人策略。所以你看到的是一个完全自主的机器人,它的策略是在零遥操作数据上训练的。我们打破了每个机器人每天 24 小时的诅咒。看看这些机器人有多开心,因为它们不再需要参与数据采集了。
那么这就是答案吗?我们解决机器人的规模化问题了吗?在座有开 Tesla 或 Waymo 的吗?有人吗?对吧?你知道,当你开车的时候,你其实在贡献最大的物理数据飞轮(data flywheel)。妙处在于你在 FSD 模式下根本感觉不到,因为数据上传是一个后台过程。但是穿戴这些 UMI 或数据穿戴设备还是很麻烦的,对吧?它是侵入式的。不像开车上班那么无缝。所以我们需要一个 FSD 的等价物。数据采集需要退出前台,淡入背景,这样我们才能捕捉到人类灵巧性的全部荣光,跨越所有行业,跨越所有有经济价值的劳动。
所以我们全力押注人类第一视角视频(egocentric video),这些视频带有详细标注,比如手部位置追踪和密集语言标注。介绍一下 Ego-Scale。99.9% 的训练投入基于人类第一视角视频。结果是一个端到端策略(end-to-end policy),直接从摄像头像素映射到 22 个自由度的高灵巧机器人手。你看到的这里是完全自主的。我们用 21,000 小时的野外(in-the-wild)人类第一视角数据预训练 Ego-Scale,零机器人数据。在预训练时我们预测手部关节和手腕姿态。然后动作微调,我们只采集了 50 小时的高精度动作捕捉数据手套和 4 小时的遥操作。4 小时的遥操作。不到训练总量的 0.1%。有了这些,Ego-Scale 能泛化到这些非常灵巧的任务,比如分拣卡片或操作注射器。对吧?还有转移液体。你知道,也许有一天我们会有机器人护士在家里。不如试试。对于这些任务,只需要在测试时做一次演示就能学到不同的叠衣服策略。
也许论文中最迷人的发现是我们发现了灵巧性的神经缩放定律(neural scaling law)。预训练小时数和最优验证损失之间有一个非常干净的关系。事实上,它是一个干净的对数线性(log-linear)数学方程。在语言模型的原始神经缩放定律之后六年。
如果我们把所有这些数据策略放在这张图上,X 轴是与机器人硬件的对齐程度,Y 轴是可扩展性,看起来是这样的。遥操作,最不可扩展。数据穿戴设备,你可以做到十万小时级别。第一视角视频,如果我们能转起来 FSD 的飞轮,一年内轻松达到一千万小时。如果我们在这里画一条线,这条线左边的一切都是新范式——传感化人类数据(sensorized human data)。让我做几个预测。在接下来一到两年,我们会看到遥操作不断下降,降到几乎可以忽略的量。然后会有一系列数据穿戴设备,为不同硬件和使用场景定制设计。最后,机器人的主食将是第一视角视频。
那么,为我们亲爱的朋友遥操作默哀片刻。你为我们服务得很好。安息吧。传感化人类数据万岁。
数据策略讲完了吗?你注意到我在数据策略上画了两个圈吗?外面那个圈是什么?所有 LLM 前沿实验室现在都花了大量预算来获取数百万个编程环境来做强化学习。机器人也一样。我们迫切需要扩大环境规模。当然,你总是可以直接在真实机器人上做强化学习。在我们实验室,我们用 RL 把某些任务推到接近 100% 的成功率,这样你可以连续执行好几个小时。你知道,看这些机器人自己组装 GPU 还是挺治愈的。或者用一位智者的话说——好孩子,这个任务已经被我老板批准了。但是我们没法做到一百万个环境,因为那需要一百万个机器人,如果你用之前的方式的话。所以我们需要更好的方法。
这里,假设你用 iPhone 拍一张照片,然后通过 3D 世界扫描管线(pipeline)提取所有物体,再在经典物理模拟器中自动合成。所以扫描之后这些物体其实都是可交互的。然后你可以在模拟中用我们叫"数字表亲"(digital cousins)的变体进行无限增强。这样 iPhone 基本上变成了一台口袋世界扫描仪。这个过程我们叫"真实到模拟再到真实"(real-to-sim-to-real)。通过这种方式,我们有了一种可扩展的方法把物理世界搬进数字世界。但这种方法还是依赖经典图形引擎。能做得更好吗?
介绍一下 Dream Dojo。这是我们对视频世界模型的改造,把它们变成完整的神经模拟器(neural simulator)。Dream Dojo 输入连续动作信号,实时输出下一帧 RGB 画面和传感器状态。你在这里看到的没有一个像素是真实的。Dream Dojo 能够通过纯数据驱动的方式捕捉和学习不同机器人的力学。这个过程中没有物理方程、没有图形引擎。
所以机器人的新后训练范式(post-training paradigm)是一个大规模并行 RL 系统,运行在少量真实机器人工作站、一堆运行世界扫描的图形核心、以及大量运行世界模型的推理算力上。或者用这个等式来说——算力现在等于环境现在等于数据。或者用一位智者的话说——你买得越多,省得越多。这条信息已被我老板批准。
就是这样。把它合在一起,机器人将遵循的"伟大的平行"。它正在发生。我们正在看到终局的开端。你们玩《文明》这个游戏吗?还是我最爱的。我喜欢把我的研究想象成在这棵文明科技树上解锁成就。
机器人还有三个成就要解锁,然后我们就完成了。我可以退休了,我等不及了。第一个是通过物理图灵测试(physical Turing test)。在广泛的活动中,你分不清是人在做任务还是机器人在做。也许不包括喝醉的人,但你懂的。物理图灵测试讲的是单位能量输入、单位劳动输出。光看这个机器人的性感姿势,我觉得我们的任务很明确。大概两到三年后吧。
接下来,物理 API(physical API)。你有一整个机器人车队,它们可以像任何其他软件一样用 API 和命令行来配置,有一天由 Opus 9.0 来编排(orchestrate)。如果我们有了这个物理 API,我们就能实现无人工厂(lights-out factory)。那本质上是原子打印机(printer of atoms)。输入 markdown 文件中的设计,输出完全组装好的产品,完全自主。或者那些自动化化学、生物和医学科学发现的湿实验室(wet lab)。
最后一站,物理自动研究(physical auto research)。当机器人开始设计、改进和构建下一代自己,远远超越人类能力所及。
你可能会问,这是不是太科幻了?我们这辈子能看到吗?AI 社区花了 14 年,从 2012 年 AlexNet 的第一次前向传播——一个勉强能分辨猫和狗的模型——到今天 AI Ascent 2026,我们在谈论智能体自动研究(agentic auto research)。再加 14 年怎么样?2026 年正好在 2012 和 2040 的中间。而且技术不是线性推进的,它是指数级推进的。所以我可以以 95% 的确定性说,我们会在 2040 年之前到达终局的终点,科技树的终点。而我们都还年轻。如果你相信机器人,机器人——
致在座的所有人,我觉得我们这一代人出生得太晚,赶不上探索地球;出生得太早,赶不上探索星辰;但我们恰好赶上了解决——
谢谢。
*[鼓掌]*
**Jim Fan:** Thanks [applause] everyone. Thanks.
So, it was a summer day in 2016. Actually, right in this office that we're sitting,
*[snorts]*
there's a guy in shiny leather jacket, you know, big biceps, hurling this large metal tray. And on this large piece of metal, he wrote, "To Elon and the Open AI team, to the future of computing and humanity, I present you the world's first DGX-1." So, that was the first time I met Jensen. And as any good intern would do, I rushed to getting line to sign my name on it. So, can you spot it, my name? It's here. And can you spot another? That's Andre, right there. So, Andre, we're going to the Computer History Museum. I feel like a dinosaur. You know, back then, I had no clue what I was signing up for. And then, no one can describe what happened next better than Ilya himself. If you believe in deep learning, deep learning will believe in you. And oh boy, did deep learning believe in all of us big time.
Three step functions, 6 years. That's how all it took to bring us here today. The first tick, GPT-3, pre-training. Next token prediction is really about learning the rules of grammar, the shape of language. It's about simulating how thoughts and code and strings in general should unfold. 2022, InstructGPT, supervised fine-tuning, aligning the simulation to do useful work. 01, reasoning, using reinforcement learning to surpass imitation learning, and finally, auto research, accelerating the whole loop beyond what's humanly possible.
So, as Andre said, all the labs are getting to the final boss fight. So, for LLMs, they're in the thick of the end game. And honestly, I'm very jealous. Look at how happy Andre was, big smile on his on his face. The LLM folks are having the party of their lifetime. They're speedrunning AGI on mystical creatures literally called methos. So, why can't robotics get a piece of fun?
So, as any self-respecting scientist would do, I copy homework and I give it a new name. I call it the great parallel. So, instead of simulating strings, can we simulate next physical world state? And then we can align through action fine-tuning onto a thin slice of that simulation that matters for real robots. And we let reinforcement learning carry the last mile. And that's it. The great parallel, copying the LLM success. If you can't beat them, join them. So, please join me in a new episode, robotics, the end game. I'm sorry, I just couldn't resist. Nano bananas too good. Thanks, Demis.
So, how do we play the end game? It boils down to two things, model strategy and data strategy. Let's look at the model first.
The last 3 years were dominated by VOAs or vision language action models, and models like Pi and Groot fall in this category. So, we assume that the pre-training is done by a VOA, and we simply graph an action head on top of it. But really, if you think about these models, they're LVAs because the most amount of parameters are dedicated to language. So, language is first first-class citizen, followed by vision and action. And by design, VOAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places. This is my favorite example from the original VOA paper. Move the Coke can to a picture of Taylor Swift. Yes, it has not seen Taylor Swift before. Yes, it's able to generalize, but this is not quite the pre-training ability that we're looking for.
So, what's the second pre-training paradigm? And I always thought that it would be something glorious. Unfortunate, it turns out that this is AI video slop that we call. You know, I can watch these um cats playing banjo on security cam all day. It's peak internet. But really, look at this. No one can take this seriously
*[laughter]*
until we realize that these video models are learning to simulate next world state internally. So, these are some rollouts from VEO-3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in. Physics emerge by predicting the next blob of pixels at scale. And even visual planning emerges. Look at how VEO solves these mazes. It solves them by running simulation forward in pixel space. And draw attention to the lower right corner here. This is my favorite example. Let's watch, and you blink if you miss how VEO-3 solves this one. It's super smart. You know, VEO-3 figures out that if you're not looking, geometry is optional. I call this physics slop.
So, how do we make these world models useful? Well, we do action fine-tuning. We align the superposition of all possible future states, and [snorts] collapse that onto a thin slice that matters for real robots. Introducing Dream Zero. It's a new type of policy model that dreams a couple seconds into the future and acts accordingly. And you know that motor actions, they're high-dimensional continuous signals. So, that looks just like pixels. We can render it at the same time as we render the videos. So, Dream Zero Zero jointly decodes the next world states and next actions.
And as a result, it's able to zero-shot solve tasks and verbs that it has never seen in training. And as the robot executes, we can visualize what it's dreaming about. And the correlation is very tight. If the video prediction works, the action works. If the video hallucinates, the action fails. So, once again, vision and action are now first-class citizens.
And we have a lot of fun with Dream Zero. So, we just roll the robot around um in our lab, and then type random things into the prompt box. And of course, Dream Zero is not going to get all of these tasks 100% robust, but it's kind of like GPT-2. It's trying to get the shape of the motion correct in every case. So, Dream Zero is our first step towards open-ended open vocabulary prompting for robotics. And we [snorts] call this new type of model world action models or WAM. So, let's all take a moment of silence for our dear friend VOAs. They've served us well. Rest in peace. Long live world action models.
*[clears throat]*
And next, data strategy. This is Nvidia's chief scientist, Bill Dally, operating teleoperation inside our lab. And given his salary, I think this is by far the most expensive teleop trajectory ever collected in our data set.
The past 3 years have been dominated by teleoperation. It's the golden era. All right, VR headsets, extremely optimized latency for streaming, and these complex rigs that look like medieval torture devices. You know, so much investment in industry, so much pain and suffering. And yet, for teleop, it's upper bounded by 24 hours per robot per day, the fundamental physical limit. And actually, who am I kidding? It's more like 3 hours per robot per day, and only when the robot god is merciful because they throw all tantrums all the time.
So, how can we do better? Well, how about this? You just wear the robot hand on your own hand. So, this is called UMI or universal manipulation interface, and it's a deceptively simple idea. You wear the robot actuator on your hand and directly collect the data as humans, while getting the rest of the robot body out of the loop. Yet, I would say UMI is perhaps one of the greatest papers ever written in robotics data, and it spawned two unicorn startups. On the left hand side is Genesis, improving this design so you can wear the gripper here. And then on the right hand side, Sunday made these three-finger data gloves.
So, last year, we took it one step further. We designed this exoskeleton that has a one-to-one mapping with five-finger dexterous robot hands, and we call it Dex UMI. Let's look at it in action. On the left, the human directly collecting data always is fastest. On the right, look at how difficult teleop is. All right, the human operator, here one of our most skilled PhDs, he has to align very carefully, right? And then it's super slow. Also, the success rate is very low as well. And in the middle, you just exoskeleton and you collect data directly. And we train a robot policy on this data. So here what you see is a fully autonomous robot of a policy that's trained on zero teleoperation data. So we're able to break the curse of 24 hours per robot per day and see how happy these robots are because they no longer need to be in the loop for data collection.
So is this the answer? Have we solved scaling for robotics?
Anyone driving Tesla or Waymo here? Anyone? Right? You know, when you're driving, you're actually contributing to the biggest physical data flywheel. And the beauty is you don't even feel it during FSD because the data upload is an ambient process. Yet wearing these Umi or data wearables is still cumbersome, right? It's intrusive. It's not as seamless as just driving to work. So we need an FSD equivalent. The data collection needs to get out of the way, fade into the background so we can capture the full glory of human dexterity across all walks of lives, across all labors of economic value.
So we're going all in on human egocentric videos that come with these detailed annotations like hand position tracking and dense language annotations.
Introducing Ego-Scale. Where 99.9% of the training that goes into this is based on human egocentric videos. And the result is an end-to-end policy that maps directly from the camera pixels here to 22 degrees of freedom high dexterity robot hands. What you see here is fully autonomous.
We pretrain Ego-Scale on 21K hours of in-the-wild egocentric human data with zero robot data whatsoever. And during pretraining we predict these hand joints and wrist poses. Then action fine-tuning, we collect only 50 hours of high precision mocap data gloves and 4 hours of teleop. That's 4 hours of teleop. Less than 0.1% of our training mix. And with this Ego-Scale is able to generalize to these very dexterous tasks like sorting card or manipulating syringe. Right? Over transferring the liquid. You know, someday we might have robot nurses at home. Might as well try this. And for these tasks it takes only one shot demonstration at test time to learn different shirt folding strategies.
And perhaps the most fascinating finding from the paper is that we discovered this neural scaling law for dexterity. It's a very clean relationship between the amount of hours we put into pretraining and the optimal validation loss. In fact, it's a clean log-linear mathematical equation. Six years after the original neural scaling law for language models.
So if we put all of these data strategies on this chart, X axis is alignment to the robot hardware, Y axis is scalability, this is what it looks like. Teleop, the least scalable. Data wearables, you can go up to hundreds of thousands of hours. And egocentric video, if we're able to spin the FSD flywheel, easily 10 million hours in the next year or so. And if we draw a line here, everything to the left of this line is a new paradigm, sensorized human data.
So let me make a few predictions. In the next year or two, we'll see teleop dropping and dropping to almost negligible amount. And then there will be an ensemble of data wearables custom designed for different hardware and use cases. And finally, the main diet for robotics will be egocentric videos. So, a moment of silence for our dear friend teleop. You have served us well. Rest in peace. Long live sensorized human data.
Are we done with the data strategy yet? Did you notice I put two rings on data strategy? What's the outer ring here?
All the LM frontier labs have spent significant budget now on acquiring millions of coding environments to do reinforcement learning.
So robotics is the same. We're in urgent need to scale up environments. And of course, you can always do reinforcement learning directly on the real robot.
So in our lab, we use RL to push certain tasks to almost 100% success rate so you can do these continuous execution for hours on end. You know, it's kind of therapeutic to see these robots assembling GPUs just by themselves. Or as a wise man would say, good boy, this task has been approved by my boss.
Yet we can't get [snorts] to 1 million environments because that would require 1 million robots if you do it the previous way. So we need a better way.
Here, let's say you take an iPhone picture and you can pass this through this 3D world scan pipeline to extract all the objects and then automatically synthesize them again inside a classical physics simulator. So all these objects are actually interactive after the scan. And then you can augment this infinitely in simulation with variations that we call digital cousins. So now iPhone basically become a pocket world scanner in this process that we call real-to-sim-to-real. And in this way we have a scalable way to port the physical world into the digital world. But still this method relies on a classical graphics engine. Can we do better?
Introducing Dream Dojo. So it's our spin on video world model and turning them into full-fledged neural simulators. Dream Dojo takes as input these continuous action signals and outputs the next RGB frames as well as sensor states in real time. Not a single pixel you see here is real. And Dream Dojo is able to capture and learn the mechanics of different robots through a purely data-driven approach. There is no physics equation, no graphics engine involved in this process.
So the new post-training paradigm for robotics is a massively parallel RL system that runs on a few real robot stations, a bunch of graphics cores running world scans, and heavy inference compute running world models. Or as this equation goes, compute now equals environment now equals data. Or as a wise man would say, the more you buy, the more you save. And this message has been approved by my boss.
So that's it. Putting it together, the great parallel that robotics will follow. And it's happening as we speak. And we're looking at the beginning of the end game.
You guys play the video game Civilization? Still my favorite. I like to think of my research as unlocking game achievements on this civilizational technology tree. And there are three more achievements to unlock for robotics and then we're done. I can retire and I can't wait for that.
The first is passing the physical Turing test. Across a wide range of activities, you cannot tell the difference between a human doing a task or robot doing it. Maybe not drunk humans, but you know.
Physical Turing test is about unit energy in and unit labor out. And just by judging at the sexy pose of this robot, I think the work is cut out for us. So maybe it's two to three years away.
And next, physical API. You have a whole fleet of robots and they can be configured just like any other software using APIs and command lines, orchestrated someday by Opus 9.0.
And if we have this physical API, we'll be able to realize lights-out factories. Those are essentially printers of atoms. They take as input design in markdown files and then output fully assembled products, completely autonomous. Or these wet labs that automate scientific discoveries in chemistry biology and medicine.
And the final stop, physical auto research. When the robots start to design, improve, and build the next iteration of themselves far beyond what's humanly possible.
So you might ask, is this too science fiction? Like are we going to see this in our lifetime? it took the AI community 14 years to go from the first forward pass of AlexNet in 2012, a model that barely recognized cat versus dog, to AI ascent today, 2026, where we talk about agentic auto research.
And let's just add another 14 years. How about that? 2026 is right in the middle of 2012 and 2040. And technology does not advance linearly, it advances exponentially. So [snorts] I can say with 95% certainty that we'll get to the end of the end game, the end of the technology tree, by 2040. And we'll still be all We'll still be young. If you believe in robotics, robotics And to all of us here, sitting here, I think our generation was born too late to explore the Earth and too early to explore the stars, but we are born just in time to solve Thank you.
*[applause] [applause]*