The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li
概要
AI教母Fei-Fei Li谈ImageNet如何催生现代AI、为何AGI是营销术语、World Labs发布Marble世界模型产品及其在VFX和机器人仿真中的应用
核心洞察
- Fei-Fei Li 自认是人文主义者而非乌托邦主义者:她不否认AI对就业和社会的冲击,但坚信技术总体对人类文明是正向的——前提是每个参与者都要"像负责任的个人一样行事"。她在Stanford对每一届毕业生说的话是:"你的领域叫人工智能,但它没有任何'人工'的部分。"
- ImageNet的核心洞察是"缺的不是模型,是数据":2006-2007年,Fei-Fei Li和学生做了一个在当时近乎"疯狂"的决定——用几个研究生的力量整理整个互联网上的图片数据,最终策展1500万张图片、22,000个概念类别。2012年Jeff Hinton团队用ImageNet数据+2块Nvidia GPU+神经网络赢得挑战赛,现代AI由此诞生——大数据、神经网络、GPU这个"黄金三件套"至今仍是ChatGPT等产品的技术内核。
- 她认为AGI是"营销术语而非科学术语":AI连让幼儿都能完成的任务(如看一段视频数清房间里有几把椅子)都做不到,更不用说牛顿从天体运动推导出力学定律这种创造性飞跃。即使给AI所有现代观测数据,它也推不出17世纪的运动定律方程。
- World Labs发布Marble——世界首个"prompt-to-worlds"产品:用户输入一句话或一张图就能生成可导航的3D世界。与Sony合作的虚拟制作测试中,制作时间缩短40倍。已有VFX、游戏开发、机器人仿真和心理学研究等多个场景在使用。
- 机器人领域的Bitter Lesson还远未被验证:语言模型的训练数据(文本)和输出(文本)完美对齐,机器人的训练数据(视频)和需要的输出(3D世界中的动作)存在根本性不匹配。自动驾驶从2005年Stanford原型车赢得DARPA挑战赛到今天Waymo上路,走了20年——而自动驾驶还只是"在2D路面上行驶、目标是不碰任何东西"的简单机器人。
- 贯穿全场的核心线索:从ImageNet到世界模型到Marble,Fei-Fei Li的每一步都在回应同一个判断——AI最被低估的瓶颈不是算法而是数据与感知能力,语言只是智能的一部分,空间理解才是连接AI与物理世界的关键桥梁。
人文主义乐观派:技术是双刃剑,关键在人
核心要点:Fei-Fei Li不否认AI的风险,但坚信技术对文明是净正面——前提是每个参与者都承担责任。
- 她明确区分自己与乌托邦主义者:"我不是乌托邦主义者。我不认为AI不会影响工作或人。事实上,我是一个人文主义者。"她的乐观基于人类文明史——从文字发明到现在,人类持续创新工具、改善生活。
- 她在国会演讲中说过一句被广泛引用的话:"AI没有任何'人工'的部分——它由人启发、由人创造,最重要的是,它影响的是人。"Lenny称这是他听过的关于AI最好的一句话。
- 她对Stanford每一届毕业生的告别语是同一句:"你的领域叫人工智能,但它没有任何'人工'的部分。"这句话她说了二十年。
- 在被问到"我们需要做对什么"时,她的回答出人意料地朴素:每个人都应该像负责任的个体一样行事——不论你是在开发AI、部署AI还是使用AI。"这就是我们教孩子的,也是我们作为成年人应该做的。"
"It's not like I think AI will have no impact on jobs or people. In fact, I'm a humanist. I believe that whatever AI does currently or in the future is up to us." —— Fei-Fei Li
AI简史:从Dartmouth Workshop到ChatGPT,70年三代人
核心要点:AI领域已有70年历史,经历了逻辑系统→专家系统→机器学习三个阶段,Fei-Fei Li属于"第一代机器学习研究者"。
- AI的正式起点是1956年Dartmouth Workshop,John McCarthy在那里创造了"人工智能"这个术语,后来到Stanford任教。
- 1950-1980年代是早期探索期:逻辑系统、专家系统、早期神经网络。
- 1980年代末到21世纪初约20年是机器学习的诞生期——计算机编程与统计学习的"联姻"。核心突破是认识到纯规则系统无法覆盖认知能力的广度,必须让机器"学习模式"。
- Fei-Fei Li 2000年在Caltech开始AI PhD,恰好是机器学习第一代。她选择从视觉智能切入,因为"人类是深度视觉化的动物"。
- 一个令人惊讶的事实:直到2015-2016年,一些科技公司仍然回避使用"AI"这个词,因为不确定它是否是"脏词"。Fei-Fei Li当时反而在鼓励所有人使用"AI"。大约2017年左右,公司才开始自称"AI公司"——距今不到10年。
"I was actually encouraging everybody to use the word AI because to me that is one of the most audacious questions humanity has ever asked." —— Fei-Fei Li
ImageNet的诞生:一个"近乎疯狂"的大数据赌注
核心要点:Fei-Fei Li的核心洞察是——AI缺的不是更好的模型,而是足够多的数据。她用几个研究生的力量整理了整个互联网的图片数据。
- 她和学生在研究各种数学模型(神经网络、贝叶斯网络等)时发现一个"痛点":这些模型没有足够的数据来训练。她从人类学习和进化中获得启发——人类通过海量经验学习,进化也是大数据学习过程。
- 2006-2007年,他们做了一个"非常有野心"的决定:获取整个互联网上的物体图片数据。"当时互联网比现在小得多,所以这个野心至少还不算太疯狂。但让几个研究生和一个教授来做这件事,现在看来完全是妄想。"
- 最终成果:精心策展1500万张图片,基于语言学家的WordNet创建了22,000个概念的分类体系,全部开源给研究社区,并举办年度ImageNet挑战赛。
- Fei-Fei Li对此"事后验证"感到欣慰:Scale AI创始人Alex Wong早期就给她发邮件说ImageNet启发了Scale的创建。Lenny提到,当今最快增长的数据标注公司(Merkore、Surge、Scale)做的本质上还是同一件事——给实验室提供更多标注数据。
2012 AlexNet时刻:现代AI的黄金三件套
核心要点:2012年多伦多大学Jeff Hinton团队用ImageNet数据+2块GPU+神经网络取得突破,这三个要素至今仍是ChatGPT等产品的技术内核。
- 2012年,Jeff Hinton教授带领的多伦多大学团队参加ImageNet挑战赛,使用ImageNet大数据、两块Nvidia GPU和神经网络算法,在物体识别上取得巨大突破。这被广泛认为是深度学习/现代AI的诞生时刻。
- Lenny对"两块GPU"这个细节感到惊讶——当时就是普通的游戏GPU,现在训练模型用的是数十万块性能强得多的GPU。
- 从2012到ChatGPT,技术配方本质没变:数据从图片变成互联网规模的文本,神经网络架构更复杂但仍是神经网络,GPU数量大幅增加但仍是GPU。Fei-Fei Li:"如果你看ChatGPT的技术成分,还是这三样东西。"
"A group of Toronto researchers led by Professor Jeff Hinton participated in the ImageNet challenge, used the ImageNet big data and two GPUs from Nvidia and successfully created the first neural network algorithm that made huge progress towards solving object recognition." —— Fei-Fei Li
AGI是营销术语:AI连数椅子都做不到
核心要点:Fei-Fei Li认为AGI更多是营销术语而非科学术语,AI在空间理解、创造性推理和情感智能方面还差得很远。
- 她直言"我不知道有谁真正定义过AGI"——各种定义从"机器的超级能力"到"机器能在社会中成为经济上可行的agent,能挣工资养活自己"。作为科学家,她不愿陷入AI vs AGI的定义之争。
- 她用三个例子说明AI的局限:
- 空间理解:让AI看几个办公室的视频然后数椅子数量,幼儿或小学生能做到,AI做不到。
- 创造性推理:牛顿观察天体运动推导出运动定律方程——即使给AI所有现代观测仪器数据,它也推不出17世纪的方程。Demis Hassabis(DeepMind)的测试方法类似:给模型20世纪前的所有信息,看它能否做出爱因斯坦的突破。答案是远远不能。
- 情感智能:一个学生走进老师办公室谈动机、困惑、热情——今天的对话AI再强也达不到那种情感认知水平。
"I don't know if anyone has ever defined AGI... I feel AGI is more a marketing term than a scientific term." —— Fei-Fei Li
世界模型与空间智能:语言只是智能的一部分
核心要点:人类智能的核心不只是语言——空间理解、物体交互、情境感知同样关键。世界模型是连接语言模型与物理世界(包括机器人)的枢纽。
- Fei-Fei Li用急救现场的例子说明空间智能的重要性:消防员、急救人员在混乱场景中组织救援——大量行动基于对物体、空间、情境的即时理解,"语言无法帮你灭火"。
- 她用DNA发现史做了一个精彩类比:Rosalind Franklin拍的X射线衍射照片是2D的,但Watson和Crick从中推导出高度三维的双螺旋结构——这需要3D空间推理,"你不可能用2D思维推出那个结构"。即使在科学发现中,空间智能也是不可替代的。
- 2024年她在TED Talk上首次系统阐述空间智能和世界模型的概念。灵感来源是:GPT-2发布时(约2020年底),她与Stanford NLP同事Percy Liang、Chris Manning长谈语言模型的未来,而Stanford HAI也率先建立了foundation model研究中心。但她始终在想:语言之外还有太多事情可以推进。
- 世界模型不只是"描述一个场景,生成一个世界"——它是一个基础设施,能让任何人通过文字或图片创造3D世界,并在其中导航、交互、推理。如果消费者是机器人,世界模型能帮它规划路径、整理厨房。
"So much of our intelligence is built upon visual, perceptual, spatial understanding, not just language per se." —— Fei-Fei Li
机器人的Bitter Lesson:数据不匹配 + 物理复杂性
核心要点:Richard Sutton的"Bitter Lesson"(简单模型+海量数据总会赢)在机器人领域还远未被验证,因为机器人面临数据获取困难和物理系统复杂性的双重挑战。
- Ben Horowitz建议Lenny问Fei-Fei Li:为什么Bitter Lesson单独不能解决机器人问题?Fei-Fei Li的回答:ImageNet论文出现在Bitter Lesson之前——"对我来说那不是苦涩的教训,是甜蜜的教训,因为我正是相信大数据的作用才做了ImageNet。"
- 数据不匹配问题:语言模型有一个"完美设定"——训练数据是文本,输出也是文本,输入输出完美对齐。但机器人需要的输出是3D世界中的动作,而训练数据(网络视频)不包含动作。"就像试图把方形塞进圆孔。"目前的补救方案包括遥操作数据和合成数据,但问题远未解决。
- 物理系统复杂性:机器人更像自动驾驶汽车而非语言模型——不仅需要"大脑",还需要物理身体和应用场景。Stanford的Sebastian Thrun 2005-2006年带领Stanford的车赢得DARPA挑战赛,到今天Waymo在旧金山街头运营,用了整整20年——而自动驾驶汽车只是"在2D路面上行驶的金属盒子,目标是不碰任何东西"。机器人是"在3D世界中运作的3D物体,目标是碰东西"。
"Self-driving cars are much simpler robots. They're just metal boxes running on 2D surfaces. And the goal is not to touch anything. Robots are 3D things running in 3D world and the goal is to touch things." —— Fei-Fei Li
World Labs与Marble:世界首个大型世界模型产品
核心要点:World Labs团队约30人,用一年多时间打造了世界首个能生成真正3D世界的生成模型,产品名Marble,支持"prompt to worlds"。
- World Labs由Fei-Fei Li与三位联合创始人Justin Johnson、Kristoff Lassner和Ben Mildenhall在约18个月前创立,四人都来自AI、计算机图形学和计算机视觉研究领域。投资方包括Andreessen Horowitz(Ben Horowitz是Fei-Fei Li多年旧识)。
- Marble是World Labs推出的首个产品,基于他们从零打造的frontier模型。用户可以用一句话或一张图片(或多张图片)生成可导航的3D世界。支持VR眼镜体验("戴上眼镜就能在里面走来走去")。
- 一个有趣的产品细节:Marble进入世界时先展示"点阵"效果再渲染出完整纹理——这其实不是模型本身的特性,而是工程团队故意设计的过渡动画。大量用户反馈说这个效果令人愉悦——Lenny提到它让人想起《黑客帝国》。Fei-Fei Li承认这是一个"研究者学到产品课"的时刻:不只是硬核模型,UX设计同样能创造用户喜悦。
Marble应用:VFX 40倍加速、心理学研究等意外场景
核心要点:Marble已在虚拟制作、游戏、机器人仿真和心理学研究中被使用,VFX场景下制作时间缩短40倍。
- 虚拟制作/VFX:World Labs与Sony合作使用Marble拍摄发布视频。技术导演和艺术家反馈"制作时间缩短40倍"——实际上可能更多("I had to because we only had one month to work on this project")。使用方式是将Marble生成的3D世界与摄像机对齐,演员在其中表演。
- 游戏开发:用户已在将Marble场景的mesh导出用于VR游戏和其他游戏开发。
- 机器人仿真:Fei-Fei Li自己作为机器人研究者深知痛点——训练机器人需要大量多样化的合成数据(不同环境、不同物体),手动构建每个场景资产太慢。Marble可以近乎实时地生成这些仿真环境。
- 心理学研究(意外场景):一个心理学团队主动联系World Labs,希望用Marble为精神科患者生成不同特征的沉浸式场景(如凌乱的vs整洁的房间),用于研究大脑对不同环境的反应。传统方法需要耗费大量时间和预算。Lenny还联想到暴露疗法——恐高症、蜘蛛恐惧症等都可以用Marble生成场景。Fei-Fei Li说一位朋友前一晚刚打电话问Marble是否可以用于恐高症治疗。
与视频模型的本质区别:真正的3D空间智能
核心要点:World Labs做的不是生成2D视频,而是生成真正的3D世界——用户可以导航、交互、导出mesh,这与V3等视频生成模型有本质区别。
- Fei-Fei Li用Plato的洞穴寓言解释视觉的本质:一个被绑在椅子上的囚犯只能看到墙壁上的投影,他的任务是从2D投影推断出背后3D世界的真实情况。"空间智能就是从2D中理解3D甚至4D世界的能力。"
- 视频模型生成的是平面的2D视频,用户只能被动观看。Marble生成的是有3D结构的世界,用户可以自由移动摄像机、导出视频(比如按导演想要的镜头轨迹),甚至可以导出mesh用于其他工具。
- World Labs几周前还发布了世界首个在单块H100 GPU上实现实时视频生成的demo——但这只是技术的一部分,核心差异化在于3D。
创始人旅程:知识无畏与人才竞争的震撼
核心要点:Fei-Fei Li的职业选择一以贯之——跟随热情、追随优秀的人、不过度思考失败场景。但AI人才竞争的烈度仍然超出了她的预期。
- 19岁时她在美国开了一家干洗店(家庭移民创业),然后走学术路线。从Princeton即将拿到终身教职时选择跳到Stanford重新计时,因为"Stanford的人和硅谷的生态系统太了不起了"。后来成为SAIL(Stanford AI Lab)首位女性主任——当时她还是相对年轻的教授。再到Google Cloud AI,只因为想和Jeff Dean、Jeff Hinton、Demis Hassabis等人一起工作。
- 她的核心建议是"知识无畏"(intellectual fearlessness):创造新事物意味着没人做过,必须允许自己无畏和勇敢。"我不会过度思考所有可能出错的事情,因为那太多了。"
- 她对年轻AI人才的忠告:不要过度优化每一个维度——薪资、title、公司估值、技术栈。"我发现自己不断进入'导师模式',看到一个出色的年轻人在考虑工作时纠结于每一个细枝末节。"她建议聚焦三件事:你的热情在哪里?你认同这个使命吗?你信任这个团队吗?
- 让她始料未及的是AI人才竞争的烈度:World Labs成立时还没有"某些顶级人才要价多少"的惊人故事,现在这些数字让她时常感到焦虑。
"I don't overthink of all possible things that can go wrong because that's too many." —— Fei-Fei Li
Stanford HAI:从学术象牙塔到华盛顿国会山
核心要点:Fei-Fei Li在Google的经历让她意识到AI是"文明级技术",必须有人类中心的治理框架。Stanford HAI现已成为全球最大的AI研究机构,涉及8个学院数百教授。
- 2018年她在Google的经历让她做出一个关键判断:AI将是文明级技术,需要一个以人为中心的发展框架。她在《纽约时报》发文阐述这个理念,并与John Hennessy、James Landy、Chris Manning等教授联合创立HAI。
- HAI六七年后已覆盖Stanford全部8个学院——从医学到教育到可持续发展到商学到工程到人文到法律。研究横跨数字经济、法律研究、政治科学、新药发现、超越Transformer的新算法等。
- 政策方面的实际行动:创建国会"AI训练营"(congressional boot camp)、发布AI Index报告、参与推动National AI Research Cloud法案(在第一届Trump政府期间通过)、参与州级AI监管讨论。
- 她创立HAI的一个核心发现是:当时硅谷和华盛顿DC之间几乎没有对话。"鉴于这项技术有多重要,我们需要让所有人参与进来。"
人人都有AI中的角色:从艺术家到农民到护士
核心要点:Fei-Fei Li在全球旅行中被问得最多的问题是"我在AI中有什么角色"——她的回答是每个人都有,但核心是人类尊严和能动性不可被技术剥夺。
- 她批评硅谷的话语习惯:"我们总是随便抛出'无限生产力'、'无限闲暇时间'、'无限算力'这种词,但到头来AI是关于人的。"
- 她对不同人群的具体建议:
- 年轻艺术家:拥抱AI作为工具,"因为你讲故事的方式是独一无二的,世界仍然需要它",但同时要学会用最好的工具来讲你独特的故事。
- 临近退休的农民:AI仍然与你相关,因为你是公民,你可以参与社区关于AI使用的决策,应该有发言权。
- 护士:Fei-Fei Li在自己的职业生涯中做了大量医疗保健AI研究,因为护士过度劳累和疲劳,而老龄化社会需要更多照护。AI可以通过智能摄像头提供更多信息、通过机器人辅助来帮助。
- 她的底线判断:"没有任何技术应该剥夺人的尊严。人的尊严和能动性应该在每一项技术的开发、部署和治理中处于核心位置。"
"No technology should take away human dignity and the human dignity and agency should be at the heart of the development, the deployment as well as the governance of every technology." —— Fei-Fei Li
附录:关键人/机构/产品/数据
| 项目 | 详情 |
|------|------|
| Dr. Fei-Fei Li | Stanford教授,ImageNet创建者,World Labs联合创始人,前Google Cloud AI首席科学家,Stanford HAI联合创始人 |
| World Labs | Fei-Fei Li创立的spatial intelligence公司,约30人团队,成立约18个月 |
| Marble | World Labs首个产品,世界首个"prompt-to-worlds" 3D世界生成应用 |
| Justin Johnson / Kristoff Lassner / Ben Mildenhall | World Labs联合创始人,均来自AI/计算机图形学/计算机视觉研究领域 |
| ImageNet | 2006-2007年启动,1500万张图片,22,000个概念类别,催生现代AI |
| Jeff Hinton | 多伦多大学教授,2012年用ImageNet+2块GPU取得深度学习突破 |
| Alex Wong | Scale AI创始人,早期写信告知Fei-Fei Li ImageNet启发了Scale |
| Sebastian Thrun | Stanford教授,2005-2006年赢得DARPA自动驾驶挑战赛 |
| Stanford HAI | 2018年创立,全球最大人类中心AI研究机构,覆盖Stanford全部8个学院 |
| John McCarthy | 1956年Dartmouth Workshop上创造"人工智能"一词 |
| Percy Liang / Chris Manning | Stanford NLP教授,与Fei-Fei Li讨论语言模型未来 |
| Richard Sutton | "Bitter Lesson"论文作者,图灵奖获得者,强化学习专家 |
| Ben Horowitz | a16z联合创始人,World Labs投资人 |
| VFX制作加速 | 与Sony合作测试中制作时间缩短40倍 |
| 人脑功耗 | 约20瓦——比房间里任何灯泡都暗 |
| 自动驾驶时间线 | 从2005年DARPA原型到2025年Waymo上路,用了20年 |
| 单H100实时视频生成 | World Labs发布的世界首个在单块GPU上实现实时视频生成的demo |
你不认为它会夺走我们所有的工作,你不认为它会毁灭我们。所以我想从这里开始聊起:你对AI将如何长期影响人类有什么看法?
You don't think it's going to take all our jobs. you don't think it's going to kill us. So, I thought it'd be fun to start there. Just what's your perspective on how AI is going to impact humanity over time?
如果我们作为一个物种、作为社会、作为社区、作为个人不做正确的事情,我们同样可以把这事搞砸。
And uh if we're not doing the right thing as a species, as a society, as communities, as individuals, we can screw this up as well.
我们有几个小时可以聊?
How many hours do we have?
所以我觉得听你讲一下这段简史会非常有意思:ImageNet之前的世界是什么样的,你做了什么来创建ImageNet,为什么它如此重要,以及后来发生了什么。
So, I thought it'd be really interesting to hear from you just kind of like the brief history of what the world was like before imageet, then just the work you did to create ImageNet, why that was so important, and then just what happened after.
Alan Turing在40年代就走在时代前面,他用一个问题挑战人类:机器能思考吗?他有一个具体的方法来测试"思考机器"的概念,就是对话式聊天机器人。按他的标准,我们现在确实有了思考机器。但那更多是一种轶事性的启发。这个领域真正始于50年代,当时计算机科学家们聚在一起,研究如何用计算机程序和算法来构建能做到只有人类认知才能完成的事情的程序。
那就是开端。1956年的Dartmouth Workshop上,有John McCarthy教授——他后来到了Stanford——创造了"人工智能"这个术语。从50年代到60、70、80年代是AI探索的早期,我们有逻辑系统、专家系统,也有神经网络的早期探索。到了80年代末、90年代以及21世纪最初几年,这大约20年实际上是机器学习的开端。它是计算机编程与统计学习的"联姻"。这个联姻给AI带来了一个非常关键的概念:纯粹基于规则的程序无法涵盖我们希望计算机拥有的大量认知能力。所以我们必须让机器学习模式。一旦机器能学习模式,它就有希望做更多的事情。
比如你给它三只猫,目标不只是让机器识别这三只猫,而是让机器能识别第四只、第五只、第六只以及所有其他的猫。这种学习能力对人类和许多动物来说是基础性的。我们作为一个领域意识到需要机器学习。这一直持续到21世纪初。我在2000年正式进入AI领域,那年我在Caltech开始读博士。我是第一代机器学习研究者之一,我们已经在研究机器学习的概念,尤其是神经网络。我记得在Caltech的第一门课就叫"神经网络",但过程非常痛苦。当时正处于所谓的AI寒冬——公众不太关注这个领域。
资金不多,但想法很多。我认为有两件事让我的职业生涯与现代AI的诞生如此紧密相连。第一,我选择通过视觉智能的视角来看人工智能,因为人类是深度视觉化的动物。我们后面可以多聊这个,但我们的智能很大一部分建立在视觉、感知和空间理解之上,而不仅仅是语言。我认为它们是互补的。所以我选择了视觉智能方向。在我的博士和早期教授阶段,我和学生们非常专注于一个北极星问题——解决物体识别问题,因为它是感知世界的基石。我们在世界中行走、解释、推理和交互,基本上都是在物体层面进行的。我们不会在分子层面与世界交互。
我们偶尔会这样,但很少。比如你想端起一个茶壶,你不会说"这个茶壶由100片瓷器组成,让我逐片处理"——你把它当作一个物体来交互。所以物体非常重要。我是最早把这个确定为北极星问题的研究者之一。但我认为关键的转折是:作为AI的学生和研究者,我在研究各种数学模型——神经网络、贝叶斯网络等等——有一个核心痛点:这些模型没有足够的数据来训练。我们作为一个领域太专注于模型了,但我突然意识到,人类学习和进化本质上都是大数据学习过程。
人类通过持续不断的海量经验来学习,进化也是如此——你看时间尺度上的动物进化,它们就是通过体验世界来进化的。所以我和学生们提出一个猜想:一个被严重忽视的、让AI真正活起来的关键要素是大数据。于是我们在2006-2007年启动了ImageNet项目。我们非常有野心,想要获取整个互联网上的物体图片数据。当然那时候互联网比现在小得多,所以我觉得这个野心至少还不算太疯狂。但现在回头看,认为几个研究生和一个教授能做到这件事,完全是异想天开。但我们就是做了。我们精心策展了互联网上1500万张图片,借鉴语言学家在WordNet上的工作,创建了22,000个概念的分类体系,把它们整合成ImageNet,并向研究社区开源。
我们举办了年度ImageNet挑战赛来鼓励大家参与,我们也继续做自己的研究。但2012年是很多人认为深度学习开端、现代AI诞生的时刻,因为一个由Jeff Hinton教授领导的多伦多大学研究团队参加了ImageNet挑战赛,使用了ImageNet的大数据和两块Nvidia GPU,成功创建了第一个神经网络算法。它没有完全解决问题,但在物体识别上取得了巨大进展。这三项技术的组合——大数据、神经网络和GPU——就是现代AI的黄金配方。然后快进到AI的公众时刻,也就是ChatGPT时刻。如果你看ChatGPT背后的技术成分,本质上还是这三样东西。
现在是互联网规模的数据(主要是文本),比2012年复杂得多的神经网络架构但仍然是神经网络,更多的GPU但仍然是GPU。所以这三个要素至今仍是现代AI的核心。
And you know Alan Touring was ahead of his time by in the 40s by asking daring humanity with the question can we is there thinking machines right and of course he has a specific way of uh testing this concept of thinking machine which is a conversational chatbot which to his standard we now have a thinking machine but uh that was just a more anecdotal inspir inspiration. The field really began in the 50s um when computer scientists came together and look at how we can use computer programs and algorithms to uh to build these programs that can do things that have been only capable by human cognition.
So um and and that was the beginning and the founding fathers the Dartmouth workshop in the 1956 uh you know we have professor John McCarthy who later came to uh Stanford who coined the term artificial intelligence and between the 50s60s 70s and 80s it was the early days of AI exploration and we had logic systems we had uh expert systems We also had early exploration of neuronet network and then it came to around the late 80s, the 90s and the the very beginning of the 21st century. That stretch about 20 years is actually the beginning of machine learning. It's the marriage between computer programming and statistical as uh learning. And that marriage brought a very very critical concept into AI which is that purely rulebased um uh program is not going to account for the vast amount of cognitive capabilities that we imagine computers can do. So we have to use machines to learn the patterns. Once the machines can learn the patterns, it has a hope to do more things.
For example, if you give it three cats, the hope is not just for the machines to recognize these three cats. The hope is the machines can recognize the fourth cat, the fifth cat, the sixth cat, and all the other cats. And that's a learning ability that is fundamental to humans and many animals. and uh we we as a field realized we need machine learning. So that was up till the beginning of the 21st century. I entered the field of AI literally in the year of 2000. That's when my uh PhD began at Caltech. And so I was one of the first generation machine learning researchers and we were already studying this concept of machine learning especially neuronet network. I remember that was one of my first courses in at Caltech is called neuro network but it was very painful. It was still smack in the middle of the so-called AI winter meaning the public didn't look at this too much.
there wasn't that much funding but there was also a lot of ideas flowing around and I think two things happened to myself that brought my own career so close to the birth of modern AI is that um I chose to look at artificial intelligence through the lens of visual intelligence because uh humans are deeply visual animals. We can talk a little more later, but so much of our intelligence is built upon visual, perceptual, spatial understanding, not just language per se. I think they're complimentary. So I chose to look at visual intelligence and um my PhD and my early uh professor years I um my students and I are very committed to a northstar problem which is solving the problem of object recognition because it's a building block for the perceptual world. Right? We go around the world interpreting, reasoning and interacting with it more or less at the object level. We don't interact with the world at the molecular level.
We don't interact with the world as um we sometimes do but we rarely for example if you want to lift a teapot you don't say okay the teapot is made of a 100 pieces of porcelain and let me work on this 100 pieces you look at this as one object and and interact with it. So object is really important. So um I was among the first uh uh researchers to identify this as a northstar problem. But I think what happened is that as a student of AI and then a researcher of AI, I was working on all kinds of mathematical models including neuronet network including Beijian network including many many models and there was one singular pain point is that these models don't have data to be trained on and uh as a field we were so focusing on these models but It dawned on me that human learning as well as evolution is actually a big data learning process.
Humans learn with so much experience you know constantly and evolution if you look at time animals evolve with just experiencing the world. So I think my students and and I conjectured that a very critically overlooked ingredient of bringing AI to life is big data and then we began this image that project in 2006 2007 we were very ambitious we want to get the entire internet's image data on objects now granted internet was a lot smaller than today so we I felt like that ambition was at least not too crazy. Now it's totally delusional to uh to think a couple of graduate student and a professor can do this. But uh and that's what we did. We curated very carefully 15 million images on the internet. Created a taxonomy of 22,000 concepts borrowing other researchers work like a linguist work on wordnet and it's a particular way of dictionarying uh words and we combine that into image that and we open source that to the research community.
We held an annual image net challenge to encourage everybody to participate in this. We continue to do our own research. But 2012 was the moment that many people think was the beginning of the deep learning or birth of modern AI because a group of Toronto researchers led by professor Jeff Hinton participated in imageet challenge used the imageet big data and two GPUs from Nvidia and created successfully the first neuronet network algorithm that's can it didn't fundamental it didn't totally solved but made a huge progress towards solving the problem of object recognition and that combination of the trio technology uh big data neuronet network and GPU was kind of the golden recipe for modern AI and then fast forward the the the public moment of AI which is the chat GPT moment if you look at the ingredients of what brought Chad GPT to to the to the uh world technically still use these three ingredients.
Now it's internet scale data mostly texts is a much more com complex neuronet network um architecture than 2012 but it's still neuronet network and a lot more GPUs but it's still GPUs. So these three ingredients are still to at the core of modern AI.
正如你所说,这至今仍然是让模型变得更聪明的主要方式。现在全世界增长最快的一些公司——我在播客上采访过大部分——Merkore、Surge、Scale——他们做的就是这件事,持续为实验室提供越来越多的标注数据。
as you said this continues to be in a large way the way models get smarter some of the fastest growing companies in the world right now I've had them all mostly on the podcast Merkore and Surge and Scale like they do this they continue to do this for labs just give them more and more label data of the things they're most excited about.
那是一个转变期,有些人开始称之为AI。但如果你追溯硅谷科技公司的营销用语,我觉得大约2017年左右公司才开始自称AI公司。
that was the changing like um some people start calling it AI but I think if you look at the Silicon Valley tech company companies if you trace their marketing term I think 2017ish themselves AI companies
天哪,好吧。在我们聊你正在做的事情和你对未来的看法之前,关于早期历史还有什么你觉得人们不知道但很重要的事情吗?
Oh man. Okay. Is there anything else around the history that early history that you think people don't know that you think is important before we chat about where think things are going in the work that you're doing?
我在想,如果Alan Turing今天还在世,你问他AI和AGI的区别,他可能只会耸耸肩说:我在1940年代就问过同样的问题。所以我不想陷入定义AI和AGI的兔子洞里。我觉得AGI更多是一个营销术语而不是科学术语。作为科学家和技术专家,AI是我的北极星,是我们整个领域的北极星。人们想怎么叫就怎么叫,我都可以。
And I think our founding fathers that Alan Turing, I wonder if Alan Turing is around today and you ask him to contrast AI versus AGI, he might just shrug and said, well, I asked the same question back in 1940s. So, so I don't want to get get onto a rabbit hole of defining AI versus AGI. I feel AGI is more a marketing term than a scientific term. As a scientist and technologist, AI is my northstar is my field's northstar and I'm happy people call it whatever name they want to call it.
AI今天还有太多做不到的事情——更不用说像Isaac Newton那样,观察天体运动然后推导出一组支配所有物体运动的方程。那种创造性、外推能力和抽象能力,我们今天根本无法让AI做到。再看情感智能。一个学生走进老师办公室谈动机、热情、想学什么、什么问题困扰着你——那种对话,无论今天的对话机器人多么强大,你都得不到那个层次的情感认知智能。所以还有很多可以做得更好。我不认为我们的创新已经结束了。
So um there's just so much AI today could not do then let alone thinking about how did you know um someone like Isaac Newton look at the movements of the celestial bodies and and and derive an equation or or a set of equations that governs the movement of all bodies that level of creativity extrapolation abstraction we have no way of enabling AI to do that today. And then let's look at emotional intelligence. If you look at a student coming into a teacher's office and have a conversation about motivation, passion, what to learn, what's the problem that's that's you know really uh bothering you. that conversation as powerful as as today's conversational bots are, you don't get that level of emotional cognitive intelligence uh from today's AI. So there's a lot we can do better. Um and I do not believe we're done innovating.
好了,让我们聊聊世界模型(world models)。这又是一个你走在所有人前面的了不起的例子。你很早就看到了"我们只需要大量干净的数据让AI和神经网络学习"。你谈论世界模型这个理念已经很长时间了。你创办了一家公司来构建——我们有语言模型(language models),这是不同的东西,这是世界模型。我们会谈谈那是什么。而现在,当我准备这期节目的时候,Elon在谈世界模型,Jensen在谈世界模型,Google也在做这个。你在这个方向上已经深耕了很久。你刚刚发布了一个东西,就在这期播客上线之前。请讲讲什么是世界模型?为什么它如此重要?
Okay, so let's talk about world models. This is uh to me this is just another really amazing example of you being ahead of where people end up. So you were way ahead on okay, we just need a lot of clean data for AI and neural networks to learn. uh you've been talking about this idea of world models for a long time. You started a company to build uh essentially there's language models. This is a different thing. This is a world model. We'll talk about what that is. And now uh as I was preparing for this, Elon's like talking about world models. Jensen's talking about world models. I know Google's working on this stuff. You've been at this for a long time. And you're actually just launched something that's going to we're going to talk about uh right before this podcast airs. Um talk about what is a world model? Why is it so important?
当时我是——现在也仍然是——Stanford人类中心AI研究所(HAI)的联合主任。我记得公众还没有意识到大语言模型的力量,但作为研究者我们看到了,我们看到了未来。我和NLP同事Percy Liang、Chris Manning进行了很长的对话,讨论这项技术将有多关键。Stanford AI研究所、HAI是最早建立基础模型(foundation model)完整研究中心的机构。Percy Liang和很多研究者领衔发表了第一篇关于基础模型的学术论文。所以这对我来说非常有启发。
当然,我来自视觉智能的世界。我一直在想,在语言之外还有太多可以推进的东西。人类利用空间智能和对世界的理解做了太多事情,而这些超越了语言。想象一个非常混乱的急救现场——火灾、交通事故或自然灾害。如果你身临其境,想想人们如何组织起来救人、阻止进一步的灾难、灭火——这里面大量的行动基于对物体和世界的即时理解、态势感知。语言是其中的一部分,但在很多情况下语言帮不了你灭火。那到底是什么?
我思考了很多,同时我在做大量机器人研究。我突然意识到,连接语言之外的智能、连接具身AI(embodied AI,即机器人)、连接视觉智能——这一切的枢纽就是空间智能(spatial intelligence),关于理解世界。2024年我在TED做了一个关于空间智能和世界模型的演讲。这个想法源于我在机器人和计算机视觉方面的研究。有一件事对我来说非常清楚:我真的想和最聪明的技术专家一起工作,尽快把这项技术变为现实。这就是我们创办World Labs的原因。你可以看到"world"这个词就在我们公司名称中,因为我们非常相信世界建模和空间智能。
I was um co-director um I still am but I was at that time uh full-time co-director of Stanford's uh human center AI institute and I I remember it was you know the public was not aware of the power of the large language model yet but as researchers we were seeing it we're seeing the future and I had pretty long conversations with my natural language processing colleagues like Percy Leang and Chris Batting, we were talking about how critical this technology is going to be and Stanford uh AI institute, human center AI institute, hi was the first one to establish a full research center um foundation model. We were Percy Le Young and and many researchers led the first uh academic paper um foundation model. So so it was just very inspiring for me.
So, of course, I come from the world of visual intelligence and I was just thinking there's so much we can um push forward on beyond language because humans um humans have used our sense of spatial intelligence and world understanding to do so many things and they are beyond language. Think about a very chaotic first responder scene, whether it's fire or some traffic accident or or some natural disaster. And it's if you immerse yourself in those scene and think about how people organize themselves to to rescue people, to stop further disasters, to put down fires, to to a lot of that is movements, is is spontaneous understanding of objects, worlds, hum situational awareness. Language is part of that. But a lot of those situations language cannot get you to put down the fire. So that is what is that?
I I was thinking a lot and in the meantime I was doing a lot of robotics research and I it ca it dawned on me that the lynch pin of connecting the additional intelligence in addition to language and connecting embodied AI which are robotics. connecting visual intelligence is this sense of spatial intelligence about understanding the world and that's when um I think I um it was 2024 I gave a TED talk about spatial intelligence and world models and uh I start formulating this idea uh back in um based on my robotics and computer vision research and then one thing that is really clear to me is that I really want to work with the brightest uh technologist and and move as fast as possible to bring this technology to life. And that's when we founded this company called World Labs. And you can see the the the word world is in the title of our company because we believe so much in world modeling and spatial intelligence.
无限可玩的游戏,你只需要在脑海中想出来就行。还有创意方面——享受乐趣、发挥创造力、构想各种疯狂的新世界和环境。
还有设计。人类设计一切——从机器到建筑到住宅。还有科学发现。我喜欢用DNA结构发现的例子。DNA发现史上最重要的一个证据是Rosalind Franklin拍摄的X射线衍射照片——那是一张平面的2D照片,看起来像一个带衍射条纹的十字形。你可以Google这些照片。但从那张2D平面照片出发,两个重要的人——James Watson和Francis Crick——加上其他信息,能够在3D空间中推理,推导出高度三维的DNA双螺旋结构。那个结构不可能是2D的。你不可能用2D思维来推导出那个结构,你必须运用3D空间思维,运用人类的空间智能。所以即使在科学发现中,空间智能或AI辅助的空间智能也是至关重要的。
Just like infinitely playable games that you just invent out of your head. And then creativity feels like just like being fun, having fun, being creative, thinking of m wild new worlds and and environments.
And also design. humans design from machines to buildings to homes and also scientific discovery right there is so much u I I like to use the example of the discovery of the structure of DNA if you look at one of the most important piece in DNA's discovery history is the X-ray defraction photo that was captured by Rosalyn Franklin and it was a flat 2D photo of a structure that looks like it looks like a cross with defractions. You can you can uh Google those photos. But with that 2D flat photo, humans, especially two important humans, James Watson and Francis Crick, in addition to their other uh information, was able to reason in 3D space and deduce a highly three-dimensional double helix structure of the DNA. And that structure cannot possibly be 2D. You cannot think in 2D and deduce that structure. You have to think in 3D spatial um use the the human spatial intelligence. So I think even in scientific discovery um spatial intelligence or AI assisted spatial intelligence is critical.
往往那些一开始看起来"哦这很酷,挺好玩的"东西,最终反而是改变世界最多的。
我联系了Ben Horowitz,他非常喜欢你在做的事情,是你的忠实粉丝。他们是投资方吧?
Uh and it's oftentimes the things that just look like okay this is cool. Uh that it's fun to play with and end up changing the world most.
I reached out to Ben Horowitz who loves what you're doing. A big fan of yours. Uh they're investors I believe. And
研究的成熟度远不如语言模型。很多人还在尝试不同的算法,其中一些确实由大数据驱动。所以我认为大数据会继续在机器人领域发挥作用。但机器人面临几个困难。第一,数据要难获取得多。你可能说网络上有视频数据——确实,最新的机器人研究在使用网络视频,我认为网络视频确实有用。但想想是什么让语言模型成功的。作为一个做计算机视觉、空间智能和机器人的人,我非常嫉妒做语言的同事,因为他们有一个完美的设定:训练数据是文字(最终是token),然后模型输出的也是文字。
输入和输出之间有完美的对齐——你希望得到的东西(我们称之为目标函数)和你的训练数据看起来一样。但机器人不同。即使空间智能也不同。你希望从机器人那里得到的是动作,但你的训练数据缺少3D世界中的动作。而这恰恰是机器人需要做的——在3D世界中执行动作。所以你必须找到不同的方法来解决——就像把方形塞进圆孔一样。我们有的是大量的网络视频,所以我们必须开始讨论补充数据,比如遥操作数据或合成数据,这样机器人才能在"苦涩的教训"假说——即大量数据——下被训练。
我认为仍然有希望,因为我们在世界建模方面所做的工作将真正为机器人解锁大量信息。但我们必须谨慎,因为我们还处于早期阶段,苦涩的教训还有待验证,因为我们还没有完全搞清楚数据问题。关于机器人领域苦涩的教训的另一个方面,我认为我们必须非常现实:与语言模型甚至空间模型相比,机器人是物理系统。所以机器人更接近自动驾驶汽车而不是大语言模型。认识到这一点非常重要。这意味着要让机器人工作,我们不仅需要大脑,还需要物理身体,还需要应用场景。看看自动驾驶汽车的历史——我的同事Sebastian Thrun带领Stanford的车在2005年或2006年赢得了第一个DARPA挑战赛。
从那个能在内华达沙漠中行驶130英里的自动驾驶原型车到今天旧金山街头的Waymo,已经过去了20年,而且我们还没完成,还有很多问题。这是一段20年的旅程。而自动驾驶汽车是简单得多的机器人——它们只是在2D路面上行驶的金属盒子,目标是不碰任何东西。机器人是在3D世界中运作的3D物体,目标是碰东西。所以这段旅程会涉及很多方面。当然有人会说,自动驾驶早期的算法是深度学习之前的,深度学习在加速"大脑"的发展——我认为这是对的。这就是为什么我做机器人研究,为什么我做空间智能,我为此感到兴奋。但与此同时,汽车工业非常成熟,产品化还涉及成熟的用例、供应链和硬件。所以我觉得这是一个非常有意思的时期来研究这些问题。
但Ben说得对,我们可能仍然要经历一些苦涩的教训。
It's not the the research is not nearly as mature as say language models. So many people are still um experimenting with different algorithms and some of those algorithms are driven by big data. So I do think big data will continue to play a role in robotics and um but what is hard for robotics there are a couple of things one is that it's harder to get data it's a lot harder to get data you can say well there is web data this is where the latest robotics research is using web videos and I think web videos do do play a role but if you Think about what made language model work. A very as someone who does computer vision and and spatial intelligence and robotics, I'm very jealous of my colleagues in um in language because they had this perfect setup where their training data are in words eventually tokens and then they produce a model that outputs words.
So you have this perfect alignment between what you hope to get which we call objective function and what your training data looks like. But robotics is different. Even spatial intelligence is different. You hope to get actions out of robots. But your training data lacks actions in 3D worlds. And that's what robots have to do, right? actions in 3D worlds. So, you have to um find different ways to fit a uh what do they call a a a a square in a round hole that what we have is tons of web videos. So then we have to start talking about uh adding supplementing data such as teleaoperation data or synthetic data so that the robots are trained with this hypothesis of bitter lesson which is large amount of data.
I think there's still hope because even what we are doing um in world modeling will really unlock a lot of this uh information for robots but I think we have to be careful because we're at the early days of this and bitter lesson is still to be tested uh because we haven't fully figured out the data for another part of the bitter lesson of robotics I think we should be so so realistic about is again compared to language models or even spatial models, robots are physical systems. So robots are closer to self-driving cars than a large language model. And that's very important to recognize. That means that in order for robots to work, we not only need brains, we also need the physical body, we also need application scenarios. And if you look at the the the the the history of self-driving car, um my colleague Sebastian Thrum uh uh took Stanford's car uh to win the first DARPA challenge in 2006 or 2005.
It's 20 years since that prototype of a self-driving car being able to drive 130 miles in the Nevada desert to today's Whimo and um on the street of San Francisco and we're not even done yet. There's still a lot. So that's a 20 year journey. And self-driving cars are much simpler robots. They're just metal boxes running on 2D surfaces. And the goal is not to touch anything. Robot is 3D things running in 3D world and the goal is to touch things. So the journey is going to be you know there's many aspects elements and of course one could say well the self-driving car early algorithm were pre-deep learning era. So deep learning is accelerating uh the brains and I think that's true. That's why I'm in robotics. That's why I'm in spatial intelligence and I'm excited by it. But in the meantime, the car industry is very mature and productizing also involves the mature use cases, supply chains, the hardware. So I think it's a very interesting time to work in these problems.
But it's true Ben is right. we might still be subject to a number of bitter lessons
我们有一个了不起的创始团队,都是来自顶尖团队的杰出技术专家。大约一两个月前,我们第一次看到——只需要用一句话、一张图片或多张图片,就能创造出我们可以在其中导航的世界。如果你戴上VR眼镜——我们有这个选项——你甚至可以在里面走来走去。虽然我们已经构建了很长时间,但第一次看到它的时候仍然令人叹为观止。我们想把它交到需要的人手中。我们知道有很多创作者、设计师、研究机器人仿真的人、思考可导航可交互沉浸式世界的各种用例的人、游戏开发者——都会觉得它有用。所以我们开发了Marble作为第一步。
虽然还处于非常早期的阶段,但这是世界首个做到这一点的模型,也是世界首个让人们可以直接"prompt to worlds"的产品。
Uh we uh we have a team of incredible founding team of incredible technologists from you know incredible uh teams. And then around um just a month or two ago, we saw the first time that we we can just prompt with a sentence and an image and multiple images and create worlds that we can just navigate in. If you put it on goggle, which we have an option to let you do that, you can even walk around, right? So it was even though we've been building this for for for quite a while, it was still just all inspiring and we wanted to get into the hands of uh people who need it. And then we know that so many creators, designers, people who are thinking about uh robotic simulation, people who are thinking about uh different use cases of uh navigable interactable um uh immersive worlds, game developers will find this useful. So we uh develop developed Marble as a first step.
It's it's again still very early uh but it's the world's first uh model doing this and it's the world's first uh product that allows people to just uh prompt we call it prompt to worlds.
听到这个真的很酷。这正是我作为研究者在学习的地方。那些引导你进入世界的点阵是一个有意为之的可视化功能,不是模型本身的一部分。模型实际上直接生成世界。我们一直在试图找到一种方式来引导人们进入世界,好几个工程师尝试了不同的版本,最后我们收敛到了这个点阵效果。你不是唯一一个告诉我们这个体验多么令人愉悦的人——这对我们来说非常有满足感,一个不是硬核模型本身、而是有意为之的可视化功能,竟然让用户如此喜悦。
That's so cool to hear because this is where as a researcher I I I'm learning because the the the the dots that lead you into the world was a an intentional feature uh visualization. It is not part of the model. It's uh the model actually just generates the world. We we were trying to find a way to guide people into the world and a number of engineers uh worked on different versions but we converged on the dot and so many people you're not the only one told us how delightful that experience is and it it was really satisfying for us to hear that this intentional visualization feature that's not just the big hardcore model actually has delighted our users.
实际上可能更多。因为我们只有一个月的时间来完成这个项目,而他们要拍的东西太多了。所以使用Marble确实极大地加速了VFX和电影的虚拟制作。这是一个用例。我们已经看到用户把Marble场景的mesh导出放到游戏里——无论是VR游戏还是他们开发的其他有趣的游戏。我们还展示了一个机器人仿真的例子——因为我自己也是做机器人训练研究的。最大的痛点之一就是为训练机器人创建合成数据。这些合成数据需要非常多样化——来自不同的环境、有不同的可操作物体。一种途径是让计算机来模拟。
否则人类就得手动为机器人构建每一个资产,那会花费太长时间。所以已经有研究者联系我们想用Marble来创建这些仿真环境。我们还收到了一些意想不到的用户反馈。比如一个心理学团队联系我们,想用Marble做心理学研究。原来他们研究的一些精神科患者需要了解大脑对不同特征的沉浸式场景的反应——比如凌乱的场景或整洁的场景。研究人员很难获得这种沉浸式场景,自己创建需要太长时间和太多预算。而Marble几乎可以即时地把大量实验环境交到他们手中。
目前我们看到了多个用例,但VFX、游戏开发者、仿真开发者以及设计师都非常兴奋。
Yes. In fact, I had to because we only had one month to work on this project and and there were so many things they were trying to shoot. So, so using marble really really significantly accelerated the production of virtual virtual production for VFX and movies. That's one use cases. We are already seeing our users putting uh taking our marble scene and taking the mesh export and putting games you know whether it's games on VR or games uh just just just fun games that they they have developed we have had um we were showing uh an example of uh robotic simulation because uh when I was I mean I'm still am a researcher doing robotic uh training. One of the biggest pain point is to create synthetic data for training robots. And these synthetic data needs to be very diverse. They need to come from different environments with different objects to manipulate. And uh and one path to it is is to ask uh computers to simulate.
Otherwise, humans have to, you know, build every single asset for robots. That that's just going to take a lot longer. So we already have researchers reaching out and wanting to use marble to create those synthetic environments. We also have unexpected um user uh outreach in terms of uh how they want to use marble. For example, a psychologist team called us to use marble to do psychology research. It turned out some of the psychiatric patients they study, they need to understand how their brain respond to different immersive scenes of different features. Uh, for example, messy scenes or clean scenes or or whatever you name it. And it's very hard for researchers to get their hands on um these kind of immersive scenes. and it will take them too long and too much budget to uh to to create. And Marble is a really almost instantaneous way of getting so many of these um experimental uh environments into their hands.
So, we're seeing um uh we're seeing multiple use cases at this point, but the the VFX, the game developers, the simulation uh uh developers as well as designers are very excited.
所以对我来说,空间智能比创造平面的2D世界要深刻得多。空间智能是创造、推理、交互、理解深度空间世界的能力——无论是2D、3D还是4D,包括动态等等。World Labs就聚焦于此。当然,生成视频本身可以是其中一部分。事实上几周前我们发布了世界首个在单块H100 GPU上可实时演示的视频生成。我们的技术包含了这个。但Marble之所以非常不同,是因为我们真正想让创作者、设计师和开发者手中拥有一个能给他们提供具有3D结构的世界的模型,这样他们才能用于工作。这就是Marble如此不同的原因。
So spatial intelligence to me is deeper than owning creating that flat 2D world. Spatial intelligence to me is the ability to create, reason, interact, make sense of deeply spatial world, whether it's 2D or 3D or 4D, including dynamics and all that. So, so World Lab is focusing on that. And of course, um the ability to create videos per se, could be part of this. And in fact uh just a couple of weeks ago we rolled out the world's first uh realtime demoable realtime video generation on a single uh H100 GPU. So we we we part of our technology includes that. But I think Marvel is very different because we really want creators, designers, developers to have in their hands a model that can give them uh worlds with 3D structure so they can use it for for their work. And that's where that's why Marble is so different.
恭喜你们的发布。我知道这是一个巨大的里程碑,我知道这花了大量的工作。向你和你的团队表示祝贺。
让我聊聊你的创始人旅程。你是这家公司的创始人。创立多久了?几年前?两三年前?
Well, congrats on the launch. I know this is a huge milestone. I know this took a ton of work. So, I just want to say congrats to you and your team.
Let me talk about your founder journey for a moment. So, you're a founder of this company. You started how many years ago? Couple years ago, two, three years ago.
然后我创办了Google Cloud AI,又在Stanford创办了一个研究所,但那些是不同的事情。我确实觉得作为创始人,面对这段艰辛的旅程,我比20岁的创始人准备得稍微多一些。但我仍然感到惊讶,有时甚至感到恐慌——AI竞争的激烈程度,无论是模型和技术本身,还是人才。当我创办公司的时候,还没有那些关于某些顶级人才要价多少的惊人故事。这些事情持续让我惊讶,我必须非常警觉。
and and then I you know um founded Google Cloud AI and then I founded an institute at Stanford but those are different beasts. I did feel I was a little more prepared as a a founder of the the grinding journey that um that I um compared to maybe um maybe the the the 20 year old founders. But I still I'm surprised and and and uh it puts me into paranoia sometimes that how intensely competitive uh AI landscape is from from the model the technology itself as well as talents. And you know when I founded the company um we did not have these incredible stories of how much certain talents would cost you know um so these are things that continue to surprise me and uh and I have to be very alert about.
你之前提到一个我想回来聊的点。纵观你的职业生涯,你一直身处催生了今天这些突破的所有重要人才聚集地。我们聊了ImageNet,还有Stanford的SAIL(很多工作在那里完成),以及Google Cloud(很多突破在那里发生)。是什么把你带到了这些地方?对于想要推进职业发展、站在未来中心的人来说,这中间有什么一以贯之的线索?是什么把你从一个地方拉到另一个地方?
Yeah. you mentioned this point that I want to come back to that you if you just look over the course of your career. You were like at all of the major uh collections of humans that led to so many of the breakthroughs that are happening today. Obviously we talked about Imageet also just sale at Stanford is where a lot of the work happened at Google cloud which a lot of the breakthroughs happened. What brought you to those places? uh like for people looking for how to advance in their career, be at the center of the future, just like is there a throughine there of just what pulled you from place to place and pulled you into those groups that might be helpful for people to hear?
比如当我来到Stanford时,在学术界我已经很接近在Princeton获得终身教职了——那意味着永远拥有这份工作。但我选择来Stanford,因为虽然我爱Princeton(那是我的母校),但在那个时刻Stanford有如此了不起的人、硅谷的生态系统如此了不起,我愿意冒险重新启动我的终身教职时钟。后来成为SAIL(Stanford AI Lab)的首位女性主任——当时我相对来说还是一个很年轻的教授,我想做这件事因为我关心那个社区。我没有花太多时间去想所有可能失败的情况。显然我很幸运,更资深的教授支持了我,但我只是想要产生影响。去Google也类似。
我想和Jeff Dean、Jeff Hinton这些了不起的人一起工作。World Labs也是一样。我有这份热情,我也相信拥有相同使命的人能做出了不起的事情。这就是贯穿我一生的指引。我不会过度思考所有可能出错的事情,因为那太多了。
So when I uh for example um came to Stanford, you know, in the world of academia, I was very close to this thing called tenure um which is, you know, have the job forever in in at Princeton. But I I choose to chose to come to Stanford because I love Princeton. It's my alma mater. It's just at that moment there are people who are so amazing at Stanford and the Silicon Valley ecosystem was so amazing that I was okay to take a risk of restarting my tenure clock. um going to um becoming the first uh female director of sale. I was actually relatively speaking a very young faculty at that time and I wanted to do that because I care about that community. I didn't spend too much time thinking about all the failure cases. Obviously, I was very lucky that the more senior faculty supported me, but I just wanted to make a difference. And then going to Google was similar.
I wanted to work with people like Jeff Dean, Jeff Hinton, and um all these incredible Dennis, the the incredible people. Um I you know, so so the same with World Labs. I I I have this passion and I also believe that people with the same mission can do incredible things. So that's how it guided my through through life. I don't overink of all possible things that can go wrong because that's too many.
专注于你能产生的影响,以及你能与之共事的团队和工作。
and and just just focus on the impact and and you can make and the kind of work and team you can you can work with.
你在那里做什么?我知道你仍然在参与。
What are you what are you doing there? I know this is a thing you do on the site still.
那对我来说是一个非常重要的决定,因为我本可以留在工业界。但在Google的经历教会了我一件事:AI将成为一项文明级技术。我意识到这对人类有多重要,以至于2018年我在《纽约时报》上写了一篇文章,谈到需要一个指导框架来开发和应用AI,而这个框架必须植根于人类的善意和以人为中心。我觉得Stanford——世界顶尖大学之一,位于硅谷心脏、孕育了从Nvidia到Google的重要公司——应该成为创建这种以人为中心的AI框架的思想领袖,并在研究、教育、政策和生态系统工作中践行它。
所以我创立了HAI。六七年后的今天,它已成为全球最大的以人为中心的AI研究机构,涵盖研究、教育、生态系统外展和政策影响。它涉及Stanford全部八个学院的数百名教授——从医学到教育到可持续发展到商学到工程到人文到法律。我们支持跨学科领域的研究者,从数字经济到法律研究到政治科学到新药发现到超越Transformer的新算法。我们还非常注重政策,因为当我们创立HAI时,我意识到硅谷根本不与华盛顿DC或布鲁塞尔对话。鉴于这项技术有多重要,我们需要让所有人参与进来。
所以我们创建了多个项目——从国会"AI训练营"到AI Index报告到政策简报。我们尤其参与了政策制定,包括倡导一项国家AI研究云法案(该法案在第一届Trump政府期间通过),还参与了州级AI监管讨论。我们做了很多,虽然我在运营层面参与减少了,但我继续担任领导者之一,因为我关心的不仅是创造这项技术,还有正确地使用它。
Um and uh it was a very very important decision for me because I could have stayed in industry but my time at Google taught me one thing is AI is going to be a civilizational technology and it it's it dawned on me how important this is to humanity to the point that I actually wrote a piece in New York Times that year 2018 to talk about the need for a guiding framework to develop and to to apply AI and that framework has to be anchored in human benevolence is human centerness and I felt that Stanford uh one of the world's top university in the heart of Silicon Valley that gave birth to important companies from Nvidia to Google uh should um be a thought leader uh to create this human- centered AI framework and to um to actually embody that in our research education and policy and in ecosystem work.
So I founded HAI it uh you know after uh fast forward after six seven years it has become the world's largest AI institute that does human- centered um uh research education uh ecosystem outreach and policy uh in uh in uh impact. Uh it involves hundreds of faculty across all eight schools at Stanford from medicine to education to sustainability to business to engineering to humanities to uh law and uh we we support researchers especially at the interdisciplinary area from digital economy to uh legal studies to political science to discovery of new drugs. uh to to new algorithms to that's beyond transformers. We also actually put a very strong focus on um on policy because when we started HAI I realized that Silicon Valley did not talk to Washington DC and or Brussels or other parts of the world and it's re given how important this this technology is we need to bring everybody on board.
So we created multiple programs from congressional boot camp to um AI index report to policy briefing and we especially uh participated in policym including um advocating for a u a national AI research cloud bill that was passed in the first Trump administration and participate participating in state level uh regulatory AI discussions. So there's a lot we did and and I continue to be um one of the the leaders even though I'm much less involved operationally because I care not only we create this technology but we use it in the right way.
但没有任何技术应该剥夺人的尊严。人的尊严和能动性应该处于每一项技术的开发、部署和治理的核心位置。如果你是一个年轻艺术家,你的热情是讲故事——拥抱AI作为工具。拥抱Marble,我希望它成为你的工具。因为你讲故事的方式是独一无二的,世界仍然需要它。但如何讲你的故事、如何用最了不起的工具以最独特的方式讲你的故事——这很重要,你的声音需要被听到。如果你是一个接近退休的农民,AI仍然与你有关,因为你是公民。你可以参与社区。你应该对AI如何被使用、如何被应用有发言权。
如果你是护士,我希望你知道,在我的职业生涯中,我做了大量的医疗保健AI研究,因为我觉得我们的医疗工作者应该被AI技术大大增强和帮助——无论是通过智能摄像头提供更多信息,还是通过机器人辅助。因为我们的护士过度劳累、过度疲劳,随着社会老龄化,我们需要更多帮助来照顾人。AI可以发挥这个角色。我只是想说,即使像我这样的技术专家,真诚地相信每个人在AI中都有角色——这一点非常重要。
But no technology should take away human dignity and the human dignity and agency should be at the heart of the development, the deployment as well as the governance of every technology. So if you are a young artist and your passion is storytelling, uh, embrace AI as a tool. In fact, embrace Marvel. I hope it becomes a tool for you. Um, because the way you tell your story is unique and this the world still needs it. But how you tell your story, how do you use the most incredible tool to tell your story in the most unique way is important and that that voice needs to be heard. If you're a farmer near retirement, AI still matters because you're a citizen. You can participate in your community. You should have a voice in how AI is used, how AI is applied. you you work with people that you can you know encourage all of all of you to use AI uh to make life easier for you.
If you're a nurse, I hope you know that at least in my uh career, I have worked so much in healthc care research because I feel our health care workers should be greatly augmented and helped by AI technology whether it's smart cameras to feed more uh in information or robotic assistance because our nurses are overworked, over fatigued And as our society ages, we need more help for for people to be taken care of. So AI can play that role. So I just want to say that it's so important that um even a technologist like me um are sincere about that everybody has a role in AI.