**Lex Fridman:** 接下来是一场关于人工智能前沿进展的对话,包括过去一年 AI 领域一些令人兴奋的技术突破和发展,以及我们认为未来一年可能会发生的有趣事情。有时候讨论会非常技术化,但我们也尽量确保对圈外人来说依然易于理解,同时又不会刻意降低深度。能和 AI 社区中我最欣赏的两位朋友——Sebastian Raschka 和 Nathan Lambert——一起做这样的节目,是莫大的荣幸和快乐。他们都是广受尊敬的机器学习研究者和工程师,同时也是出色的传播者、教育者、作者和 X 上的活跃博主。Sebastian 写了两本我强烈推荐给初学者和专家的书。第一本是《从零构建大语言模型》(Build a Large Language Model from Scratch),第二本是《从零构建推理模型》(Build a Reasoning Model from Scratch)。我真心相信,在机器学习领域,学习和理解事物最好的方式就是自己从头构建。Nathan 是 Allen Institute for AI 的 post-training 负责人,也是关于 RLHF(基于人类反馈的强化学习)权威著作的作者。他们两位都有很棒的 X 账号和 Substack。Sebastian 在 YouTube 上有课程,Nathan 有播客。大家一定要关注。这是 Lex Fridman 播客。如需支持,请查看描述中的赞助商信息,那里也有联系我、提问和反馈的链接。现在,亲爱的朋友们,让我们请出 Sebastian Raschka 和 Nathan Lambert。我觉得审视这一切的一个有用视角是所谓的"DeepSeek 时刻"。大约一年前的 2025 年 1 月,中国的开放权重公司 DeepSeek 发布了 DeepSeek R1。我认为可以公正地说,它以据称远低于其他模型的算力和成本,达到了接近最先进水平的性能,震惊了所有人。从那时到现在,AI 竞争变得疯狂,无论在研究层面还是产品层面,都在加速推进。今天我们就来讨论这一切。也许我们先从一些"辣"的问题开始——在国际层面上,谁在赢?你会说是中国的那些公司还是美国的那些公司?Sebastian、Nathan,很高兴见到你们。那么 Sebastian,你觉得谁在赢?
**Lex Fridman:** - The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs an d developments in AI that happened over the past year, and some of the interesting things we think might happen this upcoming year. At times, it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great hono r and pleasure to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Raschka and Nathan Lambert. They ar e both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and X posters. Sebas tian is the author of two books I highly recommend for beginners and experts alike. First is Build a Large Language Model from Scratch and Build a Rea soning Model from Scratch. I truly believe in the machine learning world, the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI, author of the definitive book on Reinforcement Learning from Human Feedback. Both of them have great X accounts, great Substacks. Sebastian has courses on YouTube, Nathan has a podcast.
**Sebastian Raschka:** "赢"是一个非常宽泛的词。我会说,你提到了"DeepSeek 时刻",我认为 DeepSeek 赢得了做开放权重模型那些人的心,因为他们把模型以开放的形式分享出来。我觉得"赢"有多个时间尺度——有今天、明年和十年后。有一件事我可以确定的是,在 2026 年的今天,不会有任何一家公司拥有其他公司无法获取的技术。主要原因是研究人员频繁跳槽、换实验室,他们在流动。我不认为在技术获取方面会出现一个明确的赢家。不过我确实认为,差异化因素将是预算和硬件限制。想法不会是独占的,但实现这些想法所需的资源会成为关键。我目前看不到赢家通吃的局面。
**Sebastian Raschka:** And everyone should absolutely follow all of those. those. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description, where you can also find links to c ontact me, ask questions, get feedback, and so on. And now, dear friends, here's Sebastian Raschka and Nathan Lambert. So I think one useful lens to l ook at all this through is the so-called DeepSeek moment.
**Lex Fridman:** Nathan,你怎么看?
**Lex Fridman:** This happened about a year ago in January 2025, when the open-weight Chinese company DeepSee k released DeepSeek R1, that I think it's fair to say surprised everyone with near-state-of-the-art performance, with allegedly much less compute for much cheaper. And from then to today, the AI competition has gotten insane, both on the research and product level. It's just been accelerating. discu ss all of this today, and maybe let's start with some spicy questions if we can. Who's winning at the international level? Would you say it's the set of companies in China or the set of companies in the United States? And Sebastian, Nathan, it's good to see you guys. guys.
**Nathan Lambert:** 你能看到各实验室在他们试图做的事情上投入了不同的精力。我觉得为了标定我们录制的时间点——围绕 Anthropic 的 Claude Opus 4.5 模型的热度简直疯了。我用过它,最近几周还用它搭了一些东西。它几乎已经到了有点像 meme(梗)的程度了,就这种热度。有趣的是,这完全是自发的。然后如果我们往前推几个月,可以看到发布日期和笔记——Google 发布了 Gemini 3,那次发布的营销效果和"哇"的感觉非常强。但在 11 月底,Claude Opus 4.5 发布了,热度一直在增长,而 Gemini 3 是在这之前发布的。感觉人们不怎么讨论它了,尽管它刚出来的时候所有人都在说"这是 Gemini 利用 Google 结构性优势的时刻"。Gemini 3 是一个非常棒的模型,我现在还在用。只是差异化程度变低了。我同意 Sebastian 说的——你说的那些关于想法空间非常流动的观点——但在文化上,Anthropic 以重注押码代码而闻名,就是 Claude Code 这个方向,目前正在为他们带来回报。所以我觉得即使想法流通得很自由,很多东西其实受制于人力投入和组织文化,而 Anthropic 至少看起来是最不混乱的。如果他们能保持一段时间,这就是优势。但在另一边,中国有很多"不祥的"技术,那里的实验室远不止 DeepSeek。DeepSeek 在中国掀起了一场运动,我觉得有点类似于 ChatGPT 在美国掀起的运动——那时所有东西都有了聊天机器人。现在中国有大量科技公司在发布非常强的前沿开放权重模型,以至于我会说 DeepSeek 在中国作为顶尖开放模型制造者的王冠正在动摇。像 Z.ai 的 GLM 模型、MiniMax 的模型、Kimi Moonshot,尤其是最近几个月表现更为亮眼。新的 DeepSeek 模型仍然很强,但这可能——回头看会是一个重要的叙事节点——2025 年 DeepSeek 来了,它提供了一个平台,让更多中国公司发布了这些优秀模型,催生了一种新的运作方式。所以这些中国公司的模型是开放权重的。取决于美国公司的商业模式走向,它们可能面临风险。但目前在美国,很多人在为 AI 软件付费,而从历史上看,在中国和世界其他地方,人们不太愿意为软件付很多钱。
**Nathan Lambert:** So Sebastian, who do you t hink is winning? - Winning is a very broad term. I would say you mentioned the DeepSeek moment, and I think DeepSeek is winning the hearts of the peop le who work on open-weight models because they share these as open models. Winning, I think, has multiple timescales to it. We have today, we have nex t year, we have in 10 years.
**Lex Fridman:** 所以像 DeepSeek 这样的模型因为开放权重而深受人们喜爱。你觉得中国公司还会继续发布开放权重模型多久?
**Lex Fridman:** One thing I know for sure is that I don't think nowadays, in 2026, that there will be any company that has access to tech nology that no other company has access to. That is mainly because researchers are frequently changing jobs and labs. They rotate. I don't think there will be a clear winner in terms of technology access.
**Nathan Lambert:** 我会说还会持续几年。我认为,就像在美国一样,开放权重并没有一个清晰的商业模式。我一直在写关于开放模型的文章,这些中国公司也意识到了这一点。我收到过一些公司的主动联系。他们很聪明,意识到了同样的限制:很多美国顶级科技公司和其他 IT 公司出于安全考虑不会向中国公司购买 API 订阅。这在科技行业是个长期习惯。这些公司的人就把开放权重模型视为一种施加影响力、分享美国巨大且不断增长的 AI 支出市场的方式。他们对此非常务实,而且这对他们正在起作用。我认为政府会看到这在国际上正在建立大量的影响力——在技术采用方面——所以会有很多激励来维持这种势头。但构建这些模型和做研究非常昂贵,所以在某个时刻,我预期会出现整合。但我不认为 2026 年就会发生,2026 年的开放模型构建者会比 2025 年更多,而其中很多值得关注的将来自中国。
**Nathan Lambert:** However, I do think there will be, The differentiating factor will be budget and hardware const raints. I don't think the ideas will be proprietary, but rather the resources needed to implement them. I don't see currently a winner-take-all scenar io. I can't see that.
**Sebastian Raschka:** 你提到 DeepSeek 正在失去王冠。我确实认为在某种程度上是这样,但我们也需要考虑到,他们仍然——我会说——略微领先。其他公司——不是说 DeepSeek 变差了,而是其他公司在使用 DeepSeek 的想法。比如你提到 Kimi——同样的架构,他们在训练。然后又有这种"蛙跳式"竞争,他们可能在某个时间点稍微好一些,因为他们有更新的模型。我觉得这又回到了一个事实:不会有一个明确的赢家。只会是这样:一个人发布了什么,另一个跟上来,最新的模型可能永远是最好的模型。
**Sebastian Raschka:** At the moment. - Nathan, what do you think? - You see the labs put different energy into what they're trying to do, and I think to demarcate the point in time when we're recording this, the hype over Anthropic's Claude Opus 4.5 model has been absolutely insane, which is just... I mean, I've used it and built stuff in the last few weeks, and it's... it's almost gotten to the point where it feels like a bit of a meme in terms of the hype. And it's kind of funny because this is very organic, and then if we go back a few months ago, we can see the release date and the notes, as Gemini 3 from Google got released, and it seemed like the marketing and just, like, wow factor of that release was super high. But then at the end of November, Claude Opus 4.5 was released and the hype has been growing, but Gemini 3 was before this. And it kind of feels like people don't really t alk about it as much, even though when it came out, everybody was like, this is Gemini's moment to retake Google's structural advantages in AI. And Ge mini 3 is a fantastic model, and I still use it. It's just kind of differentiation is lower. And I agree with Sebastian; what you're saying with all t hese, the idea space is very fluid, but culturally Anthropic is known for betting very hard on code, which is the Claude Code thing, is working out fo r them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and the culture of organizatio ns, where Anthropic seems to at least be presenting as the least chaotic. It's a bit of an advantage, if they can keep doing that for a while. But on the other side of things, there's a lot of ominous technology from China where there's way more labs than DeepSeek.
**Nathan Lambert:** 是的。我们还会看到中国公司有不同的激励机制。比如 DeepSeek 非常神秘,而一些初创公司——像 MiniMax 和 Z.ai——这两家实际上已经提交了 IPO 文件,他们试图赢得西方的关注度,做了很多外联工作。所以我不知道这些激励机制是否会改变模型开发,因为 DeepSeek 众所周知是由对冲基金 Highflyer Capital 打造的,我们不确切知道他们用模型做什么或者是否在意这些。
**Nathan Lambert:** So DeepSeek kicked off a movement within China, I say kind of similar to how ChatGPT kicked off a movement in the US where everything had a chatbot. There's now tons of tech companies in China that are releasing very strong frontier open-weight models, to the point where I would say that DeepSeek is kind of losing its crown as the p reeminent open model maker in China, and the likes of Z.ai with their GLM models, Minimax's models, Kimi Moonshot, especially in the last few months, has shown more brightly. The new DeepSeek models are still very strong, but that's kind of a... it could look back as a big narrative point where in 2 025 DeepSeek came and it provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new type of operation. So these models from these Chinese companies are open-weights, and depending on this trajectory of business models that these American companies are doing, they could be at risk.
**Sebastian Raschka:** 他们在沟通方面是神秘的,但在描述模型工作原理的技术报告方面并不神秘。他们在那方面仍然是开放的。而且我们也应该说,关于 Claude Opus 4.5 的热度——有一层是"X 回音室"里的宠儿、Twitter 回音室的宠儿,还有实际使用该模型的人数。我觉得公平地说,ChatGPT 和 Gemini 聚焦于想在日常生活中解决问题的广大用户群体,而那个用户群体是巨大的。所以围绕编程的热度可能并不代表实际使用情况。
**Sebastian Raschka:** But currently, a lot of people are paying for AI software in the US, and historically in China and other p arts of the world, people don't pay a lot for software. - So some of these models like DeepSeek have the love of the people because they are open-weig ht. How long do you think the Chinese companies keep releasing open-weight models? - I would say for a few years. I think that, like in the US, there' s not a clear business model for it. I have been writing about open models for a while, and these Chinese companies have realized it.
**Nathan Lambert:** 我还想说,很多使用习惯确实像你说的那样——品牌认知度、知名度什么的——也有肌肉记忆的因素,你知道的,ChatGPT 已经存在很长时间了,人们习惯了用它,这几乎形成了一个飞轮:他们推荐给其他用户之类的。还有一个有趣的点是 LLM 的定制化。比如 ChatGPT 有记忆功能,对吧?所以你可能有一个订阅,你用它处理个人事务,但你不一定想在工作中用同一个——因为这是私人和工作之间的界限。如果你在一家公司,公司可能不允许,或者你自己不想这样。我觉得这也是一个有趣的点——你可能会有多个订阅。一个纯粹用于代码,没有你的个人照片或业余项目,纯工作用途。另一个是你的个人用途。所以我觉得未来也是多个并存的。
**Nathan Lambert:** So I get inbound from some of them. And they're smart and realize the same constraints: a lot of top US tech companies and other IT companies won't pay for an API sub scription to Chinese companies for security concerns. This has been a long-standing habit in tech, and the people at these companies then see open wei ght models as an ability to influence and take part of a huge growing AI expenditure market in the US. And they're very realistic about this, and it's working for them.
**Lex Fridman:** 你觉得哪个模型赢了 2025 年?哪个模型会赢 2026 年?
**Lex Fridman:** I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology, so there's going to be a lot of incentives to keep it going. But building these models and doing the research is very expensive, so at some point, I expect consolidation. But I don't expect that to be a story of 2026, where there will be more open model builders throughout 2026 than there were in 2 025. And a lot of the notable ones will be in China. - You were going to say something? - Yes.
**Nathan Lambert:** 我觉得在消费级聊天机器人的语境下,问题是:你愿不愿意押注 Gemini 而不是 ChatGPT?我直觉上觉得这是一个有点冒险的赌注,因为 OpenAI 是在位者,而在科技行业,这有太多好处。我觉得如果看 2025 年的势头,是在 Gemini 一边,但他们的起点太低了。RIP Bard,以及那些早期的尝试。他们能在组织混乱中坚持下来,做到这一步,值得巨大的肯定。但也很难跟 OpenAI 对赌,因为他们总是看起来很混乱,但非常擅长最终落地。我个人对 GPT-5 的评价很复杂,但它的 high-line 功能本质上是一个路由器——大多数用户不再消耗那么多 GPU 算力——这肯定给他们省了很多钱。所以我觉得很难把我个人喜欢模型的什么,和真正能成为大众差异化因素的东西区分开来。
**Nathan Lambert:** You mentioned DeepSeek losing its crown. I do think to some extent, yes, but we also have to consider though, they are still, I would say, slightly ahead. And the other ones—it's not that DeepSeek got wors e, it's just that the other ones are using the ideas from DeepSeek. For example, you mentioned Kimi—same architecture, they're training it.
**Lex Fridman:** 你对 2026 年怎么看?谁会赢?
**Lex Fridman:** And then a gain, we have this leapfrogging where they might be at some point in time a bit better because they have the more recent model. And I think this comes back to the fact that there won't be a clear winner. It will just be like that: one person releases something, the other one comes in, and the most r ecent model is probably always the best model. - Yeah. We'll also see the Chinese companies have different incentives.
**Nathan Lambert:** 我说一下,虽然有风险。我认为 Gemini 会继续在 ChatGPT 面前取得进展。我觉得 Google 的规模——当两者都在如此极端的规模下运营时——Google 有能力更好地把研究和产品分开,而你总是听到 OpenAI 在运营上很混乱,追逐高影响力的东西,这是一种非常创业公司的文化。然后在软件和企业端,我认为 Anthropic 会继续取得成功,他们一次又一次地被设置在有利位置上。当然 Google Cloud 有很多产品,但我认为 Gemini 这个品牌名对他们很重要。Google Cloud 会继续做得好,但那是一个更复杂的生态系统故事,因为那是在跟 Azure 和 AWS 竞争,而不是在模型提供商层面。
**Nathan Lambert:** Like, DeepSeek is very secretiv e, whereas some of these startups are like the MiniMaxs and Z.ais of the world. Those two literally have filed IPO paperwork, and they're trying to ge t Western mindshare and do a lot of outreach there. So I don't know if these incentives will change the model development, because DeepSeek famously i s built by a hedge fund, Highflyer Capital, and we don't know exactly what they use the models for or if they care about this. - They're secretive in terms of communication; they're not secretive in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say, on the Claude Opus 4.5 hype, there's the layer of something being the darling of the X echo chamber, on the Twitter echo cham ber, and the actual amount of people that are using the model. I think it's probably fair to say that ChatGPT and Gemini are focused on the broad user base that just want to solve problems in their daily lives, and that user base is gigantic.
**Lex Fridman:** 所以在基础设施方面,你认为 TPU 给了 Google 优势?
**Lex Fridman:** So the hype about the coding may not be representative of the actual use. - I would say also a lot of the usage patterns are, like you said, name recognition, brand and stuff, but also muscle memory almost, where, you know, ChatGPT has been around for a long time. People just got used to using it, and it's almost like a flywheel: they recommend it to othe r users and that stuff. One interesting point is also the customization of LLMs. For example, ChatGPT has a memory feature, right?
**Nathan Lambert:** 主要是因为 NVIDIA 芯片的利润率高得离谱,而 Google 可以从上到下开发一切来适配自己的技术栈,不必支付这个利润差。而且他们在建设数据中心方面有先发优势。所以所有这些既有长交付周期又有很高成本利润率的东西,Google 有一种历史性优势。如果要出现新范式,最有可能来自 OpenAI,他们的研究部门一次又一次展示了落地新研究想法或产品的能力。比如 Deep Research、Sora、o1 推理模型——所有这些定义性的东西都来自 OpenAI,这必须是他们作为一个组织的顶级特质之一。所以很难赌 OpenAI 会输,但我认为今年很多事情将围绕规模和优化可以被描述为"唾手可得的成果"的模型改进。
**Nathan Lambert:** And so you may have a subscription and you use it for personal stuff, but I don't know if you want to use that same thing at work. Because it's a boundary between privat e and work. If you're working at a company, they might not allow that or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions.
**Sebastian Raschka:** 而且显然,智能和速度之间存在权衡。这就是 ChatGPT-5 试图在幕后解决的。就是——大众真的想要智能,还是想要速度?
**Sebastian Raschka:** One is just clean code. It has nothing of your personal images or hobby projects in there. It's just like the work thing. And then the other one is your personal thing. So I think that's also something where there are two different use cases, and it doesn't mean y ou only have to have one. I think the future is also multiple ones. - What model do you think won 2025, and what model do you think is going to win '2 6? - I think in the context of consumer chatbots, it's a question of: are you willing to bet on Gemini over ChatGPT?
**Nathan Lambert:** 我觉得有一个好的多样化选择,或者说有一个切换选项,就很好了。
**Nathan Lambert:** Which I would say, in my gut, fee ls like a bit of a risky bet because OpenAI has been the incumbent, and there are so many benefits to that in tech. I think the momentum, if you look at 2025, was on Gemini's side, but they were starting from such a low point. And RIP Bard and these earlier attempts at getting started. Huge credit t o them for powering through the organizational chaos to make that happen.
**Sebastian Raschka:** 我的意思是,就个人使用而言,大多数时候我查东西的时候,我用 ChatGPT 快速问一个问题,快速得到信息。日常任务大多用快速模型。现在自动模式已经很好了,你不用特意选思考模式还是非思考模式。不过有时候我也想用 Pro 模式。我经常做的是,写完一些东西后放进 ChatGPT 里说:"嘿,做一个非常彻底的检查。我的引用都对吗?我的想法都正确吗?有没有格式错误?图编号有没有搞错?"之类的。这个我不需要立刻得到结果。我把东西写完,可能去吃个晚饭,让它跑着,回来再过一遍。我觉得有这个选项很重要。如果每次查询都要等 30 分钟或者哪怕 10 分钟,我会疯的。
**Sebastian Raschka:** But also it's hard to bet against OpenAI because they always come off as so chaotic, but they're very good at landing things. And I think, personally, I have very mixed reviews of GPT-5, but it must have saved them so much mon ey with the high-line feature being a router where most users are no longer charging their GPU costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are going to actually be a general public differentiator. - What do you think about 2026? Who's going to win? - I'll say something, even though it's risky. I think Gemini will continue to make progress on ChatGPT.
**Nathan Lambert:** 那是我。我坐在这边听到你用路由器和非思考模型就崩溃了。我就像"你怎么能忍受?"这就是我的反应。我一直重度使用 ChatGPT。我从来没碰过 ChatGPT-5 的非思考模式。我觉得它的语气,以及它出错的倾向——错误概率更高。有一部分原因要追溯到 OpenAI 发布 o3 的时候,那是第一个做深度搜索、找到很多来源并整合起来的模型。我被那个给养成了习惯。所以我在做任何信息查询——不管是论文还是某段代码参考——都只用 GPT-5.2 Thinking 或 Pro。我经常同时开五个 Pro 查询,每个找一篇特定论文或一个等式的反馈什么的。
**Nathan Lambert:** I think Google's scale, wh en both of these are operating at such extreme scales—and Google has the ability to separate research and product a bit better, whereas you hear so mu ch about OpenAI being chaotic operationally and chasing the high-impact thing, which is a very startup culture. And then on the software and enterpris e side, I think Anthropic will have continued success, as they've again and again been set up for that. And obviously Google Cloud has a lot of offeri ngs, but I think this kind of Gemini name brand is important for them to build. Google Cloud will continue to do well, but that's a more complex thing to explain in the ecosystem, because that's competing with the likes of Azure and AWS rather than on the model provider side. - So in infrastructure, you think TPU is giving an advantage? - Largely because the margin on NVIDIA chips is insane, and Google can develop everything from top to bottom to fit their stack and not have to pay this margin. And they've had a head start in building data centers. So all of these things that have both high le ad times and very hard margins on high costs, Google has a just kind of historical advantage there. And if there's going to be a new paradigm, it's mo st likely to come from OpenAI where their research division again and again has shown this ability to land a new research idea or a product. Like Deep Research, Sora, o1 thinking models—all these definitional things have come from OpenAI, and that's got to be one of their top traits as an organizati on. So it's kind of hard to bet against that, but I think a lot of this year will be about scale and optimizing what could be described as low-hanging fruit in models. - And clearly there's a trade-off between intelligence and speed. This is what ChatGPT-5 was trying to solve behind the scenes.
**Sebastian Raschka:** 我有一个好玩的例子——我需要在录播客出发前尽快得到答案。家里有一个本地 GPU,我想跑一个长的 RL 实验。通常不在家的时候我也会拔掉一些插头——你不在家的时候不想让东西一直插着。然后我不小心把 GPU 的电源拔了。我老婆已经在车里了,然后就——"哦,糟糕。"基本上我需要尽快得到一个 Bash 脚本来跑不同的实验和评估。这些东西我知道怎么做,我学过 Bash 终端,但那个时刻我就是需要 10 秒钟给我命令。
**Sebastian Raschka:** It's like, do people actually want intelligence, the broad public, or do they want speed? - I think it's a nice variety, or the option to have a toggle th ere. I mean, for my personal usage, most of the time when I look something up, I use ChatGPT to ask a quick question, get the information I wanted fas t. For most daily tasks, I use the quick model. Nowadays, I think the auto mode is pretty good where you don't have to specifically say thinking or no n-thinking.
**Lex Fridman:** 这个场景太搞笑了——老婆在车里等着……你得跑过去拔 GPU。你得生成一个 Bash 脚本。这听起来像一部电影——碟中谍。
**Lex Fridman:** Then again, I also sometimes want the pro mode. Very often what I do is, when I have something written, I put it into ChatGPT and say, "He y, do a very thorough check. Are all my references correct? Are all my thoughts correct?
**Sebastian Raschka:** 所以我用了非思考的最快模型。它给了我那个把不同脚本串在一起的 Bash 命令,然后有个 tee 的东西把输出导到日志文件。当时我就是着急,脑子里想不起来这些,虽然我自己也能想出来。
**Sebastian Raschka:** Did I make any formatting mistakes and are the figure numbers wrong?" Or something like that. And I don't need that right away. I finish my stuff, maybe have dinner, let it run, come back and go through this. I think this is where it's important to have this option.
**Nathan Lambert:** 我用 Gemini 干这种事。所以我所有的信息搜索用思考模式,然后快速的东西或者有时候可以用 Google 搜到的东西用 Gemini——它擅长解释东西,我相信它有这种背景知识,而且很简单。Gemini app 也变好了很多。适合做这类事情。然后代码和任何哲学性的讨论,我用 Claude Opus 4.5。也永远是扩展思考模式。扩展思考和推理时间扩展就是让模型稍微更聪明的一种方式。我会永远倾向于这边,因为进步非常快,你不知道什么时候会解锁一个新的用例。然后有时候用 Grok 来获取实时信息,或者在 AI Twitter 上找我知道自己看到过但需要翻出来的东西。虽然 Grok 4 出来的时候——Grok 4 SuperGrok Heavy,就是那个 Pro 版本——其实非常好,我挺印象深刻的,但后来就——肌肉记忆嘛——ChatGPT app 一直开着就忘了。所以我用很多不同的东西。
**Nathan Lambert:** I would go crazy if for each query I would have to wait 30 minutes or 10 minutes even. - That' s me. I'm sitting over here losing my mind that you use the router and the non-thinking model. I'm like, "How do you live with that?" That's like my r eaction. I've been heavily on ChatGPT for a while.
**Sebastian Raschka:** 是的。我其实确实用 Grok 4 Heavy 做调试。对于其他模型解决不了的那种硬核调试,我觉得它是最好的。而且……有趣的是你说 ChatGPT 是最好的界面。对我来说,出于同样的原因——但这可能只是惯性——Gemini 对我来说是更好的界面。我觉得因为我爱上了它的"大海捞针"能力。如果我放进去一个上下文很多的东西,但我要找非常具体的信息,确保它全部追踪到,我发现——至少对我来说——Gemini 是最好的。所以有趣的是,有些模型,如果在某一天因为某个特定功能、某个特定查询、那个 prompt 赢得了你的心,你就会觉得"这个模型更好",然后你就会用它一阵子,直到它做了什么蠢事。有一个阈值效应。某个聪明的事让你爱上了它,然后它做了个蠢事,你就说"算了,我去试试 Claude 或 ChatGPT。"之类的。
**Sebastian Raschka:** I never touched ChatGPT-5 non-thinking. I find its tone and then its propensity for errors—it has a higher likelihood of errors. Some of this is from back when OpenAI released o3, which was the first model to do this deep search and find many source s and integrate them for you. I became habituated with that.
**Lex Fridman:** 就是这样:你用它直到它出问题,出了问题你就换 LLM。我觉得这跟我们使用任何东西一样——最爱的文本编辑器、操作系统或浏览器。我是说,有很多选择:Safari、Firefox、Chrome。它们相对类似,但有些边缘情况、你想要的扩展,然后你就换了。但我不觉得有人会把同样的内容输入不同浏览器去比较。你只在出问题的时候才会那样做。所以这是个好观点——用到出问题,然后去探索其他选项。
**Lex Fridman:** So I will only use GPT-5.2 Thinking or Pro when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found. And I will regularly have like five Pro queries going simultaneously, each look ing for one specific paper or feedback on an equation or something. - I have a fun example where I needed the answer as fast as possible for this podc ast before I was going on the trip. like a local GPU running at home and I wanted to run a long RL experiment. And usually I also unplug things becaus e you never know if you're not at home, you don't want things plugged in. And I accidentally unplugged the GPU.
**Nathan Lambert:** 关于长上下文——我以前也是 Gemini 用户——但 GPT-5.2 的发布博客有疯狂的长上下文得分。人们都在说"他们是不是搞了什么算法上的改变?"从 30% 跳到 70%,就一次小模型更新。很难跟踪所有这些东西,但现在我对 GPT-5.2 的长上下文更看好了。就是一个"我到底怎么去测试这个"的无尽战斗。
**Nathan Lambert:** My wife was already in the car and it' s like, "Oh dang." Then basically I wanted as fast as possible a Bash script that runs my different experiments and the evaluation. And it's something I know, I learned how to use the Bash interface or Bash terminal, but in that moment I just needed like 10 seconds, give me the command. - This is a hilarious situation but yeah, so what did you use? - So I did the non-thinking fastest model. It gave me the Bash command to chain different scripts t o each other and then the thing is like you have the tee thing where you want to route this to a log file. Top of my head I was just like in a hurry, I could have thought about it myself. - By the way I don't know if there's a representative case, wife waiting in the car- ... you have to run, you kn ow, unplug the GPU. You have to generate a Bash script. This sounds like a movie, like- Mission Impossible. - I use Gemini for that. So I use thinking for all the information stuff and then Gemini for fast things or stuff that I could sometimes Google, which is like it's good at explaining things an d I trust that it has this kind of background of knowledge and it's simple. And the Gemini app has gotten a lot better and- It's good for those sorts of things. And then for code and any sort of philosophical discussion, I use Claude Opus 4.5.
**Lex Fridman:** 有趣的是我们都没有从使用角度谈到中国模型。这说明什么?是中国模型不够好,还是我们只是偏见太重、太以美国为中心了?
**Lex Fridman:** Also always with extended thinking. Extended thinking an d inference time scaling is just a way to make the models marginally smarter. And I will always err on that side when the progress is very high becaus e you don't know when that'll unlock a new use case. And then sometimes use Grok for real-time information or finding something on AI Twitter that I k new I saw and I need to dig up and I just fixated on.
**Sebastian Raschka:** 我觉得目前模型和平台之间存在差距。开放模型更多是以开放权重闻名,而不是平台。
**Sebastian Raschka:** Although when Grok 4 came out, the Grok 4 SuperGrok Heavy, which was like their pro variant was actually very good and I was pretty impressed with it, and then it just kind of like muscle memory lost track of it with having the ChatGPT app open. So I use many different things. - Yeah. I actually do use Grok 4 Heavy for debugging. For like hardcore debugging that the other ones can't solve, I f ind that it's the best at.
**Nathan Lambert:** 很多公司会以非常低的成本卖给你开放模型的推理服务。用 OpenRouter 很容易看多模型的东西。你可以在 Perplexity 上跑 DeepSeek。坐在这里,我们都在说"我们持续使用 OpenAI GPT-5 Pro"。我们都愿意为边际智能增益付费。美国的这些模型在输出方面更好。我觉得问题是,今年以及未来几年它们能保持更好吗?只要它们更好,我就愿意付费。还有分析显示,中国模型的服务方式——你可以说这是因为出口管制——是每个副本用的 GPU 更少,这导致它们更慢、有不同的错误。如果速度和智能对你作为美国用户都有利,很多用户会选择这些。我觉得这会刺激中国公司想要以其他方式竞争——要么免费,要么成本大幅降低,或者在产品形态上激发创造力,这对生态系统是好事。但简单地说:美国模型目前更好,我们就用它们。我试过那些开放模型,感觉是"有趣,但我不会回头。"
**Nathan Lambert:** And... it's interesting 'cause you say ChatGPT is the best interface. For me, for that same reason, but this could be just momentum- Gemini is the better interface for me. I think because I fell in love with their best needle in the haystack. If I ever put something that h as a lot of context but I'm looking for very specific kinds of information to make sure it tracks all of it, I find at least that Gemini for me has be en the best.
**Lex Fridman:** 我们还没怎么提到编程。这是另一个很多人深切关心的用例。我基本上一半一半地用 Cursor 和 Claude Code,因为它们——是根本不同的体验,但都有用。你们呢……你们写不少代码,所以你们用什么?目前的氛围如何?
**Lex Fridman:** So it's funny with some of these models, if they win your heart over- for one particular feature on one particular day, for that particul ar query, that prompt, you're like, "This model's better." And so you'll just stick with it for a bit until it does something really dumb. There's lik e a threshold effect. Some smart thing and then you fall in love with it and then it does some dumb thing and you're like, "You know what? I'm gonna s witch and try Claude or ChatGPT." And all that kind of stuff. - This is exactly it: you use it until it breaks, until you have a problem, and then you change the LLM.
**Sebastian Raschka:** 我用 VS Code 的 Codeium 插件。你知道的,很方便。就像一个插件,然后它是一个可以访问你仓库的聊天界面。我知道 Claude Code 有点不同——它更 agentic(自主性更强)。它触碰更多东西。它替你做整个项目。我还没到对此感到舒适的阶段,因为可能我是控制狂,但我仍然想看看发生了什么。Codeium 现在对我来说是一个甜蜜点——它在帮助我,但没有完全接管。
**Sebastian Raschka:** And I think it's the same as how we use anything, like our favorite text editor, operating systems, or the browser. I mean, there are many options: Safari, Firefox, Chrome. They're relatively similar, but then there are edge cases, extensions you want, and then you switch. But I don 't think anyone types the same thing into different browsers and compares them.
**Lex Fridman:** 我应该提一下,我使用 Claude Code 的原因之一是为了培养用英语编程的技能。这种体验从根本上是不同的。相比于微观管理代码生成过程的细节,看 diff——如果你用 Cursor 作为 IDE 的话——修改、调整,深度阅读和理解代码的过程……相比之下,Claude Code 是在这个设计空间里思考,在宏观层面引导它,我觉得这是思考编程过程的另一种方式。还有,我们应该说 Claude Code 似乎以某种方式更好地利用了 Claude Opus 4.5。
**Lex Fridman:** You only do that when something breaks. So that's a good point. You us e it until it breaks, then you explore other options. - On the long context thing, I was also a Gemini user, but the GPT-5.2 release blog had crazy lo ng context scores. People were like, "Did they just figure out some algorithmic change?" It went from 30% to 70% in this minor model update.
**Nathan Lambert:** 这是人们可以做的一个很好的并排对比。你可以同时打开 Claude Code、Cursor 和 VS Code,在所有这些上选择相同的模型,然后问问题。非常有趣——Claude Code 在那个领域好得多,令人惊叹。
**Nathan Lambert:** It's very hard to keep track of all of these things, but now I look more favorably at GPT-5.2's long context. So it's just like, "How do I actually get to test ing this?" It's a never-ending battle. - Well, it's interesting that none of us talked about the Chinese models from a usage perspective. What does th at say? Does it mean the Chinese models are not as good, or are we just very biased and US-focused? - I think currently there's a discrepancy between the model and the platform.
**Lex Fridman:** 好的。我们应该说你们两位在多个方面都是真正厉害的:研究者、程序员、教育者、Twitter 博主。在书这方面也是。Nathan 很快应该会有一本 RLHF 的书出来。
**Lex Fridman:** The open models are more known for the open weights, not the platform yet. known for the open weights, not their platform yet. - Many companies will sell you open-model inference at a very low cost. With OpenRouter, it's easy to look at multi-model things. You can run Dee pSeek on Perplexity. Sitting here, we're like, "We use OpenAI GPT-5 Pro consistently." We're all willing to pay for the marginal intelligence gain.
**Nathan Lambert:** 现在可以预订了,还有一个完整的数字预印版。我只是在把它做得更漂亮、更有条理,准备出实体版——这也是我做这件事的很大原因,因为在我们的生活如此数字化的时代,创造出你认为在物理形式上是卓越的东西,是一种乐趣。
**Nathan Lambert:** Th ese models from the US are better in terms of the outputs. I think the question is, will they stay better for this year and for years to come? As long as they're better, I'm gonna pay for them. There's also analysis showing that the way the Chinese models are served—you could argue this is due to ex port controls— is that they use fewer GPUs per replica, which makes them slower and have different errors.
**Lex Fridman:** 我在这里用一下 Perplexity——Sebastian Raschka 是一位机器学习研究者和作者,因多本有影响力的书籍而闻名。我想提到的几本——也是我强烈推荐的——《从零构建大语言模型》和新书《从零构建推理模型》。所以我对此非常兴奋。从零构建东西是最强大的学习方式之一。
**Lex Fridman:** If speed and intelligence are in your favor as a user, in the US, a lot of users will go for this. And I think that will spur these Chinese companies to want to compete in other ways, whether i t's free or substantially lower costs, or it'll breed creativity in terms of offerings, which is good for the ecosystem. But the simple thing is: the US models are currently better, and we use them. I tried these other open models, and I'm like, "Fun, but I don't go back." models, and I'm like, "Fun , but not gonna...
**Sebastian Raschka:** 说实话,从零构建一个 LLM 非常有趣。要学的东西也很多。就像你说的,这可能是真正理解某样东西如何运作的最佳方式。因为你可以看图示,但图示可能有错误。你可以看概念和解释,但你可能理解错。但如果有代码,而且代码能跑,你就知道它是对的。我的意思是,不会有误解。它是精确的。否则它就跑不起来。我觉得这就是编程的美——它不撒谎。它基本上就是数学。所以即使是数学,我觉得书里可能有你永远不会注意到的错误。因为你读书的时候并不会去运行那个数学推导,你无法验证。而代码好的地方就是你可以验证。
**Sebastian Raschka:** I don't go back to it." - We didn't really mention programming. That's another use case that a lot of people deeply care about. I u se basically half-and-half Cursor and Claude Code, because they're... I fundamentally different experiences and both are useful.
**Lex Fridman:** 是的,我同意你关于《从零构建 LLM》这本书的看法。屏蔽其他一切——互联网什么的——只专注于书本,很好。不过你知道,我读了好几本历史书——总感觉没那么孤独。确实更有乐趣。比如在编程方面,我觉得跟 LLM 一起编程真的更有趣。我也觉得跟 LLM 一起阅读真的更有趣。但你说得对,干扰应该最小化。所以你用 LLM 基本上来丰富体验,也许增加更多上下文。我就觉得用 LLM 时我的"顿悟时刻"出现的频率真的很高。
**Lex Fridman:** What do you guys... Y ou program quite a bit, so what do you use? What's the current vibe? - So, I use the Codeium plugin for VS Code. You know, it's very convenient.
**Sebastian Raschka:** 百分之百。我也想纠正一下自己:我不是在建议不使用 LLM。我建议做多轮。比如第一轮纯离线、专注模式,然后之后……我也会做笔记,但我——我试着抵制立刻去查的冲动。我做第二轮。这样更有结构。有时候答案在后面的章节里就有,但有时候让它沉淀下来、自己想想也有帮助。其他人有不同的偏好。我强烈推荐读书时使用 LLM。对我来说,它不是第一步,而是第二轮。
**Sebastian Raschka:** It's just like a plugin, and then it's a chat interface that has access to your repository. I know that Claude Code is, I think, a bit different. It is a b it more agentic. It touches more things.
**Lex Fridman:** 我的建议恰好相反。我喜欢一开始就用 LLM 来铺设我即将进入的整个世界的全貌。但我尽量避免从 LLM 点出去进入 Twitter 和博客的世界,因为那样你就掉进了兔子洞。你在读某人的观点。某个话题在吵架,突然你就到了互联网和 Reddit 的领域。但如果你纯粹让 LLM 给你这件事为什么重要、大图景的想法是什么的上下文……有时候书在这方面做得好,但不总是。
**Lex Fridman:** It does the whole project for you. I'm not quite there yet where I'm comfortable with that because maybe I'm a control freak, but I still would like to see a bit what's going on. And Codeium is kind of, right now, for me, the sweet spot where it is helping me , but it is not taking completely over. - I should mention, one of the reasons I do use Claude Code is to build the skill of programming with English. I mean, the experience is fundamentally different.
**Nathan Lambert:** 这就是为什么我喜欢 ChatGPT app——因为它给了 AI 在你电脑上一个"家",你可以专注于它,而不是成为我那堆互联网选项中的又一个标签页。我觉得 Claude Code 在让这件事成为一种乐趣方面做得很好——作为产品设计,它是一个界面,你的 AI 从那里出发去做事。它和 Codex 之间有一种无形的差别——感觉温暖而有吸引力。而 OpenAI 的 Codex 常常可以做得一样好,但就是——感觉边边角角有点粗糙。Claude Code 让从零构建东西变得有趣——你相信它会做出什么来。当然这对网站、刷新工具之类的东西很好——我就这么用它——或者数据分析。在我的博客上,我们抓取 Hugging Face 的数据,所以我们保存着每个数据集和模型的下载量随时间的变化。Claude 就说"好的,我搞定了那些数据,没问题。"我就想"这要是我自己做得花好几天。"然后我有足够的情境意识来判断"好,这些趋势显然合理。"你可以检查。但这就是一个很棒的界面——你可以有一个中间人,不必亲自做那种底层的苦活来维护不同的网页项目。
**Nathan Lambert:** You're... As opposed to micromanaging the details of the process of the generation of the code, an d looking at the diff, which you can in Cursor if that's the IDE you use, and in changing, altering. Looking and reading the code and understanding th e code deeply as you progress, versus just thinking in this design space and just guiding it at this macro level, which I think is another way of thin king about the programming process. Also, we should say that Claude Code just seems to be somehow a better utilization of Claude Opus 4.5. - It's a go od side-by-side for people to do.
**Lex Fridman:** 好。那我们刚才谈了很多闭源权重模型。来谈谈开放模型吧。跟我说说开放 LLM 模型的格局。哪些有趣?哪些让你印象深刻,为什么?我们已经提到了 DeepSeek R1。
**Lex Fridman:** You can have Claude Code open, you can have Cursor open, you can have VS Code open, and you can select the same mode ls on all of them— ...and ask questions, and it's very interesting. Claude Code is way better in that domain. It's remarkable. - All right, we should say that both of you are legit on multiple fronts: researchers, programmers, educators, Tweeters. And on the book front, too. So Nathan, at some point soon, hopefully has an RLHF book coming out. - It's available for preorder, and there's a full digital preprint. I'm just making it pretty and better organized for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital. - I should say, going to Perplexity here, Sebastian Raschka is a machine learning researcher and author known for several influential books. A couple of them that I wanted to mention—which is a book I highly recommend—Build a Large Language Model from Scratch, an d the new one, Build a Reasoning Model from Scratch. So, I'm really excited about that. Building stuff from scratch is one of the most powerful ways o f learning. - Honestly, building an LLM from scratch is a lot of fun.
**Nathan Lambert:** 你想看看我们不看笔记能报出多少个?
**Nathan Lambert:** It's also a lot to learn. And like you said, it's probably the best way to learn how something really works, 'cause you can look at figures, but figures can have mistakes. You can look at concepts and explanations, but you might m isunderstand them. But if there is code, and the code works, you know it's correct.
**Lex Fridman:** 好啊。
**Lex Fridman:** I mean, there's no misunderstanding. It's precise. Otherwise, it w ouldn't work. And I think that's the beauty behind coding.
**Nathan Lambert:** DeepSeek、Kimi、MiniMax、Z.ai、Moonshot——我们全是中国公司。
**Nathan Lambert:** It doesn't lie. It's math, basically. So, even though with math, I think you can have mista kes in a book you would never notice. Because you are not running the math when you are reading the book, you can't verify this.
**Sebastian Raschka:** 加上 Mistral AI、Gemma……gpt-oss,OpenAI 的开放权重模型。其实 NVIDIA 有一个很酷的——Nemotron 3。年底有很多东西。Qwen 可能是那个——
**Sebastian Raschka:** And with code, what's nice is you can verify it. - Yeah, I agree with you about the Build an LLM from Scratch book. It's nice to tune out everything else, the internet and so on, and just focus on the book. But, you know, I read several history books. It's just less lonely somehow.
**Nathan Lambert:** 哦对。Qwen 是我本来要说的那个显而易见的名字。中国的至少能列出 10 个,西方的也至少 10 个。我觉得 OpenAI 发布了他们自 GPT-2 以来的第一个开放模型。我在写 OpenAI 开放模型发布的文章时,他们说"别忘了 GPT-2"——我觉得很有趣,因为那真是完全不同的时代。但 gpt-oss-120b 确实是一个非常强的模型,在某些方面做得比其他模型好。出于私心,我来推广一些西方公司——美国和欧洲的——拥有完全开放模型的。我在 Allen Institute for AI 工作,我们一直在构建 OLMo,发布数据和代码。现在终于有了真正的竞争——有其他人也在尝试发布一切,让别人可以训练这些模型。有 Institute for Foundation Models/LM360,他们有各种类型的 K2 模型。Apertus 是一个瑞士的研究联盟。Hugging Face 有 SmolLM,非常受欢迎。NVIDIA 的 Nemotron 3 也开始发布数据了。还有 Stanford 的 Martini 社区项目,基本上是建了一个流水线,人们可以开 GitHub issue、实现一个新想法,然后在一个稳定的语言模型技术栈上运行。这个领域——这个清单——在 2024 年要短得多。所以让更多人参与进来去理解语言模型,是一件好事。中国在这方面没有类似的东西。我说的时候顺便提一下——中国的开放语言模型往往大得多,作为 MoE 模型有更高的峰值性能。而我们喜欢的很多东西——不管是 Gemma 还是 Nemotron——往往是美国这边的小模型,不过这正在改变。Mistral Large 3 在 12 月发布了——一个巨大的 MoE 模型,架构和 DeepSeek 非常相似。然后一个初创公司 RCAI——还有 Nemotron 和 NVIDIA 预告了远超 1000 亿参数的 MoE 模型——大约 4000 亿参数的范围——预计在 2026 年第一季度。所以我觉得今年在人们使用中国和美国开放模型方面的这种平衡会发生变化,我个人对此非常期待。
**Nathan Lambert:** It's really more fun. Like for example , on the programming front, I think it's genuinely more fun to program with an LLM. And I think it's genuinely more fun to read with an LLM. But you'r e right.
**Lex Fridman:** 首先,能列出这么多名字,太牛了。你有没有提到 LLaMA?
**Lex Fridman:** That distraction should be minimized. So you use the LLM to basically enrich the experience, maybe add more context. I just find the rate of aha moments for me is really high with LLMs. - 100%. I also want to correct myself: I'm not suggesting not to use LLMs.
**Nathan Lambert:** 没有。
**Nathan Lambert:** I suggest doing it in multiple passes. Like, one pass just offline, focus mode, and then after that... I mean, I also take notes, but I, I try to resist the urge to immediately loo k things up. I do a second pass.
**Lex Fridman:** 我觉得……
**Lex Fridman:** It's just more structured this way. Sometimes things are answered in the chapter, but sometimes also it just helps to let it sink in and think about it. Other people have different preferences. I highly recommend using LLMs when reading books.
**Sebastian Raschka:** RIP。
**Sebastian Raschka:** For me, it's not the fi rst thing to do; it's the second pass. - My recommendation is the opposite. I like to use the LLM at the beginning to lay out the full context of what is this world that I'm now stepping into? But I try to avoid clicking out of the LLM into the world of Twitter and blogs, because then you're down th is rabbit hole. You're reading somebody's opinion.
**Nathan Lambert:** 不是故意的。
**Nathan Lambert:** There's a flame war about a particular topic and all of a sudden you're in the realm of the interne t and Reddit and so on. But if you're purely letting the LLM give you the context of why this matters, what are the big picture ideas... sometimes boo ks are good at doing that, but not always. - This is why I like the ChatGPT app, because it gives the AI a home on your computer where you can focus o n it, rather than just being another tab in my mess of internet options. And I think Claude Code does a good job of making that a joy, where it seems very engaging as a product design to be an interface that your AI will then go out into the world. It's something that is intangible between it and Co dex; it just feels warm and engaging, where Codex can often be as good from OpenAI, but it just, feel a little bit rough around the edges.
**Lex Fridman:** RIP LLaMA。好的。你能提到一些特别突出的有趣模型吗?你提到 Qwen 3 显然是一个亮点。
**Lex Fridman:** Whereas Cla ude Code makes it fun to build things from scratch, where you just trust that it'll make something. Obviously this is good for websites and kind of re freshing tooling and stuff like this, which I use it for, or data analysis. For my On my blog, we scrape Hugging Face so we keep download numbers for every dataset and model. over time, so we have them. And Claude was just like, "Yeah, I've made use of that data, no problem." And I was like, "That w ould've taken me days." And then I have enough situational awareness to be like, "Okay, these trends obviously make sense." You can check things.
**Sebastian Raschka:** 我会说这一年几乎被 DeepSeek V3 和 R1 包书了。然后在另一头,12 月有 DeepSeek-V3.2。我喜欢它们的原因是它们在架构上总有一些其他模型没有的有趣调整。但如果你想选熟悉但性能真的好的,Qwen 3 和——就像 Nathan 说的——gpt-oss-120b。我觉得它有趣的地方在于,它算是第一个真正以工具使用(tool use)为核心来训练的公开或开放权重模型,我确实认为这是一种范式转变,而且生态系统还没完全准备好。所谓工具使用,我是说 LLM 能做网络搜索或调用 Python 解释器。我确实认为它是一个突破,因为这是一个巨大的解锁。LLM 最常见的抱怨之一就是幻觉(hallucination),对吧?在我看来,解决幻觉最好的方法之一就是不要总是试图记住信息或编造。对于数学,为什么不用计算器或 Python?如果我问 LLM"1998 年足球世界杯谁赢了?",它不用试着去记忆,而是可以去搜索。我觉得现在大多还是 Google 搜索。所以 ChatGPT 和 gpt-oss-120b 会做一个工具调用到 Google,也许找到 FIFA 网站。找到——哦,是法国。它会可靠地给你那个信息,而不是试图记住。所以我觉得这是一个巨大的解锁,目前在开源、开放权重生态系统中还没有被充分利用。很多人不用工具调用模式,我认为首先是信任问题。你不想在自己的电脑上跑这个——它可以访问工具,可能擦掉你的硬盘什么的。所以你可能想把它容器化。但我确实认为,对未来几年来说,拥有这种能力是一个非常重要的步骤。
**Sebastian Raschka:** But that's just a wonderful interface where you can have an intermediary and not have to do the kind of awful low-level work that you would have to do to maintain different web projects. - All right. So we just talked about a bunch of the closed-weight models. Let's talk about the open ones. Tell me abo ut the landscape of open LLM models. Which are interesting? Which stand out to you and why? We already mentioned DeepSeek R1. - Do you wanna see how m any we can name off the top of our head? - Yeah, without looking at notes. - DeepSeek, Kimi, MiniMax, Z.ai, Moonshot. We're just going Chinese. - Let' s throw in Mistral AI, Gemma... ...gpt-oss, the open weight model by OpenAI. Actually, NVIDIA had a really cool one, Nemotron 3. There, there's a lot of stuff especially at the end of the year. Qwen might be the one— - Oh, yeah.
**Lex Fridman:** 先补几个快速的点。首先,谢谢你定义了"工具使用"是什么意思。我觉得这在我们讨论的概念中是一个很好的做法。即使是像 MoE 这样比较成熟的概念,你也得说它的意思是混合专家模型(Mixture of Experts),而且你得让人们对它建立直觉——它是什么意思、实际上怎么用、有哪些不同的变体。那么,有这么多开放模型涌现,你的直觉是什么?这意味着什么?
**Lex Fridman:** Qwen was the obvious name I was gonna say. You can get at least 10 Chin ese and at least 10 Western. I think that OpenAI released their first open model— ...since GPT-2. When I was writing about OpenAI's open model release , they were like, "Don't forget about GPT-2," which I thought was really funny 'cause it's just such a different time.
**Nathan Lambert:** 如果你在发布一个开放模型,你最首要的目标就是让人们使用它。之后才是透明度和信任之类的。我觉得看中国的话,最大的原因是他们想让全世界的人使用这些模型。如果你看美国以外的地方,很多人不会为软件付费,但他们可能有计算资源,你可以把模型放上去跑。也可能有些数据你不想发送到云端。所以第一要务是让那些没有模型就无法使用 AI 的人用上 AI——用你的 AI。
**Nathan Lambert:** But gpt-oss-120b is actually a very strong model and does some things that other models don't do very well. Selfishly, I'll promote a bunch of Western companies in the US and Europe that have these fully open models. I work at the Allen Institute for AI, where we've been building OLMo, which releases data and code. And now we hav e actual competition for people that are trying to release everything so that others can train these models.
**Lex Fridman:** 我们应该明确说明——我们一直在讨论中国模型和开放权重模型。很多时候,它们的运行方式是本地的。所以不是说你在把数据发到中国或者发到硅谷的谁那里去。
**Lex Fridman:** There's the Institute for Foundation Mode ls/LM360, which has had their K2 models of various types. Apertus is a Swiss research consortium. Hugging Face has SmolLM, which is very popular. And NVIDIA's Nemotron 3 has started releasing data as well.
**Nathan Lambert:** 很多美国初创公司靠托管这些中国模型来赚钱,卖给客户——这叫卖 token,意思是有人调用模型来完成某项工作。我觉得另一个原因——对于像 OpenAI 这样的美国公司来说——他们真的太缺 GPU 了。他们已经到了 GPU 的极限。每次发布,他们总在说"我们的 GPU 吃不消了"。我记得在 gpt-oss-120b 的某次发布会上,Sam Altman 说"哦,我们发布这个是因为我们可以用你们的 GPU,不用我们自己的。而且 OpenAI 还是能获得分发。"这是另一个非常现实的考量,因为这对他们几乎零成本。
**Nathan Lambert:** And then Stanford's Martini Community Project, which is kind of making it so there's a pipelin e for people to open a GitHub issue and implement a new idea and then have it run in a stable language modeling stack. This space, that list was way s maller in 2024— ...so I think it was just AI2. So it's a great thing for more people to get involved and to understand language models, which doesn't really have a Chinese analog. While I'm talking, I'll say that the Chinese open language models tend to be much bigger, and that gives them higher pea k performance as MoEs, where a lot of these things that we like a lot, whether it was Gemma and Nemotron, have tended to be smaller models from the US , which is starting to change from the US and Europe.
**Sebastian Raschka:** 对于用户来说,也有一些人就像使用 ChatGPT 一样在本地使用模型。但我觉得对公司来说,拥有这些模型也是一个巨大的解锁,因为你可以定制它们、训练它们、添加 post-training、加入更多数据。比如专门化为法律、医疗模型之类的。你提到了 Llama——来自中国的开放权重模型的吸引力在于,它们的许可证甚至更友好。我觉得它们就是不受限制的开源许可证。如果我们用 Llama 或 Gemma 之类的,是有一些附加条件的。我记得好像是用户数量有个上限,超过了你就得向 Meta 报告你的财务状况之类的。虽然它是一个免费模型,但有附加条件。而人们确实喜欢没有附加条件的东西。所以我认为这也是——除了性能之外——中国的开放权重模型如此受欢迎的原因之一,因为你可以直接用。没有什么陷阱。
**Sebastian Raschka:** Mistral Large 3 came out, which was a giant MoE model, very similar to DeepSeek architecture in December. And then a startup, RCAI, and both Nemotron and NVIDIA have teased MoE models way bigger than 100 billion parameters- like this 400 billion parameter range coming in this Q1 2026 timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chin ese versus US open models for, which I'm personally going to be very excited to watch. - First of all, huge props for being able to name so many of th ese. Did you actually name LLaMA? - No. - I feel like ... - RIP. - This was not on purpose. - RIP LLaMA.
**Nathan Lambert:** 生态系统在这方面已经变好了,但主要是因为这些新的提供者提供了如此开放的许可证。有趣的是你刚才打开 Perplexity 说"Kimi K2 Thinking——在美国托管。"这就是一个精确的……我从没见过这个,但这正是我们讨论的精确例子——人们对此敏感。Kimi K2 Thinking 和 Kimi K2 是很受欢迎的模型。人们说它在创意写作方面非常好,在一些软件方面也是。就是这些不同模型的小特点让人们喜欢。
**Nathan Lambert:** All right. Can you mention some interesting m odels that stand out? You mentioned Qwen 3 is obviously a standout. - So I would say the year's almost bookended by both DeepSeek V3 and R1. And then on the other hand, in December, DeepSeek-V3.2.
**Lex Fridman:** 这些模型中有哪些探索了什么有趣的想法,你能说说其中对你特别有趣的?
**Lex Fridman:** Because what I like about those is they always have an interesting architecture tweak that others don't have. But otherwise, if you want to go with the familiar but really good performance, Qwen 3 and, like Nathan said, also gpt-oss-120b. And I think wh at's interesting about it is it's kind of like the first public or open weight model that was really trained with tool use in mind, which I do think i s kind of a paradigm shift where the ecosystem was not quite ready for it. By tool use, I mean that the LLM is able to do a web search or to call a Py thon interpreter.
**Sebastian Raschka:** 也许我们可以按时间顺序来。如果只看 2025 年——当然有 DeepSeek R1,2025 年 1 月出来的。不过它是基于 DeepSeek-V3 的,那个是在 2024 年 12 月发布的。在架构方面有多个有趣的东西。令人着迷的是……这也是我做从零编码项目的方式——你仍然可以从 GPT-2 开始,然后往上面加东西,把它变成另一个模型。所以它们都还是同一条血脉上的。这些模型之间的关系非常密切。但脱口而出的话——DeepSeek 独特的地方是混合专家模型(Mixture of Experts)。不是说 MoE 是他们发明的——也许我们可以多谈谈 MoE 的含义——但先列一下这些东西,然后再深入。Mixture of Experts,然后他们还有多头潜在注意力(Multi-head Latent Attention),这是对注意力机制的一个调整。我会说这是这些开放权重模型之间主要的区分因素——对注意力机制的不同调整,使推理或 KV 缓存大小……我们等下也可以定义 KV 缓存……使长上下文更经济化,缩小 KV 缓存大小。那我们能做什么调整?大多数都集中在注意力机制上。DeepSeek 有 Multi-head Latent Attention。有分组查询注意力(Group Query Attention),仍然非常流行——不是任何这些模型发明的,已经有好几年了。但那是另一个选择。滑动窗口注意力(Sliding window attention)——我记得 OLMo 3 用了它。所以有这些不同的调整使模型各不相同。但除此之外,我曾经把它们全放在一篇文章里做了比较。它们——出乎意料地——非常相似。只是中间 transformer 块的重复次数不同,还有一些人们调节的小旋钮。但好的地方是——不管怎样都能工作。你可以调整东西。你可以移动归一化层的位置来获得一些性能提升。OLMo 在消融研究(ablation studies)方面一直做得很好——展示移动某些东西实际上对模型有什么影响。消融研究就是——让它变好了还是变差了?但实现一个 transformer 有太多太多方式,而且都能让它正常工作。目前仍然流行的大想法是 Mixture of Experts、多头潜在注意力、滑动窗口注意力、分组查询注意力。然后到年底,我们看到了一个焦点——让注意力机制在推理 token 预测时线性扩展。比如有 Qwen2-VL,加了一个门控 delta 网络。这有点受状态空间模型(State space models)启发——你有一个固定的状态不断更新。但它本质上让注意力更便宜,或者用一个更便宜的操作替代了注意力。
**Sebastian Raschka:** And I do think it's a standout because it's a huge unlock. Because one of the most common complaints about LLMs are, for example, ha llucinations, right? And so, in my opinion, one of the best ways to solve hallucinations is to not try to always remember information or make things u p. For math, why not use a calculator app or Python?
**Lex Fridman:** 在这里退一步谈谈 transformer 架构大的全貌也许有用。
**Lex Fridman:** If I ask the LLM, "Who won the soccer World Cup in 1998?" instead of just trying to memorize, it could go do a search. I think mostly it's still a Google search. So ChatGPT and gpt-oss-120b, they would do a tool call to Google, maybe find the FIFA website. Find, okay, it was France.
**Sebastian Raschka:** 好,也许我们应该从 GPT-2 架构开始。这个 transformer 源自"Attention Is All You Need"论文。"Attention Is All You Need"论文有一个 transformer 架构,分为两部分——编码器(encoder)和解码器(decoder)。GPT 只专注于解码器部分。它本质上仍然是一个神经网络,内部有这个注意力机制。你一次预测一个 token。你把它通过一个 embedding 层。有 transformer 块。Transformer 块里有注意力模块和一个全连接层。中间有一些归一化层。但它本质上就是带有注意力机制的神经网络层。从 GPT-2 到 gpt-oss-120b,有一些变化——比如 Mixture of Experts 层。不是 gpt-oss-120b 发明的,有好几年了。但它本质上是一个调整——让模型更大而不在每次前向传播中消耗更多计算。有一个全连接层——如果听众熟悉多层感知器的话——你可以把它想象成 transformer 内部的一个小的多层感知器、一个全连接的神经网络层。它非常昂贵,因为是全连接的。如果你有一千个输入和一千个输出,那就是一百万个连接。这是 transformer 中非常昂贵的一部分。然后想法就是把它扩展成多个前馈网络。所以不再只有一个——假设你有 256 个——但那样会贵得多,因为现在你有 256 个。但你不会同时使用所有的。所以你现在有一个路由器(router)说"好,基于这个输入 token,使用这个全连接网络会很有用。"在这个语境中它叫做"专家"(expert)。所以 Mixture of Experts 的意思是你有多个专家。取决于你的输入是什么——比如更偏数学——它会使用不同的专家,相比于——比如说——把英文翻译成西班牙语。它可能会咨询不同的专家。并不是那么界限分明——不能说"好,这个只是数学专家,那个是西班牙语专家。"它更模糊一些。但这个想法本质上就是你把更多知识打包进网络,但不是所有知识都在所有时候被使用。那会非常浪费。所以在 token 生成过程中,你更有选择性。有一个路由器选择哪些 token 应该去哪个专家。这增加了更多复杂性。训练更难。有很多可能出错的地方——比如坍缩(collapse)之类的。所以我觉得这也是为什么 OLMo 3 仍然使用密集(dense)架构……我是说,有 OLMo 的 MoE 模型,但密集模型——"密集"的意思是……好,这又是术语。密集和稀疏之间有一个区分。Mixture of Experts 被认为是稀疏的,因为我们有很多专家,但只有少数是活跃的——所以叫稀疏。密集则相反——你只有一个全连接模块,而且它总是被使用。
**Sebastian Raschka:** It would get you that information reliably instead of just trying to memorize it. So I think it's a huge unlock w hich right now is not fully utilized yet by the open-source, open-weight ecosystem. A lot of people don't use tool call modes because I think, first, it's a trust thing. You don't want to run this on your computer where it has access to tools, could wipe your hard drive or whatever.
**Lex Fridman:** 也许这里是讨论 KV 缓存的好时机。但在那之前,甚至再退一步——从根本上说,从 GPT-2 到今天,实现了多少新想法?这些架构真的有多大不同?
**Lex Fridman:** So you want to m aybe containerize that. But I do think that is like a really important step for the upcoming years to have this ability. - So a few quick things. Firs t of all, thank you for defining what you mean by tool use. I think that's a great thing to do in general for the concepts we're talking about.
**Sebastian Raschka:** 拿 Mixture of Experts 来说。gpt-oss-120b 的注意力机制用的是分组查询注意力(Group Query Attention)。所以这是从多头注意力(Multi-Head Attention)到分组查询注意力的一个微调。我们也有……我觉得他们把 LayerNorm 换成了 RMSNorm,但那只是一种不同的归一化,不是什么大变化。就是个微调。非线性激活函数——熟悉深度神经网络的人会知道——就跟把 sigmoid 换成 ReLU 一样。没有从根本上改变网络,只是一个小调整。差不多就这些了。真的没有根本性的不同。还是同一个架构。你可以通过加上这些变化从一个变成另一个。
**Sebastian Raschka:** Even t hings as sort of well-established as MoEs. You have to say that means mixture of experts, and you kind of have to build up an intuition for people wha t that means, how it's actually utilized, what are the different flavors. So what does it mean that there's just such an explosion of open models? Wha t's your intuition? - If you're releasing an open model, you want people to use it, is the first and foremost thing.
**Lex Fridman:** 从根本上说还是同一个架构。
**Lex Fridman:** And then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models, and I t hink a lot of people will not. If you look outside of the US, a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think there can also be data that you don't want to send to the cloud.
**Sebastian Raschka:** 是的。比如你刚才提到我的书。书里是 GPT-2 模型,因为它简单,而且很小——大约 1.24 亿参数。但在附赠材料里,我有从零构建 OLMo、从零构建 Gemini 3,还有其他类型的从零模型。我总是从我的 GPT-2 模型开始,然后调整——加上不同的组件,你就从一个变成了另一个。这有点像一个谱系。
**Sebastian Raschka:** So the number one thing is getting peo ple to use models, use AI, or use your AI that might not be able to do it without having access to the model. - I guess we should state explicitly, so we've been talking about these Chinese models and open weight models. Oftentimes, the way they're run is locally. So it's not like you're sending you r data to China or to whoever developed Silicon Valley, or whoever developed the model. - A lot of American startups make money by hosting- ...these m odels from China and selling them. It's called selling tokens, which means somebody will call the model to do some piece of work.
**Lex Fridman:** 你能帮人们建立一个直觉吗?因为你退一步看,AI 世界有如此多的快速进步。同时从根本上说,架构并没有变。那么所有这些动荡、进步到底发生在哪里?增益在哪里?
**Lex Fridman:** I think the other re ason is for US companies like OpenAI. They are so GPU deprived. They're at the limits of the GPUs. Whenever they make a release, they're always talkin g about like, "Our GPUs are hurting." And I think during one of these gpt-oss-120b release sessions, Sam Altman said, "Oh, we're releasing this becaus e we can use your GPUs.
**Sebastian Raschka:** 开发或训练网络有不同的阶段。有预训练(pre-training)。以前 GPT-2 的时候只有预训练。现在你有预训练、中期训练(mid-training)和后训练(post-training)。我认为我们现在处于后训练的重点阶段。预训练如果你用更好、更高质量的数据去扩展,仍然能给你带来优势。但然后我们有 GPT-2 时代没有的能力解锁。比如 ChatGPT 基本上就是一个 GPT-3 模型。GPT-3 在架构上和 GPT-2 是一样的。新的东西是添加了监督微调(supervised fine-tuning)和基于人类反馈的强化学习(RLHF)。所以更多是在算法层面而非架构层面。
**Sebastian Raschka:** We don't have to use our GPUs, and OpenAI can still get distribution out of this," which is another very real thing, because i t doesn't cost them anything. - And for the user, I think also, there are users who just use the model locally how they would use ChatGPT. But also fo r companies I think it's a huge unlock to have these models because you can customize them, you can train them, you can add post-training, add more da ta. Like, specialize them into, let's say, law, medical models, whatever you have. And the appeal, you mentioned Llama, the appeal of the open-weight models from China is that the open-weight models' licenses are even friendlier.
**Nathan Lambert:** 我还想说,系统也变化了很多。如果你听 NVIDIA 的发布会,他们会谈到"你现在可以做 FP8,可以做 FP4。"发生的事情是,这些实验室在弄清如何利用更多算力投入到一个模型里——这让他们训练更快、放入更多数据。然后你可以更快地找到更好的配置。所以你可以看——本质上——"每 GPU 每秒 token 数"作为大规模训练时看的指标。你可以从每秒 10k 提升到 13k,只要开启 FP8 训练——意思是每个参数使用更少的内存。通过保存更少的信息,你做更少的通信,训练更快。所以所有这些系统层面的东西支撑着更快的数据和算法实验。这是一个不断循环的过程——当你看架构完全一样的时候很难描述,但用来训练这些模型的代码库已经截然不同了。而且——GPU 也不同了——你现在训练 gpt-oss-20b 的挂钟时间可能比当年训练 GPT-2 快得多。
**Nathan Lambert:** I think they are just unrestricted open source licenses where if we us e something like Llama or Gemma, there are some strings attached. I think it's like an upper limit in terms of how many users you have. And then if yo u exceed, I don't know, so and so many million users, you have to report your financial situation to, let's say, Meta or something like that. And I th ink while it is a free model, there are strings attached, and people do like things where strings are not attached.
**Sebastian Raschka:** 对。就像你说的,他们在 Mixture of Experts 中有 FP4 优化,可以获得更多吞吐量。但我确实认为,在速度方面这是对的,但它不会给模型带来新能力。只是——我们能把计算做到多粗糙而不影响模型性能?但我确实认为——有替代方案正在出现。文本扩散模型(text diffusion models),一个完全不同的范式。虽然文本扩散模型可能也使用 transformer 架构,但它不是自回归(autoregressive)transformer。还有 Mamba 模型——那是状态空间模型(state space model)。但它们有权衡,目前还没有任何东西取代自回归 transformer 成为最先进的模型。要做最先进的,你还是会选自回归 transformer。但现在有一些替代方案用于更便宜的那一端——做了一些妥协的替代方案。不再只有一种架构了。有小的正在涌现。但如果我们谈最先进的——基本上仍然是 transformer 架构、自回归的、本质上源自 GPT-2。
**Sebastian Raschka:** So I think that's also one of the reasons, besides performance, why the open-weight models from China are so popular, because you can just use them. There's no catch in that sense. - T he ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulle d up Perplexity and said, "Kimi K2 Thinking hosted in the US." Which is just like an exact... I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this.
**Lex Fridman:** 我想这里的大问题是——我们已经谈了很多关于预训练背后架构的内容——扩展定律(scaling laws)在预训练、后训练、推理、上下文大小、数据和合成数据方面是否仍然强劲?
**Lex Fridman:** But Kimi K2 Thinking and Kimi K2 is a model that is very popular. People say that has very goo d creative writing and also in doing some software things. So it's just these little quirks that people pick up on with different models that they lik e. - What are some interesting ideas that some of these models have explored that you can speak to, that are particularly interesting to you? - Maybe we can go chronologically. I mean, there was, of course, DeepSeek.
**Nathan Lambert:** 我想先从扩展定律的技术定义开始——它为所有这些提供了基础。扩展定律是一种幂律关系——你可以把 x 轴想象成算力和数据的组合——它们有点相似——然后 y 轴是在下一个 token 的留存预测准确率。我们说模型是自回归的。就是说如果你保留一组模型没见过的文本,训练的时候它会有多准确?扩展定律的概念出现在人们发现这是一个非常可预测的关系的时候。我认为这个技术定义还在继续成立。然后问题变成——用户从中得到了什么?那就有了更多类型的扩展——OpenAI 的 o1 因为引入推理时间扩展(inference time scaling)而出名。而且不太出名的是——它还展示了你可以扩展强化学习训练,得到这种 log x 轴然后 y 轴线性增长的效果。所以现在有三个轴——传统扩展定律谈的是预训练:你的模型有多大、数据集有多大。然后扩展强化学习——就是你能做多长时间的试错学习——我们后面会更多定义这个。然后是推理时间算力——就是让模型在一个具体问题上生成更多 token。所以我总体上还是乐观的——它们确实都仍然有效——但唾手可得的果实大多已经被摘了,特别是去年在可验证奖励的强化学习(RLVR)上,然后是推理时间扩展——这就是为什么这些模型用起来感觉如此不同。以前你会立刻得到第一个 token。现在它们会花几秒、几分钟甚至几小时生成这些隐藏的思考,然后才给你答案的第一个词。这一切都与推理时间扩展有关——这是模型能力变化方式上的一种奇妙的阶跃函数。它们基本上使这种工具使用和更好的软件工程成为可能——我们刚才谈到的。当我们说"使之成为可能"时,几乎完全是因为 RLVR 训练让模型很容易就学会了这些技能。让模型去学习——如果你看推理过程中模型生成大量 token 的时候——它经常在做的是:试一个工具,看返回了什么。试另一个 API,看返回了什么,看是否解决了问题。所以模型在训练的时候非常快地学会了这样做。最终,这给了一种通用基础——模型可以非常漂亮地在你的仓库中使用 CLI 命令、替你处理 Git、移动文件、组织东西或搜索更多信息。如果一年前我们坐在这里,这些是我们不太会想到模型能做的事情。这就是今年发生的事情,彻底改变了我们对使用 AI 的看法——这种进化——解锁了如此多的价值。但现在——不清楚下一个解锁类似东西的途径是什么。我想……我们后面会谈到持续学习(continual learning),但围绕 AI 的某些领域有很多话题,但没人知道下一个阶跃函数真正何时到来。
**Nathan Lambert:** DeepSeek R1 that came out in January of 2025, if we just focus on 2025. However, th is was based on DeepSeek-V3, which came out the year before in December 2024. There are multiple things on the architecture side. What is fascinating is...
**Lex Fridman:** 你刚才其实说了相当多的东西,而且快速地说了一些深刻的内容。值得拆开来说一说。你说你基本上对每种形式的扩展都乐观。那我们能不能从最初开始?预训练——我们是不是在暗示预训练扩展的唾手可得的果实已经被摘完了?预训练是否已经达到了平台期,还是你对预训练仍然乐观?
**Lex Fridman:** I mean, that's what I do with my from-scratch coding projects. You can still start with GPT-2, and you can add things to that model to make it i nto this other model. So it's all still kind of like the same lineage. It is a very close relationship between those.
**Nathan Lambert:** 预训练已经变得极其昂贵了。我觉得要扩展预训练,同时也意味着你要给用户提供一个非常大的模型。我认为已经粗略确定——像 GPT-4 和类似模型在最大规模时大约有一万亿参数。有很多传闻说它们实际上已经变小了,因为训练变得更高效了。你希望模型更小,因为这样你的服务成本会相应降低。这些模型的训练成本相对于服务成本——服务数亿用户——其实很低。我记得 DeepSeek 有一个著名的数字——按云端市场价大约 500 万美元用于预训练。在 OLMo 3 的论文 2.4 节中,我们详细记录了 GPU 集群用于训练的时间——包括工程问题、多次种子运行——大约 200 万美元租用集群来处理训练模型的所有头疼事。所以这些模型——很多人都能拿到 100 万到 1000 万美元来训练一个模型——但持续服务数百万用户的成本才是真正的数十亿美元的算力。你可以看到,租用 1000 个 GPU 一天就要 10 万美元。这些公司可能有数百万个 GPU。你能看到这些东西光放在那里就要花多少钱。这是一件大事。然后就是——如果扩展确实给了你更好的模型,从财务上值不值?我觉得随着 AI 解决更多引人注目的任务,我们会慢慢推进——比如 Claude Opus 4.5 让 Claude Code 能真正用于各种事情。我——我启动了一个叫 ATOM 的项目——美国真正开放模型(American Truly Open Models)——在 7 月份。那时候真的是一个纯 vibe coding 的网站。我的工作是做各种图表。然后最近几周我回去刷新它——Claude Opus 4.5 对比当时的那个模型——简直碾压了 6 月和 7 月建站时遇到的所有问题。可能是更大的模型吧。有很多因素,但进步还在继续。
**Nathan Lambert:** But top of my head, DeepSeek—wha t was unique there is the Mixture of Experts. Not that they were inventing Mixture of Experts—we can maybe talk a bit more about what Mixture of Exper ts means—but just to list these things first before we dive into detail. Mixture of Experts, but then they also had Multi-head Latent Attention, which is a tweak to the attention mechanism, where this was, I would say, the main distinguishing factor between these open-weight models. Different tweaks to make inference or KV cache size...
**Lex Fridman:** 所以你在说的是扩展定律 y 轴的微妙之处——体验到的方式与基准测试上的实际智能可能不同。但仍然,你对预训练的直觉是——如果你扩展算力规模,模型会变得更好吗?不管财务上是否可行,单纯从定律的角度,你认为模型会变得更聪明吗?
**Lex Fridman:** We can also define KV cache in a few moments, but to kind of make it more economical to have long context, to s hrink the KV cache size. So what are tweaks that we can do? And most of them focused on the attention mechanism. There is Multi-head Latent Attention in DeepSeek.
**Nathan Lambert:** 是的。而且我觉得……有时候 AI 公司的领导人说这话听起来近乎失望,但他们会说"扩展定律已经在 13 个数量级的算力上成立了,它为什么会停?"所以我认为从根本上说,它不太可能停止——只是最终我们可能连测试更大规模都做不到,因为更多算力带来的所有问题。我觉得有很多讨论说 2026 年是超大规模 Blackwell 算力集群——吉瓦级的超级计算设施——上线的一年。这些都是在 2022 年和 2023 年签订的电力和数据中心合同。所以在 ChatGPT 之前或之后不久。花了两到三年的前置时间来建造这些更大的集群来训练模型。虽然显然有巨大的兴趣去建更多的数据中心。这就是人们在说的关键点:这些新集群即将上线。实验室将有更多训练用算力。他们会利用这些。但这不是理所当然的。我看到了这么多进步,我预期会有,我预期模型会稍大一些,我预期……我会说——我们今年会看到 2000 美元的订阅。我们已经看到了 200 美元的订阅。那可能再 10 倍。而这些都是更大的模型带来的——只是多了一点点尖端能力。
**Nathan Lambert:** There is Group Query Attention, which is still very popular. It's not invented by any of those models. It goes back a few years. But that would be the other option.
**Lex Fridman:** 据报道 xAI 将在 2026 年初达到 1 吉瓦规模,年底达到 2 吉瓦。你觉得他们会在扩展定律的背景下如何利用这些?大量用于推理?大量用于训练?
**Lex Fridman:** Sliding window attention—I think OLMo 3 uses it, if I remember correctly. So there are these different tweaks that make th e models different. Otherwise, I put them all together in an article once where I just compared them. They are very, surprisingly similar.
**Nathan Lambert:** 最终什么都有。我觉得训练模型时所有的决策都要回到预训练。如果你要在模型上扩展 RL,你仍然需要确定一个能支持这个的架构。我们谈到了其他架构——使用不同类型的注意力、MoE 模型。MoE 模型的稀疏特性使生成效率更高——这成为后训练的一大部分——你需要准备好架构才能真正扩展算力。我仍然认为大部分算力在预训练上。因为你仍然可以让模型更好——你仍然想重新审视这个。你仍然想要能得到的最好的基础模型。再过几年那会饱和,RL 算力就会用得更久。
**Nathan Lambert:** It's just d ifferent numbers in terms of how many repetitions of the transformer block you have in the center. And, like, just little knobs that people tune. But what's so nice about it is it works no matter what. You can tweak things.
**Lex Fridman:** 有没有人不同意你的看法,说预训练已死?全部应该是扩展推理、扩展后训练、扩展上下文、持续学习、扩展数据、合成数据?
**Lex Fridman:** You can move the normalization layers around to get some performance gains. And OLMo is always very good in ablation studies, showing what it actually does to the model if you move something around. Ablation studies: does it m ake it better or worse? But there are so many, let's say, ways you can implement a transformer and make it still work.
**Nathan Lambert:** 人们在氛围上那样说、那样描述,但我觉得实际发生的并不是这样。
**Nathan Lambert:** The big ideas that are still pr evalent is Mixture of Experts, multi-head latent attention, sliding window attention, group query attention. And then at the end of the year, we saw a focus on making the attention mechanism scale linearly with inference token prediction. So there was Qwen2-VL, for example, which added a gated delta net. It's kind of inspired by State space models, where you have a fixed state that you keep updating.
**Sebastian Raschka:** 就是一般性的氛围——人们说这玩意死了。
**Sebastian Raschka:** But it makes essentially this attention cheape r, or it replaces attention with a cheaper operation. - And it may be useful to step back and talk about transformer architecture in general. - Yeah, so maybe we should start with the GPT-2 architecture. The transformer that was derived from the "Attention Is All You Need" paper. The "Attention Is A ll You Need" paper had a transformer architecture that had two parts, an encoder and a decoder. And GPT went just focusing in on the decoder part.
**Nathan Lambert:** 兴奋点在别处。RL 的唾手可得的果实在别处。比如我们 11 月发布了模型……每个公司都有截止日期。我们的截止日期是 11 月 20 号,为此我们跑了五天——相比 2024 年这是很长的时间——对一个 300 亿参数的模型做后训练。这不是大模型。然后在 12 月,我们又发布了一次——我们让 RL 又跑了三个半星期,模型明显变好了,所以我们就发布了。这是一个——你要分配给某个东西——作为全年的巅峰。所以就像——
**Nathan Lambert:** It is essentially still a neural network and it has this attention mechanism inside. And you predict one token at a time. You pass it through an embeddin g layer. There's the transformer block.
**Sebastian Raschka:** 推理是——
**Sebastian Raschka:** The transformer block has attention modules and a fully connected layer. And there are some normalization laye rs in between. But it's essentially neural network layers with this attention mechanism. So coming from GPT-2 when we move on to gpt-oss-120b, there i s, for example, the Mixture of Experts layer.
**Nathan Lambert:** 训练模型时有这些决策——它们就是……你不能让它永远跑。你得不断引入研究人员的改进。所以你重新做预训练,做一个月的后训练,但然后你需要把它给用户。你需要做安全测试。所以就是……我觉得有很多机制在强化这个更新模型的循环。事情在改善。你得到一个新的算力集群让你更稳定或更快地做事。就像你听到很多关于 Blackwell 有部署问题的——在 AI2,我们大多数预训练模型是在 1000 到 2000 个 GPU 上。但当你在 10000 或 100000 个 GPU 上预训练时,你会遇到非常不同的故障。GPU 会以奇怪的方式坏掉。在 10 万 GPU 的运行中,几乎可以保证有一个 GPU 是停机的。你的训练代码必须处理那种冗余——这是一个非常不同的问题。而我们在做后训练的时候用的集群——或者对于学习 ML 的人来说——他们在训练最大模型时面对的是纯粹的大规模分布式问题,非常不同。但这在某种程度上不同于——那是一个系统问题——为了使扩展定律成立,特别是在预训练阶段。你需要所有这些 GPU 同时工作。当我们转向 RL 时,它实际上更适合异构计算,因为你有多个模型副本。给大家做一个语言模型强化学习的入门——你做的是有两组 GPU。一组你可以叫"演员"(actor),一组你叫"学习者"(learner)。学习者是实际发生强化学习更新的地方。这些传统上是策略梯度算法——近端策略优化(Proximal Policy Optimization,PPO)和分组相对策略优化(Group Relative Policy Optimization,GRPO)是两个主流类别。另一边你有演员,它们在生成补全(completion),这些补全就是你要评分的。强化学习就是关于优化奖励的。实际上,你可以有很多不同的演员在世界各地做不同类型的问题,然后把结果发回这个高度网络化的计算集群做实际学习——在那里你取梯度。你需要有一个紧密连接的网络来做不同类型的并行化,把模型分布出去做高效训练。每种不同类型的训练和服务都有这些扩展上的考量。我们谈了预训练和 RL,然后推理时间扩展——你怎么给 1 亿用户提供一个思考一小时的模型?我不知道答案,但我知道那是一个很难的问题。为了给人们这种智能,有所有这些系统问题——我们需要更多算力、更稳定的算力来做到这些。
**Nathan Lambert:** It's not invented by gpt-oss-120b. It's a few years old. But it is essentially a tweak to make the model larger without consuming more compute in each forward pass. So there is this fully connected layer, and if listeners are familiar with multi-layer pe rceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer.
**Lex Fridman:** 但你对所有这些类型的扩展都乐观——是我听到的意思。推理、推理能力、甚至预训练?
**Lex Fridman:** And it's very expensive, bec ause it's fully connected. If you have a thousand inputs and a thousand outputs, that's like one million connections. And it's a very expensive part i n this transformer. And the idea is to kind of expand that into multiple feedforward networks.
**Sebastian Raschka:** 是的。这里有一大堆东西要说,但基本上有两个……旋钮就是训练和推理扩展,你可以从中获益。在一个假设我们有无限算力的世界里,你什么都想做。所以你有训练,有推理扩展——训练是一个层级:预训练、中期训练、后训练。改变模型大小、更多训练数据、训练更大的模型给你模型中更多知识。然后模型——假设——有了一个更好的基础模型。以前我们叫它基础模型(foundation model)——现在仍然这么叫——它解锁了……但你不会说——在预训练期间或预训练之后就能让模型解决你最复杂的任务。你还有其他解锁阶段——中期训练或者后训练中的 RL——解锁模型在预训练中获得的知识所蕴含的能力。我觉得当然,如果你做更多预训练,你会得到一个更好的基础模型,以后可以解锁更多。但就像 Nathan 说的,它就是太贵了。我们没有无限算力,所以你得决定——我要把算力花在让模型更大上吗?这是一个权衡。在理想世界里,你什么都想做。我觉得在这个意义上,扩展仍然活得好好的。你仍然能得到更好的模型。但就像我们看到的 Claude Opus 4.5 一样——就是不值得。因为你可以用其他技术在当前这个时刻解锁更多性能——特别是推理时间扩展。那是今年最大的增益之一——o1——它让一个更小的模型走得比预训练一个更大的模型(像 Claude Opus 4.5 那样)更远。所以我不会说预训练扩展已死——只是目前有更有吸引力的扩展方式。但在某个时刻,你仍然会想在预训练上取得一些进步。另一个需要考虑的是你想把钱花在哪里。如果你把更多钱花在预训练上,那是一个固定成本。你训练了模型,它就永远有这个能力。你可以一直用。而推理时间扩展——你在训练时不花钱,你在后面每次查询时花钱。然后就是算数——我的模型在市场上能待多久?如果半年后我就换掉它?也许不值得多花 500 万、1000 万、1 亿美元在更长的训练上。也许我就做更多推理扩展,从那里获得性能。也许它在用户查询方面花了我 200 万美元。这变成了一个关于你有多少用户的数学题。我觉得这也是 ChatGPT 所处位置的有趣之处。我认为他们有很多用户——需要更便宜一些——所以他们有 GPT-5 这个稍小的模型。其他公司有……比方说,如果你的客户有不同的权衡。比如数学奥林匹克或某些数学问题——ChatGPT 或者他们有一个专有模型——我很确定那基本上是一个多微调了一点的模型,但大部分是在推理时间扩展上达到某些任务的巅峰性能。不是一直都需要那个。但总之,长话短说,我确实认为所有这些——预训练、中期训练、后训练、推理扩展——都仍然是你想做的事情。只是在当下、这一年,找到给你最大性价比的正确比例。
**Sebastian Raschka:** So instead of having one, let's say you have 256, but i t would make it way more expensive, because now you have 256, but you don't use all of them at the same time. So you now have a router that says, "Oka y, based on this input token, it would be useful to use this fully connected network." And in that context, it's called an expert. So a Mixture of Exp erts means you have multiple experts. And depending on what your input is, let's say it's more math-heavy, it would use different experts, compared to , let's say, translating input text from English to Spanish.
**Lex Fridman:** 我觉得这里是定义预训练、中期训练和后训练的好时机。
**Lex Fridman:** It would maybe consult different experts. It's not quite clear, I mean, not as clear-cut to say, "Okay, this is only an expert for math and for Spanish." It's a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful.
**Sebastian Raschka:** 预训练是经典的训练——一次一个 token 的下一 token 预测。你有一个大的数据语料库。Nathan 在这方面可能也有很有趣的见解——因为 OLMo 3 论文的很大一部分聚焦在正确的数据混合上。所以预训练本质上就是在大量互联网数据、书籍、论文等等上做交叉熵损失、做下一 token 预测训练。这些年有一些变化——人们以前是什么都往里扔。现在不只是原始数据了,还有合成数据——人们会把某些东西改写。合成数据不一定意味着纯 AI 生成的数据。也可以是从一篇文章——一篇维基百科文章——拿来然后改写成问答格式或总结、换种说法,用这种方式做出更好的数据。因为我也像人类那样想——如果一个人读了一本书,相比一个杂乱的——没有冒犯——Reddit 帖子,我确实认为你学到的——
**Sebastian Raschka:** So, during the token generation, you are more selective. The re's a router that selects which tokens should go to which expert. It adds more complexity. It's harder to train. There's a lot that can go wrong, lik e collapse and everything. So I think that's why OLMo 3 still uses dense...
**Nathan Lambert:** 这个帖子要上 Reddit 了,Sebastian。
**Nathan Lambert:** I mean, you have OLMo models with Mixture of Experts, but dense models, wh ere dense means... So also, it's jargon. There's a distinction between dense and sparse. So Mixture of Experts is considered sparse, because we have a lot of experts, but only a few of them are active.
**Sebastian Raschka:** 有些 Reddit 数据是非常珍贵的,对训练非常好。你只需要过滤它。
**Sebastian Raschka:** So that's called sparse. And then dense would be the opposite, where you only have one fully conne cted module, and it's always utilized. - So maybe this is a good place to also talk about KV cache. But actually, before that, even zooming out, like fundamentally, how many new ideas have been implemented from GPT-2 to today? Like, how different really are these architectures? - Take the Mixture of Experts.
**Sebastian Raschka:** 我觉得这就是要点。如果有人把它拿过来,用更简洁、更结构化的方式改写,我认为那就是更高质量的数据,能让 LLM 更快达到目标。最终你得到的 LLM 是一样的,但训练更快——因为如果语法和标点已经正确,它就直接学到了正确的方式,不用先从杂乱的来源获取信息然后再学怎么纠正。所以我觉得这就是预训练演变的方式以及为什么扩展仍然有效。不仅仅是数据量的问题,也是让数据为你更好服务的那些技巧。然后中期训练——它以前就叫预训练。我觉得之所以叫中期训练,是因为只有预训练和后训练但中间没有东西——听起来有点奇怪。中期训练通常和预训练类似,但更专业化一些。同样的算法,但你会聚焦——比如——长上下文文档。之所以不在预训练中做,是因为你没有那么多长上下文文档。我们有一个专门的阶段。LLM 的一个问题仍然是——它是一个神经网络。它有灾难性遗忘(catastrophic forgetting)的问题。你教它一些东西,它忘记其他东西。不是 100% 遗忘,但就像"没有免费午餐"。对人类也一样。如果你问我 10 年前学的数学,我得重新看看。
**Sebastian Raschka:** The attention mechanism in gpt-oss-120b, that would be the Group Query Attention mechanism. So it's a slight tweak from Multi-Head Attention to Group Query Attention. So that we have too... I think they replaced LayerNorm by RMSNorm, but it's just like a different normalization there and n ot a big change.
**Nathan Lambert:** Nathan 之前实际上在说他吸收了太多内容以至于有灾难性遗忘的问题。
**Nathan Lambert:** It's just like a tweak. The nonlinear activation function— people familiar with deep neural networks, I mean, it's the same as changi ng sigmoid with ReLU. It's not changing the network fundamentally. It's just a little tweak.
**Nathan Lambert:** 是的,我试图学太多关于 AI 的东西——就像——我在学预训练的并行化。然后我就觉得"我丢了什么东西,但我不知道是什么。"
**Nathan Lambert:** And that's about it, I would say. It's not really fundame ntally that different. It's still the same architecture. So you can go from one into the other by just adding these changes basically. - It fundamenta lly is still the same architecture. - Yep.
**Sebastian Raschka:** 我不想把 LLM 拟人化,但这跟人类学习的方式有相似之处。量不总是更好的,因为你得有选择性。中期训练就是在质量内容上做选择。所以 LLM 最后看到的是高质量的东西。然后后训练就是所有的微调——监督微调、DPO、RLVR(可验证奖励的强化学习)、RLHF 等等。就是精炼阶段。这也很有趣——这是成本的问题。你在预训练上花了很多钱。RL 少一些。用 RL,你并不真正教它知识。更像是解锁知识——更像是技能学习——用预训练中的知识来解决问题。今年或者说 2025 年确实有三篇论文关于用 RL 做预训练。但我不认为有人在生产中这样做。
**Sebastian Raschka:** For example, you mentioned my book earlier. That's a GPT-2 model in the book because it's simple and it's v ery small, so 124 million parameters approximately. But in the bonus materials, I do have OLMo from scratch, Gemini 3 from scratch, and other types of from-scratch models. And I always start it with my GPT-2 model and just tweak the—well, add different components and you get from one to the other.
**Nathan Lambert:** 目前还是玩具级的例子。
**Nathan Lambert:** I t's kind of like a lineage in a sense. - Can you build up an intuition for people? Because when you zoom out, you look at it, there's so much rapid ad vancement in the AI world. And at the same time, fundamentally the architectures have not changed. So where is all the turbulence, the turmoil of the advancement happening?
**Sebastian Raschka:** 对,玩具级的例子。但泛化一下——RL 后训练更像是技能解锁,而预训练更像是吸收知识。
**Sebastian Raschka:** Where are the gains to be had? - So there are different stages where you develop the network or train the network. You have the pre-training. Now back in the day, it was just pre-training with GPT-2. Now you have pre-training, mid-training, and post-training.
**Nathan Lambert:** 有几件事可能有帮助。很多人认为合成数据对训练模型不好。你提到 DeepSeek 几乎做了……OCR,就是光学字符识别(Optical Character Recognition)。很多实验室都做了。AI2 有一个,Meta 有多个。之所以每个实验室都有,是因为网上有大量 PDF 和其他数字文档,它们的格式不容易编码成文本。所以你用 Almost-OCR、DeepSeek OCR 或者我们叫的 Almost-OCR 来提取数万亿 token 的候选预训练数据。预训练数据集大小以万亿 token 计。研究者的小模型大概是 5 到 10 万亿。Qwen 记录的高达 50 万亿,还有传言说闭源实验室能到 100 万亿 token。获取这些潜在数据——他们有一个非常大的漏斗——你实际训练用的数据只是其中很小一部分。这种字符识别数据会被描述为实验室预训练的合成数据。然后还有一个事实——现在 ChatGPT 能给出很棒的回答,你可以在那些最好的回答上训练——那也是合成数据。它和早期 ChatGPT 充满幻觉数据的时代非常不同——那时候人们对合成数据形成了固有印象。
**Nathan Lambert:** So I think right now we are in the post-training focus stage. Pre-training still gives you advantages if you scale it up with better, higher quality data. But then we have capability unlocks that were not there with GPT-2, for For example, ChatGPT is basically a GPT-3 model. And GPT-3 is the same as GPT-2 in terms o f architecture.
**Sebastian Raschka:** 一个有趣的问题——如果我没记错的话,OLMo 3 用的数据比某些其他开放权重模型少——甚至可能比 OLMo 2 都少。但你们还是得到了更好的性能。这可能就是数据质量如何帮助的一个例子。
**Sebastian Raschka:** What was new was adding supervised fine-tuning and reinforcement learning with human feedback. So it's more on the algorithmic side th an the architecture. - I would say that the systems also change a lot. If you listen to NVIDIA's announcements, they talk about things like, "You now do FP8, you can now do FP4." What's happening is these labs are figuring out how to utilize more compute to put it into one model, which lets them tra in faster and put more data in. And then you can find better configurations faster by doing this. So you can look at, essentially, tokens per second p er GPU as a metric that you look at when you're doing large-scale training. You can go from 10k to 13k by turning on FP8 training, which means you're using less memory per parameter in the model. By saving less information, you do less communication and train faster. So all of these system things un derpin way faster experimentation on data and algorithms. It's a loop that keeps going where it's hard to describe when you look at architectures and they're exactly the same, but the code base used to train these models is vastly different- -and you could probably... the GPUs are different but you probably train gpt-oss-20b way faster in wall-clock time than GPT-2 was trained at the time. - Yeah.
**Nathan Lambert:** 这主要归结于数据质量。我觉得如果我们有更多算力,我们会训练更长时间。我觉得我们最终会认为那是我们想做的事情。而且特别是大模型——你需要更多算力。因为我们谈到了更多参数、更多知识。本质上有一个比例——大模型可以从数据中吸收更多——然后你从中获益更多。在你脑海中画一个对数图——小模型如果你以海量 token 为度量会更快趋于平缓,大模型需要更多。但目前我们在 AI2 训练的模型没那么大,获取最高质量的数据是自然的出发点。
**Nathan Lambert:** Like you said, they had, for example, in Mixture of Experts this FP4 optimization where you get more throughput. But I do think, for speed this is true, but it doesn't give the model new capabilities . It's just: how much can we make the computation coarser without suffering in terms of model performance degradation? But I do think- I mean, there a re alternatives popping up to the transformer.
**Lex Fridman:** 关于数据质量这个话题,有什么要说的吗?还有什么唾手可得的果实可以提升质量?
**Lex Fridman:** Text diffusion models, a completely different paradigm. And there is also... I mean, although text diff usion models might use transformer architectures, it's not an autoregressive transformer. And also Mamba models.
**Nathan Lambert:** 就像不断地转动曲柄。从历史上看——在开放领域——一直有一个被公认为最佳的预训练数据集,它在不同持有者之间轮换——看谁有最新的或最好的最新成果。比如 AI2 的 Dolma 在最初的 OLMo 时就很早了。Hugging Face 有 FineWeb。还有 DCLM 项目——Data Comp Language Model 的缩写。之前有 Data Comp 用于其他 ML 项目——他们有一个非常强的数据集。其中很大一部分是——互联网正变得越来越封闭,所以我们有 Common Crawl——数百万亿 token——然后你去过滤它。看起来像科学工作——你在训练分类器、根据你如何把这个数据集修剪成最高质量的东西和适合你任务的东西来做决策。以前语言模型更多地在知识和对话方面被测试,但现在人们期望它们做数学和编程。要训练一个推理模型,你需要重新混合你的整个数据集。有很多很棒的科学方法——你可以拿你巨大的数据集,从不同来源——比如 GitHub、Stack Exchange、Reddit、Wikipedia——抽取很小的样本,在每种混合上训练小模型,测量在你的评估上的表现。你可以做基本的线性回归——"这是你的最优数据集。"但如果你的评估标准变了,你的数据集变化也很大。所以 OLMo 3 的很多工作是引入新的推理数据源——让数学和代码更好——然后做这个混合程序,它给你答案。我觉得今年各实验室都在这样做——有新的热门方向,不管是编程环境还是网页导航——你需要引入新数据,改变整个预训练,这样你的后训练才能更好地工作。而这就是不断演变的过程——重新确定他们对模型关心什么。
**Nathan Lambert:** It's a state space model. But they do have trade-offs, and nothing has yet replaced the autoregressive transformer as the state-of-the-art model. For state-of-the-art, you would still go with that, but there are now alternatives for the cheaper end—alternatives that are kind of making compromises. It's not just one architecture anymore .
**Lex Fridman:** 有没有什么有趣的轶事——哪些数据来源质量特别高、出乎意料的?你提到 Reddit 有时候也是来源。
**Lex Fridman:** There are little ones coming up. But if we talk about the state-of-the-art, it's pretty much still the transformer architecture, autoregressive, der ived from GPT-2 essentially. - I guess the big question here is, we talked quite a bit about the architecture behind the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, context size, data, and synthetic data? - I'd like to start with the technical def inition of a scaling law- -which informs all of this. The scaling law is the power law relationship between...
**Nathan Lambert:** Reddit 非常有用。我觉得 PDF 绝对是一个。
**Nathan Lambert:** You can think of the x-axis, so kind of what you are scaling as a combination of compute and data, which are kind of similar, and then the y-axis is like the held-out prediction accuracy ov er next tokens. We talked about models being autoregressive. It's like if you keep a set of text that the model has not seen, how accurate will it get when you train? And the idea of scaling laws came when people figured out that that was a very predictable relationship.
**Sebastian Raschka:** 哦,特别是 arXiv。
**Sebastian Raschka:** And I think that that techni cal term is continuing, and then the question is, what do users get out of it? Then there are more types of scaling where, OpenAI's o1 was famous for introducing inference time scaling. And I think less famously for also showing that you can scale reinforcement learning training and get kind of this log x-axis and then a linear increase in performance on y-axis. So there's kind of these three axes now where the traditional scaling laws are talked about for pre-training, which is how big your model is and how big your dataset is, and then scaling reinforcement learning, which is like how long c an you do this trial and error learning that we'll talk about.
**Nathan Lambert:** 是的。AI2 运营 Semantic Scholar 已经很长时间了——你可以说它是 Google Scholar 的竞争对手,有更多功能。为此,AI2 找到并抓取了很多公开可访问论文的 PDF——那些可能不在某个出版商的付费围墙花园后面的。真正开放的科学 PDF。如果你坐在所有这些数据上并处理它,你可以从中获取价值。我觉得很多这类工作——前沿实验室早得多就做了。你只需要一个相当有经验的研究者,了解什么会改变模型——他们引入数据、清理数据,这是大量劳动。当前沿实验室扩大研究人员规模时,更多精力投入到数据上。如果你加入一个前沿实验室想要产生影响,最好的方式就是找到更好的新数据。然后那些花哨、光鲜的算法层面的东西——比如弄清怎么做 o1——那是科学家最性感的想法。就像"哦,我搞清了怎么扩展 RL。"有一个团队做了这个。但大多数贡献是——
**Nathan Lambert:** We'll define more of this, and then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish, but they're all really still working, but the low-hanging fruit has most ly been taken, especially in the last year on reinforcement learning with verifiable rewards, which is this RLVR, and then inference time scaling, whi ch is just why these models feel so different to use, where previously you would get that first token immediately. And now they'll go off for seconds, minutes, or even hours, generating these hidden thoughts before giving you the first word of your answer. And that's all about this inference time sc aling, which is such a wonderful kind of step function in terms of how the models change abilities.
**Sebastian Raschka:** 在数据集方面。
**Sebastian Raschka:** They kind of enabled this tool use stuff and enabl ed this much better software engineering that we were talking about. And this, when we say enabled, is almost entirely downstream of the fact that thi s reinforcement learning with verifiable rewards training just kind of let the models pick up these skills very easily. So let the models learn, so if you look at the reasoning process when the models are generating a lot of tokens, what it'll often be doing is: it tries a tool, it looks at what it gets back. It tries another API, it sees what it gets back and if it solves the problem.
**Nathan Lambert:** "我要让数据更好"或者"我要让基础设施更好,这样团队里的每个人跑实验能快 5%。"
**Nathan Lambert:** So the models, when you're training them, very quickly learn to do this. And then at the end of the day, that gives this kind of general foundation where the model can use CLI commands very nicely in your repo a nd handle Git for you and move things around and organize things or search to find more information, which if we were sitting in these chairs a year a go is something that we didn't really think of the models doing. So this is just kind of something that has happened this year and has totally transfo rmed how has totally transformed how we think of using AI which evolution and just unlocks so much value. But it's like, just so- pr- unlocks so much value.
**Sebastian Raschka:** 同时我觉得这也是最被严密守护的秘密之一——你的训练数据是什么——出于法律原因。所以也有很多工作投入到隐藏你的训练数据是什么上面。比如训练模型不要泄露数据来源——因为你有法律上的理由。
**Sebastian Raschka:** But it's- it's like, it's not clear what the next avenue will be in terms of unlocking stuff like this. I think there's... we'll get to continu al learning later, but there's a lot of buzz around certain areas of AI, but no one knows when the next step function will really come. - So you've ac tually said quite a lot of things there, and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish basica lly on every version of scaling.
**Nathan Lambert:** 另一个要完整说明的是——有些人试图只在已授权的数据上训练。而 Common Crawl 是对整个互联网的抓取。如果我托管了多个网站,我很乐意让它们被用来训练语言模型,但我并没有明确授权这种使用方式。因此 Common Crawl 在很大程度上是未授权的——意味着你的同意并没有被真正给予。还有另一个想法——你可以只在被明确授权的数据上训练语言模型——这样就有了管理合同——我不确定 Apertus 是版权还是许可方面的。我知道他们这样做的原因是为了符合欧盟合规要求——他们想确保模型符合某项检查。
**Nathan Lambert:** So can we just even start at the beginning? Pre-training, are we kind of implying that the low- hanging fruit on pre- training scaling has been picked? Has pre-training hit a plateau, or is even pre-training still something you're bullish on? - Pre-training has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're gonna serve a very large model to the users.
**Sebastian Raschka:** 在这个话题上,授权方面还有一些区分。有些人只是购买了许可。比如他们买了一本 Amazon Kindle 书或者一本 Manning 出版社的书,然后用来训练。这是一个灰色地带——因为你为内容付了费,你可能想在上面训练。但也有限制说即便那样也不应该被允许。所以这就变得有点模糊。是的,我觉得这目前仍然是一个热门话题。像 OpenAI 这样的大公司会去找私人公司要它们的专有数据。而私人公司越来越——保护自己的数据,因为他们知道"这将是我几年后的护城河。"我确实认为这是一个有趣的问题——如果 LLM 变得更加商品化,而且我认为很多人在学习 LLM——会有更多人有能力训练 LLM。当然有基础设施的挑战。但如果你想想大型行业——制药、法律、金融——我确实认为在某个时刻,他们会从前沿实验室挖人来在专有数据上构建他们的内部模型。那将是——又一次——预训练的解锁——目前还不存在的。因为即使你想这样做,你也拿不到那些数据。你大多数时候无法获取临床试验数据之类的东西。所以我确实认为,扩展在这个意义上可能仍然非常活跃——如果你也看特定领域的应用。因为我们现在——在今年——还只是在看通用 LLM——ChatGPT、Anthropic 什么的。它们只是通用的,我觉得它们甚至还没有触及 LLM 真正为特定任务专门训练和设计后能做什么的表面。
**Sebastian Raschka:** So I think that it 's been loosely established the likes of GPT-4 and similar models were around one trillion parameters at the biggest size. There's a lot of rumors tha t they've actually gotten smaller as training has gotten more efficient. You want to make the model smaller because then your costs of serving go down proportionately. These models, the cost of training them is really low relative to the cost of serving them to hundreds of millions of users.
**Nathan Lambert:** 我觉得在数据方面——这是 2025 年发生的一件大事,但我们完全忘记了——是 Anthropic 在法庭上败诉,欠了作者 15 亿美元。Anthropic——我想——买了数千本书并扫描了它们,在法律上被允许了——因为他们买了这些书——这正在走法律程序。然后另一方面,他们也盗版下载了一些书。我觉得盗版那条路线是法院判定他们有责任向作者支付这数十亿美元的原因。这就是如此令人震惊的诉讼——来了又走了。那么多钱——来自 VC 生态系统。
**Nathan Lambert:** I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates. In OLMo 3, section 2.4 in the paper, we just de tailed how long we had the GPU clusters sitting around for training which includes engineering issues, multiple seeds, and it was like about two milli on dollars to rent the cluster to deal with all the headaches of training a model. So these models are pretty— like, a lot of people could get one to 10 million dollars to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. I think that you c an look at a thousand GPU rental you can pay 100 grand a day for.
**Lex Fridman:** 这些法庭案件将定义人类文明的未来。因为很明显数据驱动了很多这些东西,而且有这种非常复杂的人类张力……你们都是作者,可以产生共情。在某种程度上——你倾注了心血和汗水在你的写作中。有人拿你的数据训练而不给你任何荣誉——确实有一种被偷窃的感觉。
**Lex Fridman:** And these companies could have millions of GPUs. Like you can look at how much these things cost to sit around. So that's kind of a big thing, and then it's like, if scaling is actually giving you a better model, is it gonna be financ ially worth it? And I think we'll slowly push it out as AI solves more compelling tasks, so like the likes of Claude Opus 4.5, making Claude Code just work for things.
**Sebastian Raschka:** 就像 Nathan 说的,这也有两层。有人可能买了书然后在上面训练——这可以争论公不公平——但还有直接使用盗版书的公司——甚至没有补偿作者。我觉得这才是人们特别生气的地方。
**Sebastian Raschka:** I— I launched this project called the ATOM project, which is American Truly Open Models in July, and that was like a true vibe coded website, and like, I have a job to make plots and stuff. And then I came back to refresh it in the last few weeks and it's like Claude Opus 4.5 versu s whatever model at the time was like, just crushed all the issues that it had from building in June and July and like, it might be a bigger model. Th ere's a lot of things that go into this, but there's still progress coming. - So what you're speaking to is the nuance of the y-axis of the scaling la ws—the way it's experienced versus on a benchmark, the actual intelligence might be different. But still, your intuition about pre-training, if you sc ale the size of compute, will the models get better?
**Lex Fridman:** 是的,但必须有某种补偿机制。这有点像朝 Spotify 流媒体那个方向发展的——最初对音乐做的。你知道,补偿长什么样?你必须定义那些模式。你必须想清楚所有这些。还有一件我觉得人们普遍好奇的事——随着 LLM 被越来越多使用,如果你看 arXiv 甚至 GitHub——越来越多的数据是 LLM 生成的。在那样的世界里你怎么办?这个问题有多大?
**Lex Fridman:** Not whether it's financially viable but just from the law aspect of it, do you think the models w ill get smarter? - Yeah. And I think that there's... And this sometimes comes off as almost like disillusionment from people, leadership at AI compani es saying this, but they're like, "It's held for 13 orders of magnitude of compute, why would it ever end?" So I think fundamentally it is pretty unli kely to stop, it's just eventually we're not even gonna be able to test the bigger scales because of all the problems that come with more compute. I t hink that there's a lot of talk on how 2026 is a year when very large Blackwell compute clusters, like gigawatt-scale facilities at hyperscalers, are coming online.
**Nathan Lambert:** 最大的问题是基础设施和系统,但从 AI 的角度来看,这基本上是不可避免的。
**Nathan Lambert:** These were all contracts for power and data centers that were signed and sought out in 2022 and 2023. So before or right after ChatGPT. It took this two-to-three-year lead time to build these bigger clusters to train the models. While there's obviously immense interest in building eve n more data centers than that.
**Lex Fridman:** 所以基本上就是由人类策划的 LLM 生成数据,对吧?
**Lex Fridman:** So that is the crux that people are saying: these new clusters are coming. The labs are gonna have more compute for tra ining. They're going to utilize this, but it's not a given. I've seen so much progress that I expect it, and I expect a little bit bigger models, and I expect...
**Nathan Lambert:** 是的。我觉得很多开源贡献者确实在倦怠。如果你有一个受欢迎的开源仓库,有人就说"哦我想做开源 AI,对我的职业有好处"——然后他们就 vibe code 一些东西扔进来。你可能会得到更多这样的——
**Nathan Lambert:** I would say it's more like we'll see a $2,000 subscription this year. We've seen $200 subscriptions. That could 10X again, and these are t he kind of things that could come, and they're all downstream of this bigger model that offers just a little bit more cutting edge. - So, you know, it 's reported that xAI is gonna hit that one-gigawatt scale early '26, and a full two gigawatts by year end. How do you think they'll utilize that in th e context of scaling laws?
**Sebastian Raschka:** 我有一个案例。我有一个叫 MLxtend 的仓库,大约 10 年前作为学生开发的。它对某些算法来说仍然是一个相当受欢迎的库——我觉得特别是频繁数据挖掘方面。最近有两三个人在很短的时间内提交了大量的 PR。我确实认为 LLM 参与了提交这些 PR。作为维护者,有两件事。首先我有点不堪重负。我没时间读完——特别是作为一个老库,这不是我的优先事项。但同时我也某种程度上感激——因为我觉得人们忘记的一件事是——不只是使用 LLM。仍然有一个人类层在验证东西。在某种意义上这也是数据标注的方式,对吧?最昂贵的事情之一就是为 RLHF 阶段获取人类标注数据。这有点像那样——经过多个阶段——然后你实际上得到了更高质量的数据。所以我某种程度上不介意。它可能让人感到不堪重负,但我确实认为其中也有价值。
**Sebastian Raschka:** Is a lot of that inference? Is a lot of that training? - It ends up being all of the above. So I think that all of your dec isions when you're training a model come back to pre-training. So if you're going to scale RL on a model, you still need to decide on your architectur e that enables this.
**Lex Fridman:** 感觉原始的 LLM 生成数据和有人类参与验证的 LLM 生成数据之间存在根本区别——即使那个验证只涉及很小比例的代码行。
**Lex Fridman:** We were talking about other architectures and using different types of attention, or a mixture of experts models. The sparse natu re of MoE models makes it much more efficient to do generation, which becomes a big part of post-training, and you need to have your architecture read y so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model be tter, you still want to go and revisit this.
**Sebastian Raschka:** 我觉得这适用于任何事情——人们有时也会想"哦我可以直接用 LLM 学 XYZ"——这确实可以。但可能有一个专家——可能用了 LLM 来写特定代码——这里面有人类的劳动——让它变得好看、扔掉不太好的部分——给你预消化了——这为你节省时间。我觉得这就是增值所在——有人在过滤东西,甚至在正确地使用 LLM。这仍然是你免费获得的劳动。比如如果你读一篇 Substack 文章,我也许可以让 LLM 给我关于那篇文章的意见,但我甚至不知道该问什么。我觉得读那篇文章相比我直接去 LLM 仍然有价值——因为你是专家。你选择什么知识是真正到位的、应该被包含的,你给我这个非常……精炼的总结。这是巨大的增值——因为现在我不用浪费三到五个小时自己去过一遍,可能还得到一些错误信息什么的。所以我觉得——即使有 LLM 可以帮你节省时间——写作者的未来仍然在这里。
**Sebastian Raschka:** You still want the best base model you can. And in a few years that'll saturate and the RL compute will j ust go longer. - Are there people who disagree with you and say pre-training is dead? It's all about scaling inference, scaling post-training, scaling context, continual learning, scaling data, synthetic data? - People vibe that way and describe it in that way, but I think it's not the practice that is happening. - It's just the general vibe of people saying this thing is dead- - The excitement is elsewhere. So the low-hanging fruit- ...in RL is elsewhere.
**Lex Fridman:** 观察总结和原始内容之间的差异挺有意思的。即使是一页长总结对一页长内容。有趣的是看 LLM 总结如何把"棱角"磨掉了。它从东西中移除了什么信号?
**Lex Fridman:** For example, we released our model in November... Every company has deadlines. Our deadline was November 20th, and for that, our run was fi ve days, which compared to 2024 is a very long time to just be doing post- training at a model of 30 billion parameters. It's not a big model.
**Nathan Lambert:** 我经常谈的是"声音"(voice)。
**Nathan Lambert:** And the n in December, we had another release, where we let the RL run for another three and a half weeks, and the model got notably better, so we released it . And that's a to just allocate to something that is going to be your peak- ...for the year. So it's like- - The reasoning is- - There's these types o f decisions when training a model where they just... They can't leave it forever.
**Lex Fridman:** 声音?嗯……我很想听你说的声音是什么意思。但有时候确实是洞察。通过移除一个洞察,你改变了那个东西的含义。所以我一直很失望 LLM 在真正抓住核心洞察方面有多差——而一个好的总结恰恰应该做到这一点。即使我有那些极其精心设计的 prompt——真正试图挖掘洞察——它仍然不太到位。这是一个关于人类知识和智慧是什么、洞察力意味着什么的深层哲学问题。但你说的声音是什么意思?
**Lex Fridman:** You have to keep pulling in the improvements from researchers. So yo u redo pre-training, you'll do this post-training for a month, but then you need to give it to your users. You need to do safety testing. So it's just ...
**Nathan Lambert:** 当我写作的时候,我觉得我很大程度上在做的是——把你作为研究者的想法——非常原始的——一个研究者试图在理解的前沿封装一个想法——他们试图把一种"感觉"转化为文字。我在我的写作中也试图这样做——这让它显得原始但同时信息密度很高——以一种某些人能领会、某些人不能的方式。这就是研究的本质。语言模型不擅长这个。它们都是用 RLHF 训练的——从很多人那里收集反馈,然后平均化模型的行为。我觉得当存在这种过滤器时,模型很难做到非常犀利。对 RLHF 研究者来说这是一个很美妙的基本问题。RLHF 在让模型变好方面提供了巨大的效用——但同时问题的表述中有一个……结——你过不去的。这些语言模型在它们的深层表达中没有这种先验。我不觉得这是不可能的。有模型真正震撼人的故事。比如——我很想试过 Bing Sydney。它是不是有更多"声音"?因为它经常失控——这在历史上显然是一个可怕的方式——比如告诉一个记者离开他老婆——那是一个可能推向大众使用的疯狂模型。但那有点像一个权衡——RLHF 过程在某些方面是不是在增加限制?
**Nathan Lambert:** I think there's a lot in place that reinforces this cycle of updating the models. Things improve. You get a new compute cluster that lets you do s omething more stably or faster. It's like you hear a lot about Blackwell having rollout issues, where at AI2, most of the models we're pre-training ar e on 1,000 to 2,000 GPUs.
**Lex Fridman:** 作为这些前沿实验室和公司之一,那是一个很可怕的处境——因为数百万人在使用它们。
**Lex Fridman:** But when pre-training on 10,000 or 100,000 GPUs, you hit very different failures. GPUs break in weird ways, and on a 100,000 GPU run, you're pretty much guaranteed to have one GPU that is down. Your training code must handle that redundancy, which is a very different proble m. Whereas what we're doing, like playing with post-training on a cluster, or for people learning ML, what they're battling to train these biggest mod els is just- ...mass distributed scale, and it's very different.
**Nathan Lambert:** 去年 GPT-4o 被撤回时有很多反弹。我个人从没用过那个模型,但我和 OpenAI 的人聊过——他们会收到用户的邮件——用户可能在半夜检测到部署中的微妙差异。然后他们找到这些员工的邮箱发邮件说"我的朋友变了。"他们找到这些员工的邮箱发这些东西——因为他们对那组特定的模型权重和配置如此依恋。我们在 TikTok 上也看到这个。你打开它——我不用 TikTok——但据说五分钟之内算法就抓住你了。锁定了。那些也是语言模型在做推荐。我觉得有些方法可以做到这一点——五分钟的聊天之内,模型就完全懂你了。那是人们还没准备好面对的。比如——别把那个给孩子。至少在我们知道发生了什么之前。
**Nathan Lambert:** But that's somewhat different than- That's a systems problem- ...in order to enable s caling laws, especially at pre-training. You need all these GPUs at once. When we shift to RL, it actually lends itself to heterogeneous compute becau se you have many copies of the model. To do a primer for language model reinforcement learning, what you're doing is having two sets of GPUs.
**Lex Fridman:** 但也会有这种机制……随着 LLM 被越来越多使用……不幸的是,人类境况的本质是——人们会自杀。而记者会做的是广泛报道自杀的人。他们很可能会把它和 LLM 联系起来——因为他们有对话数据。如果你在生活中真的挣扎,如果你抑郁,如果你在想自杀,你很可能会和 LLM 谈论这些。记者会说"嗯,这个人自杀是因为 LLM 。"这会导致这些公司——因为法律问题等等——越来越多地把 LLM 的棱角磨掉。所以它会尽可能地通用化。在这个领域运营如此困难——因为你不想让 LLM 在那个层面上对人造成伤害——但同时这也是人类体验的本质——拥有丰富的对话、充实的对话、挑战你的对话、让你成长的对话。你需要那种锋芒。这对 RLHF 前线的 AI 研究者来说是极其困难的事情——因为你在处理人类的处境。
**Lex Fridman:** One you can call the actor, and one you call the learner. The learner is where your actual reinforcement learning updates happen. These are traditionally poli cy gradient algorithms. Proximal Policy Optimization, PPO, and Group Relative Policy Optimization, GRPO, are the two popular classes.
**Nathan Lambert:** 这些公司的很多研究者是非常有良好动机的——Anthropic 和 OpenAI 在文化上确实想为世界做好事。这是如此——我就像"唉,我不想做这个"——因为一方面很多人把 AI 看作健康盟友——可以保密地谈论健康的对象——但另一方面它一路延伸到——谈论心理健康——某个人因此走向了悬崖边——这让人心碎——但其他人可能被拯救了。作为研究者,就像——我不想训练图像生成模型然后开放发布——因为我不想让某人在笔记本电脑上有一个可以伤害他人的工具。我的公司没有基础设施来安全地做这个。但有很多这样的领域——只是需要以复杂性和信念来接近的人。这就是一个很难的问题。
**Nathan Lambert:** And on the other side you have actors which are generating completions, and these completions are what you're going to grade. Reinforcement learning is all about opti mizing reward. In practice, you can have a lot of different actors in different parts of the world doing different types of problems, and then you sen d it back to this highly networked compute cluster to do this actual learning where you take the gradients. You need to have a tightly meshed network to do different types of parallelism and spread out your model for efficient training.
**Lex Fridman:** 但同时,我们作为社会、作为这些技术的用户,需要确保我们在就此进行复杂的对话——而不只是恐吓——说大科技公司在伤害人类或偷你的数据。事情比那复杂得多。你说得对。这些公司里有大量的人——其中很多我认识——他们真心关注帮助人们。他们在考虑来自世界各地——不只是硅谷——美国和世界各地的人的完整人类体验和需求。为所有这些不同种类的人——跨越不同年龄组、文化和心理状态——设计这一个系统,真的极其困难。
**Lex Fridman:** Every different type of training and serving has these consider ations to scale. We talked about pre-training and RL, and then inference time scaling- how do you serve a model that's thinking for an hour to 100 mil lion users? I don't know about that, but I know that's a hard problem. In order to give people this intelligence, there's all these systems problems, and we need more compute and you need more stable compute to do it." - But you're bullish on all of these kinds of scaling is what I'm hearing.
**Nathan Lambert:** 我希望 AI 的时机和大科技公司与普通人的关系能不同。大科技公司的声誉已经很低了,而 AI 如此昂贵——不可避免地会成为大科技的事情。需要这么多资源。人们说美国在"把经济赌在 AI 上"——投入如此大的建设。这些事情同时交织在一起,使得沟通环境如此困难。对我来说——去跟更多世界上讨厌大科技公司、把 AI 看作其延续的人交谈——会很有价值。
**Nathan Lambert:** On the inference, on the reasoning, even on the pre-training? - Yeah, so that's a big can of worms here, but there are basically two... The knobs are the tr aining and the inference scaling where you can get gains. In a world where we had, let's say, infinite compute resources, you want to do all of them. So you have training, you have inference scaling, and training is like a hierarchy: it's pre-training, mid-training, post-training.
**Lex Fridman:** 你推荐的一件事,你谈到的一剂解药,是在这个系统中找到"能动性"(agency)。不是以一种无力的方式坐在那里消费 AI 快速接管互联网的垃圾内容。通过使用 AI 去构建东西来找到能动性——构建应用、构建……一方面,这确实帮助你建立直觉;二方面,它赋予你力量——因为你能理解它是怎么工作的、弱点在哪里。它让你的声音有力量去说"这是技术的一个坏用途,这是一个好用途。"你更深入地融入系统中,所以你能更好地理解它、作为消费者更好地引导它。
**Lex Fridman:** Changing the model size, more training data, training a bigger model gives you more knowledge in the model. Then the model, let's say, has a better base model. Back in the day, or still, we call it a foundation model, and it unlocks... But you don't, let's say, have the model be able to solve your most complex tasks during pre-training or after pre-training.
**Sebastian Raschka:** 我觉得你提出的关于能动性的点很好。与其忽视它说"好吧,我不打算用它"——我觉得从长期来看更健康的态度是说"好吧,它在那里。我不可能把它收回去。我怎么最好地利用它?它怎么帮助我提升自己?"不过我担心的一点是——如果你完全用它来做你喜欢做的事情——你喜欢做的事情就不在了。这可能导致倦怠。比如如果我用 LLM 帮我做所有的编程——那就没有编程了。我只是在管理一个替我编程的东西。两年后——假设我每天八小时就这样——让一个东西替我写代码——我还觉得充实吗?这会不会伤害我——让我对工作不再兴奋?我是否还会为构建什么而自豪?
**Sebastian Raschka:** You still have these other unlock phases where you have mid-training or, for example, post-training with RL that unlocks capabilities that the model has in terms of knowledge in the pre-training. And I think, sure, if you do more pre-training, you get a bet ter base model that you can unlock later. But like Nathan said, it just becomes too expensive. We don't have infinite compute, so you have to decide, do I want to spend that compute more on making the model larger?
**Lex Fridman:** 关于享受这个话题,挺有趣的。我们顺便插一下——有一个最近的调查,大约 791 名专业开发者——意思是 10 年以上经验。
**Lex Fridman:** It's like a trade-off. In an ideal world, you want to do all of them. And I think in that sense, scaling is still pretty much alive. You would still get a better model, but like we saw with Claude Opus 4.5, it's just not worth it.
**Sebastian Raschka:** 很长时间了。在这个时代作为初级开发者?
**Sebastian Raschka:** Beca use you can unlock more performance with other techniques at that current moment, especially if you look at inference scaling. That's one of the bigge st gains this year with o1, where it took a smaller model further than pre-training a larger model like Claude Opus 4.5. So I wouldn't say pre-trainin g scaling is dead, it's just that there are other more attractive ways to scale right now. But at some point, you will still want to make some progres s on the pre-training.
**Lex Fridman:** 是的,在这个时代。所以有很多方面都很出人意料。他们按初级和高级开发者做了分类。它显示了——初级和高级开发者都在生产代码中使用 AI 生成的代码。这是他们发布的代码。25%——大多数人使用大约 50% 或更多。有趣的是,在"你发布的代码超过 50% 是 AI 生成的"这个类别中——高级开发者更可能这样做。但你不希望 AI 夺走你热爱的事情。我觉得这跟我接下来要说的结果吻合——大约 80% 的人觉得用 AI 作为工作的一部分——要么稍微更有乐趣,要么明显更有乐趣。
**Lex Fridman:** The thing also to consider is where you want to spend your money. If you spend it more on the pre-training, it's like a fixed c ost. You train the model, and then it has this capability forever. You can always use it.
**Sebastian Raschka:** 我觉得取决于任务。从我个人的使用来看——比如我有一个网站,有时候要调整一些东西。我个人不喜欢这个。所以如果 AI 能帮我在网站上实现什么——我完全赞成。太棒了。但同时,当我解决一个复杂问题——如果有一个 bug,我追踪这个 bug,然后找到它——那是世界上最好的感觉。但现在如果你甚至不去想这个 bug——你直接去 LLM——你就永远不会有那种感觉。但可能有中间地带——你自己试了,找不到,你用 LLM——然后你不会因此沮丧,因为它帮了你,你继续做你喜欢的事情。看这些统计——我觉得没有考虑到的是——它在所有不同场景中取了平均。我们不知道是核心任务还是人们本来就不会喜欢的琐碎事务。AI 确实非常擅长做那些需要很多工作的琐碎事情。比如我老婆前几天——她有一个播客做读书讨论——一个读书俱乐部——她在把节目笔记从 Spotify 转到 YouTube——然后链接坏了。有些节目里有 100 个链接。一个一个手动修复太痛苦了。所以我建议"嘿,试试 ChatGPT。"我们把文本复制进 ChatGPT——它修好了。本来要花两小时——它让这种工作无缝了很多。我觉得每个人都有一个 AI 对于某些本来会很无聊、很琐碎的事情很有用的用例。
**Sebastian Raschka:** With inference scaling, you don't spend money during trainin g, you spend money later per query, and then it's also like math. How long is my model gonna be on the market if I replace it in half a year? Maybe it 's not worth spending $5 million, $10 million, $100 million on training it longer. Maybe I will just do more inference scaling and get performance the re. It maybe costs me $2 million in terms of user queries. It becomes a question of how many users you have and doing the math, and I think that's als o where it's interesting where ChatGPT is in a position.
**Lex Fridman:** 对我来说——既然我们在谈编程,你提到了调试——对我来说享受的来源是——我有一个朋友,一个结对编程的伙伴。没那么孤独了。你把调试描述得像是一种极大的快乐。不——我会说调试更像是在沙漠里走了好几天之后的一口水。你跳过了整个在沙漠中受苦的部分。有时候有一个朋友——虽然不一定能找到 bug——但可以给你一些关于代码的直觉——一起穿越沙漠找到那口水——至少对我来说——也许这说明了编程体验的孤独——这是一种快乐的来源。
**Lex Fridman:** I think they have a lot of users where they need to go a bit cheaper, where they have that GP T-5 model that is a bit smaller. Other companies that have... Let's say, if your customers have other trade-offs. For example, there was also the Math Olympiad or some of these math problems where ChatGPT or they had a proprietary model, and I'm pretty sure it's just like a model that has been fine- tuned a little bit more, but most of it was during inference scaling to achieve peak performance in certain tasks. need that all the time.
**Sebastian Raschka:** 这也许和延迟满足有关。我是那种——即使小时候——我更喜欢圣诞礼物的"期待"——比实际收到礼物更开心。也许食物也一样——饿的时候食物更好吃。你说得对——调试并不总是好的。经常很沮丧。但如果你能解决它——那太好了。但有一个甜蜜的最佳点——如果太难了——那就是浪费时间。但我觉得另一个挑战是——人们将如何学习?我们看了那个图表——更资深的开发者比初级开发者发布更多 AI 生成的代码。直觉上你会认为应该是初级开发者——因为他们还不知道怎么做——所以用 AI。这可能意味着 AI 还不够好——但也可能意味着专家更擅长使用它。社会未来的一个问题将是:如果你从不自己尝试——你怎么成为专家?我一直是通过自己尝试来学习的。如果你先自己试——你学得更好。如果 LLM 一直在——你真的会去努力挣扎吗?挣扎是不舒服的。但如果你用 LLM 做一切——在某个时刻你永远不会真正迈出下一步。所以我觉得有个最佳甜蜜点——也许诀窍是——你每天专门抽两个小时离线学习——剩余时间用 LLM。但我觉得人们也要继续投资于自己——不要什么都靠 LLM。
**Sebastian Raschka:** But yeah, l ong story short, I do think all of these pre-training, mid-training, post-training, inference scaling, they are all still things you want to do. It's just finding—at the moment, in this year, it's finding the right ratio that gives you the best bang for the buck, basically. - I think this might be a good place to define pre-training, mid-training, and post-training. - So, pre-training is the classic training one next token prediction at a time. Y ou have a big corpus of data. And Nathan probably also has very interesting insights there because of OLMo 3.
**Lex Fridman:** 是的。作为一个文明,我们每个人都需要找到那个最佳甜蜜点。好了,让我们进入后训练。后训练中有哪些有趣的想法?
**Lex Fridman:** A big portion of the paper focuses on th e right data mix. So, pre-training is essentially just, you know, training cross entropy loss, training on next token prediction on a vast corpus of i nternet data, books, papers and so forth. It has changed a little bit over the years in the sense people used to throw in everything they can. Now, it 's not just raw data.
**Nathan Lambert:** 2025 年最大的是 RLVR。你可以在那里扩展训练——做大量的迭代式"生成-评分"循环——让模型学到工具使用和软件方面的有趣行为。然后那个训练也非常好地支持了推理时间扩展。这种 RL 训练支持推理时间扩展。这从根本上改变了人们处理后训练的方式。
**Nathan Lambert:** It's also synthetic data where people, let's say, rephrase certain things. So synthetic data doesn't necessarily mean purely AI- made data. It's also taking something from an article, a Wikipedia article, and then rephrasing it as a Q&A question or summarizing it, rewording it, and making better data that way. Because I think of it also like with humans.
**Lex Fridman:** 你能描述一下 RLVR 吗?由 DeepSeek R1 推广的。
**Lex Fridman:** If someone, let's say, reads a book compared to a messy—no offense, but like—Reddit post or something like that, I do think you learn— - There's going to be a post about this, Sebastian. - Some Reddit data is very coveted and excellent for training. You just have to filter it. - And I think that's the idea. I think it's like if someone took that and rephrased it in a, l et's say, more concise and structured way, I think it's higher quality data that gets the LLM there faster. You get the same LLM out of it at the end, but it trains faster because if the grammar and the punctuation are correct, it already learns the correct way, versus getting information from a mes sy source and then learning later how to correct that.
**Nathan Lambert:** 有趣的事实:我当时在提出 RLVR 这个术语的团队里——来自我们的 Tulu 3 工作——在 DeepSeek 之前。我们不会太多邀功——但学术界的好处之一是能命名和影响话语——因为闭源实验室只能说那么多。然后 DeepSeek 是做了训练突破的人——他们扩展了强化学习。你让模型生成答案,然后评分——如果正确的话——那个准确率就是你强化学习的奖励。经典的强化学习是一个智能体在环境中行动——环境给它一个状态和奖励——你试图最大化奖励。在语言模型的情况下——奖励通常是在一组可验证任务上的准确率——不管是数学问题还是编程任务。核心思想是——你找到更多可验证问题——让模型尝试很多次——同时做 RL 梯度更新。基础设施从 RLHF 演化而来——在那个时代它们试图优化的分数是一个学习到的人类偏好奖励模型。所以你改变了问题领域——让优化可以在更大的规模上进行——启动了重大变化。
**Nathan Lambert:** So, I think that is how pre-training evolved and why scaling still works. It's not just about t he amount of data, it's also the tricks to make that data better for you, in a sense. And then mid-training is... I mean, it used to be called pre-tra ining.
**Lex Fridman:** RLVR 适用于哪些领域?
**Lex Fridman:** I think it's called mid-training because it was awkward to have pre-training and post-training but nothing in the middle, right? It sounds a bi t weird. You have pre-training and post-training, but what's the actual training? So, the mid-training is usually similar to pre-training, but it's a bit more specialized.
**Nathan Lambert:** 数学和代码是最有名的。然后有很多工作在"评分标准"(rubrics)上——和"LLM 作为裁判"有关。对于每个问题——我会让一个 LLM 问"好答案应该是什么样的?"然后反复尝试——根据评分标准打分。这不一定像数学和代码那样可验证——但他们试图把这套方法推到更开放的领域。
**Nathan Lambert:** It's the same algorithm, but what you do is you focus, for example, on long-context documents. The reason you don't do that duri ng pre-training is because you don't have that many long context documents. We have a specific phase. And one problem of LLMs is still that it's a neu ral network.
**Sebastian Raschka:** 我觉得那叫基于 AI 反馈的强化学习,对吧?
**Sebastian Raschka:** It has the problem of catastrophic forgetting. So, you teach it something, it forgets other things. And you wanna... I mean, it's not 100 % forgetting, but it's like "no free lunch." It's the same with humans.
**Nathan Lambert:** 那是更早的术语——出自 Anthropic 的 Constitutional AI 论文。
**Nathan Lambert:** If you ask me some math I learned 10 years ago, I would have to look at it aga in. - Nathan was actually saying that he's consuming so much content that there's a catastrophic forgetting issue. - Yeah, I'm trying to learn so much about AI, and it's like I was learning about pre-training parallelism. I'm like, "I lost something and I don't know what it was." - I don't want to a nthropomorphize LLMs, but it's the same kind of thing in how humans learn. I mean, quantity is not always better because you have to be selective. And mid-training is being selective in terms of quality content at the end.
**Sebastian Raschka:** 回到 RLVR——有趣的地方在于你给 LLM 一个数学问题——你知道正确答案——你让 LLM 自己想办法。你不太约束它。你只给问题和答案。然后 LLM 的任务是得出正确答案。美妙的地方是——LLM 会做一步步的描述——就像学生或数学家推导解答一样。这帮助模型提高准确率。然后推理扩展——模型会使用更多 token。在 DeepSeek R1 论文中——训练时间越长——回应越长。对简单任务来说变贵了——但这些解释帮助了准确率。也有论文显示——模型解释的内容不一定要正确——但不知道为什么仍然有帮助。我不想拟人化 LLM——但有点像人类——如果有一个复杂的数学问题——你通常有一张草稿纸——一步步来。模型也自我修正——那就是 DeepSeek R1 论文中的"顿悟时刻"——模型自己认识到犯了错误说"啊,我做错了,让我再试一次。"太酷了——只是给它正确答案让它自己搞清楚——它做了人类会做的事情。虽然 LLM 不像人类那样思考——但这种有趣的巧合……对我们人类来说——看到这些步骤很好——建立信任——也能学习和复查。
**Sebastian Raschka:** So the last thing the LLM has seen is the quality stuff. And then post-traini ng is all the fine-tuning, supervised fine-tuning, DPO, Reinforcement Learning with Verifiable Rewards (RLVR), with human feedback, and so forth. So t he refinement stages. And it's also interesting, it's a cost thing.
**Nathan Lambert:** 我觉得那些顿悟时刻有点"假"——因为在预训练中你已经看过了整个互联网。RLVR 非常擅长放大这些行为——因为它们对让模型更长时间思考、检查自己的工作非常有用。
**Nathan Lambert:** You spend a lot of money on pre-training right now. RL a bit less. With RL, you do n't really teach it knowledge. It's more like unlocking the knowledge; it's more like a skill learning, like how to solve problems with the knowledge that it has from pre-training.
**Sebastian Raschka:** 我也给一个实操例子。我用 RLVR 在 MATH-500 上训练 Qwen 3 基础模型。基础模型准确率约 15%。只用了 50 步——几分钟——从 15% 到了 50%。你不能告诉我它在这里面学到了数学基础知识——
**Sebastian Raschka:** There are actually three papers this year, or last year, 2025, on RL for pre-training. But I don't think anyone does th at in production. - Toy examples for now. - Toy examples, right? But to generalize, RL post-training is more like the skill unlock, where pre-training is like soaking up the knowledge. - A few things that could be helpful. A lot of people think of synthetic data as being bad for training the models.
**Nathan Lambert:** Qwen 的例子很奇怪——有关于 Qwen 数据污染的论文。他们训练的问题几乎和 MATH 数据集一模一样。
**Nathan Lambert:** You mentioned how DeepSeek got almost... OCR, which is Optical Character Recognition. A lot of labs did it. Ai2 had one, Meta had multiple. And the r eason each of these labs has these is because there are vast amounts of PDFs and other digital documents on the web that aren't in formats that are en coded with text easily. So you use these Almost-OCR, DeepSeek OCR, or what we called our Almost-OCR, to extract trillions of tokens of candidate data for pre-training. Pre-training dataset size is measured in trillions of tokens. Smaller models from researchers can be something like five to 10 trill ion. researchers can be something like five to 10 trillion. Um, Qwen is documented going up to 50 trillion, and there are rumors that these closed lab s can go to 100 trillion tokens.
**Sebastian Raschka:** 对。RL 没有教模型新的数学知识。你不可能在 50 步里做到。知识已经在预训练中了——你只是在解锁它。
**Sebastian Raschka:** Getting this potential data to put in—they have a very big funnel, and the data you actually train on is a small perc entage of this. This character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the fact that Ch atGPT now gives wonderful answers, and you can train on those best answers, and that's synthetic data. It's very different than early ChatGPT with lot s of hallucination data. when people became grounded in synthetic data. - One interesting question is, if I recall correctly, OLMo 3 was trained with less data than specifically some other open-weight models, maybe even OLMo 2.
**Nathan Lambert:** 我仍然不同意这个前提——有很多奇怪的复杂性。如果你换了数字但保持文字不变——Qwen 不用工具就会产生一个非常高精度的小数答案——意味着在某个时刻它见过几乎和测试集一样的问题。这在研究社区一直是大争论——你能相信多少?我觉得这就是 RLVR 被认为"只是关于格式"的声誉由来。但这里有很多复杂性——不是受控实验——我们不知道。
**Nathan Lambert:** But you still got better performance, and that might be one example of h ow the data helped. - It's mostly down to data quality. I think if we had more compute, we would train for longer. I think we'd ultimately see that as something we would want to do. And especially with big models, you need more compute, because we talked about having more parameters and we talked ab out knowledge.
**Sebastian Raschka:** 但如果不是这样——蒸馏不就不能工作了吗?这是最大的问题——因为我们不知道数据里有什么。除非你有新数据集——真的不可能。同样——像 MMLU——一个多选基准——如果你稍微改一下格式——模型准确率会有很大差异。
**Sebastian Raschka:** Essentially, there's a ratio where big models can absorb more from data, and then you get more benefit out of this. Any logarithmic gra ph in your mind is like a small model will level off sooner if you're measuring tons of tokens, and bigger models need more. But mostly, we aren't tra ining that big of models right now at AI2, and getting the highest quality data we can is the natural starting point. - Is there something to be said about the topic of data quality? Is there some low-hanging fruit there still where the quality could be improved? - It's like turning the crank.
**Nathan Lambert:** 那可能是模型层面的问题。
**Nathan Lambert:** Histo rically, in the open, there's been a canonical best pre-training dataset that has moved around between who has the most recent one or the best recent effort. Like AI2's Dolma was very early with the first OLMo, and Hugging Face had FineWeb. And there's a DCLM project, which has been kind of like a, which stands for Data Comp Language Model. There's been Data Comp for other machine learning projects, and they had a very strong dataset.
**Sebastian Raschka:** 这甚至不是 LLM 开发者的恶意。它在某个时刻见过什么东西。公平评估 LLM 的唯一方式是用一个在 LLM 部署后截止日期之后的新基准。
**Sebastian Raschka:** And a lot o f it is the internet is becoming fairly closed off, so we have Common Crawl, which is hundreds of trillions of tokens, and you filter it. It looks lik e scientific work where you're training classifiers and making decisions based on how you prune down this dataset into the highest quality stuff and t he stuff that suits your tasks. Previously, language models were tested a lot more on knowledge and conversational things, but now they're expected to do math and code. To train a reasoning model, you need to remix your whole dataset.
**Lex Fridman:** 我们能不能列出后训练的完整配方?你提到 RLVR 很有效。RLHF 仍然重要。还有什么?
**Lex Fridman:** And there's a lot of wonderful scientific methods here where you can, you can take your gigantic dataset, sample really tiny things from different sources, such as GitHub, Stack Exchange, Reddit, Wikipedia. You can sample small things from them, and train small models on each of these mixes and measure their performance on your evaluations. You can just do basic linear regression, and it's like, "Here's your optimal dataset." But if your evaluations change, your dataset changes a lot. So a lot of OLMo 3 was ne w sources for reasoning to be better at math and code, and then you do this mixing procedure and it gives you the answer.
**Nathan Lambert:** 你可以按顺序来看。从中期训练开始——据传让 o1 成为可能的是非常精心的数据策划——提供推理轨迹。在中期训练中你需要类似的数据——这样后训练中的 RLVR 才能学习。然后你弄清给模型什么问题、能训练多长。随着模型变好——某些问题就100%解对——没信号了。如果看 GRPO 方程——奖励是基于一个补全相对于同一问题其他答案有多好。所有答案一样就没信号。所以他们在找更难的问题——科学领域、更难的软件问题。RLHF 仍然是画龙点睛——改善风格或语气让模型更有用。这就是让 ChatGPT 如此神奇的东西。总结——中期训练给模型技能。RLVR 让模型多次尝试做试错学习。RLHF 完善模型让它易于使用。
**Nathan Lambert:** I think that's happened at l abs this year; there's new hot things, whether it's coding environments or web navigation, and you need to bring in new data, change your whole pre-tr aining so that your post-training can work better. And that's like the constant evolution and the redetermining of what they care about for their mode ls. - Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a s ource. - Reddit was very useful. I think PDFs is definitely one. - Oh, especially arXiv. - Yeah, so AI2 has run Semantic Scholar for a long time, whic h is what you can say is a competitor to Google Scholar with a lot more features.
**Lex Fridman:** RLVR 需要多少算力?
**Lex Fridman:** And to do this, AI2 has found and scraped a lot of PDFs for openly a ccessible papers that might not be behind the closed walled garden of a certain publisher. So, truly open scientific PDFs. And if you sit on all of th ese and you process it, you can get value out of it. And I think that like, a lot of that style of work has been done by the frontier labs did much ea rlier.
**Nathan Lambert:** 只会越来越多。Ilya Sutskever 说过他们在预训练和后训练上使用类似的算力。预训练是计算密集型的。RL 因为在生成答案——变得更内存密集。在预训练中所有 GPU 互相通信——效率极高。而 RL 有很多移动部分——生成 10 万 token 可能很长时间。在 GPU 小时方面——RL 运行可能接近预训练的天数——但不会同时用那么多 GPU。有经验法则——不想让预训练超过一个月——因为会灾难性失败。GPT-4 是终极的"一把梭"运行——三个月——所有人震惊它成功了。现在人们更谨慎了。
**Nathan Lambert:** You just need to have a pretty skilled researcher that understands how things change models; they bring it in, clean it, and it's a lot of labo r. When frontier labs scale researchers, a lot more goes into data. If you join a frontier lab and you want to have impact, the best way to do it is j ust find new data that's better. And then, the fancy, glamorous algorithmic things like figuring out how to make o1 is like the sexiest thought of a s cientist.
**Sebastian Raschka:** RLVR 更不受限——可以训练多长仍然获益。而 RLHF 是偏好调优——达到一定程度就没意义继续了。我最喜欢的例子——如果亲戚问买什么笔记本——我给解释——他们可能优先电池和存储——我们可能优先内存和算力。两个答案都对——但不同人需要不同答案。偏好调优试图取平均——到了某个点你学到了那个平均。没理由继续。而 RLVR 让模型解决越来越难的问题。现在是 RLVR 1.0——简单的问答——不对中间做任何事。有论文关于过程奖励模型——给解释过程打分。我觉得那将是 RLVR 2.0。还有 DeepSeek Math-V2 论文——有有趣的推理扩展——开发了自我评分的模型。然后 RLVR 会向其他领域拓展。
**Sebastian Raschka:** It's like, "Oh, I figured out how to scale RL." There's a group that did that, but most of the contribution is like— - On the dataset - ..." I'm gonna make the data better," or, "I'm gonna make the infrastructure better so everyone on my team can run experiments 5% faster." - At the same ti me, I think it's also one of the closest guarded secrets, what your training data is, for legal reasons. And so there's also, I think, a lot of work t hat goes into hiding what your training data was essentially. Like training the model to not give away the sources because you have legal reasons. - T he other thing, to be complete, is that some people are trying to train on only licensed data, whereas Common Crawl is a scrape of the whole internet. So if I host multiple websites, I'm happy to have them train language models, but I'm not explicitly licensing what governs it.
**Nathan Lambert:** 人们兴奋的是价值函数——和过程奖励模型类似。过程奖励模型为推理的每个中间步骤打分——价值函数为每个 token 赋值。两者在语言建模时代基本未被证明。人们对价值函数更乐观了——过程奖励模型之前被尝试过很多但磕得头破血流。简单总结——你不想做太多 RLHF——信号不扩展。o1 有一个扩展图——对数增加训练算力——评估线性增长。已被多次复现。但 RLHF 没有这样的扩展定律。RLHF 的开创性扩展论文是"奖励模型过度优化的扩展定律"。所以 RLVR 和 RLHF 之间有一条大界线。做最好的 RLHF 可能不需要额外 100 倍算力——但做最好的 RLVR 你需要。Meta 有一篇关于"扩展语言模型强化学习的艺术"的论文——他们的增量实验需要 10000 V100 小时——每个实验数千到数万美元——普通学术界承受不了。
**Nathan Lambert:** And therefore, Common Crawl is largely unlicensed, which means that your consent really hasn't been provided for how to use the data. There's another idea where you can tr ain language models only on data that has been licensed explicitly, so that the kind of governing contract is provided, and I'm not sure if Apertus is the copyright thing or the license thing. I know that the reason that they did it was for an EU compliance thing, where they wanted to make sure that their model fit one of those checks. - On that note, there's also the distinction in licensing. Some people just purchase the license.
**Nathan Lambert:** 我希望 AI 出现的时机跟大科技公司与普通人的关系处于不同的阶段。大科技公司的声誉那时已经跌到谷底,而 AI 又那么烧钱,它注定是一个大科技公司的事。需要那么多资源,人们说美国是在"把国运押在 AI 上"来搞这个基建。这两件事同时交织在一起,沟通环境变得非常困难。我应该多去跟那些讨厌大科技公司、把 AI 视为大科技延续的人聊聊。
**Nathan Lambert:** Let's say they buy an Amazon Kindle book, or a Manning book, and then use that in training. That is a gray zone 'cause you paid for the content and you might want t o train on it. But then there are also restrictions where even that shouldn't be allowed. And so that is where it gets a bit fuzzy.
**Lex Fridman:** 你推荐的一个解药是——在这个系统中找到能动性(agency)。不是无力地坐在那里消费 AI 内容垃圾,看着它迅速占领互联网。而是通过使用 AI 来构建东西——构建 app、构建各种东西——来找到能动性。第一,这确实帮你建立直觉;第二,这赋予你力量,因为你能理解它怎么运作、弱点在哪里。它让你有话语权去说"这是技术的坏用法,那是好用法。"而且你更深入地接入了这个系统,所以你能更好地理解它,作为消费者也能更好地引导它。
**Lex Fridman:** And yeah, I think that is right now still a hot topic. Big companies like OpenAI approached private companies for their proprietary data and private companies, they bec ome more and more, let's say, protective of their data because they know, "Okay, this is going to be my moat in a few years." And I do think that's li ke the interesting question, where if LLMs become more commoditized, and I think a lot of people learn about LLMs, there will be a lot more people abl e to train LLMs. Of course, there are infrastructure challenges. But if you think of big industries like pharmaceutical industries, law, finance indus tries, I do think they, at some point, will hire people from other frontier labs to build their in-house models on their proprietary data, which will be then, again, another unlock with pre-training that is currently not there.
**Sebastian Raschka:** 我觉得你提的"能动性"这一点很好。与其忽视它说"好吧,我不用了",我觉得从长远来看更健康的做法是说"好吧,它已经出来了。我收不回去了。"怎么最好地利用它,怎么让它帮我提升自己?但我在这里担心的一件事是,如果你把它完全用于你热爱做的事情,那个你热爱的事情就不复存在了。这可能导致倦怠。比如说,如果我用 LLM 帮我做所有的编程,那就没有编程了。我只是在管理一个替我编程的东西。两年后,假设我每天八小时就干这个——让某个东西替我编程——我还会觉得充实吗?这会不会伤害我对工作的兴奋感?我还会为构建什么东西感到自豪吗?
**Sebastian Raschka:** Because even if you wanted to, you can't get that data. You can't get ac cess to clinical trials most of the time and these types of things. So, I do think scaling, in that sense, might be still pretty much alive if you als o look in domain-specific applications, because we are still right now, in this year, just looking at general purpose LLMs on, on ChatGPT, Anthropic a nd so forth. They are just general purpose, they're not even, I think, scratching the surface of what an LLM can do if it is really specifically train ed and designed for a specific task. - I think on the data thing—this is one of the things that happened in 2025, and we totally forget it—is Anthropi c lost in court and owed $1.5 billion to authors.
**Lex Fridman:** 说到享受这个话题,很有趣。我们应该在这里提一下,最近有一项调查,大约 791 名专业开发者,就是有 10 年以上经验的。
**Lex Fridman:** Anthropic, I think, bought thousands of books and scanned them and was cleared legally for that beca use they bought the books, and that is kind of going through the system. And then the other side, they also torrented some books, and I think this tor renting was the path where the court said that they were then culpable to pay these billions of dollars to authors, which is just such a mind-boggling lawsuit that kind of just came and went. That is so much money- ... from the VC ecosystem. - These are court cases that will define the future of hum an civilization, because it's clear that data drives a lot of this, and there's this very complicated human tension of... I mean, you can empathize.
**Sebastian Raschka:** 那可是很长时间了。作为初级开发者的话?
**Sebastian Raschka:** Y ou're both authors. And there's some degree to which, I mean, you put your heart and soul and your sweat and tears into the writing that you do. It fe els a little bit like theft for somebody to train your data without giving you credit. - And there are, like Nathan said, also two layers to it. Someo ne might buy the book and then train on it, which could be argued fair or not fair, but then there are the straight-up companies who use pirated books where it's not even compensating the author.
**Nathan Lambert:** 对,在现在这个年代。而且有很多方面令人惊讶。他们按初级和高级开发者做了细分。但基本上就是——初级和高级开发者都在他们发布的代码中使用 AI 生成的代码。这不只是好玩或者中间学习用的。这是他们发布的代码。25%——大多数人使用 50% 或更多。有趣的是,在"超过 50% 的发布代码是 AI 生成"这个类别中,高级开发者更有可能这样做。但你不想让 AI 夺走你热爱的东西。我觉得这跟我的经验一致——我接下来要说的这些数据。总体来看,大约 80% 的人觉得使用 AI 作为工作的一部分要么"更享受一些"要么"明显更享受"。
**Nathan Lambert:** That is, I think, where people got a bit angry about it specifically, I would say. - Yeah, but there has to be some kind of compensation scheme. This is like moving towards- ... towards something like Spotify streaming did- ... originally for music. You know, what does that- ... compensation look like? You have to define those kinds of models.
**Sebastian Raschka:** 我觉得这取决于任务。从我个人使用来说,比如我有一个网站,偶尔需要调整一些东西。我个人不喜欢这种事。所以在这个意义上,如果 AI 能帮我在网站上实现什么,我完全支持。太好了。但同时,当我解决一个复杂问题——比如有个 bug,我去追这个 bug,找到了——那是世界上最棒的感觉。你感觉太好了。但如果你根本不思考这个 bug,直接去找 LLM,那你就永远没有这种感觉了,对吧?但也有个中间地带:你自己先试,找不到,然后用 LLM,这样你不会太沮丧,因为它帮了你,然后你继续做你享受的事情。所以看这些统计数据,我觉得没有被考虑进去的是——它是在所有不同场景上取的平均值。我们不知道这是核心任务还是那些人们本来就不享受的琐碎事情。所以在某种意义上,AI 对于做那些费力的琐碎事情真的很棒。比如,我老婆前几天——她有一个讨论书的播客,一个读书会——她在把节目笔记从 Spotify 转到 YouTube,然后链接不知怎么就断了。有些集因为讲了太多书,有大概 100 个链接,一个一个手动修复会非常痛苦。所以我建议说"嘿,我们试试 ChatGPT。"我们把文字复制到 ChatGPT 里,它修好了。本来要花两个小时一个链接一个链接地修,它让这类工作变得无比顺畅。我觉得每个人都有一个 AI 在某些事情上很有用的用例——那些本来会非常无聊、非常琐碎的事情。
**Sebastian Raschka:** You have to think through all of that. One other thing I t hink people are generally curious about, I'd love to get your thoughts, as LLMs are used more and more. If you look at even arXiv, but GitHub, more an d more of the data is generated by LLMs. What do you do in that kind of world? How big of a problem is that? - Largest problem's the infrastructure an d systems, but from an AI point of view, it's kind of inevitable. - So it's basically LLM-generated data that's curated by humans essentially, right? - Yes, and I think that a lot of open source contributors are legitimately burning out.
**Lex Fridman:** 对我来说,既然我们在说编程,你提到了调试……我的快乐源泉,更多是在 Cursor 那边而不是 Claude Code——是我有一个朋友,有一个结对编程的伙伴。没那么孤独了。你把调试描述成这种巨大的快乐。不是的,我觉得调试就像你在沙漠里走了好几天之后的一口水。你跳过了你在沙漠里受苦的整个部分。有时候有个朋友虽然也找不到 bug,但能给你一些关于代码的直觉,你们一起穿越沙漠找到那口水,挺好的。至少对我来说——也许这说明了编程体验的孤独——这是一种快乐的来源。
**Lex Fridman:** If you have a popular open source repo, somebody's like, "Oh, I want to do open source AI. It's good for my career," and they just vibe- -code something and they throw it in. You might get more of this- - I have a- - - than I do. - Yeah, so I have actually a case study here. I have a repository called MLxtend that I developed as a student around 10 years ago, and it is a reasonably popular library still for certain algorithms, I think especially like frequent data mining stuff.
**Sebastian Raschka:** 这也许跟延迟满足有关。我是那种——即使小时候——我更喜欢圣诞礼物的想法——拥有它们、得到它们——比实际收到礼物的那一刻更让我兴奋。我会期待得到礼物的那天,但到了就结束了,我会失望。也许食物也一样。我觉得你真正饿的时候食物更好吃。你说得对,调试不是总那么好。它经常令人沮丧,但如果你能解决它,那就太棒了。不过有一个甜蜜的 Goldilocks 区间——如果太难了,那就是浪费时间。但我觉得另一个挑战是人们怎么学习。我们看了图表,发现更资深的开发者比初级的发布更多 AI 生成代码。这非常有趣,因为直觉上你会觉得应该是初级开发者——因为他们还不知道怎么做,所以用 AI 来做。这可能意味着 AI 还不够好到能完成那个任务,但也可能意味着专家更擅长使用它。他们知道怎么用,知道怎么审查代码,然后更信任这些代码。未来社会面临的一个问题是:如果你从不自己动手尝试,你怎么成为专家?我一直是通过自己尝试来学习的。如果你看数学教科书和答案——你能学到东西,但如果你先自己试了再看答案,你学得更好。因为你知道怎么把它放到你的心智框架里。如果 LLM 一直在那里,你真的会去经历那种挣扎吗?你愿意挣扎吗?挣扎不好受,对吧?但如果你用 LLM 做所有事情,到某个时候你就永远不会真正迈出下一步,然后你可能永远得不到那种作为专家使用 LLM 时才能获得的突破。所以我觉得有一个 Goldilocks 甜蜜点——也许诀窍是你每天留出专门的离线时间学习两小时,剩下的时间用 LLM。但我觉得人们还是要投资于自身——以我看来——不要什么都交给 LLM。
**Sebastian Raschka:** And there were recently two o r three people who submitted a lot of PRs in a very short amount of time. I do think LLMs have been involved in submitting these PRs. Me, as the maint ainer, there are two things. First, I'm a bit overwhelmed. I don't have time to read through it because, especially as an older library, that is not a priority for me. At the same time, I kind of also appreciate it because I think something people forget is it's not just using the LLM.
**Lex Fridman:** 是的,作为一个文明,我们每个人都得找到那个 Goldilocks 区间。在编程的语境下也一样。那,我们之前有过精彩的讨论,从预训练和中期训练开始。让我们进入后训练。后训练有很多有趣的东西。那么,后训练中有哪些有趣的想法?
**Lex Fridman:** There's still a human layer that verifies something, and that is in a sense also how data is labeled, right? One of the most expensive things is getting labeled da ta for RL from human feedback phases. And this is kind of like that, where it goes through phases, and then you actually get higher quality data out o f it. And so I don't mind it in a sense.
**Nathan Lambert:** 2025 年最大的一个是学习用可验证奖励的强化学习(RLVR)。你可以扩展那里的训练,这意味着做大量这种迭代的"生成-评分"循环,这让模型在工具使用和软件方面学到有趣的行为。这可以是搜索、自己运行命令并看到输出,然后这种训练也非常好地支持推理时间扩展。结果表明这个范式之间有非常漂亮的关联——这种 RL 训练使推理时间扩展成为可能。虽然推理时间扩展本可以通过其他方式被发现。所以是一种完美风暴——模型发生了巨大变化,训练方式是其中的主要因素。这彻底改变了人们对后训练的理解。
**Nathan Lambert:** It can feel overwhelming, but I do think there is also value in it. - It feels like there's a fundamental dif ference between raw LLM-generated data and LLM-generated data with a human in the loop that does some kind of verification, even if that verification is a small percent of the lines of code. - I think this goes with anything where people think also sometimes, "Oh, yeah. I can just use an LLM to lear n about XYZ," which is true. You can, but there might be a person who is an expert who might have used an LLM to write specific code. There is kind of like this human work that went into it to make it nice, throwing out the not-so-nice parts to kind of pre-digest it for you, and that saves you time.
**Lex Fridman:** 你能描述一下 RLVR 吗?由 DeepSeek R1 推广开来的。它是怎么工作的?
**Lex Fridman:** I think that's the value-add, where you have someone filtering things or even using the LLMs correctly. This is still labor that you get for free. Fo r example, if you read a Substack article, I could maybe ask an LLM to give me opinions on that, but I wouldn't even know what to ask. And I think the re is still value in reading that article compared to me going to the LLM because you are the expert.
**Nathan Lambert:** 好的。有个趣事:我就在提出 RLVR 这个术语的团队里,来自我们 Tulu 3 的工作,在 DeepSeek 之前。我们不会过多邀功说自己是推广规模化 RL 的人,但作为学术界能得到的有趣东西——顺便说一句——是命名和影响话语的能力,因为闭源实验室能说的有限。你作为学术界能做的一件事是——你可能没有算力来训练模型,但你可以用一种方式来框架问题,最终形成一个社区围绕 RLVR 这个术语聚集起来的描述,这非常有趣。然后 DeepSeek 是做了训练突破的人——他们扩展了强化学习。你让模型生成答案,然后评分补全是否正确,准确率就是你强化学习的奖励。经典的强化学习是一个智能体在环境中行动,环境给它状态和奖励反馈,你试图最大化奖励。对于语言模型来说,奖励通常是在一组可验证任务上的准确率——无论是数学题还是编程任务。到了像事实性领域这种东西就变得模糊了。从某些角度来说那也是可验证的,或者指令上的约束,比如"只用以 A 开头的单词回答。"所有这些东西在某种意义上都是可验证的。核心思想是你找到大量可验证的问题,让模型多次尝试,同时进行这些 RL 梯度更新。基础设施是从基于人类反馈的强化学习(RLHF)进化而来的——在那个时代,它们试图优化的分数是一个学习到的人类偏好奖励模型。所以你改变了问题领域,这让优化能达到更大的规模,从根本上启动了模型能力和人们使用方式的重大变化。
**Nathan Lambert:** You select what knowledge is actually spot on, s hould be included, and you give me this very... this executive summary. And this is a huge value-add because now I don't have to waste three to five h ours to go through this myself, maybe get some incorrect information and so on. And so I think that's also where the future still is for writers even though there are LLMs that... Can kind of save you time. - It's kinda fascinating to watch.
**Lex Fridman:** RLVR 适用于什么类型的领域?
**Lex Fridman:** I'm sure you guys do this, but for me, I look at the diffe rence between a summary and the original content. Even if it's a page-long summary of a page-long content, it's interesting to see how the LLM-based s ummary takes the edge off. What is the signal it removes from the thing? - The voice is what I talk about a lot. - Voice? Well, voice...
**Nathan Lambert:** 数学和代码是最出名的。然后有很多关于"评分标准"(rubrics)的工作,这跟人们可能听过的一个词有关:LLM-as-a-judge(LLM 做评委)。对数据集中的每个问题,我会有一个 LLM 来问"这个问题的好答案应该是什么样的?"然后你可以反复尝试这个问题,根据评分标准打分。这不像数学和代码领域那样严格可验证,但这个评分标准的想法以及其他可能更模糊的科学问题是目前关注的焦点——他们试图把这套方法推广到更开放的领域,让模型学到更多。
**Nathan Lambert:** I would love to hear what you mean by voice, but sometimes there's literally insights. By removing an insight, you're changing the meaning of the thing. So I'm con tinuously disappointed how bad LLMs are at really getting to the core insights, which is what a great summary does. Yet even if I have these extremely elaborate prompts where I'm really trying to dig for the insights, it's still not quite there, which...
**Sebastian Raschka:** 我觉得那叫基于 AI 反馈的强化学习(reinforcement learning with AI feedback),对吧?
**Sebastian Raschka:** I mean, that's a whole deep philosophical que stion about what is human knowledge and wisdom and what does it mean to be insightful and so on. But when you talk about the voice, what do you mean? - When I write, I think a lot of what I'm trying to do is take what you think as a researcher, which is very raw. A researcher is trying to encapsulat e an idea at the frontier of their understanding and they're trying to put what is a feeling into words. I try to do this in my writing, which makes i t come across as raw but also high-information in a way that some people will get it and some won't. And that's the nature of research. And language m odels don't do this well. They're all trained with Reinforcement Learning from Human Feedback, which takes feedback from many people and averages how the model behaves from this.
**Nathan Lambert:** 那是更早期的术语,在 Anthropic 的 Constitutional AI 论文中提出的。很多东西都是周期性的。
**Nathan Lambert:** And I think it's going to be hard for a model to be very incisive when there's that sort of filter. This is a wonderful f undamental problem for researchers in RLHF. This provides so much utility in making the models better, but also the problem formulation is kind of... there's this knot in it that you can't get past. These language models don't have this prior in their deep expression they're trying to get at.
**Sebastian Raschka:** 另外,回到 RLVR 退一步来说。我觉得这里有趣的是——你给 LLM 一个数学问题,你知道正确答案,然后你让 LLM 自己去搞清楚怎么做。它怎么做——你不怎么限制它。有一些约束比如"用同一种语言,不要在西班牙语和英语之间切换。"但基本上你是放手的。你只给问题和答案,然后 LLM 的任务是得出正确答案。但这里美妙的是在实践中发生了什么:LLM 会做一个逐步的描述——就像一个学生或数学家推导解法一样。它会用这些步骤,这帮助模型提升自己的准确率。然后,就像你说的,推理扩展。推理扩展大致意味着在推理时花更多计算。这里的推理扩展是模型会使用更多 token。在 DeepSeek R1 论文中,他们展示了训练时间越长,回复就越长。随时间增长。使用更多 token,所以变得更贵。对简单任务来说变贵了,但这些解释有助于准确率。也有论文显示模型解释的内容不一定必须是正确的,或者可能跟答案无关,但不知道为什么,模型去解释这件事本身仍然对它有帮助。我觉得这也是——我不想把这些 LLM 拟人化——但有点像我们人类的运作方式。如果课堂上有一个复杂的数学问题,你通常会有一张草稿纸,一步一步做。划掉一些东西。模型也会自我纠正,我觉得这就是 DeepSeek R1 论文中的"顿悟时刻"(aha moment)。他们叫它顿悟时刻,因为模型自己认识到它犯了一个错误,然后说"啊,我做错了,让我重试。"我觉得这太酷了——仅仅是给它正确答案、让它自己去想怎么做——它竟然在某种意义上做了人类会做的事情。虽然 LLM 的思考方式跟人类不同,但这是一个有趣的巧合,而且……还有一个好的副作用是——对我们人类来说,看到这些步骤很好。它建立信任,同时我们可以学习和复查。
**Sebastian Raschka:** I don' t think it's impossible. There are stories of models that really shock people. Like, I think of... I would love to have tried Bing Sydney.
**Nathan Lambert:** 这里面有很多东西。我觉得今年有很多辩论——关于语言模型中这些……我认为那些"顿悟时刻"某种程度上是假的,因为在预训练中你基本上看过了整个互联网,所以你肯定见过人们解释自己工作的内容,甚至是口头的,比如数学讲座的转录稿。"你试试这个,哦,我搞砸了。"RLVR 非常擅长做的是放大这些行为,因为它们对于让模型思考更长时间和检查自己的工作非常有用。我同意这是非常美妙的——这种训练让模型学会以一种对最终答案非常有帮助的方式来放大这些行为。
**Nathan Lambert:** Did that ha ve more voice? Because it would so often go off the rails, which is historically obviously a scary way—like telling a reporter to leave his wife—is a crazy model to potentially put in general adoption. But that's kind of like a trade-off; is this RLHF process in some ways adding limitations? - That' s a terrifying place to be as one of these frontier labs and companies, because millions of people are using them. - There was a lot of backlash last year with GPT-4o getting removed, and I've personally never used the model, but I've talked to people at OpenAI where they get emails from users that might be detecting subtle differences in the deployments in the middle of the night. And they email them like, "My friend is different." And they find these employees' emails and send them things because they are so attached to this set of model weights and configuration that is deployed to the user s.
**Sebastian Raschka:** 我还能给你一个实操的例子。我在用 Qwen 3 base model 做 RLVR,在 MATH-500 上。基础模型准确率大约 15%。只用了 50 步——几分钟的 RLVR——模型从 15% 涨到了 50%。而且那个模型……你不能说它在那几步里学到了什么根本性的数学知识——
**Sebastian Raschka:** We see this with TikTok. You open it... I don't use TikTok, but supposedly in five minutes the algorithm gets you. It's locked in.
**Nathan Lambert:** Qwen 这个例子比较奇怪,因为今年有两篇论文——其中一篇我参与了——关于 Qwen 的数据污染。特别是他们在一个特殊的中期训练阶段训练了大量内容——我们应该花一分钟谈谈这个——因为它很奇怪,他们训练了几乎跟 MATH 完全一样的问题。
**Nathan Lambert:** And those are la nguage models doing recommendations. I think there are ways you can do this. Within five minutes of chatting with it, the model just gets you. And tha t is something that people aren't really ready for.
**Sebastian Raschka:** 没错。所以你可以看到基本上 RL 并不是在教模型任何关于数学的新知识。你不可能在 50 步里做到那个。所以知识已经在那里了,在预训练里面,你只是在解锁它。
**Sebastian Raschka:** Like, don't give that to kids. At least until we know what's happening. - But there's also going t o be this mechanism... What's going to happen with these LLMs as they're used more and more... Unfortunately, the nature of the human condition is suc h that people commit suicide.
**Nathan Lambert:** 我仍然不同意这个前提,因为有很多奇怪的复杂性你证明不了。有一个指向"奇怪"的证据是——如果你拿 Qwen 3 所谓的基础模型,然后你……你可以 Google 搜"math dataset, Hugging Face",然后拿一个问题。你把它输入 Qwen 3 base 的时候……这些数学问题都有文字描述,比如"Alice 有五个苹果,拿走一个……给了三个给谁谁谁"——这些文字题。为什么人们对这些基于 Qwen 的模型有怀疑——如果你改变数字但保留文字——Qwen 会在不使用工具的情况下产出一个精度非常高的十进制答案。这意味着在某个时间点,它被展示了几乎跟测试集一样的问题,它用工具得到了一个高精度答案,但一个没有工具的语言模型永远不可能做到这个。所以在研究社区一直有这个大辩论:这些在 Qwen 上训练、特别在这个数学基准上测量的强化学习论文——已经有多篇论文讨论过污染——到底有多少你能信?我觉得这就是为什么 RLVR 获得了"只是格式化"这个名声——因为你能那么快获得这些增益,所以它一定已经在模型里了。但这里面有很多复杂性……不是真正的对照实验,所以我们并不真的知道。
**Nathan Lambert:** And so what journalists will do is they will report extensively on the people who commit suicide. And they would very li kely link it to the LLMs because they have that data about the conversations. If you're really struggling in your life, if you're depressed, if you're thinking about suicide, you're going to probably talk to LLMs about it. And so what journalists will do is say, "Well, the suicide was committed beca use of the LLM." And that's going to lead to the companies, because of legal issues and so on, more and more taking the edge off of the LLM.
**Sebastian Raschka:** 但如果那不是真的,我会说蒸馏(distillation)就不会起作用了,对吧?我的意思是,蒸馏在一定程度上确实能工作。但这确实是最大的问题——我研究这个污染问题——因为我们不知道数据里有什么。除非你有一个全新的数据集,否则真的不可能知道。同样的,你提到的那个数学数据集——有问题、答案和解释——但即使是更简单的东西像 MMLU,一个多选基准。如果你只是稍微改变格式——比如用句号代替括号什么的——模型准确率会大幅不同。
**Sebastian Raschka:** So it's g oing to be as generic as possible. It's so difficult to operate in this space because you don't want an LLM to cause harm to humans at that level, but also this is the nature of the human experience, is to have a rich conversation, a fulfilling conversation, one that challenges you from which you gr ow. You need that edge. And that's something extremely difficult for AI researchers on the RLHF front to actually have to solve, because you're dealin g with the human condition. - A lot of researchers at these companies are so well-motivated, and definitely Anthropic and OpenAI culturally want to do good for the world.
**Nathan Lambert:** 我觉得那可能是一个模型层面的问题,而不是一般性的问题。
**Nathan Lambert:** And it's such a—I'm like, "Ooh, I don't wanna work on this," because on the one hand, a lot of people see AI as a health ally, as somebody they can talk to about their health confidentially, but then it bleeds all the way into this, like talking about mental health, where it's h eartbreaking that this will be the thing where somebody goes over the edge, but other people might be saved. And I'm like, "I don't..." As a researche r, it's like, I don't want to train image generation models and release them openly because I don't want to enable somebody to have a tool on their la ptop that can harm other people. I don't have the infrastructure in my company to do that safely. But there's a lot of areas like this where it just n eeds people that will approach it with complexity and conviction.
**Sebastian Raschka:** 这甚至不是 LLM 开发者故意的——不是说"嘿,我们要在那个基准上作弊。"只是它在某个时候见过了某些东西。我觉得评估 LLM 唯一公平的方式是——用一个在 LLM 部署后截止日期之后的新基准。
**Sebastian Raschka:** It's just such a hard problem. - But also, we as a society, as users of these techno logies, need to make sure that we're having the complicated conversation about it versus just fearmongering— that Big Tech is causing harm to humans o r stealing your data. It's more complicated than that. And you're right. There's a very large number of people inside these companies, many of whom I know, who deeply care about helping people.
**Lex Fridman:** 我们能不能列出后训练的完整配方——所有涉及的组成部分?你提到 RLVR 是一个非常令人兴奋、有效的东西。也许我们应该展开说说。RLHF 仍然扮演着非常重要的角色。后训练中还有什么其他想法?
**Lex Fridman:** They are considering the full human experience of people from across the world, not just Silicon Valley. P eople across the United States and the world, what their needs are. It's really difficult to design this one system that is able to help all these dif ferent kinds of people across different age groups, cultures, and mental states. - I wish that the timing of AI was different relative to the relation ship of Big Tech to the average person. Big Tech's reputation was so low, and with how AI is so expensive, it's inevitably going to be a Big Tech thin g.
**Nathan Lambert:** 我希望 AI 出现的时机跟大科技公司与普通人的关系处于不同的阶段。大科技公司的声誉那时已经跌到谷底,而 AI 又那么烧钱,它注定是一个大科技公司的事。需要那么多资源,人们说美国是在"把国运押在 AI 上"来搞这个基建。这两件事同时交织在一起,沟通环境变得非常困难。我应该多去跟那些讨厌大科技公司、把 AI 视为大科技延续的人聊聊。
**Nathan Lambert:** It takes so many resources, and people say the US is, "betting the economy on AI" with this build-out. To have these be intertwined at the same tim e makes for such a hard communication environment. It would be good for me to go talk to more people in the world who hate Big Tech and see AI as a co ntinuation of this. - And one of the things you recommend, one of the antidotes that you talk about, is to find agency in this system. As opposed to s itting back in a powerless way and consuming the AI slop as it rapidly takes over the internet.
**Lex Fridman:** 你推荐的一个解药是——在这个系统中找到能动性(agency)。不是无力地坐在那里消费 AI 内容垃圾,看着它迅速占领互联网。而是通过使用 AI 来构建东西——构建 app、构建各种东西——来找到能动性。第一,这确实帮你建立直觉;第二,这赋予你力量,因为你能理解它怎么运作、弱点在哪里。它让你有话语权去说"这是技术的坏用法,那是好用法。"而且你更深入地接入了这个系统,所以你能更好地理解它,作为消费者也能更好地引导它。
**Lex Fridman:** Find agency by using AI to build stuff, build apps, bu ild... One, that actually helps you build intuition, but two, it's empowering because you can understand how it works, what the weaknesses are. It giv es your voice power to say, "This is a bad use of the technology, and this is a good use." And you're more plugged into the system then, so you can un derstand it better and you can steer it better as a consumer. - I think that's a good point you brought up about agency. Instead of ignoring it and sa ying, "Okay, I'm not going to use it," I think it's probably long-term healthier to say, "Okay, it's out there.
**Sebastian Raschka:** 我觉得你提的"能动性"这一点很好。与其忽视它说"好吧,我不用了",我觉得从长远来看更健康的做法是说"好吧,它已经出来了。我收不回去了。"怎么最好地利用它,怎么让它帮我提升自己?但我在这里担心的一件事是,如果你把它完全用于你热爱做的事情,那个你热爱的事情就不复存在了。这可能导致倦怠。比如说,如果我用 LLM 帮我做所有的编程,那就没有编程了。我只是在管理一个替我编程的东西。两年后,假设我每天八小时就干这个——让某个东西替我编程——我还会觉得充实吗?这会不会伤害我对工作的兴奋感?我还会为构建什么东西感到自豪吗?
**Sebastian Raschka:** I can't put it back." when they came o ut. How do I make best use of it, and how does it help me to up-level myself? The one thing I worry about here, though, is, if you just fully use it f or something you love to do, the thing you love to do is no longer there. And that could potentially, I feel, lead to burnout. For example, if I use a n LLM to do all my coding for me, now there's no coding.
**Lex Fridman:** 说到享受这个话题,很有趣。我们应该在这里提一下,最近有一项调查,大约 791 名专业开发者,就是有 10 年以上经验的。
**Lex Fridman:** I'm just managing something that is coding for me. Two years later, let's say, if I just do t hat eight hours a day, having something code for me, do I feel fulfilled still? Is this hurting me in terms of being excited about my job, excited abo ut what I'm doing? Am I still proud to build something? - On that topic of enjoyment, it's quite interesting. We should just throw this in there, that there's this recent survey of about 791 professional developers, meaning 10-plus years of experience. - That's a long time.
**Sebastian Raschka:** 那可是很长时间了。作为初级开发者的话?
**Sebastian Raschka:** As a junior developer? - Yeah, in this day and age. So, there's also many fronts that are surprising. They break it down by junior and senior developers. But, I mean, it just shows that both junior and senior developers use AI-generated code in code they ship.
**Nathan Lambert:** 对,在现在这个年代。而且有很多方面令人惊讶。他们按初级和高级开发者做了细分。但基本上就是——初级和高级开发者都在他们发布的代码中使用 AI 生成的代码。这不只是好玩或者中间学习用的。这是他们发布的代码。25%——大多数人使用 50% 或更多。有趣的是,在"超过 50% 的发布代码是 AI 生成"这个类别中,高级开发者更有可能这样做。但你不想让 AI 夺走你热爱的东西。我觉得这跟我的经验一致——我接下来要说的这些数据。总体来看,大约 80% 的人觉得使用 AI 作为工作的一部分要么"更享受一些"要么"明显更享受"。
**Nathan Lambert:** So this is not just for fun or intermediate learning things. Thi s is code they ship. 25%—like, most of them use around 50% or more. And what's interesting is, for the category of over 50% of your code that you ship is AI-generated, senior developers are much more likely to do so. But you don't want AI to take away the thing you love. I think this speaks to my ex perience, these results I'm about to say. Together, about 80% of people find it either somewhat more enjoyable or significantly more enjoyable to use AI as part of the work. - I think it depends on the task.
**Sebastian Raschka:** 我觉得这取决于任务。从我个人使用来说,比如我有一个网站,偶尔需要调整一些东西。我个人不喜欢这种事。所以在这个意义上,如果 AI 能帮我在网站上实现什么,我完全支持。太好了。但同时,当我解决一个复杂问题——比如有个 bug,我去追这个 bug,找到了——那是世界上最棒的感觉。你感觉太好了。但如果你根本不思考这个 bug,直接去找 LLM,那你就永远没有这种感觉了,对吧?但也有个中间地带:你自己先试,找不到,然后用 LLM,这样你不会太沮丧,因为它帮了你,然后你继续做你享受的事情。所以看这些统计数据,我觉得没有被考虑进去的是——它是在所有不同场景上取的平均值。我们不知道这是核心任务还是那些人们本来就不享受的琐碎事情。所以在某种意义上,AI 对于做那些费力的琐碎事情真的很棒。比如,我老婆前几天——她有一个讨论书的播客,一个读书会——她在把节目笔记从 Spotify 转到 YouTube,然后链接不知怎么就断了。有些集因为讲了太多书,有大概 100 个链接,一个一个手动修复会非常痛苦。所以我建议说"嘿,我们试试 ChatGPT。"我们把文字复制到 ChatGPT 里,它修好了。本来要花两个小时一个链接一个链接地修,它让这类工作变得无比顺畅。我觉得每个人都有一个 AI 在某些事情上很有用的用例——那些本来会非常无聊、非常琐碎的事情。
**Sebastian Raschka:** From my personal usage, for example, I have a website where I sometimes tweak things. I pers onally don't enjoy this. So in that sense, if the AI can help me to implement something on my website, I'm all for it. It's great. But then, at the sa me time, when I solve a complex problem— well, if there's a bug, and I hunt this bug, and I find it, it's the best feeling in the world. You feel grea t. But now, if you don't even think about the bug, you just go directly to the LLM, well, you never have this kind of feeling, right?
**Lex Fridman:** 对我来说,既然我们在说编程,你提到了调试……我的快乐源泉,更多是在 Cursor 那边而不是 Claude Code——是我有一个朋友,有一个结对编程的伙伴。没那么孤独了。你把调试描述成这种巨大的快乐。不是的,我觉得调试就像你在沙漠里走了好几天之后的一口水。你跳过了你在沙漠里受苦的整个部分。有时候有个朋友虽然也找不到 bug,但能给你一些关于代码的直觉,你们一起穿越沙漠找到那口水,挺好的。至少对我来说——也许这说明了编程体验的孤独——这是一种快乐的来源。
**Lex Fridman:** But then there c ould be the middle ground where, well, you try yourself, you can't find it, you use the LLM, and then you don't get frustrated because it helps you an d you move on to something that you enjoy. And so, looking at these statistics, I think what is not factored in is that it's averaging over all the di fferent scenarios. We don't know if it's for the core task or if it's for something mundane that people would not have enjoyed otherwise. So, in a sen se, AI is really great for doing mundane things that take a lot of work. For example, my wife the other day—she has a podcast for book discussions, a book club, and she was transferring the show notes from Spotify to YouTube, and then the links somehow broke. And she had in some episodes, because it is so many books, like 100 links, and it would have been really painful to go in there and fix each link manually. So I suggested, "Hey, let's try Ch atGPT." We copied the text into ChatGPT, and it fixed them. Instead of two hours going from link to link fixing that, it made that type of work much m ore seamless. I think everyone has a use case where AI is useful for something that would be really boring, really mundane. - For me personally, since we're talking about coding, and you mentioned debugging... the source of enjoyment for me, more on the Cursor side than Claude Code, is that I have a friend, I have a pair programmer.
**Sebastian Raschka:** 这也许跟延迟满足有关。我是那种——即使小时候——我更喜欢圣诞礼物的想法——拥有它们、得到它们——比实际收到礼物的那一刻更让我兴奋。我会期待得到礼物的那天,但到了就结束了,我会失望。也许食物也一样。我觉得你真正饿的时候食物更好吃。你说得对,调试不是总那么好。它经常令人沮丧,但如果你能解决它,那就太棒了。不过有一个甜蜜的 Goldilocks 区间——如果太难了,那就是浪费时间。但我觉得另一个挑战是人们怎么学习。我们看了图表,发现更资深的开发者比初级的发布更多 AI 生成代码。这非常有趣,因为直觉上你会觉得应该是初级开发者——因为他们还不知道怎么做,所以用 AI 来做。这可能意味着 AI 还不够好到能完成那个任务,但也可能意味着专家更擅长使用它。他们知道怎么用,知道怎么审查代码,然后更信任这些代码。未来社会面临的一个问题是:如果你从不自己动手尝试,你怎么成为专家?我一直是通过自己尝试来学习的。如果你看数学教科书和答案——你能学到东西,但如果你先自己试了再看答案,你学得更好。因为你知道怎么把它放到你的心智框架里。如果 LLM 一直在那里,你真的会去经历那种挣扎吗?你愿意挣扎吗?挣扎不好受,对吧?但如果你用 LLM 做所有事情,到某个时候你就永远不会真正迈出下一步,然后你可能永远得不到那种作为专家使用 LLM 时才能获得的突破。所以我觉得有一个 Goldilocks 甜蜜点——也许诀窍是你每天留出专门的离线时间学习两小时,剩下的时间用 LLM。但我觉得人们还是要投资于自身——以我看来——不要什么都交给 LLM。
**Sebastian Raschka:** It's less lonely. You made debugging sound like this great joy. No, I would say debugging is like a drink of water after you've been going through a desert for days. You skip the whole desert part where you're suffering.
**Lex Fridman:** 是的,作为一个文明,我们每个人都得找到那个 Goldilocks 区间。在编程的语境下也一样。那,我们之前有过精彩的讨论,从预训练和中期训练开始。让我们进入后训练。后训练有很多有趣的东西。那么,后训练中有哪些有趣的想法?
**Lex Fridman:** Sometimes it's nice to have a friend who can 't really find the bug, but can give you some intuition about the code, and together you're going through the desert and finding that drink of water. At least for me, maybe it speaks to the loneliness of the programming experience. That is a source of joy. - It's maybe also related to delayed gratif ication. I'm a person who, even as a kid, I liked the idea of Christmas presents—having them, getting them—better than actually receiving the presents . I would look forward to the day I get the presents, but then it's over and I'm disappointed. And maybe it's the same with food. I think food tastes better when you're really hungry. You're right, with debugging, it is not always great. It's often frustrating, but then if you can solve it, then it' s great. But there's a sweet Goldilocks zone; if it's too hard, then it's just wasting your time.
**Nathan Lambert:** 2025 年最大的一个是学习用可验证奖励的强化学习(RLVR)。你可以扩展那里的训练,这意味着做大量这种迭代的"生成-评分"循环,这让模型在工具使用和软件方面学到有趣的行为。这可以是搜索、自己运行命令并看到输出,然后这种训练也非常好地支持推理时间扩展。结果表明这个范式之间有非常漂亮的关联——这种 RL 训练使推理时间扩展成为可能。虽然推理时间扩展本可以通过其他方式被发现。所以是一种完美风暴——模型发生了巨大变化,训练方式是其中的主要因素。这彻底改变了人们对后训练的理解。
**Nathan Lambert:** But I think another challenge is how will people lea rn? We looked at the chart and saw that more senior developers are shipping more AI-generated code than the junior ones. It's very interesting, becaus e intuitively you would think it's the junior developers because they don't know how to do the thing yet, and so they use AI to do that thing. It coul d mean the AI is not good enough yet to solve that task, but it could also mean experts are more effective at using it.
**Lex Fridman:** 你能描述一下 RLVR 吗?由 DeepSeek R1 推广开来的。它是怎么工作的?
**Lex Fridman:** They know how to use it better , review the code, and then they trust the code more. One issue for society in the future will be: how do you become an expert if you never try to do the thing yourself? One way I always learned is by trying things myself. If you look at math textbooks and the solutions, you learn something, but you learn actually better if you try first.
**Nathan Lambert:** 好的。有个趣事:我就在提出 RLVR 这个术语的团队里,来自我们 Tulu 3 的工作,在 DeepSeek 之前。我们不会过多邀功说自己是推广规模化 RL 的人,但作为学术界能得到的有趣东西——顺便说一句——是命名和影响话语的能力,因为闭源实验室能说的有限。你作为学术界能做的一件事是——你可能没有算力来训练模型,但你可以用一种方式来框架问题,最终形成一个社区围绕 RLVR 这个术语聚集起来的描述,这非常有趣。然后 DeepSeek 是做了训练突破的人——他们扩展了强化学习。你让模型生成答案,然后评分补全是否正确,准确率就是你强化学习的奖励。经典的强化学习是一个智能体在环境中行动,环境给它状态和奖励反馈,你试图最大化奖励。对于语言模型来说,奖励通常是在一组可验证任务上的准确率——无论是数学题还是编程任务。到了像事实性领域这种东西就变得模糊了。从某些角度来说那也是可验证的,或者指令上的约束,比如"只用以 A 开头的单词回答。"所有这些东西在某种意义上都是可验证的。核心思想是你找到大量可验证的问题,让模型多次尝试,同时进行这些 RL 梯度更新。基础设施是从基于人类反馈的强化学习(RLHF)进化而来的——在那个时代,它们试图优化的分数是一个学习到的人类偏好奖励模型。所以你改变了问题领域,这让优化能达到更大的规模,从根本上启动了模型能力和人们使用方式的重大变化。
**Nathan Lambert:** Then you appreciate the solution differently because you know how to put it into your mental framework. If LL Ms are here all the time, would you actually go to the length of struggling? Would you be willing to struggle? Struggle is not nice, right?
**Lex Fridman:** RLVR 适用于什么类型的领域?
**Lex Fridman:** But if you use the LLM to do everything, at some point you will never really take the next step, and then you will maybe not get that unlock that you would get as an expert using an LLM. So, I think there's like a Goldilocks sweet spot where maybe the trick here is you make dedicated offline time where you st udy two hours a day, and the rest of the day use LLMs. But I think it's important also for people to still invest in themselves, in my opinion, to not just LLM everything. - Yeah, as a civilization, we each individually have to find that Goldilocks zone. And in the programming context as developers.
**Nathan Lambert:** 数学和代码是最出名的。然后有很多关于"评分标准"(rubrics)的工作,这跟人们可能听过的一个词有关:LLM-as-a-judge(LLM 做评委)。对数据集中的每个问题,我会有一个 LLM 来问"这个问题的好答案应该是什么样的?"然后你可以反复尝试这个问题,根据评分标准打分。这不像数学和代码领域那样严格可验证,但这个评分标准的想法以及其他可能更模糊的科学问题是目前关注的焦点——他们试图把这套方法推广到更开放的领域,让模型学到更多。
**Nathan Lambert:** Now, we've had this fascinating conversation that started with pre-training and mid-training. Let's get to post-training. A lot of fun stuff in post- training. So, what are some of the interesting ideas in post-training? - The biggest one from 2025 is learning this reinforcement learning with verifi able rewards.
**Sebastian Raschka:** 我觉得那叫基于 AI 反馈的强化学习(reinforcement learning with AI feedback),对吧?
**Sebastian Raschka:** You can scale up the training there, which means doing a lot of this kind of iterative generate-grade loop, and that lets the models lea rn both interesting behaviors on the tool use and software side. This could be searching, running commands on their own and seeing the outputs, and th en also that training enables this inference time scaling very nicely. And it just turned out that this paradigm was very nicely linked, where this ki nd of RL training enables inference time scaling. But inference time scaling could have been found in different ways. So, it was kind of this perfect storm where the models change a lot, and the way that they're trained is a major factor in doing so. And this has changed how people approach post-tra ining dramatically. - Can you describe RLVR, popularized by DeepSeek R1? Can you describe how it works? - Yeah. Fun fact: I was on the team that came up with the term RLVR, which is from our Tulu 3 work before DeepSeek. We don't take a lot of credit for being the people to popularize the scaling RL, but as fun as what academics get, as an aside, is the ability to name and influence the discourse, because the closed labs can only say so much. That one of the things you can do as an academic is you might not have the compute to train the model, but you can frame things in a way that ends up bein g described as a community coming together around this RLVR term, which is very fun. And then DeepSeek are the people that did the training breakthrou gh, which is, they scaled the reinforcement learning.
**Nathan Lambert:** 那是更早期的术语,在 Anthropic 的 Constitutional AI 论文中提出的。很多东西都是周期性的。
**Nathan Lambert:** You have the model generate answers and then grade the completion if it was right, and then that accuracy is your reward for reinforcement learning. So reinforcement learning is classically an agent that acts in an environment, and the environmen t gives it a state and a reward back, and you try to maximize this reward. In the case of language models, the reward is normally accuracy on a set of verifiable tasks, whether it's math problems or coding tasks. And it starts to get blurry with things like factual domains. That is also, in some way s, verifiable, or constraints on your instruction, like respond only with words that start with A." All of these things are verifiable in some way, an d the core idea of this is you find a lot more of these problems that are verifiable and you let the model try it many times while taking these RL ste ps, these RL gradient updates. The infrastructure evolved from reinforcement learning from human feedback, where in that era the score they were tryin g to optimize was a learned reward model of human preferences. So you kind of changed the problem domains and that let the optimization go on to much bigger scales, which kind of kickstarted a major change in what the models can do and how people use them. - What kind of domains is RLVR amenable to? - Math and code are the famous ones, and then there's a lot of work on what is called the rubrics, which is related to a word people might have heard : LLM-as-a-judge. For each problem, I'll have a set of problems in my dataset. I will then have an LLM and ask it, "What would a good answer to this p roblem look like?" And then you could try the problem over and over again and assign a score based on this rubric. That's not necessarily verifiable l ike math and code domains, but this rubrics idea and other scientific problems that might be a little bit more vague is where the attention is, where they're trying to push this set of methods into these kind of more open-ended domains so the models can learn a lot more. - I think that's called rein forcement learning with AI feedback, right? - That's the older term for it coined in Anthropic's Constitutional AI paper.
**Sebastian Raschka:** 另外,回到 RLVR 退一步来说。我觉得这里有趣的是——你给 LLM 一个数学问题,你知道正确答案,然后你让 LLM 自己去搞清楚怎么做。它怎么做——你不怎么限制它。有一些约束比如"用同一种语言,不要在西班牙语和英语之间切换。"但基本上你是放手的。你只给问题和答案,然后 LLM 的任务是得出正确答案。但这里美妙的是在实践中发生了什么:LLM 会做一个逐步的描述——就像一个学生或数学家推导解法一样。它会用这些步骤,这帮助模型提升自己的准确率。然后,就像你说的,推理扩展。推理扩展大致意味着在推理时花更多计算。这里的推理扩展是模型会使用更多 token。在 DeepSeek R1 论文中,他们展示了训练时间越长,回复就越长。随时间增长。使用更多 token,所以变得更贵。对简单任务来说变贵了,但这些解释有助于准确率。也有论文显示模型解释的内容不一定必须是正确的,或者可能跟答案无关,但不知道为什么,模型去解释这件事本身仍然对它有帮助。我觉得这也是——我不想把这些 LLM 拟人化——但有点像我们人类的运作方式。如果课堂上有一个复杂的数学问题,你通常会有一张草稿纸,一步一步做。划掉一些东西。模型也会自我纠正,我觉得这就是 DeepSeek R1 论文中的"顿悟时刻"(aha moment)。他们叫它顿悟时刻,因为模型自己认识到它犯了一个错误,然后说"啊,我做错了,让我重试。"我觉得这太酷了——仅仅是给它正确答案、让它自己去想怎么做——它竟然在某种意义上做了人类会做的事情。虽然 LLM 的思考方式跟人类不同,但这是一个有趣的巧合,而且……还有一个好的副作用是——对我们人类来说,看到这些步骤很好。它建立信任,同时我们可以学习和复查。
**Sebastian Raschka:** It's like a lot of these thi ngs come in cycles. - Also, just one step back for RLVR. I think the interesting thing here is that you ask the LLM a, let's say, math question, and t hen you know the correct answer, and you let the LLM, as you said, figure it out. How it does it—you don't constrain it much. There are some constrain ts like "use the same language, don't switch between Spanish and English." But let's say you're pretty much hands-off. You only give the question and the answer, and then the LLM has the task to arrive at the right answer, but the beautiful thing here is what happens in practice: the LLM will do a s tep-by-step description, like as a student or as a mathematician would derive the solution. It will use those steps, and that helps the model to impro ve its own accuracy. And then, like you said, the inference scaling. Inference scaling loosely means spending more compute during inference, and here the inference scaling is that the model would use more tokens. In the DeepSeek R1 paper, they showed the longer they train the model, the longer the r esponses are.
**Nathan Lambert:** 这里面有很多东西。我觉得今年有很多辩论——关于语言模型中这些……我认为那些"顿悟时刻"某种程度上是假的,因为在预训练中你基本上看过了整个互联网,所以你肯定见过人们解释自己工作的内容,甚至是口头的,比如数学讲座的转录稿。"你试试这个,哦,我搞砸了。"RLVR 非常擅长做的是放大这些行为,因为它们对于让模型思考更长时间和检查自己的工作非常有用。我同意这是非常美妙的——这种训练让模型学会以一种对最终答案非常有帮助的方式来放大这些行为。
**Nathan Lambert:** They grow over time. They use more tokens, so it becomes more expensive. It becomes expensive for simple tasks, but these explanations h elp accuracy. There are also papers showing what the model explains does not necessarily have to be correct or maybe it's unrelated to the answer, but for some reason, it still helps the model that it is explaining.
**Sebastian Raschka:** 我还能给你一个实操的例子。我在用 Qwen 3 base model 做 RLVR,在 MATH-500 上。基础模型准确率大约 15%。只用了 50 步——几分钟的 RLVR——模型从 15% 涨到了 50%。而且那个模型……你不能说它在那几步里学到了什么根本性的数学知识——
**Sebastian Raschka:** And I think it's also—again, I don't want to anthropomorphize these LLMs, but it's k ind of like how we humans operate. If there's a complex math problem in a math class, class, you usually have a note paper and you do it step by step. You cross out things. And the model also self-corrects, and that was, I think, the aha moment in the DeepSeek R1 paper.
**Nathan Lambert:** Qwen 这个例子比较奇怪,因为今年有两篇论文——其中一篇我参与了——关于 Qwen 的数据污染。特别是他们在一个特殊的中期训练阶段训练了大量内容——我们应该花一分钟谈谈这个——因为它很奇怪,他们训练了几乎跟 MATH 完全一样的问题。
**Nathan Lambert:** They called it the aha moment because the model itself recognized it made a mistake and then said, "Ah, I did something wrong, let me try again." And I think that's just so cool t hat this falls out of just giving it the correct answer and having it figure out how to do it, that it kind of does in a sense what a human would do. Although LLMs don't think like humans, it's kind of like an interesting coincidence and it... And the other nice side effect is it's great for us huma ns often to see these steps. It builds trust, but also we us humans to see these steps. It builds trust, but also we learn and can double check things . - There's a lot in here. I think some of the debate... There's been a lot of debate this year on if the language models like these... I think the ah a moments are kind of fake because in pre-training you essentially have seen the whole internet. so you have definitely seen people explaining their w ork, even verbally, like a transcript of a math lecture.
**Sebastian Raschka:** 没错。所以你可以看到基本上 RL 并不是在教模型任何关于数学的新知识。你不可能在 50 步里做到那个。所以知识已经在那里了,在预训练里面,你只是在解锁它。
**Sebastian Raschka:** "You try this, oh, I messed this up." And what RLVR is very good at doing is amplifying these behaviors because they're very useful in enabling the model to think longer and to check its work. And I agree that it is very beautiful that this tr aining kind of... The model learns to amplify this in a way that is so useful for the final answers being better. - I can give you also a hands-on exa mple. I was training the Qwen 3 base model with RLVR on MATH-500. The base model had an accuracy of about 15%. Just 50 steps, like in a few minutes wi th RLVR, the model went from 15% to 50% accuracy.
**Nathan Lambert:** 我仍然不同意这个前提,因为有很多奇怪的复杂性你证明不了。有一个指向"奇怪"的证据是——如果你拿 Qwen 3 所谓的基础模型,然后你……你可以 Google 搜"math dataset, Hugging Face",然后拿一个问题。你把它输入 Qwen 3 base 的时候……这些数学问题都有文字描述,比如"Alice 有五个苹果,拿走一个……给了三个给谁谁谁"——这些文字题。为什么人们对这些基于 Qwen 的模型有怀疑——如果你改变数字但保留文字——Qwen 会在不使用工具的情况下产出一个精度非常高的十进制答案。这意味着在某个时间点,它被展示了几乎跟测试集一样的问题,它用工具得到了一个高精度答案,但一个没有工具的语言模型永远不可能做到这个。所以在研究社区一直有这个大辩论:这些在 Qwen 上训练、特别在这个数学基准上测量的强化学习论文——已经有多篇论文讨论过污染——到底有多少你能信?我觉得这就是为什么 RLVR 获得了"只是格式化"这个名声——因为你能那么快获得这些增益,所以它一定已经在模型里了。但这里面有很多复杂性……不是真正的对照实验,所以我们并不真的知道。
**Nathan Lambert:** And the model... You can't tell me it's learning anything fundamentally about math in - The Qwen exa mple is weird because there've been two papers this year, one of which I was on, about data contamination in Qwen and specifically that they train on a lot of this special mid-training phase that we should take a minute on, because it's weird because they train on problems that are almost identical to MATH. - Exactly. And so you can see that basically the RL, it's not teaching the model any new knowledge about math. You can't do that in 50 steps. So the knowledge is already there, in the pre-training, you're just unlocking it. - I still disagree with the premise because there's a lot of weird complexities that you can't prove because one of the things that points to weirdness is that if you take the Qwen 3 so-called base model and you... Yo u could Google like "math dataset, Hugging Face", and you could take a problem and what you do if you put it into Qwen 3 base... All these math proble ms have words, so it'd be like "Alice has five apples and takes one... and gives three to whoever," and there are these word problems. With these Qwen -based models, why people are suspicious of them is if you change the numbers but keep the words- Qwen will produce, without tools, will produce a ver y high accuracy decimal representation of the answer, which means there's some... At some time, it was shown problems that were almost identical to th e test set, and it was using tools to get a very high precision answer, but a language model without tools will never actually have this. So it's kind of been this big debate in the research community: how much of these reinforcement learning papers that are training on Qwen and measuring specifical ly on this math benchmark, where there's been multiple papers talking about contamination, is like, how much can you believe them? And I think this is what caused the reputation of RLVR being about formatting, because you can get these gains so quickly, therefore it must already be in the model.
**Sebastian Raschka:** 但如果那不是真的,我会说蒸馏(distillation)就不会起作用了,对吧?我的意思是,蒸馏在一定程度上确实能工作。但这确实是最大的问题——我研究这个污染问题——因为我们不知道数据里有什么。除非你有一个全新的数据集,否则真的不可能知道。同样的,你提到的那个数学数据集——有问题、答案和解释——但即使是更简单的东西像 MMLU,一个多选基准。如果你只是稍微改变格式——比如用句号代替括号什么的——模型准确率会大幅不同。
**Sebastian Raschka:** But there's a lot of complexity here that we... It's not really like controlled experimentation, so we don't really know. - But if it weren't true, I wou ld say distillation wouldn't work, right? I mean, distillation can work to some extent, but the thing is that is, I think, the biggest problem, and I research this contamination because we don't know what's in the data. Unless you have a new dataset, it is really impossible. And the same, you mentio ned the math dataset, where you have a question and then answer and an explanation is given, but then also even something simpler like MMLU, which is a multiple-choice benchmark.
**Nathan Lambert:** 我觉得那可能是一个模型层面的问题,而不是一般性的问题。
**Nathan Lambert:** If you just change the format slightly, like, I don't know, if you use a dot instead of a parenthesis or something like t hat, the model accuracy will vastly differ. - I think that that could be like a model issue rather than a general issue. - It's not even malicious by the developers of the LLM, like, "Hey, we want to cheat at that benchmark." It has seen something at some point. I think the only fair way to evaluate an LLM is to have a new benchmark that is after the cutoff date when the LLM was deployed. - Can we lay out what would be the recipe of all the thing s that go into post-training? And you mentioned RLVR was a really exciting, effective thing. Maybe we should elaborate.
**Sebastian Raschka:** 这甚至不是 LLM 开发者故意的——不是说"嘿,我们要在那个基准上作弊。"只是它在某个时候见过了某些东西。我觉得评估 LLM 唯一公平的方式是——用一个在 LLM 部署后截止日期之后的新基准。
**Sebastian Raschka:** RLHF still has a really import ant component to play. What kind of other ideas are there on post-training? - I think you can kind of take this in order. I think you could view it as what made o1, which is this first reasoning model, possible, or what will the latest model be? And they actually...
**Lex Fridman:** 我们能不能列出后训练的完整配方——所有涉及的组成部分?你提到 RLVR 是一个非常令人兴奋、有效的东西。也许我们应该展开说说。RLHF 仍然扮演着非常重要的角色。后训练中还有什么其他想法?
**Lex Fridman:** You're going to have similar inte rventions at these, where you start with mid-training, and the thing that is rumored to enable o1 and similar models is really careful data curation, where you're providing a broad set of what is called reasoning traces, which is just the model generating words in a forward process that is reflectin g, like breaking down a problem into intermediate steps and trying to solve them. So at mid-training, you need to have data that is similar to this to make it so that when you move into post-training, primarily with these verifiable rewards, it can learn. And then what is happening today is you're f iguring out which problems to give the model and how out which problems to give the model and how long you can train it for and how much inference you can enable the model to use when solving these verifiable problems. So as models get better, certain problems models get better, certain problems are no longer... The model will solve them 100% of the time, and therefore there's very little signal in this. If we look at the GRPO equation, this one is famous for this because essentially the reward given to the agent is based on how good a given action—an action is a completion—is relative to the other answers to that same problem. So if all the problems get the same answer, there's no signal in these types of algorithms. So what they're doing is they're finding harder problems, which is why you hear about things like scientific domains, where it's so hard to get anything right. If you have a lab or something, it just generates so many tokens or much harder software problems. So the frontier models are all pushing into these harder domain s when they can train on more problems and the model will learn more skills at once. The RLHF link to this is that RLHF has been and still is kind of like the finishing touch on the models, where it makes the models more useful by improving the organization or style or tone.
**Lex Fridman:** 我想问一下——我们能不能岔开一下,谈谈教育和学习。如果你是一个正在听这期节目的聪明人——对编程和 AI 感兴趣——我猜从零构建什么东西是一个好的起点。那你推荐人们做什么?
**Lex Fridman:** There are different thin gs that resonate with different audiences, like some people like a really quirky model and RLHF could be good at enabling that personality, and some p eople hate this markdown bulleted list thing that the models do, but it's actually really good for quickly parsing information. In RLHF, this human fe edback stage is really great for putting this into the model at the end of the day. It's what made ChatGPT so magical for people. And that use has act ually remained fairly stable.
**Sebastian Raschka:** 我个人会从——就像你说的——从零实现一个你能在自己电脑上运行的简单模型开始。目标不是拥有一个日常使用的东西。它不会取代 ChatGPT。目标是看清 LLM 里到底有什么、预训练在你自己的电脑上是怎么工作的。然后你学到预训练、监督微调和注意力机制。你对事物的运作有了扎实的理解。但到了某个点你会达到极限——因为小模型只能做这么多。学习大规模 LLM 的问题在于——让模型更大是指数级复杂的——因为你得把参数分片到多个 GPU 上。书的目标基本上是理解 LLM 是怎么工作的。它不会是生产级的 LLM——但一旦你有了那个——你就能理解生产级的 LLM。
**Sebastian Raschka:** This formatting can also help the models get better at math problems, for example. So it's like the border between style and formatting, and like the method that you use to answer a problem is actually all very closely linked in terms of when you're training these model s, which is why RLHF can still make a model better at math, but these verifiable domains are a much more direct process to doing this because it makes more sense with the problem formulation, which is why it ends up all forming together. But to summarize, it's like mid-training is give the model the skills it needs to then learn. RL with verifiable rewards is letting the model try a lot of times, so put a lot of compute into trial-and-error learn ing across hard problems.
**Lex Fridman:** 所以你总是试图构建一个能放在一个 GPU 上的 LLM?
**Lex Fridman:** And then RLHF would be like finishing the model, making it easy to use and kind of just rounding the model out. - Can you co mment on the amount of compute required for RLVR? - It's only gone up and up. I think Ilya Sutskever was famous for saying they use a similar amount o f compute for pre-training and post-training. Back to the scaling discussion, they involve very different hardware for scaling. Pre-training is very c ompute-bound, which is like this FLOPs discussion, which is just how many matrix multiplications can you get through at once.
**Sebastian Raschka:** 对。大多数都可以。美妙的地方是——你可以自我验证。当你从零编写这些的时候——你可以拿 Hugging Face Transformers 库里的一个现有模型。这个库很棒——但如果你想学 LLM——它不是最好的起点——因为代码太复杂了。
**Sebastian Raschka:** And because with RL you' re generating these answers, you're trying the model in real-world environments, it ends up being much more memory-bound because you're generating lon g sequences and the attention mechanisms have this behavior where you get a quadratic increase in memory as you're getting to longer sequences. So the compute becomes very different. In pre-training we would talk about a model—if we go back to like the Biden administration executive order, it's like 10 to the 25th FLOPs to train a model. If you're using FLOPs in post-training, it's a lot weirder because the reality is just like: how many hours ar e you allocating?
**Nathan Lambert:** 它最初是一个微调库——然后发展成了所有模型架构的标准表示。Hugging Face 是获取模型的默认去处——Transformers 是那个软件。
**Nathan Lambert:** How many GPUs for? And I think in terms of time, the RL compute is getting much closer because you just can't put it all into one sy stem. Pre-training is so computationally dense where all the GPUs are talking to each other and it's extremely efficient, whereas RL has all these mov ing parts and can take a long time to generate a sequence of 100,000 tokens. If you think about GPT-5.2 Pro taking an hour, it's like, what if your tr aining run has a sample for an hour and you have to make sure that's handled efficiently?
**Sebastian Raschka:** 所有有开放权重模型的前沿实验室都有 Transformers 版本。那是标准权重格式。但即使 Transformers 也不在生产中使用。人们用 SGLang 或 vLLM。
**Sebastian Raschka:** So I think in GPU hours or just wall-clock hours, the RL run s are probably approaching the same number of days as pre-training, but they probably aren't using as many GPUs at the same time. There are rules of t humb where in labs you don't want your pre-training runs to last more than a month because they fail catastrophically. And if you are planning a huge cluster to be held for two months and then it fails on day 50, the opportunity costs are just so big. So people don't want to put all their eggs in on e basket.
**Nathan Lambert:** Transformers 库有大约 400 个模型。
**Nathan Lambert:** GPT-4 was the ultimate YOLO run, and nobody ever wanted to do it before, where it took three months to train and everybody was shocked that it worked. I think people are a little bit more cautious and incremental now. - So RLVR is more, let's say, unlimited in how much you can train or sti ll get benefit, where RLHF, because it's a preference tuning, you reach a certain point where it doesn't really make sense to spend more RL budget on that. So just a step back with preference tuning: there are multiple people that can give multiple, let's say, explanations for the same thing and the y can both be correct, but at some point you learn a certain style and it doesn't make sense to iterate on it. My favorite example is: if relatives as k me what laptop they should buy, I give them an explanation or ask, "What is your use case?" They, for example, prioritize battery life and storage.
**Sebastian Raschka:** 所以它的代码库巨大。理解你想理解的部分就像大海捞针。但你有一个能工作的实现——所以你可以反向推导。我推荐——比如我想理解 OLMo——我会看权重、配置文件——你可以看到他们用了什么。然后从 GPT-2 模型开始加这些东西。很酷的是你可以加载预训练权重看能不能工作——作为可验证奖励来确保架构正确。OLMo 3 的挑战是 RoPE 的 YaRN 扩展——我没法完全匹配。在挣扎中——你理解了很多。最后你可以做单元测试——和参考实现对比。我觉得这是最好的学习方式——逆向工程。
**Sebastian Raschka:** Other people like us, for example, we would prioritize RAM and compute. Both answers are correct, but different people require different answers. With preference tuning, you're trying to average somehow. You are asking the data labelers to give you, not the right, but the preferred answer and then y ou train on that.
**Nathan Lambert:** 我觉得这是每个想进入 AI 的人今天都应该做的。这就是为什么我喜欢你的书。我从 RL 和机器人领域来到语言模型——从没花时间学所有基础。这个 transformer 架构如此基础——人们需要做这个。我觉得很多人被"如何产生影响或找到职业道路"给压倒了。但这个领域移动得太快——很多时候最好的人不会完全解决一个问题——因为有更大的容易问题——他们继续走了。做完基础之后——深入某个细分方向对人是好的。在线匿名人士和前沿研究者之间的距离——没有人知道那些匿名账号是谁。"我不理解这个,继续深挖"——非常有用。有很多研究领域只有三篇论文需要读——然后作者可能会回你邮件。我对"角色训练"变得很感兴趣——怎么让模型有趣或讽刺。牛津的一个学生联系了我——我指导了他——那篇论文现在存在了。世界上大概只有两三个人对此感兴趣。如果你坚持在一个领域——有很多有趣的东西可学。
**Nathan Lambert:** But at some point you learn that average preferred answer. And there's no reason to keep training longer on it because it's just a s tyle, whereas with RLVR, you let the model solve more and more complex, difficult problems. So I think it makes more sense to allocate more budget lon g-term to RLVR. Also, right now we are in an RLVR 1.0 blend where it's still that simple thing where we have a question and answer, but we don't do an ything with the stuff in between.
**Sebastian Raschka:** 是的,你不能什么都做——会倦怠。我很久没跟上计算机视觉了——只专注于 LLM。回到你的书——这真的是一本很棒的书。如果你想学 RLHF——我不会建议去读 RLHF 论文——因为你会花两年——
**Sebastian Raschka:** There were multiple research papers, also by Google for example, on process reward models that also give scores for the explanation—how correct is the explanation. And I think that will be the next thing, let's say RLVR 2.0 for this year, focusing in between questio n and answer, like how to leverage that information, the explanation, to help it get better accuracy. So that's one angle. And there was a DeepSeek Ma th-V2 paper where they also had interesting inference scaling there where, first, they had developed models that grade themselves, a separate model.
**Nathan Lambert:** 有些还互相矛盾。我编辑完这本书——没有哪一章我需要说"X 篇论文说一件事,Y 篇论文说另一件事。"
**Nathan Lambert:** A nd I think that will be one aspect. And the other, like Nathan mentioned, it will be for RLVR branching into other domains. - The place where people a re excited are value functions, which is pretty similar. So process reward models are kind of like... Process reward models assign how good something is at each intermediate step in a reasoning process, where value functions apply value to every token the language model generates.
**Lex Fridman:** 快过一下目录——后训练大图中我们可能遗漏了哪些想法?有问题设定、训练概述、偏好、偏好数据和优化工具、奖励建模、正则化、指令调优、拒绝采样和强化学习。然后 Constitutional AI 和 AI 反馈、推理和推理时间扩展、工具使用和函数调用、合成数据和蒸馏、评估——然后是开放问题——过度优化、风格和信息——产品 UX、角色和后训练。有哪些值得一提的?你提到了角色训练。
**Lex Fridman:** Both of these have been largely unproven in the language modeling and reasoning model era. People are more optimistic about value functions for whatever reason now. I t hink process reward models were tried a lot more in this pre-o1, pre-reasoning model era, and a lot of people had a lot of headaches with them. So I t hink a lot of it is human nature...
**Nathan Lambert:** 角色训练有趣在于相关研究太少了。本质上是——你如何改变数据和决策来让模型成为你想要的样子?OpenAI 有"模型规格"——内部指南——发布给开发者。你可以知道什么是训练的失败——相对于什么是他们想做的。RL 章节是人们想要的——因为大家都听说 RLVR。RLHF 的核心是偏好有多混乱——RL 的设置假设偏好可以被量化。我觉得这和 Von Neumann-Morgenstern 效用定理有关。量化偏好是人类设计出的问题——为了让偏好可以被研究。但存在根本性争论。社会选择理论是经济学中关于如何聚合偏好的子领域。我保持着一个推理模型技术报告清单——在第 14 章有一张巨大的表。著名的 DPO 论文——附录中的推导跳过了数学步骤。我重做了推导——然后用语言模型做——它们就说"这是 log 技巧。"我觉得阅读中的挣扎对学习是好的。
**Nathan Lambert:** Value models have a very deep history in reinforcement learning. They're one of the first things core to deep rein forcement learning existing, is training value models. So right now people are excited about trying value models, but there's very little proof. And t here are negative examples in trying to scale up process reward models. These things don't always hold in the future. We came to this discussion by ta lking about scaling. The simple way to summarize what you're saying is you don't want to do too much RLHF, where the signal doesn't scale. People have worked on RLHF for language models for years, especially with intense interest after ChatGPT. And the first release of a reasoning model trained with RLVR, OpenAI's o1, had a scaling plot where if you increase training compute logarithmically, you get a linear increase in evaluations.
**Lex Fridman:** 你们都多次提到了"挣扎"。如果你在过程中不挣扎——你就没有完全遵循正确的学习过程。
**Lex Fridman:** This has been reproduced multiple times. DeepSeek had a plot like this. But there's no scaling law for RLHF where if you log-increase the compute, you get performa nce. In fact, the seminal scaling paper for RLHF is scaling laws for reward model over-optimization.
**Nathan Lambert:** 有些提供商在做教育用的模型——设计成不一次给出所有信息。训练模型来做这个将是很好的贡献。我们有可能在 AI2 做这个。
**Nathan Lambert:** So that's a big line to draw with RLVR and the me thods we have now. In the future, they will follow this scaling paradigm: where you can let the best runs run for an extra 10x and you get performance , but you can't do this with RLHF. And that is just going to be field-defining in how people approach them. While I'm a shill for people to academical ly do RLHF, to do the best RLHF you might not need the extra 10 or 100x of compute, but to do the best RLVR you do.
**Sebastian Raschka:** 有道理。前几天我玩电子游戏被卡住了——我用了 LLM。但我说"请不要剧透。接下来我该做什么?"你可以对数学做同样的事——"不要给完整解答,但有什么我可以尝试的?"但这需要自律。我们可以开发教育 LLM——但其他 LLM 还在——仍然有诱惑。
**Sebastian Raschka:** I think there's a seminal paper fr om a Meta internship. It's called something like "The Art of Scaling Reinforcement Learning with Language Models." What they describe as a framework i s Scale-RL. Their incremental experiment was like 10,000 V100 hours, which is like thousands or tens of thousands of dollars per experiment. They do a lot of them, and This cost is not accessible to the average academic, which is a hard equilibrium where it's trying to figure out how to learn from e ach community. - I was wondering if we could take a bit of a tangent and talk about education and learning. If you're someone listening to this who's a smart person interested in programming and AI, I presume building something from scratch is a good beginning. So can you take me through what you wo uld recommend people do? - I would personally start, like you said, implementing a simple model from scratch that you can run on your computer. The go al is not, when you build a model from scratch, to have something for every day use. It's not going to be your personal assistant replacing an existin g open-weight model or ChatGPT. It's to see what exactly goes into the LLM, what comes out, and how the pre-training works on your own computer, prefe rably.
**Nathan Lambert:** 有一个短暂的 10 年窗口——所有考试可以数字化——现在 AI 之后又得用蓝皮本和口试了——因为作弊太容易了。很有趣。
**Nathan Lambert:** Then you learn about pre-training, supervised fine-tuning, and the attention mechanism. You get a solid understanding of how things work, but a t some point you reach a limit, because small models can only do so much. The problem with learning about LLMs at scale is that it's exponentially mor e complex to make a larger model, because the model isn't just larger—you have to shard your parameters across multiple GPUs. Even for the KV cache, t here are multiple ways to implement it.
**Lex Fridman:** 你提到角色训练——需要多少算力?有没有不需要太多算力的研究方向?
**Lex Fridman:** One is just to understand how it works, just to grow the cache. You grow it step-by-step by, let's say, concat enating lists, but then that wouldn't be optimal on GPUs. You would pre-allocate a tensor and then fill it in. But that adds another 20 or 30 lines of code.
**Nathan Lambert:** 角色训练基于用 LoRA 微调约 70 亿参数模型——只微调一小部分权重。
**Nathan Lambert:** And for each thing, you add so much code. The goal with the book is basically to understand how the LLM works. It's not going to be a productio n-level LLM, but once you have that, you can understand the production-level LLM. - So you're trying to always build an LLM that's going to fit on one GPU? - Yes. Most of them do.
**Lex Fridman:** 但可行。
**Lex Fridman:** I have some bonus materials on some MoE models. One or two of them may require multiple GPUs, but the goal is to have it on one GPU. And the beautiful thing is, you can self-verify. It's almost like RLVR.
**Nathan Lambert:** 不是每个学者都可行。一些学者只能做推理——用模型获取补全然后分析。这很适合评估——创造模型失败的代表性问题。做评估的研究者的最高目标是前沿实验室采纳你的评估。如果你从小大学找到 Claude 不擅长的东西——然后下一个 Claude 提到了它——那就是你的职业火箭。以最少算力获得最大影响——深入一个很窄的方向——想八个月后模型会在哪里挣扎。
**Nathan Lambert:** When you code these from scratch, you can take an existing model from the Hugging Face Transformers library. The library is great, but if you want to learn about LLMs, it's not the best place to start because the co de is so complex to fit so many use cases. Because people use it in production, it has to be really sophisticated, really intertwined, and hard to rea d. It's not linear. - It started as a fine-tuning library, and then it grew to be the standard representation of every model architecture.
**Lex Fridman:** 但开发全新的想法呢?
**Lex Fridman:** Hugging Fac e is the default place to get a model, and Transformers is the software. It enables it so people can easily load a model and do something basic with i t. - And all frontier labs that have open-weight models have a Transformers version of it, like from DeepSeek to gpt-oss-120b. That's the canonical we ight format you can load. But even even Transformers, the library, is not used in production.
**Nathan Lambert:** 这是权衡。如果你在读博——你也可以走更长期的方向。我自己很务实。OpenAI 平均年薪超过 100 万美元股票。进入 AI 实验室是改变人生的。如果你专注——在语言模型领域仍然有很多向上流动性。但从研究角度——成为下一个 Yann LeCun——不是通过关注语言模型开发。
**Nathan Lambert:** People use SGLang or vLLM, and it adds another layer of complexity. - We should say that the Transformers library has like 400 models. - So it's the one library that tries to implement a lot of LLMs, and so you have a huge codebase, basically. It's huge. It's like, I don't know, maybe millions— - That's crazy. - hundreds of thousands of lines of code. Un derstanding the part you want to understand is finding the needle in the haystack. But what's beautiful is you have a working implementation, so you c an work backwards. What I would recommend doing, or what I also do, is if I want to understand, for example, how OLMo is implemented, I would look at the weights in the model hub, the config file, and then you can see, "Oh, they used so many layers. They use, let's say, Group Query Attention or Mult i-Head Attention in that case." Then you see all the components in a human-readable, 100-line config file.
**Sebastian Raschka:** 我觉得变化并没有那么大。酷的东西一直在工业界开发——而且是封闭的。区别是你的偏好——你喜欢发表还是在封闭实验室。薪酬当然是另一个。现在还有第三个选项——创业。取决于你在哪里更舒服。而且没什么是永远的。
**Sebastian Raschka:** And then you start, let's say, with your GP T-2 model and add these things. The cool thing here is you can then load the pretrained weights and see if they work in your model. You want to match the same output that you get with a Transformer model, and then you can use that, basically as a verifiable reward to make your architecture correct. Sometimes it takes me a day.
**Lex Fridman:** 我觉得做教授的朋友看起来比在前沿实验室工作的人更快乐。那里有踏实感。前沿实验室确实是 996。
**Lex Fridman:** With OLMo 3, the challenge was RoPE for the position embeddings. They had a YaRN extension and there was some custom scal ing there, and I couldn't quite match these things. In this struggle, you kind of understand things. At the end, you know you have it correct because you can unit test it.
**Lex Fridman:** 996——早上 9 点到晚上 9 点——一周六天。这基本上是硅谷 AI 公司的标准?
**Lex Fridman:** You can check against the reference implementation. I think that's one of the best ways to learn, really. To basically reverse-e ngineer something. - I think that is something everyone interested in getting into AI today should do. That's why I liked your book.
**Nathan Lambert:** 是的,有那种趋势。几乎翻转了——我在学术界时就是那种感觉。作为教授——写基金、教课、做研究——三份工作合一。现在相比之下教授压力更小——因为——
**Nathan Lambert:** I came to languag e models from the RL and robotics field. I had never taken the time to just learn all the fundamentals. This transformer architecture is so fundamenta l, just as deep learning was in the past, and people need to do this. I think where a lot of people get overwhelmed is, "How do I apply this to have i mpact or find a career path?" Because language models make this fundamental stuff so accessible, and people with motivation will learn it.
**Nathan Lambert:** 我觉得他们工作量也很大。只是非常充实——通过和学生工作——有持续的指导——非常有回报。
**Nathan Lambert:** Then it's l ike, "How do I get cycles on goal to contribute to research?" I'm actually fairly optimistic because the field moves so fast that a lot of times the b est people don't fully solve a problem because there's a bigger problem to solve that's very low-hanging fruit, so they move on. I think that a lot of what I was trying to do in this RLHF book is take post-training techniques and describe how people think about them influencing the model and what pe ople are doing. Then it's remarkable how many things I just think people stop studying or don't pursue. I think people trying to go narrow after doing the fundamentals is good, and then reading the relevant papers and being engaged in the ecosystem.
**Sebastian Raschka:** 在初创公司是那种压力——你必须成功。很难——因为你得不断交付。我待过初创公司——经历不错——但不知道能不能永远这样。这些模型在蛙跳式超越——不断试图比竞争对手再迈一步。现在很残酷。
**Sebastian Raschka:** It's like you actually... actually... The proximit y that random people online have to the leading researchers—no one knows who all the... The anonymous accounts on X and ML are very popular, and no on e knows who all these people are. It could just be random people that study this stuff deeply, especially with the AI tools.
**Nathan Lambert:** 这种蛙跳和多个参与者——是语言模型进步被低估的驱动力。竞争深深植根于人们心中。Anthropic 以文化上深度投入和有组织而闻名——每个人似乎都很一致。有这种竞争动态会让你更努力——但以人力资本为代价——你只能持续这么久——人们在倦怠。我写过关于倦怠的帖子。Patrick McGee 的《Apple in China》谈到 Apple 工程师为供应链工作多么辛苦——他说"人们因此死亡。"这就是以人的代价创造进步的完美环境——996——人们确实在拼命干。
**Nathan Lambert:** To just be like, "I don't understand this, keep digging into it," is a very useful thing. But there's a lot of research areas that maybe have three papers that you need to rea d, and then one of the authors will probably email you back. But you have to put in a lot of effort into these emails to understand the field. I think it would be for a newcomer easily weeks of work to feel like they can truly grasp what is a very narrow area, but I think going narrow after you have the fundamentals will be very useful to people because I've become very interested in character training, which is how you make the model funny or sa rcastic or serious, and what do you do to the data to do this?
**Sebastian Raschka:** 我也读了那本书。他们有暗号——如果有人需要回家挽救婚姻。太疯狂了。但我不觉得他们被迫工作——是对产品的热情。我有时候也过度工作——不健康——有过背部和颈椎问题。但没人强迫——是因为想工作。
**Sebastian Raschka:** A student at Oxford reached out to me and was like, "Hey, I'm interested in this," and I advised him. And that paper now exists. There's like two or three people in the world that were very interested in this. He's a PhD student, which g ives him an advantage, but for me, that was a topic I was waiting for someone to be like, "Hey, I have time to spend cycles on this." I'm sure there's a lot more very narrow things where you're just like, "It doesn't make sense that there was no answer to this." I think it's just there's so much inf ormation coming that people are like, "I can't grab onto any of these," but if you just stick in an area, I think there's a lot of interesting things to learn. - Yeah, I think you can't try to do it all because it would be very overwhelming and you would burn out.
**Lex Fridman:** OpenAI 和 Anthropic 就是这样。但也有一种狂热氛围在增长——与扩展定律一致——世界将在数周内被改变——你想处于中心。硅谷是一种回音室和泡沫。泡沫实际上可以是有用和有效的——可以是 Steve Jobs 的现实扭曲力场——互相说服突破即将到来——从而让突破真的到来。
**Lex Fridman:** For me, for example, I haven't kept up with computer vision in a long time; I just focused on LLMs. But coming back to your book, I think this is a really great book and a really good b ang for the buck because if you want to learn about RLHF, I wouldn't go out there and read RLHF papers because you would be spending two years— - Some of them contradict. I've just edited the book, and there's no chapter where I had to be like, "X papers say one thing and Y papers say another, and w e'll see what comes out to be true." - Just to go through the table of contents, what are some ideas we might have missed in the bigger picture of pos t-training? First of all, you did the problem setup, training overview, what are preferences, preference data, and the optimization tools, reward mode ling, regularization, instruction tuning, rejection sampling, and reinforcement learning.
**Sebastian Raschka:** Byrne Hobart 对泡沫做了分类——金融泡沫(坏的)和建设泡沫(推动建造)。AI 在后者里——但我担心它转变为金融泡沫。
**Sebastian Raschka:** Then, Constitutional AI and AI feedback, reasoning, and infe rence-time scaling to use in function calling, synthetic data and distillation, evaluation, and then an open questions section, over-optimization, sty le and information, and then product UX, character and post-training. What are some ideas worth mentioning that connect both the educational and the r esearch components? You mentioned character training, which is pretty interesting. - Character training is interesting because there's so little on it . We talked about how people engage with these models.
**Lex Fridman:** 在想法空间里——你在做现实扭曲力场——如果你在 996 工作同时偏离现实太远——你可能错过人类体验的基本方面。硅谷是一个非常特定的地理区域。你可能不理解中西部视角。你们以特定方式互相交谈和说服——那可能带来麻烦。你就是一个年轻人试图决定用生命做什么。
**Lex Fridman:** We feel good using them because they're positive, but that can go too far; it can be too positi ve. And it's like, essentially, it's: How do you change your data and decision-making to make it exactly what you want? And like, OpenAI has this thin g called a model spec, which is essentially their internal guideline for what they want the model to do, and they publish this to developers. So, esse ntially, you can know what is a failure of OpenAI's training—where they have the intentions and they haven't met them yet— versus what is something th at they actually wanted to do and that you don't like.
**Nathan Lambert:** 旧金山 AI 梗已经到了"永久下层阶级"的程度——2025 年下半年是建立持久价值的唯一窗口。那是泡沫走太远的例子。但我仍然觉得——想在 AI 产生影响的年轻人——身在旧金山最有可能。但有权衡。
**Nathan Lambert:** And that transparency is very nice, but all the methods for curating these documents and how ea sy it is to follow them is not very well known. I think the way the book is designed is that the RL chapter is obviously what people want because ever ybody hears about it with RLVR, and it's the same algorithms and the same math, but you can use it in very different documents. So I think the core of RLHF is like how messy preferences are. It's essentially a rehash of a paper I wrote years ago, but this is essentially the chapter that'll tell you why RLHF is never ever fully solvable because, the way that even RL is set up, it assumes that preferences can be quantified and that multiple prefere nces can be reduced to single values.
**Lex Fridman:** 旧金山很棒——但确实有泡沫。进去——但也要走出来。读历史书、读文学、去世界其他地方看看。Twitter 和 Substack 不是整个世界。
**Lex Fridman:** And I think it relates in the economics literature to the Von Neumann-Morgenstern utility theorem, and that is t he chapter where all of that philosophical, economic, and psychological context tells you what gets compressed into doing RLHF. So it's like you have all of this and then later in the book it's like: You use this RL map to make the number go up. And I think that's why it'll be very rewarding for peo ple to do research on, because quantifying preferences is something that humans have designed a problem in order to make preferences studyable. But th ere's kind of fundamental debates, like, an example is in a language model response you have different things you care about, like accuracy or style.
**Nathan Lambert:** 我推荐《Season of the Witch》——旧金山从 1960 到 1985 年的历史——嬉皮运动、同性恋群体接管城市、HIV/AIDS 危机。那是如此近代——动荡和伤痛——也有爱。没有人知道这些。很棒的书。
**Nathan Lambert:** And when you're collecting the data, they all get compressed into: "I like this more than another." And that is happening, and there's a lot of resear ch in other areas of the world that go into how you should actually do this. I think social choice theory is the subfield of economics around how you should aggregate preferences. And I went to a workshop that published a white paper on: "How can you think about using social choice theory for RLHF?" So I mostly would want people that get excited about the math to come and find things where they could stumble into this broader context. I think the re's a fun thing: I just keep a list of all the tech reports of reasoning models I like.
**Nathan Lambert:** 我觉得你可以按顺序来看。你可以把它理解为——是什么让 o1 这个第一个推理模型成为可能,或者最新的模型是什么?它们其实……你会看到类似的干预措施。你从中期训练开始——据传让 o1 和类似模型成为可能的是非常精心的数据策划,你提供一组广泛的所谓推理轨迹(reasoning traces),就是模型在前向过程中生成的文字——反思、把问题分解成中间步骤并试图解决。所以在中期训练阶段,你需要有与此类似的数据,这样当你进入后训练——主要是用这些可验证奖励——的时候,它才能学习。然后现在正在发生的事情是——你在弄清楚给模型什么问题、可以训练多长时间、以及在解决这些可验证问题时让模型使用多少推理。所以随着模型变得更好,某些问题就不再……模型 100% 的时间都能解对它们,因此从中获得的信号就非常少。如果我们看 GRPO 方程——这个因此而出名——本质上是给智能体的奖励是基于一个给定动作——一个动作就是一个补全——相对于同一个问题的其他答案有多好。所以如果所有问题都得到相同的答案,这类算法中就没有信号了。所以他们在做的是找更难的问题——这就是为什么你会听到科学领域的东西——在那里很难做对任何事。如果你有一个实验室或什么的,它会生成非常多 token,或者是更难的软件问题。所以前沿模型都在推向这些更难的领域——在那里它们可以在更多问题上训练,模型可以同时学到更多技能。RLHF 跟这个的关联是——RLHF 一直是、现在仍然是模型的最后润色,让模型更有用——通过改善组织方式、风格或语气。不同的东西跟不同的受众共鸣——有些人喜欢非常古怪的模型,RLHF 能很好地启用那种个性;有些人讨厌模型做的那种 markdown 列表,但其实那对快速解析信息非常有效。在 RLHF——人类反馈阶段——对于在最后把这些放入模型非常棒。这就是让 ChatGPT 对人们如此神奇的原因。这种用途其实一直相当稳定。这种格式化也能帮助模型在数学问题上做得更好。所以风格和格式与你用来回答问题的方法之间的界限其实紧密相连——在训练这些模型的时候——这就是为什么 RLHF 仍然能让模型在数学上更好,但这些可验证领域是一个更直接的过程,因为它跟问题表述更契合——所以最终它们都融合在一起。总结一下就是:中期训练是给模型它需要的技能来学习。用可验证奖励的 RL 是让模型多次尝试——把大量算力投入到在困难问题上的试错学习中。然后 RLHF 就像是做最后的修饰,让模型易于使用、把模型打磨圆润。
**Nathan Lambert:** So in Chapter 14, where there's a short summary of RLVR, ther e's just a gigantic table where I list every single reasoning model that I like. I think in education, a lot of it needs to be like, at this point, wh at I like, because the language models are so good at the math. For example, the famous paper, Direct Preference Optimization, which is a much simpler way of solving the problem than RL. The derivations in the appendix skip steps of math.
**Lex Fridman:** 你能评论一下 RLVR 需要多少算力吗?
**Lex Fridman:** And for this book, I redid the derivations and I'm like, "Wha t the heck is this log trick that they use to change the math?" But doing it with language models, they're like, "This is the log trick." And I'm like , "I don't know if I like this, that the math is so commoditized." I think some of the struggle in reading this appendix- ...and following the math is good for learning. - Yeah, we're returning to this often on the topic of education. You both have brought up the word "struggle" quite a bit. So ther e is value. If you're not struggling as part of this process, you're not fully following the proper process for learning. proper process for learning, I suppose. - Some providers are working on models for education designed to not give- actually, I haven't used them, but I'd guess they're designed t o not give all the information at once.
**Nathan Lambert:** 只会越来越多。我记得 Ilya Sutskever 有句名言——他们在预训练和后训练上使用差不多的算力。回到扩展讨论——它们涉及非常不同的硬件来扩展。预训练是非常计算密集型的——这就是 FLOP 的讨论——就是你一次能做多少矩阵乘法。因为 RL 你在生成答案、在真实世界环境中尝试模型——它最终变得更加内存密集型,因为你在生成长序列,注意力机制有这种行为——随着序列变长,内存呈二次方增长。所以计算变得非常不同。预训练的时候我们会说一个模型——如果我们回到 Biden 政府的行政命令——是 10 的 25 次方 FLOP 来训练一个模型。如果你在后训练中使用 FLOP 来衡量,就很奇怪了,因为现实就是:你分配了多少小时?多少 GPU?我觉得在时间上,RL 的算力在越来越接近,因为你就是没法把它全塞进一个系统。预训练的计算非常密集——所有 GPU 彼此通信,效率极高。而 RL 有所有这些移动部件,生成一个 10 万 token 的序列可能要很长时间。如果你想想 GPT-5.2 Pro 需要一个小时——就像,如果你的训练运行有一个采样要花一小时,你得确保这被高效处理。所以我觉得在 GPU 小时或纯挂钟时间上,RL 运行可能正在接近跟预训练一样的天数,但它们同时使用的 GPU 可能没那么多。有一些经验法则——在实验室里你不想让预训练运行超过一个月,因为它们会灾难性地失败。如果你计划让一个巨大的集群占用两个月然后在第 50 天失败了,机会成本太大了。所以人们不想把所有鸡蛋放在一个篮子里。GPT-4 是终极的"放手一搏"运行——之前没人想这么做——训练了三个月,所有人都震惊它竟然成功了。我觉得现在人们更谨慎、更增量了。
**Nathan Lambert:** And make people work for it. Training models to do this would be a wonderful contribution. Where, like all of the stuff in the book, you had to reevaluate every decision. decision for it- It's a great example. There's a chance we work on it at Ai2, which I tho ught would be so fun. - It makes sense.
**Sebastian Raschka:** 所以 RLVR 更像是在训练量上不受限——你仍然能从更多训练中获益。而 RLHF 因为是偏好调优,你到了某个点之后再花更多 RL 预算就没什么意义了。退一步说偏好调优:可以有多个人对同一件事给出多种解释,而且都可以是正确的,但到某个时候你学到了一种风格,再迭代就没意义了。我最喜欢的例子是:如果亲戚问我该买什么笔记本电脑,我会解释或者问"你的使用场景是什么?"他们比如优先考虑电池续航和存储空间。而像我们这样的人会优先考虑内存和算力。两个答案都是对的,但不同的人需要不同的答案。做偏好调优的时候,你试图取某种平均值。你要求标注者给你的不是正确答案,而是偏好答案,然后你在上面训练。但到某个时候你学到了那个平均偏好答案,没有理由继续训练下去,因为那只是一种风格。而 RLVR 你让模型去解决越来越复杂、困难的问题。所以我觉得长期来看,把更多预算分配给 RLVR 更合理。另外,现在我们还处在 RLVR 1.0 阶段——还是那种简单的形式,有一个问题和答案,但我们对中间的东西不做任何处理。有多篇研究论文——也有 Google 的——关于过程奖励模型(process reward models),也给解释过程打分——解释有多正确。我觉得那会是下一个东西——今年的 RLVR 2.0——聚焦于问题和答案之间的部分,怎么利用那些信息、那些解释来帮助模型获得更好的准确率。那是一个方向。还有一篇 DeepSeek Math-V2 论文,他们在推理扩展方面也有有趣的做法——首先,他们开发了自我评分的模型,一个独立的模型。我觉得那会是一个方面。另一个——就像 Nathan 提到的——RLVR 将分支到其他领域。
**Sebastian Raschka:** I did something like that the other day for video games. Sometimes for pastime I play video games, like I like - Video games with puzzles, like Zelda and Metroid. And there's this new game where I really got stuck and was okay with it. I don't want to struggle for two days, so I used an LLM.
**Nathan Lambert:** 人们兴奋的一个方向是价值函数(value functions),这跟它很相似。过程奖励模型在推理过程的每个中间步骤评估某个东西有多好,而价值函数对语言模型生成的每个 token 赋予价值。这两者在语言建模和推理模型时代基本上都还没有被证明有效。不知道什么原因,人们现在对价值函数更乐观了。我觉得过程奖励模型在 o1 之前、推理模型之前的时代被尝试得更多,很多人在上面遇到了很多头疼的问题。所以我觉得很大程度上是人的天性……价值模型在强化学习中有非常深的历史。它们是深度强化学习存在的最核心的东西之一。所以现在人们对尝试价值模型很兴奋,但几乎没有证据。而且在尝试扩大过程奖励模型方面有负面案例。这些东西在未来不一定成立。我们通过讨论扩展来到了这里。简单总结你说的就是——你不想做太多 RLHF,因为信号不会扩展。人们在 ChatGPT 之后对语言模型的 RLHF 工作了好几年,投入了巨大的兴趣。第一个用 RLVR 训练的推理模型——OpenAI 的 o1——有一个扩展图,如果你对数增加训练算力,你在评测上得到线性增长。这已经被多次复现。DeepSeek 也有类似的图。但 RLHF 没有扩展定律——你对数增加算力并不能得到性能提升。事实上,RLHF 的开创性扩展论文叫"奖励模型过度优化的扩展定律"。所以这是 RLVR 和我们现在拥有的方法之间一条很大的分界线。在未来,它们会遵循这个扩展范式——你可以让最好的运行多跑 10 倍,你就能获得性能。但你不能用 RLHF 这样做。这将定义整个领域——人们如何对待它们。虽然我是学术上推崇大家做 RLHF 的人——要做最好的 RLHF 可能不需要额外的 10 倍或 100 倍算力,但要做最好的 RLVR 你需要。我觉得有一篇来自 Meta 实习项目的里程碑论文。叫什么"用语言模型扩展强化学习的艺术"之类的。他们描述的框架叫 Scale-RL。他们的增量实验大约是 10000 个 V100 小时——就是每个实验几千到几万美元。他们做了很多实验。这个成本对普通学术界来说是不可及的,这是一个困难的均衡——试图弄清楚如何从不同社区学习。
**Nathan Lambert:** But then you say, "Hey, please don't add spoilers. Just, you know, I'm here and there. What do I have to do next?" You can do the same thing for math where you say, "Okay, I'm stuck at this point. Don't give me the full solution, but what is something I could try?" Wh ere you carefully probe it.
**Lex Fridman:** 我想我们能不能岔开一下,聊聊教育和学习。如果你是一个听节目的聪明人,对编程和 AI 感兴趣,我想从零构建东西应该是一个好的开始。那你能带大家走一遍你推荐人们怎么做吗?
**Lex Fridman:** But the problem here is I think it requires discipline. Many people enjoy math, but there are also a lot of people who nee d to do it for their homework, and then it's like a shortcut. We could develop an educational LLM, but other LLMs are still there, and there's still a temptation to use the other LLMs. - I think many people in college understand the stuff they're passionate about- about- ...they're self-aware and th ey understand it shouldn't be easy. I think we just have to develop a good taste- ...talk about research taste, school taste about stuff that you shou ld be struggling on- ...and stuff you shouldn't be.
**Sebastian Raschka:** 我个人会建议像你说的那样——从零实现一个简单的模型,在你自己的电脑上运行。目标不是——当你从零构建模型的时候——做出一个日常使用的东西。它不会成为你的个人助手,替代已有的开放权重模型或 ChatGPT。目的是看看 LLM 里面到底有什么、出来什么、预训练怎么运作——最好在你自己的电脑上。然后你学到预训练、监督微调和注意力机制。你对事情怎么运作有了扎实的理解。但到某个时候你会到达一个极限,因为小模型能做的有限。学习大规模 LLM 的问题是——要做一个更大的模型,复杂度呈指数增长——因为模型不只是更大——你得把参数分片到多个 GPU 上。即使是 KV 缓存,也有多种实现方式。一种是只是理解它怎么运作——不断增长缓存。你一步一步增长——比方说通过拼接列表——但那在 GPU 上不是最优的。你应该预分配一个张量然后填进去。但那又多了 20 到 30 行代码。每加一个东西,都多出那么多代码。书的目标基本上是理解 LLM 怎么工作。它不会是一个生产级的 LLM,但一旦你有了这个理解,你就能理解生产级的 LLM。
**Sebastian Raschka:** It's tricky, because you don't have good long-term vision sometimes you don't have good long-term vision about what would be actually useful to you in your career. But you have to develop that taste, yeah. - I was talking to my fiancee or friends a bout this, there's this brief 10-year window where all of the homework and all the exams could be digital. Before that, everybody had to do all the ex ams in blue books because there was no other way. And now after AI, everyone's going to need to be in blue books and oral exams because everyone could cheat so easily.
**Lex Fridman:** 所以你总是在构建一个能装在一个 GPU 上的 LLM?
**Lex Fridman:** It's like this brief generation that had a different education system where everything could be digital, but you still couldn't chea t. And now it's just going back. It's just very funny. - You mention character training. Just zooming out on a more general topic, for that project ho w much compute was required? And in general, to contribute as a researcher, are there places where not too much compute is required where you can actu ally contribute as an individual researcher? - For the character training thing, I think this research is built on fine-tuning about 7 billion paramet er models with LoRA, which is essentially only fine-tuning a small subset of the weights of the model. I don't know exactly how many GPU hours that wo uld take. - But it's doable. - Not doable for every academic.
**Sebastian Raschka:** 对。大多数都能。我有一些附赠材料关于 MoE 模型。一两个可能需要多个 GPU,但目标是装在一个 GPU 上。而且美妙的是你可以自我验证。这几乎就像 RLVR。当你从零编写这些东西的时候,你可以拿 Hugging Face Transformers 库中的一个现有模型。那个库很棒,但如果你想学 LLM,它不是最好的起点,因为代码太复杂了——要适应这么多用例。因为人们在生产中使用它,它必须非常精密、非常交织、难以阅读。它不是线性的。
**Sebastian Raschka:** The situation for some academics is so dire that the only work you can do is doing infer ence where you have closed models or open models and you get completions from them and you can look at them and understand the models. And that's very well-suited to evaluation, where you want to be the best at creating representative problems that the models fail on or show certain abilities, which I think that you can break through with this. I think that the top-end goal for a researcher working on evaluation, if you want to have career moment um, is that Frontier Labs pick up your evaluation. You don't need to have every project do this.
**Nathan Lambert:** 它一开始是一个微调库,然后成长为每个模型架构的标准表示。Hugging Face 是获取模型的默认地方,Transformers 是软件。它让人们可以轻松加载一个模型并做一些基本操作。
**Nathan Lambert:** But if you go from a small university with no compute and find something that Claude struggles with, and then the next Claude model has it in the blog post, there's your career rocket ship. I think that' s hard, but if you want to scope the maximum possible impact with minimum compute, it's something like that, which is just get very narrow and it take s learning of where the models are going. So you need to build a tool that tests where Claude 4.5 will fail. If I'm going to start a research project, I need to think where the models in eight months are going to be struggling. - But what about developing totally novel ideas? - This is a trade-off.
**Sebastian Raschka:** 而且所有有开放权重模型的前沿实验室都有一个 Transformers 版本——从 DeepSeek 到 gpt-oss-120b。那是你能加载的规范权重格式。但即使 Transformers 这个库本身也不在生产中使用。人们用 SGLang 或 vLLM,那又增加了一层复杂度。
**Sebastian Raschka:** I think that if you're doing a PhD, you could also be like, "It's too risky to work in language models. I'm going way longer term," which is like what is— what is the thing that's going to define language model development in 10 years? I end up being a person that's pretty practical. I mean, I went to my PhD where it was like, "I got into Berkeley.
**Nathan Lambert:** 我们应该说 Transformers 库有大约 400 个模型。
**Nathan Lambert:** Worst case, I get a master's, and then I go work in tech." I'm very practical about it, so I'm like the life afforded to people to work at these AI companies, the amount of... OpenAI's average compensation is over a million dollars in stock a year p er employee. For any normal person in the US, to get into this AI lab is transformative for your life. So I'm pretty practical about it. there's still a lot of upward mobility working in language models if you're focused.
**Sebastian Raschka:** 所以它是一个试图实现大量 LLM 的库,所以你有一个庞大的代码库。它很大。可能有百万——
**Sebastian Raschka:** And look at these jobs. But from a research perspective, the transformative im pact in these academic awards... to be the next Yann LeCun is from not working on— not caring about language model development very much. - It's a big financial sacrifice in that case. - So I work with some awesome students, and they're like, "Should I go work at an AI lab?" And I'm like, "You're ge tting a PhD at a top school. Are you gonna leave to go to a lab?" I don't know. If you go work at a top lab, I don't blame you.
**Nathan Lambert:** 那真是疯了。
**Nathan Lambert:** Don't go work at some random startup that might go to zero. But if you're going to OpenAI, I'm like, "It could be worth leaving a PhD for." - Let's more rigorously think th rough this. So where would you give a recommendation for people to do a research contribution? So the options are academia: get a PhD.
**Sebastian Raschka:** ——几十万行代码。要理解你想理解的部分,就是大海捞针。但美妙的是你有一个可工作的实现,所以你可以反向工程。我推荐做的——我也这么做——是如果我想理解比如 OLMo 怎么实现的,我会去看模型 hub 上的权重,看配置文件,然后你能看到"哦,他们用了多少层。他们用了分组查询注意力还是多头注意力。"你在一个人类可读的、100 行的配置文件中看到所有组件。然后你从你的 GPT-2 模型开始,加上这些东西。这里很酷的是——你可以加载预训练权重,看它们在你的模型中是否工作。你要让输出跟 Transformer 模型的输出一样——然后你可以把那个作为可验证的奖励来确认你的架构是正确的。有时候要花我一天。OLMo 3 的时候,挑战是 RoPE 位置编码。他们有一个 YaRN 扩展,有一些自定义缩放,我没能完全匹配。在这种挣扎中,你真的理解了东西。最后你知道你做对了,因为你可以做单元测试。你可以跟参考实现比对。我觉得这是最好的学习方式之一——基本上就是逆向工程。
**Sebastian Raschka:** Spend five year s publishing. Compute resources are constrained. There's— there's research labs that are more focused on open-weight models, and working there. Or clo sed frontier research labs.
**Nathan Lambert:** 我觉得这是今天每个对 AI 感兴趣的人都应该做的。这就是为什么我喜欢你的书。我是从 RL 和机器人领域来到语言模型的。我从来没花时间去学所有基础。这个 transformer 架构是如此基础——就像过去深度学习一样——人们需要去做这个。我觉得很多人被压倒的地方是"我怎么把这个应用到有影响力的工作或找到职业路径?"因为语言模型让这些基础性的东西非常容易获取,有动力的人会去学。然后就是"我怎么获得研究的周期来做贡献?"我其实挺乐观的,因为这个领域发展太快了,很多时候最好的人没有完全解决一个问题——因为有一个更大的、很容易摘到的问题要解决——所以他们就继续前进了。我觉得我在 RLHF 那本书里试图做的很多事情是——拿后训练技术,描述人们怎么看待它们影响模型、人们在做什么。然后令人惊讶的是有多少事情我觉得人们就不再研究或不去追了。我觉得在做完基础之后去做窄而深的研究是好的——然后阅读相关论文、参与生态系统。实际上……那些在线上的随机人跟顶级研究者之间的距离——没人知道所有那些……X 和 ML 上的匿名账号非常受欢迎,没人知道这些人都是谁。可能就是一些深入研究这些东西的普通人——特别是有了 AI 工具之后。那种"我不理解这个,继续挖"的态度是非常有用的。但有很多研究领域可能只有三篇你需要读的论文,然后其中一个作者可能会回你邮件。但你得在这些邮件中投入大量努力来理解这个领域。我觉得对于新人来说,轻松也得花几周工作才能真正掌握一个非常窄的领域。但我认为在有了基础之后去做窄而深是对人们非常有用的。因为我最近对"性格训练"(character training)非常感兴趣——就是你怎么让模型变得幽默或讽刺或严肃——你怎么处理数据来做到这个?牛津一个学生联系了我说"嘿,我对这个感兴趣",我指导了他。那篇论文现在已经存在了。世界上可能只有两三个人对这个非常感兴趣。他是个博士生,这给了他优势,但对我来说,那是一个我一直在等有人说"嘿,我有时间投入到这上面"的话题。我相信还有很多非常窄的东西,你就觉得"这没有答案居然说不通。"我觉得只是信息太多了,人们觉得"我抓不住任何一个"——但如果你在一个领域扎下去,我觉得有很多有趣的东西可以学。
**Nathan Lambert:** So OpenAI, Anthropic, xAI, and so on. - The two gradients are: the more closed, the more money you tend to get, but you al so get less credit. In terms of building a portfolio of things that you've done, it's very clear what you have done as an academic. Versus if you are going to trade this fairly reasonable progression for being a cog in the machine, which could also be very fun. So I think it's very different career paths.
**Sebastian Raschka:** 是的。我觉得你不能试图全部做完,因为那会非常累人,你会倦怠。对我来说,比如我已经很久没跟上计算机视觉了——我只专注于 LLM。但说回你的书——我觉得这真的是一本很棒的书,性价比很高。因为如果你想学 RLHF,我不会建议你去读 RLHF 论文,因为那得花你两年——
**Sebastian Raschka:** But the opportunity cost for being a researcher is very high because PhD students are paid essentially nothing. So it ends up rewarding people that have a fairly stable safety net, and they realize that they can operate in the long term, wanting to do very interesting work and get a very inte resting job. So it is a privileged position to be like, "I'm gonna see out my PhD and figure it out after because I want to do this." At the same time , the academic ecosystem is getting bombarded by funding getting cut and stuff. So there's just so many different trade-offs where I understand plenty of people that are like, "I don't enjoy it.
**Nathan Lambert:** 有些还互相矛盾。我刚编辑完这本书——没有任何一章我需要写"X 论文说了一件事,Y 论文说了另一件事,我们看看最终哪个是对的。"
**Nathan Lambert:** I can't deal with this funding search. My grant got cut for no reason by the government," or, "I don't kn ow what's gonna happen." So I think there's a lot of uncertainty and trade-offs that, in my opinion, favor just taking the well-paying job with meanin gful impact. It's not like you're getting paid to sit around at OpenAI. You're building the cutting edge of things that are— changing millions of peop le's relationship to tech. - But publication-wise, they're being more secretive, increasingly so.
**Lex Fridman:** 让我过一下目录,我们在后训练的大图景中可能遗漏了什么想法?首先你做了问题设置、训练概述、什么是偏好、偏好数据、优化工具、奖励建模、正则化、指令调优、拒绝采样、强化学习。然后是 Constitutional AI 和 AI 反馈、推理与推理时间扩展在函数调用中的使用、合成数据和蒸馏、评估,然后是一个开放问题部分——过度优化、风格和信息、产品体验、性格和后训练。有哪些值得提到的想法同时连接了教育和研究的部分?你提到了性格训练,这挺有趣的。
**Lex Fridman:** So you're publishing less and less. And so you are h aving a positive impact at scale, but you're a cog in the machine. - I think it honestly hasn't changed that much. I have been in academia. I'm not in academia anymore. wouldn't want to miss my time in academia.
**Nathan Lambert:** 性格训练之所以有趣,是因为关于它的研究太少了。我们谈过人们怎么跟这些模型互动。我们使用它们感觉很好,因为它们很正面,但那可能走得太远——太正面了。本质上就是:你怎么改变数据和决策来让模型变成你想要的样子?OpenAI 有一个叫 model spec 的东西——基本上是他们对模型行为的内部指南——他们把这个发布给开发者。所以本质上你可以知道——什么是 OpenAI 训练的失败——他们有意图但还没达到——相对于他们实际想做但你不喜欢的事情。这种透明性很好,但所有关于策划这些文件以及遵循它们有多容易的方法都不太为人所知。我觉得这本书的设计是——RL 那章显然是人们想要的,因为大家都听说了 RLVR——用的是同样的算法和数学,但你可以在非常不同的文档中使用它。所以我觉得 RLHF 的核心是关于偏好有多混乱。这本质上是我多年前写的一篇论文的重新阐述——但这基本上是那个会告诉你为什么 RLHF 永远不可能被完全解决的章节。因为即使 RL 的设置方式,也假设偏好可以被量化,多个偏好可以被缩减为单一值。我觉得这跟经济学文献中的 Von Neumann-Morgenstern 效用定理有关。那个章节是所有哲学、经济学和心理学背景告诉你什么被压缩进了 RLHF 的过程。所以就像——你有所有这些背景——然后书的后面就是:你用这个 RL 映射来让数字变大。我觉得这就是为什么人们做研究会觉得很有收获——因为量化偏好是人类设计出来的一个问题框架,为了让偏好变得可研究。但存在一些根本性的辩论——比如,在一个语言模型的回复中你有不同在意的东西,像准确率或风格。当你收集数据的时候,它们都被压缩成"我更喜欢这个而不是那个。"这就是正在发生的事,而且世界上其他领域有很多研究探讨你实际应该怎么做。我觉得社会选择理论(social choice theory)是经济学中关于如何汇总偏好的子领域。我去过一个研讨会,他们发表了一份白皮书:"你怎么用社会选择理论来思考 RLHF?"所以我最想要的是——对数学感兴趣的人来找到这些领域,偶然进入这个更广阔的背景。我觉得有一个有趣的事:我就是保留了一个我喜欢的推理模型技术报告的清单。所以在第 14 章——有一个关于 RLVR 的简短总结——那里有一个巨大的表格,列出了我喜欢的每一个推理模型。我觉得在教育方面,到了这个阶段,很多东西需要是"我喜欢什么"——因为语言模型在数学方面已经太强了。比如那篇著名的论文 Direct Preference Optimization(DPO)——一种比 RL 简单得多的解决方式。附录中的推导跳过了数学步骤。对于这本书,我重新做了推导——然后我就说"这个 log 技巧是什么鬼?他们用来改变数学的?"但用语言模型做的时候,它们会说"这是 log 技巧。"然后我就想"我不确定我喜不喜欢这个——数学被商品化到这个程度。"我觉得在读这个附录和跟着数学推导中的一些挣扎——对学习是有好处的。
**Nathan Lambert:** But what I wanted to say before I get to that is that I think it hasn't changed that muc h. I was working in computational biology, using AI or machine learning methods with collaborators, and a lot of people went from academia directly to Google. And I think it's the same. Back then, professors were sad that their students went into industry because they couldn't carry on their legacy.
**Lex Fridman:** 是的,我们在教育话题上反复回到这个。你们两位都多次提到了"挣扎"这个词。所以挣扎是有价值的。如果你在这个过程中不挣扎,你就没有完全遵循正确的学习流程。
**Lex Fridman:** I think it's the same. It hasn't changed that much. The only thing that has changed is the scale. Cool stuff was always developed in industry that wa s closed.
**Nathan Lambert:** 一些提供商在开发教育用的模型——设计成不会一次给出所有信息。我其实没用过,但我猜它们是设计来让人们自己去努力获取信息的。训练模型来做到这一点将是一个很棒的贡献。就像书里的所有东西一样——你得重新评估每一个决定。这是一个很好的例子。我们在 Ai2 可能会做这个,我觉得会很有趣。
**Nathan Lambert:** You couldn't talk about it. And I think the difference now is your preference. Do you like to publish your work, or are you more in a closed lab? That's one difference.
**Sebastian Raschka:** 说得通。我前几天用电子游戏做了类似的事。有时候消遣我会玩电子游戏——我喜欢那种有谜题的游戏——像塞尔达和银河战士。有一个新游戏我真的卡住了,我能接受这个。我不想挣扎两天,所以我用了 LLM。但你会说"嘿,请不要剧透。我现在在这里、在那里。下一步我该做什么?"数学也可以这样做——你说"好,我卡在这个点了。别给我完整答案,给我一个我可以尝试的方向。"你小心地试探它。但问题是我觉得这需要自律。很多人喜欢数学,但也有很多人只是为了作业才做——那就成了捷径。我们可以开发一个教育 LLM,但其他 LLM 还在那里,总有使用其他 LLM 的诱惑。
**Sebastian Raschka:** The compensation, of course, is another, but it's always been like that. It depends on where you feel comfortable. And no thing is forever. Right now, there's a third option, which is launching a startup.
**Nathan Lambert:** 我觉得很多大学生对自己热爱的东西是有自觉的——他们知道那不应该太容易。我觉得我们只需要培养好的品味——谈研究品味、学校品味——关于什么东西你应该去挣扎,什么不应该。这很难,因为你有时候没有对什么在职业生涯中真正有用的长远眼光。但你得培养那种品味。
**Nathan Lambert:** A lot of people are doing that. It's a very risky move, but it can be a high-risk, high-reward situation, whereas joining an industry lab is pretty safe. You also have upward mobility. I think once you've been at an i ndustry lab, it's easier to find future jobs.
**Sebastian Raschka:** 我跟我未婚妻或朋友们聊这个——有一个短暂的十年窗口,所有作业和考试都可以是数字化的。在那之前,每个人都必须在蓝皮本上做考试,因为没有其他方式。现在在 AI 之后,大家又得回到蓝皮本和口试了——因为每个人都能那么容易地作弊。就像那短暂的一代人有一个不同的教育系统——一切可以数字化,但你仍然不能作弊。现在又回去了。太好笑了。
**Sebastian Raschka:** But then again, how much do you enjoy the team and working on proprietary things versus how much you lik e publishing work? I mean, publishing is stressful. Acceptance rates at conferences can be arbitrary and very frustrating, but it's high reward if you have a paper published. You feel good because your name is on there.
**Lex Fridman:** 你提到了性格训练。放大到更一般的话题,那个项目需要多少算力?总的来说,作为一个研究者要做贡献,有没有不需要太多算力的地方,你可以作为个人研究者做出贡献?
**Lex Fridman:** It's a high accomplishment. - I feel like my friends who are professors seem hap pier than those who work at a frontier lab, to be honest. There's a grounding there. The frontier labs definitely do this 9-9-6, which is shorthand fo r working all the time. - Can you describe 9-9-6? It's a culture invented, I believe, in China and adopted in Silicon Valley.
**Nathan Lambert:** 性格训练这个研究建立在用 LoRA 微调大约 70 亿参数模型的基础上——LoRA 本质上只微调模型权重的一小部分。我不确切知道需要多少 GPU 小时。
**Nathan Lambert:** What is 9-9-6? It's 9:00 AM to 9:00 PM, - Six days a week. - six days a week. What is that, 72 hours? Okay.
**Sebastian Raschka:** 但是可以做到的。
**Sebastian Raschka:** So, is this basically the standard in AI companies in Silicon Vall ey? This kind of grind mindset. - Yeah, I mean, maybe not exactly like that, but I think there is a trend towards it. And it's interesting. I think it almost flipped because when I was in in academia, I felt like that.
**Nathan Lambert:** 不是每个学术界的人都做得到。有些学术界的人的处境非常艰难——他们唯一能做的工作就是做推理——用闭源模型或开源模型获取补全(completion),然后看这些补全并理解模型。这非常适合评测——你想成为最擅长创造那些让模型失败或展示某些能力的代表性问题的人。我觉得你可以通过这个取得突破。我觉得对于做评测的研究者来说——如果你想有职业动力——终极目标是前沿实验室采用你的评测。不是每个项目都需要做到这个,但如果你从一个没有算力的小型大学出发,找到了 Claude 难以解决的问题,然后下一个 Claude 模型在博客里提到了它——这就是你的职业火箭。我觉得这很难,但如果你想要用最少算力获得最大可能影响——就是这样——做得非常窄,需要了解模型的发展方向。所以你需要构建一个测试 Claude 4.5 会在哪里失败的工具。如果我要开一个研究项目,我需要思考八个月后的模型会在哪里挣扎。
**Nathan Lambert:** As a professor, you write grants, you teach, and you do research. It's like three jobs in one, and it's more than a full-time job if you want to be successful. successful. And I feel like now, like Nathan just said, the professors, in comparison to a lab, I think they have less pressure or workload than at a frontier lab because— - I think they work a lot. They're just so fulfil led.
**Lex Fridman:** 但如果是开发全新的想法呢?
**Lex Fridman:** By working with students— and having a constant runway of mentorship and a mission that is very people-oriented, I think in a era when things are moving very fast and are very chaotic, it's very rewarding to people. - Yeah, and I think at a startup, it's this pressure. It's like you have to mak e it. And it's really important that people put in the time, but it is really hard because you have to deliver constantly, and I've been at a startup. I had a good time, but I don't know if I could do it forever.
**Nathan Lambert:** 这是一个权衡。我觉得如果你在读博,你也可以说"在语言模型上工作风险太高了。我要走更长远的路"——就是什么会定义十年后的语言模型发展?我最终是一个挺务实的人。我读博的时候就想——"我进了 Berkeley。最坏的情况我拿个硕士,然后去科技公司工作。"我对此非常务实。所以我觉得——这些 AI 公司能给人带来的生活条件、那个金额……OpenAI 平均每个员工的年薪——包含股票——超过一百万美元。对于美国的任何普通人来说,进入这些 AI 实验室对你的生活是变革性的。所以我挺务实的。做语言模型方面仍然有大量向上流动的空间——如果你专注的话。去看看这些工作。但从研究角度来说,那种变革性影响和学术奖项——成为下一个 Yann LeCun——是来自不太关心语言模型发展的人。
**Nathan Lambert:** It's an interesting pace and it's exactly like we talked about in the beginning. These models are leapfrogging each other, and they are just constantly trying to take the next step compared to their competitors. It's just ruthless right now. - I think this leapfrogging nature and having multiple players is actually an underrated driver of language modeling progress where competition i s so deeply ingrained in people, and these companies have intentionally created very strong cultures. Like, Anthropic is known to be so culturally, li ke, deeply committed and organized.
**Sebastian Raschka:** 在那种情况下是一个巨大的财务牺牲。
**Sebastian Raschka:** I mean, we hear so little from them, and everybody at Anthropic seems very aligned. And it's like being in a cultu re that is super tight and having this competitive dynamic is a thing that's gonna make you work hard and create things that are better. But that come s at the cost of human capital, which is like you can only do this for so long, and people are definitely burning out. I wrote a post on burnout as I' ve tread in and out of this myself, especially trying to be a manager, full-mode training.
**Nathan Lambert:** 所以我跟一些很棒的学生合作,他们问"我应该去 AI 实验室工作吗?"我说"你在顶级学校读博。你要退学去实验室?"我不知道。如果你去顶级实验室工作,我不怪你。别去某个可能归零的随机创业公司。但如果你要去 OpenAI——我会说"为此退学可能值得。"
**Nathan Lambert:** It's a crazy job doing this. The book Apple in China by Pat rick McGee, he talked about how hard the Apple engineers worked to set up the supply chains in China, and he was like, they had "saving marriage" prog rams, and he told in a podcast, he was like, "People died from this level of working hard." So I think it's just like it's a perfect environment for c reating progress based on human expense, and there's gonna be a lot of... the human expense is the 996 that we started this with, which is like— ... p eople do really grind. - I also read this book. I think they had a code word for if someone had to go home to spend time with their family to save the marriage, and it's crazy. Then the colleagues said, "Okay, this is like red alert for this situation. We have to let that person go home this weekend ." But at the same time, I don't think they were forced to work. They were so passionate about the product, I guess, that you get into that mindset. A nd I had that sometimes as an academic, but also as an independent person, I have that sometimes. I overwork, and it's unhealthy. I had back issues, I had neck issues, because I did not take the breaks that I maybe should have taken. But no one forced me to; it's because I wanted to work, because it 's exciting stuff. - That's what OpenAI and Anthropic are like.
**Lex Fridman:** 让我们更严格地想一下这个问题。你推荐人们在哪里做研究贡献?选项有学术界:读博。花五年发论文。算力有限。还有更专注于开放权重模型的研究实验室,在那里工作。或者闭源前沿研究实验室——OpenAI、Anthropic、xAI 等等。
**Lex Fridman:** They want to do this work. - Yeah, but there's also a feeling of fervor that's buildin g, especially in Silicon Valley, aligned with the scaling laws idea, where there's this hype where the world will be transformed in a scale of weeks a nd you want to be at the center of it. And then, you know, I have this great fortune of having conversations with a wide variety of human beings, and from there I get to see all these bubbles and echo chambers across the world. It's fascinating to see how we humans form them. And I think it's fair t o say that Silicon Valley is a kind of echo chamber, a kind of silo and bubble.
**Nathan Lambert:** 两个梯度是:越闭源,你通常赚的钱越多,但你得到的信用(credit)也越少。在建立你做过什么的作品集方面——作为学术界的人,你做了什么非常清楚。而如果你要把这种相当合理的成长路径换成做机器上的一个齿轮——那也可以很有趣。所以我觉得是非常不同的职业路径。但做研究者的机会成本非常高,因为博士生基本上没什么工资。所以这最终有利于那些有相当稳定安全网的人——他们意识到可以从长远考虑,想做非常有趣的工作并得到非常有趣的职位。所以这是一个特权位置——"我要读完博士然后再想办法,因为我想做这个。"同时学术生态系统正被经费削减轰炸。所以有这么多不同的权衡,我理解很多人说"我不享受了。我受不了这种找经费的折磨。我的 grant 被政府无缘无故砍了",或者"我不知道会发生什么。"所以我觉得有很多不确定性和权衡——在我看来有利于拿那个薪酬好、有意义影响的工作。不是说你在 OpenAI 拿钱坐着不干活。你在构建改变数百万人与技术关系的最前沿的东西。
**Nathan Lambert:** I think bubbles are actually really useful and effective. It's not nec essarily a negative thing because you could be ultra-productive. It could be the Steve Jobs reality distortion field, because you just convince each o ther that breakthroughs are imminent, and by convincing each other of that, you make the breakthroughs imminent. - Byrne Hobart wrote a book classifyi ng bubbles. One of them is financial bubbles, which is like speculation, which is bad, and the other one is for build-outs, because it pushes people t o build these things.
**Lex Fridman:** 但在发表方面,他们越来越保密了。所以你发表的越来越少。你确实在大规模上产生积极影响,但你是机器上的一个齿轮。
**Lex Fridman:** And I do think AI is in this, but I worry about it transitioning to a financial bubble, which is - Yeah, but also in the space o f ideas, that bubble—you are doing a reality distortion field, and that means you are deviating from reality. And if you go too far from reality while also working, you know, 996, you might miss some fundamental aspects of the human experience, including beyond Silicon Valley. This is a common probl em in Silicon Valley: it's a very specific geographic area. You might not understand the Midwest perspective, the full experience of all the other hum ans in the United States and across the world, and you speak a certain way to each other, you convince each other of a certain thing, and that can get you into real trouble.
**Sebastian Raschka:** 我觉得说实话这并没有变多少。我在学术界待过。我现在不在学术界了。不想错过我在学术界的时光。但我想先说的是——我觉得这并没有变多少。我之前在计算生物学领域——用 AI 或机器学习方法跟合作者一起做——很多人从学术界直接去了 Google。我觉得一样。那时候教授们也很难过学生去了工业界,因为他们不能延续自己的遗产。我觉得一样,没怎么变。唯一变了的是规模。酷的东西一直都是在工业界开发的——而且是闭源的。你不能谈论它。我觉得现在的区别是你的偏好。你喜欢发表你的工作,还是更愿意在闭源实验室?这是一个区别。薪酬当然是另一个,但这一直都是这样。取决于你在哪里感到舒适。而且没有什么是永远的。现在还有第三个选项——创业。很多人在做。这是一个非常冒险的举动,但可以是高风险高回报的情况。而加入工业实验室是相当安全的。你也有向上流动的空间。我觉得一旦你在工业实验室待过,找未来的工作更容易。但另一方面——你有多享受团队和做专有的东西,相比你有多喜欢发表工作?我是说,发表有压力。会议的接收率可能很随意、很让人沮丧,但如果你有论文发表了,那是很高的回报。你感觉很好,因为你的名字在上面。这是一个很高的成就。
**Sebastian Raschka:** Whether AI is a big success and becomes a powerful technology or it's not, in either trajectory you can get yourself into trou ble. So you have to consider all of that. Here you are, a young person trying to decide what you want to do with your life. - The thing that is... I d on't even really understand this, but the SF AI memes have gotten to the point where "permanent underclass" was one of them, which was the idea that t he last six months of 2025 was the only time to build durable value in an AI startup or model.
**Lex Fridman:** 我觉得我那些做教授的朋友似乎比那些在前沿实验室工作的人更快乐,说实话。有一种扎实感。前沿实验室确实是 996 模式——就是一直在工作的代名词。
**Lex Fridman:** Otherwise, all the value will be captured by existing c ompanies and you will therefore be poor, which... that's an example of the SF thing that goes so far. I still think for young people going to be able to tap into it, if you're really passionate about wanting to have an impact in AI, being physically in SF is the most likely place where you're going to do this. But it has has trade-offs. - I think SF is an incredible place, but there is a bit of a bubble. And if you go into that bubble, which is e xtremely valuable, just get out also.
**Nathan Lambert:** 你能描述一下 996 吗?这是一种文化——据我所知起源于中国,被硅谷采用了。什么是 996?早 9 点到晚 9 点——
**Nathan Lambert:** Read history books, read literature, visit other places in the world. Twitter and Substack are not the entire wo rld. - I would say, one of the people I worked with is moving to SF, and it's like, I need to get him a copy of Season of the Witch, which is a histor y of SF from 1960 to 1985, which goes through the hippie revolution, like all the gays taking over the city and that culture emerging, and then the HI V/AIDS crisis and other things. And it's just like, that is so recent, and so much turmoil and hurt, but also love in SF. And it's like, no one knows about this.
**Sebastian Raschka:** 一周六天。
**Sebastian Raschka:** It's a great book, Season of the Witch. I recommend it. A bunch of my SF friends who get out recommended it to me. And I think that's just like living there...
**Nathan Lambert:** 一周六天。那是多少——72 小时?好。那这基本上是硅谷 AI 公司的标准吗?这种拼命的心态。
**Nathan Lambert:** I lived there and I didn't appreciate this context, and it's just so recent. - Yeah. Okay, let's... We talked a lot about a lot of things. Certainly about the things that were exciting last year.
**Sebastian Raschka:** 是的,我的意思是,可能不完全是那样,但我觉得有这个趋势。而且很有趣。我觉得几乎翻转了——因为我在学术界的时候,我就是那种感觉。作为教授,你写 grant、教书、做研究。一个人干三份工作,如果你想成功的话不只是一份全职工作。我觉得现在,就像 Nathan 刚才说的,教授们跟实验室比起来——我觉得他们的压力或工作量比在前沿实验室少,因为——
**Sebastian Raschka:** But this year, One of the things you guys mentioned that's exciting is the scaling of text diffusion models, and just a different exploration of text diffusion. Can you talk about what that is and what the possibility it holds? So, different kinds of approaches than the current LMs? - Yeah, so we talked a lot about the transformer architecture and the autoregressive transformer a rchitecture specifically, like GPT. And it doesn't mean no one else is working on anything else.
**Nathan Lambert:** 我觉得他们工作量也很大。只是他们非常有满足感。跟学生一起工作——有持续的指导和使命感,那是非常以人为导向的。我觉得在事物快速变化、非常混乱的时代,这对人来说非常有回报。
**Nathan Lambert:** So, people are always on the, let's say, lookout for the next big thing. Because I think it would be almost stupid not to. Because sure, right now, the transformer architecture is the thing, and it works best, and there's, right now, nothing else out there. But, you know, it's always a good idea to not put all your eggs into one basket.
**Sebastian Raschka:** 是的。我觉得在创业公司就是那种压力——你必须成功。人们投入时间非常重要,但确实很难——因为你得不断交付。我在创业公司待过。我度过了很好的时光,但我不知道我能不能永远做下去。那是一个有趣的节奏——就像我们一开始说的——这些模型在互相超越,他们就是不断试图比竞争对手领先一步。现在真的很残酷。
**Sebastian Raschka:** So, people are developing other alternatives to the autoregressive transformer. One of them would be, for example, text diffusion models. And listeners may know dif fusion models from image generation, like Stable Diffusion popularized it. There was a paper on generating images.
**Nathan Lambert:** 我觉得这种互相超越的特性和有多个参与者,实际上是语言建模进步中被低估的驱动力。竞争深深植入了人的本性,这些公司有意培养了非常强的文化。比如 Anthropic 以文化上极度专注和有组织著称。我们很少听到他们的消息,Anthropic 的每个人似乎都非常一致。处于一个极其紧密的文化中加上这种竞争动力——会让你努力工作并创造出更好的东西。但这以人力资本为代价——你只能这样做这么长时间,人们确实在倦怠。我写过一篇关于倦怠的文章——我自己也经历过——特别是试图同时做管理者和全力训练模型。这是一份疯狂的工作。Patrick McGee 写的《Apple in China》这本书里,他谈到 Apple 工程师为在中国建供应链有多辛苦。他说他们有"拯救婚姻"项目。他在播客里说"有人因为这种程度的拼命工作而死亡。"所以我觉得——这是一个以人的代价来创造进步的完美环境。而人的代价就是我们一开始讨论的 996——人们确实拼得很厉害。
**Nathan Lambert:** Back then, people used GANs, Genera tive Adversarial Networks. And then there was this diffusion process where you iteratively denoise an image, and that resulted in really good quality images over time. Stable Diffusion was a company. Other companies build their own diffusion models.
**Sebastian Raschka:** 我也读了那本书。我记得如果有人需要回家陪家人来挽救婚姻,他们有一个代号。很疯狂。然后同事们会说"好,这是红色警报。我们得让那个人这个周末回家。"但同时我不觉得他们是被迫工作的。他们对产品太有热情了——你会进入那种心态。我有时候作为学术界的人也这样,作为独立工作的人也这样。我会过度工作,而且不健康。我有过背部问题、颈部问题——因为我没有休息我本该休息的时间。但没人强迫我——这是因为我想工作,因为这是令人兴奋的东西。
**Sebastian Raschka:** And then people are now like, "Okay, can we try th is also for text?" Doesn't, you know, make intuitive sense yet, because it feels like, okay, it's not something continuous like a pixel that we can di fferentiate. It's discrete text, so how do we implement that denoising process? It's kind of similar to the BERT models by Google. Like, when you go b ack to the original transformer, they were the encoder and the decoder.
**Nathan Lambert:** 那就是 OpenAI 和 Anthropic 的样子。他们想做这些工作。
**Nathan Lambert:** The decoder is what we are using right now in GPT and so forth. The encoder is more like a parallel technique where you have multiple tokens that you fill in in parallel. GPT models, they do autoregressive generation, completing the sentence one token at a time. And in BERT models, you have a sentence that has gaps.
**Lex Fridman:** 是的,但也有一种狂热的感觉在硅谷积累——跟扩展定律的理念一致——有一种炒作说世界将在几周的尺度上被改变,你想在这一切的中心。然后——你知道——我很幸运能跟各种各样的人交谈,从中我看到了全世界各种泡沫和回音室。看我们人类怎么形成这些泡沫太迷人了。我觉得公平地说,硅谷是一种回音室,一种孤岛和泡沫。但我觉得泡沫实际上是有用的和高效的。不一定是负面的——因为你可以超级高产。它可以是 Steve Jobs 式的现实扭曲力场——因为你们互相说服突破即将到来,通过互相说服这一点,你让突破真的即将到来。
**Lex Fridman:** You mask them out, and then one iteration is filling in thes e gaps. Text diffusion is kind of like that, where you are starting with some random text, and then you are filling in the missing parts or refining t hem iteratively over multiple iterations. And the cool thing here is that this can do multiple tokens at the same time. It's like the promise of havin g it more efficient.
**Nathan Lambert:** Byrne Hobart 写了一本书对泡沫进行分类。一种是金融泡沫——投机——那是坏的。另一种是建设期的——因为它推动人们去建造东西。我确实认为 AI 是后者,但我担心它转变成金融泡沫。
**Nathan Lambert:** Now, the trade-off is, of course, how good is the quality? It might be faster, and now you have this dimension of the denoising p rocess. The more steps you do, the better the text becomes. And people...
**Lex Fridman:** 是的。但同时在思想空间里——那个泡沫——你在做现实扭曲力场,这意味着你在偏离现实。如果你在一边偏离现实一边 996 地工作,你可能会错过人类体验的一些根本层面——包括硅谷以外的。这是硅谷常见的问题:这是一个非常特定的地理区域。你可能不理解中西部的视角,不理解美国和世界所有其他人的完整体验。你们用特定的方式跟彼此说话,互相说服某些事情,这可能让你陷入真正的麻烦。不管 AI 是一个大成功还是不是——不管是哪种轨迹——你都可能给自己惹上麻烦。所以你得考虑所有这些。你就在这里——一个年轻人试图决定自己人生要做什么。
**Lex Fridman:** I mean, you can scale in different ways. They try to see if that is maybe a valid alternative to the autoregressive model in terms of giving you the same quality for less compute. Right now, there are papers that suggest if yo u want to get the same quality, you have to crank up the denoising steps, and then you end up spending the same compute you would spend on an autoregr essive model. The other downside is, while being parallel sounds appealing, some tasks are not parallel.
**Nathan Lambert:** 有一个事情……我自己都不太理解,但旧金山的 AI meme 已经发展到——"永久下层阶级"是其中一个——就是 2025 年下半年是在 AI 创业公司或模型中创造持久价值的唯一时机。否则所有价值将被现有公司捕获,你因此会变穷。这……这就是旧金山那种走到极端的例子。我仍然觉得对年轻人来说——如果你真的对 AI 有影响力很有热情——身在旧金山是最有可能实现这一点的地方。但有取舍。
**Nathan Lambert:** Like reasoning tasks or tool use, maybe where you have to ask a code interpreter to give you an intermediate result. That is tricky with diffusion models. So, there are some hybrids, but the main idea is: how can we parallelize it? It's an interesting avenue.
**Lex Fridman:** 我觉得旧金山是一个不可思议的地方,但确实有点泡沫。如果你走进那个泡沫——那极其有价值——也要走出来。读历史书,读文学作品,去世界其他地方看看。Twitter 和 Substack 不是全世界。
**Lex Fridman:** I think right now, there are mostly research models out there, like LaMDA and some ot her ones. I saw some by startups, some deployed models. There is no big diffusion model at scale yet, like on the Gemini or ChatGPT level. But there w as an announcement by Google, a site where they said they are launching Gemini Diffusion, and they put it into context of their Gemini Nano 2 model, a nd they said basically: for the same quality on most benchmarks, we can generate things much faster.
**Nathan Lambert:** 我会说——我合作的一个人正在搬到旧金山——我得给他买一本《Season of the Witch》。这是一本旧金山从 1960 年到 1985 年的历史——讲了嬉皮运动、同性恋群体占据城市和那种文化的涌现,然后是 HIV/AIDS 危机和其他事情。那是多么近的历史——那么多动荡和伤痛——但也有爱——在旧金山。没人知道这些。这是一本很棒的书,《Season of the Witch》。我推荐。我那些走出旧金山的朋友推荐给我的。住在那里……我住过那里但我没有感受到这个背景——而且这些是多么近的历史。
**Nathan Lambert:** You mentioned what's next. I don't think the text diffusion model is going to replace autoregressive LLMs, but it will be something maybe for quick, cheap, at-scale tasks. Maybe the free tier in the future will be something like that. - I think there are examples where it's already being used. To paint an example of why this is better, for example , when GPT-5 is taking 30 minutes to respond, it's generating one token at a time.
**Lex Fridman:** 是的。好,我们聊了很多……聊了很多东西。当然包括去年令人兴奋的那些进展。但今年,你们提到的一个令人兴奋的方向是 text diffusion models 的 scaling,以及对 text diffusion 的不同探索。能不能聊一下这是什么,以及它有什么可能性?就是和当前的 LM 不同的方法?
**Lex Fridman:** And this diffusion idea is essentially to generate all of those tok ens and the completion in one batch, which is why it could be way faster. And I think it could be suited for... the startups I'm hearing about are cod e startups where you have a code base, and you have somebody that's effectively "vibe coding," and they say, "Make this change." And a code diff is es sentially a huge reply from the model, but it doesn't have to have that much external context, and you can get it really fast by using these diffusion models. One example I've heard is that they use text diffusion to generate really long diffs, because doing it with an autoregressive model would tak e minutes, and that time for a user-facing product causes a lot of churn. Every second, you lose a lot of users.
**Sebastian Raschka:** 是的,我们之前聊了很多关于 transformer 架构和自回归 transformer 架构的内容,特别是 GPT 那种。但这并不意味着没人在做别的东西。人们一直在寻找下一个大突破。因为我觉得不这样做几乎是愚蠢的。没错,现在 transformer 架构是王者,效果最好,目前没有其他东西能比。但你知道,不把鸡蛋都放在一个篮子里总是好主意。所以人们在开发自回归 transformer 的替代方案。其中一个就是 text diffusion models。听众可能从图像生成领域了解 diffusion models,比如 Stable Diffusion 把它普及了。之前有一篇生成图像的论文。那时候人们用 GAN,也就是生成对抗网络。然后出现了这个 diffusion 过程,通过迭代去噪图像,随着时间推移产生了非常高质量的图像。Stable Diffusion 是一家公司。其他公司也构建了自己的 diffusion models。然后现在人们就想:"好,我们能不能把这个也用在文本上?"直觉上还没有完全说得通,因为感觉——好吧,文本不是像像素那样连续可微分的东西。它是离散的文本,那你怎么实现这个去噪过程呢?这有点类似于 Google 的 BERT 模型。如果你回到最初的 transformer,有 encoder 和 decoder。decoder 就是我们现在在 GPT 等模型中使用的。encoder 更像是一种并行技术,你有多个 token 同时填入。GPT 模型做的是自回归生成,一次完成一个 token。而在 BERT 模型中,你有一个句子,里面有空缺。你把它们遮住,然后一次迭代就是填入这些空缺。text diffusion 有点类似,你从一些随机文本开始,然后在多次迭代中填入缺失的部分或逐步精炼它们。这里的酷炫之处在于,它可以同时处理多个 token。这就像是更高效的一种承诺。当然,trade-off 是质量怎么样?它可能更快,而且你现在有了去噪过程这个维度——你做的步骤越多,文本就越好。而且人们……我是说,你可以以不同方式 scaling。他们在尝试看看这是否可能是自回归模型的一个有效替代方案——能用更少的算力给你同样的质量。目前有论文表明,如果你想获得同样的质量,你得增加去噪步骤,然后最终花的算力和自回归模型一样多。另一个缺点是,虽然并行化听起来很诱人,但有些任务不是并行的。比如推理任务或工具使用,也许你需要让代码解释器给你一个中间结果。这在 diffusion models 中很棘手。所以有一些混合方案,但核心思想是:我们怎么并行化?这是一个有趣的方向。我觉得现在主要还是研究模型,像 LaMDA 和其他一些。我看到一些初创公司的部署模型。目前还没有像 Gemini 或 ChatGPT 那个级别的大规模 diffusion 模型。但 Google 有一个公告,说他们要推出 Gemini Diffusion,并且把它放在 Gemini Nano 2 模型的语境中,基本上说:在大多数 benchmark 上达到同样质量的情况下,我们可以更快地生成内容。你问什么是下一步。我不认为 text diffusion model 会取代自回归 LLM,但它可能用于快速、廉价、大规模的任务。也许未来的免费层就是这样的东西。
**Sebastian Raschka:** So, I think it's going to be this thi ng where it's going to— ...grow and have some applications, but I actually thought that different types of models were going to be used for different things much sooner than they have been, so I kind of trade off. I think the tool-use point is the one that's stopping them from being most general pur pose because, for Claude Code and ChatGPT search, the autoregressive chain is interrupted with some external tool, and I don't know how to do that wit h the diffusion setup. - So what's the future of tool use this year and then in the coming years? Do you think there's going to be a lot of developmen ts there, and how that's integrated into the entire stack? - I do think right now, it's mostly on the proprietary LLM side, but I think we will see mo re of that in the open-source tooling. And I think it is a huge unlock because then you can really outsource certain tasks from just memorization to a ctual— you know, instead of having the LLM memorize what is 23 plus 5, just use a calculator. - So do you think that can help solve hallucination? - N ot solve it, but reduce it.
**Nathan Lambert:** 我觉得已经有一些例子在用了。来画个图说明为什么这更好——比如当 GPT-5 需要 30 分钟来回复时,它是一次生成一个 token。而 diffusion 的想法本质上是一次批量生成所有 token 和完整回复,这就是为什么它可以快得多。而且我觉得它可能适用于……我听说的初创公司是代码类初创公司,你有一个代码库,有人在做"vibe coding",他们说"做这个改动"。一个代码 diff 本质上是模型的一个巨大回复,但它不需要那么多外部上下文,而且你用 diffusion models 可以非常快地得到结果。我听说的一个例子是他们用 text diffusion 来生成非常长的 diff,因为用自回归模型会花好几分钟,而那个时间对于面向用户的产品来说会导致大量流失。每一秒你都会失去很多用户。所以我觉得它会是这样一种东西——它会成长,有一些应用场景。但实际上我之前以为不同类型的模型会更早被用于不同的事情,所以我现在有点犹豫。我觉得工具使用那个点才是阻止它们成为最通用模型的关键——因为对于 Claude Code 和 ChatGPT 搜索来说,自回归链会被某个外部工具打断,而我不知道怎么在 diffusion 的设定下做到这一点。
**Nathan Lambert:** So the LLM still needs to know when to ask for a tool call. And the second one is, well, it doesn't mean the internet is a lways correct. You can do a web search, but let's say I asked who won the World Cup in, let's say, 1998; it still needs to find the right website and get the right information. You can still go to the incorrect website and give me incorrect information.
**Lex Fridman:** 那今年以及未来几年,工具使用的发展方向是什么?你觉得会有很多进展吗?它是怎么集成到整个技术栈中的?
**Lex Fridman:** So I don't think it will fully solve that, but it is improving it in that sense. And so another cool paper earlier this year—I think it was December 31st, so it's not technically 2026, but close—t he recursive language model. That's a cool idea to kind of take this even a bit further. Just to explain, Nathan, you also mentioned earlier, it's har der to do cool research in academia because of the compute budget.
**Sebastian Raschka:** 我确实觉得目前主要还是在专有 LLM 那边,但我认为我们会在开源工具中看到更多。我觉得这是一个巨大的解锁——因为那时你可以真正把某些任务从纯记忆力外包出去——你知道,与其让 LLM 记住 23 加 5 是多少,不如直接用计算器。
**Sebastian Raschka:** If I recall correctly, they did everything with GPT-5, so they didn't even use loca l models, but the idea is, let's say you have a long-context task; instead of having the LLM solve all of it in one shot or even in a chain, you break it down into sub-tasks. You have the LLM decide what is a good sub-task, and then recursively call an LLM to solve that. And I think something like t hat, adding tools—you know, each one maybe you have a huge Q&A task, so each one goes to the web and gathers information, and then you pull it togethe r at the end and stitch it back together. I think there's going to be a lot of unlock using things like that where you don't necessarily improve the L LM itself; you improve how the LLM is used and what the LLM can use.
**Lex Fridman:** 那你觉得这能帮助解决幻觉问题吗?
**Lex Fridman:** One downside right now with tool use is you have to give the LLM permission to us e tools. And that will take some trust, especially if you want to unlock things like having an LLM answer emails for you—or not even answer, but just sort them for you or select them for you or something like that. I don't know if I would today give an LLM access to my emails, right? I mean, this is a huge risk. - I think there's a cool... one last point on the tool use thing.
**Sebastian Raschka:** 不是解决,但能减少。LLM 仍然需要知道什么时候该调用工具。第二点是——这并不意味着互联网上的信息总是对的。你可以做网页搜索,但比如我问 1998 年世界杯谁赢了,它仍然需要找到正确的网站并获取正确的信息。你仍然可能访问到错误的网站,给我错误的信息。所以我不认为它会完全解决这个问题,但在那个意义上它确实在改善。另一篇今年早些时候的很酷的论文——我觉得是 12 月 31 日的,所以严格来说不算 2026 年,但很接近了——叫 recursive language model。这是一个很酷的想法,把这个概念推得更远一些。我解释一下——Nathan,你之前也提到过——在学术界做酷炫的研究更难了,因为算力预算有限。如果我没记错的话,他们全部用 GPT-5 做的,甚至没用本地模型。但想法是,比如你有一个长上下文任务——与其让 LLM 一次性解决所有问题,甚至用链式方法——你把它分解成子任务。让 LLM 决定什么是好的子任务,然后递归地调用 LLM 来解决。我觉得类似这样的东西,加上工具——你知道,每一个也许都有一个大的 Q&A 任务,每一个去网上收集信息,然后最后拼在一起缝合回来。我觉得会有很多这样的解锁——你不一定要改善 LLM 本身,你改善 LLM 的使用方式和 LLM 能使用的东西。目前工具使用的一个缺点是你必须给 LLM 使用工具的权限。这需要信任,尤其是如果你想解锁比如让 LLM 帮你回复邮件——甚至不是回复,只是帮你分类、筛选或做类似的事。我不知道今天我会不会给一个 LLM 访问我的邮件的权限,对吧?这是一个巨大的风险。
**Sebastian Raschka:** I think that you hinted at this, and we've both come at this in our ow n ways, is that the open versus closed models use tools in very different ways, where open models, people go to Hugging Face and download the model, a nd then the person's going to be like, "What tool do I want?" I don't know, Exa is my preferred search provider, but somebody else might care for a di fferent search startup. Where you release a model, it needs to be useful for multiple tools, for multiple use cases, which is really hard because you' re making a general reasoning engine model, which is actually what gpt-oss-120b is good for. But on the closed models, you're deeply integrating the s pecific tool into your experience, and I think that open models will struggle to replicate some of the things that I like to do with closed models, wh ich will be like, you can reference a mix of public and private information. And something that I keep trying every three to six months, I try Claude Code on the web, which is just prompting a model to make an update to some GitHub repository that I have.
**Nathan Lambert:** 我觉得还有一个很酷的……关于工具使用的最后一点。我觉得你暗示过这个,而且我们各自从不同角度看到了这一点——开放模型和封闭模型使用工具的方式非常不同。开放模型的话,人们去 Hugging Face 下载模型,然后那个人就会想:"我想用什么工具?"我不知道——Exa 是我喜欢的搜索服务商,但别人可能喜欢另一个搜索初创公司。当你发布一个模型,它需要适用于多种工具、多种用例,这非常难,因为你在做一个通用推理引擎模型——这其实就是 gpt-oss-120b 擅长的。但在封闭模型这边,你是在深度集成特定的工具到你的体验中。我觉得开放模型很难复制我在封闭模型上喜欢做的一些事情——比如你可以引用公开信息和私有信息的混合。还有一件事我每三到六个月就会尝试一次——我试用 Claude Code 的网页版,就是提示模型去更新我在 GitHub 上的某个仓库。那种安全的云环境真的很好用——你把任务发出去,它去做,然后回来告诉你结果。这些可能会帮助定义一些本地开源和封闭模型的各自生态位。但我觉得最初,因为大家太急于让工具使用跑起来,开放模型是落后的——这是不可避免的。我觉得前沿实验室有那么多研究和资源。但当开放模型解决这个问题时会很有趣——因为它将需要一个更灵活、可能更有趣的模型,也许能跟这个 recursive 的想法配合,成为一个编排者和工具使用模型。所以希望这种必要性能推动一些有趣的创新。
**Nathan Lambert:** And it's just like that set of secure cloud environments is just so nice for just sending it off to do this thing and then come back to me, and these will probably help define some of the local open and closed niches. But I think initially, because there was such a rush to get tool use working, the open models were on the back foot, which is kind of inevitable. I think there's so much research, so many resources in these frontier labs, but it will be fun when the open models solve this bec ause it's going to necessitate a bit more flexible and potentially interesting model that might work with this recursive idea to be an orchestrator an d a tool use model, so hopefully the necessity drives some interesting innovation there. - So, continual learning—this is a longstanding topic, import ant problem. I think that increases in importance as the cost of training the models goes up.
**Lex Fridman:** 好,continual learning——这是一个长期存在的课题,一个重要问题。我觉得随着训练模型的成本增加,它的重要性也在增加。能不能解释一下什么是 continual learning,以及今年和未来几年在这方面取得进展有多重要?
**Lex Fridman:** So can you explain what continual learning is and how im portant it might be this year and in the coming years to make progress? - This relates a lot to this kind of SF zeitgeist of, what is AGI, which is Ar tificial General Intelligence, and what is ASI, Artificial Superintelligence, and what are the language models that we have today capable of doing? I think the language models can solve a lot of tasks, but a key milestone among the AI community is essentially when AI could replace any remote worker, taking in information and solving digital tasks and doing them. And the limitation that's highlighted by people is that a language model will not lea rn from feedback the same way that an employee does. So if you hire an editor, the editor will mess up, but you will tell them.
**Nathan Lambert:** 这跟旧金山那种时代精神很相关——什么是 AGI,就是通用人工智能;什么是 ASI,就是超级人工智能;以及我们今天拥有的语言模型能做什么?我觉得语言模型能解决很多任务,但 AI 社区中的一个关键里程碑本质上是:AI 什么时候能取代任何远程工作者——接收信息、解决数字任务并完成它们。人们强调的局限是:语言模型不会像员工那样从反馈中学习。如果你雇了一个编辑,编辑会犯错,但你会告诉他们。如果你雇了一个好编辑,他们不会再犯同样的错。但语言模型没有这种修改自身和快速学习的能力。所以想法是:如果我们真的想达到一种真正通用的、可适应的智能——能进入任何远程工作场景的那种——它就需要能从反馈中快速学习,需要在职学习。我个人更看好的是语言模型只需要提供非常好的上下文就行。你曾经说过——也许是在线下——你可以给模型写大量的文档,说"我有所有这些信息。这是我写过的所有博客文章。我喜欢这种写作风格。我的声音基于此。"但很多人不会把这些提供给模型,而且模型之前也不是设计来接收这么多上下文的。Agentic 模型才刚刚开始。所以这是一种 trade-off:我们需要用 continual learning 来更新模型的权重让它快速学习吗?还是反对意见——我们只需要提供更多上下文和信息,模型就能通过拥有大量上下文和足够聪明来呈现出快速学习的样子?
**Nathan Lambert:** And if you hired a goo d editor, they don't do it again. But language models don't have this ability to modify themselves and learn very quickly. So the idea is, if we are g oing to actually get to something that is a true, general adaptable intelligence that can go into any remote work scenario, it needs to be able to lea rn quickly from feedback and on-the-job learning. I'm personally more bullish on language models being able to just provide them with very good contex t. You said, maybe offline, that you can write extensive documents to models where you say, "I have all this information. Here are all the blog posts I've ever written. I like this type of writing. My voice is based on this." But many people don't provide this to models, and the models weren't desig ned to take this amount of context previously. Agentic models are just starting. So it's this kind of trade-off: do we need to update the weights of t his model with this continual learning thing to make them learn fast?
**Lex Fridman:** 这里我们应该提一下术语。Continual learning 是指持续地改变权重,让模型根据新的输入信息进行调整适应——持续地、快速地、频繁地做。而你提到的另一边通常被称为 in-context learning。随着你学到东西,有一个巨大的上下文窗口。你可以每次提示系统时不断往里加额外信息。我觉得两者都可以被合理地视为学习。只是学习发生的位置不同。
**Lex Fridman:** Or the counterargument is we just need to provide them with more context and inf ormation, and they will have the appearance of learning fast by having a lot of context and being smart. - So we should mention the terminology here. Continual learning refers to changing the weights continuously so that the model adapts and adjusts based on the new incoming information, doing so co ntinually, rapidly, and frequently. And then the thing you mentioned on the other side of it generally will be referred to as in-context learning. As you learn stuff, there's a huge context window.
**Sebastian Raschka:** 老实说,我觉得 continual learning——更新权重——我们已经有了不同的版本。如果你想想……我觉得这里的区别是:你是为每个人做一个个性化的定制模型,还是在全局模型层面做?我觉得我们已经有了——从 GPT-5 到 5.1 再到 5.2。也许不是即时的,但这是一个精心策划的更新——一个快速的精心策划的更新——有关于模型做不到什么的反馈、来自社区的反馈。他们更新权重,下一个模型,如此反复。所以这是 continual learning 的一种。另一个更细粒度的例子是 RLVR——你运行它,它更新。问题是你不能为每个人都做——因为为每个人更新权重太贵了。我觉得那就是问题所在。除非你能……即使在 OpenAI 的规模上,建数据中心,也太贵了。我觉得只有当你在设备端有模型——成本在消费者那边——才可行。就像 Apple 试图做的那样,用 Apple Foundation Models,把它们放到手机上,让它们从经验中学习。
**Sebastian Raschka:** You can just keep loading it with extra information every time you prompt the system, which I think bo th legitimately can be seen as learning. It's just a different place where you're doing the learning. - I think, to be honest with you, continual lear ning — updating weights — we already have that in different flavors. If you think about how... I think the distinction here is: do you do that on a pe rsonalized custom model for each person, or do you do it on a global model scale? I think we have that already, going from GPT-5 to 5.1 and 5.2.
**Lex Fridman:** 一个有点相关的话题,但这个——也许有点拟人化的术语——记忆。对于如何给这些系统添加记忆,尤其是个性化记忆,有哪些不同的想法?我们越来越多地看到这方面的发展。
**Lex Fridman:** It's maybe not immediate, but it is a curated update, a quick curated update where there was feedback about things they couldn't do, feedback by the commun ity. They updated the weights, next model, and so forth. So it is a flavor of that. Another even finer-grained example is like RLVR; you run it, it up dates.
**Sebastian Raschka:** 目前,基本上就是把东西塞进上下文然后回忆。但这很贵——你可以缓存它,但你仍然在消耗 token。第二个是你能做的有限。我觉得更多是偏好或风格方面的。很多人在解数学题时就是这样做的。你说就是这样——你可以加入之前的知识什么的——但你也给它特定的偏好提示:"像我上次偏好的那样做"之类的。但这并不解锁新的能力。对于那个,人们仍然在用的一个东西是 LoRA adapters。这基本上是——不更新整个权重矩阵,而是有两个更小的权重矩阵,你并行地放在旁边,像是 delta 的叠加。但你在某种程度上可以做到,但然后又是经济学问题。也有论文,比如"LoRA learns less but forgets less"。就像——没有免费午餐。如果你想学更多,你需要用更多权重,但就更贵。然后如果你学得更多,你忘得也更多,你必须找到那个 Goldilocks 最佳平衡区。
**Sebastian Raschka:** The problem is you can't just do that for each person because it would be too expensive to update the weights for each person, and I think that 's the problem. Unless you get... Even at OpenAI scale, building the data centers, it would be too expensive. I think that is only feasible once you h ave something on the device where the cost is on the consumer.
**Lex Fridman:** 我们还没怎么提到的,但在这个讨论中隐含的是上下文长度。在这方面有很多创新空间吗?
**Lex Fridman:** Like what Apple tried to do with the Apple Foundation models, putting them on the phone , where they learn from experience. - A bit of a related topic, but this kind of, maybe anthropomorphized term: memory. What are different ideas for t he mechanism of how to add memory to these systems as we're increasingly seeing? Personalized memory especially? - Right now, it's mostly basically st uffing things into the context and then just recalling that. But again, I think it's expensive because you have to—you can cache it, but still you spe nd tokens on that.
**Nathan Lambert:** 我觉得大家普遍接受的观点是——这是一个算力和数据的问题——有时候加上一些小的架构变化,比如注意力变体。我们聊到了混合注意力模型——本质上就是在你的 transformer 中有一个看起来像状态空间模型的东西。那些更适合,因为你不需要花那么多算力来建模最远的那个 token。我觉得那些不是免费的——因为它们必须伴随大量算力或正确的数据。你在世界上有多少 100,000 个 token 的序列?你从哪里得到它们?最终就是扩展它们相当昂贵。我们已经很快到了 100 万 token 的输入上下文长度了。我预计它会继续增加,今年达到 200 万或 500 万,但我不认为它会到 1 亿。那将是一个真正的突破——我觉得那样的突破是可能的。我把 continual learning 看作一个研究问题——可能出现一个突破,让 transformer 在这方面工作得更好而且很便宜。有这么多科学关注度,这些事情可能会发生。但按部就班的话,它会是随时间持续的增量改进。
**Nathan Lambert:** And the second one is you can only do so much. I think it's more like a preference or style. I mean, a lot of people do that when t hey solve math problems. You say it's way so you can add previous knowledge and stuff, but you also give it certain preference prompts: "do what I pre ferred last time," or something like that.
**Sebastian Raschka:** 看极端情况的话——我觉得还是没有免费午餐。一个极端是让它便宜:你有一个 RNN,只有一个状态来保存之前所有的内容。它是一个特定的固定大小的东西,所以你永远不会真正增长内存——因为你把所有东西塞进一个状态里。但上下文越长,你忘的信息越多——因为你不可能把所有东西压缩到一个状态里。另一个极端是 transformer,它试图记住每一个 token——有时候很好,如果你想查找特定信息的话——但非常昂贵,因为 KV cache 会增长,点积也会增长。但然后就像你说的——Mamba 层——它们其实有同样的问题。像 RNN 一样,你试图把所有东西压缩到一个状态里。你是有选择性的,但也……然后我觉得又是这个 Goldilocks 最佳区间。在 Nemotron 3 中,他们找到了一个好的比例——需要多少注意力层来获取全局信息(所有东西都可访问)——相比拥有这些压缩状态。我觉得我们会这样继续 scaling——找到更好的比例,在 Goldilocks 最佳区间中——在算力足够便宜和足够强大之间。再提一个——Recursive Language Model 那篇论文——它是试图解决长上下文问题的论文之一。他们发现的本质上是——如果你不把所有东西塞进这个长上下文,而是分解成多个更小的任务——通过多个更小的核心来节省内存——实际上你能得到比让 LLM 一次性处理所有东西更好的准确率。这是一个新范式。我们会看到——你知道,可能还有其他变种。所以我觉得通过这些方法,我们仍然会在长上下文上取得进步。但同时也像 Nathan 说的——问题是在预训练阶段,我们没有那么多长上下文文档。所以基本上更难研究 LLM 在那个层面上的表现。
**Sebastian Raschka:** But it doesn't unlock new capabilities. So for that, one thing people still use is LoRA adapters. These are basically, instead of updating the whole weight matrix, there are two smaller weight matrices that you kind of have in parallel or overlays like the delta. But yeah, you can do that to some extent, but then again, it is economics.
**Nathan Lambert:** 有一些经验法则——基本上你预训练一个语言模型,比如 OLMo。我们在 8K 上下文长度预训练,然后通过训练扩展到 32K。有一些经验法则——大约是每翻倍训练上下文长度,需要大约 2 倍的算力,然后你通常可以再把上下文长度扩展 2 到 4 倍。所以我觉得很多最终取决于预训练时的算力瓶颈——这就是……就像我们说的,大家都在谈今年顶级实验室算力的大幅增长,这应该会反映在更长的上下文窗口上。但我觉得在 post-training 这边,有一些更有趣的东西。随着我们有了 agent,agent 会自己管理这个上下文。现在 agent——经常用 Claude Code 的人都很怕压缩(compaction),就是 Claude 把整个 100,000 token 的工作内容压缩成一个要点列表。但下一代模型要做的——我相信已经有人在做了——本质上是模型可以控制什么时候压缩以及怎么压缩。所以你基本上可以训练你的 RL 算法,让压缩成为一个动作——它缩短历史记录——然后问题的表述就变成:"我想在模型把历史压缩到最小长度的同时,保持我获得的最高评估分数。"因为那样你就有了做这种复合自回归预测所需的最少 token 数。所以在这方面其实有一些很好的问题设定——这些 agentic 模型学会了以不同于单纯向前推进的方式来使用它们的上下文。
**Nathan Lambert:** There were also papers, for example, LoRA learns less but forgets le ss. It's like, there's no free lunch. If you want to learn more, you need to use more weights, but it gets more expensive. And then again, if you lear n more, you forget more, and you have to find that Goldilocks zone basically. - We haven't really mentioned it much, but implied in this discussion is context length also.
**Sebastian Raschka:** 一个有趣的近期例子是 DeepSeek-V3.2。他们有一个稀疏注意力机制——本质上是一个非常高效的、小型的轻量级索引器。它不是关注所有 token,而是选择:"我实际上需要哪些 token?"我是说,这几乎回到了注意力的原始想法——你是有选择性的,但注意力一直是全开的——你可能在某些上面权重为零,但你仍然使用了所有。但他们甚至更进一步——"让我们直接 mask 掉或者根本不做那个。"甚至在 OLMo 的 sliding window attention 中,也是类似的想法。你有一个滑动窗口,保持固定——因为你不需要所有东西。偶尔某些层你可能需要,但通常是浪费的。但现在,如果你使用所有东西,你是安全的一边——因为你永远不会漏掉信息,给你最大的性价比。我觉得今年更多的是关于——像你说的——怎么更聪明地处理这些。现在人们想要下一个 state-of-the-art,而 state-of-the-art 恰好是暴力的、昂贵的方案。然后一旦你有了那个——像你说的——保持那个准确率,但看看我们怎么用技巧做得更便宜。
**Sebastian Raschka:** Is there a lot of innovation that's possible there? - I think the colloquially accepted thing is that it's a compute and data pr oblem where you can... and sometimes small architecture things like attention variants. We talked about hybrid attention models, which is essentially if you have what looks like a state space model within your transformer. And those are better suited because you have to spend less compute to model t he furthest along token. I think that, those aren't free because they have to be accompanied by a lot of compute or the right data.
**Nathan Lambert:** 是的。所有这些 scaling 的事情。我们先得到 Claude 4.5 Sonnet 模型的原因是因为你可以更快地训练它——你不会那么快撞到算力的墙。他们可以尝试更多东西,更快地得到模型——即使更大的模型实际上更好。
**Nathan Lambert:** How many sequences of 100,000 tokens do you have in the world, and where do you get these? It just ends up being pretty expensive to scale them. We've gotten pretty qui ckly to a million tokens of input context length. I would expect it to keep increasing and get to 2 million or 5 million this year, but I don't expect it to go to 100 million.
**Lex Fridman:** 我觉得我们应该说——AI 领域有很多令人兴奋的东西。我最近的心思一直集中在机器人领域。今天我们几乎完全没谈机器人。图像生成、视频生成也有很多进展。我觉得公正地说——在研究工作的数量、强度和热情方面,最令人兴奋的是在 LLM 领域——这也是为什么我觉得我们把焦点放在 LLM 上是合理的。但引入一些可能有用的话题也很好。比如 world models——人们对这个越来越兴奋。你觉得今年 world models 在 LLM 领域会有什么用武之地吗?
**Lex Fridman:** That would be like a true breakthrough, and I think those breakthroughs are possible. I think of the continual learning thin g as a research problem where there could be a breakthrough that just makes transformers work way better at this and it's cheap. These things could ha ppen with so much scientific attention. But turning the crank, it'll be consistent increases over time. - Looking at the extremes, I think there's, ag ain, no free lunch.
**Sebastian Raschka:** 是的,我确实这么认为。而且 LLM 有趣的一点是——如果我们解锁更多 LLM 能力,它也自动解锁了所有其他领域,因为它让进步更快。很多研究者和工程师用 LLM 来编程。所以即使他们在做机器人,如果你优化了这些帮助编程的 LLM,它也有回报。但是的——world models 很有趣。基本上就是模型在某种意义上运行一个世界的模拟——一个真实事物的小型玩具版本——这又能解锁 LLM 不知道的数据相关的能力。它可以模拟东西。我觉得 LLM 恰好通过预训练和 next-token prediction 就能工作得很好。但我们可以做得更精细。我觉得有一篇 Meta 的论文,叫 World Models。他们基本上把 world models 的概念再次应用到 LLM 上——不只是做 next-token prediction 和可验证奖励、检查答案正确性——他们还确保中间变量是正确的。你知道,这有点像模型在学习一个代码环境。我觉得这非常有道理。只是做起来很贵——但它确实让事情变得更精细——建模整个过程,而不仅仅是结果。我还记得我读研的时候,有一个叫 CASP 的比赛——他们做蛋白质结构预测。他们预测一个当时还没解出来的蛋白质的结构。某种意义上,这非常棒。我觉得我们也需要为 LLM 做类似的事情——你做 benchmark,但没人知道答案。你提交结果,但没人知道解答。然后事后有人揭晓。AlphaFold 出来的时候,它碾压了这个 benchmark。有多个版本迭代——但我记得第一个版本。我不是那个领域的专家——但第一个版本明确地建模了物理交互……分子的物理学。包括角度,不可能的角度。然后在下一个版本中,我觉得他们去掉了这些,只是用暴力 scaling。我觉得 LLM 目前就处在这种暴力 scaling 的阶段——因为它恰好有效。但我确实认为在某个时候把这个东西带回来可能有意义。我觉得 world models 就是——我认为那个方向可能真的很酷。而且当然也适用于机器人领域——那跟 LLM 完全不相关。
**Sebastian Raschka:** So, the one extreme to make it cheap: you have, let's say, an RNN that has a single state where you save everything from the previ ous stuff. It's like a specific fixed-size thing, so you never really grow the memory because you are stuffing everything into one state, but then the longer the context gets, the more information you forget because you can't compress everything into one state. Then on the other hand, you have the t ransformers, which try to remember every token, which is great sometimes if you want to look up specific information, but very expensive because you h ave the KV cache that grows, the dot product that grows. But then, like you said, the Mamba layers—I mean, they kind of have the same problem.
**Lex Fridman:** 是的。而且机器人领域非常明确。有运动控制或操作的问题。运动控制在学习领域已经更多地被解决了。但就像最初的蛋白质折叠系统一样——引入传统的基于模型的方法有很多价值。所以你不会……你不太可能端到端地学习操作或者全身局部操作问题。那是梦想。但当你看到人手的精妙和现实世界的复杂性时,很难一路学下来——就像 AlphaFold 2 没有那样做一样。
**Lex Fridman:** Like an RNN, you try to compress everything into one state; you're a bit more selective there. But then I think it's like this Goldilocks zone again. With Ne motron 3, they found a good ratio of how many attention layers do you need for the global information where everything is accessible compared to havin g these compressed states. And I think that's how we will scale more—by finding better, let's say, ratios in the Goldilocks zone, like between making computing cheap enough to run, but then also making it powerful enough to be useful.
**Nathan Lambert:** 我对机器人学习领域很兴奋。我觉得它正在被语言模型领域的所有兴奋和投资集体增强。训练 transformer 的基础设施——这是一个通用的建模工具——正在变成世界级的工业工具。之前机器人领域有什么限制,现在都好多了。有更多的算力。然后在此之上,他们把这些语言模型作为某种中心单元,你可以围绕一个已经能工作的东西做有趣的探索性工作。然后我看到它在涌现——有点像我们说的 Hugging Face transformers 和 Hugging Face。我觉得我在 Hugging Face 的时候就试图推动这件事——但当时太早了。就是 Hugging Face 上的开放机器人模型——让人们能贡献数据和 fine-tune 它们。我觉得我们现在离这个目标近多了——对机器人和自动驾驶的投资是相关的,它使这成为可能。一旦你到了这个点——人们可以下载一个机器人模型,也许 fine-tune 到自己的机器人上,或者在全球范围内共享数据集。有一些这方面的工作,比如 RTX——我觉得是几年前的——人们开始做这件事了。但一旦他们有了这个生态系统,一切看起来会非常不同。然后整个后 ChatGPT 的爆发把更多资源投入到这里——我觉得这是一个非常好的研究领域。
**Nathan Lambert:** And one more plug here, the Recursive Language Model paper, that is one of the papers that tries to kind of address the long context thing. So what they found is essentially instead of stuffing everything into this long context if you break it up into multiple smaller tasks, so you save memory by having multiple smaller cores, you can actually get better accuracy than having the LLM try everything all at once. I mean, it's a new paradigm. We will see, you know, there might be other flavors of that.
**Lex Fridman:** 这也导致了更好、更准确、更真实的模拟器被构建——缩小了机器人领域的 sim-to-real 差距。但你知道,你提到了机器人领域的大量兴奋和投资。这个的负面是——在炒作周期中会发生的——我个人认为,大多数机器人领域的人也认为——机器人不会在被隐含或明确承诺的时间尺度上被解决。那么当所有这些机器人公司涌现出来,然后它们没有一个能用的产品时会怎样?会出现一种兴奋的崩盘——这令人紧张。希望有其他东西能进来持续推进,让这些想法的持续发展不会中断。
**Lex Fridman:** So I think with that, we will still make improvement on long context, but then also, like Nathan said, I think the problem is for pre-training itself, we don't h ave as many long context documents as other documents. So it's harder to study basically how LLMs behave and stuff like that on that level. - There ar e some rules of thumb where essentially you pre-train a language model, like OLMo. we pre-trained at like 8K context length and then extended to 32K w ith training. And there are some rules of thumb where you're essentially doubling the training context length, it takes like 2X compute, and then you can normally like 2 to 4X the context length again. So I think a lot of it ends up being kind of compute bound at pre-training, which is in this...
**Sebastian Raschka:** 我觉得这也和 continual learning 的问题有关——本质上——现实世界太复杂了。对于 LLM 来说,你真的不需要让模型为用户个性化学习——因为有很多事情是所有人都要做的。每个人可能都想——我不知道——纠正他们邮件或代码中的语法错误什么的。它更受限——所以你可以为模型预先准备好这些。但为真实世界准备机器人更难。你有机器人基础模型,你可以学习某些东西——比如抓取物体。但每个人的家都不一样。太不一样了——我觉得这就是机器人需要在工作中学习的地方。而那个——我猜——就是目前的瓶颈:怎么在运行中定制它。
**Sebastian Raschka:** Li ke we talked about, everyone talks about this big increase in compute for the top labs this year, and that should reflect in some longer context windo ws. But I think on the post-training side, there are some more interesting things. As we have agents, the agents are gonna manage this context on thei r own, where now agents, people that use Claude Code a lot dread the compaction, which is when Claude takes its entire full 100,000 tokens of work and compacts it into a bulleted list. But what the next models will do—and I'm sure people are already working on this—is essentially the model can contr ol when it compacts and how.
**Lex Fridman:** 我觉得我几乎无法低估一个几乎没有被机器人领域的人或任何人谈论的东西的重要性——那就是安全。我们讨论的所有关于学习的有趣复杂性,所有失败模式和失败案例,我们一直在讨论的关于 LLM 的一切——有时它们会以有趣的方式失败。所有这些在 LLM 领域都是好玩的事情。但在机器人领域——在人们的家中——跨越数百万分钟和数十亿次交互——你几乎不被允许失败。当你有具身系统被投放到真实世界中时,你必须解决那么多你从未想过需要解决的问题——当你只是在思考通用机器人学习问题的时候。
**Lex Fridman:** So you can essentially train your RL algorithm where compaction is an action- ...where it shortens the history and then t he problem formulation will be, "I want to keep the maximum evaluation scores that I have gotten while the model compacts its history to the minimum l ength." Because then you have the minimum amount of tokens that you need to do this kind of compounding autoregressive prediction. So there are actual ly pretty nice problem setups in this, where the... Like these agentic models learn to use their context in a different way than just plow forward. - One interesting recent example would be DeepSeek-V3.2, where they had a sparse attention mechanism where they have essentially a very efficient, small , lightweight indexer. And instead of attending to all tokens, it selects: "What tokens do I actually need?" I mean, it almost comes back to the origi nal idea of attention where you are selective, but attention is always on, you have maybe zero weight on some of them, but you use them all.
**Nathan Lambert:** 我对家用消费级学习型机器人非常悲观。我对自动驾驶非常乐观——我对机器人自动化也非常乐观——比如像 Amazon 配送那种——Amazon 建了全新的配送中心,是为机器人优先设计的而不是为人类。AI 圈子里有很多关于 AI 赋能自动化和大规模制造的兴奋。我确实觉得机器人实现这一点的路径更合理——这种任务是专门设计和优化来做重复性工作的——人类可以做但不想做。但这也会比人们预测的要久得多。我觉得从 AI 奇点到"我们现在可以在美国扩大大规模制造因为我们有巨大的 AI 优势"之间的跳跃,被很多政治和其他挑战性的问题所困扰。
**Nathan Lambert:** But they are even more like, "Let's just mask that out or not even do that." And even with sliding window attention in OLMo, that is also kind of that idea. Yo u have a rolling window where you keep it fixed, because you don't need everything. Occasionally, some layers you might, but it's wasteful. But right now, I think, if you use everything, you're on the safe side—it gives you the best bang for the buck because you never miss information.
**Lex Fridman:** 让我们聊聊时间线——具体来说是通往 AGI 或 ASI 的时间线。作为一个起点,说"没有人真正在 AGI 和 ASI 的定义上达成一致",这公平吗?
**Lex Fridman:** And I think t his year will be more about figuring out, like you said, how to be smarter about that. Right now, people want to have the next state-of-the-art, and t he state-of-the-art happens to be the brute-force, expensive thing. And then once you have that, as you said, keep that accuracy, but let's see how we can do that cheaper now, with tricks. - Yeah. All this scaling thing.
**Nathan Lambert:** 我觉得确实有很多分歧,但我一直受到反驳——很多人说的其实差不多,就是一个能完成大多数数字经济工作的东西。远程工作者是一个比较合理的例子。而且我觉得 OpenAI 的定义也与此相关——就是一个能做很多有经济价值的任务的 AI——这个定义我不是特别喜欢,但我觉得它可以作为一个锚点。因为今天的语言模型虽然极其强大,但还不是那种远程工作者的即插即用替代品。还有一些 AI 能做的事情比远程工作难得多——比如发现一个你甚至无法预先假设的意外科学发现——有人会说那是人工超级智能的问题。或者——接收所有的医疗记录,找到人们之前不知道的跨疾病关联——或者发现某种常见药物可以治疗某种罕见癌症。他们会说那是超级智能的事情。所以这些是自然的层级。我的问题是——它变得和 AI 的意义追寻以及那些宗教层面的东西深度纠缠在一起。所以你可以走不同的路径。
**Nathan Lambert:** The reason we get the Claude 4.5 Sonnet model first is because you can train it faster and you're not hitting these compute walls as soon. They can just try a lot more things and get the model faster, even though the bigger model is actually better. - I think we should say that there's a lot of exciting stuff going on in the AI space. My mind has recently been really focused o n robotics. Today, we almost entirely didn't talk about robotics.
**Lex Fridman:** 而且我甚至不确定远程工作者是不是一个好的定义——因为那到底是什么?我确实……我是说,我……我不知道你喜不喜欢原来叫 AI27 的那份报告。他们更聚焦于代码和研究品味——所以目标是超人类编程者。他们有几个里程碑系统:超人类编程者、超人类 AI 研究者、然后是超级智能 AI 研究者、然后是完全的 ASI——人工超级智能。但在你开发出超人类编程者之后,其他一切会很快跟上。那里的任务是实现完全自主的自动编程。任何你需要做的编程——为了进行研究——都完全自动化了。从那里,人类会和那个系统一起做 AI 研究,他们很快就能开发出一个实际上能为你做研究的系统。那是想法。他们最初的预测是 2027、2028——现在他们推迟了三到四年到 2031(平均预测)。我的预测可能甚至比 2031 还要晚。但至少你可以具体地想一想——完全自动化编程有多难。
**Lex Fridman:** There's a lot of stuff on image generation, video generation. I think it's fair to s ay that the most exciting research work in terms of the amount, intensity, and fervor is in the LLM space, which is why I think it's justified for us to focus on the LLMs that we're discussing. But it'd be nice to bring in certain things that might be useful. For example, world models— there's growi ng excitement on that.
**Nathan Lambert:** 是的,我不同意他们的一些前提和关于事情如何发展的动态推演。但我觉得他们在定义场景里程碑方面做得很好——那些是具体的、能讲出一个有用故事的里程碑——这就是为什么这份 AI 2027 报告的影响力超越了硅谷。因为他们讲了一个好故事——他们做了大量严谨的工作。我所在的阵营认为 AI 是所谓的"参差不齐的"(jagged)——在某些事情上会非常出色,在某些事情上会非常糟糕。我觉得当他们接近这个自动化软件工程师的时候——它擅长的会是传统 ML 系统和前端——模型在这方面很出色。但分布式 ML——模型实际上相当差——因为关于大规模分布式学习的训练数据太少了。这是我们已经看到的——我觉得它只会被放大。然后在那些 trade-off 中就更混乱了——比如你怎么看 AI 研究是如何运作的等等。
**Nathan Lambert:** Do you think there will be any use in this coming year for world models in the LLM space? Yes, I do think so. Also with LLMs, w hat's interesting here is that if we unlock more LLM capabilities, it also automatically unlocks all the other fields because it makes progress faster . A lot of researchers and engineers use LLMs for coding. So even if they work on robotics, if you optimize these LLMs that help with coding, it pays off.
**Lex Fridman:** 所以你觉得基本上超人类编程者几乎是不可达到的——因为事物的参差不齐本质——你总是会有能力上的缺口?
**Lex Fridman:** But then, yes, world models are interesting. It's basically where you have the model run a simulation of the world in a sense, like a little toy thing of the real thing, which can, again, unlock capabilities regarding data the LLM is not aware of. It can simulate things. And I think LLMs happen to work well by pre-training and doing next-token prediction.
**Nathan Lambert:** 我觉得这是在给一个东西赋予完备性——而模型在某些类型的代码上已经超人了。我觉得这会继续。人们很有创造力——他们会利用这些难以置信的能力来弥补模型的弱点,并快速推进。在很长一段时间内——人类会一直和模型之间有这种舞蹈——人类去实现模型做不到的事。最好的 AI 研究者就是那些能激活这种超能力的人。我觉得那些线索指向了我们已经看到的。我觉得 Claude Code 做网站——你可以在几个小时内搭建一个漂亮的网站——或者做数据处理——会继续变好——我们沿途还会学到一些新的编程技能。联系到大科技公司正在发生的事——这份 AI 2027 报告倾向于奇点的想法——而我认为研究是混乱的、社会性的——而且很大程度上存在于数据中——是 AI 模型无法处理的方式。但我们今天拥有的确实非常强大——这些科技公司都在集体用数百亿美元的投资来押注。所以我们会得到一个更好版本的 ChatGPT,一个比我们现在拥有的更好的 Claude Code。我觉得很难预测那会走向哪里。但那个未来的明亮清晰度就是为什么一些世界上最有权势的人在往里投这么多钱。我觉得这只是一些小差异——我们其实不知道更好版本的 ChatGPT 是什么——但同时——它能自动化 AI 研究吗?我会说可能不行——至少在这个时间框架内。大科技公司会比我们得到一个能催生 AI 研究奇点的自动化 AI 研究者快得多地花掉 1000 亿美元。
**Nathan Lambert:** But we could do this even more sophisticatedly in a sense. I think there was a paper by Meta, a paper called World Models. So where they basically apply the concept of world models to LLMs again, where instead of just having next-token p rediction and verifiable rewards, checking the answer correctness, they also make sure the intermediate variables are correct. You know, it's kind of like the model is learning basically a code environment in a sense.
**Lex Fridman:** 所以你的预测是——如果这甚至是一个有用的里程碑——超过 10 年吗?
**Lex Fridman:** And I think this makes a lot of sense. It's just expensive to do, but it is making things more sophisticated, like modeling the whole thing, not just the result. And so it can add more value. I remember when I was a grad student, th ere is a... competition called CASP, I think, where they do protein structure prediction.
**Nathan Lambert:** 我会说在软件方面少于那个时间——但在研究这样的事情上会更长。
**Nathan Lambert:** They predict the structure of a protein that is not solved y et at that point. So in a sense, this is actually great, and I think we need something like that for LLMs also, where you do the benchmark, but no one does. You hand in the results, but no one knows the solution. And then after the fact, someone reveals that.
**Lex Fridman:** 好吧,让我们只是为了好玩想象一下——一个所有软件编写都完全自动化的世界。你能想象那个世界吗?
**Lex Fridman:** But, AlphaFold, when it came out, it cru shed this benchmark. I mean, there were also multiple iterations, but I remember the first one. I'm not an expert in that subject, but the first one e xplicitly modeled the physical interactions of the... You know, the physics of the molecule.
**Nathan Lambert:** 到今年年底,被自动化的软件量会非常高。但那会是像——你想用 RL 训练一个模型,你需要多组 GPU 互相通信。那仍然会很难——但会容易得多。
**Nathan Lambert:** Also the angles, impossible angles. And then in the next version, I think they got rid of this, and just with brute force, scaling it up. And I think with LLMs, we are currently in this brute force scaling b ecause it just happens to work. But I do think at some point it might make sense to bring back this thing.
**Lex Fridman:** 一种思考这件事的方式——编程的完全自动化——是想想写出的有用代码行数、和需要的人类参与者数量的比例。所以大概很长一段时间内软件编写中仍会有人类参与。只是相对于写出的代码量,人类会越来越少。对吧?超人类编程者——我觉得那个假设是人类参与者的数量降到零。当人类参与者的数量从数十万降到数百时,那个世界是什么样的?
**Lex Fridman:** And I think with world models, I think that is where I think that might be actually quite cool. I mean, yeah. And of course, also for robotics, which is completely unrelated to LLMs. - Yeah. An d robotics is very explicit.
**Nathan Lambert:** 我觉得软件工程会更多地转向系统设计和结果目标。我确实觉得软件大方向会是这样。我觉得过去几周这已经在发生了——人们从一个月前说"哦对,agent 有点像 slop"——这是一个著名的 Karpathy 引言——转向了现在有点 meme 化的——软件的工业化——任何人都可以用自己的印记创建软件。我确实觉得我们更接近那一边了——它需要方向感和对系统如何运作的理解来从语言模型中提取最好的东西。我觉得很难接受软件开发将会有多大的变化——以及有多少人可以在完全不看代码的情况下做事情。
**Nathan Lambert:** So there's the problem of locomotion or manipulation. Locomotion is much more solved, especially in the learning domain. But there's a lot of value, just like with the initial protein folding systems, bringing in the traditional model-based methods. So you don't... it's unlikely that you can just learn the manipulation or the whole body, local manipulation problem end to end.
**Sebastian Raschka:** 我觉得有趣的是思考这些系统是否会独立——完全独立——在这个意义上——虽然我毫不怀疑 LLM 在某个时候会解决编程——就像计算器解决了计算一样,对吧?在某个时候——人类开发了一个工具——你再也不需要人类来计算那个数字了。你只需要输入——它是一个算法。你可以那样做。我觉得编程大概也是一样。但问题不是……我觉得会发生的是——你只需要说"建那个网站"。它会做一个非常好的网站——然后你可能会微调。但它会独立地做事吗——你还会有人类要求 AI 做某件事吗?比如会不会有一个人说"建那个网站"?还是会有 AI 自己就去建网站之类的?
**Sebastian Raschka:** That's the dream. But then you realize whe n you look at the magic of the human hand and the complexity of the real world, it's really hard to learn this all the way through, the way I guess Al phaFold 2 didn't. - I'm excited about the robotic learning space. I think it's collectively getting supercharged by all the excitement and investment in language models generally, where the infrastructure for training transformers, which is a general modeling thing, is becoming world-class industria l tooling, where wherever there was a limitation for robotics, it's just way better. There's way more compute.
**Lex Fridman:** 我觉得谈论建网站这件事——
**Lex Fridman:** And then on top of that, they take thes e language models as kind of central units where you can do interesting explorative work around something that already works. And then I see it emergi ng as, kind of like we talked about, Hugging Face transformers and Hugging Face. I think when I was at Hugging Face, I was trying to get this to happe n, but it was too early. It's like these open robotic models on Hugging Face, and having people be able to contribute data and fine-tune them.
**Sebastian Raschka:** 太简单了。
**Sebastian Raschka:** I think we're much closer now that the investment in robotics and self-driving cars is related and it enables this, where once you get to the point where you can have this sort of ecosystem where somebody can download a robotics model and maybe fine-tune it to their robot or share datasets across the world . There's some work in this area like RTX, I think it was a few years ago, where people are starting to do that. But once they have this ecosystem, it 'll look very different. And then this whole post-ChatGPT boom is putting more resources into that, which I think is a very good area for doing resear ch. - This is also resulting in much better, more accurate, and more realistic simulators being built, closing the sim-to-real gap in the robotic spac e.
**Lex Fridman:** 网站和网络的问题——你知道——HTML 和所有那些东西——它对 slop 非常有弹性。它会给你展示 slop;它很擅长展示 slop。我更想考虑的是安全关键系统——比如让 AI 端到端生成一个管理物流的系统,或者管理车队——所有那类东西。让它端到端为你生成那些。
**Lex Fridman:** But, you know, you mentioned a lot of excitement in the robotics space and a lot of investment. The downside of that, which happens in hype cycles, I personally believe, and most robotics people believe, that it's not... Robotics is not going to be solved at the time scale as being implicitly or explicitly promised. And so what happens when there's all these robotics companies that spring up and then they don't have a product that works?
**Nathan Lambert:** 我觉得一个更中间的例子是——拿 Slack 或 Microsoft Word 来说。如果组织允许的话——我觉得 AI 很容易就能端到端地实现功能,而且做得相当好——对于你想尝试的那些东西。你想在 Slack 里加一个新标签页——你想用——我觉得 AI 能做得相当好。
**Nathan Lambert:** Then there's going to be this kind of crash of excitement, which is nerve-wracking. Hopefully something else will come in and keep swooping in so that the continued development of some of these ideas keeps going. - I think it's also related to the continual learning issue, essentially, where the real wor ld is so complex. With LLMs, you don't really need to have something learn for the user, because there are a lot of things everyone has to do. Everyon e maybe wants to, I don't know, fix their grammar in their email or code or something like that.
**Lex Fridman:** 实际上那是一个非常好的例子。我们离那个有多远?
**Lex Fridman:** It's more constrained, so you can kind of prepare the model for that. But preparing the robot for the real world is harder. I mean, you have the robotic foundation models, and you can learn certain thing s like grasping things. But then again, everyone's house is different.
**Nathan Lambert:** 就今年。
**Nathan Lambert:** It's so different, and that is, I think, where the robot would have to learn on the job, essentially. And that, I guess, is the bottleneck right now: how to, customize it on the fly, essentially. - I don't think I can possibly un derstate the importance of the thing that doesn't get talked about almost at all by robotics folks or anyone, which is safety. All the interesting com plexities we talk about learning, all the failure modes and failure cases, everything we've been talking about with LLMs—sometimes they fail in intere sting ways. All of that is fun and games in the LLM space.
**Lex Fridman:** 看吧,我不知道。我不确定。
**Lex Fridman:** In the robotic space, in people's homes, across millions of minutes and billions of interac tions, you really are almost allowed to fail never. When you have embodied systems that are put out there in the real world, you just have to solve so many problems you never thought you'd have to solve when just thinking about the general robot learning problem. - I'm so bearish on in-home learned robots for consumer purchase. I'm very bullish on self-driving cars, and I'm very bullish for robotic automation, e.g., like Amazon distribution where Amazon has built whole new distribution centers designed for robots first rather than humans. There's a lot of excitement in AI circles about AI enab ling automation and mass-scale manufacturing, and I do think that the path to robots doing that is more reasonable, where it's a thing that is designe d and optimized to do a repetitive task that a human could conceivably do but doesn't want to.
**Nathan Lambert:** 我想我不知道生产代码库有多糟糕——但我觉得在几年的量级内——很多人会被推向更多地成为设计师和产品经理——你有多个 agent 可以为你尝试东西——它们可能花一到两天来实现一个功能或尝试修复一个 bug。你有这些仪表板——我觉得 Slack 其实就是一个好的仪表板——你的 agent 会跟你对话,然后你给反馈。但像是——如果我做一个网站——"你想要一个凑合的 logo 吗?"我觉得这些整体性的设计和风格对模型来说会非常难——以及决定下次加什么。
**Nathan Lambert:** And then I'm much, but it's also going to take a lot lo nger than people probably predict. I think the leap from AI singularity to we can now scale up mass manufacturing in the US because we have a massive AI advantage is one that is troubled by a lot of political and other challenging problems. - Let's talk about timelines, specifically timelines to AGI or ASI. Is it fair, as a starting point, to say that nobody really agrees on the definitions of AGI and ASI? - I kind of think there's a lot of disag reement, but I've been getting pushback where a lot of people kind of say the same thing, which is like a thing that could reproduce most digital econ omic work. So, the remote worker is a fairly reasonable example.
**Lex Fridman:** 我就……好吧。我经常和很多程序员在一起——其中一些人总体上有点偏怀疑。那就是他们的风格。我就是觉得在复杂系统中添加功能有很多复杂性。比如你看浏览器,Chrome。如果我想加一个功能——如果我想让标签页不在上面而在左边。界面上的改动,对吧?我觉得我们没有……这不是明年就能实现的事。
**Lex Fridman:** And I think OpenAI's definition is somewhat related to that, which is like an AI that can do a lot of economically valuable tasks—which I don't really love as a definition, but I think it could be a grounding point, because language mo dels today, while immensely powerful, are not this remote worker drop-in. And there are things that could be done by an AI that are way harder than re mote work, which are like finding an unexpected scientific discovery that you couldn't even posit, which would be an example of something that somebod y says is an artificial superintelligence problem. Or, taking in all medical records and finding linkages across certain illnesses that people didn't know, or figuring out that some common drug can treat some niche cancer. They would say that that is a superintelligence thing.
**Nathan Lambert:** Claude 今年的一个版本中——他们的一个测试是:给它一个软件,让 Claude 运行起来从头重建它——它已经几乎可以从零重建 Slack 了——只给它软件的参数,让它在沙盒环境中运行。
**Nathan Lambert:** So these are kind of n atural tiers. My problem with it is that it becomes deeply entwined with the quest for meaning of AI and these religious aspects to it. So there's dif ferent paths you can take it. - And I don't even know if the remote worker is a good definition because what exactly is that? I actually, I mean, I li ke...
**Lex Fridman:** 所以"从零开始"这部分——我反而更喜欢。
**Lex Fridman:** I don't know if you like the originally titled AI27 report. They focus more on code and research taste, so the target there is the superhuman co der. So they have several milestone systems: Superhuman coders, superhuman AI researcher, then superintelligent AI researcher, and then the full ASI, artificial superintelligence. But after you develop the superhuman coder, everything else follows quickly.
**Nathan Lambert:** 所以可能是更小更新的公司有优势——他们会说:"我们没有那些臃肿和复杂性——所以这个功能就存在了。"
**Nathan Lambert:** There, the task is to have fully autonomous , automated coding. So any kind of coding you need to do in order to perform research is fully automated. And from there, humans would be doing AI res earch together with that system, and they will quickly be able to develop a system that can actually do the research for you. That's the idea.
**Sebastian Raschka:** 我觉得这回到了你提到的点——你聊天的一些人持怀疑态度。我觉得那不是因为 LLM 不能做 X、Y、Z。而是因为人们不想让它以那种方式做。
**Sebastian Raschka:** And ini tially their prediction was 2027, 2028, and now they've pushed it back by three to four years to 2031 (mean prediction). Probably my prediction is eve n beyond 2031, but at least you can, in a concrete way, think about how difficult it is to fully automate programming. - Yeah, I disagree with some of their presumptions and dynamics on how it would play out, but I think they did good work in the scenario-defining milestones that are concrete and te ll a useful story, which is why the reach for this AI 2027 document transcended Silicon Valley. It's because they told a good story and they did a lot of rigorous work to do this. I think the camp that I fall into is that AI is so-called "jagged," which will be excellent at some things and really ba d at some things.
**Lex Fridman:** 有些可能是人类一方的技能问题。我们得对自己诚实。有些可能是规格不足的问题。编程——就是你在假设……这就像关系和友谊中的沟通问题。你在假设 LLM 应该读懂你的心思。这就是规格驱动的设计真正重要的地方。用自然语言来指定你想要什么。
**Lex Fridman:** I think that when they're close to this automated software engineer, what it will be good at is traditional ML systems and frontend, the model is excellent at; but distributed ML, the models are actually quite bad at because there's so little training data on doing large-scale dist ributed learning. This is something we already see, and I think this will just get amplified. And then it's kind of messier in these trade-offs, like how you think AI research works and so on. - So you think basically a superhuman coder is almost unachievable, because of the jagged nature of the thi ng, you're just always going to have gaps in capabilities? - I think it's assigning completeness to something where the models are already superhuman at some types of code. I think that will continue.
**Nathan Lambert:** 如果你跟实验室的人聊——他们在自己的训练代码和生产代码中就用这些。Claude Code 是用 Claude Code 构建的——他们都在大量使用这些工具。Dario 说过 Claude 的多少代码……这些人在能力和他们在推理上可能花的钱方面稍微领先。他们可能花的是我们一个月 100 或 200 美元计划的 10 到 100 倍。他们真的放开了用。而且我觉得——以我们这个进步速度——一年前我们还没有 Claude Code,也还没有真正的推理模型。坐在这里的今天和我们能用这些模型做的事情之间的差异是巨大的——而且有很多 low-hanging fruit 可以改进。失败模式相当愚蠢。比如——"Claude,你试了 14 次用一个我没安装的 CLI 命令,然后我给你发了那个该运行的命令。"从建模角度来说——这是相当可修复的。所以我不知道。
**Nathan Lambert:** And people are creative, so they'll utilize these incredible abilities to fill in the weaknesses of the models and move really fast. There'll always be this dance for a long time between the humans enabling the thing that the model can't do. And the best AI researchers are the ones that can enable this superpower. And I think those lines lead to what we already see. I think like Claude Code for b uilding a website, you can stand up a beautiful website in a few hours or do data going to keep getting better, and we'll pick up some new coding skil ls along the way.
**Lex Fridman:** 我同意你的看法。我总体上越来越乐观了。说到你在阐述的观点——我觉得这是一个人类技能问题。Anthropic 正在引领这条路——和其他公司一起——在理解如何最好地将模型用于编程方面。因此他们在有效地使用它们。外围有很多程序员……我是说——并没有一个非常好的使用指南。人们在试图搞明白——但——
**Lex Fridman:** Linking to what's happening in big tech, this AI 2027 report leans into the singularity idea, whereas I think research is messy, soc ial and largely in the data in ways that AI models can't process. But what we do have today is really powerful and these tech companies are all collec tively buying into this with tens of billions of dollars of investment. So we are going to get some much better version of ChatGPT, a much better vers ion of Claude Code than we already have. I think it's just hard to predict where that is going, but the bright clarity of that future is why some of t he most powerful people in the world are putting so much money into this.
**Nathan Lambert:** 可能非常贵。入门门槛可能是每月 2000 美元——只有科技公司和有钱人才用得起。可能就是这样。
**Nathan Lambert:** And I think it's just kind of small differences between like—we don't actual ly know what a better version of ChatGPT is, but also, can it automate AI research? I would say probably not, at least in this timeframe. Big tech is going to spend $100 billion much faster than we get an automated AI researcher that enables an AI research singularity. - So you think your prediction would be— if this is even a useful milestone or more than 10 years out? - I would say less than that on the software side, but I think longer than th at on things like research. - Well, let's just for fun try to imagine a world where all software writing is fully automated. Can you imagine that worl d? - By the end of this year, the amount of software that'll be automated will be so high.
**Lex Fridman:** 但可能值得。如果最终结果是一个能用的软件系统——那可能值得。顺便说一句,有趣的是——我们从关于 AGI 时间线的讨论收敛到了一些更务实、更有用的东西。关于 AGI 和 ASI 的时间线——有没有什么具体的、有趣的、有用的、深刻的东西可以说?还是这些讨论跟日常有点脱节?
**Lex Fridman:** But it'll be things like trying to train a model with RL an d you need to have multiple bunches of GPUs communicating with each other. That'll still be hard, but it'll be much easier. - One way to think about t his— the full automation of programming—is just thinking of lines of useful code written, the fraction of that to the number of humans in the loop. So presumably there'll be for a long time humans in the loop of software writing. It'll just be fewer and fewer relative to the amount of code written.
**Nathan Lambert:** 有一些有趣的赌注。很多人在尝试做 RLVR——Reinforcement Learning with Verifiable Rewards——在真正的科学领域。有些拿到数亿美元融资的初创公司有湿实验室——他们让语言模型提出在真实世界中被测试的假设。我会说他们还很早期——但以这个进步速度——也许他们只是早了六个月——他们因为先到而成功了;或者也许他们早了八年——你不知道。那种把这个动量分支到其他科学领域的探月计划将是非常有变革性的。如果 AlphaFold 时刻在各种其他科学领域通过一个初创公司的解决方案出现。我觉得有些初创公司——也许 Harmonic 是其中一个——他们在语言模型加 Lean 做数学方面全力以赴。你之前有另一位嘉宾讨论过这个——我们不知道在那个模型上花 1 亿美元到底会产出什么。大多数会失败——但有几个可能是重大突破——与 ChatGPT 或 Claude Code 类型的软件体验非常不同。一个只对博士数学家有用、但让他们效率提升 100 倍的工具……
**Nathan Lambert:** Right? And the superhuman coder—I think the presumption there is it goes to zero, the number of humans in the loop. What does that world look like whe n the number of humans in the loop is in the hundreds, not in the hundreds of thousands? - I think software engineering will be driven more to system design and goals of outcomes, where I do think software is largely going to be. I think this has been happening over the last few weeks, where people have gone from a month ago saying, "Oh yeah, agents are kind of slop," which is a famous Karpathy quote, to what is a little bit of a meme—the industr ialization of software when anyone can just create software with their fingerprints.
**Sebastian Raschka:** 我同意。我觉得这会在很多领域发生——尤其是资源丰富的领域——比如金融、法律和制药公司。但话说回来——这真的是 AGI 吗?因为我们现在又在专业化了。这和以前我们有专门算法的时代真的有那么大不同吗?只是同样的事情,更精细了。但我不知道——AGI 有门槛吗?我觉得真正酷的是——我们有可以专门化的基础模型。那才是突破。现在——我觉得我们还没到那一步——首先太贵了——而且 ChatGPT 不会把他们的模型拿出来让你定制。我觉得一旦那成为现实……我可以想象这作为一种商业模式——OpenAI 在某个时候说:"嘿,Bank of America,花 1 亿美元我们给你做定制模型"——类似这样。我觉得那才是巨大的经济附加值。另外一件事——公司们——差异化因素是什么?如果每个人都用同一个 LLM,如果每个人都用 ChatGPT,他们都会做同样的事。如果所有人步调一致——但公司想有竞争优势——就不得不使用一些私有数据并专业化。这会很有趣。
**Sebastian Raschka:** I do think we are closer to that side of things, and it takes dir ection and understanding how systems work to extract the best from the language models. I think it's hard to accept the gravity of how much is going t o change with software development and how many more people can do things without ever looking at the code. - I think what's interesting is to think a bout whether these systems will be independent—completely independent in the sense that, while I have no doubt that LLMs will kind of at some point so lve coding in a sense, like calculators solve calculating, right? So at some point, humans developed a tool where you never need a human to calculate that number. You just type it in, and it's an algorithm.
**Lex Fridman:** 看到进步的速度——确实感觉东西在来了。我不觉得 AGI 和 ASI 的门槛特别有用。
**Lex Fridman:** You can do it in that sense. And I think that's the same probably for coding. But the questio n isn't... I think what will happen is, you will just say, "Build that website." It will make a really good website, and then you maybe refine it.
**Nathan Lambert:** 我觉得真正的问题——这和远程工作者的事相关——是:我们什么时候会看到经济影响的大跳跃?因为目前——LLM 模型的经济影响还没有出现明显的飞跃。那是——你知道——撇开 AGI 或 ASI 那些东西——有一个真实的问题:"我们什么时候会看到 GDP 的……"
**Nathan Lambert:** But will it do things independently where... Will you still be having humans asking the AI to do something? Like will there be a person to say, "Build th at website?" Or will there be AI that just builds websites or something? - I think talking about building websites is— - Too simple. - The problem wit h websites and the problem with the web, you know, HTML and all that kind of stuff, it's very resilient to just ... slop. It will show you slop; it's good at showing slop.
**Lex Fridman:** "……跳跃?"
**Lex Fridman:** I would rather think of safety-critical systems, like asking AI to end-to-end generate something that manages logistics, or mana ges cars, a fleet of cars—all that kind of stuff. So it end-to-end generates that for you. - I think a more intermediate example is take something lik e Slack or Microsoft Word. I think if organizations allow it, AI could very easily implement features end-to-end and do a fairly good job for like thi ngs that you want to try. You want to add a new tab in Slack that you want to use, and I think AI will be able to do that pretty well. - Actually, tha t's a really great example.
**Nathan Lambert:** 是的——GDP 由什么组成?很大一部分是金融服务——所以我不知道这是什么。
**Nathan Lambert:** How far away are we from that? - Like this year. - See, I don't know. I don't know. - I guess I don't know how bad product ion codebases are, but I think that within... on the order of a few years, a lot of people are going to be pushed to be more of a designer and product manager, where you have multiple of these agents that can try things for you and they might take one to two days to implement a feature or attempt to fix a bug. And you have these dashboards, which I think Slack is actually a good dashboard where your agents will talk to you and you'll then give fe edback. But things like, if I make a website, like, "Do you want a passable logo?" I think these cohesive design things and the style is going to be v ery hard for models and deciding on what to add the next time. - I just...
**Lex Fridman:** 对——GDP 是一个——
**Lex Fridman:** Okay. I hang out with a lot of programmers and some of them are a little bi t on the skeptical side in general. That's just their vibe. I just think there's a lot of complexity involved in adding features to complex systems.
**Nathan Lambert:** 我很难想象 GDP 的跳跃——但我会说软件开发会以不同的方式变得有价值——当你不再需要看代码的时候。当 Claude Code 能给你做一个小生意。就是——Claude 可以建你的网站、你的银行账户、你的邮箱、和你的其他一切。你只需要表达你想往世界上推出什么。那不只是一个企业市场——但确实很难。我不知道你怎么让人们去尝试那样做。我猜如果 ChatGPT 能做到的话——人们在尝试 ChatGPT。
**Nathan Lambert:** L ike, if you look at the browser, Chrome. If I wanted to add a feature, if I wanted to have tabs as opposed to up top, I want them on the left side. In terface-wise, right? I think we're not... This is not a next-year thing. - One of the Claude releases this year, one of their tests was: we give it a piece of software and leave Claude to run to recreate it entirely, and it could already almost rebuild Slack from scratch, just given the parameters o f the software and left in a sandbox environment to do that. - So the "from scratch" part, I like almost better. - So it might be that smaller and new er companies are advantaged, and they're like, "We don't have the bloat and complexity, and therefore this feature exists." - And I think this gets to the point you mentioned, that some people you talk to are skeptical.
**Sebastian Raschka:** 我觉得归根结底是一个科学问题:"工具使用到底有多难解决?"因为你暗示的很多东西——远程工作的那些——都是工具使用。就是……计算机使用——你怎么让一个 LLM 走到那里——这个 agentic 系统——在世界上做一些事——而且只搞砸 1% 的时间。
**Sebastian Raschka:** I think that's not because the LLM can't do X, Y, Z. It's because people don't w ant it to do it this way. - Some of that could be a skill issue on the human side. We have to be honest with ourselves. And some of that could be an u nderspecification issue.
**Lex Fridman:** 计算机使用——
**Lex Fridman:** So, programming, it's like you're just assuming... This is like an issue with communication in relationships and friendships. You're assuming the LLM is supposed to read your mind. This is where spec-driven design is really important.
**Nathan Lambert:** 或者更少。
**Nathan Lambert:** Using natural language to specify what y ou want. - If you talk to people at the labs, they use these in their training and production code. Claude Code is built with Claude Code, and they al l use these things extensively. Dario talks about how much of Claude's code... It's like these people are slightly ahead in terms of the capabilities they have and what they probably spend on inference.
**Lex Fridman:** ……是实验室关心但我们还没看到太多进展的一个好例子。我们在 2025 年看到了多个演示——Claude 可以使用你的电脑,OpenAI 有 Operator——但它们都很烂。他们在往里投钱——我觉得那会是一个好例子。实际上——接管整个屏幕似乎比在后端有一个可以调用的 API 要难得多。对于其中一些——你必须为它们搭建一个不同的环境。它们不是在你的 MacBook 上工作——它们是分别跟 Google、Amazon 和 Slack 交互——它们处理所有这些东西的方式和人类完全不同。所以其中一些可能是结构性障碍。
**Lex Fridman:** They could spend 10 to 100x as much as we're spending on a lowly $100 or $200 a month plan. They truly let it rip. And I think that, with the pace of progress that we have, it seems like- a year ago we didn't have Claude Code and we didn't really have reasoning models. The difference between sitting here today and what we can do with these models is significant, and there's a lot of low-hanging fruit to improve them.
**Sebastian Raschka:** 而且从规格角度来说——我觉得问题是对于任意任务——你仍然必须指定你想让 LLM 做什么。你怎么做?环境是什么?你怎么指定?你可以说最终目标是什么——但如果它不能解决最终目标……对于 LLM——如果你要求文本——它总是可以澄清或做子步骤。你怎么把那个信息放进一个——比如说——帮你订旅行的系统里?你可以说"你搞错了我的信用卡信息"——但甚至要让它到那个地步——甚至到那个地步——作为用户——你怎么在模型能够尝试之前引导它?我觉得界面真的很难。
**Sebastian Raschka:** The failure modes are pretty dumb. Like- "Claude, you tried to use a CLI command I don't have installed 14 times, and then I s ent you the command to run." That, from a modeling perspective, is pretty fixable. So I don't know. - I agree with you. I've been becoming more and mo re bullish in general.
**Lex Fridman:** 是的——它必须了解很多关于你个人的具体信息。这回到了 continual learning——关于整体上犯的一般性错误——然后是通过你犯的错误。
**Lex Fridman:** Speaking to what you're articulating, I think it is a human skill issue. Anthropic is leading the way, along with other compani es, in understanding how to best use the models for programming; therefore, they're effectively using them. There are a lot of programmers on the outs kirts who don't... I mean, there's not a really good guide on how to use them.
**Nathan Lambert:** 所有的 AI 界面都在被设置成向人类询问输入。我觉得我们聊了很多 Claude Code。它会问反馈和问题。如果它对你的计划或期望目标没有足够的规格——它就开始问问题——"你更想要……?"我们聊到了 Memory——它跨聊天保存。它的第一个实现有点奇怪——它会在聊天中提到我的狗的名字什么的。我就想——"你不需要这么含蓄。我不在意。"但正在出现的东西——ChatGPT 有 Pulse 功能。这是一个精心策划的几段文字,带有链接——让你看看某些内容。人们说模型会开始问你问题。我觉得这很可能……会奏效。语言模型知道你有一个医生预约——然后问:"嘿,看完医生之后你感觉怎么样?"这又进入了人类非常容易被影响的领域——会有很多社会变化。但他们也在实验让模型参与——有些人喜欢这个 Pulse 功能——它处理你的聊天记录,自动搜索信息并放到应用里。所以有很多东西在来。
**Nathan Lambert:** People are trying to figure it out, but- - It might be very expensive. The entry point might be $2,000 a month, which is only for tech companies and rich people. That could be it. - But it might be worth it. If the final result is a working software system, it might be worth it.
**Sebastian Raschka:** 我之前用过那个功能——我总是觉得不好意思——因为它每天都做——但我很少去看。多少算力被烧在了我根本不看的东西上——你知道?就有点——"唉……"
**Sebastian Raschka:** By the way, it's funny how we converged from the discussion of timeline to AGI to something more pragmatic and useful. Is there anything concrete, interesting, useful, and profound to be said about the timeline to AGI and ASI? Or are these d iscussions a bit too detached from the day to day? - There are interesting bets. A lot of people are trying to do RLVR— Reinforcement Learning with Ve rifiable Rewards—in real scientific domains, where startups with hundreds of millions of funding have wet labs where they're having language models pr opose hypotheses that are tested in the real world.
**Nathan Lambert:** 世界上也有很多闲置算力——所以不用太内疚。
**Nathan Lambert:** I would say that they're early, but with the pace of progress, it's like- ...maybe they're early b y six months and they make it because they were there first, or maybe they're early by eight years; you don't know. That type of moonshot to branch th is momentum into other sciences would be very transformative. If, AlphaFold moments happen in all sorts of other scientific domains by a startup solvi ng this. I think there are startups—maybe Harmonic is one—where they're going all in on language models plus Lean for math.
**Lex Fridman:** 好。你觉得可能需要新的想法吗?通往 AGI 的路——不管我们怎么定义——要解决更通用的计算机使用、解决生物学和化学和物理学——就是 Dario Amodei 定义的那种 AGI。你觉得有可能需要全新的想法吗?非 LLM 的、非 RL 的想法?它们可能长什么样?我们有点进入哲学领域了。
**Lex Fridman:** You had another guest wher e you talked about this recently, and we don't know exactly what's going to fall out of spending $100 million on that model. Most of them will fail, b ut a couple might be big breakthroughs that are very different than ChatGPT or Claude Code type software experiences. A tool that's only good for a Ph D mathematician but makes them 100X effective... - I agree. I think this will happen in a lot of domains, especially domains that have a lot of resour ces, like finance, legal, and pharmaceutical companies. But then again, is it really AGI?
**Nathan Lambert:** 如果要像奇点那样的事情发生——我会说是的。新想法可能是架构或训练算法——基础的深度学习层面的东西。但从本质上说——那些很难预测。我觉得即使没有那些进展我们也不会走得太远。我们可能会得到软件方面的解决方案——但可能就止步于软件——而不做计算机使用——如果没有更多创新的话。所以我觉得会有很多进展——但如果你放大来看——未来 30 年仍然会有一些想法看起来像是那种重大科学创新——开启了下一个篇章。我不知道它是一年后来还是 15 年后来。
**Nathan Lambert:** Because we are now specializing it again. Is it really that much different from back in the day when we had specialized algorithms? It's just the same thing, way more sophisticated, but I don't know, is there a threshold for AGI? I think the real cool thing here is that we have foundation models we can specialize.
**Sebastian Raschka:** 是的。我想知道如果苦涩教训在未来 100 年继续成立——那会是什么样子。
**Sebastian Raschka:** That's like the breakthrough. Right now, I t hink we are not there yet because, first, it's too expensive, but also, ChatGPT doesn't just give away their model to customize it. I think once that' s true... And I can imagine this as a business model, where OpenAI says at some point, "Hey, Bank of America, for $100 million we will do your custom model," something like that.
**Nathan Lambert:** 如果 scaling laws 是深度学习的基本规律——我觉得苦涩教训会一直适用——就是算力会变得更充裕。但即使在充裕的算力中——那些 scaling law 斜率更陡或者 offset 更好的——这就是一个性能和算力的二维图——即使有更多算力可用——那些能从中获得 100 倍回报的会赢。
**Nathan Lambert:** I think that will be the huge economic value-add. The other thing, though, is also... Companies, I mean, what is the diff erentiating factor? If everyone uses the same LLM, if everyone uses ChatGPT, they will all do the same thing.
**Lex Fridman:** 可能会是字面意义上的——绕地球轨道运行的计算机集群——带着太阳能板。
**Lex Fridman:** Well, if everyone is moving in lockstep, but companies want to have a competitive advantage, there is no way around using some of their private data and specializing. It's gonna be interesti ng. - Seeing the pace of progress, it does feel like things are coming. I don't think the AGI and ASI thresholds are particularly useful. - I think th e real question, and this relates to the remote worker thing, is: when are we going to see a big, obvious leap in economic impact? Because currently t here's not been an obvious leap in the economic impact of LLM models, for example.
**Sebastian Raschka:** 那个问题是散热。你从太阳得到所有辐射——但没有空气来散热。但太空中确实有很多空间来放集群。有很多太阳能——你可以解决散热问题——因为有很多能量——而且可能存在解决热问题的工程意志——所以是有可能的。
**Sebastian Raschka:** And that's, you know, aside from AGI or ASI, all that stuff, there' s a real question of, "When are we going to see a GDP..." "...jump?" - Yeah, what is the GDP made up of? A lot of it is financial services, so I don't know what this is. - Right, GDP is a- - It's just hard for me to think about the GDP bump, but I would say that software development becomes valuable in a different way, when you no longer have to look at the code anymore. When Claude Code will make you a small business. Which is essentially, Claud e can set up your website, your bank account, your email, and your whatever else.
**Lex Fridman:** 有没有可能——我们应该说这绝对是有可能的——我们今年基本上会进入一个平台期?不是在系统能力方面——而是它们对人类文明的实际意义方面。在编程方面——会建出非常漂亮的网站。非常好的自动补全。非常好的理解代码库的方式,也许帮助调试——但真的只是编程方面一个非常好的助手。它可以帮助数学家做一些数学。它可以帮你购物。它是一个好助手。它是 Clippy 的超级加强版。还有什么?它可能是一个好的教育工具——所有那些——但计算机使用可能极其难解决。所以我在尝试把所有这些领域中的悲观论点框定出来——不会有真正巨大的经济影响——但要意识到训练这些系统在每个层面有多昂贵——无论是预训练还是推理——推理有多贵——推理、reasoning——所有这些。这可能吗?你觉得多大可能?
**Lex Fridman:** And you just have to express what you're trying to put into the worl d. That's not just an enterprise market, but it is hard. I don't know how you get people to try doing that. I guess if ChatGPT can do it—people are tr ying ChatGPT. - I think it boils down to the scientific question of, "How hard is tool use to solve?" Because a lot of the stuff you're implying, the remote work stuff, is tool use.
**Nathan Lambert:** 当你看这些模型——有这么多明显可以改进的地方——训练这些模型需要很长时间——做这门技艺——要花我们好多年——用现有的想法——来真正在任何 benchmark 或性能上达到饱和。它可能服务于非常细分的市场——比如一般的 ChatGPT 8 亿用户可能不会从中获得很多好处——但它会通过在不同事情上变得更好来服务不同的群体。
**Nathan Lambert:** It's like... computer use, like how you have an LLM that goes out there, this agentic system, and does something in th e world, and only screws up 1% of the time. - Computer use- - Or less. - ...is a good example of what labs care about and we haven't seen a lot of pro gress on. We saw multiple demos in 2025 of, like, Claude can use your computer, or OpenAI had operator, and they all suck. They're investing money in this, and I think that'll be a good example. Whereas actually, something where it just seems like taking over the whole screen seems a lot harder than having an API that they can call in the back end.
**Lex Fridman:** 但我觉得现在每个人都在追逐的是一个对所有人都有用的通用系统。好吧——如果那个不是……那个可以进入平台期——对吧?
**Lex Fridman:** For some of that, you have to set up a different environment for them all to work in. They're not w orking on your MacBook; they are individually interfacing with Google and Amazon and Slack, and they handle all these things in a very different way t han humans do. So some of this might be structural blockers. - Also, specification-wise, I think the problem is for arbitrary tasks, well, you still h ave to specify what you want your LLM to do. And how do you do that?
**Nathan Lambert:** 我觉得那个梦想实际上正在消亡。就像你说的那些专业化模型……而且多模态经常……视频生成是一个完全不同的东西。
**Nathan Lambert:** What is the environment? How do you specify? You can say what the end goal is, bu t if it can't solve the end goal... with LLMs, if you ask it for text, it can always clarify or do sub-steps. How do you put that information into a s ystem that, let's say, books a travel trip for you?
**Lex Fridman:** "那个梦想正在消亡"是一个很大的声明——因为我不知道它是不是在消亡。如果你问实际的前沿实验室的人——他们……我是说——他们仍然在追——对吧?
**Lex Fridman:** You can say, "You screwed up my credit card information," but even to get it to that point, even t o get it to that point, how do you, as a user, guide the model before it can even attempt that? I think the interface is really hard. - Yeah, it has t o learn a lot about you specifically. And this goes to continual learning, about the general mistakes that are made throughout, and then mistakes that are made through you. - All the AI interfaces are getting set up to ask humans for input. I think Claude Code we talked about a lot.
**Sebastian Raschka:** 我确实觉得他们仍然在赶着推出下一个模型——它会比前一个好得多——"得多"是一个相对的词——但它会比前一个好。我看不到他们减速。我只是觉得收益会更多地通过——不仅仅是 scaling 模型——而且现在……我觉得有很多技术债务。就像——"好吧,就放一个更好的模型进去。"更好的模型,更好的模型。然后现在人们就想——"好吧,让我们同时也改进它周围的一切。"比如上下文工程和推理 scaling。大实验室仍然会继续做这些。然后现在小实验室也会赶上来——因为它们在招更多人。会有更多的人和 LLM。这有点像一个循环。它们也让人们更高效——就是……就是放大。我觉得我们可以期待的是放大——但不是……不是范式改变。我不觉得那是真的——但一切都会被放大再放大——我可以看到这会持续很长时间——你知道?
**Sebastian Raschka:** It asks feedback and questions. If it doesn't have enough specification on your plan or your desired goal, it starts to ask questions, "Would you rather?" We talked a bout Memory, which saves across chats. Its first implementation is kind of odd, where it'll mention my dog's name or something in a chat. I'm like, "Y ou don't need to be subtle about this.
**Nathan Lambert:** 是的。我猜我说那个梦想正在消亡——取决于你到底认为它会做什么。比如 Claude Code 是一个通用模型——可以做很多事情——但它不一定……它很大程度上取决于集成。我打赌 Claude Code 可以做得相当好来处理你的邮件——最难的部分是弄清楚怎么给它信息以及怎么让它能发送你的邮件。但那只是有点像……我觉得它回到了"一个模型统治一切"的理念——就是云端有一个东西——处理你整个数字生活——而且比所有人都聪明得多。它在一个……所以从"Claude Code 变成那个东西"做一个信念飞跃——在某种程度上——有一些路径——但我确实觉得行业的话语有一点不同了。
**Nathan Lambert:** I don't care." But things that are emerging, ChatGPT has the Pulse feature. Which is like a curated couple of p aragraphs with links to something to look at, and people talk about how models are going to ask you questions. Which I think is a very... It's probabl y going to work.
**Sebastian Raschka:** 我觉得我们接下来作为普通人使用 LLM 时立即感受到的——可能和一些琐碎的事情有关——比如做图。现在 LLM 做图太差了。是不是因为我们用的是那些便宜模型——推理算力比后台少得多?也许有一些。比如——有一些方法可以得到更好的图——但如果你今天问——"画一个 X、Y、Z 的流程图"——大多数时候都很糟糕。而这对人类来说是一个非常简单的任务。我觉得有时候画东西甚至比写东西更容易。
**Sebastian Raschka:** The language model knows you had a doctor appointment and asks, "Hey, how are you feeling after that?" Which again goes into the terr itory where humans are very susceptible to this, and there's a lot of social change to come. But also, they're experimenting with having the models en gage. Some people like this Pulse feature, which processes your chats and automatically searches for information and puts it in the app. So there are a lot of things coming. - I used that feature before, and I always feel bad because it does that every day, and I rarely check it out. It's like, how much compute is burned on something I don't even look at, you know? It's kind of like, "Oh..." - There's also a lot of idle compute in the world, so d on't feel too bad. - Okay. Do you think new ideas might be needed? Is it possible that the path to AGI, however we define that, to solve computer use more generally, to solve biology and chemistry and physics—sort of the Dario Amodei definition of AGI? Do you think it's possible that totally new ide as are needed? Non-LLM, non-RL ideas?
**Lex Fridman:** 是的——多模态理解确实感觉有点奇怪——它没有被更好地解决。
**Lex Fridman:** What might they look like? We're going into philosophy land a bit. - For something like a singularity to happen, I would say yes. The new ideas could be architectures or training algorithms, fundamental deep learning things. But in that nature, they're pretty ha rd to predict.
**Nathan Lambert:** 我觉得我们没说一个显而易见的事——一个我们没意识到的、巨大的、很难衡量的事——就是让所有人类知识对全世界可及。有一件事很难表达清楚——就是 Google 搜索和 LLM 之间的巨大差异。我觉得我基本上可以问 LLM 任何事情并得到答案——而且它的幻觉越来越少了。这意味着理解我自己的生活、规划职业轨迹、解决我周围的问题、学习人类历史中的任何东西。我觉得没人真正在谈论这个——因为他们立刻就把它视为理所当然——这太棒了。这就是为什么所有人都在用它:因为你可以得到东西的答案。想想随时间推移的影响。这不只是在美国——是全世界。全世界的孩子能学到这些想法——随时间推移的影响可能是……那才是真正的影响。说到 GDP——它不会像一个跳跃。它会是……那就是我们怎么去火星、怎么建造这些东西、怎么拥有一百万个新的 OpenAI 以及从那里来的所有创新。它是这种安静的力量——渗透一切:人类知识。
**Nathan Lambert:** I think we won't get very far even without those advances. We might get the software solution, but it might stop at software and not do computer use without more innovation. So I think that a lot of progress will be coming, but if you're gonna zoom out, there's still ideas in the next 30 years that are gonna look like that was a major scientific innovation that enabled the next chapter of this. And I don't know if it comes in one y ear or in 15 years. - Yeah.
**Sebastian Raschka:** 我同意。在某种意义上——它让知识更加可及——但也取决于话题是什么。对于数学这样的东西——你可以问它问题——它回答——但如果你想从零开始学一个话题——最佳点仍然在别处。有非常好的数学教科书——线性排列的——那是经过验证的学习一个话题的策略。如果你从零开始——使用信息密集的文本来吸收是有道理的——但然后你用 LLM 来生成无限的练习题。比如——你在某个领域有问题——或者对某些东西不确定——你让它生成示例题目——你解答——你需要更多背景知识——你让它生成那些。但然后……它不会给你任何——比如说——教科书里没有的东西。它只是把它包装得不一样——如果这说得通的话。但有些事情——我觉得——它还在更及时的意义上增加价值——在没有更好的替代品只能由人类临时做的情况下。比如——如果你计划去迪士尼乐园——你试图搞清楚哪些票买哪个乐园什么时候——嗯——没有教科书讲这个。没有信息密集的资源。只有稀疏的互联网——而 LLM 有很大价值。你直接问它。你有旅行限制。我想去那里和那里。请帮我搞清楚我需要什么、什么时候、从哪里、多少钱——之类的——它是一个非常定制化的、临时的套餐。这是一千个个性化例子中的一个——个性化本质上是从稀疏的互联网——那个非信息密集的东西——中提取信息——那里没有更好的版本存在。它根本不存在。你几乎从零创造它。
**Sebastian Raschka:** I wonder if the bitter lesson holds true for the next 100 years, what that looks like. - If scaling laws are fundamental i n deep learning, I think the bitter lesson will always apply, which is compute will become more abundant, but even within abundant compute, the ones t hat have a steeper scaling law slope or a better offset— like, this is a 2D plot of performance and compute—and like even if there's more compute avai lable, the ones that get 100x out of it will win. - It might be something like literally computer clusters orbiting Earth with solar panels. - The pro blem with that is heat dissipation. You get all the radiation from the sun and don't have any air to dissipate heat. But there is a lot of space to pu t clusters. There's a lot of solar energy there and you could figure out the heat dissipation, as there is a lot of energy and there probably could be engineering will to solve the heat problem— so there could be. - Is it possible—and we should say that it definitely is possible— that we're basicall y going to be plateauing this year?
**Lex Fridman:** 而且如果它确实存在——就到处是——说到迪士尼乐园——到处是——怎么说呢?广告 slop。根本不可能。随便拿一个世界上的城市——最值得做的 10 件事是什么?问 LLM 比问互联网上任何东西都好得多。
**Lex Fridman:** Not in terms of— the system capabilities, but what they actually mean for human civilization. So on the coding fro nt, really nice websites will be built. Very nice auto-complete. Very nice way to understand code bases and maybe help debug, but really just a very n ice helper on the coding front.
**Sebastian Raschka:** 嗯——目前是这样——因为它们是被补贴的——而且它们将来会被广告买单。
**Sebastian Raschka:** It can help research mathematicians do some math. It can help you with shopping. It's a nice helper. It's Clippy on st eroids.
**Lex Fridman:** 天哪。
**Lex Fridman:** What else? It may be a good education tool and all that kind of stuff, but computer use turns out extremely difficult to solve. So I'm trying to frame the cynical case in all these domains where there's not a really huge economic impact, but realize how costly it is to train these systems at every level, both the pre-training and the inference, how costly the inference is, the reasoning, all of that. Like, is that possible?
**Sebastian Raschka:** 它来了。
**Sebastian Raschka:** And how likely is that, do you think? - When you look at the models, there are so many obvious things to improve and it takes a long time to train these models and to do this art, and it'll take us with the ideas that we have multiple years to actually saturate in terms of whatever benchmark or performance we are searching for. It might serve very narrow niches; like the average ChatGPT 800 million user might not get a lot of benefit out of this, but it is goi ng to serve different populations by getting better at different things. - But I think what everybody's chasing now is a general system that's useful to everybody. So, okay, so if that's not... That can plateau, right? - I think that dream is actually kind of dying.
**Lex Fridman:** 不。不。我是说——我希望在那个语境中会有非常清楚的标识——什么是广告什么不是广告。
**Lex Fridman:** As you talked about with the spec ialized models where it's like... And multimodal is often... Video generation is a totally different thing. Thing. - "That dream is kind of dying" is a big statement, because I don't know if it's dying.
**Sebastian Raschka:** 那是我几年前提到的事情。如果——我不知道——如果你在找一双新跑鞋——嗯——Nike 先出来是巧合吗?也许是也许不是。但我觉得有明确的法律。你必须对此明确。我觉得那是每个人都怕的。就是那种微妙的信息在里面——但那也把我们带到了广告的话题——我觉得这是一个事情。希望——我觉得——2025 年——就是因为我觉得他们现在还没有用其他方式赚到钱。放广告位……但问题是——他们不能——因为有没有广告的替代品——人们会涌向那些产品。而且就是疯狂——他们怎么互相抬杠——花那么多钱就为了获取用户。
**Sebastian Raschka:** If you ask the actual frontier lab people, they... I mean, they're still chasing it, right? - I d o think they are still rushing to get the next model out, which will be much better than the... "Much" is a relative term, but it will be better than the previous one. And I can't see them slowing down.
**Lex Fridman:** 我觉得是的。像是一些 Instagram 广告——我不用 Instagram——但我理解那种吸引力——付费给一个平台来找到真正会喜欢你产品的用户——那是 Instagram 广告最好的情况。但也有很多广告对激励机制来说非常糟糕。我觉得——一个 AI 的力量能和那种积极视角结合的世界——"我是一个人——我有一个小生意——我想做世界上最好的——我不知道——该死的牛排刀——我想把它卖给需要的人。"如果 AI 能让那种广告更好地运作——那对世界非常好——尤其是有数字基础设施——因为那就是现代网络的构建方式。但那不是说让人上瘾的信息流——好让你给人们展示更多内容——是好事。所以——我觉得那甚至是 OpenAI 会说的——他们想找到一种方式——能有广告的变现上行空间——同时仍然给用户主动权。而且我个人觉得——Google 可能会更擅长搞清楚怎么做这件事——因为他们已经有广告供给——如果他们搞清楚怎么在 Gemini 应用中把这种需求变成有用的广告——他们就可以打开。有人会搞清楚的——我不知道是不是今年——但会有实验。
**Lex Fridman:** I just think the gains will be made or felt more through not only scaling the model, but now... I feel like there's a lot of tech debt. It's like, "Well, let's just put the better model in there." Better model, better model. And now people are lik e, "Okay, let's also at the same time improve everything around it too." Like the engineering of the context and inference scaling.
**Sebastian Raschka:** 我确实觉得现在阻止公司的——真的就是竞争对手没有做。这更像一个声誉的事情。人们现在只是怕毁掉或失去声誉——失去用户——因为如果有人推出了这些广告会上头条。但——
**Sebastian Raschka:** The big labs will still keep doing that. And now also the smaller labs will catch up, because now they are hiring more. There will be more people and LLMs. It's kind of like a circle.
**Lex Fridman:** 除非它们很棒——但第一批广告不会很棒——因为那是一个我们不知道怎么解决的难题。
**Lex Fridman:** They also make them more productive and it's just... It's like amplification. I think what we can expect is amplification, but not lik e a change of any... not like a paradigm change. I don't think that is true, but everything will be just amplified and amplified, and I can see that c ontinuing for a long time, you know? - Yeah.
**Sebastian Raschka:** 是的——我觉得第一个版本大概会像 X 上那样——时间线里偶尔有一个推广帖子。会有一些写着"推广"或小字——然后有一张图。我觉得现在的问题是:谁先动?
**Sebastian Raschka:** I guess my statement that the dream is dying depends on exactly what you think it's gonna be doing. Like, Claude Code is a general model that can do a lot of things, but it's not necessarily... It depends a lot on integrations. I bet Claude Code could do a fairly good job of doing your email, and the hardest part is figuring out how to give information to it and how to get it to be able to send your em ails.
**Lex Fridman:** 如果我们看 10 年后——广告的主张是:你会通过有那么多用户而在广告上赚那么多钱——你可以用这些来资助更好的研发和做更好的模型。这就是为什么 YouTube 在主导市场——Netflix 怕 YouTube。他们有广告——他们赚——我每月付 28 美元买 Premium。他们至少从我和很多其他人身上每月赚 28 美元。而且他们正在视频领域创造如此主导的地位。所以我觉得那就是主张:广告能让你在每用户支出上拥有持续优势。但现在有那么多钱在里面——有人启动那个飞轮是吓人的——因为那是一个长期赌注。
**Lex Fridman:** But that's just kind of like... I think it goes back to what is the "one model to rule everything" ethos, which is just like a thing in the clou d that handles your entire digital life and is way smarter than everybody. It's like it's operating in a... So it's an interesting leap of faith to go from "Claude Code becomes that," which in some ways is...
**Lex Fridman:** 你觉得今年商业方面会有什么疯狂的大动作吗?比如 Google 或 Apple 收购 Anthropic 之类的?
**Lex Fridman:** There are some avenues for that, but I do think that the rhetoric of the industry is a litt le bit different. - I think the immediate thing we will feel next as a normal person using LLMs will probably be related to something trivial, like ma king figures. Right now, LLMs are terrible at making figures. Is it because we are getting served the cheap models with much less inference compute th an behind the scenes? Maybe some. Like, there are some ways to get better figures, but if you ask today, ..."Draw a flowchart of X, Y, Z," it's most o f the time terrible. And it is a very simple task for a human. I think it's almost easier sometimes to draw something than to write something. - Yeah, the multimodal understanding does feel like something that is odd... ...that it's not better solved. - I think we're not saying one obvious thing tha t we're not realizing, that's a gigantic thing that's hard to measure, which is making all of human knowledge accessible— —to the entire world. One th ing that is hard to articulate is the huge difference between Google Search and an LLM.
**Nathan Lambert:** Dario 永远不会卖。但我们开始看到一些整合——Groq 200 亿美元、Scale AI 将近 300 亿美元——以及无数其他类似的交易——它们的结构方式对硅谷生态系统是有害的——就是这种授权交易——不是所有人都被带上——而不是全面收购——那种让普通员工的股票得到兑现的方式。那是一个文化方面的大问题——因为创业生态系统是命脉——你加入一家创业公司——即使它不成功——它可能被便宜收购——你会因为股权而得到一笔钱。这些授权交易很多时候只是在带走顶级人才。Groq 被 NVIDIA 收购的交易据说对员工更好——但那仍然是这种规避反垄断的操作。但我觉得这种整合趋势会继续。我和很多我尊敬的聪明人一直在预期整合应该更早发生——但似乎有些事情开始转变了——但同时公司们在融荒谬的金额——理由让我觉得——"我不知道你为什么要拿那笔钱。"所以今年是混合的——但一些整合压力在开始。
**Nathan Lambert:** I feel like I can basically ask an LLM anything and get an ans wer, and it's doing less and less hallucination. And that means understanding my own life, figuring out a career trajectory, solving the problems all around me, learning about anything through human history. I feel like nobody's really talking about that, because they just immediately take it for gr anted that this is awesome. That's why everybody's using it: because you get answers for stuff.
**Lex Fridman:** 我们会看到什么令人惊讶的整合?你说 Anthropic 是"永远不会"。我是说——Groq 是一个大事件。Groq 带 Q 的那个,顺便说一下。
**Lex Fridman:** Think about the impact across time. This is not just i n the United States; it's all across the world. Kids throughout the world being able to learn these ideas— the impact that has across time is probably ... ...That's the real impact. Talk about GDP; it won't be like a leap.
**Nathan Lambert:** 是的。就是有很多初创公司——AI 初创公司的估值溢价非常高。所以可能会有很多那种事情——是的。
**Nathan Lambert:** It'll be... ...that's how we get to Mars, that's how we build these things, th at's how we have a million new OpenAIs and all the innovation from there. It's this quiet force that permeates everything: human knowledge. - I agree with you. In a sense, it makes knowledge more accessible, but it also depends on what the topic is. For something like math, you can ask it questions and it answers, but if you want to learn a topic from scratch, the sweet spot is still elsewhere.
**Lex Fridman:** 100 亿美元级别的收购——对一家可能一年前才成立的创业公司来说确实很大。我觉得 Manus.ai……这个总部在新加坡的公司——Meta 参与创立的——成立八个月就有了 20 亿美元的退出。我觉得会有其他数十亿美元级别的收购——比如 Perplexity。
**Lex Fridman:** There are really good math textbooks laid out linear ly, and that is a proven strategy to learn a topic. It makes sense, if you start from zero, to use information-dense text to soak it up, but then you use the LLM to make infinite exercises. Like, you have problems in a certain area or have questions that something's- uncertain about certain things, you ask it to generate example problems, you solve them, and you need more background knowledge, you ask it to generate that. But then... it won't giv e you anything, let's say, that is not in the textbook.
**Nathan Lambert:** 比如 Perplexity——对吧?
**Nathan Lambert:** It's just packaging it differently, if that makes sense. But then there are things I feel like where it also adds value in a more timely sense, where there is no good alternative besides a human doing it on the fly. For example, if you're plann ing to go to Disneyland and you try to figure out which tickets to buy for which park when, well, there is no textbook on that. There is no informatio n-dense resource.
**Lex Fridman:** 是的——有传言说被 Apple 收购。我觉得 AI 领域有很多压力和流动性。大公司有压力要出成果——我猜一个大收购给人们留出空间——然后讲那个故事的下一章。
**Lex Fridman:** There's only the sparse internet, and then there is a lot of value in the LLM. You just ask it. You have constraints on traveling th ese days. I want to go there and there.
**Nathan Lambert:** 我猜 Cursor——我们一直在聊代码——有人收购 Cursor。如果有人收购 Cursor……
**Nathan Lambert:** Please figure out what I need, when and from where, what it costs and stuff like that, and it is a very custom ized, on-the-fly package. And this is like one of a thousand examples of personalized- Personalization is essentially pulling information from the spa rse internet, the non-information-dense thing where there's no better version that exists. It just doesn't exist. You make it almost from scratch. - A nd if it does exist, it's full of- speaking of Disney World, full of- what would you call it?
**Lex Fridman:** 他们处于一个非常好的位置——拥有那么多用户数据。我们聊到了 continual learning。他们在一篇博客文章中有一句最有趣的话——就是他们的新 Composer 模型——是对中国某个大的 Mixture of Expert 模型的 fine-tune。你可以通过问它或者因为模型有时候会用中文回复来知道这一点——美国模型都不会那样。他们在博客文章里说——"我们根据用户的实际使用反馈每 90 分钟更新一次模型权重。"这基本上是现实世界 RL 在模型上发生的最接近的事情——而且就这样在他们的一篇博客文章里提了一句——
**Lex Fridman:** Ad slop. It's impossible. Take any city in the world, wh at are the top 10 things to do? An LLM is just way better to ask than anything on the internet. - Well, for now, that's because they're subsidized and they're gonna be paid for by ads. - Oh my goodness. - It's coming. - No.
**Nathan Lambert:** 那太不可思议了。
**Nathan Lambert:** No. I mean, I'm hoping there's a very clear indication what's an ad and what 's not an ad in that context. - That's something I mentioned a few years ago. If, I don't know, if you are looking for a new running shoe, well, is it a coincidence that Nike maybe comes up first? Maybe, maybe not.
**Lex Fridman:** 这超级酷。
**Lex Fridman:** But I think there are clear laws. You have to be clear about that. I think that's wha t everyone fears. It's the subtle message in there, but that also brings us to the topic of ads, where I think this was a thing.
**Nathan Lambert:** 而且顺便说一句——我经常用 Composer——因为它的一个好处是快。
**Nathan Lambert:** Hopefully, I think fo r- in 2025, just because I think it's they're still not making money in other ways right now. Having ad spots in there... but the thing is, they could n't, because there are alternatives without ads and people would just flock- to the other products. It's also just crazy how- yeah, how they're one-up ping each other, spending so much money to just get the users. - I think so. Like, some Instagram ads— I don't use Instagram, but I understand the app eal of paying a platform to find users who will genuinely like your product, and that is the best case of things like Instagram ads.
**Lex Fridman:** 我得试试——因为每个人都这么说。
**Lex Fridman:** But there are als o plenty of cases where advertising is very awful for incentives, and I think that a world where the power of AI can integrate with that positive view of, "I am a person and I have a small business and I want to make the best, I don't know, damn steak knives in the world, and I want to sell them to somebody who needs them." And if AI can make that sort of advertising thing work even better, that's very good for the world, especially with digital infrastructure, because that's how the modern web has been built. But that's not to say that addicting feeds so that you can show people more content is a good thing. So, I think that's even what OpenAI would say, is they want to find a way that can make the monetization upside of ads while still gi ving their users agency. And I personally would think that Google is probably going to be better at figuring out how to do this, because they already have ad supply and if they figure out how to turn this demand in their Gemini app into useful ads, then they can turn it on.
**Nathan Lambert:** 然后可能会有一些 IPO。你觉得 Anthropic、OpenAI、xAI 会吗?
**Nathan Lambert:** And somebody will figure it out—I don't know if it's this year, but there will be experiments with it. - I do think what holds companies back right now is really just that the competition is not doing it. It's more like a reputation thing. It's just, I think people are just afraid right now of ruining or losing their reputa tion, losing users, because it would make headlines if someone launched these ads. But— - Unless they were great, but the first ads won't be great bec ause it's a hard problem that we don't know how to solve. - Yeah, I think also the first version of that will likely be something like on X, like the timeline where you have a promoted post sometimes in between.
**Lex Fridman:** 他们可以如此轻松地融到那么多钱——以至于他们不觉得需要 IPO。只要融资容易——他们就不会 IPO——因为公开市场会施加压力。我觉得我们在中国看到生态系统有点不同——MiniMax 和 Z.ai 都在提交 IPO 材料——看中国市场如何反应会很有趣。我实际上猜——只要这一切继续——它会和美国一样疯狂——而不是基于他们都在亏很多钱的现实。我希望更多美国的巨型 AI 创业公司是上市的——因为看他们怎么花钱会非常有趣——而且能有更多洞察。而且就是给人们投资这些公司的机会——因为我觉得它们是最强大的公司——它们是这个时代的公司。而传统现在是——美国很多大创业公司不上市。就像——我们还在等 Stripe 的 IPO——但 Databricks 肯定没有。他们融了像 Series G 什么的。我就觉得这对市场来说是一个奇怪的均衡——我希望看到这些公司上市——以一个公司能做到的那种方式发展。
**Lex Fridman:** It'll be something where it will say "promoted" or something small, and then there will be an image. I think right now the problem is: who makes the first move? - If we go 10 years out, the proposition for ads is that you will make so muc h money on ads by having so many users that you can use this to fund better R&D and make better models, which is why YouTube is dominating the market for any— Netflix is scared of YouTube. They have the ads, they make—I pay $28 a month for Premium. They make at least $28 a month off of me and many o ther people.
**Nathan Lambert:** 你觉得 10 年后一些前沿模型公司还在吗?Anthropic、OpenAI?
**Nathan Lambert:** And they're just creating such a dominant position in video. So I think that's the proposition: that ads can make you have a sustained ad vantage. in what you're spending per user. But there's so much money in it right now that somebody starting that flywheel is scary because it's a long -term bet. - Do you think there'll be some crazy big moves this year business-wise? Like Google or Apple acquiring Anthropic or something like this? - Dario will never sell, but we are starting to see some types of consolidation with Groq for $20 billion and Scale AI for almost $30 billion and count less other deals like this that are structured in a way that is detrimental to the Silicon Valley ecosystem, which is this licensing deal where not ev erybody gets brought along, rather than a full acquisition that benefits the rank-and-file employee by getting their stock vested.
**Sebastian Raschka:** 我绝对不认为是赢家通吃——除非真的有某个算法秘密被其中一家发现——让这个飞轮转起来。因为开发路径对所有人来说太相似了。Google 和 OpenAI 有所有相同的产品——然后 Anthropic 更聚焦——但当你跟人聊的时候听起来他们在解决很多同样的问题。所以我觉得……会有各种分散的产品。有很多……这是一块非常大的蛋糕——人们会从中分钱。
**Sebastian Raschka:** That's a big issue for culture to address because the startup ecosystem is the lifeblood where, if you join a startup, even if it's not successful, it might get acquired on a cheap premium and you'll get paid out for this equity. These licensing deals are taking the top talent a lot of the time. The deal for Groq to N VIDIA is rumored to be better to the employees, but it is still this antitrust-avoiding thing. But I think that this trend of consolidation will conti nue. Me and many smart people I respect have been expecting consolidation to have happened sooner, but it seems like some of these things are starting to turn, but at the same time, companies are raising ridiculous amounts of money for reasons where I'm like, "I don't know why you're taking that mon ey." So it's mixed this year, but some consolidation pressure is starting. - What kind of surprising consolidation will we see? You say Anthropic is a "never." I mean, Groq is a big one.
**Nathan Lambert:** 我不想轻描淡写——但 OpenAI 和 Anthropic 主要是 LLM 服务提供商。而像 Google 和 xAI——关联到 X——还做其他东西。所以如果 AI 变得更商品化——那些只提供 LLM 的公司可能会消亡——这是完全有可能的。
**Nathan Lambert:** Groq with a Q, by the way. - Yeah. There's just a lot of startups and a very high premium on AI startups. So ther e could be a lot of - that kind of stuff, yeah. - $10 billion range acquisitions, which is really big for a startup that was maybe founded a year ago. I think Manus.ai... this company based in Singapore that Meta-founded was founded eight months ago and then had a $2 billion exit.
**Sebastian Raschka:** 我觉得他们的优势是有很多用户——而且我觉得他们会转型。就像 Anthropic——我觉得他们最初没计划做代码——但他们发现——"好的,这是一个好的生态位——现在我们站稳了——我们在这个生态位上推进。"我可以看到同样的事情……假设——我不确定是否会成真——但假设 Google 拿走了通用聊天机器人的所有市场份额。也许 OpenAI 就会聚焦于某个其他子话题。他们有太多用户——不会在可预见的未来消失。
**Sebastian Raschka:** I think there will be some other multi-billion dollar acquisitions, like Perplexity. - Like Perplexity, right? - Yeah, people rumor them to Apple. I think there's a lot of of pressure and liquidity in AI. There's pressure on big companies to have outcomes and- I would guess that a big acquisition gives people leeway to then tell the next chapter of that story. - I guess Cursor—we've been talking about code—somebody acquires Cursor. if somebody acquires Cursor... - They're in such a good position by having so much user data. And we talked about continual learning. They had one of the most interesting sentences i n a blog post, which is that they had their new Composer model, which was a fine-tune of one of these large Mixture of Expert models from China. You c an know that by asking it or because the model sometimes responds in Chinese— ...which none of the American models do. And they had a blog post where they're like, "We're updating the model weights every 90 minutes based on real-world feedback from people using it." Which is like the closest thing t o real-world RL happening on a model, and it's just mentioned in one of their blog posts— - That's incredible. - which is super cool. - And by the way , I should say I use Composer a lot because one of the benefits it has is that it's fast. - I need to try it 'cause everybody says this. - And there'l l be some IPOs potentially. You think Anthropic, OpenAI, xAI. - They can all raise so much money so easily that they don't feel a need to.
**Lex Fridman:** 我觉得 Google 总是准备好说"hold my beer"——关于 AI 模型。
**Lex Fridman:** So long as fundraising is easy, they're not going to IPO because public markets apply pressure. I think we're seeing in China that the ecosystem's a little diffe rent with both MiniMax and Z.ai applying for, filing IPO paperwork, which will be interesting to see how the Chinese market reacts. I actually would g uess that it's going to be similarly hypey to the US, so long as all this is going and not based on the reality that they're both losing a ton of mone y. I wish more of the gigantic American AI startups were public because it would be very interesting to see how they're spending money and have more i nsight.
**Nathan Lambert:** 我觉得问题是这些公司能不能支撑住估值。我觉得 AI 公司在某些方面被看待的方式就像 AWS、Azure 和 GCP 一样——都在同一个领域竞争——而且都是非常成功的业务。有一种可能是 API 市场非常不赚钱——它们会上下游扩展到产品和硬件。他们有那么多现金——可以建发电厂和数据中心——这是一个持久的优势。但也有一种合理的结果——这些 API 对开发者非常有价值和灵活——它们变成了类似 AWS 的东西。但 AWS 和 Azure 也会有这些 API——所以五六个人在 API 市场竞争很难。所以也许那就是为什么它们会被挤出去。
**Nathan Lambert:** And also just to give people access to investing in these, because I think they're some of the most formidable companies—they're the companies of the era. And the tradition is now for so many of the big startups in the US to not go public. It's like we're still waiting for Stripe and the IPO , but Databricks definitely didn't. They raised like a Series G or something.
**Lex Fridman:** 你提到了"RIP Llama"。Meta 还有赢的路径吗?
**Lex Fridman:** And I just feel like it's kind of a weird equilibrium for the market whe re it's like, I would like to see these companies go public and evolve in that way that a company can. - Do you think 10 years from now some of the fr ontier model companies are still around? Anthropic, OpenAI? - I definitely don't see it as a winner-takes-all unless there truly is some algorithmic s ecret that one of them finds that lets this flywheel. Because the development path is so similar for all of them. Google and OpenAI have all the same products, and then Anthropic's more focused, but when you talk to people it sounds like they're solving a lot of the same problems.
**Nathan Lambert:** 我觉得没人知道。他们在做很多事——所以他们在和 Black Forest Labs 签授权协议——那是一个图像生成公司——或者 Midjourney。所以我觉得在产品和面向消费者的 AI 前端——现在说还太早。我觉得他们有一些非常优秀且动力十足的人——靠近 Zuckerberg。所以我觉得那里还有故事要展开。Llama 有点不同——Llama 是组织最集中的表达。我不觉得 Llama 会继续得到那种程度的支持。我觉得那是一个非常成功的品牌。所以他们可能仍然参与开放生态系统——或者把 Llama 品牌延续到不同的服务中——因为人们知道 Llama 是什么。
**Nathan Lambert:** So I think... and there's offerings that'll spread out. There's a lot of... it's a very big cake being made that people are going to take money out of. - I don't want t o trivialize it, but OpenAI and Anthropic are primarily LLM service providers. And some of the other companies like Google and xAI, linked to X, do ot her stuff too. And so it's very possible, if AI becomes more commodified, that the companies just providing LLMs will die. - I think the advantage the y have is a lot of users, and I think they will just pivot.
**Lex Fridman:** 你觉得会有 Llama 5 吗?
**Lex Fridman:** Like Anthropic, I think, pivoted. I don't think they originally planned to work on code, b ut they found, "Okay, this is a nice niche, and now we are comfortable and we push on this niche." I can see the same thing... Let's say hypotheticall y, I'm not sure if it will be true, but let's say Google takes all the market share of the general chatbot. Maybe OpenAI will then focus on some other sub-topic.
**Nathan Lambert:** 不会是开放权重的。
**Nathan Lambert:** They have too many users to go away in the foreseeable future. - I think Google is always ready to say, "Hold my beer," with AI models. - I think the question is if the companies can support the valuations. I see the AI companies being looked at in some ways like AWS, Azure, and GCP are, all competing in the same space and all very successful businesses. There's a chance that the API market is so unprofitable that they go up and down the stack to products and hardware. They have so much cash that they can build power plants and data centers, which is a durable advantage now.
**Sebastian Raschka:** 有趣。我觉得 Llama 是开创性的开放权重模型。Llama 1、2 和 3——有很多喜爱。但我觉得后来——假设或推测——我觉得 Meta 的领导层——上面的高管——他们……我觉得他们对 Llama 非常兴奋——因为他们看到它在社区中多么受欢迎。然后我觉得问题是试图——比如说——把开放的东西变现——不是变现开源——而是用它来制造更大的轰动。感觉有点强迫——比如开发这些非常大的 Llama 4 模型来在 benchmark 上登顶。但我不认为 Llama 模型的目标是在 benchmark 上击败——比如说——ChatGPT 或其他模型。我觉得目标是有一个人们可以使用、信任、修改和理解的模型。所以那包括有更小的模型。它们不一定是最好的模型。而发生的事情是——这些模型——当然——benchmark 表明它们比实际更好——因为它们有专门训练了偏好的模型——好让它们在 benchmark 上表现好。那有点像——过拟合——强迫它成为最好的。但同时——他们没有做那些人们可以用的小模型。然后没人能跑这些大模型。然后就出现了一种怪异的情况。我觉得只是因为人们太兴奋于追逐头条和推前沿了。我觉得就是那样。
**Sebastian Raschka:** But th ere's also a reasonable outcome that these APIs are so valuable and so flexible for developers that they become something like AWS. But AWS and Azure are also going to have these APIs, so having five or six people competing in the API market is hard. So maybe that's why they get squeezed out. - You mentioned "RIP Llama." Is there a path to winning for Meta? - I think nobody knows. They're moving a lot, so they're signing licensing deals with Blac k Forest Labs, which is an image generation company, or Midjourney.
**Lex Fridman:** 而且在 benchmark 方面过度了。
**Lex Fridman:** So I think in some ways on the product and consumer-facing AI front, it's too earl y to tell. I think they have some people who are excellent and very motivated being close to Zuckerberg. So I think there's still a story to unfold th ere. Llama is a bit different, where Llama was the most focused expression of the organization.
**Nathan Lambert:** 那是太多工作了。
**Nathan Lambert:** And I don't see Llama being supported to that extent. I think it was a very successful brand for them. So they still might participate in the open ecosystem or continue the Llama brand into a different se rvice, because people know what Llama is. - You think there's a Llama 5? - Not an open-weight one. - It's interesting. I think Llama was the pioneerin g open-weight model.
**Sebastian Raschka:** 我觉得它在内部政治斗争和不一致的激励下爆了。研究者想建最好的模型——但有一层组织……和管理在试图展示他们做了这些事情。然后有传言说——比如某些可怕的技术决策被做出。就是看起来糟糕到一切都崩了。
**Sebastian Raschka:** With Llama 1, 2, and 3, there was a lot of love. But I think then, hypothesizing or speculating, I think the leaders at Meta, lik e the upper executives, they... I think they got very excited about Llama because they saw how popular it was in the community. And then I think the p roblem was trying to, let's say, monetize the open—or not monetize the open source, but use it to make a bigger splash.
**Lex Fridman:** 是的——但我们也应该给 Mark Zuckerberg 巨大的赞誉。我觉得这来自 Mark——来自 Mark Zuckerberg——来自领导层的最顶端——说开源很重要。那种领导力的存在意味着可能会有 Llama 5——他们吸取 benchmark 的教训然后说——"我们要做 GPT-OSS——"——"提供一个非常棒的开源库。"
**Lex Fridman:** It felt almost forced, like de veloping these very big Llama 4 models to be on top of the benchmarks. But I don't think the goal of Llama models is to be on top of the benchmarks be ating, let's say, ChatGPT or other models. I think the goal was to have a model that people can use, trust, modify, and understand. So that includes h aving smaller models.
**Nathan Lambert:** 人们说的是——Mark 和 Alexandr Wang 之间有一场辩论——他非常聪明——但更反对开源。而且在他对 AI 组织有很大影响力的程度上——看起来可能性更小了——因为看起来 Mark 把他引进来是为了给 AI 方向一个新的领导视角。如果开放还是封闭不再是模型的定义性特征——我不指望那会是 Mark 和 Alex 之间的决定性争论。他们都非常聪明——但我就是很难理解这一切——因为 Mark 在 2024 年 7 月写了那篇文章——那可能是当时最好的博客文章——叫《开源 AI 的理由》。然后 2025 年 7 月来了——变成了——"我们正在重新评估与开源的关系。"所以就是有点……
**Nathan Lambert:** They don't have to be the best models. And what happened was, these models were, of course... the benchmarks suggested that they were better than they were because they had specific models trained on preferences so that they performed well on benchmarks. That's kind of, like, t his overfitting thing to force it to be the best. But then at the same time, they didn't do the small models that people could use.
**Sebastian Raschka:** 但我觉得问题也是……不是问题——但我觉得——嗯——我们可能有点太苛刻了——那导致了一些那样的结果。因为我的意思是——我们作为开源开发者或者社区……即使模型也许不是每个人期待的——它受到了很多反弹。我觉得那很不幸——因为我能想象作为一家公司——他们希望得到正面头条。结果不是没有头条或正面头条——而是负面头条。然后那对公司的形象不好。我觉得那也是一种情况——可能是一种赌气反应——几乎就是——"好吧——我们试图做好事——试图给你们一个酷的东西——一个开源模型——然后你们反而对我们很负面——甚至对公司来说也是。"所以在那个意义上看起来——"好吧——也许那我们就改变想法了。"我猜。我不知道。
**Sebastian Raschka:** And I think that n o one could run these big models then. And then there was kind of a weird thing. I think it's just because people got too excited about headlines push ing the frontier. I think that's it. - And too much on the benchmarking side. - It's too much work. - I think it imploded under internal political fig hting and misaligned incentives.
**Lex Fridman:** 是的——这就是 X 上话语动态可能把我们——作为一个社区——带偏的地方。因为有时候感觉很随机。人们挑选他们喜欢和不喜欢的东西。比如你可以看到 Grok 4.1 和 Grok Code Fast 1.0 也一样。我不觉得——氛围上——人们公开喜爱它。但很多人在用。如果你看 Reddit 和 X——他们真的没有从编程社区那里得到赞誉——但他们在用。Llama 大概也一样。我不理解正面炒作或负面炒作的动态。我不理解。
**Lex Fridman:** The researchers want to build the best models, but there's a layer of organization— ...and management that is trying to demonstrate that they do these things. And then there are rumors about how, for example, some horrible technical decision was made. It just seems l ike it got so bad that it all just crashed out. - Yeah, but we should also give huge props to Mark Zuckerberg. I think it comes from Mark, actually, f rom Mark Zuckerberg, from the top of the leadership, saying open source is important.
**Nathan Lambert:** 我是说——2025 年的一个故事就是美国填补 Llama 的空缺——就是这些中国开放权重模型的崛起——模型——到了这个程度——那是我最近花了很多精力关注的一个议题——试图做政策工作来让美国投资这个领域。
**Nathan Lambert:** The fact that that leadership exists means there could be a Llam a 5, where they learn the lessons from benchmarking and say, "We're going to be GPT-OSS—" "...and provide a really awesome library of open source." - What people say is that there's a debate between Mark and Alexandr Wang, who is very bright, but much more against open source. And to the extent that he has a lot of influence over the AI org, it seems much less likely, because it seems like Mark brought him in for a fresh leadership eye in directi ng AI. And if being open or closed is no longer the defining nature of the model, I don't expect that to be a defining argument between Mark and Alex. They're both very bright, but I just have a hard time understanding all of it because Mark wrote this piece in July of 2024, which was probably the b est blog post at the time, saying "The Case for Open Source AI." And then July 2025 came around and it was, "We're reevaluating our relationship with open source." So it's just kind of... - But I think also the problem...
**Lex Fridman:** 那就给我讲讲 ADAM 的故事。
**Lex Fridman:** Not the problem, but I think, well, we may have been a bit too harsh, and that caused some of that. Because I mean, we as open source developers or the community... Even though the model was maybe not what everyone hoped for, it got a lot of backlash. And I think that was unfortunate because I can see that as a company, they were hoping for positive headlines.
**Nathan Lambert:** ADAM 项目一开始我叫它 American DeepSeek Project——这对 DC 的受众来说不太合适——但这是我职业生涯中能做的最有影响力的事情的故事。这些中国开放权重模型正在积累大量影响力——而且对基于这些开放模型进行开发的需求很大——尤其是在美国的企业中——它们对中国模型非常警惕。
**Nathan Lambert:** And instead of just getting no headlines or positive headlines, in turn they got negative headlines. And then it kind of reflected bad on the company. I think that i s also something where it's maybe a spite reaction, almost like, "Okay, we tried to do something nice, we tried to give you something cool, like an op en source model, and now you are kind of being negative about us, even for the company." So in that sense, it looks like, "Well, maybe then we'll chan ge our mind." I guess. I don't know. - Yeah, that's where the dynamics of discourse on X can lead us, as a community, astray.
**Lex Fridman:** ADAM 项目——American Truly Open Models——是一个美国的倡议——旨在构建和托管高质量的、真正开放权重的 AI 模型和配套基础设施——明确目标是与中国快速推进的开源 AI 生态系统竞争并赶上它。
**Lex Fridman:** Because sometimes it fee ls random. People pick the thing they like and don't like. I mean, you can see the same thing with Grok 4.1 and Grok Code Fast 1.0. I don't think, vib e-wise, people love it publicly. But a lot of people use it. So if you look to Reddit and X, they don't really give it praise from the programming com munity, but they use it. And the same thing with probably Llama. I don't understand the dynamics of either positive hype or negative hype. I don't und erstand it. - I mean, one of the stories of 2025 is the US filling the gap of Llama, which is the rise of these Chinese open-weight models, models- to the point where that was the single issue I've spent a lot of energy on lately, trying to do policy work to get the US to invest in this. - So just t ell me the story of ADAM. - The ADAM Project started as me calling it the American DeepSeek Project, which doesn't really work for DC audiences, but i t's the story of the most impactful thing I can do with my career, which is that these Chinese open-weight models are cultivating a lot of power, and there is a lot of demand for building on these open models, especially in enterprises in the US that are very cagey about Chinese models. - The ADAM P roject, American Truly Open Models, is a US-based initiative to build and host high-quality, genuinely open-weight AI models and supporting infrastruc ture explicitly aimed at competing with and catching up to China's rapidly advancing open-source AI ecosystem. - I think the one-sentence summary woul d be that... or two sentences.
**Nathan Lambert:** 我觉得一两句话的总结是——第一——开放模型将成为 AI 研究的引擎——因为那是人们起步的地方——因此拥有它们很重要。第二——因此美国应该构建最好的模型——这样最好的研究发生在美国——那些美国公司能从成为 AI 研究发生地的故乡中获得价值。如果没有对开放模型的更多投资——我们的网站上有图表——"Qwen、Qwen、Qwen、Qwen"——全是这些来自中国公司的优秀模型——它们在国际上积累着影响力。我觉得美国在 AI 上花了多得多的钱——而创建比封闭实验室前沿领先一代的开放模型大约花 1 亿美元——那是很多钱——但对这些公司来说不算多。因此——我们需要一个想做这件事的人的集中力量。而且我觉得我们从几乎全栈的人那里得到了参与——无论是政策方面的。
**Nathan Lambert:** One is a proposition that open models are going to be an engine for AI research because that is what people start with; therefore, it's important to own them. And the second one is, therefore, the US should be building the best models so that the best research happens in the US, and those US companies take the value from being the home of where AI research is happening. And without more investment in open models—we have plots on the website where it's like, "Qwen, Qwen, Qwen, Qwen"—it's all these models that are excellent from these Chinese companies that are cul tivating influence internationally.
**Lex Fridman:** 所以政府方面有支持吗?
**Lex Fridman:** I think the US is spending way more on AI, and the ability to create open models that are a generation beyond what the cutting edge of closed labs costs roughly $100 million, which is a lot of money, but not a lot of money to these companies. Therefore, we need a centralizing force of people who want to do this. And I think we got engagement from people pretty much across the full stack, whether it's policy. - So there has been support from the administration? - I don't think anyone technically in government has signed it publicly, but I know people that hav e worked in AI policy, in both the Biden and Trump administrations, are very supportive of promoting open-source models in the US.
**Nathan Lambert:** 我不认为有任何技术上在政府里的人公开签名了——但我知道曾在 AI 政策领域工作过的人——在 Biden 和 Trump 两届政府中——都非常支持在美国推动开源模型。比如——AI2 从 NSF 获得了 1 亿美元、为期四年的拨款——这是 NSF 授出的最大 CS 拨款——用于 AI2 来尝试这件事。这是一个起点。但最好的事情发生在有多个组织构建模型的时候——因为它们可以交叉授粉想法、构建这个生态系统。我不觉得只有 Llama 发布模型就能行——因为 Llama 可能消失。AI2 也一样——我不能是唯一一个构建模型的。这变成了花很多时间和人交谈——不管是政策领域……我知道 NVIDIA 对此非常兴奋。我觉得 Jensen Huang 一直在说这件事的紧迫性——他们在 2025 年做了更多——Nemotron 模型更受重视了。他们开始随着 NVIDIA 的开放模型发布一些数据——很少有公司这样做——尤其是 NVIDIA 这种规模的——所以有进步的迹象。我们听说 Reflection AI——他们说他们的 20 亿美元融资专门用于构建美国的开放模型——我觉得他们的公告推文读起来像一篇博客文章——对吧?我觉得文化潮流正在开始转变。7 月的时候——四五个 DeepSeek 级别的中国开放权重模型——美国的是零。那就是我意识到的时刻——"好吧——我猜我必须在这上面花精力——因为没有别人会做。"所以这需要很多人一起贡献——而且我不是说 ADAM 项目是在帮助推动生态系统的那个东西——但像我这样的人做这种事来传播信息。
**Nathan Lambert:** I think, for exampl e, AI2 got a grant from the NSF for $100 million over four years, which is the biggest CS grant the NSF has ever awarded, and it's for AI2 to attempt this. It's a starting point. But the best thing happens when there are multiple organizations building models, because they can cross-pollinate ideas and build this ecosystem. I don't think it works if it's just Llama releasing models, because Llama could go away. The same thing applies for AI2; I c an't be the only one building models. It becomes a lot of time spent on talking to people, whether in policy... I know NVIDIA is very excited about th is. I think Jensen Huang has been talking about the urgency for this, and they've done a lot more in 2025, where the Nemotron models are more of a foc us.
**Lex Fridman:** 你喜欢 2025 年的 America's AI Action Plan 吗?那里面包括开源的内容。白宫 AI 行动计划有一个专门的部分叫"鼓励开源和开放权重 AI"——定义了这些模型并论证它们对创新和创业公司有独特价值。
**Lex Fridman:** They've started releasing some data along with NVIDIA's open models, and very few companies do this, especially of NVIDIA's size, so there are sig ns of progress. We hear about Reflection AI, where they say their two billion dollar fundraise is dedicated to building US open models, and I feel and their announcement tweet reads like a blog post, right? I think that cultural tide is starting to turn. In July, four or five DeepSeek-caliber Chines e open-weight models and and zero from the US. That's the moment where I realized, like, "Oh, I guess I have to spend energy on this because nobody el se is gonna do it." So it takes a lot of people contributing together, and I don't say that, the Adam Project isn't the thing that's helping to move t he ecosystem, but it's people like me doing this sort of thing to get the word out. - Do you like the 2025 America's AI Action Plan?
**Nathan Lambert:** 是的。AI 行动计划是一个计划——但总体来说——我觉得它可能是这届政府出台的最连贯的政策文件——我希望它基本上能成功。我认识参与过 AI 行动计划的人——以及把政策变成现实的挑战。作为一个 AI 研究者我不知道怎么做这个——但里面很多东西是非常真实的——而且国家正在进行大规模的 AI 建设。人们听到很多问题——从用水到什么的——我们应该能在这个国家建东西——但也不能在建设过程中毁了我们国家的地方——而且值得花精力在这上面。我觉得那是联邦政府扮演的角色。他们设定议程。而对于 AI——设定开放权重应该是优先考虑的议程——是他们能做的一大部分——然后人们就会思考这件事。
**Nathan Lambert:** That includes ope n source stuff. The White House AI Action Plan includes a dedicated section titled "Encourage Open-Source and Open-Weight AI," defining such models an d arguing they have unique value for innovation and startups. - Yeah. I mean, the AI Action Plan is a plan, but largely, I think it's maybe the most c oherent policy document that has come out of the administration, and I hope that it largely succeeds.
**Sebastian Raschka:** 而且——对于这些公司的教育和人才来说——这非常重要——因为否则——如果只有封闭模型——你怎么让下一代人在某个时候做出贡献?因为不然的话——你只有加入了一家公司之后才能学到东西。但到那时——你怎么雇到有才华的人?你怎么识别有才华的人?我觉得开源对很多事情至关重要——但也仅仅是为了教育大众和培养下一代研究者。那是方式——或者说唯一的方式。
**Sebastian Raschka:** I know people that have worked on the AI Action Plan and the challenges of taking policy and making it real. I have no idea how to do this as an AI researcher, but largely a lot of things in that we re very real, and there's a huge build-out of AI in the country. There are a lot of issues that people are hearing about, from water use to whatever, and we should be able to build things in this country, but also, we need to not ruin places in our country in the process of building it, and it's wor thwhile to spend energy on.
**Nathan Lambert:** 我本可以让这个传播得更广的方式是——讲一个中国 AI 与威权国家整合、成为 ASI 并接管世界的故事——因此我们需要自己的美国模型。但我非常刻意地选择谈论美国的创新和科学——因为我觉得它既是更现实的结果——也是一个我想要实现的世界。
**Nathan Lambert:** I think that's a role the federal government plays. They set the agenda. And with AI, setting the agenda that open-weight should be a first consideration is a large part of what they can do and then people think about it. - Also, for education and talent for these compani es, it's very important because otherwise, if there are only closed models, how do you get the next generation of people contributing at some point? B ecause otherwise, you will point only be able to learn after you joined a company. But at that point, how do you hire talented people? How do you iden tify talented people? I think open source is essential for a lot of things, but also even just for educating the population and training the next gene ration of researchers. It's the way, or the only way. - The way that I could've gotten this to go more viral was to tell a story of Chinese AI integra ting with an authoritarian state, being ASI and taking over the world, and therefore we need our own American models.
**Sebastian Raschka:** 我会说——不过——甚至任何开放权重模型——我确实认为都是有价值的模型。
**Sebastian Raschka:** But it's very intentional why I talk about innovation and science in the US because I think it's both more realistic as an outcome, but also it's a world that I would like to manifes t. - I would say, though, also even any open-weight model, I do think, is a valuable model. - Yeah. And my argument is that we should be in a leading position. But I think it's worth saying it simply because there are still voices in the AI ecosystem that say we should consider banning the release o f open models due to safety risks.
**Nathan Lambert:** 是的。我的论点是我们应该处于领先位置。但我觉得有必要简单地说一下——因为 AI 生态系统中仍然有声音说我们应该考虑因为安全风险而禁止发布开放模型。我觉得值得补充的是——实际上——那是不可能的——除非让美国建自己的大防火墙——而那也是众所周知不太管用的——因为训练这些模型的成本——不管是 100 万到 1 亿美元——对世界上想要有影响力的大量人来说是可以企及的——所以这些模型会在全世界各地被训练。我们希望这些模型——尤其是——我是说——有安全顾虑——但我们希望这些信息和工具能在全世界自由流动——进入美国——这样人们可以使用和学习它们。阻止那样做将是对我们互联网的彻底重构——看起来是不可能的。
**Nathan Lambert:** And I think it's worth adding that, effectively, that's impossible without making the US have its own great firewal l, which is also known to not work that well because the cost for training these models, whether it's one to a hundred million dollars, is attainable to a huge amount of people in the world that want to have influence, so these models will be trained all over the world. And we want the models, espec ially when, like, I mean, there are safety concerns, but we want this information and tools to flow freely across the world and into the US so that pe ople can use them and learn from them. Stopping that would be such a restructuring of our internet that it seems impossible. - Do you think maybe in t hat case the big open-weight models from China are actually a good thing in a sense, like, for the US companies?
**Lex Fridman:** 你觉得也许在那种情况下——中国的大开放权重模型实际上在某种意义上对美国公司是好事?因为也许——你之前提到的——美国公司通常在开源发布方面比他们自己使用的落后一代。比如 gpt-oss 可能不是最前沿的模型。Gemini 3 可能也不是。但他们这样做是因为他们知道这发布出去是安全的。但当他们看到——这些公司看到——比如 DeepSeek-V3.2 非常棒——被广泛使用而且没有反弹——没有安全风险——那可能反过来鼓励他们发布更好的模型。也许在那个意义上——这是非常积极的事情。
**Lex Fridman:** Because maybe the US companies you me ntioned earlier are usually one generation behind in terms of what they release open source versus what they are using? For example, gpt-oss might not be the cutting-edge model. Gemini 3 might not be, but they do that because they know this is safe to release. But then when they see, these companies see, for example, there is DeepSeek-V3.2, which is really awesome, and it gets used and there is no backlash, there is no security risk, that could t hen, again, encourage them to release better models.
**Nathan Lambert:** 百分之百。这些中国公司启动了一些——我觉得如果他们不都在发布模型——可能就不会发生的事情。所以我觉得——我几乎确定那些讨论在领导层已经有过了。
**Nathan Lambert:** Maybe that, in a sense, is a very positive thing. - A hundred percent. These Chinese companies ha ve set things into motion that I think would potentially not have happened if they were not all releasing models. So I think it was like I'm almost su re that those discussions have been had by leadership. - Is there a possible future where the dominant AI models in the world are all open source? - D epends on the trajectory of progress that you predict.
**Lex Fridman:** 有没有可能——未来世界上占主导地位的 AI 模型全部是开源的?
**Lex Fridman:** If you think saturation in progress is coming within a few years, so essentially, within the ti me where financial support is still very good, then open models will be so optimized and so much cheaper to run that they'll win out. This goes back t o open source ideas where so many more people will be putting money into optimizing the serving of these open-weight common architectures that they wi ll become standards, and then you could have chips dedicated to them, and it'll be way cheaper than the offerings from these closed companies that are custom. - We should say that the AI27 report kinda predicts one of the things it does from a narrative perspective is that there will be a lot of cen tralization. As the AI systems get smarter and smarter, national security concerns will arise, and you'll centralize the labs, and they'll become supe r secretive, and there'll be this whole race - ...from a military perspective of how do you... between China and the US.
**Nathan Lambert:** 取决于你预测的进步轨迹。如果你认为进步的饱和会在几年内到来——基本上——在财务支持仍然很好的时间范围内——那么开放模型会被如此优化、运行成本如此低——以至于它们会胜出。这回到了开源的想法——会有多得多的人投入资金来优化这些开放权重通用架构的服务——它们会成为标准——然后你可以有专门为它们设计的芯片——而且会比那些封闭公司的定制产品便宜得多。
**Nathan Lambert:** And so all of these fun conve rsations we're having about LLMs... the generals and the soldiers will come into the room and be like, "All right. We're now in the Manhattan Project stage of this whole thing." - I think in 2025, '26, '27, I don't think something like that is even remotely possible. You can make the same argument f or computers, right?
**Sebastian Raschka:** 我们应该说——AI27 报告在叙事角度上预测的一件事是——会有很多集中化。随着 AI 系统变得越来越聪明——国家安全顾虑会出现——你会集中化实验室——它们会变得超级保密——然后会有这种竞赛——从军事角度——你怎么……中美之间。所以所有这些我们在聊的关于 LLM 的有趣对话……将军们和士兵们会走进来说——"好了。我们现在进入了整个事情的曼哈顿计划阶段。"
**Sebastian Raschka:** You can say, "Computers are capable and we don't want the general public to get them." Or chips, even AI chips, but you see how H uawei makes chips now. It took a few years, but... and I don't think there is a way you can contain knowledge like that. I think in this day and age, it is impossible, like the internet.
**Nathan Lambert:** 我觉得在 2025、2026、2027 年——我不觉得那样的事情哪怕是遥远地可能。你可以对计算机做同样的论证——对吧?你可以说——"计算机有能力——我们不想让公众拿到。"或者芯片——甚至 AI 芯片——但你看 Huawei 现在就在做芯片了。花了几年——但是……而且我不觉得有办法能遏制那样的知识。我觉得在今天这个时代——那是不可能的——就像互联网一样。我不觉得这有可能。
**Nathan Lambert:** I don't think this is a possibility. - On the Manhattan Project thing, I think that a Manhattan Project-like thin g for open models would be pretty reasonable, because it wouldn't cost that much. But I think that will come. It seems like culturally, the companies are changing.
**Lex Fridman:** 关于曼哈顿计划的事——我觉得为开放模型做一个类似曼哈顿计划的事情是相当合理的——因为它不会花那么多钱。但我觉得那会来的。看起来文化上——公司们在改变。但我同意 Sebastian 说的所有观点。我不觉得那会发生也不觉得那会有帮助。
**Lex Fridman:** But I agree with Sebastian on all of that. I don't see it happening nor being helpful. - Yeah. The motivating force behind the Manhattan Project was civilizational risk.
**Nathan Lambert:** 是的。曼哈顿计划背后的推动力是文明风险。要为开源模型激发那种动力更难。
**Nathan Lambert:** It's harder to motivate that for open-source models. - There's no civilizational risk. - On the hardware side, we me ntioned NVIDIA a bunch of times. Do you think Jensen and NVIDIA will keep winning? - I think they have to iterate and manufacture a lot. And I think t hey probably... what they're doing, they do innovate, but I think there's always the chance that someone does something fundamentally different, gets very lucky, and then does something.
**Sebastian Raschka:** 没有文明风险。
**Sebastian Raschka:** But the problem is adoption. The moat of NVIDIA is probably not just the GPU. It's more like the CUDA ecosystem, and that has evolved over two decades.
**Lex Fridman:** 在硬件方面——我们提了很多次 NVIDIA。你觉得 Jensen 和 NVIDIA 会继续赢吗?
**Lex Fridman:** Even back when I was a grad student, I was in a lab doing biophysical simulations, molecular dynamics, and we h ad a Tesla GPU back then just for the computations. It was about 15 years ago now. And they built this up for a long time, and that's the moat, I thin k.
**Sebastian Raschka:** 我觉得他们必须不断迭代和大量制造。而且我觉得他们可能……他们确实在创新——但我觉得总有一种可能——有人做了根本不同的东西——非常幸运——然后做成了某件事。但问题是采用。NVIDIA 的护城河可能不只是 GPU。更多是 CUDA 生态系统——那已经发展了二十年。即使回到我读研的时候——我在一个做生物物理模拟、分子动力学的实验室——我们当时就有 Tesla GPU 来做计算。那大约是 15 年前了。他们建设这个已经很长时间了——那才是护城河——我觉得。不是芯片本身——虽然他们有钱来迭代、建造和扩展。但真正是关于兼容性。如果你在那个规模上——你为什么要用一个有风险的东西——一年只能做几块芯片?你选大的。但我确实觉得——有了 LLM——设计类似 CUDA 的东西会更容易。CUDA 花了 15 年是因为它很难——但现在我们有 LLM——也许我们可以复制 CUDA。
**Sebastian Raschka:** It's not the chip itself, although they have the money to iterate, build, and scale. But then it's really about compatibility. If you're at that sc ale, why would you go with something risky where there are only a few chips they can make per year? You go with the big one. But then I do think with LLMs now, it will be easier to design something like CUDA. It took 15 years because it was hard, but now that we have LLMs, we can maybe replicate CUD A. - And I wonder if there will be a separation of training and inference compute as we stabilize, and more compute is needed for inference. - That's supposed to be the point of the Groq acquisition.
**Nathan Lambert:** 而且我想知道会不会出现训练和推理算力的分离——随着我们稳定下来——需要更多算力做推理。
**Nathan Lambert:** And that's why part of what Vera Rubin is- where they have a new chip with no high-bandwidth memory, which is one of the- or very little, which is one of the most expensive pieces. It's designed for pre-fill, which is the part of inference where you essentially do a lot of matrix multiplications. And then you only need the memory when you're doing this autoregressive generation, and you have the K V cache swaps.
**Lex Fridman:** 那就应该是 Groq 收购的意义。也是 Vera Rubin 的一部分——他们有一个新芯片——没有高带宽内存——或者很少——那是最贵的组件之一。它是为 pre-fill 设计的——那是推理中做大量矩阵乘法的部分。然后你只在做自回归生成时才需要内存——KV cache 交换。所以他们有这个新 GPU——专为那种特定用例设计——然后每 FLOP 或什么的拥有成本实际上低得多。但我觉得 NVIDIA 的命运仍然取决于 AI 的扩散。他们最大的客户仍然是这些超大规模公司。Google 显然可以做 TPU。Amazon 在做 Trainium。Microsoft 会尝试做自己的东西。只要 AI 进步的速度很快——NVIDIA 的平台是最灵活的——人们会想要那个。但如果有停滞——那制造定制芯片就有更多时间来做。
**Lex Fridman:** So they have this new GPU that's designed for that specific use case, and then the cost of ownership per FLOP or whatever is actually w ay lower. But I think that NVIDIA's fate lies in the diffusion of AI still. Their biggest clients are still these hyperscale companies. Like, Google o bviously can make TPUs. Amazon is making Trainium. Microsoft will try to do its own things.
**Sebastian Raschka:** 有趣的是 NVIDIA 在积极尝试开发各种不同的产品。
**Sebastian Raschka:** And so long as the pace of AI progress is high, NVIDIA's p latform is the most flexible and people will want that. But if there's stagnation, then creating bespoke chips, there's more time to do it. - It's int eresting that NVIDIA is quite active in trying to develop all kinds of different products. - They try to create areas of commercial value that will us e a lot of GPUs. - Mm-hmm. But they keep innovating and they're doing a lot of incredible research, so... - Everyone says the company's super oriented around Jensen and how operationally plugged in he is.
**Nathan Lambert:** 他们试图创造会使用大量 GPU 的商业价值领域。
**Nathan Lambert:** And it sounds so unlike many other big companies that I've heard about. And so long as that's t he culture, I think that we can expect that to keep progress happening. And it's like he's still in the Steve Jobs era of Apple.
**Sebastian Raschka:** 嗯。但他们持续创新——而且在做大量令人难以置信的研究——所以……
**Sebastian Raschka:** So long as that is ho w it operates, I'm pretty optimistic for their situation because it's like, it is their top-order problem, and I don't know if making these chips for the whole ecosystem is the top goal of all these other companies. They'll do a good job, but it might not be as good of a job. - Since you mentioned J ensen, I've been reading a lot about history and about singular figures in history. What do you guys think about the single man/woman view of history?
**Nathan Lambert:** 所有人都说这家公司超级围绕 Jensen 运转——他在运营上有多深度参与。听起来和我听说过的很多其他大公司完全不一样。只要那是文化——我觉得我们可以期待进步继续发生。就像他仍然处在 Apple 的 Steve Jobs 时代。只要它那样运转——我对他们的处境相当乐观——因为那是他们的头号问题——而且我不知道为整个生态系统做芯片是不是所有其他公司的首要目标。他们会做得不错——但可能不会做得那么好。
**Nathan Lambert:** How important are individuals for steering the direction of history in the tech sector? So, you know, what's NVIDIA without Jensen? You mentioned Ste ve Jobs. What's Apple without Steve Jobs?
**Lex Fridman:** 说到 Jensen——我最近一直在读很多关于历史的东西——关于历史上那些标志性人物。你们怎么看单个人物/个人视角的历史观?在科技领域——个人对引导历史方向有多重要?比如——没有 Jensen 的 NVIDIA 是什么?你提到了 Steve Jobs。没有 Steve Jobs 的 Apple 是什么?没有 Elon 的 xAI 或没有 Demis 的 DeepMind 是什么?
**Lex Fridman:** What's xAI without Elon or DeepMind without Demis? - People make things earlier and faster, whereas scientif ically, many great scientists credit being in the right place at the right time and still making the innovation, where eventually someone else will st ill have the idea. So I think that in that way, Jensen is helping manifest this GPU revolution much faster and much more focused than it would happen without having a person there. And this is making the whole AI build-out faster.
**Nathan Lambert:** 人让事情更早和更快地发生。而科学上——很多伟大的科学家把功劳归于恰好在对的时间对的地方——而仍然做出了创新。最终别人也会有那个想法。所以我觉得在那个意义上——Jensen 在帮助这场 GPU 革命更快、更集中地实现——比没有他的情况下。而且这让整个 AI 建设更快了。但我仍然认为最终——像 ChatGPT 这样的东西会出现——像这样的建设会发生——只是可能不会那么快。我觉得那是大致适用的味道。
**Nathan Lambert:** But I do still think that eventually, something like ChatGPT would ha ve happened and a build-out like this would have happened, but it probably would not have been as fast. I think that's the sort of flavor that is appl ied. - These individual people, there are people who are placing bets on something. Some get lucky, some don't.
**Sebastian Raschka:** 这些个人——有些人在押注某些东西。有些幸运——有些不幸运。但如果你没有这些人在掌舵——它会更分散。就像投资 ETF 对比个股。个股可能涨跌更剧烈——ETF 更平衡。它最终会随时间上涨。我们会到达那里。但就是——你知道——我觉得关键是专注。激情和专注。
**Sebastian Raschka:** But if you don't have these people at the helm, it would be more diffused. It's almost like investing in an ETF versus individual stocks. Individual stocks might go up or down more heavily than an ETF, which is more balanced.
**Lex Fridman:** 难道不是有一个真正的论据——没有 Jensen——就没有深度学习革命的重振吗?
**Lex Fridman:** It will eventually go up over time. We'll get there. But it's just like, you know, the focus I think is the thin g. Passion and focus. - Isn't there a real case to be made that without Jensen, there's not a reinvigoration of the deep learning revolution? - It cou ld've been 20 years later, is what I would say.
**Nathan Lambert:** 可能会晚 20 年——那是我会说的。或者如果没有 GPU 的话——可能又一个 AI 寒冬会来。
**Nathan Lambert:** Or like another AI winter could have come if GPUs weren't around. - That could change history complete ly because you could think of all the other technologies that could've come in the meantime, and the focus of human civilization would get... Silicon Valley would be captured by different hype. - But I do think there's certainly an aspect where it was all planned, the GPU trajectory. But on the othe r hand, it's also a lot of lucky coincidences or good intuition.
**Lex Fridman:** 那可能会完全改变历史——因为你可以想想在那期间会出现的所有其他技术——而人类文明的焦点会被……硅谷会被不同的炒作所占据。
**Lex Fridman:** Like the investment into, let's say, biophysical simulations. I mean, I think it star ted with video games and then it just happened to be good at linear algebra because video games require a lot of linear algebra. And then you have the biophysical simulations.
**Sebastian Raschka:** 但我确实觉得——当然有一个方面是所有的 GPU 轨迹都是有计划的。但另一方面——也有很多幸运的巧合或者好的直觉。比如对生物物理模拟的投资。我是说——我觉得它始于视频游戏——然后恰好擅长线性代数——因为视频游戏需要大量线性代数。然后你有了生物物理模拟。但我仍然不觉得主计划是 AI。我觉得恰好是 Alex Krizhevsky。某个人拿了这些 GPU 然后说——"嘿——让我们试试在上面训练一个神经网络。"恰好效果非常好——而且我觉得它只发生是因为你可以购买那些 GPU。
**Sebastian Raschka:** But still, I don't think the master plan was AI. I think it happened to be Alex Krizhevsky. So someone took these GPUs and s aid, "Hey, let's try to train a neural network on that." It happened to work really well, and I think it only happened because you could purchase thos e GPUs. - Gaming would've created a demand for faster processors if NVIDIA had gone out of business in the early days.
**Nathan Lambert:** 如果 NVIDIA 在早期就倒闭了——游戏对更快处理器的需求仍然会存在。那是我会想的。我觉得 GPU 会不同——但我觉得在 AlexNet 和 Transformer 出现的时候 GPU 仍然会存在。只是很难知道是一家公司那么成功——还是多家小公司有更差的芯片。但我不觉得那是 100 年的延迟。可能是十年的延迟。
**Nathan Lambert:** That's what I would think. I th ink that the GPUs would've been different, but I think GPUs would still exist at the time of AlexNet and at the time of the Transformer. It was just h ard to know if it would be one company as successful or multiple smaller companies with worse chips. But I don't think that's a 100-year delay. It mig ht be a decade delay. - Well, it could be one, two, three, four, five-decade delay. I just can't see Intel or AMD doing what NVIDIA did. - I don't thi nk it would be a company that exists.
**Lex Fridman:** 嗯——可能是一、二、三、四、五十年的延迟。我就是看不到 Intel 或 AMD 做 NVIDIA 做了的事。
**Lex Fridman:** I think it would be a different company that would rise. - Like Silicon Graphics or something. - So yeah, some c ompany that has died would have done it. - But just looking at it, it seems like these singular figures, these leaders, have a huge impact on the traj ectory of the world. Obviously, there are incredible teams behind them. But, you know, having that kind of very singular, almost dogmatic focus- -is n ecessary to make progress. - Yeah, I mean, even with GPT, it wouldn't exist if there wasn't a person, Ilya, who pushed for this scaling, right? - Yeah , Dario Amodei was also deeply involved in that.
**Nathan Lambert:** 我不觉得会是一个现有的公司。我觉得会是一个不同的公司崛起。
**Nathan Lambert:** If you read some of the histories from OpenAI, it seems wild thinking about how early these people we re like, "We need to hook up 10,000 GPUs and take all of OpenAI's compute and train one model." There were a lot of people who didn't want to do that. - Which is an insane thing to believe. To believe in scaling before scaling has any indication that it's going to materialize. Again, singular figure s.
**Lex Fridman:** 像 Silicon Graphics 什么的。
**Lex Fridman:** Speaking of which, 100 years from now, this is presumably post-singularity, whatever singularity is. When historians look back at our time now, wha t technological breakthroughs would they really emphasize as the breakthroughs that led to the singularity? So far we have Turing to today, 80 years. - I think it would still be computing, like the umbrella term "computing." I don't necessarily think that in 100 or 200 years it would be AI.
**Nathan Lambert:** 所以是的——某个已经消亡的公司会做那件事。
**Nathan Lambert:** It could still very well be computers. We are now taking better advantage of them, but the fact of computing remains. - It's basically a Moore's Law discussio n. Even the details of CUDA and GPUs won't even be remembered, nor will all this software turmoil.
**Lex Fridman:** 但就看着这些——这些标志性人物——这些领袖——确实对世界轨迹有巨大影响。显然他们背后有不可思议的团队。但——你知道——有那种非常单一的、几乎教条式的专注——是推进进步所必需的。
**Lex Fridman:** It'll just be, obviously, compute. - I generally ag ree, but is the connectivity of the internet and compute able to be merged? Or is it both of them? - I think the internet will probably be related to communication. It could be a phone, the internet, or satellites.
**Nathan Lambert:** 是的——我是说——即使 GPT——如果没有一个人——Ilya——推动了这种 scaling——它也不会存在——对吧?
**Nathan Lambert:** Compute is more like the scaling aspect of it. - It's possible that the internet is c ompletely forgotten- -that the internet is wrapped into phone networks, like communication networks. This is just another manifestation of that, and t he real breakthrough comes from increased compute, or Moore's Law, broadly defined. - Well, I think the connection of people is very fundamental to it . it's like, you can talk to anyone. You want to find the best person in the world for something, they are somewhere in the world.
**Lex Fridman:** 是的——Dario Amodei 也深度参与了。如果你读一些 OpenAI 的历史——想想这些人这么早就说"我们需要连接 10,000 个 GPU——拿走 OpenAI 所有的算力——训练一个模型"是多么疯狂。有很多人不想那样做。
**Lex Fridman:** And being able to h ave that flow of information—the AIs will also rely on this. I've been fixating on when I said the dream was dead about the one central model. The thi ng that is evolving is people having many agents for different tasks.
**Nathan Lambert:** 这是一件疯狂的信念。在 scaling 有任何迹象表明它会实现之前就相信 scaling。又是标志性人物。说到这——100 年后——这大概是后奇点了——不管奇点是什么。当历史学家回顾我们这个时代——什么技术突破会被他们真正强调为导致奇点的突破?到目前为止——从 Turing 到今天——80 年。
**Nathan Lambert:** People already started doing this with different clouds. It's described as many AGIs in the data center where each one manages and they talk to each other. And that is reliant on networking and the free flow of information. on top of compute. But networking, especially with GPUs, is such a part of scaling of compute.
**Sebastian Raschka:** 我觉得仍然会是计算——这个统称——"计算"。我不一定认为在 100 或 200 年后它会是 AI。它很可能仍然是计算机。我们现在在更好地利用它们——但计算这个事实不变。
**Sebastian Raschka:** The GPUs and the data centers need to talk to each other. - A nything about neural networks will be remembered? Like, do you think there's something very specific and singular to the fact that it's neural network s that's seen as a breakthrough, like a genius, that you're basically replicating, in a very crude way, the human mind? The structure of the human bra in, the human mind? - I think without the human mind, we probably wouldn't have neural networks, because it just was an inspiration for that. But on t he other end, I think it's just so, so different. I mean, it's digital versus biological, that I do think it will probably be more grouped as an algor ithm. - That's massively parallelizable... ...On this particular kind of compute? - It could have been like genetic computing; genetic algorithms just parallelized. It just happens that this is more efficient and works better. - And it very well could be that the LLM, the neural networks, the way we architect them now is just a small component of the system that leads to singularity. - If you think of it in 100 years, I think society can be chang ed more with more compute and intelligence because of autonomy.
**Nathan Lambert:** 基本上是一个 Moore's Law 的讨论。即使 CUDA 和 GPU 的细节都不会被记住——也不会记住所有这些软件的动荡。就是——显然——算力。
**Nathan Lambert:** But looking at this, what are the things from the Industrial Revolution that we rememb er? We remember the engine, which is probably the equivalent of the computer in this. But there's a lot of other physical transformations that people are aware of, like the cotton gin and all these things, these machines that are still known: air conditioning, refrigerators.
**Lex Fridman:** 我大体上同意——但互联网的连接性和算力能不能被合并?还是两者都是?
**Lex Fridman:** Some of these things fro m AI will still be known. The word "transformer" could still be known. I would guess that deep learning is definitely still known, but the transformer might be evolved away from in 100 years with AGI researchers everywhere.
**Sebastian Raschka:** 我觉得互联网可能会和通信相关。它可能是电话、互联网或卫星。算力更像是 scaling 的那个方面。
**Sebastian Raschka:** But I think deep learning is likely to be a term that is remembered. - And I wonder what the air conditioning and refrigeration of the future is that AI brings. If we travel forward 100 years from now, we transport there right now, what do you think is different? How do you think the world looks different?
**Lex Fridman:** 有可能互联网被完全遗忘——互联网被包含进电话网络——像通信网络——这只是它的另一种表现形式——而真正的突破来自更多的算力——或者广义定义的 Moore's Law。
**Lex Fridman:** First of all, do you think there are humans? Do you think there are robots everywhere walking around? - I do think specialized robots, for sure, for certain tasks. - Humanoid form? - Maybe half-humanoid. We'll see.
**Nathan Lambert:** 嗯——我觉得人与人的连接是非常根本的。就是——你可以跟任何人交谈。你想找到世界上某件事最好的人——他们在世界的某个地方。而能有那种信息流——AI 也会依赖这个。我一直在纠结——我说那个梦想已死——关于一个中心模型的事。正在演变的是——人们有很多 agent 用于不同任务。人们已经开始用不同的云来做这件事了。它被描述为数据中心里的很多 AGI——每一个管理着某些东西——它们互相对话。那依赖于网络和信息的自由流动——在算力之上。但网络——尤其是 GPU 的网络——是算力 scaling 的重要组成部分。GPU 和数据中心需要互相通信。
**Nathan Lambert:** I t hink for certain things, yes, there will be humanoid robots because it's just amenable for the environment. But for certain tasks, it might make sense . What's harder to imagine is how we interact with the devices and what humans do with devices.
**Lex Fridman:** 关于神经网络——会被记住什么?比如——你觉得有什么关于神经网络的非常具体和独特的东西——被视为一个突破——像一种天才——你基本上在以非常粗糙的方式复制人类的思维?人脑的结构、人的思维?
**Lex Fridman:** Well, I mean, I'm pretty sure it will probably not be the cellphone or the laptop. Will it be implants? - I mean, it has to be brain-computer interfaces, right? I mean, 100 years from now, given the progr ess we're seeing now— there has to be... unless there's legitimately a complete alteration of how we interact with reality. - On the other hand, cars are older than 100 years, right?
**Sebastian Raschka:** 我觉得没有人类思维——我们可能就不会有神经网络——因为那只是它的灵感来源。但另一方面——我觉得它们太不同了。数字的对比生物的——我确实觉得它可能更多被归类为一种算法。
**Sebastian Raschka:** And it's still the same interface. We haven't replaced cars with something else. We just made them better, but it's s till a steering wheel, still wheels, you know? - I think we'll still carry around a physical brick of compute because people want some ability to have a private...
**Nathan Lambert:** 大规模并行化的……在这种特定类型的算力上?
**Nathan Lambert:** Like, you might not engage with it as much as a phone, but having private information that is yours as an interface between the rest of the internet, I think that will still exist. It might not look like an iPhone and it might be used a lot less, but I still expect people to carry thin gs around. - Why do you think the smartphone is the embodiment of private? There's a camera on it.
**Sebastian Raschka:** 它可能是遗传计算——遗传算法被并行化。只是恰好神经网络更高效、效果更好。
**Sebastian Raschka:** There's— - Private for you, like encrypted messages , encrypted photos... know what your life is. I guess it's a question of how optimistic on brain-machine interfaces you are. Is all that just going to be stored in the cloud?
**Nathan Lambert:** 而且完全有可能——LLM——神经网络——我们现在构建它们的方式——只是导向奇点的系统的一个小组件。
**Nathan Lambert:** Your whole calendar? It's hard to think about processing all the information that we can process visually through brain-machi ne interfaces presenting something like a calendar or something to you. It's hard to just think about knowing, without looking, your email inbox.
**Sebastian Raschka:** 如果你在 100 年后想——我觉得社会可以因为自主性——通过更多算力和智能——而被改变更多。但看看这些——工业革命中我们记住了什么?我们记住了引擎——那大概是计算机的等价物。但还有很多其他物理变革——人们知道的——比如轧棉机和所有这些机器——那些仍然为人所知的:空调、冰箱。AI 中的一些东西仍然会为人所知。"Transformer"这个词可能仍然会被知道。我猜深度学习肯定仍然被知道——但 transformer 在 100 年后可能会被 AGI 研究者们演化掉。但我觉得深度学习很可能是一个会被记住的术语。
**Sebastian Raschka:** Like you signal to a computer and then you just know your email inbox. Is that something that the human brain can handle being piped into it non-visually? I don't know exactly how those transformations happen. Humans aren't changing in 100 years.
**Lex Fridman:** 而且我想知道——AI 带来的空调和冰箱是什么。如果我们往前穿越 100 年——我们现在被传送到那里——你觉得有什么不同?你觉得世界看起来有什么不同?首先——你觉得还有人类吗?你觉得到处都是走来走去的机器人吗?
**Lex Fridman:** I think agency and community are things that people actua lly want. - A local community, yeah. - People you are close to, being able to do things with them and being able to ascribe meaning to your life and b eing able to do things. In 100 years, I don't think that human biology is changing away from those on a time scale that we can discuss. And I think th at UBI does not solve agency.
**Sebastian Raschka:** 我确实觉得会有专业化的机器人——对的——用于某些任务。
**Sebastian Raschka:** I do expect mass wealth, and I hope that it has spread so that the average life looks very different in 100 years. But t hat's still a lot to happen. If you think about countries that are early in their development process to getting access to computing and internet, to build all the infrastructure and have policy that shares one nation's wealth with another is...
**Lex Fridman:** 人形的吗?
**Lex Fridman:** I think it's an optimistic view to see all that happen ing in 100 years- ...while they are still independent entities and not just like absorbed into some international order by force. - But there could be just better, more elaborate, more effective ... social support systems that help alleviate some levels of basic suffering from the world. You know, t he transformation of society where a lot of jobs are lost in the short term, I think we have to really remember that each individual job that's lost i s a human being who's suffering. That's like a ...
**Sebastian Raschka:** 也许半人形的。我们会看到的。我觉得某些东西——是的——会有人形机器人——因为它适合那个环境。但某些任务——可能不是人形更有道理。更难想象的是我们怎么和设备交互——以及人类用设备做什么。嗯——我的意思是——我很确定它可能不会是手机或笔记本电脑。会是植入物吗?
**Sebastian Raschka:** When jobs are lost, the scale is a real tragedy. You can make all kinds of arguments about economic s or how it's all going to be okay. It's good for the GDP, there's going to be new jobs created.
**Lex Fridman:** 我是说——必须是脑机接口——对吧?100 年后——鉴于我们现在看到的进展——必须有……除非真的有一个对我们与现实交互方式的彻底改变。
**Lex Fridman:** Fundamentally at the individual level for that human being, that's real suffering. That's a real personal sort of tragedy. And we have to not forget that as the technologies are being developed.
**Sebastian Raschka:** 另一方面——汽车超过 100 年了——对吧?而且还是同样的界面。我们没有用其他东西替代汽车。我们只是让它们更好了——但还是方向盘——还是轮子——你知道?
**Sebastian Raschka:** And also my hope for all the AI slop we're seeing is that there will be a greater and greater premium for the fundamental aspects of the human experience that are in-person. The things that we all... Like seeing each other, talking together in-person. - The next few years are definitely going to be an incre ased value on physical goods and events— ...and even more pressure on slop.
**Nathan Lambert:** 我觉得我们仍然会带着一块实体的算力——因为人们想要某种能力来拥有私人的……你可能不会像用手机那样频繁地和它交互——但拥有属于你的私人信息——作为你和互联网其他部分之间的界面——我觉得那仍然会存在。它可能不像 iPhone——而且可能用得少得多——但我仍然期望人们会携带东西。
**Nathan Lambert:** So it'll be... the slop is only starting. The next few years will be more and more diverse ...versions of slop. - They would be drowning in slop. Is that what— - So I'm hoping that society drowns in slop enough to snap out o f it and be like, "We can't deal with it.
**Lex Fridman:** 你为什么觉得智能手机是私密性的化身?它上面有摄像头。有——
**Lex Fridman:** It just doesn't matter." And then, the physical has such a higher premium on it. - Even like classic example s, I honestly think this is true, and I think we will get tired of it. We are already kind of tired of it. I mean, even art.
**Nathan Lambert:** 对你来说是私密的——比如加密消息、加密照片……知道你的生活是什么样的。我猜这是一个你对脑机接口有多乐观的问题。那些都会存储在云端吗?你整个日历?很难想象通过脑机接口以非视觉方式处理我们视觉能处理的所有信息——比如给你呈现一个日历什么的。很难想象在不看的情况下就知道你的邮件收件箱。就是——你给计算机发信号——然后你就知道你的邮件收件箱了。这是人脑能处理的——以非视觉方式被输入的东西吗?我不确切知道那些变换是怎么发生的。人类在 100 年内不会改变。我觉得主动性和社区是人们真正想要的东西。
**Nathan Lambert:** I don't think art will go away. You have paintings, physical paintings. There's more value, not just monetary value, but just more value appreciation for the actual painting t han a photocopy of that painting.
**Lex Fridman:** 本地社区——是的。
**Lex Fridman:** It could be a perfect digital reprint, but there is something when you go to a museum and you look at that art and y ou see the real thing and you just think, "Okay. A human." It's like a craft. You have like an appreciation for that.
**Nathan Lambert:** 你身边亲近的人——能和他们一起做事——能给你的生活赋予意义——能做事情。在 100 年内——我不觉得人类生物学会从那些方面改变——在我们能讨论的时间尺度上。而且我觉得 UBI 不能解决主动性。我确实期待大规模的富裕——而且我希望它已经扩散——让 100 年后的平均生活看起来非常不同。但那还有很多事情要发生。如果你想想那些还处于获取计算和互联网的早期发展阶段的国家——要建设所有的基础设施——要有把一个国家的财富分享给另一个国家的政策——我觉得在 100 年内看到所有这些都发生是一种乐观的看法——当它们仍然是独立实体——而不是被某种强制吞并进某个国际秩序的时候。
**Nathan Lambert:** And I think the same is true for writing, for talking, for any type of experience... I do unfortunately think it will be like a dichotomy, like a fork where some things will be autom ated. Like, you know, there are not as many paintings as there used to be, you know, 200 years ago.
**Lex Fridman:** 但可能有更好的、更精细的、更有效的……社会支持系统——帮助减轻世界上某些层面的基本苦难。你知道——社会的转型——在短期内很多工作被失去——我觉得我们真的必须记住——每一个失去的工作都是一个正在受苦的人。那就像一个……当工作被失去——那个规模是一个真正的悲剧。你可以做各种经济学的论证——或者一切都会好起来。对 GDP 有利——会创造新工作。但从根本上——在个人层面——对那个人来说——那是真实的痛苦。那是一种真实的个人悲剧。我们不能在技术被开发的时候忘记这一点。而且——我对我们看到的所有 AI slop 的希望是——会有越来越高的溢价——给那些人类体验中根本性的、面对面的方面。我们所有人……像是见到彼此——面对面地交谈。
**Lex Fridman:** There are more photographs, more photocopies. But at the same time, it won't go away. There will be value in that.
**Nathan Lambert:** 未来几年肯定是实体商品和活动价值提升的时期——而且 slop 的压力会更大。所以……slop 才刚刚开始。未来几年会有越来越多样的……slop 的版本。
**Nathan Lambert:** I think the difference will just be, you know, what's the proportion of that. But per sonally, I have a hard time reading things where I obviously see it's obviously AI generated. I'm sorry.
**Lex Fridman:** 我们会被 slop 淹没。是那样吗?
**Lex Fridman:** It might—it might be really good information there, but I'm just like, "Nah, not for me." - I think eventually they'll fool you, and it'll be on platforms that give ways of verifying or building trust. So you will trust that Lex is not AI generated, having been here. So then you have trust in this channel. But it's harder for new people who do n't have that trust. - Well, that will get interesting because I think fundamentally it's a solvable problem by having trust in certain outlets that t hey won't do it, but it's all going to be trust-based.
**Nathan Lambert:** 所以我希望社会被 slop 淹没到一定程度后清醒过来——然后说——"我们受不了了。就是不重要了。"然后——实体的东西有更高的溢价。
**Nathan Lambert:** There will be systems to authorize, "Okay, this is real. This is not real." There will be some telltale signs where you can obviously tell this is AI generated and this is not. But some will be so good that it's hard to tell, and then you have t o trust.
**Sebastian Raschka:** 甚至像经典的例子——我老实说觉得这是对的——而且我觉得我们会厌倦的。我们已经有点厌倦了。我是说——甚至艺术。我不觉得艺术会消失。你有绘画——实体绘画。有更多的价值——不只是货币价值——而是更多的价值欣赏——对那幅真正的画——而不是那幅画的复印件。它可以是完美的数字重印——但当你去博物馆——你看那个艺术——你看到真实的东西——你就想——"好。一个人类。"就像一种手艺。你有对那个的欣赏。而且我觉得写作、交谈、任何类型的体验都一样……我不幸地觉得它会像一个二分法——像一个分叉——某些东西会被自动化。比如——你知道——绘画没有 200 年前那么多了。有更多的照片——更多的复印件。但同时——它不会消失。那里面会有价值。我觉得区别只是——你知道——比例是什么。但就个人而言——我很难读那些我明显看出是 AI 生成的东西。抱歉。它可能——可能信息真的很好——但我就是——"算了——不适合我。"
**Sebastian Raschka:** And well, that will get interesting and a bit problematic. - The extreme case of this is to watermark all human content. So all photos that w e take on our own have some watermark until they are edited or something like this. And software can manage communications with the device manufacture r- device manufacturer to maintain human editing— which is the opposite of the discussion to try to watermark AI images. And then you can make a Googl e image that has a watermark and use a different Google tool to remove it. - Yep.
**Nathan Lambert:** 我觉得最终它们会骗过你——而且会在提供验证或建立信任方式的平台上。所以你会信任 Lex 不是 AI 生成的——因为来过这里。所以你就信任这个频道。但对于没有那种信任的新人来说更难。
**Nathan Lambert:** It's going to be an arms race, basically. - And we've been mostly fo cusing on the positive aspects of AI. All the capabilities that we've been talking about can be used to destabilize human civilization with even just relatively dumb AI applied at scale, and then further, superintelligent AI systems. Of course, there's the sort of doomer take that's important to con sider as we develop these technologies.
**Sebastian Raschka:** 嗯——那会变得有趣——因为我觉得根本上——这是一个可以通过对某些媒体的信任来解决的问题——它们不会那样做——但一切都会是基于信任的。会有系统来认证——"好——这是真的。这不是真的。"会有一些端倪——你明显能看出这是 AI 生成的、这不是。但有些会好到很难分辨——然后你就得信任了。嗯——那会变得有趣——而且有点成问题。
**Sebastian Raschka:** What gives you hope about the future of human civilization, given everything we've been talking about? Are we going to be okay? - I think we will. I'm definitely a worrier, both about AI and non-AI things. But humans do tend to find a way.
**Nathan Lambert:** 这个的极端情况是——给所有人类内容加水印。所以我们自己拍的所有照片都有某种水印——直到它们被编辑或什么的。软件可以管理和设备制造商之间的通信——设备制造商来维护人类编辑——这跟试图给 AI 图像加水印的讨论恰好相反。然后你可以做一个有水印的 Google 图片——用另一个 Google 工具把它去掉。
**Nathan Lambert:** I think that's what humans are built for: to have community and find a way to figure out problems. That's what has gotten us to this point. And to think that the AI oppor tunity and related technologies is really big.
**Sebastian Raschka:** 是的。那会是一场军备竞赛——基本上。
**Sebastian Raschka:** And I think that there's big social and political problems to help everybody understand that. And I thi nk that's what we're staring at a lot of right now, is like the world is a scary place, and AI is a very uncertain thing. And it takes a lot of work t hat is not necessarily building things. It's like telling people and understanding people, that the people building AI are historically not motivated or wanting to do.
**Lex Fridman:** 而且我们一直主要关注 AI 的积极方面。我们讨论的所有能力——都可以被用来破坏人类文明的稳定——即使是相对愚蠢的 AI 大规模应用——然后更进一步——超级智能 AI 系统。当然——在我们开发这些技术时——那种末日论调是重要的需要考虑的。什么给你对人类文明未来的希望——鉴于我们讨论的一切?我们会没事吗?
**Lex Fridman:** But it is something that is probably doable. It just will take longer than people want. And we have to go through that long period o f like hard, distraught AI discussions if we want to have the lasting benefits. - Yeah.
**Nathan Lambert:** 我觉得会。我绝对是一个担忧者——关于 AI 的和非 AI 的事情。但人类确实倾向于找到一条路。我觉得那就是人类天生要做的:有社区、找到解决问题的方法。那就是让我们走到今天的东西。而且要想到 AI 的机遇和相关技术确实很大。我觉得有很大的社会和政治问题要帮助每个人理解。我觉得那就是我们现在面对的很多东西——世界是一个可怕的地方——AI 是一个非常不确定的东西。这需要很多工作——不一定是建造东西。而是告诉人们和理解人们——那些建造 AI 的人历史上并不是有动力或想要做的。但那是可能做到的事情。只是会比人们想要的花更长时间。我们必须经历那段漫长的——困难的、焦虑的 AI 讨论——如果我们想拥有持久的好处。
**Nathan Lambert:** Through that process, I'm especially excited that we get a cha nce to better understand ourselves, us at the individual level as humans and at the civilization level, and answer some of the big mysteries, like wha t is this whole consciousness thing going on here? It seems to be truly special. Like, there's a real miracle in our mind. And AI puts a mirror to our selves and we get to answer some of the big questions about like, what is this whole thing going on here? - Well, one thing about that is also what I do think makes us very different from AI and why I don't worry about AI taking over is, like you said, consciousness.
**Lex Fridman:** 是的。通过那个过程——我特别兴奋的是我们有机会更好地理解自己——在个人层面作为人类——以及在文明层面——回答一些大谜题——比如这整个意识的事情是怎么回事?它似乎真的很特别。我们的思维中有一个真正的奇迹。而 AI 给我们自己放了一面镜子——我们得以回答一些大问题——比如——这到底是怎么一回事?
**Lex Fridman:** We humans, we decide what we wan t to do. AI in its current implementation, I can't see it changing. You have to tell it what to do.
**Sebastian Raschka:** 嗯——关于那个的一个事情也是——我确实认为让我们和 AI 非常不同的——也是为什么我不担心 AI 接管的——就像你说的——意识。我们人类——我们决定我们想做什么。AI 在目前的实现中——我看不到它改变。你必须告诉它做什么。所以你仍然有主动权。它不会从你那里夺走主动权——因为它变成了一个工具。你可以把它当作一个工具。你告诉它做什么。它会比其他之前的工具更自动——它当然比锤子更强大——它可以解决问题——但仍然是你在掌控——对吧?所以 AI 不在掌控——你在掌控。你告诉 AI 做什么——它为你做。
**Sebastian Raschka:** And so you have still the agency. It doesn't take the agency from you because it becomes a tool. You can think of it as a tool. You tell it what to do.
**Lex Fridman:** 所以在后奇点、后末日的人类与机器之间的战争中——你是说人类值得为之战斗?
**Lex Fridman:** It will be more automatic than other previous to ols. It's certainly more powerful than a hammer, it can figure things out, but it's still you in charge, right? So the AI is not in charge, you're in charge.
**Sebastian Raschka:** 百分之百。我是说——这就是……电影《终结者》——他们在 80 年代拍的——而且我确实觉得——嗯——我能看到出问题的唯一情况是——当然——如果东西被明确编程来做有害的事——基本上。
**Sebastian Raschka:** You tell the AI what to do and it's doing it for you. - So in the post-singularity, post-apocalyptic war between humans and machines, you're s aying humans are worth fighting for? - 100%. I mean, this is... The movie Terminator, they made in the '80s, essentially, and I do think, well, the on ly thing I can see going wrong is, of course, if things are explicitly programmed to do the thing that is harmful, basically. - I think actually in th at, in a Terminator type of setup, I think humans win. I think we're too clever.
**Lex Fridman:** 我觉得实际上在那种情况——在终结者类型的设定中——我觉得人类会赢。我觉得我们太聪明了。很难解释我们是怎么搞定的——但我们会。而且我们大概会用本地 LLM——开源 LLM——来帮助对抗机器。我为这种荒谬道歉。就像我说的——Nathan 已经知道——我很长时间以来一直是他的粉丝。Sebastian——我也很长时间以来一直是你的粉丝——所以能终于见到你是莫大的荣幸。感谢你们为世界带来的一切。感谢你正在写的优秀书籍。感谢你们的教学。感谢今天的对话。这很有趣。
**Lex Fridman:** It's hard to explain how we figure it out, but we do. And we'll proba bly be using local LLMs, open source LLMs to help fight the machines. I apologize for the ridiculousness. Like I said, Nathan already knows I've been a big fan of his for a long time. Been a big fan of yours, Sebastian, for a long time, so it's an honor to finally meet you.
**Sebastian Raschka:** 感谢你邀请我们来——有这种人类的连接——这其实——
**Sebastian Raschka:** Thank you for everything you put out into the world. Thank you for the excellent books you're writing. Thank you for teaching us.
**Nathan Lambert:** 极其有价值——人类的连接。
**Nathan Lambert:** And thank you for talking today. This was fun . - Thank you for inviting us here and having this human connection, which is actually- - Extremely valuable- human connection. Thanks for listening t o this conversation with Sebastian Raschka and Nathan Lambert.
**Lex Fridman:** 感谢收听这次与 Sebastian Raschka 和 Nathan Lambert 的对话。如需支持本播客——请查看描述中的赞助商信息——那里也有联系我、提问和反馈的链接。现在让我以 Albert Einstein 的一段话作为结束。"不是因为我多么聪明——而是我在问题上停留得更久。"感谢收听——希望下次再见。
**Lex Fridman:** To support this podcast, please check out our sponsors in the description, where you ca n also find links to contact me, ask questions, give feedback and so on. And now let me leave you with some words from Albert Einstein. "It is not tha t I'm so smart, but I stay with the questions much longer." Thank you for listening, and hope to see you next time.