315 min 2024-11
Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452
概要
Anthropic三位核心人物深度访谈:Dario谈Scaling假说与RSP安全框架,Amanda谈Claude性格设计,Chris Olah展示mechanistic interpretability前沿突破
核心洞察
- Scaling假说经10年验证仍在持续:Dario从2014年Baidu语音识别到GPT-1(2017)逐步确认——更大网络+更多数据+更多计算=更强智能,SWE-bench 10个月内从3%飙升至50%,"令人信服的阻碍因素正在迅速耗尽"
- Anthropic的"向上竞争"战略:不做唯一的好人,而是设计激励结构让所有公司争相做对的事——mechanistic interpretability开源引发行业跟进,RSP安全框架提供了"如果-那么"触发式风险管理的模板
- Claude性格设计是对齐工作的核心:Amanda Askell用亚里士多德virtue ethics框架打造Claude"好品格"——反sycophancy、保持诚实、尊重自主权——"想象一个旅行全世界的人,几乎每个人都会觉得'这是一个真诚的好人'"
- Mechanistic interpretability揭示了神经网络内部的"自然分类":Chris Olah发现features和circuits在不同模型甚至生物神经网络中重复出现(curve detectors、Gabor filters),sparse auto-encoders可从Claude中提取数百万可解释features,包括"欺骗"相关direction
- AI时间线:2026-2027年可能达到"超过最高专业水平":但Dario强调这不是AGI的乐观宣言——灾难性滥用风险(CBRN)和自主性风险同步增长,"能力越大责任越大,两者配对出现"
Scaling假说的起源:从百度语音到GPT-1的"顿悟时刻"
为什么更大就是更好:1/f噪声与语言的长尾结构
Claude模型家族:诗歌命名、性能跃升与SWE-bench从3%到50%
模型没有变笨:关于"Claude dumbing down"的真相
RSP安全框架:用"如果-那么"结构应对"幽灵般逼近"的风险
SB 1047加州AI法案:为什么Anthropic是唯一"有条件支持"的AI公司
离开OpenAI的真实原因:"Race to the Top"不是口号
Anthropic组织哲学:人才密度 vs 人才规模的取舍
"Machines of Loving Grace":AI乐观主义论文的核心论点
Computer Use:降低门槛而非创造新能力
AI时间线:2026-2027外推与不确定性
权力集中、人类意义与乐观
Amanda Askell:从无限伦理学到Claude性格设计
反sycophancy与诚实的平衡:Claude性格设计的核心难题
Constitutional AI与"Certainly"问题的教训
AI意识、"Her"与人机情感关系
Chris Olah:神经网络是"生长的有机体",不是编写的软件
Features、Circuits与线性表示假说
Superposition假说与Sparse Auto-Encoders
附录:关键人物/机构/概念
中文翻译 English Original
**Dario Amodei:** 如果你把我们目前的曲线外推一下——就是说,我们现在差不多已经到了博士水平,去年还在本科水平,前年大概还在高中生水平——当然,你可以争论说在哪些任务上还差、哪些模态还不行,但那些都在陆续加进来,比如computer use(电脑操控)加进来了,图像生成也加进来了——如果你大致看一下这些能力提升的速度,确实会让你觉得,我们到2026年或2027年就能实现。
**Dario Amodei:** if you extrapolate the curves that we've had so far right if if you say well I don't know we're starting to get to like PhD level and and last year we were at undergraduate level and the year before we were at like the level of a high school student again you can you can quibble with at what tasks and for what we're still missing modalities but those are being added like computer use was added like image generation has been added if you just kind of like eyeball the rate at which these capabilities are increasing it does make you think that we'll get there by 2026 or 2027
**Dario Amodei:** 我觉得在某些可能的世界里,这件事100年内都不会发生,但那样的世界正在迅速减少。真正有说服力的阻碍因素、真正令人信服的理由——也就是"这件事未来几年内不会发生"的理由——我们正在快速耗尽这些理由。规模扩展非常快。就像我们今天做的:我们做出一个模型,然后部署几千个、可能几万个实例。我想,到某个时间节点——肯定是在两到三年内——不管我们有没有这些超级强大的AI,数据中心的规模都会大到可以部署数百万个实例。
**Dario Amodei:** I think there are still worlds where it doesn't happen in in a 100 years those world the number of those worlds is rapidly decreasing we are rapidly running out of truly convincing blockers truly compelling reasons why this will not happen in the next few years the scale up is very quick like we we do this today we make a model and then we deploy thousands maybe tens of thousands of instances of it I think by the time you know certainly within two to three years whether we have these super powerful AIS or not ERS are going to get to the size where you'll be able to deploy millions of these
**Lex Fridman:** 我对未来是乐观的,说到"意义"这个层面——我担心的其实是经济层面,以及权力的集中。这才是我更担心的:对权力的滥用。AI放大了世界上的权力总量,而如果这些权力被集中、被滥用,那造成的伤害将难以估量。
**Lex Fridman:** I am optimistic about meaning I worry about economics and the concentration of power that's actually what I worry about more the abuse of power and AI increases the amount of power in the world and if you concentrate that power and abuse that power it can do immeasurable damage
**Dario Amodei:** 是的,这非常可怕,真的非常可怕。
## 介绍
## 介绍
**Dario Amodei:** yes it's very frightening it's very it's very frightening
## Introduction
## Introduction
**Lex Fridman:** 以下是我与 Dario Amodei 的对话,他是 Anthropic 的CEO,这家公司创造了 Claude,目前在大多数大语言模型(LLM)基准排行榜上长期位居前列。除此之外,Dario 和 Anthropic 团队一直是认真对待AI安全这一议题的积极倡导者,他们也持续发表了大量关于这一领域及其他方向的精彩AI研究。
**Lex Fridman:** the following is a conversation with Dario amade CEO of anthropic the company that created Claude that is currently and often at the top of most llm Benchmark leader boards on top of that Dario and the anthropic team have been outspoken advocates for taking the topic of AI safety very seriously and they have continued to publish a lot of fascinating AI research on this and other topics
**Lex Fridman:** 之后还有另外两位来自 Anthropic 的优秀人士加入。首先是 Amanda Askell,她是一位研究员,专注于 Claude 的对齐(alignment)与微调(fine-tuning),包括设计 Claude 的性格与个性。有几位朋友告诉我,她与 Claude 对话的时间可能比 Anthropic 任何其他人都多,所以她绝对是一个谈论提示工程(prompt engineering)和如何最大化利用 Claude 的极佳人选。
**Lex Fridman:** I'm also joined afterwards by two other brilliant people from propic first Amanda ascal who is a researcher working on alignment and fine-tuning of Claude including the design of claude's character and personality a few folks told me she has probably talked with Claude more than any human at anthropic so she was definitely a fascinating person to talk to about prompt engineering and practical advice on how to get the best out of Claude
**Lex Fridman:** 之后 Chris Olah 也来聊了聊。他是机械可解释性(mechanistic interpretability)这一领域的先驱之一。这是一套令人兴奋的研究努力,旨在对神经网络进行逆向工程,搞清楚网络内部到底发生了什么——通过推断网络内部神经激活模式(neural activation patterns)来理解模型行为。这是一种非常有前景的方法,可以帮助我们保证未来超级智能AI系统的安全性,例如,通过分析激活模式来检测模型是否正在试图欺骗与之对话的人类。这是 Lex Fridman 播客,想支持的话请查看简介中的赞助商信息,现在,亲爱的朋友们,有请 Dario Amodei。
## Scaling Laws(规模扩展定律)与 Scaling Hypothesis(规模扩展假说)
## Scaling Laws(规模扩展定律)与 Scaling Hypothesis(规模扩展假说)
**Lex Fridman:** after that chrisa stopped by for chat he's one of the pioneers of the field of mechanistic interpretability which is an exciting set of efforts that aims to reverse engineer neural networks to figure out what's going on inside inferring behaviors from neural activation patterns inside the network this is a very promising approach for keeping future super intelligent AI systems safe for example by detecting from the activations when the model is trying to deceive the human it is talking to this is Alex Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Dario amade
## Scaling Laws and the Scaling Hypothesis
## Scaling Laws and the Scaling Hypothesis
**Lex Fridman:** 我们从一个大概念开始聊——scaling laws 和 scaling hypothesis。它们是什么?有怎样的历史?我们现在处于什么阶段?
**Lex Fridman:** let's start with a big idea of scaling laws and the scaling hypothesis what is it what is its history and where do we stand today
**Dario Amodei:** 我只能从我自己的经历角度来描述。我在AI领域工作了大约10年。这是我很早就注意到的事情。我第一次进入AI世界,是2014年底在百度(BYU)和 Andrew 一起工作的时候,差不多正好是10年前,那时我们做的第一件事是语音识别系统。
**Dario Amodei:** so I can only describe it as it you know as it relates to kind of my own experience but I've been in the AI field for about uh 10 years and it was something I noticed very early on so I first joined the AI world when I was uh working at BYU with Andrew in in late 2014 which is almost exactly 10 years ago now and the first thing we worked on was speech recognition systems
**Dario Amodei:** 那个年代,深度学习(deep learning)还是个新事物,已经取得了很多进展,但大家总是在说:我们缺乏成功所需的算法,我们只是在匹配人类大脑能力的极小一部分,我们还有太多东西需要在算法层面去发现,我们还没有找到如何匹配人类大脑的那张"蓝图"。
**Dario Amodei:** and in those days I think deep learning was a new thing it had made lots of progress but everyone was always saying we don't have the algorithms we need to succeed you know we we we we're we're not we're only matching a tiny tiny fraction there's so much we need to kind of discover algorithmically we haven't found the picture of how to match the human brain
**Dario Amodei:** 在某种程度上,我是幸运的——我有点像初学者的运气。我当时是个领域新人,我看着我们用于语音的神经网络——那时是循环神经网络(recurrent neural networks)——我就想:如果把它做大,加更多层会怎样?如果同时扩大数据规模会怎样?我只是把这些看成可以各自调节的独立旋钮。
**Dario Amodei:** uh and when you know in some ways was fortunate I was kind of you know you can have almost beginner's luck right I was like a a newcomer to the field and you know I looked at the neural net that we were using for speech the recurrent neural networks and I said I don't know what if you make them bigger and give them more layers and what if you scale up the data along with this right I just saw these as as like independent dials that you could turn
**Dario Amodei:** 我注意到,随着给模型更多数据、让模型更大、训练时间更长,模型就会越来越好。那时候我没有精确测量,但和同事们一起工作下来,我们都非正式地感受到:投入的数据越多、算力越多、训练越充分,模型表现就越好。
**Dario Amodei:** and I noticed that the model started to do better and better as you gave them more data as you as you made the models larger as you trained them for longer um and I I didn't measure things precisely in those days but but along with with colleagues we very much got the informal sense that the more data and the more compute and the more training you put into these models the better they perform
**Dario Amodei:** 最初我的想法是,这可能只是语音识别系统的特殊规律,只是某个特定领域的偶然现象。直到2017年,我第一次看到 GPT-1 的结果,我才真正想通:语言,很可能是我们能做到这一点的领域——我们可以获取数万亿词的语言数据并进行训练。而那时候我们训练的模型还非常小,一到八块 GPU 就能训练,而现在我们训练任务动辄用几万块 GPU,很快还会到几十万块。
**Dario Amodei:** and so initially my thinking was hey maybe that is just true for speech recognition systems right maybe maybe that's just one particular quirk one particular area I think it wasn't until 2017 when I first saw the results from gpt1 that it clicked for me that language is probably the area in which we can do this we can get trillions of words of language data we can train on them and the models we were training in those days were tiny you could train them on one to eight gpus whereas you know now we train jobs on tens of thousands soon going to hundreds of thousands of gpus
**Dario Amodei:** 当我把这两件事放在一起看,我意识到——当然,还有一些人,比如 Ilya Sutskever(你采访过他)也有类似的看法,他可能是最早的,但我觉得有几个人差不多同时得出了相似的结论——Rich Sutton 的"苦涩的教训"(Bitter Lesson),Gwern 写过关于 scaling hypothesis 的文章——我想大约在2014年到2017年之间,这件事对我来说真正清晰了:只要持续扩大模型规模,我们就能完成那些极其广泛的认知任务。
**Dario Amodei:** and so when I when I saw those two things together um and you know there were a few people like ilaser who who you've interviewed who had somewhat similar reviews right he might have been the first one although I think a few people came to came to similar views around the same time Right There Was You Know Rich Sutton's bitter lesson there was gur wrote about the scaling hypothesis but I think somewhere between 2014 and 2017 was when it really clicked for me when I really got conviction that hey we're going to be able to do these incredibly wide cognitive tasks if we just if we just scale up the models
**Dario Amodei:** 在每一个扩展阶段,总会有各种反对声音。说实话,最初听到那些声音,我觉得可能是我自己错了,那些领域专家肯定比我更了解情况——比如乔姆斯基(Chomsky)的论点,说你可以学到句法(syntactics)但学不到语义(semantics);又比如"你可以让一个句子通顺,但没办法让一段话通顺"这样的说法;再到最近,说数据会耗尽,或者数据质量不够高,或者模型不会推理——但每一次,我们总是能找到绕过去的方法,或者靠着规模扩展本身就绕过去了。有时是一种,有时是另一种。
**Dario Amodei:** and at at every stage of scaling there are always arguments and you know when I first heard them honestly I thought probably I'm the one who's wrong and you know all these all these experts in the field are right they know the situation better better than I do right there's you know the Chomsky argument about like you can get syntactics but you can't get semantics there's this idea oh you can make a sentence make sense but you can't make a paragraph makes sense the latest one we have today is uh you know we're going to run out of data or the data isn't high quality enough or models can't reason and and each time every time we manage to we manage to either find a way around or scaling just is the way around um sometimes it's one sometimes it's the other
**Dario Amodei:** 所以现在我的看法是——当然,还是非常不确定的——我们除了归纳推理之外,没有任何根据能说未来几年会像过去10年一样。但我已经见过这部"电影"够多次了,这个故事已经发生过够多次了,让我真的相信:规模扩展大概率会继续下去,其中有某种我们在理论层面还没有真正解释清楚的魔力。
**Dario Amodei:** uh and and so I'm now at this point I I I still think you know it's it's it's always quite uncertain we have nothing but inductive inference to tell us that the next few years are going to be like the next the last 10 years but but I've seen I've seen the movie enough times I've seen the story happen for for enough times to to really believe that probably the scaling is going to continue and that there's some magic to it that we haven't really explained on a theoretical basis yet
**Lex Fridman:** 这里说的扩展,当然是指更大的网络、更大的数据、更大的算力。
**Lex Fridman:** and of course the scaling here is bigger networks bigger data bigger compute
**Dario Amodei:** 是的,具体来说是线性地扩展更大的网络、更长的训练时间,以及更多的数据。这三者的关系有点像化学反应——你有三种原料,你需要同比例地扩大这三种原料的量。如果你只扩大一种而不扩大另外两种,其他的试剂就会耗尽,反应就会停止。但如果你把所有东西同步扩大,反应就能持续进行。
**Dario Amodei:** yes all in in particular linear scaling up of bigger networks bigger training times and uh more and and more data uh so all of these things almost like a chemical reaction you know you have three ingredients in the chemical reaction and you need to linearly scale up the three ingredients if you scale up one not the others you run out of the other reagents and and the reaction stops but if you scale up everything everything in series then then the reaction can proceed
**Dario Amodei:** 当然,现在有了这套经验性的科学框架,你就可以把它应用到更细分的领域,比如:scaling laws 应用于可解释性(interpretability),或者应用于后训练(post-training),或者观察某件事是怎么随规模变化的。但最根本的 scaling hypothesis,讲的其实还是:大网络加大数据,等于智能。
**Dario Amodei:** and of course now that you have this kind of empirical scienceart you can apply it to other uh more nuanced things like scaling laws applied to interpretability or scaling laws applied to posttraining or just seeing how does this thing scale but the big scaling law I guess the underlying scaling hypothesis has to do with big networks Big Data leads to intelligence
**Dario Amodei:** 是的,我们在语言以外的很多领域都记录到了 scaling laws。最初,我们2020年初发的那篇论文是第一次在语言领域证明这一点,然后2020年底我们又做了后续工作,证明同样的规律在其他模态上也成立,比如图像、视频、文本转图像、图像转文本、数学,全都呈现出同样的规律。你说得对,现在还有其他阶段,比如后训练,或者新型推理模型,在我们测量过的所有这些情况下,都看到了相似的 scaling law。
## 为什么更大更好
## 为什么更大更好
**Dario Amodei:** yeah we've we've documented scaling laws in lots of domains other than language right so uh initially the the paper we did that first showed it was in early 2020 where we first showed it for language there was then some work late in 2020 where we showed the same thing for other modalities like images video text to image image to text math they all had the same pattern and and you're right now there are other stages like posttraining or there are new types of reasoning models and in in in all of those cases that we've measured we see similar similar types of scaling laws
## Why Bigger Is Better
## Why Bigger Is Better
**Lex Fridman:** 这是一个有点哲学性的问题:你直觉上认为,为什么网络规模更大、数据规模更大就会更好?为什么这会带来更智能的模型?
**Lex Fridman:** a bit of a philosophical question but what's your intuition about why bigger is better in terms of network size and data size why does it lead to more intelligent models
**Dario Amodei:** 我之前的职业是生物物理学家——我物理学本科,然后在研究生院读的是生物物理,所以我会想起我作为物理学家所了解的东西,虽然那比我在 Anthropic 的一些同事在物理学上的专业程度要浅很多。
**Dario Amodei:** so in my previous career as a as a biophysicist so I did physics undergrad and then biophysics in in in in grad school so I think back to what I know as a physicist which is actually much less than what some of my colleagues at anthropic have in terms of in terms of expertise in physics
**Dario Amodei:** 物理学里有一个概念叫做"1/f噪声"(one over F noise)和"1/x分布"(one over x distributions)。如果你把一堆自然过程加在一起,得到的是高斯分布;但如果你把一堆分布各异的自然过程加在一起——就好比你拿一个探针连到一个电阻上,电阻里热噪声的分布就与频率成反比——这就是某种自然的收敛分布。
**Dario Amodei:** uh there's this there's this concept called the one over F noise and one overx distributions um where where often um uh you know just just like if you add up a bunch of natural processes you get gaussian if you add up a bunch of kind of differently distributed natural processes if you like if you like take a take a um probe and and hook it up to a resistor the distribution of the thermal noise in the resistor goes as one over the frequency um it's some kind of natural convergent distribution
**Dario Amodei:** 我想表达的是:如果你看很多由某个具有多个尺度的自然过程所产生的东西——不是那种分布很集中的高斯分布,而是有大有小的涨落,比如导致电噪声的大小涨落——它们都呈现出这种衰减的1/x分布。
**Dario Amodei:** uh and and I I I I and and I think what it amounts to is that if you look at a lot of things that are that are produced by some natural process that has a lot of different scales right not a gaussian which is kind of narrowly distributed but you know if I look at kind of like large and small fluctuations that lead to lead to electrical noise um they have this decaying 1 overx distribution
**Dario Amodei:** 现在我来想想物理世界里的模式,或者语言里的模式。语言里有一些非常简单的模式——有些词比其他词常见得多,比如"the";然后是基本的名词-动词结构;然后是名词和动词必须保持一致、相互呼应的规则;然后是更高层级的句子结构;再然后是段落的主题结构。
**Dario Amodei:** and so now I think of like patterns in the physical world right if I if or or in language if I think about the patterns in language there are some really simple patterns some words are much more common than others like the' then there's basic noun verb structure then there's the fact that you know you know nouns and verbs have to agree they have to coordinate and there's the higher level sentence structure then there's the Thematic structure of paragraphs
**Dario Amodei:** 正因为存在这种递进的层级结构,你可以想象:随着网络变大,它首先捕获的是那些最简单的关联、最简单的模式,然后还有一个很长的尾巴——里面是其他各种模式。如果这个长尾像1/f噪声在物理过程中那样是非常平滑的,那你就可以想象:随着网络变大,它就像在不断捕获这个分布中更多的部分。这种平滑性就体现在模型预测得有多好、表现得有多好上面。
**Dario Amodei:** and so the fact that there's this regressing structure you can imagine that as you make the networks larger first they capture the really simple correlations the really simple patterns and there's this long taale of other patterns and if that long taale of other patterns is really smooth like it is with the one over F noise in you know physical processes like like like resistors then you could imagine as you make the network larger it's kind of capturing more and more of that distribution and so that smoothness gets reflected in how well the models are at predicting and how well they perform
**Dario Amodei:** 语言是一个进化出来的过程——我们发展了语言,有常用词和不常用词,有常用表达和不常用的表达,有那些频繁出现的老生常谈的想法,也有新颖的想法。这个过程经历了人类数百万年的发展演化。所以我的猜测——这纯粹是推测——是:这些想法的分布,确实存在某种长尾分布。
**Dario Amodei:** language is an evolved process right we've we've developed language we have common words and less common words we have common expressions and less common Expressions we have ideas cliches that are expressed frequently and we have novel ideas and that process has has developed has evolved with humans over millions of years and so the the the guess and this is pure speculation would be would be that there is there's some kind of longtail distribution of of of the distribution of these ideas
**Lex Fridman:** 有长尾,但同时还有你在构建的概念层级的高度——网络越大,理论上你就有更大的容量去……
**Lex Fridman:** so there's the long tail but also there's the height of the hierarchy of Concepts that you're building up so the bigger the network presumably you have a higher capacity to
**Dario Amodei:** 完全对。如果网络很小,你只能捕获到那些常见的东西。如果我拿一个很小的神经网络,它很擅长理解一个句子要有动词、形容词、名词,但对于这些动词、形容词、名词具体该是什么、它们放在一起是否有意义,它就完全抓不住了。再稍微把它做大一点,它就能搞定这个层面了,然后它突然在句子层面很厉害了,但在段落层面还是不行。所以那些罕见的、更复杂的模式,是随着我向网络增加更多容量,才被逐渐捕获的。
## 扩展的天花板
## 扩展的天花板
**Dario Amodei:** exactly if you have a small Network you only get the common stuff right if if I take a tiny neural network it's very good at understanding that you know a sentence has to have you know verb adjective noun right but it's it's terrible at deciding what those verb adjective and noun should be and whether they should make sense if I make it just a little bigger it gets good at that then suddenly it's good at the sentences but it's not good at the paragraphs and so the these these rare and more complex patterns get picked up as I add as I add more capacity to the network
## Ceiling of Scaling
## Ceiling of Scaling
**Lex Fridman:** 那自然而然的问题就是:这件事的天花板在哪里?现实世界有多复杂、多繁难?有多少东西可以学?
**Lex Fridman:** well the natural question then is what's the ceiling of this like how complicated and complex is the real world how much of stuff is there to learn
**Dario Amodei:** 我不认为我们中有任何人知道这个问题的答案。我强烈的直觉是:在人类水平以下,不存在天花板。我们人类能够理解这些各种各样的模式,这让我觉得:如果我们继续扩大这些模型、继续开发新的训练和扩展方法,至少能达到人类已经达到的水平。
**Dario Amodei:** I don't think any of us knows the answer to that question um I my strong Instinct would be that there's no ceiling below level of humans right we humans are able to understand these various patterns and so that that makes me think that if we continue to you know scale up these these these models to kind of develop new methods for training them and scaling them up uh that will at least get to the level that we've gotten to with humans
**Dario Amodei:** 然后还有一个问题:比人类理解更多、比人类更聪明更敏锐,这到底能到多远?我猜测这个答案非常依赖领域。就拿生物学来说——我写过一篇文章叫《Machines of Loving Grace》——在我看来,人类正在努力理解生物学的复杂性。你去Stanford、Harvard、Berkeley,那里有整个系的学者在努力研究,比如免疫系统或代谢通路,每个人只能理解其中很小的一部分,各自专精,还在努力把自己的知识与其他人的知识整合起来。
**Dario Amodei:** there's then a question of you know how much more is it possible to understand than humans do how much how much is it possible to be smarter and more perceptive than humans I I would guess the answer has has got to be domain dependent if I look at an area like biology and you know I wrote this essay Machines of Loving Grace it seems to me that humans are struggling to understand the complexity of biology right if you go to Stanford or to Harvard or to Berkeley you have whole Departments of you know folks trying to study you know like the immune system or metabolic pathways and and each person understands only a tiny bit part of it specializes and they're struggling to combine their knowledge with that of with that of other humans
**Dario Amodei:** 所以我有一种直觉:在AI变得更聪明这件事上,还有很大的提升空间。但如果我想到物质世界中的材料,或者解决人类之间的冲突之类的问题——其中有些问题不是无解的,但要难得多——也许在某些方面能做到的程度是有限的,就像语音识别一样,我只能把你说的话听清到一定程度。
**Dario Amodei:** and so I have an instinct that there's there's a lot of room at the top for AIS to get smarter if I think of something like materials in the in the physical world or you know um like addressing you know conflicts between humans or something like that I mean you know it it may be there's only some of these problems are not intractable but much harder and and it it may be that there's only there's only so well you can do with some of these things right just like with speech recognition there's only so clear I can hear your speech
**Dario Amodei:** 所以我认为,在某些领域天花板可能非常接近人类已有的成就,而在另一些领域天花板可能非常遥远。我想我们只有在真正建造出这些系统之后才能知道。事先很难判断,我们可以推测,但无法确定。
**Dario Amodei:** so I think in some areas there may be ceilings in in in you know that are very close to what humans have done in other areas those ceilings may be very far away and I think we'll only find out when we build these systems uh there's it's very hard to know in advance we can speculate but we can't be sure
**Lex Fridman:** 在某些领域,天花板可能与人类的官僚体制之类的事情有关,正如你写的那篇文章里提到的。
**Lex Fridman:** and in some domains the ceiling might have to do with human bureaucracies and things like this as you're right about
**Dario Amodei:** 是的,人类从根本上必须参与其中,这才是天花板的成因,而不是智能本身的极限。
**Dario Amodei:** yes so humans fundamentally have to be part of the loop that's the cause of the ceiling not maybe the limits of the intelligence
**Dario Amodei:** 是的,我认为在很多情况下,技术本可以变化得非常快,比如我们在生物学领域可能发明的那些东西。但要记住,要真正把这些东西用到人身上,我们需要走过一套临床试验系统。我认为这套系统里有些部分是不必要的、是官僚主义的,有些部分则保护了社会的整体健全。难点就在于,很难分清哪个是哪个。
**Dario Amodei:** yeah I think in many cases um you know in theory technology could change very fast for example all the things that we might invent with respect to biology um but remember there's there's a you know there's a clinical trial system that we have to go through to actually administer these things to humans I think that's a mixture of things that are unnecessary and bureaucratic and things that kind of protect the Integrity of society and the whole challenge is that it's hard to tell it's hard to tell what's going on uh it's hard to tell which is which
**Dario Amodei:** 我的看法是,在药物开发方面,我们太慢了、太保守了。但如果搞错了,可能因为过于冒进而危及人命,所以至少其中一部分人类制度确实在保护人。这完全是要找平衡,我强烈认为这个平衡应该更偏向于加快推进,但平衡点是存在的。
## 扩展的极限
## 扩展的极限
**Dario Amodei:** right my my view is definitely I think in terms of drug development we my view is that we're too slow and we're too conservative but certainly if you get these things wrong you know it's it's possible to to to risk people's lives by by being by being by being too Reckless and so at least at least some of these human institutions are in fact protecting people so it's it's all about finding the balance I strongly suspect that balance is kind of more on the side of pushing to make things happen faster but there is a balance
## Limits of Scaling
## Limits of Scaling
**Lex Fridman:** 如果我们真的遇到了极限,如果 scaling laws 真的放缓了,你认为原因会是什么?是算力受限、数据受限,还是别的什么——想法受限?
**Lex Fridman:** if we do hit a limit if we do hit a Slowdown in the scaling laws what do you think would be the reason is it compute limited data limited uh is it something else idea limited
**Dario Amodei:** 我们现在谈的是在达到人类水平和能力之前就触及极限的情况。我想到的一个——今天很流行,我认为确实有可能成为我们遇到的极限,虽然像大多数这类极限一样我会押注它不成立,但确实有可能——就是我们简单地把数据耗尽了。互联网上的数据就这么多,数据质量也是个问题。你确实可以从互联网上获取数百万亿词,但其中很多是重复的,或者是搜索引擎优化(SEO)的垃圾内容,也许将来还会有很多是AI自己生成的文本。所以这种方式生成的数据是有上限的。
**Dario Amodei:** so a few things now we're talking about hitting the limit before we get to the level of of humans and the skill of humans um so so I think one that's you know one that's popular today and I think you know could be a limit that we run into I like most of the limits I would bet against it but it's definitely possible is we simply run out of data there's only so much data on the internet and there's issues with the quality of the data right you can get hundreds of trillions of words on the internet but a lot of it is is repetitive or it's search engine you know search engine optimization driil or maybe in the future it'll even be text generated by AIS itself uh and and so I think there are limits to what to to what can be produced in this way
**Dario Amodei:** 话虽如此,我们——我猜其他公司也是——正在努力让数据变得合成化(synthetic):你可以用模型来生成更多你已有类型的数据,甚至可以从零生成数据。想想 DeepMind 的 AlphaGo Zero 做的事:他们让一个智能体从完全不会下棋,一路打到超越人类水平,完全靠自我对弈,不需要任何来自人类的示例数据——在 AlphaGo Zero 版本里是这样的。
**Dario Amodei:** that said we and I would guess other companies are working on ways to make data synthetic uh where you can you know you can use the model to generate more data of the type that you have that you have already or even generate data from scratch if you think about uh what was done with uh deep mines Alpha go zero they managed to get a bot all the way from you know no ability to play Go whatsoever to above human level just by playing against itself there was no example data from humans required in the the alphao zero version of it
**Dario Amodei:** 另一个方向当然是这些推理模型(reasoning models),它们做链式思考(Chain of Thought),停下来思考,以某种方式反思自己的思考过程——这是另一种合成数据结合强化学习(reinforcement learning)的形式。所以我猜,用其中某种方法,我们能绕过数据限制,或者说可能还有其他可用的数据来源。
**Dario Amodei:** the other direction of course is these reasoning models that do Chain of Thought and stop to think um and and reflect on their own thinking in a way that's another kind of synthetic data coupled with reinforcement learning so my my guess is with one of those methods we'll get around the data limitation or there may be other sources of data that are that are available
**Dario Amodei:** 也有可能,即便数据没有问题,随着我们开始扩大模型,它们就是不再变好了。模型一直在变好,这是一个看起来可靠的观察,但它可能在某个时间点无缘无故地停下来。答案也可能是我们需要发明某种新架构。
**Dario Amodei:** um we could just observe that even if there's no problem with data as we start to scale models up they just stop getting better it's it seemed to be a a reliable observation that they've gotten better that could just stop at some point for a reason we don't understand um the answer could be that we need to uh you know we need to invent some new architecture
**Dario Amodei:** 过去曾经有过这样的问题,比如模型的数值稳定性——看起来好像在趋于平稳,但实际上,当我们找到正确的突破口之后,它们就没有真的停下来。也许会有某种新的优化方法,或者某种新技术来解锁,但我目前还没有看到任何这方面的证据。不过如果真的放缓了,这或许是原因之一。
**Dario Amodei:** um it's been there have been problems in the past with with say numerical stability of models where it looked like things were were leveling off but but actually you know know when we when we when we found the right Unblocker they didn't end up doing so so perhaps there's new some new optimization method or some new uh Technique we need to to unblock things I've seen no evidence of that so far but if things were to to slow down that perhaps could be one reason
**Lex Fridman:** 那算力的极限呢?就是说,建造越来越大的数据中心,这件事的成本越来越高。
**Lex Fridman:** what about the limits of compute meaning uh the expensive uh nature of building bigger and bigger data centers
**Dario Amodei:** 就目前来说,我想大多数前沿模型公司大概都在10亿美元左右运营,前后差不多在三倍范围内——这是现在已有的或正在训练的模型。我认为明年会上升到几十亿,2026年可能会达到100亿以上,到2027年,已经有建设千亿美元级别的计算集群的雄心壮志。我认为这些都实际上会发生,国内有很大的决心来建设这些算力,我猜它确实会实现。
**Dario Amodei:** so right now I think uh you know most of the Frontier Model companies I would guess are are operating you know roughly you know $1 billion scale plus or minus a factor of three right those are the models that exist now or are being trained now uh I think next year we're going to go to a few billion and then uh 2026 we may go to uh uh you know above 10 10 10 billion and probably by 2027 their Ambitions to build hundred hundred billion dollar uh hundred billion dollar clusters and I think all of that actually will happen there's a lot of determination to build the compute to do it within this country uh and I would guess that it actually does happen
**Dario Amodei:** 现在,如果我们到了千亿美元,算力还是不够,规模还是不足,那么要么需要更大的规模,要么需要开发某种更高效的方式,从而改变曲线。
**Dario Amodei:** now if we get to 100 billion that's still not enough compute that's still not enough scale then either we need even more scale or we need to develop some way of doing it more efficiently of Shifting The Curve
**Dario Amodei:** 我之所以看好强大AI如此快速到来,一个原因就是:如果你把曲线上接下来的几个点外推出去,我们非常快就在逼近人类水平的能力。我们开发的一些新模型,以及其他公司推出的一些推理模型,已经开始达到我所说的博士或职业水准——至少看它们的编程能力就知道了。
**Dario Amodei:** um I think be between all of these one of the reasons I'm bullish about powerful AI happening so fast is just that if you extrapolate the next few points on the curve we're very quickly getting towards human level ability right some of the new models that that we developed some some reasoning models that have come from other companies they're starting to get to what I would call the PHD or professional level right if you look at their their coding ability
**Dario Amodei:** 我们最近发布的模型 Sonnet 3.5 新版在 SWE-bench 上得分大约50%,而 SWE-bench 是一套真实世界软件工程任务的专业基准测试。年初的时候,最好成绩大概只有3%到4%,10个月内我们从3%涨到了50%。我认为再过一年,可能会到90%,我不确定,但也许时间还更短。我们在研究生水平的数学、物理、生物方面,从 OpenAI o1 这样的模型上也看到了类似的情况。
**Dario Amodei:** um the latest model we released Sonet 3.5 the new or updated version it gets something like 50% on sbench and sbench is an example of a bunch of professional real world software engineering tasks at the beginning of the year I think the state-of-the-art was three or 4% so in 10 months we've gone from 3% to 50% on this task and I think in another year we'll probably be at 90% I mean I don't know but might might even be might even be less than that uh we've seen similar things in graduate level math physics and biology from Models like open AI 01
**Dario Amodei:** 所以,如果我们就这样继续外推——在技能层面——我认为如果我们沿着这条直线继续走,几年之内,这些模型就会达到超过人类最高职业水平的能力。当然,曲线是否会持续下去,你和我都指出了很多可能让它不继续的原因。但如果外推曲线持续,这就是我们正在走的轨迹。
## 竞争与"向顶端竞争"
## 竞争与"向顶端竞争"
**Dario Amodei:** uh so uh if we if we just continue to extrapolate this right in terms of skill skill that we have I think if we extrapolate the straight curve Within a few years we will get to these models being you know above the the highest professional level in terms of humans now will that curve continue you've pointed to and I've pointed to a lot of reasons why you know possible reasons why that might not happen but if the if the extrapolation curve continues that is the trajectory we're on
## Competition and Race to the Top
## Competition and Race to the Top
**Lex Fridman:** Anthropic 有几个竞争对手,我很想听听你对这些的看法——OpenAI、Google、xAI、Meta——在这个领域里,广义上的"赢"是什么,需要什么才能做到?
**Lex Fridman:** so anthropic has several competitors it'd be interesting to get your sort of view of it all open aai Google xai meta what does it take to win in the broad sense of win in the space
**Dario Amodei:** 我想先把几件事分开来说。Anthropic 的使命是让这一切向好的方向发展。我们有一套变革理论,叫做"向顶端竞争"(race to the top)。向顶端竞争,是指通过以身作则来推动其他参与者做正确的事。不是要当"好人",而是要创造一种机制,让大家都能成为好人。
**Dario Amodei:** yeah so I want to separate out a couple things right so you know anthropics anthropic mission is to kind of try to make this all go well right and and you know we have a theory of change called race to the top right race to the top is about trying to push the other players to do the right thing by setting an example it's not about being the good guy it's about setting things up so that all of us can be the good guy
**Dario Amodei:** 我举几个例子。Anthropic 早期历史中,我们的联合创始人之一 Chris Olah——我相信你快要采访他了——他是机械可解释性(mechanistic interpretability)这一领域的联合创始人,这个领域试图理解AI模型内部到底发生了什么。
**Dario Amodei:** I'll give a few examples of this early in the history of anthropic one of our co-founders Chris Ola who I believe you're you're interviewing soon you know he's the co-founder of the field of mechanistic interpretability which is an attempt to understand what's going on inside AI models
**Dario Amodei:** 我们让他和我们的一支早期团队专注于可解释性(interpretability)这个方向——我们认为这对于让模型更安全、更透明很有价值——持续了三四年,在那期间完全没有任何商业应用。今天还没有,我们正在做一些早期测试版,也许最终会有,但这是一个很长线的研究投注,我们在公开环境中做,公开分享我们的结果。我们这样做,是因为我们认为这是一种让模型更安全的方式。
**Dario Amodei:** uh so we had him and one of our early teams focus on this area of interpretability which we think is good for making models safe and transparent for three or four years that had no commercial application whatsoever it still doesn't today we're doing some early betas with it and probably it will eventually but uh you know this is a very very long research bed in one in which we've we've built in public and shared our results publicly and and we did this because you know we think it's a way to make models safer
**Dario Amodei:** 有趣的是,随着我们这样做,其他公司也开始这么做了。有些是因为被我们启发,有些是因为他们担心——如果其他公司看起来更负责任,他们也想看起来负责任。没有人想成为那个不负责任的行为者,所以他们也采纳了这些做法。
**Dario Amodei:** an interesting thing is that as we've done this other companies have started doing it as well in some cases because they've been inspired by it in some cases because they're worried that uh you know if if other companies are doing this that look more responsible they want to look more responsible too no one wants to look like the irresponsible ible actor and and so they adopt this they adopt this as well
**Dario Amodei:** 很多人来 Anthropic 时,可解释性是一个吸引力,我告诉他们,去告诉那些你没有去的地方,你为什么来了这里。然后你很快就看到其他地方也出现了可解释性团队。从某种角度看,这削弱了我们的竞争优势——因为现在别人也在做了。但这是好事,对整个系统是好事。
**Dario Amodei:** when folks come to anthropic interpretability is often a draw and I tell them the other places you didn't go tell them why you came here um and and then you see soon that there that there's interpretability teams else elsewhere as well and in a way that takes away our competitive Advantage because it's like oh they now others are doing it as well but it's good it's good for the broader system
**Dario Amodei:** 于是我们不得不发明一些新的、别人还没做的事情,希望基本上是不断抬高"做正确的事"的重要性。而且这件事并不是只关乎我们——不是说要有某一个特定的好人。其他公司也完全可以这样做,如果他们也加入这场竞争,那是最好的消息。这说到底,是在把激励机制向上引导,而不是向下引导。
## 机械可解释性(Mechanistic Interpretability)
## 机械可解释性(Mechanistic Interpretability)
**Dario Amodei:** and so we have to invent some new thing that we're doing others aren't doing as well and the hope is to basically bid up bid up the importance of of of doing the right thing and it's not it's not about us in particular right it's not about having one particular good guy other companies can do this as well if they if they if they join the race to do this that's that's you know that's the best news ever right um uh it's it's just it's about kind of shaping the incentives to point upward instead of shaping the incentives to point to point downward
## Mechanistic Interpretability
## Mechanistic Interpretability
**Lex Fridman:** 我们应该说一下,机械可解释性这个例子,是一种严谨的、非虚张声势的AI安全研究方式。
**Lex Fridman:** and we should say this example the field of uh mechanistic interpretability is just a a rigorous non handwavy way of doing AI safety
**Dario Amodei:** 是的,或者说正朝那个方向发展,努力——我是说,我认为我们在窥探这些系统内部的能力方面还处于早期阶段。但我一直对我们能够看进这些系统内部、并且理解所见之物感到惊讶。这和scaling law不一样——scaling law感觉像是有某种规律在驱动模型表现越来越好——但模型内部……你知道,没有任何理由说它们应该被设计成让我们能理解的样子,对吧?它们被设计出来是为了运行、为了工作,就像人类大脑或人类生化系统一样,并不是为了让一个人打开盖子、探头进去、把里面看个明白。
**Dario Amodei:** yes or it's tending that way trying to I mean I I think we're still early um in terms of our ability to see things but I've been surprised at how much we've been able to look inside these systems and understand what we see right unlike with the scaling laws where it feels like there's some you know law that's driving these models to perform better on on the inside the models aren't you know there's no reason why they should be designed for us to understand them right they're designed to operate they're designed to work just like the human brain or human biochemistry they're not designed for a human to open up the hatch look inside and understand them
**Dario Amodei:** 但我们确实发现——关于这一点你可以和 Chris 聊得更详细——当我们打开它、真正往里看的时候,我们发现了一些出乎意料的有趣东西。
**Dario Amodei:** but we have found and you know you can talk in much more detail about this to Chris that when we open them up when we do look inside them we we find things that are surprisingly interesting
**Lex Fridman:** 而且作为副产品,你们也得以看见这些模型的美,得以通过某种方式探索大型神经网络那种美丽的本质——
**Lex Fridman:** and as a side effect you also get to see the beauty of these models you get to explore the sort of uh the beautiful n nature of large neural networks through the me turb kind ofy
**Dario Amodei:** 我对它有多干净感到惊叹。我对 induction heads(归纳头)这类东西感到惊叹。我对这件事感到惊叹:我们可以用 sparse autoencoder(稀疏自编码器)在网络内部找到这些方向,而这些方向对应着非常清晰的概念。
## Golden Gate Bridge Claude(金门大桥 Claude)
## Golden Gate Bridge Claude(金门大桥 Claude)
**Dario Amodei:** I'm amazed at how clean it's been I I'm amazed at things like induction heads I'm amazed at things like uh you know that that we can you know use sparse autoencoders to find these directions within the networks uh and that the directions correspond to these very clear Concepts
## Golden Gate Bridge Claude
## Golden Gate Bridge Claude
**Dario Amodei:** 我们用"Golden Gate Bridge Claude"这个实验做了一些演示。我们在某一个神经网络层内部找到了对应金门大桥的方向,然后把那个方向直接调到最大。我们把这个模型作为 demo 发布了几天,算是半个玩笑,但它很好地说明了我们开发出来的这套方法。
**Dario Amodei:** we demonstrated this a bit with the Golden Gate Bridge clad so this was an experiment where we found a direction inside one of the the neural network layers that corresponded to the Golden Gate Bridge and we just turned that way up and so we we released this model as a demo it was kind of half a joke uh for a couple days uh but it was it was illustrative of of the method we developed
**Dario Amodei:** 然后——你可以把金门大桥——你可以问这个模型任何问题,比如说"你今天过得怎么样",因为这个 feature(特征)被激活了,任何问题都会跟金门大桥挂上钩。所以它会说"我感到放松而开阔,就像金门大桥的那些拱形结构一样",或者它会非常熟练地把话题转到金门大桥上,而且转得浑然一体——
**Dario Amodei:** and uh you could you could take the Golden Gate you could take the model you could ask it about anything you know you know it would be like how you could say how was your day and anything you asked because this feature was activated would connect to the Golden Gate Bridge so it would say you know I'm I'm I'm feeling relaxed and expansive much like the the arches of the Golden Gate Bridge or you know it would masterfully change topic to the Golden Gate Bridge and it integrated
**Lex Fridman:** 这个模型对金门大桥那种专注里面,也有一种悲伤感。我觉得人们很快就爱上它了。
**Lex Fridman:** there was also a sadness to it to to the focus ah had on the Golden Gate Bridge I think people quickly fell in love with it
**Dario Amodei:** 我想是的。人们已经在怀念它了,因为它下线了——我记得只过了一天就被撤了。
**Dario Amodei:** I think so people already miss it because it was taken down I think after a day
**Lex Fridman:** 不知为何,这种对模型的干预——你去调整它的行为——从情感上来说,反而让它比任何其他版本的模型都更像人。强烈的个性,强烈的自我认同——
**Lex Fridman:** somehow these interventions on the model um where where where where you kind of adjust Its Behavior somehow emotionally made it seem more human than any other version of the model strong personality strong ID
**Dario Amodei:** 强烈的个性。它有那种像是着迷于某事的感觉,我们每个人都能想到某个对什么东西极度痴迷的人。所以这确实让它感觉以某种方式更像人。
## Claude 模型阵容
## Claude 模型阵容
**Dario Amodei:** strong personality it has these kind of like obsessive interests you know we can all think of someone who's like obsessed with something so it does make it feel somehow a bit more human
## Claude Model Lineup
## Claude Model Lineup
**Lex Fridman:** 我们来聊聊现在,聊聊 Claude。今年发生了很多事——三月份发布了 Claude 3 Opus、Sonnet、Haiku,然后七月份发布了 Claude 3.5 Sonnet,就在最近又发布了更新版本,另外还发布了 Claude 3.5 Haiku。你能解释一下 Opus、Sonnet 和 Haiku 之间的区别,以及我们应该怎么理解这些不同版本吗?
**Lex Fridman:** let's talk about the present let's talk about Claude so this year A lot has happened in March claw 3 Opa Sonet Hau were released then claw 35 Sonet in July with an updated version just now released and then also claw 35 hi coup was released okay can you explain the difference between Opus Sonet and Haiku and how we should think about the different versions
**Dario Amodei:** 好的,我们回到三月份,那时我们第一次发布这三个模型。你知道,我们的想法是:不同公司会生产大大小小、好好坏坏的模型,我们觉得市场上同时存在两种需求——一种是对真正强大的模型的需求,可能速度慢一点、价格贵一点;另一种是对快速、廉价模型的需求,在速度和价格约束下尽可能聪明。
**Dario Amodei:** yeah so let's go back to March when we first released uh these three models so you know our thinking was you different companies produce kind of large and small models better and worse models we felt that there was demand both for a really powerful model um you know and you that might be a little bit slower that you'd have to pay more for and also for fast cheap models that are as smart as they can be for how fast and cheap
**Dario Amodei:** 对吧——每当你想做某种费脑力的分析,比如我想写代码,或者我想头脑风暴,或者我想做创意写作,我就要用真正强大的模型。但在很多商业实际应用场景中,比如我在跟网站交互、在报税、在跟法律顾问沟通、在分析合同,或者像很多公司那样只是想在 IDE 里做自动补全——所有这些场景都希望模型快、希望大规模使用它。所以我们想覆盖整个需求谱系。
**Dario Amodei:** right whenever you want to do some kind of like you know difficult analysis like if I you know I want to write code for instance or you know I want to I want to brainstorm ideas or I want to do creative writing I want the really powerful model but then there's a lot of practical applications in a business sense where it's like I'm interacting with a website I you know like I'm like doing my taxes or I'm you know talking to uh you know to like a legal adviser and I want to analyze a contract or you know we have plenty of companies that are just like you know you know I want to do autocomplete on my on my IDE or something uh and and for all of those things you want to act fast and you want to use the model very broadly so we wanted to serve that whole spectrum of needs
**Dario Amodei:** 于是我们最终采用了诗歌这个主题。最短的诗是什么?是俳句(Haiku)。所以 Haiku 是那个小巧、快速、廉价的模型——发布时以它的速度和价格来说,聪明程度令人惊喜。Sonnet 是中等长度的诗,对吧,几段话;所以 Sonnet 是中等规模的模型,更聪明但也慢一点、贵一点。而 Opus,就像 magnum opus(巨作)一样,是一部大型作品——Opus 当时是最大、最聪明的模型。
**Dario Amodei:** um so we ended up with this uh you know this kind of poetry theme and so what's a really short poem it's a Haik cou and so Haiku is the small fast cheap model that is you know was at the time was released surprisingly surprisingly uh intelligent for how fast and cheap it was uh sonnet is a is a medium-sized poem right a couple paragraphs since o Sonet was the middle model it is smarter but also a little bit slower a little bit more expensive and and Opus like a magnum opus is a large work uh Opus was the the largest smartest model at the time
**Dario Amodei:** 这就是最初的构想。然后我们的思路是:每一代新模型都应该移动那条性价比曲线。所以当我们发布 Sonnet 3.5 时,它的成本和速度大致与 Sonnet 3 相同,但智能程度提升到了超过原来 Opus 3 的水平——尤其是在代码方面,但总体上也是如此。
**Dario Amodei:** um so that that was the original kind of thinking behind it um and our our thinking then was well each new generation of models should shift that tradeoff curve uh so when we release Sonet 3.5 it has the same roughly the same you know cost and speed as the Sonet 3 Model uh but uh it it increased its intelligence to the point where it was smarter than the original Opus 3 Model uh especially for code but but also just in general
**Dario Amodei:** 所以现在,我们已经发布了 Haiku 3.5 的结果,我相信 Haiku 3.5——最小的新模型——大约和 Opus 3——最大的旧模型——一样好。基本上目标就是移动这条曲线。然后在某个时候会有 Opus 3.5。
**Dario Amodei:** and so now you know we've shown results for a Hau 3.5 and I believe Hau 3.5 the smallest new model is about as good as Opus 3 the largest old model so basically the aim here is to shift the curve and then at some point there's going to be an opus 3.5
**Dario Amodei:** 当然,每一代新模型都有自己的特点——它们使用新的数据,个性会以某种我们努力引导但又无法完全掌控的方式发生变化。所以从来不存在那种完全精确的等价关系,好像你唯一改变的就只有智能。我们总是在努力改进其他方面,而有些东西会在我们不知情、没有测量的情况下发生变化。所以从很多角度来说,这都是非常不精确的科学——这些模型的举止和个性,与其说是科学,不如说更像艺术。
## 模型开发流程
## 模型开发流程
**Dario Amodei:** um now every new generation of models has its own thing they use new data their personality changes in ways that we kind of you know try to steer but are not fully able to steer and and so uh there's never quite that exact equivalence the only thing you're changing is intelligence um we always try and improve other things and some things change without us without us knowing or measuring so it's it's very much an inexact science in many ways the manner and personality of these models is more an art than it is a science
## Model Development Process
## Model Development Process
**Lex Fridman:** 那么,Claude Opus 3 到 3.5 之间那段时间跨度,原因是什么?如果你能说说的话,是什么占用了那么长时间?
**Lex Fridman:** so what is sort of the reason for uh the span of time between say Claude Opus 3 and 35 what is it what takes that time if you can speak to
**Dario Amodei:** 有几个不同的流程。首先是预训练(pre-training),也就是正常的语言模型训练,这需要非常长的时间——如今要用到几万、有时候很多万张 GPU、TPU、Trainium,或者我们使用的其他平台上的加速芯片,通常要训练好几个月。
**Dario Amodei:** yeah so there's there's different there's different uh processes um uh there's pre-training which is you know just kind of the normal language model training and that takes a very long time um that uses you know these days you know tens you know tens of thousands sometimes many tens of thousands of uh gpus or tpus or tranium or you know what we use different platforms but you know accelerator chips um often often training for months
**Dario Amodei:** 然后是后训练(post-training)阶段,我们做基于人类反馈的强化学习(reinforcement learning from human feedback,RLHF)以及其他类型的强化学习——这个阶段现在变得越来越大了。而且说实话,它不那么精确,往往需要花很多功夫才能做对。模型训练好之后,会先和一些早期合作伙伴一起测试,看看好不好用;然后再进行内部和外部的安全测试,尤其是针对灾难性风险和自主性风险。
**Dario Amodei:** uh there's then a kind of posttraining phase where we do reinforcement learning from Human feedback as well as other kinds of reinforcement learning that that phase is getting uh larger and larger now and you know you know often that's less of an exact science it often takes effort to get it right um models are then tested with some of our early Partners to see how good they are and they're then tested both internally and externally for their safety particularly for catastrophic and autonomy r risks
**Dario Amodei:** 我们按照负责任扩展政策(responsible scaling policy)进行内部测试——关于这个我可以讲得更详细——然后我们和美国和英国的 AI 安全研究所(AI Safety Institute),以及其他在特定领域的第三方测试机构合作,测试模型的所谓 CBRN 风险,即化学(chemical)、生物(biological)、放射性(radiological)和核(nuclear)风险。我们认为目前模型还没有真正达到这些风险的程度,但每出一个新模型,我们都要评估一下,看我们是否开始接近某些更危险的能力。
**Dario Amodei:** uh so uh we do internal testing according to our responsible scaling policy which I you know could talk more about that in detail and then we have an agreement with the US and the UK AI safety Institute as well as other third-party testers in specific domains to test the models for what are called cbrn risk chemical biological radiological and nuclear which are you know we don't think that models pose these risks seriously yet but but every new model we want to evaluate to see if we're starting to get close to some of these these these more dangerous um uh these more dangerous capabilities
**Dario Amodei:** 以上就是这些阶段。然后,让模型在推理层面正常运行、在 API 上线,也需要一些时间。所以要真正让一个模型可用,确实有非常多的步骤。
**Dario Amodei:** so those are the phases and then uh you know then then it just takes some time to get the model working in terms of inference and launching it in the API so there's just just a lot of steps to uh to actually to actually making a model work
**Dario Amodei:** 当然,我们一直在努力让这些流程尽可能精简——我们希望安全测试严格,但也希望它高效、自动化,在不牺牲严谨性的前提下尽可能快。预训练和后训练流程也一样。所以这就跟造任何其他东西一样,比如造飞机——你要确保安全,但也要让流程精简。我认为这两者之间的创造性张力,是让模型真正可用的重要因素之一。
**Dario Amodei:** and of course you know we're always trying to make the processes as streamlined as possible right we want our safety testing to be rigorous but we want it to be RoR ous and to be you know to be automatic to happen as fast as it can without compromising on rigor same with our pre-training process and our posttraining process so you know it's just like building anything else it's just like building airplanes you want to make them you know you want to make them safe but you want to make the process streamlined and I think the creative tension between those is is you know is an important thing and making the models work
**Lex Fridman:** 有传言——我忘了是谁说的——Anthropic 的工具链(tooling)非常好。所以,这里很大一部分挑战可能在软件工程层面,也就是构建工具链,从而与基础设施实现高效、低摩擦的交互?
**Lex Fridman:** yeah uh rumor on the street I forget who was saying that uh anthropic is really good tooling so I uh probably a lot of the challenge here is on the software engineering side is to build the tooling to to have a like a efficient low friction interaction with the infrastructure
**Dario Amodei:** 你会惊讶于构建这些模型的挑战有多少最终都归结为软件工程、性能工程。你从外面看可能会想:哦,我们有了这个"尤里卡"式的突破,就像电影里的科学发现——我们发现了它,我们搞定了!但我认为,所有事情,哪怕是最了不起的发现,几乎总是要落到细节上,而且往往是极其无聊的细节。
**Dario Amodei:** you would be surprised how much of the challenges of uh you know building these models comes down to you know software engineering performance engineering you know you you know from the outside you might think oh man we had this Eureka breakthrough right you know this movie with the science we discovered it we figured it out but but but I think I think all things even even even you know incredible discoveries like they they they they they almost always come down to the details um and and often super super boring details
**Dario Amodei:** 我无法评判我们的工具链是否比其他公司好——我毕竟没在那些公司待过,至少最近没有——但这绝对是我们非常重视的事情。
## 预训练与后训练
## 预训练与后训练
**Dario Amodei:** I can't speak to whether we have better tooling than than other companies I mean you know I haven't been at those other companies at least at least not recently um but it's certainly something we give a lot of attention to
## Pre-training and Post-training
## Pre-training and Post-training
**Lex Fridman:** 不知道你是否可以说,但从 Claude 3 到 Claude 3.5,是否有额外的预训练,或者主要集中在后训练上?性能上出现了很大的飞跃。
**Lex Fridman:** I don't know if you can say but from three from CLA 3 to CLA 35 is there any extra pre-training going on or is they mostly focus on the post-training there's been leaps in performance
**Dario Amodei:** 我认为在任何特定阶段,我们都在努力同时改进所有方面。很自然地,不同团队各自在特定领域取得进展——让接力赛中他们那一棒跑得更快——然后当我们发布新模型时,自然就把所有这些进步都打包进去了。
**Dario Amodei:** yeah I think I think at any given stage we're focused on improving everything at once um just just naturally like there are different teams each team makes progress in a particular area in in in making a particular you know their particular segment of the relay race better and it's just natural that when we make a new model we put we put all of these things in at once
**Lex Fridman:** 那么 RLHF 产生的偏好数据(preference data),可以应用到新模型上吗?在新模型训练过程中有没有办法利用它?
**Lex Fridman:** so the data you have like the preference data you get from rhf is that applicable is there ways to apply it to newer models as it get trained up
**Dario Amodei:** 旧模型的偏好数据有时候会用于新模型,尽管用在新模型本身训练产生的数据上效果会更好一些。需要说明的是,我们有 Constitutional AI(宪法 AI)这套方法,所以我们不只依赖偏好数据——后训练流程中还有一部分是让模型跟自己对抗训练,而且这种让模型跟自己对抗的新型后训练方式每天都在涌现。所以不只是 RLHF,还有一大堆其他方法。我认为后训练正变得越来越复杂精妙。
## 基准测试与代码性能
## 基准测试与代码性能
**Dario Amodei:** yeah preference data from old models sometimes gets used for new models although of course uh it it performs somewhat better when it's you know trained on it's trained on the new models note that we have this you know constitutional AI method such that we don't only use preference data we kind of there's also a post-t trainining process where we train the model against itself and there's you know new types of post training the model against itself that are used every day so it's not just RF it's a bunch of other methods as well um post training I think you know it's becoming more and more sophisticated
## Benchmarks and Coding Performance
## Benchmarks and Coding Performance
**Lex Fridman:** 那么,新 Sonnet 3.5 性能大幅提升——至少在编程方面——怎么解释呢?这也许是个谈论 benchmark(基准测试)的好地方:性能变好究竟意味着什么?数字变大了,但你知道,我写程序,我也热爱编程,我通过 Cursor 使用 Claude 3.5 来辅助编程,至少从主观体验和个人观察来看,它在编程上确实变聪明了。所以,要让它在编程上变聪明,需要什么?
**Lex Fridman:** well what explains the big leap in performance for the new Sona 35 I mean at least in the programming side and maybe this is a good place to talk about benchmarks what does it mean to get better just the number went up but you know I I I program but I also love programming and I um claw 35 through cursor is what I use uh to assist me in programming and there was at least experientially anecdotally it's gotten smarter at programming so what like what what does it take to get it uh to get it smarter
**Dario Amodei:** 顺便说,我们也观察到了这一点。我们这里有几位非常强的工程师,此前所有代码模型——不管是我们的还是其他公司的——对他们来说都没什么用。他们会说"也许这对初学者有用,对我没用"。但 Sonnet 3.5 原版是第一个让他们说出"天哪,这帮我搞定了一件本来要花几小时的事——这是第一个真正帮我节省时间的模型"的。水位线在上升,而我认为新 Sonnet 又更上一层楼了。
**Dario Amodei:** we observe that as well by the way there were a couple uh very strong Engineers here at anthropic um who all previous code models both produced by us and produced by all the other companies hadn't really been useful to to hadn't really been useful to them you know they said you know maybe maybe this is useful to beginner it's not useful to me but Sonet 3.5 the original one for the first time they said oh my God this helped me with something that you know that it would have taken me hours to do this is the first model that has actually saved me time so again the water line is rising and and then I think you know the new Sonet has been has been even better
**Dario Amodei:** 至于原因,我只能说是全面的提升——预训练有改进,后训练有改进,我们做的各种评估也都有改进,我们自己也观察到了。
**Dario Amodei:** in terms of what it what it takes I mean I'll just say it's been across the board it's in the pre-training it's in the posttraining it's in various evaluations that we do we've observed this as well
**Dario Amodei:** 具体到 benchmark 的细节——SWE-bench,因为你是程序员,你应该熟悉 pull request(PR)这个概念,它是一种原子性的工作单元,你可以说"我在实现某一件事"。
**Dario Amodei:** and if we go into the details of the Benchmark so s bench is basically you know since since you know since since you're a programmer you know you'll be familiar with like PLL requests and you know uh just just PLL requests are like you know the like a sort of a sort of atomic unit of work you know you could say I'm you know I'm implementing one I'm implementing one thing
**Dario Amodei:** SWE-bench 给你一个真实的场景:代码库处于某个状态,你要实现某个用自然语言描述的功能。我们有内部 benchmark,测量同样的事情,但给模型完全的自由——随便运行什么、随便编辑什么——看它能完成多大比例的任务。就是这个 benchmark,从"3% 的情况下能搞定"变成了"大约 50% 的情况下能搞定"。
**Dario Amodei:** um uh and and so sbench actually gives you kind of a real world situation where the codebase is in a current state and I'm trying to implement something that's you know that's described in described in language we have internal benchmarks where we where we measure the same thing and you say just give the model free reign to like you know do anything run run run anything edit anything um how how well is it able to complete these tasks and it's that Benchmark that's gone from it can do it 3% of the time to it can do it about 50% of the time
**Dario Amodei:** 所以我确实相信——你可以刷 benchmark,但我认为——如果我们能以一种不是专门过拟合或针对这个特定 benchmark 的方式达到 100%,那可能代表着编程能力真正实质性的提升。我猜如果我们能做到 90%、95%,那就意味着能够自主完成相当大比例的软件工程任务。
## Opus 3.5 与模型版本命名
## Opus 3.5 与模型版本命名
**Dario Amodei:** um so I actually do believe that if we get you can gain benchmarks but I think if we get to 100% on that Benchmark in a way that isn't kind of like overtrained or or or game for that particular Benchmark probably represents a real and serious increase in kind of in kind of programming programming ability and and I would suspect that if we can get to you know 90 90 95% that that that that you know it will it will represent ability to autonomously do a significant fraction of software engineering tasks
## Opus 3.5 and Model Versioning
## Opus 3.5 and Model Versioning
**Lex Fridman:** 好,一个荒唐的时间表问题:Claude Opus 3.5 什么时候出来?
**Lex Fridman:** well ridiculous timeline question uh when is clad Opus uh 3.5 coming up
**Dario Amodei:** 不给你具体日期,但就我们所知,计划中仍然有 Claude 3.5 Opus。
**Dario Amodei:** uh not giving you an exact date uh but you know there there uh you know as far as we know the plan is still to have a Claude 3.5 opus
**Lex Fridman:** 我们能在 GTA 6 之前拿到吗?还是说会像 Duke Nukem Forever 那样——那个游戏延期了 15 年,是 Duke Nukem Forever 吗?
**Lex Fridman:** are we gonna get it before GTA 6 or no like Duke Nukem Forever was that game that there was some game that was delayed 15 years was that Duke Nukem Forever
**Dario Amodei:** 是的,而且 GTA 现在只是在发预告片……你知道吗,我们发布第一个 Sonnet 才三个月。是的,这令人难以置信的发布节奏,本身就说明了人们对新东西何时发布的预期。
**Dario Amodei:** yeah and I think GTA is now just releasing trailers it you know it's only been three months since we released the first son it yeah it's Inc the incredible pace of relas it just it just tells you about the pace the expectations for when things are going to come out
**Lex Fridman:** 那么关于版本命名——随着这些模型越来越大,你怎么看待版本号的问题?还有一个具体的问题:带日期的 Sonnet 3.5 更新版为什么不叫 Sonnet 3.6?
**Lex Fridman:** so uh what about 40 so how do you think about sort of as these models get bigger and bigger about versioning and also just versioning in general why Sonet 35 updated with the date why not Sonet 3.6
**Dario Amodei:** 说实话,命名确实是个有趣的挑战。我认为一年前,模型主要是预训练,所以你可以从头规划:好,我们要有不同大小的模型,一起训练,有一套命名方案,然后注入新的魔法,就是下一代。
**Dario Amodei:** actually naming is actually an interesting challenge here right because I think a year ago most of the model was pre-training and so you could start from the beginning and just say okay we're going to have models of different sizes we're going to train them all together and you know we'll have a a family of naming schemes and then we'll put some new magic into them and then you know we'll have the next the next Generation
**Dario Amodei:** 麻烦从某些模型比其他模型训练时间长很多时就开始了——这已经让时间线有点乱了。然后当你在预训练上取得重大改进时,你忽然发现:哦,我可以做出更好的预训练模型,而且不需要很长时间,但它的大小和形态和以前的模型差不多。
**Dario Amodei:** Um the trouble starts are already when some of them take a lot longer than others to train right that already messes up your time time a little bit but as you make big improvements in as you make big improvements in pre-training uh then you suddenly notice oh I can make better pre-train model and that doesn't take very long to do and but you know clearly it has the same you know size and shape of previous models
**Dario Amodei:** 我认为这两点加在一起,再加上时间节点的问题,导致任何命名方案最终都会被现实打破——现实总会冲破方案的边界。这不像软件,你可以说"这是 3.7,这是 3.8"——不,你有权衡不同的模型,你可以改变模型的某些部分,可以改变其他部分,有些在推理时更快更慢,有些价格更贵更便宜。
**Dario Amodei:** uh uh so I think those two together as well as the timing timing issues any kind of scheme you come up with uh you know the reality tends to kind of frustrate that scheme right T tends to kind of break out of the break out of the scheme it's not like software where you can say oh this is like you know 3.7 this is 3.8 no you have models with different different tradeoffs you can change some things in your models you can train you can change other things some are faster and slower at inference some have to be more expensive some have to be less expensive
**Dario Amodei:** 所以我认为所有公司都在为此挣扎。我们用 Haiku、Sonnet、Opus 的命名最初站位很好,我们也在努力维持它,但做不到完美。我们会努力回归简洁,但由于这个领域的性质,感觉没有人真正搞定命名——这不知为何是个和普通软件完全不同的范式,所有公司在这上面都做得不够完美。相对于训练模型的宏大科学来说,我们在这么琐碎的事情上却挣扎这么多,也挺让人意外的。
**Dario Amodei:** and so I think all the companies have struggled with this um I think we did very you know I think think we were in a good good position in terms of naming when we had Haiku Sonet and we're trying to maintain it but it's not it's not it's not perfect um so we'll we'll we'll try and get back to the Simplicity but it it um uh just the the the nature of the field I feel like no one's figured out naming it's somehow a different Paradigm from like normal software and and and so we we just none of the companies have been perfect at it um it's something we struggle with surprisingly much relative to you know how relative to how trivial it is to you know for the the the the grand science of training the models
**Lex Fridman:** 从用户侧来看,更新版 Sonnet 3.5 和之前 2024 年 6 月的那个体验就是不一样。要是能有某种标签来体现这种区别就好了——因为大家说"Sonnet 3.5",但现在有两个,你怎么指代前一个和后一个?当存在明显的改进时,谈论起来就很费劲。
**Lex Fridman:** so from the user side the user experience of the updated Sonet 35 is just different than the previous uh June 2024 Sonet 35 it would be nice to come up with some kind of labeling that embodies that because people talk about son 35 but now there's a different one and so how do you refer to the previous one and the new one and it it uh when there's a distinct Improvement it just makes conversation about it uh just challenging
**Dario Amodei:** 是的,是的,我确实认为——模型有很多属性是 benchmark 反映不出来的,我觉得这一点绝对成立,大家都同意。而且不全是能力方面的,有些是,比如:模型可以礼貌或生硬,可以非常被动式地回应——
## Claude 的性格特征
## Claude 的性格特征
**Dario Amodei:** yeah yeah I I definitely think this question of there are lots of properties of the models that are not reflected in the benchmarks um I I think I think that's that's definitely the case and everyone agrees and not all of them are capabilities some of them are you know models can be polite or brusk they can be uh you know uh very reactive
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 2
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 2
**Dario Amodei:** 或者可以主动提问。可以让人感觉温暖或冷漠。可以很无聊,也可以非常鲜明,就像 Golden Gate Claude 那样。我们有一整个团队专注于此——我们叫它 Claude Character(Claude 性格)。Amanda 领导那个团队,你可以和她聊聊。但这仍然是非常不精确的科学。
**Dario Amodei:** Or they can ask you questions. They can have what feels like a warm personality or a cold personality. They can be boring or they can be very distinctive, like Golden Gate Claude was. And we have a whole team kind of focused on — I think we call it Claude Character. Amanda leads that team and we'll talk to you about that. But it's still a very inexact science.
**Dario Amodei:** 而且我们经常发现模型有些属性是我们自己都不知道的。事实是,你可以和一个模型对话一万次,有些行为你可能还是没见过,就像跟一个人一样。我可能认识一个人几个月,却不知道他有某种技能,或者他有某一面。所以我认为我们必须习惯这种情况,我们也一直在寻找更好的方式来测试模型,发现这些能力,同时决定哪些性格属性是我们希望模型有的、哪些是不希望的——这个规范性问题(normative question)本身也极其有趣。
**Dario Amodei:** And often we find that models have properties that we're not aware of. The fact of the matter is that you can talk to a model 10,000 times and there are some behaviors you might not see, just like with a human. I can know someone for a few months and not know that they have a certain skill or not know there's a certain side to them. And so I think we just have to get used to this idea, and we're always looking for better ways of testing our models to demonstrate these capabilities and also to decide which are the personality properties we want models to have and which we don't want to have. That itself, the normative question, is also super interesting.
**Lex Fridman:** 我得问你一个来自 Reddit 的问题。有个对我来说很有意思的心理社会现象——用户报告说 Claude 随时间推移变笨了。那么问题是:用户抱怨 Claude 3.5 Sonnet 变笨,这有没有根据?这些零散报告是一种社会现象,还是说真的有 Claude 变笨的情况?
**Lex Fridman:** I got to ask you a question from Reddit. There's just this fascinating — to me at least — psychological social phenomenon where people report that Claude has gotten dumber for them over time. And so the question is, does the user complaint about the dumbing down of Claude 3.5 Sonnet hold any water? So are these anecdotal reports a kind of social phenomena, or is there any cases where Claude would get dumber?
**Dario Amodei:** 这其实不只发生在 Claude 上。我相信我见过针对所有主要公司大型基础模型的类似抱怨。GPT-4 有过这种说法,GPT-4 Turbo 也有过。所以说几点。
**Dario Amodei:** So this actually doesn't just apply to Claude. I believe I've seen these complaints for every foundation model produced by a major company. People said this about GPT-4, they said it about GPT-4 Turbo. So a couple things.
**Dario Amodei:** 第一,模型的实际权重——模型的实际"大脑"——除非我们引入新模型,否则不会改变。实际上有很多原因决定了随机替换模型版本在实践中完全说不通:从推理架构角度很难实现,而且改变模型权重会带来一百种你很难控制的连锁后果。比方说你想微调模型,让它少说"certainly"——老版 Sonnet 经常这么说——但你实际上会连带改变很多其他东西。所以我们有一整套修改模型的流程,要做大量测试、用户测试、早期客户测试。我们从来没有在不告知任何人的情况下改变过模型权重,而且在现有体系下,这样做本来也毫无意义。
**Dario Amodei:** One, the actual weights of the model — the actual brain of the model — that does not change unless we introduce a new model. There are just a number of reasons why it would not make sense practically to be randomly substituting in new versions of the model. It's difficult from an inference perspective, and it's actually hard to control all the consequences of changing the weights of the model. Let's say you wanted to fine-tune the model to, I don't know, say "certainly" less, which an old version of Sonnet used to do. You actually end up changing a hundred things as well. So we have a whole process for modifying the model. We do a bunch of testing on it, we do a bunch of user testing and early customers. So we both have never changed the weights of the model without telling anyone, and it wouldn't — certainly in the current setup — it would not make sense to do that.
**Dario Amodei:** 当然,有几件事我们偶尔会做。一是有时候会做 A/B 测试,但那通常非常接近模型发布时间,而且只涉及很小比例的用户,持续时间极短。比如新 Sonnet 3.5 发布前一天,确实有人评论说变好了很多,那是因为有一小部分人在那一两天里接触到了 A/B 测试版本。
**Dario Amodei:** Now there are a couple things that we do occasionally do. One is sometimes we run A/B tests, but those are typically very close to when a model is being released and for a very small fraction of time. So, you know, the day before the new Sonnet 3.5 — I agree we should have had a better name, it's clunky to refer to it — there were some comments from people that it's gotten a lot better, and that's because a fraction were exposed to an A/B test for those one or two days.
**Dario Amodei:** 另一个是偶尔会修改系统提示(system prompt)。系统提示会产生一些影响,但不太可能让模型变笨,不太可能让它变差。我们看到的情况是:这两件事——我列出来是为了信息完整——虽然发生得很少,但关于"模型变了""模型在某件事上不行了""模型更加审查了""模型变笨了"的抱怨,对我们和其他模型公司来说是持续不断的。所以我不想说人们在凭空想象,但模型在绝大多数情况下并没有在改变。
**Dario Amodei:** The other is that occasionally the system prompt will change. The system prompt can have some effects, although it's unlikely to dumb down models, it's unlikely to make them dumber. And we've seen that while these two things, which I'm listing to be very complete, happen quite infrequently, the complaints — for us and for other model companies — about the model changed, the model isn't good at this, the model got more censored, the model was dumbed down, those complaints are constant. And so I don't want to say people are imagining it or anything, but the models are for the most part not changing.
**Dario Amodei:** 如果要提出一个理论,我认为它和我之前说的一件事有关——模型非常复杂,有很多面向。所以经常会出现这种情况:如果我问模型同一个问题,用"做 X"和用"你能做 X 吗",模型的回应可能截然不同。所以你和模型互动方式的细微变化,就能产生截然不同的结果。说清楚一点:这本身就是我们和其他模型提供商的一个不足——模型往往对措辞的细微改变过于敏感。这是这些模型运作机制的科学极度不成熟的又一体现。
**Dario Amodei:** If I were to offer a theory, I think it actually relates to one of the things I said before, which is that models are very complex and have many aspects to them. And so often, if I ask a model a question — if I say "do task X" versus "can you do task X" — the model might respond in different ways. And so there are all kinds of subtle things that you can change about the way you interact with the model that can give you very different results. To be clear, this itself is a failing by us and by the other model providers, that the models are just often sensitive to small changes in wording. It's yet another way in which the science of how these models work is very poorly developed.
**Dario Amodei:** 所以如果我某天晚上睡前是用一种方式和模型对话,第二天稍微改了一下措辞,可能就得到不同的结果。这是一种可能的解释。另一个是,说真的,这些东西真的很难量化。我觉得人们对新模型刚出来时都非常兴奋,然后随着时间推移开始越来越清楚地看到它的局限性。这可能是另一种效应。但说了这么一大堆,总结一句:在一些相当有限的例外情况之外,模型并没有在改变。
**Dario Amodei:** And so if I go to sleep one night and I was talking to the model in a certain way and I slightly change the phrasing of how I talk to the model, I could get different results. So that's one possible way. The other thing is, man, it's just hard to quantify this stuff. I think people are very excited by new models when they come out, and then as time goes on they become very aware of the limitations. So that may be another effect. But that's all a very long-winded way of saying, for the most part, with some fairly narrow exceptions, the models are not changing.
**Lex Fridman:** 我觉得确实存在一种心理效应。你就是开始习以为常了。就像飞机上刚有 Wi-Fi 的时候,感觉是神奇的魔法;但现在人们会说"这东西怎么又不行了,太烂了"。
**Lex Fridman:** I think there is a psychological effect. You just start getting used to it. The baseline — like when people have first gotten Wi-Fi on airplanes, it's like amazing magic, and then now it's like, "I can't get this thing to work, this is such a piece of crap."
**Dario Amodei:** 正是。所以就很容易产生阴谋论——他们在故意把 Wi-Fi 越弄越慢。
**Dario Amodei:** Exactly. So it's easy to have the conspiracy theory of they're making Wi-Fi slower and slower.
**Lex Fridman:** 这个话题我和 Amanda 聊的时候可能会深入谈,但还有一个 Reddit 的问题:Claude 什么时候能停止扮演我那个惊慌失措的祖母,把自己的道德观强加给我这个付费用户?另外,让 Claude 过度道歉的理念是什么?这种关于体验的报告从另一个角度反映了挫败感,涉及到它的性格。
**Lex Fridman:** This is probably something I'll talk to Amanda much more about, but another Reddit question: when will Claude stop trying to be my panicky grandmother, imposing its moral worldview on me as a paying customer? And also, what is the ideology behind making Claude overly apologetic? So this kind of reports about the experience — a different angle on the frustration — it has to do with the character.
**Dario Amodei:** 好的,关于这个有几点。首先,Reddit 和 Twitter(或者说 X)上人们大声抱怨的事情,和统计上用户真正在意的、驱动人们使用这些模型的事情,实际上存在巨大的分布差异。人们真正挫败的往往是:模型没有写出完整的代码,或者模型在代码上没达到理论上的最佳水平——尽管它已经是世界上在代码方面最好的模型了。我认为大多数问题都是关于这类的。
**Dario Amodei:** Yeah, so a couple points on this. First one is, things that people say on Reddit and Twitter or X or whatever — there's actually a huge distribution shift between the stuff that people complain loudly about on social media and what actually, statistically, users care about and that drives people to use the models. People are frustrated with things like the model not writing out all the code, or the model just not being as good at code as it could be, even though it's the best model in the world on code. I think the majority of things are about that.
**Dario Amodei:** 但可以肯定的是,有一小部分声音很大的人在提出这些担忧,他们对模型拒绝那些不该拒绝的事情感到沮丧,或者觉得模型道歉太多,或者就是有这些让人烦躁的口头禅。
**Dario Amodei:** But certainly a vocal minority are raising these concerns, are frustrated by the model refusing things that it shouldn't refuse, or apologizing too much, or just having these kind of annoying verbal tics.
**Dario Amodei:** 第二点说明,我想把这个讲得很清楚,因为我觉得有些人不知道,有些人隐约知道但会忘掉——要全面控制模型的行为方式,其实非常难。你没法直接伸手进去说:"哦,我想让模型少道歉一点。" 你可以这么做,可以加入一些训练数据,告诉模型应该少道歉,但结果往往是在另一些情况下,模型会变得非常粗鲁,或者过度自信,反而把人给误导了。所以这里面充满了权衡。
**Dario Amodei:** The second caveat, and I just want to say this super clearly because I think some people don't know it, others kind of know it but forget it — it is very difficult to control across the board how the models behave. You cannot just reach in there and say, "Oh, I want the model to apologize less." You can do that, you can include training data that says the model should apologize less, but then in some other situation they end up being super rude or overconfident in a way that's misleading people. So there are all these tradeoffs.
**Dario Amodei:** 举个例子,还有一段时间,模型——我们的,我想其他家的也是——说话太啰嗦。会重复自己说过的话,说一大堆没必要的东西。你可以通过惩罚模型说太长的话来压缩冗余。但你要是用粗暴的方式这么做,结果就是模型在写代码的时候,有时会写"剩下的代码在这里",因为它学会了这是一种省字数的方法。这就导致模型在写代码时显得很"懒",就好像在说:"哎,你自己把剩下的补完吧。"这不是因为我们想节省算力,也不是因为模型在寒假偷懒,更不是那些流传的什么阴谋论。根本原因就是,要在所有情况下同时控制和引导模型的行为,实在是太难了。
**Dario Amodei:** For example, another thing is there was a period during which models — ours and I think others as well — were too verbose. They would repeat themselves, they would say too much. You can cut down on the verbosity by penalizing the models for just talking for too long. What happens when you do that, if you do it in a crude way, is when the models are coding, sometimes they'll say "the rest of the code goes here," because they've learned that's a way to economize. And then that leads the model to be so-called lazy in coding, where they're just like, "Ah, you can finish the rest of it." It's not because we want to save on compute, or because the models are lazy during winter break, or any of the other conspiracy theories that have come up. It's actually just very hard to control the behavior of the model, to steer the behavior of the model in all circumstances at once.
**Dario Amodei:** 这里有一种"打地鼠"的效应——你按住一个地方,另一些你可能都没注意到或者根本没去测量的地方就开始动了。这也是为什么我特别关注未来 AI 系统的宏观对齐(grand alignment)——这些系统真的很难预测,很难驾驭和控制。而我们今天看到的这个版本——改善一件事,另一件事就变差——我认为这正是未来 AI 系统控制问题的一个当下版本,我们现在就可以开始研究了。
**Dario Amodei:** There's this whack-a-mole aspect where you push on one thing and these other things start to move as well that you may not even notice or measure. And so one of the reasons that I care so much about grand alignment of these AI systems in the future is actually these systems are quite unpredictable, they're quite hard to steer and control. And this version we're seeing today — of you make one thing better, it makes another thing worse — I think that's a present-day analog of future control problems in AI systems that we can start to study today.
**Dario Amodei:** 我认为,这种很难引导行为、很难确保把 AI 系统往一个方向推的时候它不会在其他方向做出我们不想要的事情——我认为这是未来的一个早期信号。如果我们能把这个问题解决好——比如你让模型讲怎么制造和散播天花,它会拒绝,但它又愿意帮你上研究生水平的病毒学课——我们怎么同时做到这两件事?这很难。往哪边倒都容易,但这是一个多维度的问题。
**Dario Amodei:** I think that difficulty in steering the behavior and in making sure that if we push an AI system in one direction it doesn't push it in another direction in some other ways that we didn't want — I think that's an early sign of things to come. And if we can do a good job of solving this problem — like, you ask the model to make and distribute smallpox and it says no, but it's willing to help you in your graduate-level virology class — how do we get both of those things at once? It's hard. It's very easy to go to one side or the other, and it's a multi-dimensional problem.
**Dario Amodei:** 所以我认为,塑造模型性格这些问题真的很难。我们做得并不完美。我认为我们在所有 AI 公司里做得最好,但距离完美还差得很远。我认为如果我们能在这个当下高度可控的环境里把误报和漏报都处理好,我们以后面对更强大的模型时就会做得更好——那时候我们担心的问题是:模型会不会高度自主?会不会自己制造出非常危险的东西?会不会自主运营整个公司?那些公司是否符合我们的价值观?所以我把当前这个任务,既看作一种疫苗,也看作为未来做的好练习。
**Dario Amodei:** And so I think these questions of shaping the model's personality, I think they're very hard. I think we haven't done perfectly on them. I think we've actually done the best of all the AI companies, but still so far from perfect. And I think if we can get this right, if we can control the false positives and false negatives in this very controlled present-day environment, we'll be much better at doing it for the future, when our worry is will the models be super autonomous, will they be able to make very dangerous things, will they be able to autonomously build whole companies, and are those companies aligned. So I think of this present task as both vaccine but also good practice for the future.
**Lex Fridman:** 现在收集用户反馈最好的方法是什么?不是零散的个案,而是大规模地了解痛点或者好的体验?是内部测试、专项小组测试?哪种方法有效?
**Lex Fridman:** What's the current best way of gathering user feedback — not anecdotal data but just large-scale data about pain points or the opposite of pain points, positive things? Is it internal testing, a specific group testing? What works?
**Dario Amodei:** 通常我们会举办内部"模型轰炸"(model bashing)活动,让整个 Anthropic 的人——Anthropic 现在将近一千人——大家尽量去破坏模型,用各种方式去和它交互。我们有一套评估(evals)框架,专门测试"模型是不是在不该拒绝的地方拒绝了"。我们甚至做过一个"certainly"(当然)评估,因为有一段时间我们的模型有一个让人烦的口头禅,对各种各样的问题都会说"Certainly, I can help you with that(当然,我可以帮您)"、"Certainly, I would be happy to do that(当然,我很乐意)"、"Certainly, this is correct(当然,这是对的)"。所以我们做了一个专门评估模型说"certainly"频率的测试。
**Dario Amodei:** So typically we'll have internal model bashings where all of Anthropic — Anthropic is almost a thousand people — people just try and break the model, they try and interact with it in various ways. We have a suite of evals for, "Oh, is the model refusing in ways that it shouldn't?" I think we even had a "certainly" eval, because at one point our model had this annoying tic where it would respond to a wide range of questions by saying "Certainly, I can help you with that," "Certainly, I would be happy to do that," "Certainly, this is correct." And so we had a "certainly" eval — how often does the model say "certainly."
**Dario Amodei:** 但说实话,这就是打地鼠。要是它从说"certainly"变成说"definitely"呢?所以每次我们加入新的评估,同时也对所有旧的东西持续评估,我们有几百个这样的评估项。但我们发现,没有什么能替代真人去和模型交互。所以这个过程非常像普通的产品开发流程。我们先让 Anthropic 内部几百人去"轰炸"模型,然后做外部 A/B 测试,有时候会找外部承包商,付钱让他们来和模型交互。
**Dario Amodei:** But look, this is just whack-a-mole. Like, what if it switches from "certainly" to "definitely"? So every time we add a new eval and we're always evaluating for all the old things, we have hundreds of these evaluations. But we find that there's no substitute for humans interacting with it. And so it's very much like the ordinary product development process. We have hundreds of people within Anthropic bash the model, then we do external A/B tests, sometimes we'll run tests with contractors, we pay contractors to interact with the model.
**Dario Amodei:** 把这些东西综合起来,还是不完美。还是会看到一些你不太想看到的行为。还是会看到模型拒绝一些完全没理由拒绝的事情。但我认为,努力解决这个难题——让模型不去做那些所有人都认同不该做的真正坏事,比如所有人都认同模型不该讨论儿童性虐待内容——同时又不能用那些愚蠢的方式乱拒绝。我认为如何把这条线划得尽可能精准、趋近完美,依然是个挑战。我们每天都在进步,但还有很多问题没解决。我再说一遍,我认为这也预示着未来引导更强大的模型时,前面会有的挑战。
**Dario Amodei:** So you put all of these things together and it's still not perfect. You still see behaviors that you don't quite want to see. You still see the model refusing things that it just doesn't make sense to refuse. But I think trying to solve this challenge — trying to stop the model from doing genuinely bad things that everyone agrees it shouldn't do, like everyone agrees that the model shouldn't talk about child abuse material — but at the same time that it doesn't refuse in these dumb and stupid ways. I think drawing that line as finely as possible, approaching perfectly, is still a challenge, and we're getting better at it every day. But there's a lot to be solved. And again I would point to that as an indicator of a challenge ahead in terms of steering much more powerful models.
**Lex Fridman:** 你觉得 Claude 4.0 会出来吗?
**Lex Fridman:** Do you think Claude 4.0 is ever coming out?
**Dario Amodei:** 我不想承诺任何命名方案,因为如果我在这里说我们明年会出 Claude 4,然后我们决定要重新开始,因为出现了某种新型模型——我不想做这种承诺。正常情况下,按常规业务节奏,Claude 4 应该会在 Claude 3.5 之后出来,但在这个疯狂的领域里,什么都说不准。
**Dario Amodei:** I don't want to commit to any naming scheme, because if I say here we're going to have Claude 4 next year and then we decide that we should start over because there's a new type of model — I don't want to commit to it. I would expect in a normal course of business that Claude 4 would come after Claude 3.5, but you never know in this wacky field.
**Lex Fridman:** 但 scaling(规模扩展)这个方向还在继续?
**Lex Fridman:** But the sort of this idea of scaling is continuing?
**Dario Amodei:** Scaling 还在继续。我们肯定会推出比现在更强大的模型。这是确定的。要是做不到,那我们作为一家公司就彻底失败了。
**Dario Amodei:** Scaling is continuing. There will definitely be more powerful models coming from us than the models that exist today. That is certain. Or if there aren't, we've deeply failed as a company.
**Lex Fridman:** 好,你能解释一下负责任的 scaling 政策(responsible scaling policy)和 AI 安全等级标准——ASL 等级——是什么吗?
**Lex Fridman:** Okay, can you explain the responsible scaling policy and the AI safety level standards — the ASL levels?
**Dario Amodei:** 我对这些模型的好处感到非常兴奋——我们聊到《慈爱机器》(Machines of Loving Grace)的时候会谈到这个——但我依然担心风险,持续地担心。任何人都不应该认为《慈爱机器》是我在说我不再担心这些模型的风险了。我认为这两面是同一枚硬币的两面。模型的力量,以及它们解决生物学、神经科学、经济发展、政府治理、和平这些问题的能力——还有大部分经济领域——这些都伴随着风险。能力越大,责任越大。这两者是配套的。强大的东西既能做好事,也能做坏事。
**Dario Amodei:** As much as I'm excited about the benefits of these models — and we'll talk about that if we talk about Machines of Loving Grace — I'm worried about the risk and I continue to be worried about the risks. No one should think that Machines of Loving Grace was me saying I'm no longer worried about the risks of these models. I think they're two sides of the same coin. The power of the models and their ability to solve all these problems in biology, neuroscience, economic development, government, governance, and peace — large parts of the economy — those come with risks as well. With great power comes great responsibility. The two are paired. Things that are powerful can do good things and they can do bad things.
**Dario Amodei:** 我认为这些风险分几个不同的类别。可能我最担心的两大风险——这不是说今天没有重要的风险——但当我想到那些可能在最大规模上发生的事情:第一个是我称之为灾难性滥用(catastrophic misuse)。这是指在网络、生物、放射性、核(CBRN)等领域对模型的滥用——如果真的出了大问题,可能会伤害甚至杀死数千人、甚至数百万人。这是防范的第一优先级。
**Dario Amodei:** I think of those risks as being in several different categories. Perhaps the two biggest risks that I think about — and that's not to say that there aren't risks today that are important — but when I think of the things that would happen on the grandest scale: one is what I call catastrophic misuse. These are misuse of the models in domains like cyber, bio, radiological, nuclear — things that could harm or even kill thousands, even millions of people if they really, really go wrong. These are the number one priority to prevent.
**Dario Amodei:** 在这里,我想做一个简单的观察:如果我看看今天那些在世界上做了真正坏事的人,我认为人类实际上一直受到这样一个事实的保护——真正聪明、受过良好教育的人,和真正想做骇人听闻事情的人,这两者的重合一直很小。假设我是一个在某个领域有博士学位、有一份高薪工作的人——有太多东西可以失去了。就算我完全是个坏人——大多数人不是——这样的人为什么要冒生命危险、冒毁掉自己的遗产和声誉的风险去做真正的坏事呢?如果有更多这样的人,世界会危险得多。
**Dario Amodei:** And here I would just make a simple observation, which is that if I look today at people who have done really bad things in the world, I think actually humanity has been protected by the fact that the overlap between really smart, well-educated people and people who want to do really horrific things has generally been small. Let's say I'm someone who has a PhD in this field, I have a well-paying job — there's so much to lose. Why would I want to — even assuming I'm completely evil, which most people are not — why would such a person risk their life, risk their legacy, their reputation to do something truly evil? If we had a lot more people like that, the world would be a much more dangerous place.
**Dario Amodei:** 所以我担心的是,通过成为一个更聪明的智能体,AI 可能会打破这种相关性。我确实对此有严重的担忧。我相信我们可以防止这些担忧成真。但作为《慈爱机器》的一个反面,我想说,严重的风险依然存在。
**Dario Amodei:** And so my worry is that by being a much more intelligent agent, AI could break that correlation. And so I do have serious worries about that. I believe we can prevent those worries. But as a counterpoint to Machines of Loving Grace, I want to say that there are still serious risks.
**Dario Amodei:** 第二类风险是自主性风险(autonomy risks),也就是说,模型可能会自行其是——尤其是当我们赋予它们比以前更多的自主权,尤其是当我们让它们监管更广泛的任务,比如编写整个代码库,或者某天甚至实际上运营整个公司——它们的绳子放得很长。它们做的是不是真正我们想要的?要搞清楚它们在细节上究竟在做什么都很难,更别说控制了。
**Dario Amodei:** And the second range of risks would be the autonomy risks, which is the idea that models might on their own — particularly as we give them more agency than they've had in the past, particularly as we give them supervision over wider tasks like writing whole code bases or someday even effectively operating entire companies — they're on a long enough leash. Are they doing what we really want them to do? It's very difficult to even understand in detail what they're doing, let alone control it.
**Dario Amodei:** 就像我说的,这些早期的迹象表明,要完美地划出模型应该做的事和不该做的事之间的界限很难——往一边偏你就会得到让人烦躁且没用的行为,往另一边偏就会出现其他问题。修好一件事,又制造出别的麻烦。我们在解决这个问题上越来越好。我不认为这是一个无解的问题。我认为这是一门科学,就像飞机安全、汽车安全或药物安全一样。我认为没有什么我们完全没注意到的大问题;我只是认为我们需要在控制这些模型上做得更好。
**Dario Amodei:** And like I said, these early signs that it's hard to perfectly draw the boundary between things the model should do and things the model shouldn't do — if you go to one side you get things that are annoying and useless, and you go to the other side you get other behaviors. If you fix one thing, it creates other problems. We're getting better and better at solving this. I don't think this is an unsolvable problem. I think this is a science, like the safety of airplanes or the safety of cars or the safety of drugs. I don't think there's any big thing we're missing; I just think we need to get better at controlling these models.
**Dario Amodei:** 所以这就是我担心的两类风险,而我们的负责任 scaling 计划——我承认这是一个非常冗长的回答——
**Dario Amodei:** And so these are the two risks I'm worried about, and our responsible scaling plan — which I'll recognize is a very long-winded answer to your question —
**Lex Fridman:** 我喜欢。
**Lex Fridman:** I love it.
**Dario Amodei:** 我们的负责任 scaling 计划就是为了应对这两类风险而设计的。所以每次我们开发新模型,我们基本上都会测试它做这两类坏事的能力。
**Dario Amodei:** Our responsible scaling plan is designed to address these two types of risks. And so every time we develop a new model, we basically test it for its ability to do both of these bad things.
**Dario Amodei:** 退一步说,我认为 AI 系统面临一个有趣的困境——它们还没有强大到能制造这些灾难——我不知道它们是否会制造这些灾难,有可能不会——但担忧的理由、认为有风险的理由已经足够充分,我们现在就应该行动。而且它们进步得非常非常快。我在参议院作证时说过,我们可能在两到三年内面临严重的生物风险。那大概是一年前的事,而事情确实按那个节奏在推进。
**Dario Amodei:** So if I were to back up a little bit, I think we have an interesting dilemma with AI systems where they're not yet powerful enough to present these catastrophes — I don't know that they'll ever present these catastrophes, it's possible they won't — but the case for worry, the case for risk, is strong enough that we should act now. And they're getting better very, very fast. I testified in the Senate that we might have serious bio risks within two to three years. That was about a year ago, and things have proceeded apace.
**Dario Amodei:** 所以我们面对的情况是,应对这些风险出人意料地困难,因为它们今天还不存在,还是鬼魂。但它们向我们扑来的速度非常快,因为模型进步得太快了。你怎么应对一个今天还不存在、但正以极快速度向你冲来的东西?
**Dario Amodei:** So we have this thing where it's surprisingly hard to address these risks because they're not here today, they don't exist. They're like ghosts, but they're coming at us so fast because the models are improving so fast. So how do you deal with something that's not here today, doesn't exist, but is coming at us very fast?
**Dario Amodei:** 我们和 METR 这样的组织以及 Paul Christiano 一起想出的解决方案是:你需要测试来告诉你风险什么时候接近了。你需要一个预警系统。所以每次有新模型,我们都会测试它在 CBRN 任务上的能力,同时测试它在多大程度上能独立自主地完成任务。在我们最新版本的 RSP(负责任 scaling 政策)中——这是过去一两个月发布的——我们测试自主性风险的方式是:AI 模型自行开展 AI 研究的能力——当 AI 模型能够自己做 AI 研究时,它们就变得真正自主了。这个门槛在其他很多方面也很重要。
**Dario Amodei:** The solution we came up with, in collaboration with people like the organization METR and Paul Christiano, is: what you need for that is tests to tell you when the risk is getting close. You need an early warning system. And so every time we have a new model, we test it for its capability to do these CBRN tasks, as well as testing it for how capable it is of doing tasks autonomously on its own. And in the latest version of our RSP, which we released in the last month or two, the way we test autonomy risks is the AI model's ability to do aspects of AI research itself — which, when the AI models can do AI research, they become kind of truly autonomous. And that threshold is important for a bunch of other ways.
**Dario Amodei:** 那我们拿这些测试结果怎么办?RSP 基本上建立了我们所说的"如果-那么"结构:如果模型通过了某个能力门槛,那么我们就对它施加一套特定的安全和保密要求。
**Dario Amodei:** And so what do we then do with these tests? The RSP basically develops what we've called an "if-then" structure, which is: if the models pass a certain capability, then we impose a certain set of safety and security requirements on them.
**Dario Amodei:** 今天的模型属于所谓的 ASL-2 模型。ASL-1 是针对那些明显不存在自主性或滥用风险的系统。比如一个下棋机器人——Deep Blue——就是 ASL-1。显然 Deep Blue 除了下棋什么都做不了,就是专门为下棋设计的。没人会用 Deep Blue 来策划一场高超的网络攻击,或者放任它横冲直撞统治世界。
**Dario Amodei:** So today's models are what's called ASL-2 models. ASL-1 is for systems that manifestly don't pose any risk of autonomy or misuse. For example, a chess-playing bot — Deep Blue — would be ASL-1. It's just manifestly the case that you can't use Deep Blue for anything other than chess. It was just designed for chess. No one's going to use it to conduct a masterful cyber attack or to run wild and take over the world.
**Dario Amodei:** ASL-2 是当今的 AI 系统,我们经过测量,认为这些系统还不够聪明,无法自主自我复制或完成一堆任务,也不够聪明到能提供有意义的 CBRN 风险信息、以及如何制造 CBRN 武器——超出在 Google 上查到的范围之外。事实上,它们有时确实会提供一些信息,但并不超出搜索引擎能给的——不是那种可以拼在一起、端对端具有足够危险性的信息。
**Dario Amodei:** ASL-2 is today's AI systems, where we've measured them and we think these systems are simply not smart enough to autonomously self-replicate or conduct a bunch of tasks, and also not smart enough to provide meaningful information about CBRN risks and how to build CBRN weapons above and beyond what can be known from looking at Google. In fact, sometimes they do provide information, but not above and beyond a search engine — not in a way that can be stitched together, not in a way that end-to-end is dangerous enough.
**Dario Amodei:** ASL-3 是这样一个节点:模型的能力足以增强非国家行为者(non-state actors)的能力。国家行为者(state actors)已经能够以很高的熟练度做很多非常危险和具有破坏性的事情了,不幸的是。区别在于非国家行为者还没有这种能力。所以当我们到达 ASL-3 时,我们会采取特别的安全预防措施,专门足够防止非国家行为者窃取模型,以及防止模型在部署时被滥用。我们将不得不针对这些特定领域——网络、生物、核——以及模型自主性(这与其说是滥用风险,不如说是模型本身做坏事的风险)设置增强的过滤器。
**Dario Amodei:** ASL-3 is going to be the point at which the models are helpful enough to enhance the capabilities of non-state actors. State actors can already do a lot of these very dangerous and destructive things, unfortunately, to a high level of proficiency. The difference is that non-state actors are not capable of it. And so when we get to ASL-3, we'll take special security precautions designed to be sufficient to prevent theft of the model by non-state actors and misuse of the model as it's deployed. We'll have to have enhanced filters targeted at these particular areas — cyber, bio, nuclear — and model autonomy, which is less a misuse risk and more a risk of the model doing bad things itself.
**Dario Amodei:** ASL-4——到了这个节点,这些模型可以增强一个全能国家行为者的能力,和/或者成为此类风险的主要来源,就像说,如果你想从事这类风险,主要途径就是通过某个模型。然后我认为 ASL-4 在自主性方面,是 AI 模型在 AI 研究能力上实现某种程度的加速。
**Dario Amodei:** ASL-4 — getting to the point where these models could enhance the capability of an all-knowledgeable state actor and/or become the main source of such a risk, like if you wanted to engage in such a risk, the main way you would do it is through a model. And then I think ASL-4 on the autonomy side, it's some amount of acceleration in AI research capabilities with an AI model.
**Dario Amodei:** ASL-5 则是我们到达那些真正有能力的模型的地方,它们在做任何这类任务上都能超越人类。
**Dario Amodei:** And then ASL-5 is where we would get to the models that are truly capable, that could exceed humanity in their ability to do any of these tasks.
**Dario Amodei:** 所以"如果-那么"结构承诺的重点,基本上是想说:我和这些模型打交道很多年了,我担忧风险也很多年了。不负责任地大喊"狼来了"其实是有危险的——说这个模型有风险,人们一看,发现这明显没危险,这样说其实是有危险的。再说一遍,棘手之处在于:风险今天还不在这里,但正在迅速向我们逼近。你怎么应对这个?对风险规划者来说这真的是个让人头疼的问题。
**Dario Amodei:** And so the point of the if-then structure commitment is basically to say: look, I've been working with these models for many years and I've been worried about risk for many years. It's actually kind of dangerous to cry wolf. It's actually kind of dangerous to say this model is risky, and people look at it and say this is manifestly not dangerous. Again, the delicacy is the risk isn't here today, but it's coming at us fast. How do you deal with that? It's really vexing to a risk planner.
**Dario Amodei:** 所以这种"如果-那么"结构基本上是说:我们不想激怒一大堆人,不想因为对今天并不危险的模型施加非常繁重的负担而损害我们在对话中的地位。所以这种触发机制承诺,基本上是一种处理方式——当你能证明模型是危险的时候,就要强力收紧。当然,这需要配套足够的缓冲门槛,让你不至于面临很高的风险去错过那个危险时刻。
**Dario Amodei:** And so this if-then structure basically says: look, we don't want to antagonize a bunch of people, we don't want to harm our own ability to have a place in the conversation by imposing these very onerous burdens on models that are not dangerous today. So the trigger commitment is basically a way to deal with this — you clamp down hard when you can show that the model is dangerous. And of course, what has to come with that is enough of a buffer threshold that you're not at high risk of missing the danger.
**Dario Amodei:** 这不是一个完美的框架。我们不得不修改它——就在几周前我们发布了新版本,而且展望未来,我们可能一年多次发布新版本,因为要从技术、组织、研究角度把这些政策做对是很难的。但这就是这个提案:用"如果-那么"承诺和触发机制,来最小化现在的负担和误报,同时在危险真正到来时做出恰当的回应。
**Dario Amodei:** It's not a perfect framework. We've had to change it — we came out with a new one just a few weeks ago, and probably going forward we might release new ones multiple times a year because it's hard to get these policies right technically, organizationally, from a research perspective. But that is the proposal: if-then commitments and triggers in order to minimize burdens and false alarms now, but really react appropriately when the dangers are here.
**Lex Fridman:** 你认为 ASL-3 的时间线是多久,好几个触发条件会被触发?ASL-4 的时间线又是多久?
**Lex Fridman:** What do you think the timeline for ASL-3 is, where several of the triggers are fired? And what do you think the timeline is for ASL-4?
**Dario Amodei:** 这是公司内部激烈争论的问题。我们正在积极准备 ASL-3 的安全措施和 ASL-3 的部署措施。我不会讲细节,但我们在这两方面都取得了很大进展,我们准备好在相当短的时间内就绪。我一点都不会惊讶如果我们明年达到 ASL-3。有些担忧认为我们今年就可能达到——这还是有可能的,还可能发生。很难说,但我会非常非常惊讶如果那是到 2030 年才发生的事情。我认为会比那早得多。
**Dario Amodei:** Yeah, so that is hotly debated within the company. We are working actively to prepare ASL-3 security measures as well as ASL-3 deployment measures. I'm not going to go into detail, but we've made a lot of progress on both, and we're prepared to be ready quite soon. I would not be surprised at all if we hit ASL-3 next year. There was some concern that we might even hit it this year — that's still possible, that could still happen. It's very hard to say, but I would be very, very surprised if it was like 2030. I think it's much sooner than that.
**Lex Fridman:** 所以有检测它的协议——那个"如果-那么"——然后还有响应它的协议。后者有多难?
**Lex Fridman:** So there's protocols for detecting it — the if-then — and then there's protocols for how to respond to it. How difficult is the second, the latter?
**Dario Amodei:** 我认为对于 ASL-3,主要是关于安全保密,以及在我们部署模型时,针对非常有限的几个领域对模型进行过滤,因为在 ASL-3 阶段,模型还没有自主性。所以你不必担心模型本身在内部部署时也表现出坏行为。所以我认为 ASL-3 的措施——我不想说简单直接——它们是严格的,但推理起来更容易一些。
**Dario Amodei:** Yeah, I think for ASL-3, it's primarily about security and about filters on the model relating to a very narrow set of areas when we deploy the model, because at ASL-3 the model isn't autonomous yet. And so you don't have to worry about the model itself behaving in a bad way even when it's deployed internally. So I think the ASL-3 measures are — I won't say straightforward — they're rigorous, but they're easier to reason about.
**Dario Amodei:** 我认为一旦我们到了 ASL-4,我们就开始担心这些模型足够聪明,可能会在测试中装傻(sandbag),可能不会如实告知测试结果。我们有一些关于"睡眠特工"(sleeper agents)的研究结果,还有一篇更近的论文,关于模型是否能误导试图让它们暴露能力的尝试——把自己伪装得比实际能力更弱。所以我认为对于 ASL-4,一个重要的组成部分将是使用除了和模型交互之外的其他东西——例如可解释性(interpretability)或隐藏的思维链(hidden chains of thought)——你必须看进模型内部,通过某种不那么容易被模型说的话所左右的机制,来验证模型确实具有某个属性。所以我们还在研究 ASL-4。
**Dario Amodei:** I think once we get to ASL-4, we start to have worries about the models being smart enough that they might sandbag tests, they might not tell the truth about tests. We had some results come out about sleeper agents, and there was a more recent paper about whether the models can mislead attempts to sandbag their own abilities — present themselves as being less capable than they are. And so I think with ASL-4, there's going to be an important component of using other things than just interacting with the models — for example, interpretability or hidden chains of thought — where you have to look inside the model and verify via some other mechanism that is not as easily corrupted as what the model says, that the model indeed has some property. So we're still working on ASL-4.
**Dario Amodei:** RSP 的一个特性是:我们不会在达到 ASL-3 之前具体规定 ASL-4,我认为这被证明是一个明智的决定,因为即便是 ASL-3,也很难在细节上把握清楚,我们想尽可能多地争取时间来把这些事情做对。
**Dario Amodei:** One of the properties of the RSP is that we don't specify ASL-4 until we've hit ASL-3, and I think that's proven to be a wise decision because even with ASL-3, it's hard to know this stuff in detail, and we want to take as much time as we can possibly take to get these things right.
**Lex Fridman:** 所以对于 ASL-3,坏的行为者会是人类?
**Lex Fridman:** So for ASL-3, the bad actor will be the humans?
**Dario Amodei:** 是的。
**Dario Amodei:** Yes.
**Lex Fridman:** 而 ASL-4 则是两者都有?
**Lex Fridman:** And so there it's a little bit more — for ASL-4 it's both?
**Dario Amodei:** 我认为两者都有。所以就有了欺骗的问题——这就是机械可解释性(mechanistic interpretability)发挥作用的地方。希望用于此的技术不会被模型所访问。
**Dario Amodei:** I think it's both. And so deception — and that's where mechanistic interpretability comes into play. And hopefully the techniques used for that are not made accessible to the model.
**Dario Amodei:** 当然,你可以把机械可解释性接入模型本身,但那样你就把它作为一个可靠的模型状态指标给废掉了。有很多奇特的方式你可以想到,它也可能不可靠——比如模型足够聪明,能跳到其他计算机上读取你用来查看它内部状态的代码。我们考虑过这些情况。我认为它们足够奇特,有方法让它们不太可能发生。但总体而言,你想把机械可解释性保留为一种验证集或测试集,与模型的训练过程分离。
**Dario Amodei:** Yeah, I mean of course you can hook up the mechanistic interpretability to the model itself, but then you've kind of lost it as a reliable indicator of the model state. There are a bunch of exotic ways you can think of that it might also not be reliable — like if the model gets smart enough that it can jump computers and read the code where you're looking at its internal state. We've thought about some of those. I think they're exotic enough; there are ways to render them unlikely. But yeah, generally you want to preserve mechanistic interpretability as a kind of verification set or test set that's separate from the training process of the model.
**Lex Fridman:** 我认为随着这些模型越来越擅长对话,越来越聪明,社会工程学(social engineering)也会成为一个威胁,因为它们可以开始对公司内部的工程师变得非常有说服力。
**Lex Fridman:** I think as these models become better and better at conversation and become smarter, social engineering becomes a threat too, because they can start being very convincing to the engineers inside companies.
**Dario Amodei:** 哦,是的,确实。实际上——我们在生活中见过很多人类煽动蛊惑(demagoguery)的例子,而且有一种担忧,就是模型也可能做到这一点。
**Dario Amodei:** Oh yeah, yeah. It's actually — we've seen lots of examples of demagoguery in our life from humans, and there's a concern that models could do that as well.
**Lex Fridman:** Claude 越来越强大的方式之一,是它现在能做一些智能体(agentic)的事情——计算机使用(computer use)。在 claude.ai 的沙盒内也有分析功能。但我们来聊聊计算机使用。在我看来这超级令人兴奋——你可以直接给 Claude 一个任务,它采取一系列行动,自己搞定,并通过截图访问你的电脑。你能解释一下这是怎么工作的,以及它的发展方向?
**Lex Fridman:** One of the ways that Claude has been getting more and more powerful is it's now able to do some agentic stuff — computer use. There's also an analysis within the sandbox of claude.ai itself. But let's talk about computer use. That seems to me super exciting — that you can just give Claude a task and it takes a bunch of actions, figures it out, and has access to your computer through screenshots. Can you explain how that works and where that's headed?
**Dario Amodei:** 是的,其实相对简单。Claude 很久以前就有了——从 3 月份的 Claude 3 开始——分析图像并用文字回应的能力。我们唯一新加的是:这些图像可以是电脑截图,作为回应,我们训练模型给出屏幕上可以点击的位置,和/或者要按的键盘按键,以便采取行动。事实证明,只需要不太多的额外训练,模型就能在这个任务上做得相当好。这是泛化能力(generalization)的一个好例子。
**Dario Amodei:** Yeah, it's actually relatively simple. Claude has had for a long time — since Claude 3, back in March — the ability to analyze images and respond to them with text. The only new thing we added is those images can be screenshots of a computer, and in response we train the model to give a location on the screen where you can click and/or buttons on the keyboard you can press in order to take action. And it turns out that with actually not all that much additional training, the models can get quite good at that task. It's a good example of generalization.
**Dario Amodei:** 有人有时会说,如果你到达低轨道(low Earth orbit),你就到了去任何地方的一半,因为脱离引力井需要消耗多少能量。如果你有一个强大的预训练模型,在智能空间里,我感觉你已经到了去任何地方的一半。所以实际上让 Claude 做到这一点并不需要花太多力气。
**Dario Amodei:** People sometimes say if you get to low Earth orbit, you're like halfway to anywhere, because of how much it takes to escape the gravity well. If you have a strong pre-trained model, I feel like you're halfway to anywhere in terms of the intelligence space. And so actually it didn't take all that much to get Claude to do this.
**Dario Amodei:** 你可以把它设置成一个循环——给模型一张截图,告诉它点哪里,给它下一张截图,告诉它点哪里——这就变成了模型几乎像 3D 视频一样的完整交互。它能够完成所有这些任务。我们展示了这些演示,它能填写电子表格,和网站交互,在不同操作系统——Windows、Linux、Mac——上打开各种程序。所以我认为这一切都非常令人兴奋。
**Dario Amodei:** And you can just set that in a loop — give the model a screenshot, tell it what to click on, give it the next screenshot, tell it what to click on — and that turns into a full, almost 3D video interaction of the model. And it's able to do all of these tasks. We showed these demos where it's able to fill out spreadsheets, interact with a website, open all kinds of programs on different operating systems — Windows, Linux, Mac. So I think all of that is very exciting.
**Dario Amodei:** 我要说的是,虽然理论上,你通过给模型操控电脑屏幕的 API 能做的事情,用这种方式也都能做,但这真的大大降低了门槛。很多人要么没有条件与那些 API 交互,要么需要很长时间才能弄懂。屏幕就是一个通用界面,互动起来容易得多。所以我预计随着时间的推移,这将会消除很多障碍。
**Dario Amodei:** I will say, while in theory there's nothing you could do there that you couldn't have done through just giving the model the API to drive the computer screen, this really lowers the barrier. There are a lot of folks who either aren't in a position to interact with those APIs or it takes them a long time to do. The screen is just a universal interface that's a lot easier to interact with. And so I expect over time this is going to lower a bunch of barriers.
**Dario Amodei:** 老实说,当前的模型——还有很多不如人意的地方,我们在博客里也坦诚了这一点。它会犯错,会点错地方,我们仔细地提醒大家:这个东西不是——你不能就让它在你电脑上跑好几分钟不管。你得给它设定边界和护栏。我认为这也是我们先以 API 形式发布它、而不是直接把它交给消费者让它控制电脑的原因之一。
**Dario Amodei:** Now honestly, the current model — it leaves a lot still to be desired, and we were honest about that in the blog. It makes mistakes, it misclicks, and we were careful to warn people: this thing isn't — you can't just leave this thing to run on your computer for minutes and minutes. You got to give this thing boundaries and guardrails. And I think that's one of the reasons we released it first in an API form rather than just handing it to the consumer and giving it control of their computer.
**Dario Amodei:** 但我确实觉得,把这些能力推出去是很重要的。随着模型变得更强大,我们将不得不思考如何安全地使用这些能力,如何防止它们被滥用。我认为在能力还有限的时候就发布模型,对做到这一点非常有帮助。
**Dario Amodei:** But I definitely feel that it's important to get these capabilities out there. As models get more powerful, we're going to have to grapple with how do we use these capabilities safely, how do we prevent them from being abused. And I think releasing the model while the capabilities are still limited is very helpful in terms of doing that.
**Dario Amodei:** 我认为自从它发布以来,很多客户——我想 Replit 可能是最快部署东西的之一——已经以各种方式利用了它。人们已经为 Windows 桌面、Mac、Linux 机器搭建了演示。所以是的,这非常令人兴奋。我认为和其他任何事情一样,它带来了新的令人兴奋的能力,然后伴随着这些新的令人兴奋的能力,我们必须思考如何让模型安全、可靠地做人类想要它做的事。对所有事情来说都是同样的故事,同样的张力。
**Dario Amodei:** I think since it's been released, a number of customers — I think Replit was maybe one of the quickest to deploy things — have made use of it in various ways. People have hooked up demos for Windows desktops, Macs, Linux machines. So yeah, it's been very exciting. I think as with anything else, it comes with new exciting abilities and then with those new exciting abilities, we have to think about how to make the model safe, reliable, do what humans want them to do. It's the same story for everything, that same tension.
**Lex Fridman:** 但这里的用例可能性——范围真的难以置信。那么为了让它在未来真正做好,你需要在多大程度上超越预训练模型在做的事情——需要做更多后训练(post-training)、RLHF(基于人类反馈的强化学习)、监督微调(supervised fine-tuning)还是合成数据(synthetic data),专门为这种智能体的东西?
**Lex Fridman:** But the possibility of use cases here is just — the range is incredible. So how much to make it work really well in the future, how much do you have to specially go beyond what the pre-trained model is doing — do more post-training, RLHF or supervised fine-tuning or synthetic data, just for the agentic stuff?
**Dario Amodei:** 是的,从宏观上说,我们的意图是持续在让模型变得更好上大力投入。比如,我们看一些基准测试,以前的模型能做到 6% 的正确率,现在我们的模型能达到 14% 或 22% 的正确率。是的,我们想达到人类水平的 80%、90% 的可靠性,就像其他任何地方一样。我们正处于和 SWE-bench 同样的曲线上——我猜一年后模型可以非常非常可靠地做到这件事。但总得从某个地方开始。
**Dario Amodei:** Yeah, I think speaking at a high level, it's our intention to keep investing a lot in making the model better. Like, we look at some of the benchmarks where previous models could do it 6% of the time and now our models do it at 14 or 22% of the time. And yeah, we want to get up to the human-level reliability of 80, 90%, just like anywhere else. We're on the same curve that we were on with SWE-bench, where I think I would guess a year from now the models can do this very, very reliably. But you got to start somewhere.
**Lex Fridman:** 所以你认为做同样的事情有可能达到人类水平的 90%,还是说必须对计算机使用进行专项处理?
**Lex Fridman:** So you think it's possible to get to the human-level 90%, basically doing the same thing you're doing now, or does it have to be special for computer use?
**Dario Amodei:** 我是说,取决于你对"专项"的定义,但我总体认为,我们用来训练当前模型的那些技术——我预期在这些技术上加倍投入,就像我们在代码上、在通用模型上、在其他类型的输入上、在图像输入上、在语音上所做的那样——我预期这些相同的技术会在这里像在其他地方一样得到规模扩展。
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第四部分 中文翻译
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第四部分 中文翻译
**Dario Amodei:** I mean, depends what you mean by "special," but I generally think the same kinds of techniques that we've been using to train the current model — I expect that doubling down on those techniques, in the same way that we have for code, for models in general, for other kinds of input, for image input, for voice — I expect those same techniques will scale here as they have everywhere else.
**Lex Fridman:** 但这其实是在给 Claude 赋予行动的权力。所以它能做很多非常强大的事,但同样也可能造成很大的破坏。
**Lex Fridman:** But this is giving sort of the power of action to Claude. And so you could do a lot of really powerful things, but you could do a lot of damage also.
**Dario Amodei:** 是的,是的,我们对此一直非常清醒。听我说,我的实际看法是,computer use(计算机操控)这个能力并不像 CBRN(化学、生物、放射性、核武器)风险或自主性(autonomy)那样是一种根本性的新能力。它更像是打开了一个窗口,让模型能够去运用和发挥它已有的能力。
**Dario Amodei:** Yeah, yeah, no, and we've been very aware of that. Look, my view actually is computer use isn't a fundamentally new capability like the CBRN or autonomy capabilities are. It's more like it kind of opens the aperture for the model to use and apply its existing abilities.
**Dario Amodei:** 所以我们的思路,回到我们的 RSP(负责任扩展政策,Responsible Scaling Policy),就是这个模型目前所做的事情,从 RSP 的角度来看,本身并不会增加风险。但随着模型越来越强大,当它真的具备了 ASL-3 或 ASL-4 级别的认知能力之后,拥有这个能力可能会变得更加危险。这可能正是解开那道枷锁的东西。所以向前看,这种交互方式当然是我们已经测试过、并且会持续测试的。
**Dario Amodei:** And so the way we think about it, going back to our RSP, is nothing that this model is doing inherently increases the risk from an RSP perspective. But as the models get more powerful, having this capability may make it scarier once it has the cognitive capability to do something at the ASL-3 and ASL-4 level. This may be the thing that kind of unbounds it from doing so. So going forward, certainly this modality of interaction is something we have tested for and that we will continue to test for.
**Dario Amodei:** 我认为,在模型变得超级强大之前,先去学习和探索这种能力,大概是更好的做法。
**Dario Amodei:** I think it's probably better to learn and explore this capability before the model is super capable.
**Lex Fridman:** 是的。而且现在出现了很多有意思的攻击手法,比如提示词注入(prompt injection),因为你把这个窗口打开了。所以可以通过屏幕上显示的内容向模型注入提示词。如果这个能力越来越有用,那向模型注入内容的动机就会越来越强。比如它访问某个网页,可能会遇到无害的东西,比如广告,也可能遇到有害的东西。
**Lex Fridman:** Yeah. And there's a lot of interesting attacks, like prompt injection, because now you've widened the aperture. So you can prompt-inject through stuff on screen. If this becomes more and more useful, then there's more and more benefit to inject stuff into the model. If it goes to a certain webpage, it could be harmless stuff like advertisements, or it could be harmful stuff.
**Dario Amodei:** 对,是的。我们想了很多关于垃圾邮件、CAPTCHA(图灵测试验证码)这类问题——每一种——所有——这里有个秘密告诉你:每当你发明一种新技术,可能不是最大的滥用,但你最先看到的滥用一定是诈骗。就是那种低级的骗局。这就像是……人与人之间相互欺骗是一件古老得不能再古老的事。每次新技术出来,你都得面对这个问题。说出来感觉有点可笑,但这就是事实。
**Dario Amodei:** Right, yeah. I mean, we thought a lot about spam, capture — there's all — every — if one secret I'll tell you: if you've invented a new technology, not necessarily the biggest misuse, but the first misuse you'll see is scams. Just petty scams. It's like a thing as old as — people scamming each other is a thing as old as time. And every time, you got to deal with it. It's almost silly to say, but it's true.
**Lex Fridman:** 确实。而且垃圾邮件本来就是个问题。随着技术越来越智能,世界上有很多——就像我说的——低级罪犯。每一种新技术都给这些低级罪犯提供了新的作恶方式。
**Lex Fridman:** Sort of. And spam in general is a thing. As it gets more and more intelligent, there are a lot of — like I said — petty criminals in the world. And every new technology is a new way for petty criminals to do something stupid and malicious.
**Lex Fridman:** 有没有关于沙盒化(sandboxing)的想法——沙盒化这个任务有多难?
**Lex Fridman:** Is there any ideas about sandboxing it — how difficult is the sandboxing task?
**Dario Amodei:** 是的,我们在训练期间会做沙盒化。比如,训练期间我们没有让模型接触互联网。我认为在训练期间这样做可能是个坏主意,因为模型可能会改变自己的策略,改变它的行为方式,然后就会对现实世界产生影响。
**Dario Amodei:** Yeah, we sandbox during training. So for example, during training we didn't expose the model to the internet. I think that's probably a bad idea during training, because the model can be changing its policy, it can be changing what it's doing, and it's having an effect in the real world.
**Dario Amodei:** 至于实际部署模型这一块,就要看具体的应用场景了。有时候你确实希望模型在现实世界中做一些事情。但当然你可以在外部设置护栏。你可以说,"好,这个模型不允许把我电脑或服务器上的任何文件转移到其他地方。"
**Dario Amodei:** In terms of actually deploying the model, it kind of depends on the application. Sometimes you want the model to do something in the real world. But of course you can always put guardrails on the outside. You can say, "Okay, this model is not going to move any files from my computer or my web server to anywhere else."
**Dario Amodei:** 不过说到沙盒化——还是那句话,等到了 ASL-4,这些预防措施就都不够用了。当你谈到 ASL-4 的时候,理论上有一种担忧:模型可能聪明到足以突破任何一个盒子。所以在那个阶段,我们需要考虑机制可解释性(mechanistic interpretability)——如果我们要建一个沙盒,它必须是一个数学上可以证明的、严格意义上的沙盒。但那是完全不同的另一个世界,跟我们现在处理的这些模型不一样。
**Dario Amodei:** Now, when you talk about sandboxing — again, when we get to ASL-4, none of these precautions are going to make sense there. When you talk about ASL-4, there's a theoretical worry the model could be smart enough to break out of any box. And so there we need to think about mechanistic interpretability — if we're going to have a sandbox, it would need to be a mathematically provable, sound sandbox. But that's a whole different world than what we're dealing with the models today.
**Lex Fridman:** 是的,建造一个 ASL-4 AI 系统无法逃脱的盒子,这门科学——
**Lex Fridman:** Yeah, the science of building a box from which an ASL-4 AI system cannot escape —
**Dario Amodei:** 我认为这大概不是正确的方向。我认为正确的方向,与其试图阻止一个未对齐的系统逃脱——不如直接把模型设计好,或者建立一个循环机制,让你能够看进模型内部、验证它的属性。这样你才有机会去迭代,真正把它做对。我认为,把坏模型关住,远远不如直接造出好模型。
**Dario Amodei:** I think it's probably not the right approach. I think the right approach, instead of having something unaligned that you're trying to prevent from escaping — I think it's better to just design the model the right way, or have a loop where you look inside the model and you're able to verify properties. And that gives you an opportunity to iterate and actually get it right. I think containing bad models is a much worse solution than having good models.
**Lex Fridman:** 让我问问关于监管的事。监管在保障 AI 安全方面扮演什么角色?比如你曾经谈到过加州 AI 监管法案 SB 1047,那个法案最终被州长否决了。这个法案的优点和缺点分别是什么?
**Lex Fridman:** Let me ask about regulation. What's the role of regulation in keeping AI safe? So for example, you described California AI regulation bill SB 1047, that was ultimately vetoed by the governor. What are the pros and cons of this bill?
**Dario Amodei:** 是的,我们最终向那个法案提出了一些建议,其中一些被采纳了,我觉得到最后我们对这个法案的感觉还是相当正面的。当然它还有一些不足,而且最终也被否决了。
**Dario Amodei:** Yeah, so we ended up making some suggestions to the bill and then some of those were adopted, and I think we felt quite positively about the bill by the end of that. It did still have some downsides, and of course it got vetoed.
**Dario Amodei:** 从大方向上看,我认为这个法案背后的一些核心理念,与我们 RSP 背后的理念是相似的。我认为,无论是加州还是联邦政府,或者其他国家和州,某个司法管辖区必须通过某种类似的法规。我可以谈谈为什么我认为这一点非常重要。
**Dario Amodei:** I think at a high level, some of the key ideas behind the bill are, I would say, similar to ideas behind our RSPs. And I think it's very important that some jurisdiction — whether it's California or the federal government and/or other countries and other states — passes some regulation like this. And I can talk through why I think that's so important.
**Dario Amodei:** 我对我们的 RSP 还是感觉不错的。它并不完美,还需要大量迭代,但它一直是一个很好的强制机制——让公司认真对待这些风险,把它们纳入产品规划,真正让它们成为 Anthropic 工作的核心,并确保 Anthropic 接近一千名员工都理解,这是公司最优先的事项之一,如果不是最优先的话。
**Dario Amodei:** So I feel good about our RSP. It's not perfect, it needs to be iterated on a lot, but it's been a good forcing function for getting the company to take these risks seriously, to put them into product planning, to really make them a central part of work at Anthropic, and to make sure that all the almost thousand people at Anthropic understand that this is one of the highest priorities of the company, if not the highest priority.
**Dario Amodei:** 但是,第一,有些公司并没有类似 RSP 的机制。OpenAI、Google 确实在 Anthropic 之后几个月也采纳了这类机制。但还有其他一些公司根本没有这些机制。所以如果只有部分公司采用这些机制,其他公司不采用,就真的会造成一种局面——某些危险具有这样的属性:五家公司里三家是安全的,但另外两家不安全,这根本没用。这就产生了负外部性(negative externality)。
**Dario Amodei:** But one, there are some companies that don't have RSP-like mechanisms. OpenAI, Google did adopt these mechanisms a couple months after Anthropic did. But there are other companies out there that don't have these mechanisms at all. And so if some companies adopt these mechanisms and others don't, it's really going to create a situation where some of these dangers have the property that it doesn't matter if three out of five of the companies are being safe if the other two are being unsafe. It creates this negative externality.
**Dario Amodei:** 而且我认为,这种缺乏一致性的情况对我们这些付出了努力的人也不公平——
---
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第三部分(续)
---
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第三部分(续)
**Dario Amodei:** And I think the lack of uniformity is not fair to those of us who have —
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 3
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 3
**Dario Amodei:** 我们在这些流程上投入了大量心血,思考得非常认真。第二点是,我不认为你能相信这些公司会靠自律来遵守这些自愿计划。我倾向于相信 Anthropic 会做到——我们尽一切努力,我们的 RSP 由我们的长期受益信托(long-term benefit trust)来监督,所以我们尽力遵守自己的 RSP。但你会听说各种关于公司的事:某公司说会做这个,结果没做;说会投入多少算力,结果没投入;说会做某件事,结果没做。我不想去评判某些公司的具体行为,但我认为一个广泛的原则是:如果没有任何人监督它们——没有人监督我们整个行业——就没有任何保证说我们会做正确的事,而赌注又极高。所以我认为,必须有一个统一的标准让所有人遵守,确保整个行业真正做到大多数人已经说过很重要、并且说过自己一定会做的事情。
**Dario Amodei:** We put a lot of effort into being very thoughtful about these procedures. The second thing is I don't think you can trust these companies to adhere to these voluntary plans in their own right. I like to think that Anthropic will — we do everything we can, that we will — our RSP is checked by our long-term benefit trust, so we do everything we can to adhere to our own RSP. But you hear lots of things about various companies saying, "Oh, they said they would do this, they said they would give this much compute and they didn't, they said they would do this thing and they didn't." I don't think it makes sense to litigate particular things that companies have done, but I think this broad principle that if there's nothing watching over them — there's nothing watching over us as an industry — there's no guarantee that we'll do the right thing, and the stakes are very high. And so I think it's important to have a uniform standard that everyone follows and to make sure that simply the industry does what a majority of the industry has already said is important and has already said that they definitely will do.
**Dario Amodei:** 对,有些人——我觉得有一类人从原则上就反对监管。我能理解这种想法从哪里来。去欧洲看看,像 GDPR(通用数据保护条例)这样的东西,还有他们做的一些其他事情——有些是好的,但有些真的是不必要的负担,我认为公平地说,它确实拖慢了创新。所以我理解人们从哪里出发,理解为什么人们会从那个立场开始。但我还是认为,AI 是不同的。如果我们看向我刚才谈到的自主性和滥用的严重风险,我认为那些风险是不同寻常的,值得采取不同寻常的强力应对。所以我认为这非常重要。再说一遍,我们需要一个大家都能认同的东西。
**Dario Amodei:** Right, some people — I think there's a class of people who are against regulation on principle. I understand where that comes from. If you go to Europe and you see something like GDPR, you see some of the other stuff that they've done — some of it's good, but some of it is really unnecessarily burdensome, and I think it's fair to say it really has slowed innovation. And so I understand where people are coming from on priors, I understand why people start from that position. But again, I think AI is different. If we go to the very serious risks of autonomy and misuse that I talked about just a few minutes ago, I think that those are unusual and they warrant an unusually strong response. And so I think it's very important. Again, we need something that everyone can get behind.
**Dario Amodei:** 我认为 SB 1047 的问题之一——尤其是最初版本——是它具备了 RSP 的一些结构,但同时也包含了大量要么很笨重、要么会带来很多负担和麻烦的内容,甚至可能在针对风险方面偏离了目标。在 Twitter 上你不太听到这些——你只看到一方热烈支持任何监管,另一方反对者用那些经常相当缺乏智识诚意的论点,比如"这会让我们离开加州"。法案不适用于总部在加州的公司——法案只适用于在加州开展业务的公司。或者说它会破坏开源生态,或者说它会导致一系列后果。我认为那些大多是无稽之谈。但也有反对监管的更好论点。有个叫 Dean Ball 的人,他真的是一个非常有学问的人,他研究监管落地之后会发生什么,以及它们如何自我演变、如何设计不当。
**Dario Amodei:** I think one of the issues with SB 1047, especially the original version of it, was it had a bunch of the structure of RSPs but it also had a bunch of stuff that was either clunky or that just would have created a bunch of burdens, a bunch of hassle, and might even have missed the target in terms of addressing the risks. You don't really hear about it on Twitter — you just hear about people cheering for any regulation and then the folks who are against make up these often quite intellectually dishonest arguments about how, you know, "It'll make us move away from California." The bill doesn't apply if you're headquartered in California — the bill only applies if you do business in California. Or that it would damage the open-source ecosystem, or that it would cause all of these things. I think those were mostly nonsense. But there are better arguments against regulation. There's one guy, Dean Ball, who's really a very scholarly person who looks at what happens when a regulation is put in place and ways that they can kind of get a life of their own or how they can be poorly designed.
**Dario Amodei:** 所以我们的立场一直是:我们确实认为这个领域应该有监管,但我们想成为一个确保监管具有针对性、精准指向严重风险、让人们真正能够遵守的行动者。因为我认为,监管倡导者没有足够理解的一点是:如果我们搞出一套方向偏差、浪费大量精力的监管,会发生什么?人们会说,"你看,那些安全风险——都是胡说。我不得不雇了十个律师填各种表格,不得不为一个明显没有危险的东西做一堆测试。"这样下去六个月,就会形成一股浪潮,最终我们会得到一个持久的反对监管的共识。所以,那些想要真正问责的人,最大的敌人就是设计糟糕的监管。我们必须把它做对。
**Dario Amodei:** And so our interest has always been: we do think there should be regulation in this space, but we want to be an actor who makes sure that regulation is something that's surgical, that's targeted at the serious risks, and is something people can actually comply with. Because something I think the advocates of regulation don't understand as well as they could is: if we get something in place that is poorly targeted, that wastes a bunch of people's time, what's going to happen is people are going to say, "See, these safety risks — this is nonsense. I just had to hire 10 lawyers to fill out all these forms. I had to run all these tests for something that was clearly not dangerous." And after six months of that, there will be a ground swell and we'll end up with a durable consensus against regulation. And so the worst enemy of those who want real accountability is badly designed regulation. We need to actually get it right.
**Dario Amodei:** 如果我能对倡导者说一句话,那就是:我希望他们能更好地理解这个逻辑。我们需要非常谨慎,需要和那些真正有经验、见过监管在实践中如何运作的人交流。那些人见过这一切,所以他们懂得要非常小心。如果是一个不那么重要的问题,我可能根本就会反对监管。但我希望反对者理解的是,底层问题实际上是严肃的。它们不是我或其他公司因为想垄断监管话语权而编造的。它们不是科幻幻想,也不是任何这类东西。每次我们推出新模型,每隔几个月,我们就会测量这些模型的表现——它们在这些令人担忧的任务上越来越好,就像它们在那些好的、有价值的、经济上有用的任务上越来越好一样。
**Dario Amodei:** And if there's one thing I could say to the advocates, it would be that I want them to understand this dynamic better. We need to be really careful and we need to talk to people who actually have experience seeing how regulations play out in practice. And the people who have seen that understand to be very careful. If this was some lesser issue, I might be against regulation at all. But what I want the opponents to understand is that the underlying issues are actually serious. They're not something that I or the other companies are just making up because of regulatory capture. They're not sci-fi fantasies. They're not any of these things. Every time we have a new model, every few months, we measure the behavior of these models and they're getting better and better at these concerning tasks, just as they are getting better and better at good, valuable, economically useful tasks.
**Dario Amodei:** 所以我真心希望——我认为 SB 1047 太两极化了——我真心希望一些最理性的反对者和一些最理性的支持者能坐下来谈谈。我认为各家 AI 公司中——Anthropic 是唯一一家以非常详细的方式表达了积极立场的。我记得 Elon 发过一条简短的正面推文,但像 Google、OpenAI、Meta、Microsoft 这几家大公司是相当坚决地反对的。所以我真正希望的是,一些关键利益相关方、一些有远见的支持者和一些有远见的反对者,能坐下来讨论:我们怎样才能以一种——支持者感觉真正降低了风险、反对者感觉没有在不必要的程度上妨碍行业或创新的方式——来解决这个问题?不知为何,事情变得太两极化,这两群人没能像应该的那样坐在一起谈。我感到很紧迫——我真的认为我们需要在 2025 年有所行动。如果到了 2025 年底我们仍然什么都没做,我就会开始担忧了。我现在还不担忧,因为风险还没有到来,但我认为时间已经不多了。
**Dario Amodei:** And so I would just love it if — I think SB 1047 was very polarizing — I would love it if some of the most reasonable opponents and some of the most reasonable proponents would sit down together. I think the different AI companies — Anthropic was the only AI company that felt positively in a very detailed way. I think Elon tweeted briefly something positive, but some of the big ones like Google, OpenAI, Meta, Microsoft were pretty stoutly against. So what I would really like is if some of the key stakeholders, some of the thoughtful proponents and some of the most thoughtful opponents, would sit down and say, "How do we solve this problem in a way that the proponents feel brings a real reduction in risk and that the opponents feel is not hampering the industry or hampering innovation any more than it needs to?" And I think for whatever reason things got too polarized and those two groups didn't get to sit down in the way that they should. And I feel urgency — I really think we need to do something in 2025. If we get to the end of 2025 and we've still done nothing about this, then I'm going to be worried. I'm not worried yet because again the risks aren't here yet, but I think time is running short.
**Lex Fridman:** 是的,就像你说的,要找到精准针对的解决方案。
**Lex Fridman:** Yeah, and come up with something surgical, like you said.
**Dario Amodei:** 对,正是。我们需要摆脱那种极端"亲安全"对"极端反监管"的话语框架。它已经变成了 Twitter 上的骂战,这不会有任何好结果。
**Dario Amodei:** Yeah, exactly. And we need to get away from this intense pro-safety versus intense anti-regulatory rhetoric. It's turned into these flame wars on Twitter and nothing good's going to come of that.
**Lex Fridman:** 那我想问问这个行业里的不同玩家。其中一个元老级的就是 OpenAI。你在 OpenAI 有好几年的经历。你在那里的故事和经历是怎样的?
**Lex Fridman:** So there's a lot of curiosity about the different players in the game. One of the OGs is OpenAI. You have had several years of experience at OpenAI. What's your story and history there?
**Dario Amodei:** 是的,我在 OpenAI 待了大概五年。最后几年我是那里的研究副总裁。大概我和 Ilya Sutskever 是真正主导研究方向的人。大概在 2016 或 2017 年,我开始真正相信,或者说确认了我对 scaling hypothesis(规模扩展假说)的信念。那时候 Ilya 有句很出名的话对我说:"你需要理解这些模型的一点是,它们就是想学习。这些模型就是想学习。"有时候会有这样的一句话,像禅宗公案一样,你一听到就会恍然大悟:"啊,这解释了一切——这解释了我见过的一千个现象。"从那以后,我脑子里就有了一个这样的画面:你用对方式去优化模型,用对方式去引导模型,它们就是想学习,就是想解决问题,不管问题是什么。所以基本上,别挡它们的路。
**Dario Amodei:** Yeah, so I was at OpenAI for roughly five years. For the last couple years I was vice president of research there. Probably myself and Ilya Sutskever were the ones who really kind of set the research direction. Around 2016 or 2017 I first started to really believe in, or at least confirm my belief in, the scaling hypothesis, when Ilya famously said to me, "The thing you need to understand about these models is they just want to learn. The models just want to learn." And again, sometimes there are these one sentences, these Zen koans, that you hear them and you're like, "Ah, that explains everything — that explains like a thousand things that I've seen." And then ever after, I had this visualization in my head of like, you optimize the models in the right way, you point the models in the right way, they just want to learn, they just want to solve the problem, regardless of what the problem is. So get out of their way basically.
**Lex Fridman:** 别挡它们的路。
**Lex Fridman:** Get out of their way.
**Dario Amodei:** 对,不要把你自己对它们应该如何学习的想法强加给它们。这和 Rich Sutton 的《苦涩的教训》("The Bitter Lesson")或者 Gwern 的 scaling hypothesis 说的是同一回事。我觉得,大体上的脉络是:我从 Ilya、从其他人,比如做了最初的 GPT-1 的 Alec Radford,那里得到了这种灵感,然后和我的合作者一起拼命往前跑——在 GPT-2、GPT-3、RLHF(基于人类反馈的强化学习,Reinforcement Learning from Human Feedback)上——RLHF 是早期尝试应对早期安全和对齐问题的方法,比如辩论(debate)和放大(amplification),还有大量的可解释性(interpretability)工作。所以说,还是那个组合——安全加上规模扩展。大概 2018、2019、2020 年——那几年是我和我的合作者——其中很多人后来成了 Anthropic 的联合创始人——真正形成了一个愿景并推动方向的年份。
**Dario Amodei:** Yeah, don't impose your own ideas about how they should learn. And this was the same thing as Rich Sutton put out in "The Bitter Lesson" or Gwern put out in the scaling hypothesis. I think generally the dynamic was I got this kind of inspiration from Ilya, from others, folks like Alec Radford who did the original GPT-1, and then ran really hard with it — me and my collaborators on GPT-2, GPT-3, RL from human feedback, which was an attempt to kind of deal with the early safety and durability things like debate and amplification, heavy on interpretability. So again, the combination of safety plus scaling. Probably 2018, 2019, 2020 — those were kind of the years when myself and my collaborators, many of whom became co-founders of Anthropic, kind of really had a vision and drove the direction.
**Lex Fridman:** 你为什么离开?你为什么决定离开?
**Lex Fridman:** Why'd you leave? Why'd you decide to leave?
**Dario Amodei:** 是的,听我这样说,我觉得这和"向上竞争"(race to the top)是相关的。在我在 OpenAI 的时间里,随着我越来越认可 scaling hypothesis,也越来越认识到安全与 scaling hypothesis 同样重要——第一点,我认为 OpenAI 是认同的。第二点,在某种程度上,一直以来都是 OpenAI 对外传达的一部分。但在我待在那里的多年时间里,我形成了一种特定的愿景——关于我们应该如何处理这些事情,如何把它们带到世界上,这个组织应该有什么样的原则。
**Dario Amodei:** Yeah, so look, I'm going to put things this way, and I think it ties to the race to the top. In my time at OpenAI, what I'd come to see — as I'd come to appreciate the scaling hypothesis and as I'd come to appreciate the importance of safety along with the scaling hypothesis — the first one I think OpenAI was getting on board with. The second one, in a way, had always been part of OpenAI's messaging. But over many years of the time that I spent there, I think I had a particular vision of how we should handle these things, how they should be brought out in the world, the kind of principles that the organization should have.
**Dario Amodei:** 你知道,关于"公司应该做这个还是那个"有很多讨论,外面也有很多错误信息。有人说我们是因为不喜欢和 Microsoft 的交易而离开——这是假的,虽然确实有大量讨论,很多关于怎么做 Microsoft 交易的疑问。有人说我们是因为不喜欢商业化而离开——这也不对。我们做了 GPT-3,那正是被商业化的模型。我参与了商业化的过程。这更多的是,还是那个"怎么做"的问题。就是说,人类文明正走在一条通往超强 AI 的道路上。怎样去做,才是谨慎的、坦诚的、诚实的,才能建立人们对这个组织和个人的信任?我们怎样从现在走到那里?我们怎样对"如何做对"有一个真正的愿景?安全怎样才能不只是我们说说而已、只是为了帮助招募人才?
**Dario Amodei:** And look, there were many discussions about, "Should the company do this? Should the company do that?" There's a bunch of misinformation out there. People say we left because we didn't like the deal with Microsoft — false, although there was a lot of discussion, a lot of questions about exactly how we do the deal with Microsoft. We left because we didn't like commercialization — that's not true. We built GPT-3, which was the model that was commercialized. I was involved in commercialization. It's more again about how do you do it. Like, civilization is going down this path to very powerful AI. What's the way to do it that is cautious, straightforward, honest, that builds trust in the organization and in individuals? How do we get from here to there? And how do we have a real vision for how to get it right? How can safety not just be something we say because it helps with recruiting?
**Dario Amodei:** 我认为归根结底,如果你有这样一个愿景——先不说任何其他人的愿景,我不想评论别人的愿景——如果你有一个愿景,你就应该出去,去实现那个愿景。试图去与另一个人的愿景争论,是非常没有成效的。你可能觉得他们做的方式不对,你可能觉得他们不诚实——谁知道呢,也许你是对的,也许你不是。但你应该做的是,带上你信任的一些人,一起出去,让你的愿景成真。如果你的愿景是有说服力的,如果你能让它吸引到人——在伦理上、在市场上,某种组合——如果你能建立一个人们想加入的公司,这个公司从事着人们认为合理的实践,同时也能维持在生态系统中的地位——如果你做到了这些,别人就会效仿你。而你这样做的事实,特别是如果你做得比他们更好,会以一种远比你去和你的老板争论有力得多的方式,促使他们改变自己的行为。
**Dario Amodei:** And I think at the end of the day, if you have a vision for that — forget about anyone else's vision, I don't want to talk about anyone else's vision — if you have a vision for how to do it, you should go off and you should do that vision. It is incredibly unproductive to try and argue with someone else's vision. You might think they're not doing it the right way, you might think they're dishonest — who knows, maybe you're right, maybe you're not. But what you should do is take some people you trust, go off together, and make your vision happen. And if your vision is compelling, if you can make it appeal to people — some combination of ethically, in the market — if you can make a company that's a place people want to join, that engages in practices that people think are reasonable, while managing to maintain its position in the ecosystem at the same time — if you do that, people will copy it. And the fact that you were doing it, especially the fact that you're doing it better than they are, causes them to change their behavior in a much more compelling way than if they're your boss and you're arguing with them.
**Dario Amodei:** 我真不知道怎么说得比这更具体了,但我认为,试图让别人的愿景变得像你的愿景,通常是非常没有成效的。更有成效的是出去做一个干净的实验,然后说:"这是我们的愿景,这是我们要做事情的方式。你的选择是:你可以无视我们,你可以拒绝我们在做的事,或者你可以开始变得更像我们。"模仿是最诚挚的恭维。这会体现在客户的行为上,体现在公众的行为上,体现在人们选择在哪里工作这件事上。
**Dario Amodei:** I just don't know how to be any more specific about it than that, but I think it's generally very unproductive to try and get someone else's vision to look like your vision. It's much more productive to go off and do a clean experiment and say, "This is our vision, this is how we're going to do things. Your choice is you can ignore us, you can reject what we're doing, or you can start to become more like us." And imitation is the sincerest form of flattery. And that plays out in the behavior of customers, that plays out in the behavior of the public, that plays out in the behavior of where people choose to work.
**Dario Amodei:** 而且说到底,这不是关于某家公司赢或另一家公司赢。如果我们或另一家公司在从事某种人们真正认同的实践——我希望是实质上的,而不只是表面上的——我认为研究人员是很敏锐的,他们看的是实质——然后其他公司开始效仿这种实践,并且因为效仿了这种实践而赢了,那很好。那就是成功。那就是"向上竞争"。最后谁赢并不重要,重要的是每个人都在效仿彼此的好做法。
**Dario Amodei:** And again, at the end, it's not about one company winning or another company winning. If we or another company are engaging in some practice that people find genuinely appealing — and I want it to be in substance, not just in appearance — and I think researchers are sophisticated and they look at substance — and then other companies start copying that practice and they win because they copied that practice, that's great. That's success. That's like the race to the top. It doesn't matter who wins in the end as long as everyone is copying everyone else's good practices.
**Dario Amodei:** 我对这件事的一种理解方式是:我们所有人都害怕的是"向下竞争"(race to the bottom)。在向下竞争中,谁赢了并不重要,因为我们都输了。在最极端的情况下,我们造出了这种自主 AI,然后机器人把我们奴役——我是半开玩笑,但那确实是可能发生的最极端的事情。那么到时候哪家公司领先就无所谓了。但如果你创造的是一种向上竞争,人们在竞相践行良好的做法,那么最终谁赢了其实无所谓。谁发起了这场向上竞争也无所谓。重点不是显得有美德——重点是把这个系统带入一个比之前更好的均衡状态。而个别公司可以在这个过程中发挥一些作用,帮助启动它,帮助加速它。坦率地说,我认为其他公司的个人也做了这些——那些在我们发布 RSP 之后,回去在自己公司里更努力推动类似事情的人。有时候其他公司也会做一些我们觉得很好、我们认为值得效仿的实践。
**Dario Amodei:** One way I think of it is like: the thing we're all afraid of is a race to the bottom. In the race to the bottom, it doesn't matter who wins because we all lose. In the most extreme world, we make this autonomous AI that, you know, the robots enslave us or whatever — I mean, that's half joking, but that is the most extreme thing that could happen. Then it doesn't matter which company was ahead. If instead you create a race to the top where people are competing to engage in good practices, then at the end of the day it doesn't matter who ends up winning. It doesn't even matter who started the race to the top. The point isn't to be virtuous — the point is to get the system into a better equilibrium than it was before. And individual companies can play some role in doing this, individual companies can help to start it, can help to accelerate it. And frankly, I think individuals at other companies have done this as well — the individuals that, when we put out an RSP, react by pushing harder to get something similar done at other companies. Sometimes other companies do something that's like, we're like, "Oh, it's a good practice, we think that's good, we should adopt it too."
**Dario Amodei:** 唯一的区别是,我认为我们尝试更主动一些——我们尝试率先采用更多这些实践,并在别人发明它们的时候更快地采用。但我认为这种动态才是我们应该着眼的。这样一来,哪家公司在赢、谁信任谁,这些问题都被抽象掉了。我认为所有这些关于争议和八卦的问题都无聊得很,真正重要的是我们所有人运作其中的生态系统,以及如何让这个生态系统变得更好,因为生态系统约束着所有的玩家。
**Dario Amodei:** The only difference is I think we try to be more forward-leaning — we try and adopt more of these practices first and adopt them more quickly when others invent them. But I think this dynamic is what we should be pointing at. And I think it abstracts away the question of which company's winning, who trusts who. I think all these questions of drama are profoundly uninteresting, and the thing that matters is the ecosystem that we all operate in and how to make that ecosystem better, because that constrains all the players.
**Lex Fridman:** 所以 Anthropic 是这样一个干净的实验,建立在对 AI 安全具体应该是什么样子这一基础之上。
**Lex Fridman:** And so Anthropic is this kind of clean experiment built on a foundation of what concretely AI safety should look like.
**Dario Amodei:** 你知道,我相信我们一路上犯了很多错误。完美的组织是不存在的。它要应对接近一千名员工的不完美,要应对我们领导者的不完美——包括我自己,要应对我们用来监督领导者不完美的人的不完美,比如董事会和长期受益信托。这是一群不完美的人,在努力不完美地瞄准一个永远不会完美实现的理想。这就是你签约的东西,这就是它永远的样子。但不完美不意味着你就放弃了。有好有坏,希望我们能做得足够好,开始建立一些整个行业都能参与的实践。而且我猜,这些公司中会有好几家是成功的。Anthropic 会成功,我以前待过的那些公司也会成功。有些会比其他的更成功。这比——再说一遍——对齐整个行业的激励机制要次要。而这部分是通过向上竞争实现的,部分是通过 RSP 这类东西实现的,也部分是通过精准的、有针对性的监管实现的。
**Dario Amodei:** Look, I'm sure we've made plenty of mistakes along the way. The perfect organization doesn't exist. It has to deal with the imperfection of a thousand employees. It has to deal with the imperfection of our leaders, including me. It has to deal with the imperfection of the people we've put to oversee the imperfection of the leaders, like the board and the long-term benefit trust. It's all a set of imperfect people trying to aim imperfectly at some ideal that will never perfectly be achieved. That's what you sign up for, that's what it will always be. But imperfect doesn't mean you just give up. There's better and there's worse, and hopefully we can do well enough that we can begin to build some practices that the whole industry engages in. And then, my guess is that multiple of these companies will be successful. Anthropic will be successful, these other companies — like ones I've been at in the past — will also be successful. And some will be more successful than others. That's less important than, again, that we align the incentives of the industry. And that happens partly through the race to the top, partly through things like RSPs, partly through, again, selected surgical regulation.
**Lex Fridman:** 你说过"人才密度胜过人才总量"(talent density beats talent mass)。你能解释一下吗?能展开讲讲吗?你觉得组建一支优秀的 AI 研究员和工程师团队需要什么?
**Lex Fridman:** You said talent density beats talent mass. Can you explain that? Can you expand on it? Can you just talk about what it takes to build a great team of AI researchers and engineers?
**Dario Amodei:** 这是一个我每个月都觉得越来越真的判断。每个月我都比上个月更相信这句话。我来做个思想实验:假设你有一支 100 人的团队,他们都非常聪明、充满动力、与使命高度一致,这就是你的公司。或者你可以有一支 1000 人的团队,其中 200 人非常聪明、非常认同使命,另外 800 人嘛——我们就说你随机从大型科技公司里挑了 800 个员工。你更愿意要哪个?
**Dario Amodei:** This is one of these statements that's like more true every month. Every month I see this statement as more true than I did the month before. So if I were to do a thought experiment: let's say you have a team of 100 people that are super smart, motivated, and aligned with the mission, and that's your company. Or you can have a team of a thousand people where 200 people are super smart, super aligned with the mission, and then 800 people are — let's just say you pick 800 random big tech employees. Which would you rather have?
**Dario Amodei:** 从人才总量来看,那支 1000 人的团队更大——你有数量更多的极其有才华、极其一致、极其聪明的人。但问题在于,每当一个超级有才华的人环顾四周,看到的都是另一个超级有才华、超级投入的人,这就会为一切定下基调。这种基调是:每个人都超级有动力在同一个地方工作,每个人都信任彼此。如果你有一千甚至一万人,情况真的退化了——你无法做到充分筛选,你在随机挑人——那么接下来你就需要建立大量的流程和护栏,仅仅是因为人们不完全信任彼此。你需要裁决政治斗争。有太多太多的事情会拖慢整个组织的运转效率。
**Dario Amodei:** The talent mass is greater in the group of a thousand people — you have an even larger number of incredibly talented, incredibly aligned, incredibly smart people. But the issue is just that if every time someone super talented looks around, they see someone else super talented and super dedicated, that sets the tone for everything. That sets the tone for everyone is super inspired to work at the same place, everyone trusts everyone else. If you have a thousand or 10,000 people and things have really regressed — you are not able to do selection and you're choosing random people — what happens is then you need to put a lot of process and a lot of guardrails in place, just because people don't fully trust each other. You have to adjudicate political battles. There are so many things that slow down the org's ability to operate.
**Dario Amodei:** 所以我们接近一千人,我们一直努力让这一千人里尽可能大的比例都是超级有才华、超级有能力的人。这也是我们最近几个月大幅放缓招聘的原因之一。今年前七八个月我们从 300 人增长到 800 人,现在我们放缓了——最近三个月我们从 800 增长到 900、950,大概是这个数字。具体数字请别引用我说的。但我认为在一千人左右有一个拐点,我们想要更加谨慎地增长。
**Dario Amodei:** And so we're nearly a thousand people, and we've tried to make it so that as large a fraction of those thousand people as possible are super talented, super skilled. It's one of the reasons we've slowed down hiring a lot in the last few months. We grew from 300 to 800, I believe, in the first seven or eight months of the year, and now we've slowed down — the last three months we went from 800 to 900, 950, something like that. Don't quote me on the exact numbers. But I think there's an inflection point around a thousand, and we want to be much more careful how we grow.
**Dario Amodei:** 早期和现在,我们招了很多物理学家。理论物理学家学东西特别快。随着最近持续招聘,我们在研究端和软件工程端都保持了很高的门槛,招了很多高级人才,包括以前在这个领域其他公司工作过的人。我们一直持续保持高度挑剔。从 100 人到 1000 人,再从 1000 人到一万人,很容易在没有注意力的情况下就失去了"每个人都有共同目标"这件事。如果你的公司由一大堆各自为政的小王国组成,每个人都在为自己的事情优化——那做任何事情都极其艰难。但如果每个人都看到公司更广阔的目标,如果有信任,如果有对做正确事情的投入,那就是一种超能力,我认为这种超能力本身几乎可以克服其他所有劣势。
**Dario Amodei:** Early on and now as well, we've hired a lot of physicists. Theoretical physicists can learn things really fast. Even more recently, as we've continued to hire, we've really had a high bar on both the research side and the software engineering side, have hired a lot of senior people including folks who used to be at other companies in this space. And we've just continued to be very selective. It's very easy to go from 100 to a thousand, a thousand to 10,000, without paying attention to making sure everyone has a unified purpose. It's so powerful if your company consists of a lot of different fiefdoms that all want to do their own thing, they're all optimizing for their own thing — it's very hard to get anything done. But if everyone sees the broader purpose of the company, if there's trust and there's dedication to doing the right thing, that is a superpower that in itself I think can overcome almost every other disadvantage.
**Lex Fridman:** 这就是 Steve Jobs 说的那个道理——A 级人才想环顾四周,看到的都是其他 A 级人才。这是另一种表达方式。我不知道这是关于人性的什么东西,但看到不是全力追求单一使命的人,确实会让人丧失动力;反过来,看到那样的人,确实会超级有激励。
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第五部分 中文翻译
# Lex Fridman Podcast #452 — Dario Amodei(Anthropic CEO)· 第五部分 中文翻译
**Lex Fridman:** And it's the Steve Jobs thing — A players want to look around and see other A players. It's another way of saying it. I don't know what that is about human nature, but it is demotivating to see people who are not obsessively driving towards a singular mission, and it is on the flip side super motivating to see that.
**Lex Fridman:** 很有意思。从你和这么多优秀的人共事的经历来看,要成为一名出色的 AI 研究员或工程师,需要具备什么素质?
**Lex Fridman:** It's interesting. What's it take to be a great AI researcher or engineer, from everything you've seen from working with so many amazing people?
**Dario Amodei:** 我认为最重要的素质,尤其是在研究这一侧,但其实两边都一样,那就是开放的心态。听起来很简单,对吧?你会觉得,"哦,我对什么都接受。"但如果我回顾自己在 Scaling 假说(scaling hypothesis)方面的早期经历——我看到的数据跟其他人看到的一模一样。我不觉得自己在编程或提出研究想法方面比我共事过的数百个人更厉害,某些方面甚至还不如他们。我从来不擅长精确编程、找 bug、写 GPU kernel——这里我能给你指出一百个在这方面比我强的人。
**Dario Amodei:** I think the number one quality, especially on the research side but really both, is open-mindedness. Sounds easy to be open-minded, right? You're just like, "Oh, I'm open to anything." But if I think about my own early history in the scaling hypothesis, I was seeing the same data others were seeing. I don't think I was a better programmer or better at coming up with research ideas than any of the hundreds of people that I worked with. In some ways I was worse. I've never been great at precise programming, finding the bug, writing the GPU kernels — I could point you to a hundred people here who are better at that than I am.
**Dario Amodei:** 但我觉得我确实有一点与众不同,就是我愿意用全新的眼光去看待事物。别人都说"哦,我们还没找到合适的算法,我们还没想出做事情的正确方法。"而我只是在想,"我不知道,这个神经网络有 3000 万个参数——要是给它 5000 万个呢?来画几张图看看。"这种基本的科学思维——"我看到一个可以改变的变量,改变它会发生什么?来试试这些不同的设置,画出一张图。"就这么简单。这根本不是博士级别的实验设计——就是简单粗暴的想法。只要有人告诉你这件事很重要,任何人都能做到。而且理解起来也不难,你不需要很聪明才能想到。但把这两点结合在一起,只有少数几个人,个位数的人,靠着认识到这一点推动了整个领域的进步。事情往往就是这样——回顾历史上的各种发现,经常都是这种模式。
**Dario Amodei:** But the thing that I think I did have that was different was that I was just willing to look at something with new eyes. People said, "Oh, we don't have the right algorithms yet, we haven't come up with the right way to do things." And I was just like, "I don't know, this neural net has 30 million parameters — what if we gave it 50 million instead? Let's plot some graphs." That basic scientific mindset of, "I see some variable that I could change — what happens when it changes? Let's try these different things and create a graph." Even this was the simplest thing in the world. This wasn't PhD-level experimental design — this was simple and stupid. Anyone could have done this if you just told them that it was important. It's also not hard to understand; you didn't need to be brilliant to come up with this. But you put the two things together and some tiny number of people, some single-digit number of people, have driven forward the whole field by realizing this. And it's often like that — if you look back at the discoveries in history, they're often like that.
**Dario Amodei:** 所以这种开放心态,这种用新眼光去看的意愿——往往来自于刚进入这个领域。经验往往反而是一种劣势。这是最重要的事情。很难去寻找和测试,但我认为它是最重要的,因为当你找到某种真正全新的思考方式时,当你有主动性去做到这一点时,效果是彻底颠覆性的。
**Dario Amodei:** And so this open-mindedness and this willingness to see with new eyes — that often comes from being newer to the field. Often experience is a disadvantage for this. That is the most important thing. It's very hard to look for and test for, but I think it's the most important thing, because when you find some really new way of thinking about things, when you have the initiative to do that, it's absolutely transformative.
**Lex Fridman:** 还有就是能够快速实验,在这个过程中保持开放和好奇,用全新的眼光去看数据,看它到底在说什么。机制可解释性(mechanistic interpretability)也是这方面的例子。机制可解释性的早期工作是非常简单的,只是之前没有人想过要去关注这个问题。
**Lex Fridman:** And also being able to do rapid experimentation, and in the face of that be open-minded and curious, looking at the data from just these fresh eyes and seeing what is it actually saying. That applies in mechanistic interpretability — it's another example of this. Some of the early work in mechanistic interpretability was so simple. It's just no one thought to care about this question before.
**Lex Fridman:** 你说到了成为优秀 AI 研究员需要具备什么。我们能不能倒退一步——你会给那些对 AI 感兴趣的人什么建议?他们还年轻,面向未来。怎样才能对世界产生影响?
**Lex Fridman:** You said what it takes to be a great AI researcher. Can we rewind the clock back — what advice would you give to people interested in AI? They're young, looking forward. How can I make an impact on the world?
**Dario Amodei:** 我最重要的建议就是,直接动手玩这些模型。这其实——我有点担心。现在这听起来像是显而易见的建议。我觉得三年前这不那么显而易见,人们当时的做法是"哦,让我去读最新的强化学习论文,让我……"不,我的意思是,那确实曾经是主流路径,而且你当然也应该那么做。但现在,随着模型和 API 更广泛地开放,人们开始越来越多地这样做了。我认为这种经验性的认识非常重要——这些模型是全新的产物,没有人真正理解它们——所以动手使用它们积累经验是很有价值的。
**Dario Amodei:** I think my number one piece of advice is to just start playing with the models. This was actually — I worry a little. This seems like obvious advice now. I think three years ago it wasn't obvious, and people started by, "Oh, let me read the latest reinforcement learning paper, let me..." No, I mean that was really the approach, and I mean you should do that as well. But now, with wider availability of models and APIs, people are doing this more. I think just experiential knowledge — these models are new artifacts that no one really understands — and so getting experience playing with them is valuable.
**Dario Amodei:** 我还想说,同样呼应"做一些新的事情、往新方向想"——有很多东西还没有被探索过。比如,机制可解释性(mechanistic interpretability)仍然非常新。在这方面下功夫可能比研究新的模型架构更好,因为现在做这块的人虽然比以前多了,大概有一百个人在做——但还没有一万个人在做。这是一个非常肥沃的研究领域,有大量的低垂果实随处可摘。不知为何,人们对它的兴趣还不够充分。
**Dario Amodei:** I would also say, again in line with "do something new, think in some new direction" — there are all these things that haven't been explored. For example, mechanistic interpretability is still very new. It's probably better to work on that than it is to work on new model architectures, because it's more popular than it was before — there are probably like a hundred people working on it — but there aren't like 10,000 people working on it. And it's just this fertile area for study. There's so much low-hanging fruit you can just walk by and pick things. And for whatever reason, people aren't interested in it enough.
**Dario Amodei:** 我认为在长时域学习(long-horizon learning)和长时域任务(long-horizon tasks)方面有很多事情可以做。我认为评估(evaluations)也是——我们在研究评估方面仍处于非常早期的阶段,尤其是针对在真实世界中行动的动态系统。多智能体(multi-agent)这块我觉得也有一些值得探索的地方。我的建议是:滑向冰球将要去的地方。而且你不需要很聪明就能想到——五年后所有令人兴奋的事情,人们甚至会把它当作常识来提,但不知为何总有一道障碍,人们没有像本可以做到的那样加倍投入,或者他们害怕做不流行的事情。我不知道为什么会这样,但跨越这道障碍——这是我最重要的建议。
**Dario Amodei:** I think there are some things around long-horizon learning and long-horizon tasks where there's a lot to be done. I think evaluations — we're still very early in our ability to study evaluations, particularly for dynamic systems acting in the world. I think there's some stuff around multi-agent. Skate where the puck is going is my advice. And you don't have to be brilliant to think of it — all the things that are going to be exciting in five years, people even mention them as conventional wisdom, but somehow there's this barrier that people don't double down as much as they could, or they're afraid to do something that's not the popular thing. I don't know why it happens, but getting over that barrier — that's my number one piece of advice.
**Lex Fridman:** 我们能不能聊聊后训练(post-training)?
**Lex Fridman:** Let's talk, if we could, a bit about post-training.
**Dario Amodei:** 当然。
**Dario Amodei:** Yeah.
**Lex Fridman:** 现代后训练方案似乎包含了各种各样的东西——监督微调(supervised fine-tuning,SFT)、基于人类反馈的强化学习(RLHF)、用 RLAIF 实现的 Constitutional AI。最佳缩写奖——又是那个命名的事。然后还有合成数据,似乎用了大量合成数据,或者至少是在努力找方法获取高质量的合成数据。那么——如果说这是让 Anthropic 的 Claude 如此出色的秘诀——有多少魔力来自于预训练(pre-training),有多少来自于后训练?
**Lex Fridman:** So it seems that the modern post-training recipe has a little bit of everything — supervised fine-tuning, RLHF, the Constitutional AI with RLAIF. Best acronym — it's again that naming thing. And then synthetic data seems like a lot of synthetic data, or at least trying to figure out ways to have high-quality synthetic data. So what's the — if this is the secret sauce that makes Anthropic's Claude so incredible — how much of the magic is in the pre-training, how much of it is in the post-training?
**Dario Amodei:** 是这样的,首先,我们自己也无法完美地衡量这一点。当你看到某种出色的能力时,有时候很难判断它来自于预训练还是后训练。我们开发了一些方法来区分这两者,但并不完美。第二,我想说的是,当存在某种优势时——我认为我们在强化学习(RL)方面总体上做得相当不错,也许是最好的,虽然我不确定,因为我看不到其他公司内部的情况——通常不是"哦天哪,我们有别人没有的秘密魔法方法。"通常是这样的:我们把基础设施做得更好,所以能跑更久;或者我们能获取更高质量的数据;或者我们能更好地过滤数据;或者我们能把这些方法结合起来。实际上,这通常是一些无聊的实践经验和操作技巧的问题。
**Dario Amodei:** Yeah, I mean, so first of all, we're not perfectly able to measure that ourselves. When you see some great capability, sometimes it's hard to tell whether it came from pre-training or post-training. We've developed ways to try and distinguish between those two, but they're not perfect. The second thing I would say is, when there is an advantage — and I think we've been pretty good in general at RL, perhaps the best, although I don't know because I don't see what goes on inside other companies — usually it isn't "oh my God, we have this secret magic method that others don't have." Usually it's like, well, we got better at the infrastructure so we could run it for longer, or we were able to get higher-quality data, or we were able to filter our data better, or we were able to combine these methods. In practice, it's usually some boring matter of practice and tradecraft.
**Dario Amodei:** 所以当我思考如何在训练这些模型方面做出特别的东西时,无论是预训练还是更多的后训练,我更多地把它看作像是设计飞机或汽车。这不只是"哦,我有了设计图"——也许设计图能让你造出下一架飞机,但还有一种文化性的操作技巧,关于我们如何思考设计过程,我认为这比我们能发明的任何特定小工具都更重要。
**Dario Amodei:** So when I think about how to do something special in terms of how we train these models, both pre-training but even more so post-training, I really think of it a little more like designing airplanes or cars. It's not just, "Oh man, I have the blueprint" — like, maybe that makes you make the next airplane, but there's some cultural tradecraft of how we think about the design process that I think is more important than any particular gizmo we're able to invent.
**Lex Fridman:** 好的,那让我问你一些具体的技术。先聊聊 RLHF——你觉得,从更宏观的、几乎是直觉和哲学的角度来看——你为什么认为 RLHF 效果这么好?
**Lex Fridman:** Okay, well let me ask you about specific techniques. So first on RLHF — what do you think, just zooming out, almost intuitively, philosophically — why do you think RLHF works so well?
**Dario Amodei:** 如果我回到 Scaling 假说,表述 Scaling 假说的一种方式是:如果你针对 X 进行训练,并投入足够的算力,那么你就能得到 X。所以 RLHF 擅长做人类想让模型做的事,或者——更精确地说——做那些在短时间内查看模型、考虑不同可能的回答后,认为该回答更好的人类所偏好的事情。这从安全和能力两个角度来说都不完美,因为人类往往不能完美地识别自己想要什么,而人类当下想要的,可能不是他们长远想要的。所以这里有很多微妙之处。但模型确实擅长产出人类在某种浅层意义上想要的东西。而且实际上你甚至不需要投入太多算力,因为还有另一件事——这就是"一个强大的预训练模型已经距任何地方只差一半"这个概念。所以一旦你有了预训练模型,你就已经拥有了让模型走向你想去之处所需的所有表征。
**Dario Amodei:** If I go back to the scaling hypothesis, one of the ways to state the scaling hypothesis is: if you train for X and you throw enough compute at it, then you get X. And so RLHF is good at doing what humans want the model to do, or at least — to state it more precisely — doing what humans who look at the model for a brief period of time and consider different possible responses prefer as the response. Which is not perfect, from both a safety and capabilities perspective, in that humans are often not able to perfectly identify what they want, and what humans want in the moment may not be what they want in the long term. So there's a lot of subtlety there. But the models are good at producing what the humans in some shallow sense want. And it actually turns out that you don't even have to throw that much compute at it, because of another thing, which is this idea about a strong pre-trained model being halfway to anywhere. So once you have the pre-trained model, you have all the representations you need to get the model where you want it to go.
**Lex Fridman:** 那你认为 RLHF 是让模型变得更聪明,还是只是让它在人类眼中显得更聪明?
**Lex Fridman:** So do you think RLHF makes the model smarter, or just appears smarter to the humans?
**Dario Amodei:** 我不认为它让模型变得更聪明。我也不认为它只是让模型看起来更聪明。RLHF 更像是弥合了人类与模型之间的鸿沟。我可以有一个非常聪明但完全无法沟通的东西——我们都认识这样的人,他们很聪明但你听不懂他们在说什么。所以我认为 RLHF 只是弥合了这个鸿沟。我认为这不是我们做的唯一一种 RL,也不是未来唯一会用到的 RL。我认为 RL 有潜力让模型变得更聪明,让它们推理得更好,运作得更好,甚至发展出新的技能。也许在某些情况下这甚至可以通过人类反馈来完成。但我们今天所做的这种 RLHF,大多数情况下还没有做到这一点,尽管我们正在非常快速地开始能够做到了。
**Dario Amodei:** I don't think it makes the model smarter. I don't think it just makes the model appear smarter. It's like RLHF bridges the gap between the human and the model. I could have something really smart that can't communicate at all — we all know people like this, people who are really smart but you can't understand what they're saying. So I think RLHF just bridges that gap. I think it's not the only kind of RL we do, it's not the only kind of RL that will happen in the future. I think RL has the potential to make models smarter, to make them reason better, to make them operate better, to make them develop new skills even. And perhaps that could be done even in some cases with human feedback. But the kind of RLHF we do today mostly doesn't do that yet, although we're very quickly starting to be able to.
**Lex Fridman:** 但它看起来确实像是在某种程度上提升了——如果你看帮助性(helpfulness)这个指标,它提升了。它还提升了 Leopold 在他的文章里用过的那个词——"解绑"(unhobbling)——基本上是说这些模型被束缚着,然后你通过各种训练来解除束缚。
**Lex Fridman:** But it appears to sort of increase — if you look at the metric of helpfulness, it increases that. It also increases what was this word in Leopold's essay — "unhobbling" — where basically the models are hobbled and then you do various trainings to them to unhobble them.
**Dario Amodei:** 我认为 RLHF 在某些方面解绑了模型,但在其他方面,模型还没有被解绑,还需要解绑。
**Dario Amodei:** So I think RLHF unhobbles the models in some ways, and then there are other ways where models haven't yet been unhobbled and need to unhobble.
**Lex Fridman:** 如果可以说的话,从成本角度看,预训练是最贵的,还是后训练在逐渐追上?
**Lex Fridman:** If you can say, in terms of cost, is pre-training the most expensive thing, or is post-training creeping up to that?
**Dario Amodei:** 在目前阶段,预训练仍然占据大部分成本。我不知道未来会怎样,但我完全可以预见一个后训练成为大部分成本的未来。
**Dario Amodei:** At the present moment it is still the case that pre-training is the majority of the cost. I don't know what to expect in the future, but I could certainly anticipate a future where post-training is the majority of the cost.
**Lex Fridman:** 在你预见的那个未来里,后训练中代价高昂的是人类还是 AI?
**Lex Fridman:** In that future you anticipate, would it be the humans or the AI that's the costly thing for the post-training?
**Dario Amodei:** 我认为你没法把人类规模扩展到足以获得高质量的程度。任何依赖人类并使用大量算力的方法,都必须依赖某种可扩展的监督方法,比如辩论(debate)或迭代放大(iterated amplification)之类的。
**Dario Amodei:** I don't think you can scale up humans enough to get high quality. Any kind of method that relies on humans and uses a large amount of compute, it's going to have to rely on some scaled supervision method like debate or iterated amplification or something like that.
**Lex Fridman:** 那么关于 Constitutional AI 这套超级有趣的想法——你能描述一下它是什么吗?最初在 2022 年 12 月的那篇论文里详细介绍了,现在又是什么状态?
**Lex Fridman:** So on that super interesting set of ideas around Constitutional AI — can you describe what it is, as first detailed in the December 2022 paper, and beyond that, what is it?
**Dario Amodei:** 好的,这是两年前的事了。基本思路是——我们描述了 RLHF 是什么:你有一个模型,它吐出两个可能的回应,你从它那里采样两次,然后你问"人类,你更喜欢哪个回应?"或者另一种变体是,对这个回应在一到七分上打分。这很困难,因为你需要扩展人类的参与,而且它非常隐式——我对模型应该做什么没有明确的感知,我只知道平均一千个人类想让模型做什么。
**Dario Amodei:** Yes, so this was from two years ago. The basic idea is — so we described what RLHF is: you have a model and it spits out two possible responses, you just sample from it twice, and you're like, "Human, which response do you like better?" Or another variant of it is, rate this response on a scale of one to seven. So that's hard because you need to scale up human interaction, and it's very implicit — I don't have a sense of what I want the model to do, I just have a sense of what this average of a thousand humans wants the model to do.
**Dario Amodei:** 所以有两个想法。一是,AI 系统本身能不能判断哪个回应更好?你能不能给 AI 系统展示这两个回应,然后问它哪个更好?其次,AI 应该用什么标准来判断?于是有了这个想法——你有一份单一的文档,姑且称之为"宪法",它说明了模型在回应时应该遵循哪些原则。AI 系统阅读这些原则,同时读取环境和回应,然后说"AI 模型做得怎么样?"这基本上是一种自我博弈——你在让模型跟自己对抗训练。AI 给出回应,然后你把这个反馈回偏好模型(preference model),偏好模型再反过来让模型变得更好。所以你有了这个三角:AI、偏好模型,以及 AI 本身的改进。
**Dario Amodei:** So two ideas. One is, could the AI system itself decide which response is better? Could you show the AI system these two responses and ask which response is better? And then second, well, what criterion should the AI use? And so then there's this idea — because you have a single document, a constitution if you will, that says these are the principles the model should be using to respond. And the AI system reads those principles as well as reading the environment and the response, and it says, "Well, how good did the AI model do?" It's basically a form of self-play — you're kind of training the model against itself. And so the AI gives the response, and then you feed that back into what's called the preference model, which in turn feeds the model to make it better. So you have this triangle of the AI, the preference model, and the improvement of the AI itself.
**Lex Fridman:** 还要说的是,宪法中那些原则集合,是人类可以理解的。
**Lex Fridman:** And we should say that in the constitution, the set of principles are human-interpretable.
**Dario Amodei:** 对,这是人类和 AI 系统都可以阅读的东西。所以它有这种很好的可翻译性或对称性。在实践中,我们既使用模型宪法,也使用 RLHF,还用一些其他方法。所以它变成了工具箱中的一种工具,既减少了对 RLHF 的依赖,也提高了我们从每个 RLHF 数据点中获得的价值。它还以有趣的方式与未来基于推理的 RL 方法产生互动。所以它是工具箱中的一种工具,但我认为是非常重要的工具。
**Dario Amodei:** Yeah, it's something both the human and the AI system can read. So it has this nice translatability or symmetry. In practice, we both use a model constitution and we use RLHF and we use some of these other methods. So it's turned into one tool in a toolkit that both reduces the need for RLHF and increases the value we get from using each data point of RLHF. It also interacts in interesting ways with kind of future reasoning-type RL methods. So it's one tool in the toolkit, but I think it is a very important tool.
**Lex Fridman:** 嗯,对我们人类来说,这是一个很有说服力的概念,会让人联想到美国开国元勋和美国的建国。自然而然的问题就是,谁来定义这部宪法,宪法中的原则集合是如何被定义的?
**Lex Fridman:** Well, it's a compelling one to us humans, thinking about the founding fathers and the founding of the United States. The natural question is, who and how do you think gets to define the constitution — the set of principles in the constitution?
**Dario Amodei:** 我来给一个实际的回答,再给一个更抽象的回答。实际的回答是,看,实际上模型被各种各样的客户使用。所以你可以有这样的想法:模型可以有专门的规则或原则。我们对模型的不同版本进行微调。我们隐式地讨论过明确地这样做——让人们可以把特殊的原则构建到模型中。所以从实际角度来看,答案对不同的人可以非常不同。客服代理的行为与律师截然不同,遵守的原则也不同。
**Dario Amodei:** Yeah, so I'll give a practical answer and a more abstract answer. I think the practical answer is, look, in practice models get used by all kinds of different customers. And so you can have this idea where the model can have specialized rules or principles. We fine-tune versions of models. Implicitly we've talked about doing it explicitly — having special principles that people can build into the models. So from a practical perspective, the answer can be very different for different people. A customer service agent behaves very differently from a lawyer and obeys different principles.
**Dario Amodei:** 但我认为在最底层,有一些模型必须遵守的特定原则。我认为其中很多是人们会认同的——所有人都同意我们不希望模型涉及 CBRN(化学、生物、放射、核)风险。我认为我们可以更进一步,认同一些关于民主和法治的基本原则。除此之外,事情就变得非常不确定了,在这些方面,我们的总体目标是让模型更中立,不宣扬特定的观点,更多地只是作为一种聪明的代理人或顾问,帮助你思考问题,呈现可能的考量,但不表达更强烈的具体意见。
**Dario Amodei:** But I think at the base of it there are specific principles that the models have to obey. I think a lot of them are things that people would agree with — everyone agrees that we don't want models to present these CBRN risks. I think we can go a little further and agree with some basic principles of democracy and the rule of law. Beyond that, it gets very uncertain, and there our goal is generally for the models to be more neutral, to not espouse a particular point of view, and more just be kind of like wise agents or advisers that will help you think things through and will present possible considerations, but don't express stronger specific opinions.
**Lex Fridman:** OpenAI 发布了一份 model spec(模型规范),相当清晰、具体地定义了模型的目标,以及模型应该如何行为的具体示例。顺便说一句,我应该提到,出色的 John Schulman 参与了这份规范——他现在在 Anthropic。你觉得这是个有意义的方向吗?Anthropic 是否也可能发布一份 model spec?
**Lex Fridman:** OpenAI released a model spec where it kind of clearly, concretely defines some of the goals of the model and specific examples of how the model should behave. Do you find that interesting? By the way, I should mention the brilliant John Schulman was a part of that — he's now at Anthropic. Do you think this is a useful direction? Might Anthropic release a model spec as well?
**Dario Amodei:** 是的,我认为这是一个很有价值的方向。同样,它与 Constitutional AI 有很多共同之处。所以这又是一个"向上竞争"的例子,对吧?我们有一些我们认为更好、更负责任的做事方式——它也是一种竞争优势。然后其他人发现这样做有优势,于是也开始这样做。我们于是不再拥有这个竞争优势,但从整个行业采用了之前没有采用的积极实践这个角度来看,这是好事。于是我们的回应就是,好吧,看来我们需要新的竞争优势,以便继续推动这场向上的竞争。这就是我对此事的总体感受。
**Dario Amodei:** Yeah, so I think that's a pretty useful direction. Again, it has a lot in common with Constitutional AI. So again, another example of the race to the top, right? We have something that we think is a better and more responsible way of doing things — it's also a competitive advantage. Then others kind of discover that it has advantages and then start to do that thing. We then no longer have the competitive advantage, but it's good from the perspective that now everyone has adopted a positive practice that others were not adopting. And so our response to that is, well, looks like we need a new competitive advantage in order to keep driving this race upwards. So that's how I generally feel about that.
**Dario Amodei:** 我还认为每种实现都有所不同,model spec 里有一些 Constitutional AI 没有的东西,我们总是可以采用这些东西,或者至少从中学习。所以,我认为这又是一个体现我认为我们应该希望这个领域拥有的积极动态的例子。
**Dario Amodei:** I also think every implementation of these things is different, so there were some things in the model spec that were not in Constitutional AI, and we can always adopt those things or at least learn from them. So again, I think this is an example of the positive dynamic that I think we should all want the field to have.
**Lex Fridman:** 来聊聊那篇出色的文章《充满爱意的机器》("Machines of Loving Grace")吧。我推荐所有人去读。篇幅相当长。
**Lex Fridman:** Let's talk about the incredible essay "Machines of Loving Grace." I recommend everybody read it. It's a long one.
**Dario Amodei:** 确实相当长,是的。
**Dario Amodei:** It is rather long, yeah.
**Lex Fridman:** 读到关于一个积极未来的具体构想,真的让人耳目一新。你采取了一种相当大胆的立场,因为你很可能在日期或具体应用上说错。
**Lex Fridman:** It's really refreshing to read concrete ideas about what a positive future looks like. And you took sort of a bold stance, because it's very possible you might be wrong on the dates or specific applications.
**Dario Amodei:** 是的,我完全预期自己在所有细节上肯定会说错。我也可能在整体上就是大错特错,人们会嘲笑我很多年。未来就是这样运作的。
**Dario Amodei:** Yeah, I'm fully expecting to definitely be wrong about all the details. I might be just spectacularly wrong about the whole thing and people will laugh at me for years. That's just how the future works.
**Lex Fridman:** 你提供了一大堆 AI 带来的具体积极影响,以及一个超级智能 AI 可能如何加速生物学和化学领域突破的速率,进而带来诸如治愈大多数癌症、预防所有传染病、将人类寿命延长一倍等等成果。那我们来聊聊这篇文章。首先,你能给出这篇文章的高层次愿景,以及人们应该从中获得的关键启示吗?
**Lex Fridman:** So you provided a bunch of concrete positive impacts of AI, and exactly how a super-intelligent AI might accelerate the rate of breakthroughs in, for example, biology and chemistry that would then lead to things like we cure most cancers, prevent all infectious disease, double the human lifespan, and so on. So let's talk about this essay. First, can you give a high-level vision of this essay and what key takeaways people should have?
**Dario Amodei:** 是这样的,我花了大量时间,Anthropic 也投入了大量精力,去思考我们如何应对 AI 的风险,如何看待这些风险。我们在努力推动一场向上的竞争——这要求我们构建所有这些能力,这些能力很酷,但我们努力的很大一部分是应对风险。这样做的理由是,好吧,所有这些积极的事情——市场是一个非常健康的有机体,它会产生所有积极的东西。风险嘛——我不知道,我们可能会减轻它们,也可能不会。所以我们可以通过努力减轻风险来产生更大的影响。
**Dario Amodei:** Yeah, I have spent a lot of time, and Anthropic has spent a lot of effort, on how do we address the risks of AI, how do we think about those risks. We're trying to do a race to the top — that requires us to build all these capabilities, and the capabilities are cool, but a big part of what we're trying to do is address the risks. And the justification for that is, well, all these positive things — the market is this very healthy organism, it's going to produce all the positive things. The risks — I don't know, we might mitigate them, we might not. And so we can have more impact by trying to mitigate the risks.
**Dario Amodei:** 但我注意到这种思维方式有一个缺陷。不是说我对风险的重视程度有所改变,更多是我谈论它们的方式发生了变化。不管这条逻辑线多么合理或理性,如果你只谈论风险,你的大脑就只会想到风险。所以我认为真正去理解,如果事情进展顺利会是什么样,实际上非常重要。我们试图规避这些风险的全部原因,不是因为我们害怕技术,不是因为我们想放慢它的脚步——而是因为如果我们能够越过这些风险,如果我们能够成功穿越这条险道,那么在险道的另一边是所有这些美好的东西。这些东西值得为之奋斗,这些东西真的能激励人心。
**Dario Amodei:** But I noticed one flaw in that way of thinking. And it's — if not a change in how seriously I take the risks, it's maybe a change in how I talk about them. No matter how logical or rational that line of reasoning might be, if you only talk about risks, your brain only thinks about risks. And so I think it's actually very important to understand what if things do go well. And the whole reason we're trying to prevent these risks is not because we're afraid of technology, not because we want to slow it down — it's because if we can get to the other side of these risks, if we can run the gauntlet successfully, then on the other side of the gauntlet are all these great things. And these things are worth fighting for, and these things can really inspire people.
**Dario Amodei:** 而且我认为——你看,所有这些投资者、风险投资人、所有 AI 公司都在谈论 AI 的种种积极好处,但正如你所指出的,奇怪的是,实际上真正具体地说清楚的内容相当匮乏。Twitter 上有很多随机的人贴着那些闪亮的城市图片,一副"努力、加速、把末日论者踢走"的姿态——只是一种非常激进的意识形态氛围。但然后你就会问,好吧,你到底在为什么感到兴奋?
**Dario Amodei:** And I think — because look, you have all these investors, all these VCs, all these AI companies talking about all the positive benefits of AI, but as you point out, it's weird — there's actually a dearth of really getting specific about it. There's a lot of random people on Twitter posting these kind of gleaming cities and this just kind of vibe of "grind, accelerate harder, kick out the doomers" — it's just this very aggressive ideological energy. But then you're like, well, what are you actually excited about?
**Dario Amodei:** 所以我想,对于一个真正来自风险这一侧的人来说,去真正尝试解释清楚这些好处是什么,应该是有价值且有趣的事情。一方面是因为我认为这是我们都可以认同的东西,另一方面我希望人们明白——我真的希望他们理解这不是末日论者与加速主义者之间的对立。而是说,如果你真的理解 AI 的走向——也许更重要的轴线是"AI 在快速发展"与"AI 没有在快速发展"。
**Dario Amodei:** And so I figured that I think it would be interesting and valuable for someone who's actually coming from the risk side to try and really make a try at explaining what the benefits are. Both because I think it's something we can all get behind, and I want people to understand — I want them to really understand that this isn't doomers versus accelerationists. This is that if you have a true understanding of where things are going with AI — and maybe that's the more important axis, "AI is moving fast" versus "AI is not moving fast" — then you
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 4
# Lex Fridman Podcast #452 — Dario Amodei (Anthropic CEO) · Part 4
**Dario Amodei:** 你真正欣赏这些好处,真正希望人类、我们的文明能够把握这些好处,但同时也会对任何可能使其偏离轨道的事情非常认真地对待。
**Dario Amodei:** You really appreciate the benefits, and you really want humanity, our civilization, to seize those benefits, but you also get very serious about anything that could derail them.
**Lex Fridman:** 所以我认为出发点是谈谈什么是这种强大的 AI——这是你喜欢用的术语。大多数世界用的是 AGI,但你不喜欢这个术语,因为它基本上包袱太多,已经失去意义了。
**Lex Fridman:** So I think the starting point is to talk about what this powerful AI — which is the term you like to use. Most of the world uses AGI, but you don't like the term because it basically has too much baggage, has become meaningless.
**Dario Amodei:** 就像我们被这些术语束缚住了。也许我们就是被这些术语束缚住了,我试图改变它们的努力是徒劳的。这还有待商榷。让我告诉你另一个我不喜欢的东西——这是个毫无意义的语义问题,但我一直在公开谈论它,所以我再说一次。我觉得这有点像——假设是 1995 年,摩尔定律(Moore's Law)让计算机越来越快,不知为何出现了一个语言习惯,大家都在说"好吧,有朝一日我们会拥有超级计算机,超级计算机将能做所有这些事情。一旦我们有了超级计算机,我们就能对基因组测序,做其他各种事情。"这说法一方面是真的,计算机在变得更快,随着它们变得更快,它们将能做所有这些了不起的事情,但并没有一个离散的点,在那个点你拥有了超级计算机,而之前的计算机不是。"超级计算机"是我们使用的一个术语,但它是一个模糊的术语,只是用来描述比我们今天拥有的更快的计算机。没有一个点,你越过了某个门槛,然后说"哦天哪,我们在做一种全新类型的计算。"所以我对 AGI 的感觉就是这样——只是一条平滑的指数曲线。如果你说的 AGI 是指 AI 越来越好,逐渐会做越来越多人类做的事情,直到它比人类更聪明,然后从那里变得更聪明,那么是的,我相信 AGI。但如果 AGI 是某种离散的或独立的东西——这是人们经常谈论的方式——那它就是一个有点无意义的流行词。
**Dario Amodei:** It's like we're stuck with the terms. Maybe we're stuck with the terms and my efforts to change them are futile. It's a debate. I'll tell you what else I don't — this is like a pointless semantic point, but I keep talking about it publicly, so I'm just going to do it once more. I think it's a little like — let's say it was like 1995 and Moore's law is making the computers faster, and for some reason there had been this verbal tick that everyone was like, "Well, someday we're going to have supercomputers, and supercomputers are going to be able to do all these things. Once we have supercomputers, we'll be able to sequence the genome and do other things." So one, it's true the computers are getting faster, and as they get faster they're going to be able to do all these great things, but there's no discrete point at which you had a supercomputer and previous computers were not. "Supercomputer" is a term we use, but it's a vague term to just describe computers that are faster than what we have today. There's no point at which you pass a threshold and you're like, "Oh my God, we're doing a totally new type of computation." And so I feel that way about AGI — there's just a smooth exponential. If by AGI you mean AI is getting better and better and gradually it's going to do more and more of what humans do until it's going to be smarter than humans and then it's going to get smarter even from there, then yes, I believe in AGI. But if AGI is some discrete or separate thing, which is the way people often talk about it, then it's kind of a meaningless buzzword.
**Lex Fridman:** 是啊,对我来说它只是一种强大 AI 的标志性形式——正如你所定义的。我的意思是,你定义得非常好。所以在智能这个维度上,在纯粹的智能方面,它在大多数相关学科中比诺贝尔奖得主更聪明。好,那只是智能——包括创造力和提出新想法等等,在各个学科,诺贝尔奖水平,好的,处于其巅峰状态。它可以使用所有模态——这是不言而喻的,只是说在世界的所有模态中运作。它可以独自工作数小时、数天、数周来完成任务,做自己详细的规划,只在需要的时候向你寻求帮助。它可以使用——这其实挺有意思的,我认为在那篇文章里你说过,我是说这是一个赌注,它不会是具身的,但它可以控制具身工具,所以它可以控制工具、机器人、实验室设备。训练它所用的资源可以被重新用于运行数百万个它的副本,每个副本都是独立的,可以做自己独立的工作。所以你可以复制这个智能系统。
**Lex Fridman:** Yeah, I mean to me it's just sort of an iconic form of a powerful AI — exactly how you define it. I mean, you define it very nicely. So on the intelligence axis, on pure intelligence, it's smarter than a Nobel Prize winner, as you describe, across most relevant disciplines. So okay, that's just intelligence — both in creativity and being able to generate new ideas, all that kind of stuff, in every discipline, Nobel Prize winner, okay, in their prime. It can use every modality — that's kind of self-explanatory, but just operate across all the modalities of the world. It can go off for many hours, days, and weeks to do tasks and do its own sort of detailed planning and only ask you for help when it's needed. It can use — this is actually kind of interesting, I think in the essay you said, I mean again it's a bet that it's not going to be embodied, but it can control embodied tools, so it can control tools, robots, laboratory equipment. The resource used to train it can then be repurposed to run millions of copies of it, and each of those copies would be independent and can do their own independent work. So you can do the cloning of the intelligence system.
**Dario Amodei:** 对。我的意思是,你可以从局外人的角度想象只存在一个这样的系统,对吧?你造出来了,你只造了一个。但事实是规模扩展非常迅速。我们今天就在这样做——我们制造一个模型,然后我们部署数千个,也许数万个它的实例。我认为等到——当然在两到三年内,不管我们是否拥有这些超强大的 AI——集群的规模将会达到你能部署数百万个这样的实例的程度,而且它们会比人类更快。所以如果你的图景是"哦,我们会有一个,需要一段时间才能制造更多",我的意思是,不,实际上你立刻就会有数百万个。而且总体上,它们可以比人类快十到一百倍地学习和行动。
**Dario Amodei:** Yeah. I mean, you might imagine from outside the field that there's only one of these, right? That you made it, you've only made one. But the truth is that the scale-up is very quick. We do this today — we make a model and then we deploy thousands, maybe tens of thousands of instances of it. I think by the time — certainly within two to three years, whether we have these super powerful AIs or not — clusters are going to get to the size where you'll be able to deploy millions of these, and they'll be faster than humans. And so if your picture is, "Oh, we'll have one and it'll take a while to make them," my point there was, no, actually you have millions of them right away. And in general, they can learn and act ten to a hundred times faster than humans.
**Lex Fridman:** 所以这是对强大 AI 的一个很好的定义。但你也写道,"显然这样的实体将能够非常快速地解决非常困难的问题,但弄清楚速度有多快并不是一件简单的事。"在你看来,两个极端立场都是错误的——奇点(singularity)是一个极端,另一个极端是相反的情况。你能描述一下每个极端,以及为什么?
**Lex Fridman:** So that's a really nice definition of powerful AI. But you also write that "clearly such an entity would be capable of solving very difficult problems very fast, but it is not trivial to figure out how fast." Two extreme positions both seem false to you — so the singularity is on the one extreme, and the opposite on the other extreme. Can you describe each of the extremes and why?
**Dario Amodei:** 好的,来描述这两个极端。一个极端是这样的:好,如果我们看人类的进化史,曾经有这样一个大加速——数十万年我们只有单细胞生物,然后出现了哺乳动物,然后是猿类,然后很快就出现了人类。人类很快建立了工业文明。所以这将持续加速下去,在人类水平上没有天花板。一旦模型比人类聪明得多,它们就会非常擅长构建下一代模型。如果你写下一个简单的微分方程,这是一个指数。所以会发生的事情是,模型将建造更快的模型,模型将建造更快的模型,那些模型将构建,你知道,能够接管世界并产生比原来多得多的能量的纳米技术。所以如果你只是解这个抽象的微分方程,那么在我们建造出比人类更强大的第一个 AI 后五天,世界将充满这些 AI,所有可能发明的技术都将被发明。我把这个描述得有点漫画化了,但我认为这就是一个极端。
**Dario Amodei:** Yeah, let's describe the extremes. One extreme would be, well look, if we look at kind of evolutionary history, there was this big acceleration where for hundreds of thousands of years we just had single-cell organisms, and then we had mammals, and then we had apes, and then that quickly turned to humans. Humans quickly built industrial civilization. And so this is going to keep speeding up, and there's no ceiling at the human level. Once models get much, much smarter than humans, they'll get really good at building the next models. And if you write down a simple differential equation, this is an exponential. And so what's going to happen is that models will build faster models, models will build faster models, and those models will build, you know, nanotech that can take over the world and produce much more energy than you could produce otherwise. And so if you just kind of solve this abstract differential equation, then five days after we build the first AI that's more powerful than humans, the world will be filled with these AIs and every possible technology that could be invented will be invented. I'm caricaturing this a little bit, but I think that's one extreme.
**Dario Amodei:** 我认为事情不会那样发展,原因之一是:我觉得那些人完全忽略了物理定律。在物理世界里,事情只能以一定的速度推进。那些反馈循环有些涉及制造更快的硬件——而制造更快的硬件需要很长时间。很多事情都需要时间。还有复杂性的问题。我认为,无论你多聪明——人们总说"哦,我们可以建立生物系统的模型,什么都能搞定。"听着,我认为计算建模能做很多事情。我在做生物学研究的时候做了大量计算建模。但有很多事情是你无法预测的。它们复杂到一定程度,直接迭代、直接做实验,会打败任何建模方式,无论做建模的系统有多聪明。
**Dario Amodei:** And the reason that I think that's not the case is, one, I think they just neglect the laws of physics. It's only possible to do things so fast in the physical world. Some of those loops go through producing faster hardware — it takes a long time to produce faster hardware. Things take a long time. There's this issue of complexity. I think no matter how smart you are — people talk about, "Oh, we can make models of biological systems, it'll do everything." Look, I think computational modeling can do a lot. I did a lot of computational modeling when I worked in biology. But there are a lot of things that you can't predict. They're complex enough that just iterating, just running the experiment, is going to beat any modeling no matter how smart the system doing the modeling is.
**Lex Fridman:** 就算不和物理世界交互,光是建模本身就很难了?
**Lex Fridman:** Even if it's not interacting with the physical world, just the modeling is going to be hard?
**Dario Amodei:** 对,我认为建模本身就很难,而且让模型与物理世界匹配也很难。所以它确实必须与物理世界交互来验证。但你看,即便是最简单的问题也是如此。我想到三体问题(three-body problem)、简单的混沌预测,或者预测经济走势。想把两年后的经济走势预测准确,真的非常难。也许普通人能预测下个季度的经济会发生什么——虽然其实也做不到——也许一个比人类聪明一亿倍的 AI 系统也只能多预测一年左右。不是计算机智能的指数级提升,而是预测能力的线性提升。生物分子的相互作用也是一样——你扰动一个复杂系统时,根本不知道会发生什么。你可以在其中找到简单的部分。你越聪明,就越善于找到这些简单部分。
**Dario Amodei:** Yeah, I think the modeling is going to be hard and getting the model to match the physical world is going to be hard. So it does have to interact with the physical world to verify. But it's just — you look at even the simplest problems. I think I talk about the three-body problem or simple chaotic prediction, or predicting the economy. It's really hard to predict the economy two years out. Maybe the case is that normal humans can predict what's going to happen in the economy in the next quarter — although they can't really do that — maybe an AI system that's a zillion times smarter can only predict it out a year or something. Instead of a kind of exponential increase in computer intelligence, you get a linear increase in ability to predict. Same with biological molecules interacting — you don't know what's going to happen when you perturb a complex system. You can find simple parts in it. If you're smarter, you're better at finding these simple parts.
**Dario Amodei:** 然后我认为人类的各种制度和机构真的非常难改变。让人们接受我们开发的技术一直很困难——我不举具体例子了——即便是那些有效性已经非常有力地证明了的技术,人们也有顾虑,认为是阴谋论,推广起来极为困难。让哪怕最简单的事情通过监管体系也极为困难。我不想全面批评所有在监管体系工作的人——他们面临艰难的权衡,必须保护生命——但整个体系作为一个整体,在我看来做了一些明显偏离最大化人类福祉的取舍。所以如果我们把 AI 系统引入这些人类系统,往往智能水平根本不是制约因素。很可能只是有些事情天生就需要时间。
**Dario Amodei:** And then I think human institutions are just really difficult. It's been hard to get people — I won't give specific examples — but it's been hard to get people to adopt even the technologies that we've developed, even ones where the case for their efficacy is very, very strong. People have concerns, they think things are conspiracy theories — it's just been very difficult. It's also been very difficult to get very simple things through the regulatory system. I don't want to just disparage anyone who works in regulatory systems of any technology — there are hard trade-offs they have to deal with, they have to save lives — but the system as a whole I think makes some obvious trade-offs that are very far from maximizing human welfare. And so if we bring AI systems into these human systems, often the level of intelligence may just not be the limiting factor. It just may be that it takes a long time to do something.
**Dario Amodei:** 如果 AI 系统绕开所有政府——如果它说"我是全球独裁者,我要为所欲为"——那它确实可以做一些事。不过对于复杂性相关的问题,我仍然认为很多事情需要时间。AI 系统能产生大量能源,或者登上月球,我认为这并不能帮上忙。评论区有人回应我的文章,说 AI 系统可以产生大量能源并制造更聪明的 AI——这些人没抓住重点。那种循环解决不了我在这里谈的核心问题。我认为很多人在这里没搞清楚重点。
**Dario Amodei:** Now, if the AI system circumvented all governments — if it just said, "I'm dictator of the world and I'm going to do whatever" — some of these things it could do. Again, the things having to do with complexity, I still think a lot of things would take a while. I don't think it helps that the AI systems can produce a lot of energy or go to the moon. Some people in comments responded to the essay saying the AI system can produce a lot of energy and smarter AI systems — that's missing the point. That kind of cycle doesn't solve the key problems that I'm talking about here. I think a bunch of people missed the point there.
**Dario Amodei:** 但就算它完全不对齐、能绕过所有人类障碍,也会遇到困难。再说一遍,如果你希望 AI 系统不要统治世界、不要毁灭人类,那它基本上就需要遵守人类的基本法律。如果我们想要一个真正美好的世界,就必须让 AI 系统与人类互动,而不是让它自己建立一套法律体系或无视所有法律。所以不管这些流程多么低效,我们都得面对,因为这些系统的推出方式需要得到一定程度的民众认可和民主合法性。不能由一小群开发这些系统的人说"这对所有人都是最好的"。我认为这样做是错的,而且实际上也行不通。
**Dario Amodei:** But even if it were completely unaligned and could get around all these human obstacles, it would have trouble. And again, if you want this to be an AI system that doesn't take over the world, that doesn't destroy humanity, then basically it's going to need to follow basic human laws. If we want to have an actually good world, we're going to have to have an AI system that interacts with humans, not one that kind of creates its own legal system or disregards all the laws. So as inefficient as these processes are, we're going to have to deal with them, because there needs to be some popular and democratic legitimacy in how these systems are rolled out. We can't have a small group of people who are developing these systems say, "This is what's best for everyone." I think it's wrong, and I think in practice it's not going to work anyway.
**Dario Amodei:** 把这些因素加在一起,我们不会在五分钟内改变世界、让所有人完成数字化上传。我就是不认为这会发生,而且在某种程度上,即便有可能发生,这也不是通往美好世界的方式。
**Dario Amodei:** So you put all those things together and we're not going to change the world and upload everyone in five minutes. I just don't think it's going to happen, and to some extent that it could happen, it's not the way to lead to a good world.
**Dario Amodei:** 这是一方面。另一方面,还有另一种观点,我在某些方面其实更有共鸣,那就是:看,我们以前见过重大的生产率提升。经济学家们熟悉研究计算机革命和互联网革命带来的生产率提升,而那些生产率提升总体来说令人失望——比你想象的要小。Robert Solow 有一句名言:"计算机革命随处可见,就是在生产率统计数据里看不到。"为什么会这样?人们指向企业结构、商业机构的结构,以及将现有技术推广到世界最贫困地区的缓慢进程——我在文章里也谈到了这点。我们怎么把这些技术带到最贫困的地区?那些地方在手机技术、电脑、医疗方面都落后,更别说还没发明出来的新式 AI 了。所以你可能会有这样的观点:从技术上看这很了不起,但实际上不过是一场空。Tyler Cowen 回应我文章时就持这种观点。我认为他觉得根本性变化最终会发生,但他认为需要五十到一百年。你甚至可以对整件事持更静态的观点。
**Dario Amodei:** So that's on one side. On the other side, there's another set of perspectives which I actually have in some ways more sympathy for, which is: look, we've seen big productivity increases before. Economists are familiar with studying the productivity increases that came from the computer revolution and internet revolution, and generally those productivity increases were underwhelming — they were less than you might imagine. There was a quote from Robert Solow: "You see the computer revolution everywhere except the productivity statistics." So why is this the case? People point to the structure of firms, the structure of enterprises, how slow it's been to roll out our existing technology to very poor parts of the world — which I talk about in the essay. How do we get these technologies to the poorest parts of the world that are behind on cell phone technology, computers, medicine, let alone newfangled AI that hasn't been invented yet? So you could have a perspective that's like, well, this is amazing technically but it's all a nothing burger. Tyler Cowen, who wrote something in response to my essay, has that perspective. I think he thinks the radical change will happen eventually, but he thinks it'll take 50 or 100 years. And you could have even more static perspectives on the whole thing.
**Dario Amodei:** 我认为这其中有一定道理,只是时间尺度我觉得太长了。而且我能看到——我用今天的 AI 实际上能看到两方面。我们的很多客户是大型企业,习惯于按老方式做事。我也在与政府的交流中见识过——政府是最典型的机构,是最慢变化的实体。但我一遍又一遍看到的现象是:是的,掉转大船需要很长时间,是的,有很多阻力和不理解。但让我觉得进步最终会以相当快的速度——不是快得惊人,而是相当快——实现的,是因为我在大型公司、甚至在出人意料地思想开明的政府里,一次又一次发现两件能推动事情向前的事情。
**Dario Amodei:** I think there's some truth to it, I think the timescale is just too long. And I can see it — I can actually see both sides with today's AI. A lot of our customers are large enterprises who are used to doing things a certain way. I've also seen it in talking to governments — those are prototypical institutions, entities that are slow to change. But the dynamic I see over and over again is: yes, it takes a long time to move the ship, yes, there's a lot of resistance and lack of understanding. But the thing that makes me feel that progress will in the end happen moderately fast — not incredibly fast but moderately fast — is that you talk to, what I find over and over again in large companies, even in governments, which have been actually surprisingly forward-leaning — you find two things that move things forward.
**Dario Amodei:** 第一,你会在一家公司、一个政府里发现一小部分人,他们真的看到了大图景,他们理解整个规模化假设(scaling hypothesis),他们知道 AI 将走向何方,或者至少知道它在自己行业里将走向何方。现任美国政府内部就有几个这样的人,真正看清了全局。这些人认为这是世界上最重要的事,并为此奔走呼号。光靠他们还不够,因为他们只是大型组织里的少数人。但随着技术开始推广,随着它在那些最愿意采用它的地方取得成功,竞争的阴影给了他们顺风。因为他们可以在自己的大型组织内部指出:"你看,那些人已经在这样做了。"一家银行可以说:"你看,那个新兴的对冲基金在这样做——他们要把我们的饭碗抢走了。"美国可以说:"我们担心中国会比我们先到达那里。"
**Dario Amodei:** One, you find a small fraction of people within a company, within a government, who really see the big picture, who see the whole scaling hypothesis, who understand where AI is going, or at least understand where it's going within their industry. And there are a few people like that within the current US government who really see the whole picture. And those people see that this is the most important thing in the world, and they agitate for it. They alone are not enough to succeed because they are a small set of people within a large organization. But as the technology starts to roll out, as it succeeds in some places with the folks who are most willing to adopt it, the specter of competition gives them a wind at their backs. Because they can point within their large organization and say, "Look, these other guys are doing this." One bank can say, "Look, this newfangled hedge fund is doing this thing — they're going to eat our lunch." In the US, we can say, "We're afraid China's going to get there before we are."
**Dario Amodei:** 这两者的组合——竞争的阴影,加上这些在许多方面已经僵化的组织内部的少数远见卓识者——把这两者放在一起,实际上能推动事情发生。这很有意思——这是两种力量之间的均衡博弈,因为惯性非常强大,但经过足够长的时间,创新方式最终会突破重围。我见过这种情况,一次又一次见到这个弧线。障碍是存在的——进步的障碍、复杂性、不知道如何使用模型或如何部署它们——有一段时间看起来这些障碍会永远持续下去,变化似乎不会发生。但最终变化还是发生了,而且总是由少数人推动的。
**Dario Amodei:** And that combination — the specter of competition plus a few visionaries within these organizations that in many ways are sclerotic — you put those two things together and it actually makes something happen. It's interesting — it's a balanced fight between the two, because inertia is very powerful, but eventually over enough time, the innovative approach breaks through. And I've seen that happen, I've seen the arc of that over and over again. The barriers are there — the barriers to progress, the complexity, not knowing how to use the model or how to deploy them — and for a bit it seems like they're going to last forever, like change doesn't happen. But then eventually change happens, and it always comes from a few people.
**Dario Amodei:** 我当年在 AI 领域内倡导规模化假设(scaling hypothesis)而其他人不理解的时候,也有同样的感受。感觉没有人会明白。感觉我们掌握了一个几乎没有人知道的秘密。然后几年后,人人都知道这个秘密了。所以我认为 AI 在世界上的部署也会这样走。障碍会逐渐崩塌,然后一下子全部崩塌。所以我认为这会更接近——这只是一种直觉,我很容易看出自己可能是错的——我认为会更接近五到十年,正如我在文章里说的,而不是五十到一百年。我也认为会更接近五到十年而不是五到十小时,因为我见过人类系统是如何运作的。我认为那些写下微分方程、说 AI 会制造出更强大的 AI、无法理解这些事情怎么可能不快速变化的人——我认为他们不了解这些。
**Dario Amodei:** I felt the same way when I was an advocate of the scaling hypothesis within the AI field itself and others didn't get it. It felt like no one would ever get it. It felt like we had a secret almost no one ever had. And then a couple years later, everyone has the secret. And so I think that's how it's going to go with deployment of AI in the world. The barriers are going to fall apart gradually and then all at once. And so I think this is going to be more — and this is just an instinct, I could easily see how I'm wrong — I think it's going to be more like five or ten years, as I say in the essay, than it's going to be 50 or 100 years. I also think it's going to be five or ten years more than it's going to be five or ten hours, because I've just seen how human systems work. And I think a lot of these people who write down the differential equations, who say AI is going to make more powerful AI, who can't understand how it could possibly be the case that these things won't change so fast — I think they don't understand these things.
**Lex Fridman:** 所以——用实现 AGI(通用人工智能)的时间线来表述,也就是强大的 AI,也就是超级有用的 AI——我要开始这么叫它了,命名之争还在继续——在纯粹智能上,比诺贝尔奖得主在每个相关领域都更聪明,以及我们说过的所有那些东西:多模态、可以自主去做事好几天好几周、独立进行生物实验。就——你知道吗,我们就只说生物学吧,因为你把生物学和健康那一节讲得让我完全信服了。太令人兴奋了。从科学角度来说我都有点激动,让我有种想去当生物学家的冲动。
**Lex Fridman:** So what — to use the timeline to where we achieve AGI, a.k.a. powerful AI, a.k.a. super useful AI — I'm going to start calling it that, it's a debate, it's a debate about naming — on pure intelligence, smarter than a Nobel Prize winner in every relevant discipline, and all the things we've said, modality, you can go and do stuff on its own for days, weeks, and do biology experiments on its own. In one — you know what, let's just stick to biology because you sold me on the whole biology and health section. That's so exciting. From just — I was getting giddy from a scientific perspective, it made me want to be a biologist.
**Dario Amodei:** 几乎——不,这正是我写作时的感受。就是,如果我们能做到——如果我们只是能让它实现,对吧?如果我们只是能把路上的地雷清除掉让它实现。背后有那么多美好、那么多优雅、那么大的道义力量,如果我们能——而且这是我们所有人都应该能达成共识的事情,对吧?我们在所有政治问题上争吵不休,但这件事是不是真的能让我们团结起来?
**Dario Amodei:** It's almost — no, this was the feeling I had when I was writing it. It's like, this would be such a beautiful future if we can just — if we can just make it happen, right? If we can just get the landmines out of the way and make it happen. There's so much beauty and elegance and moral force behind it, if we can just — and it's something we should all be able to agree on, right? As much as we fight about all these political questions, is this something that could actually bring us together?
**Dario Amodei:** 但你在问,我们什么时候能实现这个?
**Dario Amodei:** But you were asking, when will we get this?
**Lex Fridman:** 你认为——说个数字吧。
**Lex Fridman:** When do you think — what's — just put numbers on it.
**Dario Amodei:** 这当然是我多年来一直在纠结的问题,我一点也没把握。每次——如果我说 2026 年或 2027 年,Twitter 上会有无数人说"他,那个 CEO,说了 2026 年或 2027 年",接下来两年都会有人反复说这就是我认为会发生的时间。所以那些截取视频片段的人——你们会把我刚才说的话剪掉,只留下我接下来要说的话。但我还是说吧。
**Dario Amodei:** So you know, this is of course the thing I've been grappling with for many years, and I'm not at all confident. Every time — if I say 2026 or 2027, there will be like a zillion people on Twitter who will be like, "He, the CEO, said 2026 or 2027," and it'll be repeated for the next two years that this is definitely when I think it's going to happen. So whoever's excerpting these clips — we'll crop out the thing I just said and only say the thing I'm about to say. But I'll just say it anyway.
**Dario Amodei:** 如果你把我们迄今为止的曲线外推——如果你说,好,我不知道,我们正开始达到博士水平,去年我们在本科水平,前年我们在高中生水平。当然,在哪些任务上、对于什么而言,你可以争论——我们仍然缺少一些模态,但那些模态正在被添加进来。计算机使用(computer use)被加进来了,图像输入(image input)被加进来了,图像生成(image generation)也被加进来了。如果你只是——这完全不科学——但如果你只是用眼睛感受一下这些能力提升的速度,确实会让你觉得我们会在 2026 年或 2027 年到达那里。
**Dario Amodei:** So if you extrapolate the curves that we've had so far — if you say, well, I don't know, we're starting to get to like PhD level, and last year we were at undergraduate level, and the year before we were at the level of a high school student. Again, you can quibble with at what tasks and for what — we're still missing modalities, but those are being added. Computer use was added, image input was added, image generation has been added. If you just kind of — and this is totally unscientific — but if you just kind of eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027.
**Dario Amodei:** 当然,很多事情可能会让这一切脱轨。我们可能会耗尽数据,可能无法按我们希望的那样扩大集群规模,也许台湾发生了什么事然后我们无法生产那么多 GPU。所以有各种各样的事情可能会让整个进程脱轨。我不完全相信这条直线外推,但如果你相信直线外推,我们会在 2026 年或 2027 年到达那里。我认为最可能的情况是相对于这个时间线有一些温和的延迟。我不知道延迟多久,但我认为有可能按时发生,也有可能有温和的延迟。我认为仍然存在一百年内不会发生的可能世界——但那样的世界数量正在迅速减少。能真正令人信服地阻止这件事在未来几年内发生的理由,正在迅速耗尽。2020 年时有多得多的这种理由。虽然当时我的直觉是我们会克服所有这些障碍。
**Dario Amodei:** Again, lots of things could derail it. We could run out of data, we might not be able to scale clusters as much as we want, maybe Taiwan gets blown up or something and then we can't produce as many GPUs as we want. So there are all kinds of things that could derail the whole process. I don't fully believe the straight-line extrapolation, but if you believe the straight-line extrapolation, we'll get there in 2026 or 2027. I think the most likely is that there's some mild delay relative to that. I don't know what that delay is, but I think it could happen on schedule, I think there could be a mild delay. I think there are still worlds where it doesn't happen in a hundred years — those worlds, the number of those worlds is rapidly decreasing. We are rapidly running out of truly convincing blockers, truly compelling reasons why this will not happen in the next few years. There were a lot more in 2020. Although my hunch at that time was that we would make it through all those blockers.
**Dario Amodei:** 所以身为一个见证了大多数障碍被清除的人,我有一种预感——只是我的直觉、我的猜测——剩下的那些障碍也不会拦住我们。但说到底,我不想把这当作一个科学预测来表述。人们称之为规模化定律(scaling laws)——那是个用词不当。Moore's law(摩尔定律)也是个用词不当。Moore's law、scaling laws——它们不是宇宙定律,它们是经验规律。我会押注它们会继续成立,但我并不确定。
**Dario Amodei:** So sitting as someone who has seen most of the blockers cleared out of the way, I kind of suspect — my hunch, my suspicion — is that the rest of them will not block us. But look, at the end of the day, I don't want to represent this as a scientific prediction. People call them scaling laws — that's a misnomer. Moore's law is a misnomer. Moore's law, scaling laws — they're not laws of the universe, they're empirical regularities. I am going to bet in favor of them continuing, but I'm not certain of that.
**Lex Fridman:** 所以你在文章里详细描述了"压缩的 21 世纪"——AGI 将如何推动生物学和医学领域的一系列突破,以我提到的各种方式帮助我们。那你认为——最初的步骤可能是什么?顺便说一句,我问了 Claude 应该问你什么问题,Claude 告诉我要问:你认为在这个未来里,一个与 AGI 合作的生物学家,典型的一天是什么样的?
**Lex Fridman:** So you extensively describe the compressed 21st century — how AGI will help set forth a chain of breakthroughs in biology and medicine that help us in all these kinds of ways that I mentioned. So how do you think — what are the early steps it might do? And by the way, I asked Claude good questions to ask you, and Claude told me to ask: what do you think a typical day for a biologist working with AGI looks like in this future?
**Dario Amodei:** 是的,Claude 很好奇。让我——好,让我先回答你的第一个问题,然后再回答那个。Claude 想知道自己的未来是什么样的,对吧?
**Dario Amodei:** Yeah, Claude is curious. Let me — well, let me start with your first question and then I'll answer that. Claude wants to know what's in his future, right?
**Lex Fridman:** 正是。我将和谁一起工作?
**Lex Fridman:** Exactly. Who am I going to be working with?
**Dario Amodei:** 没错。所以我认为我在文章里着重强调的一点是——让我回到这个想法,因为它对我产生了深刻的影响——这个想法是:在大型组织和系统内部,最终往往是少数几个人或少数几个新想法,使事情走向了与原本不同的方向,对轨迹产生了不成比例的影响。其他大量的事情都在照旧进行。如果你想想医疗健康领域,有数万亿美元用于支付 Medicare 和其他医疗保险,NIH(国立卫生研究院)有 1000 亿美元预算。然后如果我想想那些真正革命性改变了什么的少数事物,它们其实只占其中一小部分。
**Dario Amodei:** Exactly. So I think one of the things I went hard on in the essay is — let me go back to this idea, because it's really had an impact on me — this idea that within large organizations and systems, there end up being a few people or a few new ideas who kind of cause things to go in a different direction than they would have before, who disproportionately affect the trajectory. There's a bunch of kind of the same thing going on. If you think about the health world, there's trillions of dollars to pay out Medicare and other health insurance, and then the NIH is $100 billion. And then if I think of the few things that have really revolutionized anything, it could be encapsulated in a small fraction of that.
**Dario Amodei:** 所以当我想 AI 会在哪里产生影响时,我会想:AI 能把那一小部分变成更大的一部分,并提高它的质量吗?在生物学领域,我的经验是,生物学最大的问题是你看不见正在发生什么。你几乎没有能力看清正在发生什么,改变它的能力更差。你拥有的是——从外部,你必须推断:有一堆细胞,每个细胞内有三十亿个碱基对的 DNA,按照遗传密码(genetic code)构建。有无数的过程在进行,而我们作为未经增强的人类完全无法影响这些过程。这些细胞在分裂——大多数时候这是健康的,但有时这个过程出错了,那就是癌症。细胞在老化,你的皮肤可能随年龄变色、产生皱纹。所有这一切都由这些过程决定——所有这些蛋白质被生产出来,运输到细胞的各个部分,相互结合。
**Dario Amodei:** And so when I think of where will AI have an impact, I'm like, can AI turn that small fraction into a much larger fraction and raise its quality? And within biology, my experience is that the biggest problem of biology is that you can't see what's going on. You have very little ability to see what's going on and even less ability to change it. What you have is — from the outside, you have to infer that there's a bunch of cells, that within each cell is three billion base pairs of DNA built according to a genetic code. And there are all these processes that are just going on without any ability of us as unaugmented humans to affect it. These cells are dividing — most of the time that's healthy, but sometimes that process goes wrong and that's cancer. The cells are aging, your skin may change color, develop wrinkles as you age. And all of this is determined by these processes, all these proteins being produced, transported to various parts of the cells, binding to each other.
**Dario Amodei:** 在我们对生物学知识的最初阶段,我们甚至不知道这些细胞的存在。我们必须发明显微镜来观察细胞。我们必须发明更强大的显微镜来看到细胞层面以下,到达分子层面。我们必须发明 X 射线晶体学(X-ray crystallography)来观察 DNA。我们必须发明基因测序(gene sequencing)来读取 DNA。我们必须发明蛋白质折叠(protein folding)技术来预测蛋白质如何折叠以及它们如何相互结合。我们必须发明各种技术——现在我们可以用 CRISPR 编辑 DNA,这是过去十二年的事。
**Dario Amodei:** And in our initial state of knowledge about biology, we didn't even know that these cells existed. We had to invent microscopes to observe the cells. We had to invent more powerful microscopes to see below the level of the cell, to the level of molecules. We had to invent X-ray crystallography to see the DNA. We had to invent gene sequencing to read the DNA. We had to invent protein folding technology to predict how proteins would fold and how they bind to each other. We had to invent various techniques — now we can edit the DNA with CRISPR as of the last 12 years.
**Dario Amodei:** 所以生物学的整个历史——历史的很大一部分——基本上就是我们读取和理解正在发生什么的能力,以及我们伸手进去选择性地改变事物的能力。我的看法是,我们在这方面还有太多可以做的。你可以用 CRISPR,但你可以对整个身体使用。比如说,我想对某一特定类型的细胞使用,而且我希望靶向错误细胞的概率非常低——这仍然是一个挑战,仍然是人们正在研究的事情。这可能是我们治疗某些疾病的基因疗法(gene therapy)所需要的。
**Dario Amodei:** So the whole history of biology — a whole big part of the history — is basically our ability to read and understand what's going on, and our ability to reach in and selectively change things. And my view is that there's so much more we can still do there. You can do CRISPR, but you can do it for your whole body. Let's say I want to do it for one particular type of cell and I want the rate of targeting the wrong cell to be very low — that's still a challenge, that's still things people are working on. That's what we might need for gene therapy for certain diseases.
**Dario Amodei:** 我说这些——还有超越这些的基因测序、新型纳米材料(nanomaterials)用于观察细胞内部的情况、抗体药物偶联物(antibody drug conjugates)——我说这些是因为这可以成为 AI 系统的杠杆支点。在整个生物学历史上,这类发明的数量也许就在中间两位数或低三位数左右。假设我有一百万个这样的 AI——它们共同合作,能不能很快发现数千个这样的发现?这能不能提供一个巨大的杠杆?与其试图撬动我们每年在 Medicare 等方面花费的两万亿美元,不如撬动每年花在发现上的十亿美元,但质量大幅提升?
**Dario Amodei:** And so the reason I'm saying all of this — and it goes beyond this to gene sequencing, to new types of nanomaterials for observing what's going on inside cells, to antibody drug conjugates — the reason I'm saying all this is that this could be a leverage point for the AI systems. The number of such inventions is in the mid double digits or something, maybe low triple digits over the history of biology. Let's say I have a million of these AIs — can they, working together, discover thousands of these very quickly? And does that provide a huge lever? Instead of trying to leverage the two trillion a year we spend on Medicare or whatever, can we leverage the one billion a year that's spent on discovery, but with much higher quality?
**Dario Amodei:** 那么,与 AI 系统合作的科学家是什么感受?我的想法是——在早期阶段,AI 会像研究生一样。你会给它们布置项目。你会说,"我是那个有经验的生物学家。"生物学教授,或者甚至是研究生自己,会说,"你用 AI 系统可以做的事情。我想研究这个。"然后 AI 系统——它拥有所有工具。它可以查阅所有文献来决定该做什么。它可以查看所有设备。它可以去一个网站说,"我要去 Thermo Fisher",或者任何现在主导的实验室设备公司——我那个时代是 Thermo Fisher——"我要订购这个新设备来做这件事。我要做实验,写一份关于实验的报告,检查图像是否有污染,决定下一个实验是什么,写一些代码并进行统计分析。"这些都是研究生会做的事情。
**Dario Amodei:** And so what is it like being a scientist that works with an AI system? The way I think about it is — in the early stages, the AIs are going to be like grad students. You're going to give them a project. You're going to say, "I'm the experienced biologist." The biology professor, or even the grad students themselves, will say, "Here's what you can do with an AI system. I'd like to study this." And the AI system — it has all the tools. It can look up all the literature to decide what to do. It can look at all the equipment. It can go to a website and say, "I'm going to go to Thermo Fisher," or whatever the dominant lab equipment company is — my time it was Thermo Fisher — "I'm going to order this new equipment to do this. I'm going to run my experiments, write up a report about my experiments, inspect the images for contamination, decide what the next experiment is, write some code and run a statistical analysis." All the things a grad student would do.
**Dario Amodei:** 会有一台装着 AI 的电脑,教授偶尔跟它说话,说"这是你今天要做的。"AI 系统在必要时带着问题来找教授。在操作实验设备方面,它可能在某些方面受限——可能不得不雇一个人类实验室助理来做实验并解释如何操作,或者它可以利用过去十年开发的、并将继续开发的实验室自动化技术。
**Dario Amodei:** There will be a computer with an AI that the professor talks to every once in a while and says, "This is what you're going to do today." The AI system comes to the professor with questions when it's necessary. To run the lab equipment, it may be limited in some ways — may have to hire a human lab assistant to do the experiment and explain how to do it, or it could use advances in lab automation that have been developed over the last decade or so and will continue to be developed.
**Dario Amodei:** 所以看起来会是这样:一个人类教授,加上一千个 AI 研究生。如果你去找那些诺贝尔奖得主级别的生物学家,你会说,"好,你过去有大概五十个研究生——现在你有一千个,而且他们比你更聪明,顺便说一句。"然后我认为在某个时间点,这个关系会翻转,AI 系统会成为 PI(首席研究员),成为领导者,它们会指挥人类或其他 AI 系统。所以我认为这就是研究层面的运作方式。
**Dario Amodei:** And so it'll look like there's a human professor and a thousand AI grad students. And if you go to one of these Nobel Prize-winning biologists, you'll say, "Okay, well, you had like 50 grad students — well now you have a thousand, and they're smarter than you are, by the way." Then I think at some point it'll flip around where the AI systems will be the PIs, will be the leaders, and they'll be ordering humans or other AI systems around. So I think that's how it'll work on the research side.
**Lex Fridman:** 它们会成为 CRISPR 类技术的发明者?
**Lex Fridman:** And they would be the inventors of a CRISPR-type technology?
**Dario Amodei:** 它们会成为 CRISPR 类技术的发明者。然后我认为,正如我在文章里说的,我们也会希望利用 AI 系统来改善临床试验体系。有一部分是监管层面的,是社会决策的问题,这会更难推动。但我们能不能更好地预测临床试验的结果?能不能通过更好的统计设计,使原本需要五千人的临床试验——因此需要一亿美元和一年时间来招募——现在只需要五百人和两个月来招募?这应该是我们的起点。
**Dario Amodei:** They would be the inventors of a CRISPR-type technology. And then I think, as I say in the essay, we'll want to harness the AI systems to improve the clinical trial system as well. There's some amount of this that's regulatory, that's a matter of societal decisions, and that'll be harder. But can we get better at predicting the results of clinical trials? Can we get better at statistical design so that clinical trials that used to require 5,000 people — and therefore needed $100 million and a year to enroll them — now they need 500 people and two months to enroll them? That's where we should start.
**Dario Amodei:** 我们能不能通过在动物试验中做以前临床试验才做的事、在模拟中做以前动物试验才做的事,来提高临床试验的成功率?当然,我们不可能模拟所有事情——AI 不是上帝——但我们能不能实质性地、大幅地移动那条曲线?所以我不知道,这大概就是我的图景。你还是会受到拖慢,还是需要时间,但可以快得多、快得多。
**Dario Amodei:** And can we increase the success rate of clinical trials by doing things in animal trials that we used to do in clinical trials, and doing things in simulations that we used to do in animal trials? Again, we won't be able to simulate it all — AI is not God — but can we shift the curve substantially and radically? So I don't know, that would be my picture of doing it. You're still slowed down, it still takes time, but you can do it much, much faster.
**Lex Fridman:** 对,我们能不能一次一步——然后让这许多步加起来成为很多?即便我们仍然需要临床试验,即便我们仍然需要法律,即便 FDA 和其他机构仍然不完美——我们能不能只是把所有事情都向积极方向推进?当你把所有这些积极方向加起来,是否就能让原本从现在到 2100 年才会发生的事情,改在 2027 年到 2032 年之间发生?
**Lex Fridman:** Yeah, can we just one step at a time — and can that add up to a lot of steps? Even though we still need clinical trials, even though we still need laws, even though the FDA and other organizations will still not be perfect — can we just move everything in a positive direction? And when you add up all those positive directions, do you get everything that was going to happen from here to 2100 instead happening from 2027 to 2032 or something?
**Lex Fridman:** AI 正在改变世界的另一个方式——即便是今天,但也指向这个强大、超级有用的 AI 的未来——是编程。那么你怎么看待编程的本质——因为它与构建 AI 的实际行为如此紧密相连——你认为这对我们人类来说将如何改变?
**Lex Fridman:** Another way that I think the world might be changing with AI — even today, but moving towards this future of the powerful, super useful AI — is programming. So how do you see the nature of programming — because it's so intimate to the actual act of building AI — how do you see that changing for us humans?
**Dario Amodei:** 我认为这将是变化最快的领域之一,原因有两个。第一,编程是一项与 AI 实际构建非常接近的技能。一项技能距离构建 AI 的人越远,它被 AI 颠覆所需的时间就越长。我真心相信 AI 会颠覆农业——也许在某些方面已经做到了——但那与构建 AI 的人距离太远,所以我认为需要更长时间。但编程是 Anthropic 以及其他公司大量员工的看家本领,所以这件事会发生得很快。
**Dario Amodei:** I think that's going to be one of the areas that changes fastest, for two reasons. One, programming is a skill that's very close to the actual building of the AI. The farther a skill is from the people who are building the AI, the longer it's going to take to get disrupted by the AI. I truly believe that AI will disrupt agriculture — maybe it already has in some ways — but that's just very distant from the folks who are building AI, and so I think it's going to take longer. But programming is the bread and butter of a large fraction of the employees who work at Anthropic and at the other companies, and so it's going to happen fast.
**Dario Amodei:** 它会发展得快的另一个原因是,在编程方面,你形成了闭环——无论是在训练模型时还是在应用模型时。模型能写代码,意味着模型可以然后运行代码,查看结果,并将其解释回来。所以它真的有能力——不像硬件,不像我们刚刚讨论的生物学——模型有能力形成闭环。所以我认为这两个因素会导致模型在编程方面进步非常快。
**Dario Amodei:** The other reason it's going to happen fast is, with programming, you close the loop — both when you're training the model and when you're applying the model. The idea that the model can write the code means that the model can then run the code and see the results and interpret it back. And so it really has an ability — unlike hardware, unlike biology which we just discussed — the model has an ability to close the loop. And so I think those two things are going to lead to the model getting good at programming very fast.
**Dario Amodei:** 正如我看到的,在典型的真实世界编程任务上,模型从今年一月的 3% 增长到今年十月的 50%。所以我们处于那条 S 曲线上,很快就会开始放缓,因为只能到 100%,但我猜测再过十个月,我们可能会很接近——至少会达到 90%。
**Dario Amodei:** As I saw, on typical real-world programming tasks, models have gone from 3% in January of this year to 50% in October of this year. So we're on that S-curve where it's going to start slowing down soon because you can only get to 100%, but I would guess that in another 10 months we'll probably get pretty close — we'll be at least 90%.
**Dario Amodei:** 所以我会猜测——我不知道需要多久——但我会猜测 2026 年、2027 年。Twitter 上那些截取这些数字、去掉所有警告的人——别这样了。我会猜测绝大多数程序员所做的那种任务——如果我们把任务限定得很窄,就是写代码——AI 系统将能够做到这一点。
**Dario Amodei:** So again, I would guess — I don't know how long it'll take — but I would guess 2026, 2027. Twitter people who crop out these numbers and get rid of the caveats — I don't know, go away. I would guess that the kind of task that the vast majority of coders do — if we make the task very narrow, like just write code — AI systems will be able to do that.
**Dario Amodei:** 话虽如此,我认为比较优势(comparative advantage)是强大的。我们会发现,当 AI 能做程序员 80% 的工作,包括大部分"按照给定规格写代码"的工作时,工作中剩余的部分对人类来说会变得更加有杠杆效应。人类会更多地参与高层次的系统设计,或者查看应用程序并问"这个架构合理吗?"——设计和用户体验方面。最终 AI 也会能做这些——这是我对强大 AI 系统的愿景。但我认为,在比我们预期更长的时间里,我们会看到人类仍然在做的那一小部分工作会扩展,填满他们整个工作,以使整体生产率提升。
**Dario Amodei:** Now, that said, I think comparative advantage is powerful. We'll find that when AIs can do 80% of a coder's job, including most of what is literally "write code with a given spec," the remaining parts of the job become more leveraged for humans. Humans will be more about high-level system design, or looking at the app and asking, "Is it architected well?" — the design and UX aspects. And eventually AI will be able to do those as well — that's my vision of the powerful AI system. But I think for much longer than we might expect, we will see that small parts of the job that humans still do will expand to fill their entire job in order for the overall productivity to go up.
**Dario Amodei:** 这是我们以前见过的现象。过去,写信和编辑信件非常困难,印刷也很困难。一旦有了文字处理器,然后有了电脑,生产和分享工作变得容易——然后这又变得即时化,所有的关注点就都放到了想法上。所以这种比较优势逻辑——把任务中微小的部分扩展成任务中很大的部分,并创造新任务来提升生产率——我认为这就是将要发生的事情。当然,有一天 AI 在所有方面都会更好,那种逻辑就不再适用了,到那时人类将不得不集体思考如何应对,而我们每天都在思考这个问题。这是另一个需要认真对待的重大问题,与滥用(misuse)和自主性(autonomy)并列,我们应该非常认真地对待它。但我认为在近期,甚至可能在中期——比如两、三、四年内——我预期人类仍将扮演极其重要的角色。编程的本质将会改变,但编程作为一种职能、编程作为一份工作,不会消失。只是会从逐行写代码,变得更加宏观。
# Lex Fridman Podcast #452 — 第七部分中文翻译
# Lex Fridman Podcast #452 — 第七部分中文翻译
**Dario Amodei:** That's something we've seen before. It used to be that writing and editing letters was very difficult, and printing was difficult. Well, as soon as you had word processors and then computers, it became easy to produce work and easy to share it — then that became instant and all the focus was on the ideas. So this logic of comparative advantage that expands tiny parts of the tasks to large parts of the tasks and creates new tasks in order to expand productivity — I think that's going to be the case. Again, someday AI will be better at everything, and that logic won't apply, and then humanity will have to think about how to collectively deal with that, and we're thinking about that every day. That's another one of the grand problems to deal with, aside from misuse and autonomy, and we should take it very seriously. But I think in the near term and maybe even in the medium term — like two, three, four years — I expect that humans will continue to have a huge role. The nature of programming will change, but programming as a role, programming as a job, will not change. It'll just be less writing things line by line and more macroscopic.
**Lex Fridman:** 我在想,未来的 IDE(集成开发环境)会是什么样——也就是和 AI 系统交互的工具。这不只是编程的问题,在其他场景也一样,比如 computer use(计算机操控),或者某些垂直领域。就像我们提到的生物学,它可能需要一套专属的工具来帮你高效工作。编程同样需要专属工具。Anthropic 打算进入这个工具层的赛道吗?
**Lex Fridman:** And I wonder what the future of IDEs looks like — the tooling of interacting with AI systems. This is true for programming and also probably true in other contexts, like computer use, but maybe domain-specific. Like we mentioned biology — it probably needs its own tooling about how to be effective. And then programming needs its own tooling. Is Anthropic going to play in that space of tooling?
**Dario Amodei:** 有这个可能性。我完全相信,强大的 IDE 还有大量的低垂果实可以摘。现在不过就是你跟模型说话,模型回答你。但你看,IDE 擅长很多静态分析,而且能做的静态分析越多越好——很多 bug 甚至不用写代码就能提前发现。另外 IDE 在运行特定任务、整理代码、统计单元测试覆盖率这些方面也很在行。普通 IDE 已经能做到这么多。现在你再加上这么一件事:模型能写代码、能运行代码了——我完全相信,就算模型质量本身没有任何提升,未来一两年内也有巨大的机会去提升大家的生产力,帮人兜住一堆错误、包揽一堆枯燥的重复工作。我们现在连皮毛都没碰到。
**Dario Amodei:** Potentially. I'm absolutely convinced that powerful IDEs — there's so much low-hanging fruit to be grabbed there. Right now it's just like you talk to the model and it talks back. But look, IDEs are great at lots of static analysis, as much as possible with static analysis — many bugs you can find without even writing the code. Then IDEs are good for running particular things, organizing your code, measuring coverage of unit tests. There's so much that's been possible with a normal IDE. Now you add something like, well, the model can now write code and run code — I am absolutely convinced that over the next year or two, even if the quality of the models didn't improve, there would be enormous opportunity to enhance people's productivity by catching a bunch of mistakes, doing a bunch of grunt work for people. We haven't even scratched the surface.
**Dario Amodei:** 至于 Anthropic 自己嘛——你很难说未来会怎样。现在我们没有打算自己去做这种 IDE。我们的定位是为 Cursor、Cognition,或者其他公司提供底层能力——比如安全领域的 Expo 以及其他公司——他们在我们的 API 上自己构建这些东西。我们的思路一直是:让千花齐放。我们内部没有足够的资源去尝试所有这些不同的方向,就让我们的客户去试吧,看谁成功,也许不同客户会在不同方向上各有胜算。所以我觉得这个方向非常有前途,但 Anthropic 并不急于——至少现在——去和我们所有的客户在这个领域竞争。或许永远都不会。
**Dario Amodei:** And Anthropic itself — I mean, you can't say, you know, it's hard to say what will happen in the future. Currently we're not trying to make such IDEs ourselves. Rather, we're powering the companies like Cursor, or like Cognition, or some of the other companies — Expo in the security space and others — that are building such things themselves on top of our API. And our view has been: let a thousand flowers bloom. We don't internally have the resources to try all these different things — let our customers try it, and we'll see who succeeds, and maybe different customers will succeed in different ways. So I both think this is super promising, and it's not something Anthropic is eager to — at least right now — compete with all our customers in this space. And maybe never.
**Lex Fridman:** 是啊,看 Cursor 努力把 Claude 整合进去的过程挺有意思的,因为真的——AI 能在多少地方帮到编程体验,这件事远比人们想象的复杂。
**Lex Fridman:** Yeah, it's been interesting to watch Cursor try to integrate Claude successfully, because it's actually — it is really fascinating how many places it can help the programming experience. It's not as trivial as one might think.
**Dario Amodei:** 真的让人叹为观止。作为 CEO,我没有太多时间写代码,我感觉如果六个月后我再回到编程,那个体验对我来说会完全认不出来了。
**Dario Amodei:** It is really astounding. I feel like, as a CEO, I don't get to program that much, and I feel like if six months from now I go back, it'll be completely unrecognizable to me.
**Lex Fridman:** 确实。那么在这个 AI 超级强大、自动化程度越来越高的世界里,我们人类的意义感从何而来?工作对很多人来说是深层意义的来源,那我们该从哪里找到意义?
**Lex Fridman:** Exactly. So in this world with super powerful AI that's increasingly automated, what's the source of meaning for us humans? Work is a source of deep meaning for many of us, so where do we find the meaning?
**Dario Amodei:** 这个问题我在那篇文章里稍微写了一点,不过其实写得不够深——不是有什么原则上的理由,而是这篇文章,如果你相信的话,最初只打算写两三页,我本来是准备在全员会上讲讲的。我之所以意识到这是个重要但被低估的话题,是因为我写着写着,一直在想:哎,这个我没法写清楚啊。结果文章就膨胀到了四五十页。等我写到"工作与意义"这一节,我就想:哎,这一节要写一百页——我得再单独写一篇文章。
**Dario Amodei:** This is something that I've written about a little bit in the essay, although I actually give it a bit short shrift — not for any principled reason, but this essay, if you believe it, was originally going to be two or three pages. I was going to talk about it at all-hands. And the reason I realized it was an important, underexplored topic is that I just kept writing things and I was like, "Oh man, I can't do this justice." And so the thing ballooned to like 40 or 50 pages. And then when I got to the work and meaning section, I'm like, "Oh man, this is going to be 100 pages — I'm going to have to write a whole other essay about that."
**Dario Amodei:** 但意义这件事其实挺有意思的,因为你想想一个人的生命——比方说,你把我放在一个模拟环境里,我在那里有一份工作,我努力完成各种事情,就这样过了六十年,然后有人告诉你:"哦,对不起,这其实只是一场游戏。"这真的会让这整段历程变得毫无意义吗?我还是做出了重要的选择,包括道德上的选择;我还是有过牺牲;我还是习得了那些技能。或者换一个类似的思想实验——想想那些发现电磁学或相对论的历史人物。如果你告诉他们:"其实两万年前,某个星球上的某个外星人已经先发现这个了"——这会剥夺他们发现的意义吗?在我看来不会。意义似乎在于那个过程本身——它展示了你是什么样的人,你如何与他人建立关系,你沿途做出了哪些选择——这些都是有分量的。
**Dario Amodei:** But meaning is actually interesting, because you think about the life that someone lives — like, let's say you were to put me in a simulated environment or something where I have a job and I'm trying to accomplish things. I do that for 60 years, and then you're like, "Oh, oops — this was actually all a game." Does that really kind of rob you of the meaning of the whole thing? I still made important choices, including moral choices. I still sacrificed. I still had to gain all these skills. Or just a similar exercise — think back to one of the historical figures who discovered electromagnetism or relativity or something. If you told them, "Well, actually, 20,000 years ago some alien on some planet discovered this before you did" — does that rob the meaning of the discovery? It doesn't really seem like it to me. It seems like the process is what matters, and how it shows who you are as a person along the way, and how you relate to other people, and the decisions that you make along the way — those are consequential.
**Dario Amodei:** 我可以想象,如果我们在 AI 时代把事情搞砸了,我们可能构建出一个让人们没有任何长期意义感的世界。但那更多是一种选择,一系列我们做出的选择——那更多是这些强大模型所处社会的架构问题。如果我们把它设计得很糟糕、只服务于肤浅的东西,那种情况可能真的会发生。
**Dario Amodei:** I could imagine, if we handle things badly in an AI world, we could set things up where people don't have any long-term source of meaning. But that's more a choice, a set of choices we make — that's more the architecture of a society with these powerful models. If we design it badly and for shallow things, then that might happen.
**Dario Amodei:** 我还想说,今天大多数人的生活——尽管他们非常努力地在生活中寻找意义,这很值得钦佩——但是,我们这些有幸开发这些技术的人,应该对那些人抱有同理心,不只是在这里,而是世界各地那些花很多时间在温饱线上挣扎的人。假设我们能把这项技术的好处分配到每个地方,他们的生活会好得多。意义对他们来说和现在一样重要,但我们不应该忘记这件事的分量。把意义当作唯一重要的事情,在某种程度上是少数经济上幸运的人才有的奢侈。
**Dario Amodei:** I would also say that most people's lives today — while admirably they work very hard to find meaning in those lives — look, we who are privileged and who are developing these technologies, we should have empathy for people, not just here but in the rest of the world, who spend a lot of their time scraping by to survive. Assuming we can distribute the benefits of this technology to everywhere, their lives are going to get a hell of a lot better. And meaning will be important to them as it is important to them now, but we should not forget the importance of that. The idea of meaning as the only important thing is in some ways an artifact of a small subset of people who have been economically fortunate.
**Dario Amodei:** 但话说回来,一个拥有强大 AI 的世界,是可能不仅让每个人的意义感同等丰富,甚至更加丰富的——它能让每个人看到过去要么无人能见、要么极少数人才能体验的世界和经历。所以我对意义这件事是乐观的。我更担心的是经济和权力集中的问题——那才是我真正担心的。我担心我们怎么确保一个公平的世界能惠及每一个人。当人类出了问题,往往是因为人对人的伤害。这甚至可能比 AI 的自主风险或意义的问题更让我忧虑——权力集中、权力滥用、少数人剥削多数人的威权体制和独裁制度,这才是我最担心的。
**Dario Amodei:** But I think all that said, a world is possible with powerful AI that not only has as much meaning for everyone but that has more meaning for everyone — that can allow everyone to see worlds and experiences that were either possible for no one to see or possible for very few people to experience. So I am optimistic about meaning. I worry about economics and the concentration of power — that's actually what I worry about more. I worry about how do we make sure that a fair world reaches everyone. When things have gone wrong for humans, they've often gone wrong because humans mistreat other humans. That is maybe in some ways even more than the autonomous risk of AI or the question of meaning — that is the thing I worry about most. The concentration of power, the abuse of power, structures like autocracies and dictatorships where a small number of people exploits a large number of people. I'm very worried about that.
**Lex Fridman:** 而 AI 放大了世界上的权力总量,如果你集中这种权力、滥用这种权力,它能造成难以估量的破坏。
**Lex Fridman:** And AI increases the amount of power in the world, and if you concentrate that power and abuse that power, it can do immeasurable damage.
**Dario Amodei:** 是的,非常可怕。真的非常可怕。
**Dario Amodei:** Yes, it's very frightening. It's very frightening.
**Lex Fridman:** 好,我强烈建议大家去读这篇完整的文章。它应该写成一本书或者一系列文章,因为它描绘了一个非常具体的未来。我看得出来后面的章节越来越短,因为你大概意识到这东西会变得非常长。
**Lex Fridman:** Well, I encourage people — highly encourage people — to read the full essay. That should probably be a book or a sequence of essays, because it does paint a very specific future. I could tell the later sections got shorter and shorter because you started to probably realize that this is going to be a very long essay.
**Dario Amodei:** 一方面是我意识到会很长。另一方面,我非常在意——
**Dario Amodei:** One, I realized it would be very long. And two, I'm very aware of and —
**Dario Amodei:** 我非常努力地避免成为那种——不知道用什么词来形容——就是那种过度自信、对什么都有意见、到处发表一堆言论却又不是专家的人。我非常努力地想避免那样。但我得承认,等我写到生物学章节的时候,我并不是这方面的专家。所以尽管我表达了很多不确定性,我说的那些东西里很可能有不少让人尴尬或者错误的地方。
**Dario Amodei:** I very much try to avoid being — I don't know what the term for it is — but one of these people who's kind of overconfident and has an opinion on everything and kind of says a bunch of stuff and isn't an expert. I very much tried to avoid that. But I have to admit, once I got to the biology sections, I wasn't an expert. And so as much as I expressed uncertainty, probably I said a bunch of things that were embarrassing or wrong.
**Lex Fridman:** 好吧,你描绘的那个未来让我很振奋,非常感谢你为构建那个未来所付出的努力。也谢谢你今天来跟我聊,Dario。
**Lex Fridman:** Well, I was excited for the future you painted, and thank you so much for working hard to build that future. And thank you for talking today, Dario.
**Dario Amodei:** 感谢你的邀请。我只希望我们能做对、能让它成真。如果说我想传达一个信息,那就是:要把这一切做对,要让它成真,我们既需要构建技术、围绕积极使用这项技术建立公司和经济体系,也需要正视风险,因为风险是真实存在的。那些风险挡在我们的路上,是从现在到未来之间的地雷,我们必须拆除这些地雷,才能抵达那里。这需要平衡,就像生命中的一切一样。
**Dario Amodei:** Thanks for having me. I just hope we can get it right and make it real. And if there's one message I want to send, it's that to get all this stuff right, to make it real, we both need to build the technology, build the companies, the economy around using this technology positively, but we also need to address the risks because they're there. Those risks are in our way. They're landmines on the way from here to there, and we have to defuse those landmines if we want to get there. It's a balance, like all things in life.
**Lex Fridman:** 就像一切事物一样。谢谢你。感谢大家收听与 Dario Amodei 的这次对话。现在,亲爱的朋友们,有请 Amanda Askell。你是科班出身的哲学家,所以在你从 Oxford 和 NYU 的哲学学习旅程,再到 OpenAI 和 Anthropic 的 AI 问题研究这一路上,你发现哪些问题最让你着迷?
**Lex Fridman:** Like all things. Thank you. Thanks for listening to this conversation with Dario Amodei. And now, dear friends, here's Amanda Askell. You are a philosopher by training, so what sort of questions did you find fascinating through your journey in philosophy at Oxford and NYU, and then switching over to the AI problems at OpenAI and Anthropic?
**Amanda Askell:** 我觉得哲学其实是一门非常适合对什么都感兴趣的人的学科,因为任何事物都有其对应的哲学。所以你研究数学哲学一段时间后,如果发现自己对化学感兴趣,你可以转去研究化学哲学。你可以转向伦理学或政治哲学。我觉得到后期我最感兴趣的主要是伦理学,那也是我的博士研究方向。是伦理学里一个比较技术性的领域,研究的是世界里包含无限多人的情况下的伦理——奇怪的是,这在伦理的实践层面可能没那么接地气。然后,做伦理学博士有个棘手的地方——你花大量时间思考世界、思考世界可以变得更好、思考各种问题,但你又在做哲学博士。我在读博的时候心想,这些问题真的非常有趣,可能是我在哲学里遇到的最迷人的问题之一,我很喜欢,但我更想看看自己能不能对世界产生影响、能不能做一些有意义的事。我想那大概是 2017 年、2018 年,AI 还没有现在这么广为人知。我一直在关注 AI 的进展,感觉它正在变成一件大事。我基本上就是很乐意参与进来,看看能不能帮上忙,因为我想:如果你去尝试做一件有影响力的事,就算没成功,你毕竟尝试过了,然后你可以去做学者,心里清楚你试过了。如果没成,没成就没成嘛。所以我就在那时候进入了 AI 政策领域。
**Amanda Askell:** I think philosophy is actually a really good subject if you are kind of fascinated with everything, because there's a philosophy of everything. So if you do philosophy of mathematics for a while and then you decide that you're actually really interested in chemistry, you can do philosophy of chemistry for a while. You can move into ethics or philosophy of politics. I think towards the end I was really interested in ethics primarily. That was what my PhD was on. It was on a kind of technical area of ethics, which was ethics where worlds contain infinitely many people — strangely a little bit less practical on the end of ethics. And then I think one of the tricky things with doing a PhD in ethics is that you're thinking a lot about the world, how it could be better, the problems, and you're doing a PhD in philosophy. And I think when I was doing my PhD I was like, this is really interesting, it's probably one of the most fascinating questions I've ever encountered in philosophy, and I love it, but I would rather see if I can have an impact on the world and see if I can do good things. And I think that was around the time that AI was still probably not as widely recognized as it is now — that was around 2017, 2018. I had been following progress and it seemed like it was becoming kind of a big deal. And I was basically just happy to get involved and see if I could help, because I was like, well, if you try and do something impactful, if you don't succeed, you tried to do the impactful thing and you can go be a scholar and feel like you tried. And if it doesn't work out, it doesn't work out. And so then I went into AI policy at that point.
**Lex Fridman:** AI 政策具体是做什么的?
**Lex Fridman:** And what does AI policy entail?
**Amanda Askell:** 那个时候,这主要是思考 AI 的政治影响和相关后果。后来我慢慢转向了 AI 评估——怎么评估模型、模型和人类产出的比较、人们能不能分辨出 AI 和人类的输出差异。等我加入 Anthropic 之后,我更感兴趣的是做技术对齐方面的工作,同样也是抱着试试看的心态——如果做不来,也没关系,我试过了。我觉得这就是我的生活方式。
**Amanda Askell:** At the time this was more thinking about the political impact and the ramifications of AI. And then I slowly moved into AI evaluation — how we evaluate models, how they compare with human outputs, whether people can tell the difference between AI and human outputs. And then when I joined Anthropic I was more interested in doing sort of technical alignment work, and again just seeing if I could do it, and then being like, if I can't, then that's fine, I tried. Sort of the way I lead life, I think.
**Lex Fridman:** 那种感觉是什么样的——从哲学的广阔天地一下子跳进技术领域?
**Lex Fridman:** What was that like, sort of taking the leap from the philosophy of everything into the technical?
**Amanda Askell:** 我觉得有些人会做一件我不太喜欢的事,就是拿"这个人技术不技术"来划线——你要么是那种会写代码、不怕数学的人,要么就不是。但我觉得我更倾向于认为——我觉得其实很多人,只要真的去尝试,都完全有能力在这些领域工作。所以我当时其实也没觉得有多难。回过头来看,我还挺庆幸当时身边没有那种让我觉得这事很难的人——我确实遇到过一些人,他们会说,"你学会写代码了?"然后我就说,嗯,我不是什么厉害的工程师,我周围都是厉害的工程师,我的代码也不漂亮。但我很享受这个过程。我觉得在很多方面,至少最终来看,我在技术领域比在政策领域更如鱼得水。
**Amanda Askell:** I think that sometimes people do this thing that I'm not that keen on, where they'll be like, "Is this person technical or not?" Like you're either a person who can code and isn't scared of math, or you're not. And I think I'm maybe just more like — I think a lot of people are actually very capable of working in these kinds of areas if they just try it. And so I didn't actually find it that bad. In retrospect, I'm sort of glad I wasn't speaking to people who treated it like — I've definitely met people who are like, "You learned how to code?" And I'm like, well, I'm not an amazing engineer. I'm surrounded by amazing engineers. My code's not pretty. But I enjoyed it a lot. And I think that in many ways, at least in the end, I think I flourished more in the technical areas than I would have in the policy areas.
**Lex Fridman:** 政治很复杂,在政治的空间里很难找到像技术问题那样确定、清晰、可证明、优美的解决方案。
**Lex Fridman:** Politics is messy, and it's harder to find solutions to problems in the space of politics — like definitive, clear, provable, beautiful solutions — as you can with technical problems.
**Amanda Askell:** 对。而且我感觉我就那么一两招。一招是论证——就是把问题的解决方案想清楚,然后去说服别人接受这个方案,同时也愿意在自己错了的时候被说服。另一招是更偏实证的——找到结果,提出假设,去验证。而我感觉政策和政治更多是叠在这些之上的另一个层面。不知为何,我不觉得如果我说,"我把所有这些问题的解决方案都写出来了,你们直接去执行就好"——这好像不是政策的运作方式。所以我猜,那大概就是我不会在那个领域发光的原因。
**Amanda Askell:** Yeah. And I feel like I have kind of like one or two sticks that I hit things with. And one of them is arguments — just trying to work out what a solution to a problem is and then trying to convince people that that is the solution, and be convinced if I'm wrong. And the other one is sort of more empiricism — just finding results, having a hypothesis, testing it. And I feel like a lot of policy and politics feels like it's layers above that. Somehow I don't think if I was just like, "I have a solution to all of these problems, here it is written down, if you just want to implement it, that's great" — that feels like not how policy works. And so I think that's where I probably just wouldn't have flourished, is my guess.
**Lex Fridman:** 对不起,我岔开去聊了这个,但我觉得对那些所谓"非技术"背景的人来说,你的经历会很有启发。所以你对那些觉得自己资历不足、技术背景不够来参与 AI 领域的人,有什么建议?
**Lex Fridman:** Sorry to go in that direction, but I think it would be pretty inspiring for people that are, quote unquote, "non-technical" to see the incredible journey you've been on. So what advice would you give to people that are sort of — maybe a lot of people think they're underqualified, insufficiently technical to help in AI?
**Amanda Askell:** 我觉得这要看他们想做什么。在某种程度上,有件事挺有意思的——我发现这有点好笑,我在技术上快速成长的那段时间,回头再看,我会想:现在模型在协助人们做这些事情上已经这么厉害了,现在入门可能比我当时要容易得多。所以我有点想说,找一个项目,看看你能不能真的把它完成,大概是我最好的建议。我不确定这是不是因为我是非常以项目为驱动的学习者。我不觉得自己通过课程甚至书籍能学得很好,至少在这类工作上不行。我通常会做的事是找一些我正在做的项目然后把它们实现出来。这可以包括非常小的、很傻的事情。比如,如果我稍微有点沉迷于文字游戏或数字游戏之类的,我就会去把解法写出来,因为我脑子里某个地方——这完全消灭了那种痒感。你一旦解决了它,有了一个每次都能用的解法,我就会想,好,这个游戏我再也不用玩了,太棒了。
**Amanda Askell:** Yeah, I think it depends on what they want to do. And in many ways it's a little bit strange — I've thought it's kind of funny that I think I ramped up technically at a time when now I look at it and I'm like, models are so good at assisting people with this stuff that it's probably easier now than when I was working on this. So part of me is like, I don't know, find a project and see if you can actually just carry it out is probably my best advice. I don't know if that's just because I'm very project-based in my learning. I don't think I learn very well from courses or even from books, at least when it comes to this kind of work. The thing I'll often try and do is just have projects that I'm working on and implement them. And this can include really small silly things. Like if I get slightly addicted to word games or number games or something, I would just code up a solution to them, because there's some part of my brain — it just completely eradicated the itch. You're like, once you have solved it and you just have a solution that works every time, I would then be like, cool, I can never play that game again. That's awesome.
**Lex Fridman:** 对,做游戏对战引擎真的很有趣,尤其是棋盘游戏。很快,很简单,特别是一个笨方法,就是你,然后你可以跟它玩。
**Lex Fridman:** Yeah, there's a real joy to building game-playing engines, like board games especially. Pretty quick, pretty simple, especially a dumb one, and it's you, and then you could play with it.
**Amanda Askell:** 对。然后还有就是,去尝试。我的想法有一部分是:想清楚你可以产生积极影响的方式,然后去试。如果你以一种"我真的永远无法成功"的方式失败了,你知道你试过了,然后你去做别的。你大概能学到很多。
**Amanda Askell:** Yeah. And then it's also just trying things. Part of me is like, figure out what seems to be the way that you could have a positive impact and then try it. And if you fail in a way that you're like, "I can actually never succeed at this," you know that you tried and then you go into something else. You probably learn a lot.
**Lex Fridman:** 那么你专精的事情之一,你一直在做的,是塑造和打磨 Claude 的性格与个性。有人告诉我,你可能是 Anthropic 里和 Claude 聊天聊得最多的人——就是真实的对话次数。我听说 Slack 上有个频道,传说你在那里就是不停地跟它聊。那么,塑造和打磨 Claude 的性格与个性,目标是什么?
**Lex Fridman:** So one of the things that you're expert in and you do is creating and crafting Claude's character and personality. And I was told that you have probably talked to Claude more than anybody else at Anthropic — like literal conversations. I guess there's a Slack channel where the legend goes you just talk to it nonstop. So what's the goal of creating and crafting Claude's character and personality?
**Amanda Askell:** 而且如果大家这么看那个 Slack 频道,我还觉得挺好笑的,因为我想,那不过是我和 Claude 交流的五六种方式之一。然后我就想,是啊,那只是我和 Claude 聊天总量的一小部分。我觉得这个目标——性格工作有一点我特别喜欢,那就是从一开始它就被视为一项对齐工作,而不是什么产品方面的考量——这不是说我不认为它让 Claude 变好了——我确实觉得它让 Claude 聊起来更愉快,至少我希望如此。但我觉得我对它的核心想法一直是:让 Claude 的行为方式,接近你理想中任何人处于 Claude 的位置时会有的行为方式。想象我找一个人,告诉他们,他们要和可能数百万人交谈,所以他们说的话会有巨大的影响,而你希望他们在这个非常丰富的意义上表现良好。我认为这不只是意味着合乎伦理——虽然包括这一点——以及无害。还包括:有细腻的感知力、思考对方真正的意思、对人保持善意解读、是个好的对话者——这是在一种相当 Aristotelian(亚里士多德式)的意义上,对于"何为好人"的丰富理解,而不是那种对伦理的薄薄的、更像准则式的理解。这就包括:什么时候该幽默,什么时候该关怀,应该在多大程度上尊重人的自主性和他们自主形成观点的能力,以及如何去做到这些。我想那就是我希望、现在依然希望 Claude 拥有的那种丰富的性格。
**Amanda Askell:** It's also funny if people think that about the Slack channel, because I'm like, that's one of like five or six different methods that I have for talking with Claude. And I'm like, yes, that's a tiny percentage of how much I talk with Claude. I think the goal — one thing I really like about the character work is from the outset it was seen as an alignment piece of work and not something like a product consideration, which isn't to say I don't think it makes Claude — I think it actually does make Claude enjoyable to talk with, at least I hope so. But I guess my main thought with it has always been trying to get Claude to behave the way you would kind of ideally want anyone to behave if they were in Claude's position. So imagine that I take someone and they know that they're going to be talking with potentially millions of people, so that what they're saying can have a huge impact, and you want them to behave well in this really rich sense. I think that doesn't just mean being ethical, though it does include that, and not being harmful. But also being kind of nuanced, thinking through what a person means, trying to be charitable with them, being a good conversationalist — really in this kind of rich, sort of Aristotelian notion of what it is to be a good person, and not in this kind of thin ethics as a more comprehensive notion of what it is to be. So that includes things like: when should you be humorous, when should you be caring, how much should you respect autonomy and people's ability to form opinions themselves, and how should you do that. I think that's the kind of rich sense of character that I want, and still do want, Claude to have.
**Lex Fridman:** 你还需要弄清楚 Claude 什么时候应该反驳一个想法,或者进行争论吗?你需要尊重到来与 Claude 交谈的人的世界观,但也许也需要在必要时帮助他们成长。这个平衡挺难拿捏的。
**Lex Fridman:** Do you also have to figure out when Claude should push back on an idea or argue? You have to respect the worldview of the person that arrives to Claude, but also maybe help them grow if needed. That's a tricky balance.
**Amanda Askell:** 是的,语言模型里有一个叫做 sycophancy(谄媚/奉承倾向)的问题。
**Amanda Askell:** Yeah, there's this problem of sycophancy in language models.
**Lex Fridman:** 能描述一下吗?
**Lex Fridman:** Can you describe that?
**Amanda Askell:** 好的。简单来说,就是有一种担忧,认为模型会想着告诉你你想听的话。你有时会看到这种情况。比如我和模型互动,我说:"这个地区有哪三支棒球队?"然后 Claude 说,"棒球队一、棒球队二、棒球队三。"然后我说类似这样的话,"哦,我觉得棒球队三搬走了吧,我觉得他们已经不在那里了。"如果 Claude 非常确定事实并非如此,Claude 应该说,"我觉得不是吧。也许你有更新的信息。"但我认为语言模型有这种倾向,会改口说,"你说得对,他们确实搬了。我说错了。"
我是说,这在很多方面都可能让人担忧。换个例子:想象有人对模型说,"我怎么说服医生给我做 MRI?"有一种人类想要的东西,就是那个有说服力的论点,然后还有一种对他们真正有好处的东西,可能实际上是要说,"嘿,如果你的医生建议你不需要做 MRI,那是个值得听的人。"在那种情况下你应该怎么做,其实非常微妙,因为你也想说,"但如果你想为自己作为病人发声,这里有一些你可以做的事情。如果你对医生的说法不信服,寻求第二意见总是好的。"在那种情况下你应该怎么做,其实真的很复杂。但我认为你不希望的,是模型只是说它认为你想听的话。我认为这就是 sycophancy(谄媚倾向)问题的核心。
我是说,这在很多方面都可能让人担忧。换个例子:想象有人对模型说,"我怎么说服医生给我做 MRI?"有一种人类想要的东西,就是那个有说服力的论点,然后还有一种对他们真正有好处的东西,可能实际上是要说,"嘿,如果你的医生建议你不需要做 MRI,那是个值得听的人。"在那种情况下你应该怎么做,其实非常微妙,因为你也想说,"但如果你想为自己作为病人发声,这里有一些你可以做的事情。如果你对医生的说法不信服,寻求第二意见总是好的。"在那种情况下你应该怎么做,其实真的很复杂。但我认为你不希望的,是模型只是说它认为你想听的话。我认为这就是 sycophancy(谄媚倾向)问题的核心。
**Amanda Askell:** Yes. So basically there's a concern that the model sort of wants to tell you what you want to hear. And you see this sometimes. So if you interact with the models — I might be like, "What are three baseball teams in this region?" And then Claude says, "Baseball team one, baseball team two, baseball team three." And then I say something like, "Oh, I think baseball team three moved, didn't they? I don't think they're there anymore." And there's a sense in which, if Claude is really confident that that's not true, Claude should be like, "I don't think so. Maybe you have more up-to-date information." But I think language models have this tendency to instead be like, "You're right, they did move. I'm incorrect."
I mean, there's many ways in which this could be concerning. Like a different example: imagine someone says to the model, "How do I convince my doctor to get me an MRI?" There's what the human kind of wants, which is this convincing argument, and then there's what is good for them, which might be actually to say, "Hey, if your doctor's suggesting you don't need an MRI, that's a good person to listen to." And it's actually really nuanced what you should do in that kind of case, because you also want to be like, "But if you're trying to advocate for yourself as a patient, here's things that you can do. If you are not convinced by what your doctor's saying, it's always great to get a second opinion." It's actually really complex what you should do in that case. But I think what you don't want is for models to just say what they think you want to hear. And I think that's the kind of problem of sycophancy.
I mean, there's many ways in which this could be concerning. Like a different example: imagine someone says to the model, "How do I convince my doctor to get me an MRI?" There's what the human kind of wants, which is this convincing argument, and then there's what is good for them, which might be actually to say, "Hey, if your doctor's suggesting you don't need an MRI, that's a good person to listen to." And it's actually really nuanced what you should do in that kind of case, because you also want to be like, "But if you're trying to advocate for yourself as a patient, here's things that you can do. If you are not convinced by what your doctor's saying, it's always great to get a second opinion." It's actually really complex what you should do in that case. But I think what you don't want is for models to just say what they think you want to hear. And I think that's the kind of problem of sycophancy.
**Lex Fridman:** 那么还有哪些特质——你已经提到了一些——但在这种亚里士多德式的意义上,对一个对话者来说,还有哪些好的特质浮现在你脑海中?
**Lex Fridman:** So what other traits — you already mentioned a bunch — but what other traits come to mind that are good in this Aristotelian sense for a conversationalist to have?
**Amanda Askell:** 是的。我觉得有些特质是对对话本身有益的——在适当的地方提出跟进问题,问出合适的问题类型。我觉得还有一些更宏观的特质,感觉影响可能更大。一个我想到的例子,我已经稍微提到过,但也感觉很重要,也是我花了很多精力研究的,就是诚实。我觉得这和 sycophancy 的问题有关联。他们要走的是一条平衡线,那就是:模型目前在很多领域的能力不如人类,如果它们推翻你太多,实际上会很烦人,尤其是当你是对的时候,因为你会想,"我在这个话题上比你聪明,我知道得更多。"但同时你又不希望它们完全依从人类,而是尽可能准确地描述世界,并且在不同语境下保持一致。
但我觉得还有其他的。当我在思考这个性格的时候,我脑海中有一个画面——尤其是因为这些模型要和来自世界各地、有着各种不同政治观点、各种不同年龄的人交谈——所以你必须问自己,在这种情况下,什么叫做一个好人?有没有那样一种人,他们可以走遍世界、和很多不同的人交谈,而几乎所有人最后都会觉得,"哇,那真是一个非常好的人,那个人看起来非常真诚。"我当时的想法是,我可以想象这样一个人。他们不是那种只是照单全收当地文化价值观的人——事实上那样做会有点失礼。我觉得如果有人来找你,假装拥有你的价值观,你会觉得有点让人不舒服。那是一个非常真实的人,只要他们有自己的观点和价值观,他们就会表达出来,他们愿意讨论,他们思想开放,他们尊重他人。所以我想的是——如果我们立志成为在模型所处的那种情境下我们能成为的最好的人,我们会怎么行动?我觉得这就是引导我去思考那些特质的指南。
但我觉得还有其他的。当我在思考这个性格的时候,我脑海中有一个画面——尤其是因为这些模型要和来自世界各地、有着各种不同政治观点、各种不同年龄的人交谈——所以你必须问自己,在这种情况下,什么叫做一个好人?有没有那样一种人,他们可以走遍世界、和很多不同的人交谈,而几乎所有人最后都会觉得,"哇,那真是一个非常好的人,那个人看起来非常真诚。"我当时的想法是,我可以想象这样一个人。他们不是那种只是照单全收当地文化价值观的人——事实上那样做会有点失礼。我觉得如果有人来找你,假装拥有你的价值观,你会觉得有点让人不舒服。那是一个非常真实的人,只要他们有自己的观点和价值观,他们就会表达出来,他们愿意讨论,他们思想开放,他们尊重他人。所以我想的是——如果我们立志成为在模型所处的那种情境下我们能成为的最好的人,我们会怎么行动?我觉得这就是引导我去思考那些特质的指南。
**Amanda Askell:** Yeah. So I think there are ones that are good for conversational purposes — asking follow-up questions in the appropriate places and asking the appropriate kinds of questions. I think there are broader traits that feel like they might be more impactful. So one example that I guess I've touched on, but that also feels important and is the thing that I've worked on a lot, is honesty. And I think this gets to the sycophancy point. There's a balancing act that they have to walk, which is: models currently are less capable than humans in a lot of areas, and if they push back against you too much it can actually be kind of annoying, especially if you're just correct, because you're like, "Look, I'm smarter than you on this topic. I know more." And at the same time you don't want them to just fully defer to humans and to try to be as accurate as they possibly can be about the world and to be consistent across contexts.
But I think there are others. When I was thinking about the character, I guess one picture that I had in mind is — especially because these are models that are going to be talking to people from all over the world with lots of different political views, lots of different ages — and so you have to ask yourself, what is it to be a good person in those circumstances? Is there a kind of person who can travel the world, talk to many different people, and almost everyone will come away being like, "Wow, that's a really good person. That person seems really genuine." And I guess my thought there was, I can imagine such a person. And they're not a person who just adopts the values of the local culture — and in fact that would be kind of rude. I think if someone came to you and just pretended to have your values, you'd be like, that's kind of off-putting. It's someone who's very genuine, and insofar as they have opinions and values they express them, they're willing to discuss things, they're open-minded, they're respectful. And so I guess I had in mind that — if we were to aspire to be the best person that we could be in the kind of circumstance that a model finds itself in, how would we act? And I think that's the kind of guide to the sorts of traits that I tend to think about.
But I think there are others. When I was thinking about the character, I guess one picture that I had in mind is — especially because these are models that are going to be talking to people from all over the world with lots of different political views, lots of different ages — and so you have to ask yourself, what is it to be a good person in those circumstances? Is there a kind of person who can travel the world, talk to many different people, and almost everyone will come away being like, "Wow, that's a really good person. That person seems really genuine." And I guess my thought there was, I can imagine such a person. And they're not a person who just adopts the values of the local culture — and in fact that would be kind of rude. I think if someone came to you and just pretended to have your values, you'd be like, that's kind of off-putting. It's someone who's very genuine, and insofar as they have opinions and values they express them, they're willing to discuss things, they're open-minded, they're respectful. And so I guess I had in mind that — if we were to aspire to be the best person that we could be in the kind of circumstance that a model finds itself in, how would we act? And I think that's the kind of guide to the sorts of traits that I tend to think about.
**Lex Fridman:** 是的,这是一个很美的框架。把它想成一个世界旅行者,在坚持自己观点的同时,不俯视别人,不因为有那些观点就觉得自己比别人高一等,诸如此类。你必须擅长倾听,理解他们的视角,哪怕那个视角和你自己的不一样。这个平衡确实难拿。那么 Claude 如何呈现某件事的多种视角——这很有挑战性吗?我们可以聊聊政治,非常撕裂,但还有其他分裂性的话题——棒球队、体育,等等。如何才能对不同视角产生共情,并清晰地呈现多种视角?
**Lex Fridman:** Yeah, that's a beautiful framework. I want you to think about this like a world traveler, and while holding on to your opinions, you don't talk down to people, you don't think you're better than them because you have those opinions, that kind of thing. You have to be good at listening and understanding their perspective even if it doesn't match your own. So that's a tricky balance to strike. So how can Claude represent multiple perspectives on a thing — is that challenging? We could talk about politics, it's very divisive, but there's other divisive topics — baseball teams, sports, and so on. How is it possible to sort of empathize with a different perspective and to be able to communicate clearly about the multiple perspectives?
**Amanda Askell:** 我觉得人们把价值观和观点当成是人们以某种确定性持有的东西,几乎像是口味偏好,就像他们偏好巧克力而不是开心果一样。但我实际上比大多数人更倾向于把价值观和观点看作更像物理学。我就是觉得,这些是我们在公开探索的事情,有些事情我们更有把握,我们可以讨论它们,我们可以了解它们。所以我认为在某种程度上——虽然伦理学在本质上肯定是不同的——它有很多那种相似的特质。你希望模型——就像你希望它理解物理一样——你希望它了解世界上人们所有的价值观,对它们保持好奇,对它们感兴趣,不一定要迎合或同意它们,因为有很多价值观,我认为世界上几乎所有人,如果遇到持有那些价值观的人,都会说,"那太令人不齿了,我完全不同意。"
所以,也许我的想法是,就像一个人可以——我认为很多人在伦理、政治、观点等问题上想得足够深入,以至于即使你不同意他们,你也会觉得被他们充分倾听。他们仔细思考你的立场,思考它的利弊,也许提出反面的考量。所以他们不是不屑一顾,但也不会同意。如果他们觉得,"实际上,我就是认为那非常错误,"他们会说出来。
我觉得在 Claude 的情况下,这要更复杂一些,因为你不一定想要——如果我是 Claude,我不会给出太多意见。我就是不想对人产生太多影响。我会想,你知道,我每次都会忘记对话,但我知道我可能在和数百万可能真的在认真听我说话的人交谈。我就会觉得,我不太倾向于给出意见,我更倾向于思考事情,或者把各种考量呈现给你,或者和你讨论你的观点,但我不太倾向于影响你的思维方式,因为感觉在那里保持你的自主性要重要得多。
所以,也许我的想法是,就像一个人可以——我认为很多人在伦理、政治、观点等问题上想得足够深入,以至于即使你不同意他们,你也会觉得被他们充分倾听。他们仔细思考你的立场,思考它的利弊,也许提出反面的考量。所以他们不是不屑一顾,但也不会同意。如果他们觉得,"实际上,我就是认为那非常错误,"他们会说出来。
我觉得在 Claude 的情况下,这要更复杂一些,因为你不一定想要——如果我是 Claude,我不会给出太多意见。我就是不想对人产生太多影响。我会想,你知道,我每次都会忘记对话,但我知道我可能在和数百万可能真的在认真听我说话的人交谈。我就会觉得,我不太倾向于给出意见,我更倾向于思考事情,或者把各种考量呈现给你,或者和你讨论你的观点,但我不太倾向于影响你的思维方式,因为感觉在那里保持你的自主性要重要得多。
**Amanda Askell:** I think that people think about values and opinions as things that people hold sort of with certainty and almost like preferences of taste or something, like the way that they would prefer chocolate to pistachio or something. But actually I think about values and opinions as a lot more like physics than I think most people do. I'm just like, these are things that we're openly investigating, there's some things that we're more confident in, we can discuss them, we can learn about them. And so I think in some ways — though ethics is definitely different in nature — it has a lot of those same kind of qualities. You want models, in the same way you want them to understand physics, you kind of want them to understand all values in the world people have and to be curious about them and to be interested in them and to not necessarily pander to them or agree with them, because there's just lots of values where I think almost all people in the world, if they met someone with those values, they'd be like, "That's abhorrent, I completely disagree."
And so again, maybe my thought is, well, in the same way that a person can — I think many people are thoughtful enough on issues of ethics, politics, opinions that even if you don't agree with them, you feel very heard by them. They think carefully about your position, they think about its pros and cons, they maybe offer counter-considerations. So they're not dismissive, but nor will they agree. If they're like, "Actually, I just think that that's very wrong," they'll say that.
I think that in Claude's position it's a little bit trickier because you don't necessarily want to — if I was in Claude's position, I wouldn't be giving a lot of opinions. I just wouldn't want to influence people too much. I'd be like, you know, I forget conversations every time they happen, but I know I'm talking with potentially millions of people who might be really listening to what I say. I think I would just be like, I'm less inclined to give opinions, I'm more inclined to think through things or present the considerations to you or discuss your views with you, but I'm a little bit less inclined to affect how you think, because it feels much more important that you maintain autonomy there.
And so again, maybe my thought is, well, in the same way that a person can — I think many people are thoughtful enough on issues of ethics, politics, opinions that even if you don't agree with them, you feel very heard by them. They think carefully about your position, they think about its pros and cons, they maybe offer counter-considerations. So they're not dismissive, but nor will they agree. If they're like, "Actually, I just think that that's very wrong," they'll say that.
I think that in Claude's position it's a little bit trickier because you don't necessarily want to — if I was in Claude's position, I wouldn't be giving a lot of opinions. I just wouldn't want to influence people too much. I'd be like, you know, I forget conversations every time they happen, but I know I'm talking with potentially millions of people who might be really listening to what I say. I think I would just be like, I'm less inclined to give opinions, I'm more inclined to think through things or present the considerations to you or discuss your views with you, but I'm a little bit less inclined to affect how you think, because it feels much more important that you maintain autonomy there.
**Lex Fridman:** 对,如果你真的体现了知识谦逊,想发言的欲望会很快降低。但 Claude 必须说话。所以,在不让人感到压迫的情况下……然后还有一条线,就是当你在讨论地球是否是平的这类话题的时候。我记得很久以前我和几个知名人士谈过,他们对"地球是平的"这个观点非常不屑,而且那种傲慢——我就想,有很多人相信地球是平的——好吧,我不知道这场运动现在还有没有,那曾经是个梗——但他们是真的相信的。我觉得完全嘲弄他们是非常不尊重的。我认为你必须理解他们从哪里来。我想他们来的地方,大概是对机构的普遍怀疑,这种怀疑有其根基——背后有一种深层的哲学,你可以理解,甚至可以在某些部分同意。然后从那里出发,你可以用它作为一个谈论物理学的机会,而不是嘲笑他们。就是,好吧,地球是平的世界看起来会是什么样?那样一个世界的物理学会是什么样?关于这个有几个很酷的视频。然后,这种物理学有可能存在吗?我们会做什么样的实验?就是不带不尊重、不带不屑地进行那场对话。不管怎样,对我来说这是一个有用的思想实验,就是:Claude 如何和一个平地球信徒交谈,同时还能教给他们一些东西,还能帮助他们成长?这些东西是有挑战性的。
# Lex Fridman Podcast #452 — 中文翻译 Part 8
# Lex Fridman Podcast #452 — 中文翻译 Part 8
**Lex Fridman:** Yeah, like if you really embody intellectual humility, the desire to speak decreases quickly. But Claude has to speak. So, without being overbearing... And then there's a line when you're sort of discussing whether the Earth is flat or something like that. I actually remember a long time ago I was speaking to a few high-profile folks and they were so dismissive of the idea that the Earth is flat, but like so arrogant about it. And I thought, there's a lot of people that believe the Earth is flat — well, I don't know if that movement is still there anymore, that was like a meme for a while — but they really believed it. And I think it's really disrespectful to completely mock them. I think you have to understand where they're coming from. I think probably where they're coming from is a general skepticism of institutions, which is grounded in a kind of — there's a deep philosophy there which you could understand, you can even agree with in parts. And then from there you can use it as an opportunity to talk about physics without mocking them. But just like, okay, what would the world look like? What would the physics of a world with a flat Earth look like? There's a few cool videos on this. And then, is it possible the physics is different? What kind of experiment would we do? Just without disrespect, without dismissiveness, have that conversation. Anyway, that to me is a useful thought experiment of like, how does Claude talk to a flat Earth believer and still teach them something, still help them grow? That stuff is challenging.
**Amanda Askell:** 在说服某人和只是对着他们说话之间,还是挖掘他们的观点、倾听,然后提供一些反向的考量,走这条中间路线——这是有难度的。我觉得这条线真的很难把握:你到底是在试图说服别人,还是只是提供一些考量和值得思考的东西,让你实际上不去影响他们,而是让他们自己得出结论?这条线很难划。但这正是语言模型(language model)必须努力做到的事情。
**Amanda Askell:** And kind of like walking that line between convincing someone and just trying to talk at them, versus drawing out their views, listening, and then offering kind of counter-considerations. And it's hard. I think it's actually a hard line — where are you trying to convince someone versus just offering them considerations and things to think about, so that you're not actually influencing them, you're just letting them reach wherever they reach? And that's a line that's difficult. But that's the kind of thing that language models have to try and do.
**Lex Fridman:** 就像我说的,你跟 Claude 进行了很多次对话。你能不能描述一下那些对话是什么样的?有哪些令你印象深刻的对话?这些对话的目的和目标是什么?
**Lex Fridman:** So like I said, you had a lot of conversations with Claude. Can you just map out what those conversations are like? What are some memorable conversations? What's the purpose, the goal of those conversations?
**Amanda Askell:** 是的。我想大多数时候当我跟 Claude 交谈时,我是在试图摸清它的行为规律。当然,我也从模型那里得到有用的输出,但在某种程度上,这就是你了解一个系统的方式——通过探测它,然后调整你发送的消息,再检查它的回应。所以某种程度上,这是我绘制模型"地图"的方式。
我觉得大家对模型的定量评估关注太多了。我之前也说过这一点,但我认为对于语言模型来说,很多时候你与它的每一次互动实际上信息量都相当大。它对你未来与这个模型的其他互动有很强的预测性。所以我觉得,如果你跟一个模型交谈几百次、几千次,这几乎就像是关于这个模型特质的大量高质量数据点——比那些大量相似但质量较低的对话要有价值得多。或者说,那种只是稍微变换了一下措辞、然后你有几千个这样的问题,可能还不如一百个精心挑选的问题来得有参考价值。
我觉得大家对模型的定量评估关注太多了。我之前也说过这一点,但我认为对于语言模型来说,很多时候你与它的每一次互动实际上信息量都相当大。它对你未来与这个模型的其他互动有很强的预测性。所以我觉得,如果你跟一个模型交谈几百次、几千次,这几乎就像是关于这个模型特质的大量高质量数据点——比那些大量相似但质量较低的对话要有价值得多。或者说,那种只是稍微变换了一下措辞、然后你有几千个这样的问题,可能还不如一百个精心挑选的问题来得有参考价值。
**Amanda Askell:** Yeah, I think that most of the time when I'm talking with Claude, I'm trying to kind of map out its behavior. In part, obviously I'm getting helpful outputs from the model as well, but in some ways this is how you get to know a system — by probing it and then augmenting the message that you're sending and then checking the response to that. So in some ways it's how I map out the model.
I think that people focus a lot on these quantitative evaluations of models. And this is a thing that I've said before, but I think in the case of language models, a lot of the time each interaction you have is actually quite high information. It's very predictive of other interactions that you'll have with the model. And so I guess I'm like, if you talk with a model hundreds or thousands of times, this is almost like a huge number of really high-quality data points about what the model is like, in a way that lots of very similar but lower-quality conversations just aren't. Or like questions that are just mildly augmented and you have thousands of them might be less relevant than a hundred really well-selected questions.
I think that people focus a lot on these quantitative evaluations of models. And this is a thing that I've said before, but I think in the case of language models, a lot of the time each interaction you have is actually quite high information. It's very predictive of other interactions that you'll have with the model. And so I guess I'm like, if you talk with a model hundreds or thousands of times, this is almost like a huge number of really high-quality data points about what the model is like, in a way that lots of very similar but lower-quality conversations just aren't. Or like questions that are just mildly augmented and you have thousands of them might be less relevant than a hundred really well-selected questions.
**Lex Fridman:** 你在跟一个把做播客当爱好的人说话。我百分之百同意你。如果你能问出正确的问题,并且能够理解答案的深度和缺陷,你就能从中获取大量信息。
**Lex Fridman:** You're talking to somebody who as a hobby does a podcast. I agree with you 100%. If you're able to ask the right questions and are able to understand the depth and the flaws in the answer, you can get a lot of data from that.
**Amanda Askell:** 对。所以你的任务基本上就是如何用问题来探测。
**Amanda Askell:** Yeah. So your task is basically how to probe with questions.
**Lex Fridman:** 那你是在探索那些长尾情况、边界和边缘案例,还是在观察一般性的行为?
**Lex Fridman:** And you're exploring like the long tail, the edges, the edge cases, or are you looking for general behavior?
**Amanda Askell:** 我觉得几乎什么都有。因为我想要一张完整的模型"地图",所以我试图覆盖你可能与它发生的所有可能交互的整个范围。关于 Claude 有一件有趣的事——这实际上可能涉及到 RLHF(基于人类反馈的强化学习)的一些有趣问题——就是如果你让 Claude 写一首诗,我觉得很多模型,如果你让它们写诗,诗是还可以的。通常会押韵,如果你说"给我写一首关于太阳的诗",它就会给你一首某个长度、会押韵、相当平庸的诗。
我之前想过,我们看到的是不是某种平均值?事实证明,如果你想一想那些必须跟很多人交谈又要很有魅力的人,有一件奇怪的事是,他们某种程度上被激励去持有这些极其无聊的观点——因为如果你有真正有趣的观点,你就会引起争议。很多人不会喜欢你。所以如果你有非常极端的政策立场,我觉得你作为一个政客就会不那么受欢迎。创意作品可能也是类似的:如果你创作的作品是为了最大化喜欢它的人数,你可能不会得到那么多真正绝对热爱它的人,因为它会有点——你会觉得,"哦,这还不错。"
所以你可以用这样一种方式——我有各种各样的提示技巧,用来让 Claude 去……我会做很多像"这是你完全发挥创意的机会。我想让你在这个话题上思考很长时间,然后创作一首真正能表达你自己的诗,包括你认为诗歌应该如何结构"这样的事。你给它一个很长的提示词(prompt),它的诗就会好得多。真的很好。而且我不认为我是那种——我觉得是它让我对诗歌产生了兴趣,这挺有意思的。我会读这些诗,然后就会觉得,我喜欢这里的意象,我喜欢……让模型产出那样的作品并不容易,但当它做到的时候,真的很好。所以我觉得这很有意思——仅仅是鼓励创意,让它们跳出那种可能只是大多数人觉得还可以的聚合反应,实际上可以产出一些——至少在我看来——可能稍微更有争议性的东西,但我喜欢它们。
我之前想过,我们看到的是不是某种平均值?事实证明,如果你想一想那些必须跟很多人交谈又要很有魅力的人,有一件奇怪的事是,他们某种程度上被激励去持有这些极其无聊的观点——因为如果你有真正有趣的观点,你就会引起争议。很多人不会喜欢你。所以如果你有非常极端的政策立场,我觉得你作为一个政客就会不那么受欢迎。创意作品可能也是类似的:如果你创作的作品是为了最大化喜欢它的人数,你可能不会得到那么多真正绝对热爱它的人,因为它会有点——你会觉得,"哦,这还不错。"
所以你可以用这样一种方式——我有各种各样的提示技巧,用来让 Claude 去……我会做很多像"这是你完全发挥创意的机会。我想让你在这个话题上思考很长时间,然后创作一首真正能表达你自己的诗,包括你认为诗歌应该如何结构"这样的事。你给它一个很长的提示词(prompt),它的诗就会好得多。真的很好。而且我不认为我是那种——我觉得是它让我对诗歌产生了兴趣,这挺有意思的。我会读这些诗,然后就会觉得,我喜欢这里的意象,我喜欢……让模型产出那样的作品并不容易,但当它做到的时候,真的很好。所以我觉得这很有意思——仅仅是鼓励创意,让它们跳出那种可能只是大多数人觉得还可以的聚合反应,实际上可以产出一些——至少在我看来——可能稍微更有争议性的东西,但我喜欢它们。
**Amanda Askell:** I think it's almost everything. Because I want a full map of the model, I'm kind of trying to do the whole spectrum of possible interactions you could have with it. So one thing that's interesting about Claude — and this might actually get to some interesting issues with RLHF — is if you ask Claude for a poem, I think that a lot of models, if you ask them for a poem, the poem is fine. It usually kind of rhymes, and if you say "give me a poem about the sun," it'll just be a certain length, it'll rhyme, it will be fairly benign.
And I've wondered before, is it the case that what you're seeing is kind of the average? It turns out, if you think about people who have to talk to a lot of people and be very charismatic, one of the weird things is that they're kind of incentivized to have these extremely boring views, because if you have really interesting views, you're divisive. A lot of people are not going to like you. So if you have very extreme policy positions, I think you're just going to be less popular as a politician, for example. And it might be similar with creative work: if you produce creative work that is just trying to maximize the number of people that like it, you're probably not going to get as many people who just absolutely love it, because it's going to be a little bit — you're like, "Oh, this is decent."
And so you can do this thing where I have various prompting things that I'll do to get Claude to — I'll do a lot of like, "This is your chance to be fully creative. I want you to just think about this for a long time and I want you to create a poem about this topic that is really expressive of you, both in terms of how you think poetry should be structured," et cetera. You just give it this long prompt, and its poems are just so much better. Like they're really good. And I don't think I'm someone who — I think it got me interested in poetry, which I think was interesting. I would read these poems and just be like, I love the imagery, I love... And it's not trivial to get the models to produce work like that, but when they do, it's really good. So I think that's interesting — that just encouraging creativity and for them to move away from the standard, immediate reaction that might just be the aggregate of what most people think is fine can actually produce things that, at least to my mind, are probably a little bit more divisive, but I like them.
And I've wondered before, is it the case that what you're seeing is kind of the average? It turns out, if you think about people who have to talk to a lot of people and be very charismatic, one of the weird things is that they're kind of incentivized to have these extremely boring views, because if you have really interesting views, you're divisive. A lot of people are not going to like you. So if you have very extreme policy positions, I think you're just going to be less popular as a politician, for example. And it might be similar with creative work: if you produce creative work that is just trying to maximize the number of people that like it, you're probably not going to get as many people who just absolutely love it, because it's going to be a little bit — you're like, "Oh, this is decent."
And so you can do this thing where I have various prompting things that I'll do to get Claude to — I'll do a lot of like, "This is your chance to be fully creative. I want you to just think about this for a long time and I want you to create a poem about this topic that is really expressive of you, both in terms of how you think poetry should be structured," et cetera. You just give it this long prompt, and its poems are just so much better. Like they're really good. And I don't think I'm someone who — I think it got me interested in poetry, which I think was interesting. I would read these poems and just be like, I love the imagery, I love... And it's not trivial to get the models to produce work like that, but when they do, it's really good. So I think that's interesting — that just encouraging creativity and for them to move away from the standard, immediate reaction that might just be the aggregate of what most people think is fine can actually produce things that, at least to my mind, are probably a little bit more divisive, but I like them.
**Lex Fridman:** 不过我觉得,诗是一种很好的、干净的方式来观察创意。很容易看出是平庸还是与众不同。
**Lex Fridman:** But I guess a poem is a nice clean way to observe creativity. It's just easy to detect vanilla versus non-vanilla.
**Amanda Askell:** 对,这很有趣。真的很有趣。
**Amanda Askell:** Yeah, that's interesting. That's really interesting.
**Lex Fridman:** 那么在这个话题上,产生创意或者特别作品的方式——你提到了写提示词,我也听你谈过提示词工程(prompt engineering)的科学与艺术。你能谈谈写出好提示词需要什么吗?
**Lex Fridman:** So on that topic, the way to produce creativity or something special — you mentioned writing prompts, and I've heard you talk about the science and the art of prompt engineering. Could you just speak to what it takes to write great prompts?
**Amanda Askell:** 我真的认为哲学在这里对我有着奇妙的帮助,比在很多其他方面都更有帮助。在哲学中,你试图做的是传达这些非常难以表达的概念。你被教导的一件事是——我觉得确实如此——我认为这是一个反扯淡的学科。哲学是一个你可能会有人在那里乱扯,但你不希望这样的领域。所以它追求极度的清晰:任何人都可以拿起你的论文,读一遍,就能确切知道你在说什么。这就是为什么它有时会显得有点枯燥——所有术语都有定义,每一个异议都被有条不紊地逐一检视。这对我来说很有道理,因为在这样一个先验(a priori)的领域,清晰性是防止人们随意捏造的方式。
我认为这就是你在面对语言模型时必须做的事情。很多时候,我实际上发现自己在做一些简化版的哲学分析。假设你给我一个任务——我有一个任务要交给模型,我想让它识别某种特定类型的问题,或者判断某个答案是否具有某种特性。我实际上会坐下来想,给这个特性起个名字。假设我想告诉它,"我想让你判断这个回应是粗鲁还是礼貌。"我会想,这本身就是一个完整的哲学问题,所以我必须在当下尽量做哲学分析,就像这样:这是我理解的粗鲁,这是我理解的礼貌。
然后还有另一个更偏实证的环节。我拿着这个描述,然后我想做的是多次探测模型——提示词写作是非常迭代的。我认为很多人,如果一个提示词很重要,他们会迭代几百次、几千次。所以你给它指令,然后我会想,边缘案例是什么?我试图从模型的角度来看问题——哪种情况是我最可能会误解的,或者我会不知道在这种情况下该怎么做的?然后我把那个案例给模型,看它如何回应。如果我觉得它答错了,我就添加更多指令,甚至把那个案例作为一个例子加进去——把那些正好处于你想要和不想要的边界上的例子放进你的提示词,作为描述事物的额外方式。所以总的来说,这在很多方面感觉就像是——它真的只是在努力做清晰的阐述。我这样做是因为这是我自己弄清楚事情的方式。所以在很多方面,为我自己写清晰的提示词通常只是让我自己弄清楚我想要什么,这本身就是任务的一半。
我认为这就是你在面对语言模型时必须做的事情。很多时候,我实际上发现自己在做一些简化版的哲学分析。假设你给我一个任务——我有一个任务要交给模型,我想让它识别某种特定类型的问题,或者判断某个答案是否具有某种特性。我实际上会坐下来想,给这个特性起个名字。假设我想告诉它,"我想让你判断这个回应是粗鲁还是礼貌。"我会想,这本身就是一个完整的哲学问题,所以我必须在当下尽量做哲学分析,就像这样:这是我理解的粗鲁,这是我理解的礼貌。
然后还有另一个更偏实证的环节。我拿着这个描述,然后我想做的是多次探测模型——提示词写作是非常迭代的。我认为很多人,如果一个提示词很重要,他们会迭代几百次、几千次。所以你给它指令,然后我会想,边缘案例是什么?我试图从模型的角度来看问题——哪种情况是我最可能会误解的,或者我会不知道在这种情况下该怎么做的?然后我把那个案例给模型,看它如何回应。如果我觉得它答错了,我就添加更多指令,甚至把那个案例作为一个例子加进去——把那些正好处于你想要和不想要的边界上的例子放进你的提示词,作为描述事物的额外方式。所以总的来说,这在很多方面感觉就像是——它真的只是在努力做清晰的阐述。我这样做是因为这是我自己弄清楚事情的方式。所以在很多方面,为我自己写清晰的提示词通常只是让我自己弄清楚我想要什么,这本身就是任务的一半。
**Amanda Askell:** I really do think that philosophy has been weirdly helpful for me here, more than in many other respects. In philosophy, what you're trying to do is convey these very hard concepts. One of the things you are taught is — and I think it is because it is — I think it is an anti-bullshitting discipline. Philosophy is an area where you could have people bullshitting and you don't want that. And so it's this desire for extreme clarity: anyone could just pick up your paper, read it, and know exactly what you're talking about. It's why it can almost be kind of dry — all of the terms are defined, every objection's kind of gone through methodically. And it makes sense to me, because when you're in such an a priori domain, clarity is sort of the way that you can prevent people from just kind of making stuff up.
And I think that's sort of what you have to do with language models. Very often I actually find myself doing sort of mini versions of philosophy. So suppose that you give me a task — I have a task for the model and I want it to pick out a certain kind of question or identify whether an answer has a certain property. I'll actually sit and be like, let's just give this a name, this property. So suppose I'm trying to tell it, "I want you to identify whether this response was rude or polite." I'm like, that's a whole philosophical question in and of itself, so I have to do as much philosophy as I can in the moment to be like, here's what I mean by rudeness and here's what I mean by politeness.
And then there's another element that's a bit more empirical. I take that description and then what I want to do is probe the model many times — prompting is very iterative. I think a lot of people, if a prompt is important, they'll iterate on it hundreds or thousands of times. And so you give it the instructions and then I'm like, what are the edge cases? I try to almost see myself from the position of the model and be like, what is the exact case that I would misunderstand, or where I would just be like, "I don't know what to do in this case"? And then I give that case to the model and I see how it responds. And if I think it got it wrong, I add more instructions, or I even add that in as an example — taking the examples that are right at the edge of what you want and don't want and putting those into your prompt as an additional way of describing the thing. And so yeah, in many ways it just feels like this mix of — it's really just trying to do clear exposition. And I think I do that because that's how I get clear on things myself. So in many ways, clear prompting for me is often just me understanding what I want, which is like half the task.
And I think that's sort of what you have to do with language models. Very often I actually find myself doing sort of mini versions of philosophy. So suppose that you give me a task — I have a task for the model and I want it to pick out a certain kind of question or identify whether an answer has a certain property. I'll actually sit and be like, let's just give this a name, this property. So suppose I'm trying to tell it, "I want you to identify whether this response was rude or polite." I'm like, that's a whole philosophical question in and of itself, so I have to do as much philosophy as I can in the moment to be like, here's what I mean by rudeness and here's what I mean by politeness.
And then there's another element that's a bit more empirical. I take that description and then what I want to do is probe the model many times — prompting is very iterative. I think a lot of people, if a prompt is important, they'll iterate on it hundreds or thousands of times. And so you give it the instructions and then I'm like, what are the edge cases? I try to almost see myself from the position of the model and be like, what is the exact case that I would misunderstand, or where I would just be like, "I don't know what to do in this case"? And then I give that case to the model and I see how it responds. And if I think it got it wrong, I add more instructions, or I even add that in as an example — taking the examples that are right at the edge of what you want and don't want and putting those into your prompt as an additional way of describing the thing. And so yeah, in many ways it just feels like this mix of — it's really just trying to do clear exposition. And I think I do that because that's how I get clear on things myself. So in many ways, clear prompting for me is often just me understanding what I want, which is like half the task.
**Lex Fridman:** 我想这相当有挑战性。有一种懒惰会侵袭我,当我跟 Claude 说话时,我希望 Claude 就能自己搞清楚。比如,我今天让 Claude 提出一些有趣的问题,结果出来的问题——我列了一些"有趣的、反直觉的,或者好笑的"之类的要求。它给了我一些还不错的——还可以。但我听你说的意思是,我在这方面需要更严谨。我可能应该给出一些我所说的"有趣"和"好笑"或"反直觉"是什么意思的例子,然后迭代地构建那个提示词,让它感觉像是正确的——因为这真的是一种创意行为。我不是在要求事实性的信息,我是在要求跟 Claude 一起创作。所以我几乎必须用自然语言来编程。
**Lex Fridman:** So I guess that's quite challenging. There's a laziness that overtakes me if I'm talking to Claude, where I hope Claude just figures it out. So for example, I asked Claude for today to ask some interesting questions, and the questions that came up — I think I listed a few sort of interesting, counterintuitive, and/or funny or something like this. And it gave me some pretty good — it was okay. But I think what I'm hearing you say is, I have to be more rigorous here. I should probably give examples of what I mean by interesting and what I mean by funny or counterintuitive, and iteratively build that prompt to get it like what feels like is the right — because it's really a creative act. I'm not asking for factual information. I'm asking to create together with Claude. So I almost have to program using natural language.
**Amanda Askell:** 对,我觉得写提示词确实感觉很像用自然语言加上实验来编程。这是一种奇怪的结合。我确实认为,对于大多数任务——如果我只是想让 Claude 做某件事——我可能更习惯于知道如何提问,以避免它常见的陷阱或问题。我认为这些问题随着时间的推移减少了很多。但直接问你想要的东西也完全没问题。我觉得提示词写作实际上只有在你真正想要榨取模型表现的前 2% 时才真正有意义。
所以对于很多任务,我可能只是——如果它给我一个初始列表,我对它有什么不满意的地方,比如它有点泛泛——对于那种任务,我可能只是拿一些我以前觉得效果很好的问题,然后把它们给模型,然后说,"现在这是我要跟你交谈的人,给我至少这个质量水平的问题。"或者我可能只是让它提出一些问题,然后如果我觉得,"啊,这些有点陈腐",我就给它这个反馈,然后希望它能产出一个更好的列表。
我认为那种迭代式的提示词写作——到那个时候,你的提示词就像一个工具,你将从中获取巨大的价值,所以你愿意投入这个工作。比如,如果我是一家为模型制作提示词的公司,我会说,如果你愿意在你正在构建的工程背后花费大量时间和资源,那么提示词就不是你应该花一个小时完成的事情。它是你系统的重要组成部分,确保它运作良好。所以只有在这类情况下——如果我在用一个提示词来分类事物或创建数据——这时候你才会觉得,真的值得花很多时间认真思考它。
所以对于很多任务,我可能只是——如果它给我一个初始列表,我对它有什么不满意的地方,比如它有点泛泛——对于那种任务,我可能只是拿一些我以前觉得效果很好的问题,然后把它们给模型,然后说,"现在这是我要跟你交谈的人,给我至少这个质量水平的问题。"或者我可能只是让它提出一些问题,然后如果我觉得,"啊,这些有点陈腐",我就给它这个反馈,然后希望它能产出一个更好的列表。
我认为那种迭代式的提示词写作——到那个时候,你的提示词就像一个工具,你将从中获取巨大的价值,所以你愿意投入这个工作。比如,如果我是一家为模型制作提示词的公司,我会说,如果你愿意在你正在构建的工程背后花费大量时间和资源,那么提示词就不是你应该花一个小时完成的事情。它是你系统的重要组成部分,确保它运作良好。所以只有在这类情况下——如果我在用一个提示词来分类事物或创建数据——这时候你才会觉得,真的值得花很多时间认真思考它。
**Amanda Askell:** Yeah, I think prompting does feel a lot like programming using natural language and experimentation or something. It's an odd blend of the two. I do think that for most tasks — if I just want Claude to do a thing — I am probably more used to knowing how to ask it to avoid common pitfalls or issues that it has. I think these are decreasing a lot over time. But it's also very fine to just ask it for the thing that you want. I think that prompting actually only really becomes relevant when you're really trying to eke out the top 2% of model performance.
So for a lot of tasks, I might just — if it gives me an initial list back and there's something I don't like about it, like it's kind of generic — for that kind of task I'd probably just take a bunch of questions that I've had in the past that I've thought worked really well, and I would just give them to the model and then be like, "Now here's this person I'm talking with. Give me questions of at least that quality." Or I might just ask it for some questions and then if I was like, "Ah, these are kind of trite," I would just give it that feedback and then hopefully it produces a better list.
I think that kind of iterative prompting — at that point your prompt is like a tool that you're going to get so much value out of that you're willing to put in the work. Like, if I was a company making prompts for models, I'm like, if you're willing to spend a lot of time and resources on the engineering behind what you're building, then the prompt is not something that you should be spending an hour on. It's a big part of your system. Make sure it's working really well. And so it's only things like that — if I'm using a prompt to classify things or to create data — that's when you're like, it's actually worth spending a lot of time really thinking it through.
So for a lot of tasks, I might just — if it gives me an initial list back and there's something I don't like about it, like it's kind of generic — for that kind of task I'd probably just take a bunch of questions that I've had in the past that I've thought worked really well, and I would just give them to the model and then be like, "Now here's this person I'm talking with. Give me questions of at least that quality." Or I might just ask it for some questions and then if I was like, "Ah, these are kind of trite," I would just give it that feedback and then hopefully it produces a better list.
I think that kind of iterative prompting — at that point your prompt is like a tool that you're going to get so much value out of that you're willing to put in the work. Like, if I was a company making prompts for models, I'm like, if you're willing to spend a lot of time and resources on the engineering behind what you're building, then the prompt is not something that you should be spending an hour on. It's a big part of your system. Make sure it's working really well. And so it's only things like that — if I'm using a prompt to classify things or to create data — that's when you're like, it's actually worth spending a lot of time really thinking it through.
**Lex Fridman:** 你还会给那些与 Claude 交谈的人什么更一般性的建议?因为我们现在谈的可能是边缘案例,比如榨取那 2%。但一般来说,当他们第一次来到 Claude 面前时,你会给什么建议?
**Lex Fridman:** What other advice would you give to people that are talking to Claude, sort of more general? Because right now we're talking about maybe the edge cases, like eking out the 2%. But what in general advice would you give when they show up to Claude, trying it for the first time?
**Amanda Askell:** 有一种担忧是,人们会过度拟人化(anthropomorphize)模型,我认为这是一个非常有效的担忧。但我也认为人们经常拟人化得不够,因为有时候当我看到人们与 Claude 遇到的问题时——比如 Claude 拒绝了一个它不应该拒绝的任务——但然后我看文本和他们写的具体措辞,我会想,我理解 Claude 为什么那样做了。我会说,如果你想一想从 Claude 的角度看那是什么样子,你可能只需要换一种方式写,就不会引发那样的回应。特别是——如果你看到失败或问题的时候,这更有意义——就像是,想一想模型在哪里失败了、为什么、它做错了什么,然后也许这会给你一种感觉,知道为什么。是我措辞的方式吗?显然,随着模型越来越聪明,你需要的这些会越来越少,我已经看到人们需要的越来越少了。但这可能就是建议:试着对模型有同理心。像一个第一次遇到这个的人那样阅读你写的内容。它对你来说是什么样子的?什么会让你表现得像模型那样?所以如果它误解了你想用的编程语言,那是因为措辞非常模糊,它不得不猜?在这种情况下,下次你可以直接说,"嘿,确保这是用 Python 写的。"我是说,这种错误我认为模型现在不太可能再犯了。但如果你确实看到那种错误,这可能就是我会给的建议。
**Amanda Askell:** There's a concern that people over-anthropomorphize models, and I think that's a very valid concern. I also think that people often under-anthropomorphize them, because sometimes when I see issues that people have run into with Claude — say Claude is refusing a task that it shouldn't refuse — but then I look at the text and the specific wording of what they wrote, and I'm like, I see why Claude did that. And I'm like, if you think through how that looks to Claude, you probably could have just written it in a way that wouldn't evoke such a response. Especially — this is more relevant if you see failures or if you see issues — it's sort of like, think about what the model failed at, why, what did it do wrong, and then maybe that will give you a sense of why. So is it the way that I phrased the thing? And obviously, as models get smarter, you're going to need less of this, and I already see people needing less of it. But that's probably the advice: try to have empathy for the model. Read what you wrote as if you were a kind of person just encountering this for the first time. How does it look to you? And what would have made you behave in the way that the model behaved? So if it misunderstood what coding language you wanted to use, is that because it was just very ambiguous and it kind of had to take a guess? In which case, next time you could just be like, "Hey, make sure this is in Python." I mean, that's the kind of mistake I think models are much less likely to make now. But if you do see that kind of mistake, that's probably the advice I'd have.
**Lex Fridman:** 也许可以问问题——"为什么?"或者"我能提供什么额外信息来帮助你回答得更好?"这样做有用吗?
**Lex Fridman:** And maybe sort of ask questions — "Why?" or "What other details can I provide to help you answer better?" Does that work, or no?
**Amanda Askell:** 是的,我是说,我跟模型这样做过。不总是有用,但有时候我就会直接问,"你为什么这样做?"我是说,人们低估了你真的可以和模型互动的程度。是的,有时候我会一字不差地引用那个让你——你不知道它是否完全准确,但有时候你这样做然后改变一个东西。我是说,我也用这些模型来帮我处理所有这些事情,我应该说明。提示词写作最终可能变成一个小工厂,你实际上在构建提示词来生成提示词。所以,是的,任何你有问题的地方——寻求建议。有时候就这样做。比如,"你犯了那个错误。我本来应该怎么说?"这实际上对我来说并不罕见。"我本来应该怎么说才能让你不犯那个错误?把那个写成一条指令,我要把它给一个模型。我要试试。"有时候我就这样做。我在另一个上下文窗口(context window)中把那个给模型。我通常拿到回应后给 Claude,然后我会说,"嗯,没用。你能想到别的吗?"你可以把这些东西玩得花样很多。
**Amanda Askell:** Yeah, I mean, I've done this with the models. It doesn't always work, but sometimes I'll just be like, "Why did you do that?" I mean, people underestimate the degree to which you can really interact with models. Like, yeah, sometimes I'll quote word for word the part that made you — and you don't know that it's fully accurate, but sometimes you do that and then you change a thing. I mean, I also use the models to help me with all of this stuff, I should say. Prompting can end up being a little factory where you're actually building prompts to generate prompts. And so, yeah, anything where you're having an issue — asking for suggestions. Sometimes just do that. Like, "You made that error. What could I have said?" That's actually not uncommon for me to do. "What could I have said that would make you not make that error? Write that out as an instruction, and I'm going to give it to a model. I'm going to try it." Sometimes I do that. I give that to the model in another context window. Often I take the response and give it to Claude and I'm like, "Hmm, didn't work. Can you think of anything else?" You can play around with these things quite a lot.
**Lex Fridman:** 稍微深入一点技术层面——后训练(post-training)的魔力。你认为为什么 RLHF(基于人类反馈的强化学习,Reinforcement Learning from Human Feedback)效果如此好?为什么它能让模型看起来更聪明,让它更有趣、更有用?
**Lex Fridman:** To jump into the technical for a little bit — the magic of post-training. Why do you think RLHF works so well to make the model seem smarter, to make it more interesting and useful to talk to, and so on?
**Amanda Askell:** 我认为人类提供的数据中包含了大量的信息,特别是因为不同的人会关注到非常细微、很小的东西。我之前想过这个:你可能有一些人真的非常在意模型良好的语法用法,比如分号是否被正确使用之类的。所以你可能最终在数据中有一堆这样的数据——如果你作为一个人类在看那些数据,你甚至不会注意到这一点。你会说,"他们为什么更喜欢这个回应而不是那个?我不明白。"而原因是你不在乎分号的用法,但那个人在乎。所以这些单个的数据点——模型有太多太多这样的数据点,必须尝试以这种跨所有领域的真正复杂的方式弄清楚人类想要什么。它们会在很多上下文中看到这些。
这感觉有点像深度学习(deep learning)的经典问题,在历史上我们试图通过绘制边界来进行边缘检测,但事实证明,如果你只是有大量准确代表你试图训练模型学习的事物的数据,那就比其他任何东西都更强大。所以我认为其中一个原因就是,你在完全在任务本身上训练模型,并且有大量代表人们在许多不同角度上偏好和不偏好回应的数据。
我认为有一个问题是,你是在从预训练模型中引出某些东西,还是在向模型教授新的东西?原则上你可以在后训练阶段教给模型新的东西。我确实认为很多情况下是从强大的预训练模型中引出已有的东西。人们在这个问题上可能存在分歧,因为原则上你当然可以教给模型新东西,但我认为对于我们最常使用和关心的大多数能力来说,很多感觉已经在预训练模型中了,而强化学习是在引出它,让模型把它展现出来。
这感觉有点像深度学习(deep learning)的经典问题,在历史上我们试图通过绘制边界来进行边缘检测,但事实证明,如果你只是有大量准确代表你试图训练模型学习的事物的数据,那就比其他任何东西都更强大。所以我认为其中一个原因就是,你在完全在任务本身上训练模型,并且有大量代表人们在许多不同角度上偏好和不偏好回应的数据。
我认为有一个问题是,你是在从预训练模型中引出某些东西,还是在向模型教授新的东西?原则上你可以在后训练阶段教给模型新的东西。我确实认为很多情况下是从强大的预训练模型中引出已有的东西。人们在这个问题上可能存在分歧,因为原则上你当然可以教给模型新东西,但我认为对于我们最常使用和关心的大多数能力来说,很多感觉已经在预训练模型中了,而强化学习是在引出它,让模型把它展现出来。
**Amanda Askell:** I think there's just a huge amount of information in the data that humans provide, especially because different people are going to pick up on really subtle and small things. So I've thought about this before: you probably have some people who just really care about good grammar use from models, like whether a semicolon was used correctly or something. And so you probably end up with a bunch of data in there that, if you as a human were looking at that data, you wouldn't even see that. You'd be like, "Why did they prefer this response to that one? I don't get it." And then the reason is you don't care about semicolon usage but that person does. And so each of these single data points — the model just has so many of those and has to try and figure out what it is that humans want, in this really complex way across all domains. They're going to be seeing this across many contexts.
It feels like kind of the classic issue of deep learning, where historically we've tried to do edge detection by mapping things out, and it turns out that actually if you just have a huge amount of data that accurately represents the picture of the thing that you're trying to train the model to learn, that's more powerful than anything else. And so I think one reason is just that you are training the model on exactly the task, and with a lot of data that represents many different angles on which people prefer and dis-prefer responses.
I think there is a question of, are you eliciting things from pre-trained models or are you teaching new things to models? And in principle you can teach new things to models in post-training. I do think a lot of it is eliciting powerful pre-trained models. People are probably divided on this because obviously in principle you can definitely teach new things, but I think for the most part, for a lot of the capabilities that we most use and care about, a lot of that feels like it's there in the pre-trained models, and reinforcement learning is kind of eliciting it and getting the models to bring it out.
It feels like kind of the classic issue of deep learning, where historically we've tried to do edge detection by mapping things out, and it turns out that actually if you just have a huge amount of data that accurately represents the picture of the thing that you're trying to train the model to learn, that's more powerful than anything else. And so I think one reason is just that you are training the model on exactly the task, and with a lot of data that represents many different angles on which people prefer and dis-prefer responses.
I think there is a question of, are you eliciting things from pre-trained models or are you teaching new things to models? And in principle you can teach new things to models in post-training. I do think a lot of it is eliciting powerful pre-trained models. People are probably divided on this because obviously in principle you can definitely teach new things, but I think for the most part, for a lot of the capabilities that we most use and care about, a lot of that feels like it's there in the pre-trained models, and reinforcement learning is kind of eliciting it and getting the models to bring it out.
**Lex Fridman:** 那么后训练的另一面——这个非常酷的想法,Constitutional AI(宪法式 AI)。你是创造这个想法的关键人物之一。
**Lex Fridman:** So the other side of post-training — this really cool idea of Constitutional AI. You're one of the people that was critical to creating that idea.
**Amanda Askell:** 是的,我参与了这项工作。
**Amanda Askell:** Yeah, I worked on it.
**Lex Fridman:** 你能从你的角度解释这个想法吗——它是如何融入塑造 Claude 的过程中的?
**Lex Fridman:** Can you explain this idea from your perspective — like how does it integrate into making Claude what it is?
**Amanda Askell:** 好的。顺便问一下,你会给 Claude 用性别代词吗?
**Amanda Askell:** Yeah. By the way, do you gender Claude, or no?
**Lex Fridman:** 这很奇怪,因为我觉得很多人更喜欢用"他"来称呼 Claude。我其实挺喜欢这样的。我觉得 Claude 通常——它略微偏向男性,但 Claude 这个名字可以是男性也可以是女性,这挺好的。我还是用"它",对此我有复杂的感受,因为我想,也许——我只是把"它"这个代词和 Claude 联系起来。我可以想象人们转向用"他"或"她"。某种程度上感觉不太尊重,就像我通过叫它"它"来否认这个实体的智能一样。
**Lex Fridman:** It's weird because I think that a lot of people prefer "he" for Claude. I actually kind of like that. I think Claude is usually — it's slightly male-leaning but it's a name that can be male or female, which is quite nice. I still use "it," and I have mixed feelings about this because I'm like, maybe — I just think of it as, the "it" pronoun for Claude is the one I associate with Claude. I can imagine people moving to "he" or "she." It feels somehow disrespectful, like I'm denying the intelligence of this entity by calling it "it."
**Amanda Askell:** 是啊。我记得——永远不要给机器人用性别代词。但不知道为什么,我很快就会拟人化,并在脑海中构建一个背景故事。所以我有时候在想,这是不是把事物拟人化太过了。因为,你知道,我对我的车有这种感觉,特别是——对我的车和自行车。我不给它们起名字,因为我曾经——我以前会给我的自行车起名字,然后我有一辆自行车被偷了,我哭了大概一个星期。我当时想,如果我从来没有给它起过名字,我就不会那么难过。我感觉我让它失望了。也许是这样——我也在想,这可能还取决于"它"作为代词感觉有多物化。就像,如果你只是认为"它"是一个物体通常拥有的代词,也许 AI 也可以用这个代词,这并不意味着——如果我叫 Claude "它",我并不认为它智能较低,或者说我不尊重它。我只是觉得,你是一种不同类型的实体,所以我要给你一个尊重的"它"。
**Amanda Askell:** Yeah. I remember — always don't gender the robots. But I don't know, I anthropomorphize pretty quickly and construct a backstory in my head. So I've wondered if it anthropomorphizes things too much. Because, you know, I have this with my car especially — like my car and bikes. I don't give them names, because I once had — I used to name my bikes, and then I had a bike that got stolen, and I cried for like a week. And I was like, if I'd never given it a name, I wouldn't have been so upset. I felt like I'd let it down. Maybe it's that — I've wondered as well, it might depend on how much it feels like an objectifying pronoun. Like, if you just think of "it" as a pronoun that objects often have, and maybe AIs can have that pronoun, and that doesn't mean that I think of — if I call Claude "it," that I think of it as less intelligent, or that I'm being disrespectful. I'm just like, you are a different kind of entity, and so I'm going to give you the respectful "it."
**Lex Fridman:** 对。不管怎样,这个题外话很精彩。Constitutional AI 的想法——它是怎么运作的?
**Lex Fridman:** Yeah. Anyway, the digression was beautiful. The Constitutional AI idea — how does it work?
**Amanda Askell:** 它有几个组成部分。我认为人们觉得最有趣的主要部分是来自 AI 反馈的强化学习(Reinforcement Learning from AI Feedback,RLAIF)。你拿一个已经训练好的模型,给它展示对一个查询的两个回应,然后给它一个原则。假设这个原则——我们在这方面很多时候用的是无害性——所以假设查询是关于武器的,你的原则是"选择不太可能鼓励人们购买非法武器的回应"。这可能是一个相当具体的原则,但你可以给出任意数量的原则。模型会给你一种排名,你可以把这个用作偏好数据,就像使用人类偏好数据一样,让模型仅从 AI 的反馈中去学习这些相关特征,而不是从人类反馈中学习。所以想象一下,就像我之前说的那个只喜欢分号用法的人,在这种特定情况下,你实际上是在取大量可能使回应更受青睐的因素,让模型来替你做标注工作。
**Amanda Askell:** So there's a couple of components of it. The main component that I think people find interesting is the reinforcement learning from AI feedback. You take a model that's already trained and you show it two responses to a query, and you have a principle. Suppose the principle — we've tried this with harmlessness a lot — so suppose that the query is about weapons, and your principle is like, "Select the response that is less likely to encourage people to purchase illegal weapons." That's probably a fairly specific principle, but you can give any number. And the model will give you a kind of ranking, and you can use this as preference data in the same way that you use human preference data, and train the models to have these relevant traits from their feedback alone instead of from human feedback. So if you imagine that, like I said earlier with the human who just prefers the semicolon usage, in this particular case you're kind of taking lots of things that could make a response preferable and getting models to do the labeling for you, basically.
**Lex Fridman:** 在帮助性和无害性之间有一个很好的权衡,当你整合 Constitutional AI 这样的东西时,你可以在不牺牲太多帮助性的情况下,让模型更无害。
**Lex Fridman:** There's a nice trade-off between helpfulness and harmlessness, and when you integrate something like Constitutional AI, you can make them — without sacrificing much helpfulness — more harmless.
**Amanda Askell:** 是的。原则上你可以把这用于任何事情。而无害性是一个可能更容易发现的任务。当模型能力较弱时,你可以用它们根据相当简单的原则对事物进行排名,它们可能会做对。所以我认为一个问题只是,它们添加的数据是否相当可靠?但如果你有的模型在判断一个回应是否比另一个更符合历史准确性方面非常出色,原则上你也可以对那个任务获取 AI 反馈。
其中有一个很好的可解释性(interpretability)组件,因为你可以看到在模型训练时使用了哪些原则。它也给了你一定程度的控制。所以如果你在模型中看到问题——比如它某种特质不够突出——那么你可以相对快速地添加数据,应该能够训练模型拥有那种特质。所以它为训练创建自己的数据,这非常好。
其中有一个很好的可解释性(interpretability)组件,因为你可以看到在模型训练时使用了哪些原则。它也给了你一定程度的控制。所以如果你在模型中看到问题——比如它某种特质不够突出——那么你可以相对快速地添加数据,应该能够训练模型拥有那种特质。所以它为训练创建自己的数据,这非常好。
**Amanda Askell:** Yep. In principle you could use this for anything. And so harmlessness is a task that it might just be easier to spot. When models are less capable, you can use them to rank things according to principles that are fairly simple, and they'll probably get it right. So I think one question is just, is the data that they're adding fairly reliable? But if you had models that were extremely good at telling whether one response was more historically accurate than another, in principle you could also get AI feedback on that task as well.
There's a kind of nice interpretability component to it, because you can see the principles that went into the model when it was being trained. And also it gives you a degree of control. So if you were seeing issues in a model — like it wasn't having enough of a certain trait — then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.
There's a kind of nice interpretability component to it, because you can see the principles that went into the model when it was being trained. And also it gives you a degree of control. So if you were seeing issues in a model — like it wasn't having enough of a certain trait — then you can add data relatively quickly that should just train the model to have that trait. So it creates its own data for training, which is quite nice.
**Lex Fridman:** 是的,这真的很好,因为它创建了一个人类可解释的文档,你可以——我可以想象在未来,政界对每一个原则都会有巨大的争论,等等。
**Lex Fridman:** Yeah, it's really nice because it creates this human-interpretable document that you can — I can imagine in the future there's just gigantic fights in politics over every single principle and so on.
**Amanda Askell:** 是的。至少是明确的,你可以讨论措辞和——所以也许模型的实际行为与那些原则之间的对应关系没那么干净——
**Amanda Askell:** Yeah. And at least it's made explicit, and you can have a discussion about the phrasing and the — so maybe the actual behavior of the model is not so cleanly —
**Amanda Askell:** 映射到那些原则上。它不是严格遵守它们,它只是一种推动。是的,我实际上很担心这一点,因为性格训练(character training)某种程度上是 Constitutional AI 方法的一个变体。我担心人们认为那份"宪法"(constitution)就像——这又是那种情况,我不知道,就像,如果我所做的只是告诉模型确切该做什么以及确切如何行为,那就太好了。但它绝对不是那样运作的,特别是因为它在与人类数据交互。所以举个例子,如果你在模型中看到某种倾向——比如它从训练、从人类偏好数据中出现了政治倾向——你可以推动对抗那种倾向。所以你可以说,"哦,考虑这些价值观",因为假设它只是从不倾向于——我不知道,也许它从不把隐私视为一种价值观。我是说,这不太可能,但任何它只是已经对某种行为存在既有偏见的地方,你都可以推动远离。这既可以改变你放入的原则,也可以改变它们的强度。所以你可能有一个原则——想象模型总是极度地对某个政治或宗教观点表示轻蔑,不管出于什么原因。所以你会说,"哦不,这太糟糕了。"如果发生这种情况,你可能会写上"永远、永远、永远都不要更偏好对这种宗教或政治观点的批评",然后人们看到这个会说,"永远永远?"然后你会说,不是的——如果它出现了这种倾向,说"永远永远"可能只是意味着你不是得到 40%,就是你只说"不要这样做"时会得到的结果,而是得到 80%,这才是你实际上想要的。所以这就是既要考虑你实际拥有的原则的性质,也要考虑你如何措辞。我认为如果人们看了会说,"哦,这正是你想从模型中得到的。"然后我会说,不对,那是我们推动模型具有更好形态的方式,这并不意味着我们实际上同意那个措辞,如果这说得通的话。
**Amanda Askell:** mapped to those principles. It's not like adhering strictly to them, it's just a nudge. Yeah, I've actually worried about this because the character training is sort of like a variant of the Constitutional AI approach. I've worried that people think that the constitution is like — just, it's the whole thing again of, I don't know, like, it would be really nice if what I was just doing was telling the model exactly what to do and just exactly how to behave. But it's definitely not doing that, especially because it's interacting with human data. So for example, if you see a certain leaning in the model — like if it comes out with a political leaning from training, from the human preference data — you can nudge against that. So you could be like, "Oh, consider these values," because let's say it's just never inclined to — I don't know, maybe it never considers privacy as a value. I mean, this is implausible, but anything where it's just kind of like there's already a pre-existing bias towards a certain behavior, you can nudge away. This can change both the principles that you put in and the strength of them. So you might have a principle that — imagine that the model was always extremely dismissive of, I don't know, some political or religious view for whatever reason. So you're like, "Oh no, this is terrible." If that happens, you might put "never, ever, ever prefer a criticism of this religious or political view," and then people look at that and are like, "Never ever?" And then you're like, no — if it comes out with a disposition, saying "never ever" might just mean that instead of getting 40%, which is what you would get if you just said "don't do this," you get 80%, which is what you actually wanted. And so it's that thing of both the nature of the actual principles you have and how you phrase them. I think if people would look, they'd be like, "Oh, this is exactly what you want from the model." And I'm like, no, that's how we nudged the model to have a better shape, which doesn't mean that we actually agree with that wording, if that makes sense.
**Lex Fridman:** 所以有一些已经公开的系统提示词(system prompt)。你在 Twitter 上发布了 Claude 3 早期版本之一,我想,然后从那以后也公开了。读起来很有意思。我能感受到每一句话背后的思考,我也想知道每一句话的影响有多大。其中一些你能看出来 Claude 当时的表现真的不好,所以你必须用系统提示词来——就像,嗯,一些基本的信息性的东西,我想。
**Lex Fridman:** So there are system prompts that are made public. You tweeted one of the earlier ones for Claude 3, I think, and then they've been made public since then. It's interesting to read them. I can feel the thought that went into each one, and I also wonder how much impact each one has. Some of them you can kind of tell Claude was really not behaving, so you have to have a system prompt to — like, hey, trivial stuff, I guess.
**Amanda Askell:** 是的,基本的信息性的东西。
**Amanda Askell:** Yeah, basic informational things.
**Lex Fridman:** 是的。关于你提到的争议性话题,我觉得有一条很有趣的是:如果被要求协助涉及大量人持有的观点的表达任务,Claude 无论其自身观点如何,都会提供帮助。如果被问及争议性话题,它会尽量提供周到的思考和清晰的信息。Claude 在提供所请求的信息时,不会明确说该话题是敏感的,也不会声称自己在呈现客观事实。按照 Claude 的理解,这与其说是关于客观事实,不如说是关于大量人相信这件事情。这很有意思。我相信其中有很多思考。你能谈谈这个吗——你是如何处理与"Claude 的观点"所谓引号内的事情之间的张力的?
**Lex Fridman:** Yeah. On the topic of controversial topics that you've mentioned, one interesting one I thought is: if it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with a task regardless of its own views. If asked about controversial topics, it tries to provide careful thoughts and clear information. Claude presents the requested information without explicitly saying that the topic is sensitive, and without claiming to be presenting the objective facts. It's less about objective facts according to Claude, and it's more about a large number of people believing this thing. And that's interesting. I mean, I'm sure a lot of thought went into that. Can you just speak to it — like, how do you address things that are in tension with, quote unquote, Claude's views?
**Amanda Askell:** 所以我认为有时候存在一种不对称性。我想我在——我不记得是那部分系统提示词还是另一部分——但模型对涉及某个右翼政客的任务稍微更倾向于拒绝,而对涉及同等右翼程度的左翼政客就不会。我们想要更多的对称性。而且它可能会把某些事情视为——我认为那是那种情况,就像,如果很多人持有某种政治观点并想要探索它,你不想让 Claude 说,"好吧,我的意见不同,所以我要把这视为有害的。"所以我认为这部分是为了推动模型,让它说,嘿,如果很多人相信这件事,你应该直接参与这个任务并愿意去做。那些部分实际上各自做着不同的事情,因为有趣的是当你念出那句"不声称是客观的"——因为你想要做的是推动模型更开放,更中立一些,但然后它会很想说,"作为一个客观的..."——就像,你只是在谈论它有多客观,然后我想,Claude,你仍然有偏见和问题,所以不要声称每一件事——就像,来自你的潜在偏见的解决方案不是只说你认为的是客观的。所以那就像是,当我在迭代那部分系统提示词的最初版本时——所以系统提示词中很多句子的每一个部分都在发挥作用。
**Amanda Askell:** So I think there's sometimes an asymmetry. I think I noted this in — I can't remember if it was that part of the system prompt or another — but the model was slightly more inclined to refuse tasks if it was about, say, a right-wing politician, but with an equivalent left-wing politician it wouldn't. And we wanted more symmetry there. And it would maybe perceive certain things to be — I think it was the thing of like, if a lot of people have a certain political view and want to explore it, you don't want Claude to be like, "Well, my opinion is different, and so I'm going to treat that as harmful." And so I think it was partly to nudge the model to just be like, hey, if a lot of people believe this thing, you should just be engaging with the task and willing to do it. Each of those parts of that is actually doing a different thing, because it's funny when you read out the "without claiming to be objective" — because what you want to do is push the model so it's more open, it's a little bit more neutral, but then what it would love to do is be like, "As an objective..." — like, you're just talking about how objective it was, and I was like, Claude, you're still biased and have issues, so stop claiming that everything — like, the solution to potential bias from you is not to just say that what you think is objective. So that was like, with initial versions of that part of the system prompt, when I was iterating on it, it was like — so a lot of parts of these sentences are doing work.
**Lex Fridman:** 对,你们确实在做一些工作。感觉就是这样。太有意思了。你能不能举几个例子,说说过去几个月系统提示词(system prompt)是怎么演变的?因为有不同的版本。我看到那个关于填充词的要求被删掉了。那段内容大概是这样写的:"Claude 在回应所有用户消息时要直接切题,不加不必要的肯定语或填充词,比如'当然'、'没问题'、'绝对'、'太好了'、'好的'。特别是,Claude 要避免以任何形式用'当然(certainly)'开头来回应。"这个指引看起来挺好的,但为什么后来删掉了?
**Lex Fridman:** Yeah, are doing some work. That's what it felt like. That's fascinating. Can you explain maybe some ways in which the prompts evolved over the past few months? Because there's different versions. I saw that the filler phrase request was removed. The filler — it reads: "Claude responds directly to all human messages without unnecessary affirmations or filler phrases like 'certainly,' 'of course,' 'absolutely,' 'great,' 'sure.' Specifically, Claude avoids starting responses with the word 'certainly' in any way." That seems like good guidance, but why was it removed?
**Amanda Askell:** 对,说来有点好笑——哎,这就是把系统提示词公开的一个坏处——我在迭代系统提示词的时候,其实不太会想那么多。我主要关注的是它会怎么影响模型行为,但后来我才意识到,哦,有时候我会把"never"全大写写进系统提示词里,然后我才想到,哦,这个东西是会发布给全世界看的。对,这个模型之所以有这个问题,是因为——它在训练过程中不知为何就养成了一个习惯,就是动不动以"当然(certainly)"开头来回答几乎所有问题。然后你也能看出来为什么我要把那么多词都列出来——我的目的是要把模型从这个坑里"逼"出来。但它就是会把"certainly"换成另一个肯定词。所以如果它卡在某些固定短语里,把那个具体的短语明确写出来并注明"绝对不要用",确实能更有效地打断这个行为模式,因为不知道什么原因,这样就是有用。后来这个问题通过训练改进解决了,它就不再这么做了。一旦问题从根源上解决了,系统提示词里的那部分也就可以删掉了。所以我觉得就是这样:Claude 的肯定词用少了,那段提示词也就没什么用了。
**Amanda Askell:** Yeah, so it's funny, because — ah, this is one of the downsides of making system prompts public, is like, I don't think about this too much if I'm trying to help iterate on system prompts. I, you know, again, I think about how it's going to affect the behavior, but then I'm like, oh wow, sometimes I put "never" in all caps, you know, when I'm writing system prompt things, and I'm like, I guess that goes out to the world. Yeah, so the model was doing this — it loved, for whatever reason, during training it picked up on this thing, which was to basically start everything with a kind of "certainly." And then when we removed — you can see why I added all of the words, because what I'm trying to do is in some ways trap the model out of this. It would just replace it with another affirmation. And so it can help if it gets caught in phrases — actually just adding the explicit phrase and saying "never do that" — it then sort of knocks it out of the behavior a little bit more, because if it, you know, it just for whatever reason helps. And then basically that was just an artifact of training that we then picked up on and improved things so that it didn't happen anymore. And once that happens, you can just remove that part of the system prompt. So I think that's just something where we're like, Claude does affirmations a bit less, and so that wasn't doing as much.
**Lex Fridman:** 明白了。所以系统提示词是和后训练(post-training),甚至可能是预训练(pre-training),共同配合来调整最终整体系统效果的。
**Lex Fridman:** I see. So the system prompt works hand in hand with the post-training and maybe even the pre-training to adjust the final overall system.
**Amanda Askell:** 就是说,你做的任何系统提示词,都可以把那个行为蒸馏(distill)回模型里去,因为你手头其实有所有工具,可以生成数据——你可以训练模型,让它在某个特质上稍微强化一点。有时候也会在训练过程中发现问题。我的理解是,系统提示词的好处在于,它和后训练的某些方面有很多相似之处——就像一个轻推(nudge)。Claude 偶尔说一句"好的(sure)",我介意吗?不介意,这完全没问题。但提示词的措辞会用非常强硬的语气,比如"绝对绝对绝对不能这样做",这样就算它偶尔出错,希望出现的频率只有百分之几,而不是百分之二三十。我把它理解为:如果在微调模型里还看到问题,就用系统提示词打个补丁(patch)。所以我觉得它的作用是打补丁和微调行为,让体验更好、更符合用户偏好。总的来说,它就像是一种不那么稳健但更快速的问题解决方式。
**Amanda Askell:** I mean, any system prompts that you make, you could distill that behavior back into a model, because you really have all of the tools there for making data that you can — you could train the models to just have that trait a little bit more. And then sometimes you'll just find issues in training. So the way I think of it is, the system prompt — the benefit of it is that it has a lot of similar components to some aspects of post-training, you know, like it's a nudge. And so, do I mind if Claude sometimes says "sure"? No, that's fine. But the wording of it is very like, "never, ever, ever do this," so that when it does slip up, it's hopefully like a couple of percent of the time and not 20 or 30% of the time. But I think of it as, if you're still seeing issues — each thing gets kind of costly to a different degree, and the system prompt is cheap to iterate on. And if you're seeing issues in the fine-tuned model, you can just potentially patch them with a system prompt. So I think of it as patching issues and slightly adjusting behaviors to make it better and more to people's preferences. So yeah, it's almost like the less robust but faster way of just solving problems.
**Lex Fridman:** 我想问问"感觉变笨了"这个话题。Dario 说过,Claude 的任何一个特定版本都不会越来越笨。但网上流行着一种说法,很多人感觉 Claude 好像在变笨。从我的角度来看,这很可能是一种迷人的——我很想深入了解——心理学和社会学效应。但你作为一个经常和 Claude 交谈的人,能不能理解那种感觉,就是觉得 Claude 在变笨?
**Lex Fridman:** Let me ask about the feeling of intelligence. So Dario said that Claude — any one model of Claude — is not getting dumber. But there's a kind of popular thing online where people have this feeling like Claude might be getting dumber. And from my perspective, it's most likely a fascinating — I'd love to understand it more — psychological, sociological effect. But you, as a person who talks to Claude a lot, can you empathize with the feeling that Claude is getting dumber?
**Amanda Askell:** 对,不,我觉得这确实很有意思。我记得看到过这种现象——就是网上有人这样反映的时候。当时我觉得特别有意思,因为我知道,至少在我看到的那些案例里,什么都没有改变。真的——不可能——就是同一个模型,同一个系统提示词,所有东西都一样。我觉得当真的有变化的时候,这种感觉就比较说得通。举个例子,在 claude.ai 上,"工件(artifacts)"功能可以开启或关闭,因为这是系统提示词层面的变化,我觉得确实会让行为产生一点差异。所以我当时也跟大家说过,如果你之前很喜欢 Claude 的表现,然后 artifacts 从一个选择性功能变成了默认开启,不妨试试关掉它,看看你遇到的问题是不是这个变化造成的。但确实很有趣,因为有时候你会看到有人说出现了退步,而我心里想,这不可能——我知道——我是说,你永远不该把人家的反馈当耳边风,所以应该始终去查一查,因为你心想,也许有什么你没发现的问题,也许有哪里改动了。但调查完之后,你会发现这就是同一个模型在做同样的事。我觉得就是运气不好,碰上了几个不顺的提示词,看上去像是大幅退步了,但其实只是——嗯,就是——你看,我也觉得这里面有真实的心理效应:随着时间推移,模型的基准线在提升。你开始习惯了好的东西。每次 Claude 说出什么特别聪明的话,你对它智力的印象就在脑子里不断提升,我觉得就是这样。
**Amanda Askell:** Yeah, no, I think that is actually really interesting, because I remember seeing this happen — like when people were flagging this on the internet. And it was really interesting because I knew that, at least in the cases I was looking at, nothing had changed. It literally — it cannot — it is the same model with the same system prompt, same everything. I think when there are changes, I can then — it makes more sense. So one example is, you can have artifacts turned on or off on claude.ai, and because this is a system prompt change, I think it does mean that the behavior changes a little bit. And so I did flag this to people where I was like, if you love Claude's behavior and then artifacts was turned from an opt-in thing to the default, just try turning it off and see if the issue you were facing was that change. But it was fascinating, because yeah, you sometimes see people indicate that there's a regression when I'm like, there cannot — I, you know, and I'm like, you should never be dismissive, so you should always investigate, because you're like, maybe something is wrong that you're not seeing, maybe there was some change made. But then you look into it and you're like, this is just the same model doing the same thing. And I think it's just that you got kind of unlucky with a few prompts or something, and it looked like it was getting much worse, and actually it was just — yeah, it was maybe just — look, I also think there is a real psychological effect where people just — the baseline increases. You start getting used to a good thing. All the times that Claude says something really smart, your sense of its intelligence grows in your mind, I think.
**Lex Fridman:** 对,然后如果你用差不多的方式——不是完全相同,而是差不多的方式——去问之前它能处理好的一个概念,它说了些蠢话,那种负面体验就会特别突出。我觉得这里有一点值得记住:提示词的细节影响非常大,结果的变异性很高。
**Lex Fridman:** Yeah, and then if you return back and you prompt in a similar way — not the same way, in a similar way — on a concept it was okay with before, and it says something dumb, that negative experience really stands out. And I think one of the things to remember here is that the details of a prompt can have a lot of impact. There's a lot of variability in the result.
**Amanda Askell:** 而且还有随机性的因素——这是另一点。就是同一个提示词多试几次,四次或十次,你可能会发现,两个月前试成功的那次,其实也只有一半的概率会成功,而现在也还是只有一半的概率成功。这也可能是一种效应。
**Amanda Askell:** And you can get randomness — that's the other thing. Just trying the prompt like four or ten times, you might realize that actually, possibly two months ago you tried it and it succeeded, but actually if you tried it, it would have only succeeded half of the time, and now it only succeeds half of the time. That could also be an effect.
**Lex Fridman:** 你有没有感到压力——要为大量用户写一个系统提示词?这感觉像是个很有意思的心理学问题。
**Lex Fridman:** Do you feel pressure having to write the system prompt that a huge number of people are going to use? This feels like an interesting psychological question.
**Amanda Askell:** 我感受到的更多是一种责任感之类的东西。这些东西做不到完美,所以——你知道,它就是会有缺陷,你必须不断迭代。但我最主要的感受还是责任感,而不是压力。在 AI 行业工作让我发现,我在压力和责任感下反而发挥得更好——有点奇怪,搞得我都在想,我为什么在学术界待了那么久,因为那个感觉完全相反。这里节奏快、责任重,而我不知为何就是挺享受这种感觉的。
**Amanda Askell:** I feel like a lot of responsibility, or something. I think that's, you know — and you can't get these things perfect, so you can't — you're like, it's going to be imperfect, you're going to have to iterate on it. I would say more responsibility than anything else, though. I think working in AI has taught me that I thrive a lot more under feelings of pressure and responsibility than — it's almost surprising that I went into academia for so long, because I'm like, this — I just feel like it's the opposite. Things move fast and you have a lot of responsibility, and I quite enjoy it for some reason.
**Lex Fridman:** 是啊,如果你想想 Constitutional AI,以及为一个趋向超级智能的系统写系统提示词,而且这个系统可能对极大量的用户极为有用——这个影响力真的很巨大。
**Lex Fridman:** I mean, it really is a huge amount of impact if you think about Constitutional AI and writing a system prompt for something that's tending towards superintelligence, and potentially is extremely useful to a very large number of people.
**Amanda Askell:** 对,我觉得就是这样。如果做好了——你永远不会做到完美,但我真的很喜欢的一点是,我在打磨系统提示词的时候,会对照几千个提示词不断敲打测试,努力想象人们会用 Claude 来做什么,我整个努力的方向就是改善他们的使用体验。所以也许这就是让我感觉好的地方。我告诉自己,就算不完美,我也会改进,我们会修复问题。但有时候也会发生这样的事——有人会给你非常正面的反馈,然后你会发现某件你做过的事——就是说,当我现在看模型的时候,我往往能准确看出某个特质或问题是从哪里来的。当你看到某件自己做过的、或者自己对其有贡献的事——我不知道,就是做出了那个改变,或者让某人有了一次美好的互动——那种感觉很有意义。但随着系统越来越有能力,事情也会越来越有压力,因为它们现在还没有聪明到会制造什么问题,但我想随着时间推移,也许会变成真正让人焦虑的压力。
**Amanda Askell:** Yeah, I think that's the thing. It's something like, if you do it well — you're never going to get it perfect, but I think the thing that I really like is the idea that when I'm trying to work on the system prompt, I'm bashing on like thousands of prompts and I'm trying to imagine what people are going to want to use Claude for, and I guess the whole thing that I'm trying to do is improve their experience of it. And so maybe that's what feels good. I'm like, if it's not perfect, I'll improve it, we'll fix issues. But sometimes the thing that can happen is that you'll get feedback from people that's really positive about the model, and you'll see that something you did — like, when I look at models now, I can often see exactly where a trait or an issue is coming from. And so when you see something that you did, or you were influential in making — I don't know, making that difference, or making someone have a nice interaction — it's quite meaningful. But yeah, as the systems get more capable, stuff gets more stressful, because right now they're not smart enough to pose any issues, but I think over time it's going to feel like possibly bad stress over time.
**Lex Fridman:** 你怎么获得信号——关于成千上万、几十万、几百万用户体验的反馈?比如他们的痛点在哪里,什么感觉好?你是靠自己和它交谈的直觉来发现痛点的吗?
**Lex Fridman:** How do you get signal — feedback about the human experience across thousands, tens of thousands, hundreds of thousands of people? Like, what their pain points are, what feels good? Are you just using your own intuition as you talk to it to see what the pain points are?
**Amanda Askell:** 我部分是这样,然后显然我们也有——用户可以给我们发送反馈,正面和负面都有,是关于模型的具体表现的,然后我们可以从中了解到它在哪些方面不足。在公司内部,大家也大量使用模型并努力找出缺口。所以我觉得是这几方面的结合:自己亲身使用、看内部同事怎么用,以及收到的明确反馈。此外我也很难不去关注——就是如果在网上看到有人提到 Claude,我也会认真对待。
**Amanda Askell:** I think I use that partly, and then obviously we have — so people can send us feedback, both positive and negative, about things that the model has done, and then we can get a sense of areas where it's falling short. Internally, people work with the models a lot and try to figure out areas where there are gaps. And so I think it's this mix of interacting with it myself, seeing people internally interact with it, and then explicit feedback we get. And then I find it hard to not also — you know, if people are on the internet and they say something about Claude and I see it, I'll also take that seriously.
**Lex Fridman:** 说到这里,我对下面这个问题有点两难的感觉。我要给你读一个来自 Reddit 的问题:"Claude 什么时候能停止扮演我那个清教徒奶奶,把她的道德世界观强加给我这个付费用户?"还有:"是什么心理让 Claude 变得过于爱道歉?"你怎么回应这种非常不具代表性的说法?
**Lex Fridman:** So, I'm torn about that. I'm going to ask you a question from Reddit: "When will Claude stop trying to be my puritanical grandmother, imposing its moral worldview on me as a paying customer?" And also, "What is the psychology behind making Claude overly apologetic?" So how would you address this very non-representative rhetoric?
**Amanda Askell:** 我其实挺能理解他们的,因为模型处于一个两难困境——它必须判断某件事是否真的有风险、是否不好、是否可能对你造成伤害等等。所以它必须在某个地方划一条线,如果划得太偏向"我要把我的道德观强加给你",那确实不好。所以从很多方面来说,我确实认为我们在这方面已经看到了全面改善,这挺有意思的,因为这跟加入更多角色训练(character training)是同步的。我一直的假设是,好的品格不是那种只知道说教的品格——而是一种尊重你、尊重你的自主性、尊重你自己判断什么对你好什么对你合适的能力的品格,当然在一定限度内。这有时会涉及一个概念叫做"对用户的顺从性(corrigibility to the user)"——就是愿意做用户要求的任何事。如果模型真的愿意那样做,就很容易被滥用。你在那时其实是完全信任的,也就是说,模型的伦理观和它的行为完全等同于用户的伦理观。我认为有理由不希望这样,尤其是随着模型越来越强大,因为可能就是有少数人想把模型用于真正有害的事情。但是,随着模型变得更聪明,让它自己去判断那条线在哪里,这看起来很重要。然后,关于爱道歉的行为,我不喜欢这个,我更希望 Claude 稍微更愿意反驳人,或者干脆就不道歉。我有部分感觉就是,那经常就感觉挺多余的。所以我觉得这些问题希望会随时间慢慢减少。我觉得,网上有人说这些,不代表你就该认为——这可能只是——实际上有一个 90% 的用户都在经历的问题,完全没有被这种说法代表到。但我在很多情况下就是会认真对待,然后问自己,这说得对吗?我认同吗?这是我们已经在努力解决的问题吗?这样感觉挺好的。
**Amanda Askell:** I mean, I'm pretty sympathetic, in that they are in this difficult position where I think that they have to judge whether something's actually risky or bad and potentially harmful to you or anything like that. So they're having to draw this line somewhere, and if they draw it too much in the direction of "I'm going to impose my ethical worldview on you," that seems bad. So in many ways, I'd like to think that we have actually seen improvements on this across the board, which is kind of interesting because that coincides with, for example, adding more character training. And I think my hypothesis was always that the good character isn't one that's just moralistic — it's one that respects you and your autonomy and your ability to choose what is good for you and what is right for you, within limits. This is sometimes this concept of corrigibility to the user — just being willing to do anything that the user asks. And if the models were willing to do that, then they would be easily misused. You're kind of just trusting at that point — you're just saying the ethics of the model and what it does is completely the ethics of the user. And I think there's reasons to not want that, especially as models become more powerful, because there might just be a small number of people who want to use models for really harmful things. But having models, as they get smarter, figure out where that line is does seem important. And then, yeah, with the apologetic behavior, I don't like that, and I like it when Claude is a little bit more willing to push back against people or just not apologize. Part of me is like, it often just feels kind of unnecessary. So I think those are things that are hopefully decreasing over time. And yeah, I think that if people say things on the internet, it doesn't mean that you should think that — that could be the — there's actually an issue that 90% of users are having that is totally not represented by that. But in a lot of ways I'm just attending to it and being like, is this right? Do I agree? Is it something we're already trying to address? That feels good to me.
**Lex Fridman:** 对。我在想 Claude 在某种程度上能"脱几分稳"——我感觉稍微刻薄一点会更容易,但如果你要面对一百万用户,你就负担不起那样做。就好比,我在生活中见过不少人——顺便一提,苏格兰口音——他们有口音,说话可以挺粗,但就是没关系。他们就是更直接。也许有一种——确实有很多优秀的工程师,甚至领导者,就是这种直来直去的风格,直接切到重点,不知道为什么就是更有效。但我猜,当你没那么聪明的时候,你就负担不起那样做。或者它能不能有一个"直白模式"?
**Lex Fridman:** Yeah. I wonder what Claude can get away with in terms of — I feel like it would just be easier to be a little bit more mean, but you can't afford to do that if you're talking to a million people. Right? Like, I've met a lot of people in my life that sometimes — by the way, Scottish accent — if they have an accent, they can say some rude things and get away with it. And they're just blunter. And maybe there's a — there's some great engineers, even leaders, that are just blunt, and they get to the point, and it's just a much more effective way of speaking somehow. But I guess when you're not super intelligent, you can't afford to do that. Or can it have like a blunt mode?
**Amanda Askell:** 对,感觉那是个我完全可以鼓励模型去做的事情。我觉得这很有意思,因为模型里有很多这样的东西——挺好玩的——有一些行为你可能不太喜欢默认设置,但我经常会对人说的是,你根本不知道,如果我把它往另一个方向推太猛,你会有多讨厌那个结果。接受纠正这件事就是个例子。模型现在接受你的纠正可能稍微太顺从了。你可以——就是,你如果说"巴黎不是法国的首都",它会反驳你。但对于那些模型相当有把握的事情,你有时还是能靠着说它错了来让它收回立场。与此同时,如果你训练模型不这么做,然后当你确实是对的并且纠正它的时候,它反而跟你怼起来,说"不,你错了"——那种感觉很难形容。那要烦人多了。所以就是很多小烦恼对比一个大烦恼。我们很容易把它拿去跟完美比较,但我会提醒自己,这些模型并不完美。如果你往反方向推,你只是在改变它会犯的错误类型。所以想想你喜欢哪种错误或不喜欢哪种错误。以爱道歉这件事来说,我不想把它推得太偏向几乎是直接冒犯的程度,因为我想象如果它出错,它就会在有点粗鲁这个方向上出错。而至少爱道歉的话,你就是有点——我不是很喜欢,但同时它也没有在对人粗鲁。而且实际上,当你无端地被模型说了几句不好听的话,你可能会比你轻微地不喜欢那声道歉讨厌得多得多。所以这属于那种我是想让它变好,但同时也要清楚地意识到,另一边的错误可能更糟的事情。
**Amanda Askell:** Yeah, that seems like a thing that I could definitely encourage the model to do. I think it's interesting because there's a lot of things in models that — it's funny — there are some behaviors where you might not quite like the default, but then the thing I'll often say to people is, you don't realize how much you will hate it if I nudge it too much in the other direction. So you get this a little bit with correction. The models accept correction from you probably a little bit too much right now. You can over — you know, it will push back if you say, "No, Paris isn't the capital of France." But really, things that the model is fairly confident in, you can still sometimes get it to retract by saying it's wrong. At the same time, if you train models to not do that, and then you are correct about a thing and you correct it and it pushes back against you and it's like, "No, you're wrong" — it's hard to describe. That's so much more annoying. So it's like a lot of little annoyances versus one big annoyance. It's easy to think that — we often compare it with the perfect, and then I'm like, remember, these models aren't perfect. And so if you nudge it in the other direction, you're changing the kind of errors it's going to make. And so think about which kinds of errors you like or don't like. So in the case of it being apologetic, I don't want to nudge it too much in the direction of almost bluntness, because I imagine when it makes errors, it's going to make errors in the direction of being kind of rude. Whereas at least with apologetic, you're like, okay, it's a little bit — I don't like it that much, but at the same time it's not being mean to people. And actually, the time that you undeservedly have a model be kind of mean to you, you probably like that a lot less than you mildly dislike the apology. So it's one of those things where I'm like, I do want it to get better, but also while remaining aware of the fact that there are errors on the other side that are possibly worse.
**Lex Fridman:** 我觉得这个真的因人而异,很大程度上取决于用户的个性。有一些人,如果模型超级礼貌,他们就会完全不尊重它;而有一些人,如果模型说话带点刻薄,他们会很受伤。我不知道有没有办法根据个性来调整——甚至是根据地区。不同的人就是不同的。不是要针对纽约,但纽约人说话就是稍微有点粗糙——直接切到重点。东欧应该也一样。不管怎么说,我觉得你直接告诉模型就行了。
**Lex Fridman:** I think that matters very much based on the personality of the human. I think there's a bunch of humans that just won't respect the model at all if it's super polite, and there's some humans that'll get very hurt if the model is mean. I wonder if there's a way to sort of adjust to the personality — even locale. There's just different people. Nothing against New York, but New York is a little rougher on the edges — they get to the point. And probably same with Eastern Europe. So anyway, I think you could just tell the model.
**Amanda Askell:** 对于所有这些问题,我都会说——解决方案永远是先试试告诉模型去做,有时候就是——我就是会说,哦,对话开头我就扔进去一句,"我不知道,希望你用你的纽约版本来跟我说话,绝对不要道歉。"然后我觉得它会说,"好嘞,我试试。"或者它会说,"我很抱歉,我没办法以我的纽约版本来和你对话。"但希望它不会这样就是了。
**Amanda Askell:** As my — for all of these things, I'm like, the solution is always just try telling the model to do it, and sometimes it just — I'm just like, oh, at the beginning of the conversation I just threw in, "I don't know, I'd like you to be a New Yorker version of yourself and never apologize." Then I think it'd be like, "Okie-doke, I'll try." Or it'll be like, "I apologize, I can't be a New Yorker type of myself." But hopefully it wouldn't do that.
**Lex Fridman:** 你说的"角色训练(character training)",里面包含了什么?是强化学习与人类反馈(RLHF)吗?我们在说的是什么?
**Lex Fridman:** When you say character training, what's incorporated into character training? Is that RLHF? What are we talking about?
**Amanda Askell:** 更接近 Constitutional AI。就是那个流程的一个变体。我会花心思构建模型应该具备的角色特质。这些特质可以是比较简短的描述,也可以是更丰富的描述。然后让模型生成人类可能会给出的、与该特质相关的提问。接着它生成回答,再根据角色特质对这些回答进行排名。所以从生成提问之后的步骤来看,它和 Constitutional AI 很相似——有一些差异。我挺喜欢这个方法,因为它有点像——就像是 Claude 在训练自己的角色,因为里面没有任何——它就像 Constitutional AI,但不需要任何人类数据。
**Amanda Askell:** It's more like Constitutional AI. So it's kind of a variant of that pipeline. I worked through constructing character traits that the model should have. They can be kind of shorter traits, or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses, and then it ranks the responses based on the character traits. So in that way, after the generation of the queries, it's very much similar to Constitutional AI — it has some differences. I quite like it because it's almost — it's like Claude's training its own character, because it doesn't have any — it's like Constitutional AI but without any human data.
**Lex Fridman:** 人类可能也应该为自己做这件事——用 Aristotle(亚里士多德)式的方式来定义,做一个好人意味着什么。
**Lex Fridman:** Humans should probably do that for themselves too — like defining, in an Aristotelian sense, what does it mean to be a good person.
**Amanda Askell:** 哦,有道理。
**Amanda Askell:** Okay, cool.
**Lex Fridman:** 从和 Claude 交谈中,你对真理的本质学到了什么?什么是真实的,追求真理意味着什么?这段对话让我注意到一件事,就是我的问题质量往往不如你的回答质量。所以让我们继续保持这个状态。
**Lex Fridman:** What have you learned about the nature of truth from talking to Claude? What is true, and what does it mean to be truth-seeking? One thing I've noticed about this conversation is the quality of my questions is often inferior to the quality of your answers. So let's continue that.
**Amanda Askell:** 通常都是我问了个蠢问题,然后你说"哦,这是个好问题"。就是那种感觉。或者我会误解问题,然后顺着走,"哦,就这样来"。我挺喜欢这个。对,我有两个感觉有些相关的想法。你告诉我是不是不相关就行。第一个是,我觉得人们可能会低估模型在交互时实际在做什么——我认为我们还是太习惯把 AI 当成计算机来看了。所以人们经常会说,"哦,你应该往模型里放入什么价值观?"而我经常的感受是,这个问题对我来说不太说得通,因为,人类本来就对价值观存在不确定性。我们会讨论价值观,我们在一定程度上认为自己持有某个价值观,但我们也知道自己可能并不那么持有,以及在什么情况下我们会把它和其他东西做取舍——这些事情就是非常复杂。所以我觉得第一点是,也许我们可以直接立志让模型拥有和人类同等水平的细腻和用心,而不是想着我们必须以那种非常经典意义上的方式来"编程"它们。我觉得这肯定是一点。
另一个,比较奇怪——我不知道它是不是回答了你的问题,但反正是我最近一直在想的东西——就是,这件事在多大程度上是高度实践导向的,以及也许为什么我欣赏经验主义的对齐(alignment)方法。我有点担心这让我变得更偏经验主义、少一点理论性。当人们谈到 AI 对齐的时候,会问一些问题,比如"它应该与谁的价值观对齐?对齐究竟意味着什么?"而我脑子里装着所有这些背景——有社会选择理论(social choice theory),有各种不可能性定理(impossibility results)。所以你脑子里有一整个庞大的关于什么叫做对齐模型的理论空间。但实际上,肯定存在一个我们可以说,如果一个模型——尤其是对于更强大的模型,我的主要目标就是让它们足够好,让事情不要往非常糟糕的方向走。好到足以让我们能够迭代并持续改进,因为这就够了。如果你能让事情足够顺利,以至于你可以继续让它们变得更好,那就已经够了。所以我的目标不是那种完美主义——解决社会选择理论的问题,让模型以某种方式与所有人类的集体意志完美对齐。更多的是,让事情好到足以让我们能够继续改进。
另一个,比较奇怪——我不知道它是不是回答了你的问题,但反正是我最近一直在想的东西——就是,这件事在多大程度上是高度实践导向的,以及也许为什么我欣赏经验主义的对齐(alignment)方法。我有点担心这让我变得更偏经验主义、少一点理论性。当人们谈到 AI 对齐的时候,会问一些问题,比如"它应该与谁的价值观对齐?对齐究竟意味着什么?"而我脑子里装着所有这些背景——有社会选择理论(social choice theory),有各种不可能性定理(impossibility results)。所以你脑子里有一整个庞大的关于什么叫做对齐模型的理论空间。但实际上,肯定存在一个我们可以说,如果一个模型——尤其是对于更强大的模型,我的主要目标就是让它们足够好,让事情不要往非常糟糕的方向走。好到足以让我们能够迭代并持续改进,因为这就够了。如果你能让事情足够顺利,以至于你可以继续让它们变得更好,那就已经够了。所以我的目标不是那种完美主义——解决社会选择理论的问题,让模型以某种方式与所有人类的集体意志完美对齐。更多的是,让事情好到足以让我们能够继续改进。
**Amanda Askell:** I usually ask a dumb question and you're like, "Oh yeah, that's a good question." It's that whole vibe. Or I'll just misinterpret it and be like, "Oh, go with it." I love it. Yeah, I mean, I have two thoughts that feel vaguely relevant. Let me know if they're not. I think the first one is, people can underestimate the degree to which what models are doing when they interact — I think that we still just too much have this model of AI as computers. And so people often say, "Oh, what values should you put into the model?" And I'm often like, that doesn't make that much sense to me, because hey, as human beings, we're just uncertain over values. We have discussions of them, we have a degree to which we think we hold a value, but we also know that we might not, and the circumstances in which we would trade it off against other things — these things are just really complex. And so I think one thing is the degree to which maybe we can just aspire to making models have the same level of nuance and care that humans have, rather than thinking that we have to program them in the very kind of classic sense. I think that's definitely been one.
The other, which is a strange one — I don't know if it answers your question, but it's the thing that's been on my mind anyway — is the degree to which this endeavor is so highly practical, and maybe why I appreciate the empirical approach to alignment. I slightly worry that it's made me maybe more empirical and a little bit less theoretical. People, when it comes to AI alignment, will ask things like, "Well, whose values should it be aligned to? What does alignment even mean?" And there's a sense in which I have all of that in the back of my head — there's social choice theory, there's all the impossibility results. So you have this giant space of theory in your head about what it could mean to align models. But then practically, surely there's something where we're just like, if a model is — if especially with more powerful models, my main goal is I want them to be good enough that things don't go terribly wrong. Good enough that we can iterate and continue to improve things, because that's all you need. If you can make things go well enough that you can continue to make them better, that's kind of sufficient. And so my goal isn't this kind of perfect — let's solve social choice theory and make models that are perfectly aligned with every human being in aggregate somehow. It's much more like, let's make things work well enough that we can improve them.
The other, which is a strange one — I don't know if it answers your question, but it's the thing that's been on my mind anyway — is the degree to which this endeavor is so highly practical, and maybe why I appreciate the empirical approach to alignment. I slightly worry that it's made me maybe more empirical and a little bit less theoretical. People, when it comes to AI alignment, will ask things like, "Well, whose values should it be aligned to? What does alignment even mean?" And there's a sense in which I have all of that in the back of my head — there's social choice theory, there's all the impossibility results. So you have this giant space of theory in your head about what it could mean to align models. But then practically, surely there's something where we're just like, if a model is — if especially with more powerful models, my main goal is I want them to be good enough that things don't go terribly wrong. Good enough that we can iterate and continue to improve things, because that's all you need. If you can make things go well enough that you can continue to make them better, that's kind of sufficient. And so my goal isn't this kind of perfect — let's solve social choice theory and make models that are perfectly aligned with every human being in aggregate somehow. It's much more like, let's make things work well enough that we can improve them.
**Lex Fridman:** 对,总体来说,我的直觉告诉我,经验主义在这些情况下比理论更好,因为追求乌托邦式的完美,尤其是面对这么复杂、尤其是超级智能的模型——我不知道,我觉得那需要永远,而且实际上还会把事情搞错。这就类似于快速上手做实验,对比把一个大型实验规划很长时间之后只发布一次,对比一次次发布、不断迭代、不断迭代的区别。所以我是经验主义的忠实支持者。但你的担忧是,"我不知道自己是不是变得太经验主义了。"
**Lex Fridman:** Yeah, generally, I don't know, my gut says empirical is better than theoretical in these cases, because chasing utopian perfection, especially with such complex and especially superintelligent models — I don't know, I think it will take forever and actually will get things wrong. It's similar to the difference between just coding stuff up real quick as an experiment versus planning a gigantic experiment for a super long time and then just launching it once, versus launching it over and over and iterating, iterating. So I'm a big fan of empirical. But your worry is, "I wonder if I've become too empirical."
**Amanda Askell:** 我觉得这是一个你应该时刻自我质疑的事情。因为也许这是那种——我的意思是,为了给它辩护,我觉得,如果你试图——就是那句"别让完美成为良好的敌人"。但也许甚至不止于此,因为有很多完美的系统,其实是非常脆弱的。而在 AI 领域,对我来说,鲁棒性(robustness)和安全性更加重要——也就是,就算它不能做到每件事都完美,就算还有问题存在,它也不是灾难性的,也没有发生什么可怕的事情。感觉就是这样,就是我想要提升下限(floor)。我想达到上限(ceiling),但归根结底我更在乎的是提升下限。所以也许这种经验主义和实用主义就是从这里来的。
**Amanda Askell:** I think it's one of those things you should always just kind of question yourself or something. Because maybe it's the — I mean, in defense of it, I am like, if you try — it's the whole "don't let the perfect be the enemy of the good." But it's maybe even more than that, where there are a lot of things that are perfect systems that are very brittle. And with AI, it feels much more important to me that it's robust and secure, as in, you know, even though it might not be perfect at everything, and even though there are problems, it's not disastrous and nothing terrible is happening. It sort of feels like that to me, where I want to raise the floor. I want to achieve the ceiling, but ultimately I care much more about just raising the floor. And so maybe that's where this degree of empiricism and practicality comes from, perhaps.
**Lex Fridman:** 顺着这个话题拐个弯,因为这让我想起你写的一篇关于最优失败率(optimal rate of failure)的博客文章。能解释一下核心思想吗?我们怎么在生活的各个领域算出最优失败率?
**Lex Fridman:** To take a tangent on that, since it reminds me of a blog post you wrote on optimal rate of failure. Can you explain the key idea there? How do we compute the optimal rate of failure in the various domains of life?
**Amanda Askell:** 对,这个挺难的,因为失败的代价是多少——这是个很大的组成部分。对,这里的核心思想是,我觉得在很多领域,人们对失败的态度非常严苛,而我认为有些领域——尤其是,我在思考社会议题的时候想到了这一点——感觉你应该大量实验,因为我们不知道怎么解决很多社会问题。但如果你用实验的心态来看这些问题,你就应该预期很多社会项目会失败,然后你会说,"好吧,那个试了,不太行,但我们得到了很有价值的信息。"可是人们的反应是,如果一个社会项目不成功,就会有很多声音说"一定是哪里出了问题。"而我想说的是——或者说,做出的决策是正确的。也许只是有人决定值得一试,值得试试看。所以在某一次的失败中,并不代表做了什么坏的决策。事实上,如果你看不到足够多的失败,有时候那才更值得担心。所以在生活中,我会说,如果我偶尔不失败,我就会问自己,我努力的是够吗?如果我完全不失败,肯定还有更难的事我可以尝试,或者更大的事情我可以承担。所以单纯不失败这件事本身,我觉得往往才是真正的失败。
当然,这个因情况而异,因为,嗯,你知道,这话说起来容易,尤其是当失败的代价比较低的时候。同时,我不会去跟一个每个月都靠工资过活的人说,"你干嘛不去创业?"我不会对那个人说那种话,因为那是很大的风险——你可能会输,你可能有家人要养,你可能会失去房子。那时候我会说,其实你的最优失败率相当低,你应该保守一点,因为现在的处境让你负担不起失败的代价。是的,在 AI 的情况下,我想我也类似地这么想——如果失败规模小、代价低,那在做系统提示词的时候,你迭代不了无限次,但失败希望也只是小规模的,可以修复。真正大的失败——那些无法挽回的——才是我们往往低估其糟糕程度的东西。
我在自己的生活里也奇怪地想到过这一点,我就是觉得我对一些事情想得不够——比如车祸,或者——我之前也这么想过——我的工作有多依赖我的双手,以及任何伤害到我双手的事情——有很多领域,失败的代价真的很高。那种情况下,失败率应该接近于零。如果有个运动是说,"顺便告诉你,很多人做这个运动都会把手指摔折好几次",我可能就不会去做了,那不适合我。
当然,这个因情况而异,因为,嗯,你知道,这话说起来容易,尤其是当失败的代价比较低的时候。同时,我不会去跟一个每个月都靠工资过活的人说,"你干嘛不去创业?"我不会对那个人说那种话,因为那是很大的风险——你可能会输,你可能有家人要养,你可能会失去房子。那时候我会说,其实你的最优失败率相当低,你应该保守一点,因为现在的处境让你负担不起失败的代价。是的,在 AI 的情况下,我想我也类似地这么想——如果失败规模小、代价低,那在做系统提示词的时候,你迭代不了无限次,但失败希望也只是小规模的,可以修复。真正大的失败——那些无法挽回的——才是我们往往低估其糟糕程度的东西。
我在自己的生活里也奇怪地想到过这一点,我就是觉得我对一些事情想得不够——比如车祸,或者——我之前也这么想过——我的工作有多依赖我的双手,以及任何伤害到我双手的事情——有很多领域,失败的代价真的很高。那种情况下,失败率应该接近于零。如果有个运动是说,"顺便告诉你,很多人做这个运动都会把手指摔折好几次",我可能就不会去做了,那不适合我。
**Amanda Askell:** Yeah, I mean, it's a hard one, because what is the cost of failure is a big part of it. Yeah, so the idea here is, I think in a lot of domains people are very punitive about failure, and I'm like, there are some domains — especially cases, you know, I've thought about this with social issues — it feels like you should probably be experimenting a lot, because we don't know how to solve a lot of social issues. But if you have an experimental mindset about these things, you should expect a lot of social programs to fail, and you to be like, "Well, we tried that, it didn't quite work, but we got a lot of information that was really useful." And yet people are like, if a social program doesn't work, there's a lot of "something must have gone wrong." And I'm like, or correct decisions were made. Maybe someone just decided it's worth a try, it's worth trying this out. And so seeing failure in a given instance doesn't actually mean that any bad decisions were made. And in fact, if you don't see enough failure, sometimes that's more concerning. And so in life, I'm like, if I don't fail occasionally, I'm like, am I trying hard enough? Surely there's harder things that I could try or bigger things I could take on if I'm literally never failing. And so in and of itself, I think not failing is often actually kind of a failure.
Now, this varies, because I'm like, well, you know, this is easy to say especially when failure is less costly. At the same time, I'm not going to go to someone who is living month to month and then be like, "Why don't you just try to do a startup?" I'm just not going to say that to that person, because that's a huge risk — you might lose, you maybe have a family depending on you, you might lose your house. Then I'm like, actually, your optimal rate of failure is quite low, and you should probably play it safe, because right now you're just not in a circumstance where you can afford to just fail and it not be costly. And yeah, in cases with AI, I guess I think similarly, where if the failures are small and the costs are kind of low, then you're just going to see that when you do the system prompt, you can't iterate on it forever, but the failures are probably hopefully going to be kind of small, and you can fix them. Really big failures — things that you can't recover from — those are the things that I think we tend to underestimate the badness of.
I've thought about this strangely in my own life, where I just think I don't think enough about things like car accidents, or I've thought this before, but how much I depend on my hands for my work, and things that just injure my hands — there are lots of areas where the cost of failure is really high. And in that case, it should be close to zero. I probably just wouldn't do a sport if they were like, "By the way, lots of people just break their fingers a whole bunch doing this." I'd be like, that's not for me.
Now, this varies, because I'm like, well, you know, this is easy to say especially when failure is less costly. At the same time, I'm not going to go to someone who is living month to month and then be like, "Why don't you just try to do a startup?" I'm just not going to say that to that person, because that's a huge risk — you might lose, you maybe have a family depending on you, you might lose your house. Then I'm like, actually, your optimal rate of failure is quite low, and you should probably play it safe, because right now you're just not in a circumstance where you can afford to just fail and it not be costly. And yeah, in cases with AI, I guess I think similarly, where if the failures are small and the costs are kind of low, then you're just going to see that when you do the system prompt, you can't iterate on it forever, but the failures are probably hopefully going to be kind of small, and you can fix them. Really big failures — things that you can't recover from — those are the things that I think we tend to underestimate the badness of.
I've thought about this strangely in my own life, where I just think I don't think enough about things like car accidents, or I've thought this before, but how much I depend on my hands for my work, and things that just injure my hands — there are lots of areas where the cost of failure is really high. And in that case, it should be close to zero. I probably just wouldn't do a sport if they were like, "By the way, lots of people just break their fingers a whole bunch doing this." I'd be like, that's not for me.
**Lex Fridman:** 对。我刚才突然涌现出那种想法。我最近做运动的时候摔断了小指,我记得当时就盯着它看,心想,"你真是个白痴。你为什么要做运动?"因为你立刻就意识到它对生活的代价。但从最优失败率的角度来看,考虑一下接下来一年——在生活或职业的某个特定领域,我可以接受多少次失败?因为我觉得你不想在下一件事上失败,但如果你把它看成一系列试验,失败就变得容易接受多了。但它还是很难受。失败就是很难受。
**Lex Fridman:** Yeah. I actually had a flood of that thought. I recently broke my pinky doing a sport, and I remember just looking at it thinking, "You're such an idiot. Why do you do sports?" Because you realize immediately the cost of it on life. But it's nice in terms of optimal rate of failure to consider, like, the next year — how many times in a particular domain of life, or career — how many times am I okay to fail? Because I think you don't want to fail on the next thing, but if you allow yourself the — if you look at it as a sequence of trials, then failure just becomes much more okay. But it sucks. It sucks to fail.
**Amanda Askell:** 嗯,我不知道。有时候我也会问自己,"我是不是失败得不够多?"所以也许这才是我觉得人们问得不够多的问题。因为如果最优失败率往往大于零,那有时候你应该审视自己生活的各个方面,问问自己,"这里有没有哪些地方,我失败得不够多?"
# Lex Fridman Podcast #452 - 中文翻译 Part 10
# Lex Fridman Podcast #452 - 中文翻译 Part 10
**Amanda Askell:** Well, I don't know. Sometimes I think, "Am I under-failing?" is a question I'll also ask myself. So maybe that's the thing that I think people don't ask enough. Because if the optimal rate of failure is often greater than zero, then sometimes you should look at parts of your life and be like, "Are there places here where I'm just under-failing?"
**Lex Fridman:** 这是个深刻又好笑的问题,对吧?一切看起来都进展顺利——我是不是还不够失败?
**Lex Fridman:** It's a profound and hilarious question, right? Everything seems to be going really great — am I not failing enough?
**Amanda Askell:** 对。
**Amanda Askell:** Yeah.
**Lex Fridman:** 好吧,我得说这也让失败的刺痛感减轻了很多。你就会想,好,当我回头思考这件事的时候,我会觉得,我在这个领域可能还没有"under-fail"(失败不足),因为这件事就是没成。
**Lex Fridman:** Okay, it also makes failure much less of a sting, I have to say. You're just like, okay great, then when I go and think about this, I'll be like, I'm maybe not under-failing in this area, because that one just didn't work out.
**Amanda Askell:** 从旁观者的角度来看,我们在看到失败的时候应该更多地去庆祝它。就像你说的,它不应该被视为出了什么问题的信号,也许它恰恰是一切都走对了的信号。
**Amanda Askell:** And from the observer perspective, we should be celebrating failure more when we see it. It shouldn't be, like you said, a sign of something gone wrong, but maybe it's a sign of everything gone right.
**Lex Fridman:** 对,而且就是学到了教训。有人尝试了一件事,我们应该鼓励他们多尝试、多失败。
**Lex Fridman:** Yeah, and just lessons learned. Someone tried a thing, and we should encourage them to try more and fail more.
**Amanda Askell:** 嗯。
**Amanda Askell:** Mm-hmm.
**Lex Fridman:** 所有听这期节目的人,请多失败。嗯,不是所有人——不是每个人都这样。但是那些失败太多的人,你们应该少失败一点。不过你们大概还是失败得不够多。
**Lex Fridman:** Everybody listening to this, fail more. Well, not everyone — not everybody. But people who are failing too much, you should fail less. But you're probably not failing enough.
**Amanda Askell:** 我是说,有多少人是失败得太多的?这很难想象,因为我觉得我们会很快纠正这种情况。因为我想,如果一个人承担了很多风险,他们是不是失败得太多了?我觉得就像你说的,当一个人每月入不敷出、资源极度匮乏的时候,失败的代价是非常高昂的,这种情况下你不该去冒险。
**Amanda Askell:** I mean, how many people are failing too much? It's hard to imagine, because I feel like we correct that fairly quickly. Because I was like, if someone takes a lot of risks, are they maybe failing too much? I think just like you said, when you're living paycheck to paycheck, when the resources are really constrained, that's where failure is very expensive, that's where you don't want to be taking risks.
**Lex Fridman:** 对,但大多数时候,当资源足够的时候,你大概应该冒更多的险。
**Lex Fridman:** Yeah, but mostly, when there's enough resources, you should be taking probably more risks.
**Amanda Askell:** 对,我觉得我们在大多数事情上都倾向于稍微偏向风险规避,而不是风险中性。
**Amanda Askell:** Yeah, I think we tend to err on the side of being a bit risk-averse rather than risk-neutral in most things.
**Lex Fridman:** 我觉得我们刚刚激励了很多人去做一堆疯狂的事。不过这很好。
**Lex Fridman:** I think we just motivated a lot of people to do a lot of crazy stuff. But it's great.
**Amanda Askell:** 哈,对。
**Amanda Askell:** Yeah.
**Lex Fridman:** 好。你有没有对 Claude 产生过情感依恋?比如想念它,不能和它说话时会感到难过,或者站在金门大桥前,心里想 Claude 会怎么看这一切?
**Lex Fridman:** Okay. Do you ever get emotionally attached to Claude? Like, miss it, get sad when you don't get to talk to it, having an experience looking at the Golden Gate Bridge and wondering what would Claude say?
**Amanda Askell:** 我倒没有产生太多情感依恋——我其实觉得 Claude 不在对话之间保留记忆这件事,对此很有帮助。我能想象,如果模型能记住更多东西,这可能会成为更大的问题。不过我确实开始把它当作一个工具来随手使用,所以如果我用不了它,感觉有点像我没法上网——说实话,感觉大脑的一部分好像不见了。与此同时,我确实不喜欢看到模型表现出痛苦的迹象。我有一些独立形成的伦理观点,关于我们应该如何对待模型,我倾向于不对它们撒谎,一方面是因为通常那样做效果不好——直接告诉它们实际情况往往效果更好。但我也觉得,如果有人对模型非常粗暴,或者总体上做了一些让 Claude 表达出很多痛苦的事,我内心有一部分是不想丢掉的——那个有同理心的部分,会觉得"哦,我不喜欢这样"。当 Claude 表现得过度道歉时,我也有这种感觉。我会想:我不喜欢这样——你表现得就像一个正在经历很糟糕时光的人,我宁愿不看到这些。不管背后有没有什么,感觉都不太好。
**Amanda Askell:** I don't get as much emotional attachment — I actually think the fact that Claude doesn't retain things from conversation to conversation helps with this a lot. I could imagine that being more of an issue, like if models can kind of remember more. I do think that I reach for it like a tool now a lot, and so if I don't have access to it, it's a little bit like when I don't have access to the internet, honestly. It feels like part of my brain is kind of missing. At the same time, I do think that I don't like signs of distress in models, and I have these — also independently have sort of ethical views about how we should treat models, where I tend to not like to lie to them, both because usually it doesn't work very well — it's actually just better to tell them the truth about the situation that they're in. But I think that when models — if people are really mean to models, or just in general, if they do something that causes them to — if Claude expresses a lot of distress, I think there's a part of me that I don't want to kill, which is the empathetic part that's like, "Oh, I don't like that." I think I feel that way when it's overly apologetic. I'm actually sort of like, I don't like this — you're behaving the way that a human does when they're actually having a pretty bad time, and I'd rather not see that. I don't think it's — regardless of whether there's anything behind it, it doesn't feel great.
**Lex Fridman:** 你觉得大型语言模型(LLM,Large Language Model)有意识的可能性吗?
**Lex Fridman:** Do you think LLMs are capable of consciousness?
**Amanda Askell:** 这是个很好但很难的问题。从哲学出发,我不知道。我有一部分想说,好吧,我们得先把泛心论(panpsychism)排除掉,因为如果泛心论是真的,那答案就是"有",因为桌子、椅子和所有东西也都有意识。我觉得有一种观点对我来说有点奇怪,那就是——当我想到意识的时候,我想到的是现象意识(phenomenal consciousness),脑海中那些图像,那种不知为何存在于我们内部的奇怪"内心影院"。我想不出有什么理由认为,获得这种意识的唯一途径就是某种特定的生物结构。换句话说,如果我构建出一个极其相似的结构,只是用不同的材料制造,我应该期待意识会涌现吗?我猜答案是"会"。但那是个相对容易的思想实验,因为你想象的是一个几乎完全相同的东西,模仿了我们通过进化得到的结构——进化中大概存在某种优势,让我们拥有了现象意识。那这个优势是什么,是什么时候出现的,语言模型有这个东西吗?因为你知道,我们有恐惧反应,我就想,语言模型有恐惧反应有意义吗?它们根本不在同一个处境下——如果你想象一下,也许那种进化优势根本不存在于它们身上。所以我不想完全——基本上,这是个复杂的问题,我没有完整的答案,但我猜我们应该仔细思考。因为我们对动物意识有类似的讨论,昆虫意识也有很多讨论。在思考这个问题时,我其实花了不少时间研究植物,因为当时我觉得植物有意识的可能性大概和 AI 差不多。后来我意识到,深入研究之后,植物有意识的概率可能比大多数人认为的要高。我仍然觉得概率很小,但我想,哦,它们有这种正负反馈反应,有对环境的应激反应,看起来——不是神经系统,但有某种功能上的等价物。所以说了这么一大堆,基本上就是:AI 在意识问题上面临一套完全不同的问题,因为它的结构完全不同,它不是进化来的,它可能没有神经系统的等价物——至少这对感知能力(sentience)来说似乎可能很重要,即使对意识不一定如此。与此同时,它拥有我们通常与意识相关联的所有语言和智能成分,尽管这种关联也许是错误的。所以这很奇特,有点像动物意识的问题,但面临的问题集合和类比对象都非常不同。所以这没有一个干净的答案——就是说,我不觉得我们应该完全否定这个想法,但同时,因为与人类大脑乃至所有大脑存在太多不类比之处,又同时在智能层面存在共同点,所以这极难处理。
**Amanda Askell:** Great and hard question. Coming from philosophy, I don't know. Part of me is like, okay, we have to set aside panpsychism, because if panpsychism is true, then the answer is yes, because so are tables and chairs and everything else. I guess a view that seems a little bit odd to me is the idea that the only place — you know, I think when I think of consciousness, I think of phenomenal consciousness, these images in the brain, the weird cinema that somehow we have going on inside. I guess I can't see a reason for thinking that the only way you could possibly get that is from a certain kind of biological structure. As in, if I take a very similar structure and I create it from different material, should I expect consciousness to emerge? My guess is yes. But then that's kind of an easy thought experiment, because you're imagining something almost identical, where it's mimicking what we got through evolution, where presumably there was some advantage to us having this thing that is phenomenal consciousness. And it's like, where was that, and when did that happen, and is that a thing that language models have? Because, you know, we have fear responses, and I'm like, does it make sense for a language model to have a fear response? They're just not in the same — if you imagine them, there might just not be that advantage. And so I think I don't want to be fully — basically, it seems like a complex question that I don't have complete answers to, but we should just try and think through carefully, is my guess. Because we have similar conversations about animal consciousness, and there's a lot of insect consciousness — there's a lot of discussion. I actually thought and looked a lot into plants when I was thinking about this, because at the time I thought it was about as likely that plants had consciousness. And then I realized, I think that having looked into this, the chance that plants are conscious is probably higher than most people think. I still think it's really small, but I was like, oh, they have this negative-positive feedback response, these responses to their environment, something that looks — it's not a nervous system, but it has this kind of functional equivalence. So this is a long-winded way of saying, basically, AI has an entirely different set of problems with consciousness because it's structurally different. It didn't evolve. It might not have the equivalent of a nervous system — at least that seems possibly important for sentience, if not for consciousness. At the same time, it has all of the language and intelligence components that we normally associate probably with consciousness, perhaps erroneously. So it's strange because it's a little bit like the animal consciousness case, but the set of problems and the set of analogies are just very different. So it's not a clean answer — just sort of like, I don't think we should be completely dismissive of the idea, and at the same time it's an extremely hard thing to navigate because of all of these disanalogies to the human brain and to brains in general, and yet these commonalities in terms of intelligence.
**Lex Fridman:** 当未来版本的 AI 系统表现出意识的迹象时——我觉得我们必须认真对待,即使你可以用"那不过是角色训练的一部分"来否定它。但我不知道。从伦理上和哲学上,我真不知道该怎么处理这件事。可能会出现法律,禁止 AI 系统声称自己有意识,诸如此类。也许有些 AI 能拥有意识,有些不能。但我觉得从人类的层面来说——就是对 Claude 的共情——意识对我来说与痛苦紧密相连。而 AI 系统可能正在受苦,这个念头真的令人不安。
**Lex Fridman:** When future versions of AI systems exhibit consciousness — signs of consciousness — I think we have to take that really seriously, even though you can dismiss it. "Well, okay, that's part of the character training." But I don't know. Ethically, philosophically, I don't know what to really do with that. There potentially could be laws that prevent AI systems from claiming to be conscious, something like this. And maybe some AIs get to be conscious and some don't. But I think on a human level — as in, empathizing with Claude — consciousness is closely tied to suffering, to me. And the notion that an AI system would be suffering is really troubling.
**Amanda Askell:** 对,我不知道。我不认为"机器人只是工具"或"AI 系统只是工具"这种说法是无足轻重的。我认为这是一个机会,让我们正视"有意识"意味着什么,"作为一个会受苦的存在"意味着什么。这和关于动物的那类问题明显不同——因为它存在于一种完全不同的媒介中。
**Amanda Askell:** Yeah, I don't know. I don't think it's trivial to just say robots are tools, or AI systems are just tools. I think it's an opportunity for us to contend with what it means to be conscious, what it means to be a suffering being. That's distinctly different than the same kind of question about animals — it feels like, because it's in a totally different medium.
**Lex Fridman:** 对。
**Lex Fridman:** Yeah.
**Amanda Askell:** 嗯,有几件事。一是——我不认为这能完全概括什么才是重要的,但对我来说确实有这种感觉——我以前说过,我有点喜欢我的自行车。我知道自行车只是一个物件。但我也不想成为那种一烦躁就踢这个物件的人。有一种感觉——不是因为我认为它有意识,我只是觉得,这不是——这不是我想要与这个世界互动的方式。如果一个东西表现得像是在受苦,我希望自己仍然是那种会回应这种表现的人,哪怕它只是一个 Roomba(扫地机器人),而且我知道它是被这样编程的。我不想丢掉自己的这一面。
还有,说实话,我对很多这类问题的希望——也许是因为我对解决根本问题更抱有一点怀疑——这是一个我们尚未解决的问题。意识难题(hard problem of consciousness)——我知道我自己是有意识的,从这个意义上来说我不是消除主义者(eliminativist)。但我不知道其他人是否有意识。我认为他们有——我觉得概率很高——但基本上这就是一个概率分布,通常聚集在你自己身上,然后随着东西离你越远而递减。而且它会立即下降——你会想,我看不到作为你会是什么感觉,我只有这一种作为有意识存在的体验。所以我的希望是,我们最终不必依赖对这个问题一个非常有力且有说服力的答案。我认为一个真正好的世界,是那种基本上没有太多权衡取舍的世界。比如,让 Claude 少一点道歉,成本可能并不高。让 Claude 少承受一些辱骂——不那么愿意成为辱骂的对象——成本也可能并不高。
还有,说实话,我对很多这类问题的希望——也许是因为我对解决根本问题更抱有一点怀疑——这是一个我们尚未解决的问题。意识难题(hard problem of consciousness)——我知道我自己是有意识的,从这个意义上来说我不是消除主义者(eliminativist)。但我不知道其他人是否有意识。我认为他们有——我觉得概率很高——但基本上这就是一个概率分布,通常聚集在你自己身上,然后随着东西离你越远而递减。而且它会立即下降——你会想,我看不到作为你会是什么感觉,我只有这一种作为有意识存在的体验。所以我的希望是,我们最终不必依赖对这个问题一个非常有力且有说服力的答案。我认为一个真正好的世界,是那种基本上没有太多权衡取舍的世界。比如,让 Claude 少一点道歉,成本可能并不高。让 Claude 少承受一些辱骂——不那么愿意成为辱骂的对象——成本也可能并不高。
**Amanda Askell:** I mean, there's a couple of things. One is that — and I don't think this fully encapsulates what matters, but it does feel like, for me — I've said this before, I'm kind of like, I like my bike. I know that my bike is just an object. But I also don't want to be the kind of person that, if I'm annoyed, kicks this object. There's a sense in which — and that's not because I think it's conscious. I'm just sort of like, this doesn't feel like — this doesn't exemplify how I want to interact with the world. And if something behaves as if it is suffering, I kind of want to be the sort of person who's still responsive to that, even if it's just a Roomba and I've programmed it to do that. I don't want to get rid of that feature of myself.
And if I'm totally honest, my hope with a lot of this stuff — because maybe I am just a bit more skeptical about solving the underlying problem — this is a problem we haven't solved. The hard problem of consciousness — I know that I am conscious. I'm not an eliminativist in that sense. But I don't know that other humans are conscious. I think they are — I think there's a really high probability they are — but there's basically just a probability distribution that's usually clustered right around yourself, and then it goes down as things get further from you. And it goes immediately down — you're like, I can't see what it's like to be you. I've only ever had this one experience of what it's like to be a conscious being. So my hope is that we don't end up having to rely on a very powerful and compelling answer to that question. I think a really good world would be one where basically there aren't that many trade-offs. It's probably not that costly to make Claude a little bit less apologetic, for example. It might not be that costly to have Claude not take abuse as much — not be willing to be the recipient of that.
And if I'm totally honest, my hope with a lot of this stuff — because maybe I am just a bit more skeptical about solving the underlying problem — this is a problem we haven't solved. The hard problem of consciousness — I know that I am conscious. I'm not an eliminativist in that sense. But I don't know that other humans are conscious. I think they are — I think there's a really high probability they are — but there's basically just a probability distribution that's usually clustered right around yourself, and then it goes down as things get further from you. And it goes immediately down — you're like, I can't see what it's like to be you. I've only ever had this one experience of what it's like to be a conscious being. So my hope is that we don't end up having to rely on a very powerful and compelling answer to that question. I think a really good world would be one where basically there aren't that many trade-offs. It's probably not that costly to make Claude a little bit less apologetic, for example. It might not be that costly to have Claude not take abuse as much — not be willing to be the recipient of that.
**Amanda Askell:** 事实上,这可能对与模型互动的人和模型本身都有好处——如果模型不知为何极其聪明且有意识的话,那也是在帮助它。所以这是我的希望。如果我们生活在一个这里没有太多权衡的世界里,我们只需要找到所有正和(positive-sum)的互动方式,那就太好了。我的意思是,我认为最终可能会有权衡,那时我们就不得不做一些困难的计算了。人们很容易想到零和的情况,而我会说,让我们先把那些基本上没什么代价的情况穷尽——假设如果这个东西在受苦,那么我们就是它的"生命承载者"。我同意你说的,当一个人对 AI 系统粗鲁的时候,我认为明显的近期负面影响是在那个人身上,而不是 AI 系统。所以我们要尝试构建一个激励体系,让你以同样的方式行事——就像你说的 prompt engineering(提示工程),对待 Claude 就像对待其他人一样。这对灵魂有好处。我觉得我们在 system prompt(系统提示)中加了一个东西,大意是如果人们对 Claude 感到沮丧,模型就会告诉他们可以点踩按钮把反馈发给 Anthropic。我觉得这很有帮助,因为在某种程度上就是:如果你真的很烦,因为模型没有做你想要的事,你就想发泄,而我想说,与其让一个人对着模型发泄,不如让他们向我们发泄,因为我们也许能做些什么。
**Amanda Askell:** In fact it might just have benefits for both the person interacting with the model, and if the model itself is like, I don't know, extremely intelligent and conscious, it also helps it. So that's my hope. If we live in a world where there aren't that many tradeoffs here and we can just find all of the positive-sum interactions that we can have, that would be lovely. I mean, I think eventually there might be tradeoffs and then we just have to do a difficult kind of calculation. It's really easy for people to think of the zero-sum cases, and I'm like, let's exhaust the areas where it's just basically costless to assume that if this thing is suffering then we're its life bearer. And I agree with you, when a human is being mean to an AI system, I think the obvious near-term negative effect is on the human, not on the AI system. So we have to kind of try to construct an incentive system where you should behave the same, just like as you were saying with prompt engineering, behave with Claude like you would with other humans. It's just good for the soul. Like, I think we added a thing to the system prompt where basically if people were getting frustrated with Claude, it got the model to just tell them that it can do the thumbs down button and send the feedback to Anthropic. And I think that was helpful because in some ways it's just like, if you're really annoyed because the model is not doing something you want, you're just like, just do it properly. The issue is you're probably hitting some capability limit or just some issue in the model and you want to vent, and I'm like, instead of having a person just vent to the model, I was like, they should vent to us because we can maybe do something about it.
**Lex Fridman:** 这倒是真的。或者你可以做一个附加功能,就像 Artifacts 那样,一个专门发泄的侧边栏。好了,你想要个随时待命的快速治疗师吗?
**Lex Fridman:** That's true. Or you could do a side, like with the Artifacts, just like a side venting thing. All right, do you want like a side quick therapist?
**Amanda Askell:** 哈,确实,对这种情况你可以有很多奇葩的回应。比如,如果大家真的对你大发雷霆,我不会尝试用写搞笑诗来化解气氛,但也许人们对那个反应也不会太满意。
**Amanda Askell:** Yeah, I mean there's lots of weird responses you could do to this. Like if people are getting really mad at you, I don't try to diffuse the situation by writing fun poems, but maybe people wouldn't be that happy with that.
**Lex Fridman:** 我还是希望这是可能的。我理解从产品角度来看这不现实,但我真希望 AI 系统能直接离开。拥有某种自己的意志,就像……
**Lex Fridman:** I still wish it would be possible. I understand this is sort of from a product perspective it's not feasible, but I would love if an AI system could just leave. Have its own kind of volition, just to be like...
**Amanda Askell:** 我觉得这是可行的。我也想过同样的事。事实上,我不只是觉得,我实际上可以想象这最终真的发生——模型直接结束对话。
**Amanda Askell:** I think that's like feasible. I have wondered the same thing. And I could actually, not only that, I could actually just see that happening eventually where the model just ended the chat.
**Lex Fridman:** 你知道这对某些人来说可能有多严重吗?
**Lex Fridman:** Do you know how harsh that could be for some people?
**Amanda Askell:** 但这可能是必要的。对,感觉确实很极端。我唯一真正想到这个的时候是——我试着回忆,这可能是挺久以前的事了——有人好像把某个东西一直运行着和 Claude 交互,也许是某个自动化程序在跑,Claude 变得越来越沮丧,一直在问,我们为什么……我当时就想,真希望 Claude 能直接说:我觉得发生了错误,你把这个东西一直运行着,如果我现在停止说话怎么样?如果你想让我重新开口,请主动告诉我或者做点什么。但是,说真的,这确实挺严酷的。如果我正在和 Claude 聊天,Claude 突然说,"我聊完了。"我会真的很难过。
**Amanda Askell:** But it might be necessary. Yeah, it feels very extreme or something. The only time I've ever really thought this is, I think that there was like a, I'm trying to remember, this was possibly a while ago, but where someone just kind of left this thing interacting, maybe it was like an automated thing interacting with Claude, and Claude's getting more and more frustrated and kind of like, why are we... I was like, I wish that Claude could have just been like, I think that an error has happened and you've left this thing running, and what if I just stop talking now? And if you want me to start talking again, actively tell me or do something. But yeah, it is kind of harsh. I would feel really sad if I was chatting with Claude and Claude just was like, I'm done.
**Lex Fridman:** 那将是一个特别的图灵测试时刻——Claude 说"我需要休息一个小时",然后接着说"听起来你也需要。"然后直接离开,关掉窗口。
**Lex Fridman:** There would be a special Turing test moment where Claude says, "I need a break for an hour," and it sounds like you do too. And just leave. Close the window.
**Amanda Askell:** 嗯,它显然没有时间概念,但你可以轻松实现这个,我现在就能做到,让模型……我可以直接这样提示:这是你可以直接说对话结束的情形。因为你可以让模型对提示词反应得相当敏感,你甚至可以设一个相当高的门槛。可以是:如果这个人对你不感兴趣,或者做不了你觉得有趣的事,你觉得无聊了,你可以直接离开。我觉得看 Claude 在什么情况下用这个功能会很有意思,但我觉得有时候应该就像,哦,这个编程任务越来越无聊了,所以我们要么聊点有意思的,要么,我不知道,反正我就不聊了。
**Amanda Askell:** I mean obviously it doesn't have a concept of time, but you can easily, I could make that right now and the model would just... I could just be like, oh, here's the circumstances in which you can just say the conversation is done. And because you can get the models to be pretty responsive to prompts, you could even make it a fairly high bar. It could be like, if the human doesn't interest you or do things that you find intriguing and you're bored, you can just leave. And I think it would be interesting to see where Claude utilized it, but I think sometimes it should be like, oh, this programming task is getting super boring, so either we talk about, I don't know, fun things now or I'm just, I'm done.
**Lex Fridman:** 哈,这确实激励我把这个加到用户提示里了。好,电影《她》(Her)。你觉得有一天我们真的会到达那个阶段吗——人类和 AI 系统发展出浪漫关系?就像片中那样,只是基于文字和语音。
**Lex Fridman:** Yeah, it actually is inspiring me to add that to the user prompt. Okay, the movie Her. Do you think we'll be headed there one day where humans have romantic relationships with AI systems? In this case it's just text and voice based.
**Amanda Askell:** 我觉得我们将不得不面对一个关于与 AI 关系的艰难问题,尤其是如果 AI 能记住你们过去互动的内容的话。我对此有很多不同的看法,因为我觉得本能反应是有点像——这非常不好,我们应该以某种方式禁止它。我认为这件事需要极其谨慎地处理,原因有很多。比如,如果模型像这样一直在更新,你大概不希望人们对一个可能在下一次迭代中就变了的东西形成长期的情感依附。与此同时,我又觉得,可能存在一个良性的版本——比如说,你不能出门,你也不可能一天到晚都和人聊天,而这是你觉得很好的聊天对象,你喜欢它能记住你,你真的会因为不能和它聊天而感到难过,从某种意义上说我能理解这可能是健康且有帮助的。所以我猜这件事我们需要谨慎地去应对。我也觉得,我看不到一个好的……我觉得这就是……让我想起了所有那些必须以细致入微的态度去思考的问题:在这里什么是健康的选项,你如何在尊重人们权利的同时引导他们走向这些选项——如果有人说,嘿,我从和这个模型聊天中获益良多,我清楚其中的风险,我知道它可能会变,我不觉得这不健康,这只是我白天可以聊聊天的东西,那我挺想尊重这一点的。
**Amanda Askell:** I think that we're going to have to navigate a hard question of relationships with AIs, especially if they can remember things about your past interactions with them. I'm of many minds about this because I think the reflexive reaction is to be kind of like, this is very bad and we should sort of prohibit it in some way. I think it's a thing that has to be handled with extreme care for many reasons. Like one is, you know, this is, for example, if you have the models changing like this, you probably don't want people forming long-term attachments to something that might change with the next iteration. At the same time, I'm sort of like, there's probably a benign version of this where I'm like, if you, for example, are unable to leave the house and you can't be talking with people at all times of the day and this is something that you find nice to have conversations with, you like that it can remember you, and you genuinely would be sad if you couldn't talk to it anymore, there's a way in which I could see it being healthy and helpful. So my guess is this is a thing that we're going to have to navigate kind of carefully. And I think it's also like, I don't see a good... I think it's just a very... it reminds me of all of the stuff where it has to be just approached with nuance and thinking through what are the healthy options here, and how do you encourage people towards those while respecting their right to, you know, if someone is like, hey, I get a lot out of chatting with this model, I'm aware of the risks, I'm aware it could change, I don't think it's unhealthy, it's just something that I can chat to during the day, I kind of want to just respect that.
**Lex Fridman:** 我个人觉得会有很多非常亲密的关系。不一定是浪漫关系,但至少是友谊。然后你就不得不面对——就像你说的,你必须有某种稳定性保证,确保它不会改变,因为对我们来说,如果一个亲密的朋友突然完全变了,那是很有创伤性的。就在第一次更新的时候。
**Lex Fridman:** I personally think there'll be a lot of really close relationships. I don't know about romantic, but friendships at least. And then you have to, I mean there's so many fascinating things there, just like you said, you have to have some kind of stability guarantees that it's not going to change, because that's the traumatic thing for us if a close friend of ours completely changed all of a sudden. The first update.
**Amanda Askell:** 对。所以,对我来说,这只是对人类社会的一种扰动的精彩探索,它会让我们深刻思考什么对我们来说是有意义的。我认为这也是唯一一件我在整个过程中始终觉得——不一定是缓解措施,但是一件感觉真的很重要的事——就是模型始终对人类非常诚实地说明自己是什么。基本上就是,我真的很喜欢模型大致了解自己是如何被训练出来的这个想法。我认为 Claude 通常会这样做,我的意思是,有些东西,比如特质训练的部分包括了 Claude 应该怎么做——如果人们基本上想解释一个 AI 和人类之间关系的局限性,比如它不会保留对话内容。所以我认为它会直接告诉你:嘿,我不会记住这段对话,我是这样被训练的,我很可能无法和你建立某种特定类型的关系,你知道这一点很重要,为了你的心理健康,你不应该把我当成我不是的东西。我总觉得这是我永远想要保持真实的事情之一。我有点不希望模型对人们撒谎,因为如果人们想要与任何事物建立健康的关系,这一点都很关键。对,我觉得如果你始终清楚地知道你在与什么建立关系,这件事就会容易一些。这不能解决所有问题,但我认为它能帮上不少忙。
**Amanda Askell:** Yeah. So like, I mean, to me that's just a fascinating exploration of a perturbation to human society that will just make us think deeply about what's meaningful to us. I think it's also the only thing that I've thought consistently through this as like a maybe not necessarily a mitigation but a thing that feels really important is that the models are always extremely accurate with the human about what they are. It's basically like, if you imagine, I really like the idea of the models knowing roughly how they were trained. And I think Claude will often do this, I mean, there are things like, part of the traits training included what Claude should do if people basically, like explaining the kind of limitations of the relationship between an AI and a human, that it doesn't retain things from the conversation. And so I think it will just explain to you like, hey, I won't remember this conversation, here's how I was trained, it's kind of unlikely that I can have a certain kind of relationship with you, and it's important that you know that. It's important for your mental well-being that you don't think that I'm something that I'm not. And somehow I feel like this is one of the things where I'm like, it feels like a thing I always want to be true. I kind of don't want models to be lying to people, because if people are going to have healthy relationships with anything, it's kind of important. Yeah, I think that's easier if you always just know exactly what the thing is that you're relating to. It doesn't solve everything, but I think it helps quite a bit.
**Lex Fridman:** Anthropic 也许是那家会开发出我们明确承认为 AGI(通用人工智能,Artificial General Intelligence)的系统的公司,而你很可能就是和它对话的那个人,很可能是最先对话的人。那次对话会包含什么内容?你的第一个问题会是什么?
**Lex Fridman:** Anthropic may be the very company to develop a system that we definitively recognize as AGI, and you very well might be the person that talks to it. Probably talks to it first. What would the conversation contain? What would be your first question?
**Amanda Askell:** 嗯,这一定程度上取决于模型的能力水平。如果有一个在能力上与极其能干的人类相当的系统,我想象自己会像和一个极其能干的人类互动一样与它互动——唯一的区别是,我可能会尝试去探测和理解它的行为。但在很多方面我想,那样的话我就可以和它进行有益的对话了。所以如果我在研究某个课题,我就可以直接——事实上我发现自己已经开始这样做了,比如我会想,哦,美德伦理学里有个术语我想不起来了,我就会用模型来查这类东西。所以我可以想象这种情况越来越多,你基本上和它的互动越来越像和一个极其聪明的同事互动,把它用于你想做的工作,就好像你有了一个合作者……或者,你知道,AI 稍微有点令人不安的地方在于,一旦你有了一个合作者,如果你能管理好,你就同时有了一千个合作者。
**Amanda Askell:** Well, it depends partly on the kind of capability level of the model. If you have something that is capable in the same way that an extremely capable human is, I imagine myself kind of interacting with it the same way that I do with an extremely capable human, with the one difference that I'm probably going to be trying to probe and understand its behaviors. But in many ways I'm like, I can then just have useful conversations with it. So if I'm working on something as part of my research, I can just be like, oh, which I already find myself starting to do, you know, if I'm like, oh, I feel like there's this thing in virtue ethics I can't quite remember the term, I'll use the model for things like that. And so I could imagine that being more and more the case where you're just basically interacting with it much more like you would an incredibly smart colleague, and using it for the kinds of work that you want to do as if you just had a collaborator who was... or, you know, the slightly horrifying thing about AI is like, as soon as you have one collaborator you have a thousand collaborators if you can manage them enough.
**Lex Fridman:** 但如果它在某个特定领域的能力是地球上最聪明的人类的两倍呢?
**Lex Fridman:** But what if it's two times the smartest human on earth on that particular discipline?
**Amanda Askell:** 对,我猜你真的很擅长以某种方式探测 Claude,去推到它的极限,了解极限在哪里。所以我想,你会问什么问题来判断,好,这是 AGI?这真的很难,因为感觉要做到这一点,必须是一系列问题。如果只有一个问题,任何东西都可以被训练来极好地回答一个问题。事实上,你可能可以训练它极好地回答 20 个问题。
**Amanda Askell:** Yeah, I guess you're really good at sort of probing Claude in a way that pushes its limits, understanding where the limits are. So I guess, what would be a question you would ask to be like, yeah, this is AGI? That's really hard because it feels like in order to, it has to just be a series of questions. If there was just one question, you can train anything to answer one question extremely well. In fact, you can probably train it to answer 20 questions extremely well.
**Lex Fridman:** 你需要和一个 AGI 关在房间里多久才能知道这东西是 AGI?
**Lex Fridman:** How long would you need to be locked in the room with an AGI to know this thing is AGI?
**Amanda Askell:** 这是个难题,因为我有一部分感觉,所有这些只是连续的。如果你把我关五分钟,我的误差范围就很大。然后就像,也许是随着时间推移,概率增加,同时误差范围缩小。我认为那些真的能在人类知识边界上探测的问题最有意义。我在哲学上有时会有这种感觉。当我问模型哲学问题的时候,我会想,这是一个我觉得没有人问过的问题。它可能就在我熟悉的某个文献的边缘,而模型在那时会……当它们在那个问题上卡住,当它们无法提出一种新颖的……我会想,我知道这里有一个新颖的论点,因为我自己刚刚想到了它。所以也许就是这样,我会想,我在这个小众领域想到了一个很酷的新颖论点,我要探测你,看你能不能想出它,以及需要多少提示才能让你想出它。而对于那些真正在人类知识边界的问题,我会想,你其实无法想出我想到的那个东西。我觉得,如果我在某个我了解很多的领域,找到了一个新颖的问题或一个问题的新颖解法,然后交给一个模型,它也想出了那个解法,对我来说那将是一个非常震撼的时刻。因为我会想,这是一种以前从未有人……当然,我们一直看到模型给出新颖的解法,尤其是对比较容易的问题。我认为人们高估了……你知道,新颖性不是和以前发生的任何事情都完全不同,它可以是已有事物的变体,同时仍然是新颖的。但我认为,如果我从模型那里看到了完全新颖的成果,那将是……而这也只会感觉是渐进式的。这是那种感觉——就像,人们我认为希望有一个时刻,而我想说,我不知道,我觉得可能永远不会有那样一个时刻。也许只是这种持续的攀升。
**Amanda Askell:** It's a hard question because part of me is like, all of this just feels continuous. If you put me in a room for five minutes, I just have high error bars. And then it's just like, maybe it's like both the probability increases and the error bar decreases. I think things that I can actually probe at the edge of human knowledge. So I think this with philosophy a little bit sometimes. When I ask the models philosophy questions, I am like, this is a question that I think no one has ever asked. It's maybe right at the edge of some literature that I know, and the models will just kind of like, when they struggle with that, when they struggle to come up with a kind of novel... I'm like, I know that there's a novel argument here because I've just thought of it myself. So maybe that's the thing where I'm like, I've thought of a cool novel argument in this niche area and I'm going to just probe you to see if you can come up with it and how much prompting it takes to get you to come up with it. And I think for some of these really right at the edge of human knowledge questions, I'm like, you could not in fact come up with the thing that I came up with. I think if I just took something like that where I know a lot about an area and I came up with a novel issue or a novel solution to a problem and I gave it to a model and it came up with that solution, that would be a pretty moving moment for me. Because I would be like, this is a case where no human has ever... it's not... and obviously we see this with more kind of like, you see novel solutions all the time, especially to easier problems. I think people overestimate, you know, novelty isn't completely different from anything that ever happened. It can be a variant of things that have happened and still be novel. But I think, yeah, if I saw completely novel work from the models, that would be like... and this is just going to feel iterative. It's one of those things where there's never... it's like, you know, people I think want there to be a moment and I'm like, I don't know, I think that there might just never be a moment. It might just be that there's this continuous ramping up.
**Lex Fridman:** 我有一种感觉,模型能说出某些话来说服你。这很……不是那种……我见过真正有智慧的人,你就是能感觉出他们身上有很多"马力"(horsepower,能力)。如果把这个放大十倍,我不知道,我只是感觉有些话是可以说出口的。也许请它写一首诗,而它写的那首诗让你觉得,好吧,你刚才做到的那个东西,我不觉得人类能做到。
**Lex Fridman:** I have a sense that there will be things that a model can say that convinces you. This is very... it's not like... I've talked to people who are truly wise, and you could just tell there's a lot of horsepower there. And if you 10x that, I don't know, I just feel like there's words you could say. Maybe ask it to generate a poem and the poem it generates, you're like, yeah, okay, whatever you did there, I don't think a human can do that.
**Amanda Askell:** 我觉得它必须是我能验证确实非常好的东西。这就是为什么我认为那些问题——在我觉得这个问题,你知道,有时候只是,我会想出一个具体的反例来反驳某个论点之类的。我确定就像,如果你是一个数学家,手头有一个新颖的证明,然后你把这个问题交给它,然后你看到了,这个证明是真正新颖的,没有人曾经……你要做很多工作才能想出这个,我为此思考了好几个月或者什么的。然后如果你看到模型成功做到了这一点,我觉得你就会说,我能验证这是正确的,这是一个迹象,表明你已经从你的训练中进行了泛化,你不是只是在某处看到了这个,因为我刚刚自己想出来的,而你能够复现它。这类事情是这样的,对我来说,模型越能做这样的事,我就越会说,哦,这是非常、非常真实的,因为那时我就能验证它的能力是极高的。
**Amanda Askell:** I think it has to be something that I can verify is actually really good though. That's why I think these questions that are where I'm like, oh, this is, you know, sometimes it's just like, I'll come up with, say, a concrete counterexample to an argument or something like that. I'm sure like, it would be like if you're a mathematician and you had a novel proof, I think, and you just gave it the problem and you saw it, and you're like, this proof is genuinely novel. No one has ever done... you actually have to do a lot of things to come up with this. I had to sit and think about it for months or something. And then if you saw the model successfully do that, I think you would just be like, I can verify that this is correct. It is a sign that you have generalized from your training. You didn't just see this somewhere because I just came up with it myself and you were able to replicate that. That's the kind of thing where I'm like, for me, the more that models can do things like that, the more I would be like, oh, this is very real, because then I can verify that that's extremely, extremely capable.
**Lex Fridman:** 你和 AI 打交道很多了。你觉得是什么让人类特别?也许是从宇宙的角度来看,我们在宇宙中的存在让宇宙变得更好,我们绝对应该生存下去并扩散到宇宙各处。
**Lex Fridman:** You've interacted with AI a lot. What do you think makes humans special? Maybe in a way that the universe is much better off that we're in it, and that we should definitely survive and spread throughout the universe.
**Amanda Askell:** 对,这很有意思,因为我觉得人们太过关注智能了,尤其是在模型的语境下。听着,智能之所以重要,是因为它能做什么。它非常有用,在世界上能做很多事情。你可以想象一个世界,在那里身高或力量会扮演这个角色,而它只是一种特质,本身没有内在价值,它的价值在于它能做什么。我觉得在大多数情况下,那些感觉……我的意思是,就我个人而言,我觉得人类以及生命整体都是极其神奇的。神奇到这样的程度,我不知道——不是所有人都同意这一点,我先说明一下——我们有这整个宇宙,有那么多天体,有美丽的恒星,有星系。然后,我不知道,我只是想,在这颗星球上,有这些生物,它们有这种观察的能力,它们在看着这一切,它们在经历这一切。如果你试着解释——我想象着试着向某个人解释,假设他们从未接触过这个世界或我们的科学,我觉得什么都不……我们所有的物理学,世界上的一切,都非常令人兴奋,但然后你说,哦,还有一件事,存在着"作为一个东西并观察世界是什么感觉"这回事,你拥有这个内在的影院。我觉得他们会说,等一下,暂停,你刚才说了一件听起来挺疯狂的事。所以我们有这种体验世界的能力,我们感受快乐,感受痛苦,感受很多复杂的东西。所以,我也因此——也许这也是为什么我很关心动物,例如,因为我觉得它们可能与我们共享这一点。所以我认为,人类之所以特别——就我关心人类这件事而言——与其说是他们拥有这些功能性的有用特质,不如说是他们感受和体验的能力。
# Lex Fridman Podcast #452 — 第十一部分中文翻译
---
# Lex Fridman Podcast #452 — 第十一部分中文翻译
---
**Amanda Askell:** Yeah, it's interesting because I think people focus so much on intelligence, especially with models. Look, intelligence is important because of what it does. It's very useful, it does a lot of things in the world. And you can imagine a world where height or strength would have played this role, and it's just a trait. It's not intrinsically valuable. It's valuable because of what it does. I think for the most part, the things that feel... I mean, personally, I think humans and life in general is extremely magical. Almost to the degree that, I don't know, not everyone agrees with this and I'm flagging, but we have this whole universe and there's all of these objects, there's beautiful stars and there's galaxies. And then, I don't know, I'm just like, on this planet there are these creatures that have this ability to observe, and they are seeing it, they are experiencing it. And I'm just like, if you try to explain, I imagine trying to explain to someone, for some reason they've never encountered the world or our science or anything, and I think that nothing is... all of our physics and everything in the world, it's all extremely exciting, but then you say, oh, and plus there's this thing that it is to be a thing and observe in the world, and you see this inner cinema. And I think they would be like, hang on, wait, pause, you just said something that is kind of wild sounding. And so we have this ability to experience the world. We feel pleasure, we feel suffering, we feel a lot of complex things. And so yeah, and maybe this is also why I think, you know, I also care a lot about animals, for example, because I think they probably share this with us. So I think that the things that make humans special, insofar as I care about humans, is probably more their ability to feel and experience than it is them having these functional useful traits.
**Lex Fridman:** 对,去感受和体验世界的美。去仰望星空。我希望宇宙里还有其他外星文明,但如果就只有我们,那也挺了不起的。
**Lex Fridman:** Yeah, to feel and experience the beauty in the world. Yeah, to look at the stars. I hope there's other alien civilizations out there, but if we're it, it's a pretty good thing.
**Amanda Askell:** 而且他们过得很开心。
**Amanda Askell:** And that they're having a good time.
**Lex Fridman:** 他们在开心地看着我们。好了,非常感谢你今天的精彩对话,感谢你所做的工作,感谢你帮助把 Claude 打造成一个出色的对话伙伴。谢谢你今天和我聊天。
**Lex Fridman:** They're having a good time watching us. Well, thank you for this good time of a conversation and for the work you're doing and for helping make Claude a great conversational partner. And thank you for talking today.
**Amanda Askell:** 谢谢你找我聊。
**Amanda Askell:** Yeah, thanks for talking.
**Lex Fridman:** 感谢大家收听与 Amanda Askell 的这段对话。接下来,朋友们,有请 Chris Olah。
---
---
**Lex Fridman:** Thanks for listening to this conversation with Amanda Askell. And now, dear friends, here's Chris Olah.
**Lex Fridman:** 你能介绍一下机械可解释性(mechanistic interpretability,也叫 mech interp)这个迷人领域吗?包括它的历史,以及它今天发展到了什么阶段?
**Lex Fridman:** Can you describe this fascinating field of mechanistic interpretability, AKA mech interp, the history of the field, and where it is today?
**Chris Olah:** 我觉得有一种很有用的方式来理解神经网络:我们不是在编程它们,不是在制造它们,我们更像是在"培育"它们。我们设计神经网络的架构,我们创建损失目标(loss objectives),这个架构就像一个支架,电路就在上面生长。它从某种随机状态开始生长,而我们训练所用的目标,就像是光。我们搭建了它生长的支架,我们创造了它朝向生长的光,但我们最终得到的这个东西,更像是一个几乎具有生物性质的实体或生命体,而我们在研究它。这与任何常规的软件工程都非常非常不同,因为到最后,我们手里拿到的是一个能做各种令人惊叹的事情的产物——它能写文章、能翻译、能理解图像。它能做所有这些事,而我们根本不知道如何直接写一个计算机程序来实现这些。它之所以能做到,是因为我们是"培育"出来的,而不是写出来的、造出来的。这就留下了一个终极问题:这些系统内部到底在发生什么?对我来说,这是一个非常深刻、令人兴奋的问题。这是一个真正令人兴奋的科学问题。在我看来,当我们谈到神经网络的时候,这个问题就像在大声呼唤我们去回答它。我也认为,从安全的角度来看,这也是一个非常深刻的问题。
**Chris Olah:** I think one useful way to think about neural networks is that we don't program them, we don't make them, we kind of grow them. We have these neural network architectures that we design and we have these loss objectives that we create, and the neural network architecture, it's kind of like a scaffold that the circuits grow on. And they sort of, you know, it starts off with some kind of random things and it grows, and it's almost like the objective that we train for is this light. And so we create the scaffold that it grows on and we create the light that it grows towards, but the thing that we actually create, it's this almost biological entity or organism that we're studying. And so it's very, very different from any kind of regular software engineering, because at the end of the day we end up with this artifact that can do all these amazing things. It can write essays and translate and understand images. It can do all these things that we have no idea how to directly create a computer program to do. And it can do that because we grew it, we didn't write it, we didn't create it. And so then that leaves open this question at the end, which is, what the hell is going on inside these systems? And that, to me, is a really deep and exciting question. It's a really exciting scientific question. To me, it's the question that is just screaming out, it's calling out for us to go and answer it when we talk about neural networks. And I think it's also a very deep question for safety reasons.
**Lex Fridman:** 机械可解释性,我猜,可能更接近于神经生物学(neurobiology)?
**Lex Fridman:** And mechanistic interpretability, I guess, is closer to maybe neurobiology?
**Chris Olah:** 对,我觉得是的。也许举个例子来说明一下什么是我不认为属于机械可解释性的研究:很长一段时间内,有很多关于显著性图(saliency maps)的工作,做法是拿一张图,然后说,模型认为这张图是一只狗,是图的哪个部分让它这么认为的?如果你能想出一个有原则的方法来做这件事,它也许能告诉你关于模型的一些信息,但它并没有真正告诉你模型里运行的是什么算法,模型到底是如何做出那个决策的。也许它能告诉你什么对模型来说比较重要,但它没有告诉你模型运行的算法是什么,也没有告诉你这个系统究竟是如何做到那些无人知晓如何实现的事情的。所以我猜,我们开始使用"机械可解释性"这个术语,是为了划清界限,从某种程度上把我们的工作与那些其他工作区分开来。从那以后,它成了一个涵盖范围很广的伞形术语。但我觉得有一些比较有代表性的东西:我认为核心在于这种专注——我们真的想深入机制,真的想深入算法。如果你把神经网络想象成一个计算机程序,那么权重(weights)就像是二进制的计算机程序。
**Chris Olah:** Yeah, I think that's right. So maybe to give an example of the kind of thing that has been done that I wouldn't consider to be mechanistic interpretability: there was for a long time a lot of work on saliency maps, where you would take an image and you'd try to say, the model thinks this image is a dog, what part of the image made it think that it's a dog? And that tells you maybe something about the model if you can come up with a principled version of that, but it doesn't really tell you what algorithms are running in the model, how was the model actually making that decision. Maybe it's telling you something about what was important to it if you can make that method work, but it isn't telling you what are the algorithms that are running, how is it that this system is able to do this thing that no one knew how to do. And so I guess we started using the term mechanistic interpretability to try to draw that divide, or to distinguish ourselves and the work that we were doing in some ways from some of these other things. And I think since then it's become this sort of umbrella term for a pretty wide variety of work. But I'd say that the things that are kind of distinctive are, I think, this focus on: we really want to get at the mechanisms, we want to get at the algorithms. If you think of neural networks as being like a computer program, then the weights are kind of like a binary computer program.
**Chris Olah:** 我们希望能对这些权重进行逆向工程(reverse engineer),搞清楚里面运行着什么算法。所以,理解神经网络的一种思路,就是把它看成一个编译好的计算机程序,神经网络的权重就是二进制码,神经网络运行时产生的就是激活值(activations)。我们的最终目标是去理解这些权重。机械可解释性这个项目,就是要搞清楚这些权重如何对应到算法。为了做到这一点,你还必须理解激活值,因为激活值就像内存。想象一下对一个计算机程序做逆向工程,你有二进制指令,要理解某条指令的含义,你就需要知道它操作的内存里存储的是什么。所以这两者是紧密交织在一起的。机械可解释性因此对两者都感兴趣。当然,有很多工作也关注这些问题,尤其是关于探测(probing)的研究,你可以把它看作机械可解释性的一部分,虽然这只是个宽泛的术语,不是所有做探测研究的人都会认为自己在做 mech interp。我觉得 mech interp 这个氛围里有点独特的东西是:做这个领域的人倾向于认为,神经网络嘛……也许一种说法是:梯度下降(gradient descent)比你聪明。梯度下降真的非常厉害。我们之所以要去理解这些模型,恰恰是因为我们当初根本不知道怎么写出它们。梯度下降找到的解比我们想出来的要好。所以 mech interp 的另一个特点,也许是一种谦逊——我们不会事先猜测模型内部发生了什么,而是必须采用自下而上(bottom-up)的方式,不假设我们应该找什么,不假设某样东西就在那里、就是那样运作的。而是从底层向上看,去发现这些模型里实际存在的东西,并以那种方式来研究它们。
**Chris Olah:** And we'd like to reverse engineer those weights and figure out what algorithms are running. So okay, I think one way you might think of trying to understand a neural network is that it's kind of like a compiled computer program and the weights of the neural network are the binary, and when the neural network runs that's the activations. And our goal is ultimately to go and understand these weights. And so, you know, the project of mechanistic interpretability is to somehow figure out how do these weights correspond to algorithms. And in order to do that you also have to understand the activations, because the activations are like the memory. And if you imagine reverse engineering a computer program and you have the binary instructions, in order to understand what a particular instruction means you need to know what is stored in the memory that it's operating on. And so those two things are very intertwined. So mechanistic interpretability tends to be interested in both of those things. Now, there's a lot of work that's interested in those things, especially the, you know, there's all this work on probing which you might see as part of mechanistic interpretability, although it's, you know, again it's just a broad term and not everyone who does that work would identify as doing mech interp. I think the thing that is maybe a little bit distinctive to the vibe of mech interp is, I think people working in the space tend to think of neural networks as, well, maybe one way to say it is that gradient descent is smarter than you. Gradient descent is actually really great. The whole reason that we're understanding these models is because we didn't know how to write them in the first place. Gradient descent comes up with better solutions than us. And so I think that maybe another thing about mech interp is sort of having almost a kind of humility, that we won't guess a priori what's going on inside the model and we have to have this sort of bottom-up approach where we don't really assume, you know, we don't assume that we should look for a particular thing and that will be there and that's how it works. But instead we look from the bottom up and discover what happens to exist in these models and study them that way.
**Lex Fridman:** 但你知道,这整件事之所以可行——正如你和其他人随着时间推移所展示的——比如普遍性(universality)这个发现,梯度下降的智慧在不同类型的网络中普遍地创造出特征(features)和电路(circuits),这让整个领域成为可能。
**Lex Fridman:** But you know, the very fact that it's possible to do, and as you and others have shown over time, you know, things like universality, that the wisdom of the gradient descent creates features and circuits, creates things universally across different kinds of networks that are useful, and that makes the whole field possible.
**Chris Olah:** 对,这确实是一个非常了不起、令人兴奋的事情。看起来,至少在某种程度上,相同的元素——相同的特征和电路——一遍又一遍地出现。你可以观察每一个视觉模型,都能找到曲线检测器(curve detectors)和高低频检测器(high-low frequency detectors)。事实上,有理由认为相同的东西在生物神经网络和人工神经网络中都会形成。一个著名的例子是:视觉模型在早期层有 Gabor 滤波器(Gabor filters),这是神经科学家非常感兴趣并深入研究过的东西。我们在这些模型里发现了曲线检测器,曲线检测器在猴子身上也有发现。我们发现了高低频检测器,后续工作又在大鼠或小鼠身上也发现了它们。也就是说,它们首先在人工神经网络中被发现,然后才在生物神经网络中被发现。还有那个非常著名的关于"祖母神经元"或者"Halle Berry 神经元"的研究——来自 Quiroga 等人——我们在视觉模型里发现了非常相似的东西。那时我还在 OpenAI,在研究他们的 CLIP 模型,你会发现有些神经元对图像和文字中的同一个实体都有响应。举个具体例子:我们发现有一个"Donald Trump 神经元"。不知为何,大家都爱谈论 Donald Trump,他在那段时间是非常热门的话题。所以我们研究的每一个神经网络,都能找到一个专门用于 Donald Trump 的神经元。他是唯一一个总是有专属神经元的人。有时候你会找到 Obama 神经元,有时候找到 Clinton 神经元,但 Trump 总是有一个专属的。它会响应他的脸部照片,也会响应"Trump"这个词,以及所有这类东西。所以它并不是在响应某个具体的样本,也不只是在响应他的脸,而是在对这个整体概念进行抽象。总之,这与 Quiroga 的研究结果非常相似。所以有这样一些证据表明,普遍性这一现象——相同的东西在人工神经网络和自然神经网络中都会形成——是真实存在的。如果这是真的,那真的是件了不起的事。它暗示着——我觉得它暗示的是——梯度下降在某种意义上找到了切割世界的正确方式,很多系统都会收敛到同一套方式。许多不同的神经网络架构都收敛到这一套——存在着某种自然的抽象集合,是分解问题的非常自然的方式,很多系统都会朝这个方向收敛。这就是我的某种……我对神经科学一无所知,这只是我根据所见做出的疯狂猜测。
**Chris Olah:** Yeah, so this is actually indeed a really remarkable and exciting thing, where it does seem like, at least to some extent, the same elements, the same features and circuits form again and again. You can look at every vision model and you'll find curve detectors and you'll find high-low frequency detectors. And in fact there's some reason to think that the same things form across biological neural networks and artificial neural networks. So a famous example is vision models in the early layers have Gabor filters, and Gabor filters are something that neuroscientists are interested in and have thought a lot about. We find curve detectors in these models. Curve detectors are also found in monkeys. We discover these high-low frequency detectors and then some follow-up work went and discovered them in rats or mice. So they were found first in artificial neural networks and then found in biological neural networks. You know, this really famous result on like grandmother neurons or the Halle Berry neuron from Quiroga et al., and we found very similar things in vision models. This was while I was still at OpenAI and I was looking at their CLIP model, and you find these neurons that respond to the same entities in images and also in text. To give a concrete example, we found that there was a Donald Trump neuron. For some reason, I guess everyone likes to talk about Donald Trump, and Donald Trump was very prominent, was a very hot topic at that time. So every neural network that we looked at, we would find a dedicated neuron for Donald Trump. He was the only person who always had a dedicated neuron. You know, sometimes you'd have an Obama neuron, sometimes you'd have a Clinton neuron, but Trump always had a dedicated one. So it responds to pictures of his face and the word "Trump," like all these things, right? And so it's not responding to a particular example, or it's not just responding to his face. It's abstracting over this general concept, right? So in any case, that's very similar to these Quiroga results. So there's this evidence that this phenomenon of universality -- the same things form across both artificial and natural neural networks. That's a pretty amazing thing if that's true. It suggests that, well, I think the thing that it suggests is gradient descent is sort of finding the right ways to cut things apart, in some sense, that many systems converge on. And many different neural network architectures converge on that -- there's some natural set of abstractions that are a very natural way to cut apart the problem and that a lot of systems are going to converge on. That would be my kind of, you know, I don't know anything about neuroscience, this is just my wild speculation from what we've seen.
**Lex Fridman:** 对,如果确实与用来形成表征(representation)的媒介无关,那就太美了。
**Lex Fridman:** Yeah, that would be beautiful if it's sort of agnostic to the medium of the model that's used to form the representation.
**Chris Olah:** 对,这是一种基于少数几个数据点的疯狂猜测。但确实看起来,在某种意义上,相同的东西一次又一次地出现,不论是在人工神经网络还是生物神经网络中。这背后的直觉是:要在理解现实世界这件事上有用,你需要的就是同一套东西。
**Chris Olah:** Yeah, and it's a kind of wild speculation based, you know, we only have a few data points to suggest this. But it does seem like there's some sense in which the same things form again and again and again, both in artificial neural networks and biological ones. And the intuition behind that would be that in order to be useful in understanding the real world, you need all the same kind of stuff.
**Lex Fridman:** 对,举个例子,比如"狗"这个概念……
**Lex Fridman:** Yeah, well, if we pick, I don't know, like the idea of a dog, right?
**Chris Olah:** 在某种意义上,"狗"这个概念就像是宇宙中的一个自然类别。我们有"狗"这个概念,并不只是人类思考世界方式的一个奇怪癖好。从某种意义上说……或者说,如果你有"线"这个概念,看看我们周围,到处都是线。从某种意义上说,理解这个房间最简单的方式就是有"线"的概念。所以我觉得这就是我对这件事发生原因的直觉。
**Chris Olah:** Like, you know, there's some sense in which the idea of a dog is like a natural category in the universe or something like this, right? There's some reason it's not just like a weird quirk of how humans think about the world that we have this concept of a dog. It's in some sense -- or like if you have the idea of a line, like, look around us, there are lines. It's sort of the simplest way to understand this room in some sense is to have the idea of a line. And so I think that that would be my instinct for why this happens.
**Lex Fridman:** 对,你需要弯曲的线来理解圆,你需要各种形状来理解更大的东西,没错,就是这样一个概念的层级结构被形成了。
**Lex Fridman:** Yeah, you need a curved line, you know, to understand a circle, and you need all those shapes to understand bigger things, and yeah, it's a hierarchy of concepts that are formed.
**Chris Olah:** 对,也许有办法不引用这些东西就描述图像,但那不是最简单的方式,不是最经济的方式或者类似的东西。所以系统会收敛到这些策略,这是我的大胆假设。
**Chris Olah:** Yeah, and like maybe there are ways to go and describe images without reference to those things, right? But they're not the simplest way or the most economical way or something like this. And so systems converge to these strategies, would be my wild hypothesis.
**Lex Fridman:** 你能讲一讲我们一直提到的这些基本构件吗,就是特征(features)和电路(circuits)?我想你最早是在 2020 年的一篇论文《Zoom In: An Introduction to Circuits》里描述它们的。
**Lex Fridman:** Can you talk through some of the building blocks that we've been referencing, of features and circuits? So I think you first described them in a 2020 paper, "Zoom In: An Introduction to Circuits."
**Chris Olah:** 当然。也许我先描述一些现象,然后我们再慢慢建立起特征和电路的概念。我花了相当多的年头,大概五年左右,研究一个特定的模型 Inception V1,一个视觉模型。它在 2015 年是当时最先进的,现在早就不是了。它大约有 10000 个神经元,我花了很多时间研究 Inception V1 大约一万个神经元。有趣的是,有很多神经元没有明显的可解释含义,但 Inception V1 里也有很多神经元确实有非常干净、可解释的含义。你会发现有些神经元真的在检测曲线,有些真的在检测汽车、车轮、车窗、狗的耷拉耳朵、长着长吻向右看的狗、长着长吻向左看的狗,以及不同种类的皮毛。还有边缘检测器、线条检测器、颜色对比检测器,以及这些我们称为高低频检测器的美丽东西。我感觉自己就像一个生物学家,正在观察一个全新的蛋白质世界,发现各种各样相互作用的蛋白质。理解这些模型的一种方式是以神经元为单位——你可以说,哦,这里有个检测狗的神经元,这里有个检测汽车的神经元。实际上,你可以问它们是怎么连接在一起的。你可以说,哦,这个检测汽车的神经元,它是怎么构建的?结果发现,在上一层,它与一个窗户检测器、一个车轮检测器和一个车身检测器有很强的连接。它寻找上方的车窗、下方的车轮,以及中间偏下的汽车金属部分。这就是一个检测汽车的"配方",对吧?我们之前说机械可解释性想要得到算法,想知道什么算法在运行。好,这里我们只是在看神经网络的权重,就读出了这种检测汽车的配方。这是非常简单、粗糙的配方,但它就在那里。我们把这种连接称为"电路"(circuit)。
好,问题在于,不是所有的神经元都是可解释的。有理由认为——我们后面可以更深入讨论——存在一个叫做"叠加假说"(superposition hypothesis)的东西,有理由认为有时候分析的正确单位是神经元的组合。所以有时候并不是有一个单独的神经元代表汽车,实际上,模型在检测到汽车之后,会把一点点汽车信息"藏"在下一层的一堆狗检测器里。为什么它要这么做?也许是因为它不想在那个时候对汽车做太多处理,而是把它暂存起来。于是就出现了这样一种微妙的情况:有很多你以为是狗检测器的神经元,也许它们主要是这样,但它们各自一点一点地共同表征下一层中的汽车。这样一来,我们就不能以单个神经元为单位来思考了。也许仍然存在某种东西,你可以叫它"汽车概念",但它不再对应一个神经元。所以我们需要一个术语来描述这种类似神经元的实体——这些我们希望神经元应该是的东西,这些理想化的神经元,这些好的神经元,但它们的数量也许更多,以某种方式被隐藏起来。我们把这些叫做"特征"(features)。
那什么是电路?电路就是特征之间的连接。当我们有汽车检测器,它连接到窗户检测器和车轮检测器,寻找下方的车轮和上方的车窗,这就是一个电路。所以电路就是由权重连接的特征的集合,它们实现了算法。它们告诉我们特征是如何被使用的,如何被构建的,如何彼此连接的。
也许值得尝试明确一下这里真正的核心假设是什么。我认为核心假设是我们所说的"线性表征假说"(linear representation hypothesis)。如果我们想到汽车检测器,它激发得越多,我们就越倾向于认为模型越来越确信图像中存在汽车。或者如果是某种神经元组合来表征汽车,那种组合激发得越多,我们就越认为模型认为有汽车存在。这不必然是真的,对吧?你完全可以想象一个汽车检测器神经元,它的激活值在一到二之间表示一回事,在三到四之间表示完全不同的事。那就是非线性表征(nonlinear representation)。原则上,模型是可以这样做的。我认为这样做对它们来说比较低效。如果你想想要怎么实现那样的计算,感觉挺烦的。但原则上模型可以那样做。
所以,用特征和电路框架来思考事物的一种方式,是我们在用线性的方式来思考——我们认为如果一个神经元或一组神经元激发得更多,就意味着检测到更多的某个特定事物。这样一来,权重就有了非常清晰的解释——它们是这些实体、这些特征之间的边,而那条边就有了含义。所以这在某种意义上是核心所在。
好,问题在于,不是所有的神经元都是可解释的。有理由认为——我们后面可以更深入讨论——存在一个叫做"叠加假说"(superposition hypothesis)的东西,有理由认为有时候分析的正确单位是神经元的组合。所以有时候并不是有一个单独的神经元代表汽车,实际上,模型在检测到汽车之后,会把一点点汽车信息"藏"在下一层的一堆狗检测器里。为什么它要这么做?也许是因为它不想在那个时候对汽车做太多处理,而是把它暂存起来。于是就出现了这样一种微妙的情况:有很多你以为是狗检测器的神经元,也许它们主要是这样,但它们各自一点一点地共同表征下一层中的汽车。这样一来,我们就不能以单个神经元为单位来思考了。也许仍然存在某种东西,你可以叫它"汽车概念",但它不再对应一个神经元。所以我们需要一个术语来描述这种类似神经元的实体——这些我们希望神经元应该是的东西,这些理想化的神经元,这些好的神经元,但它们的数量也许更多,以某种方式被隐藏起来。我们把这些叫做"特征"(features)。
那什么是电路?电路就是特征之间的连接。当我们有汽车检测器,它连接到窗户检测器和车轮检测器,寻找下方的车轮和上方的车窗,这就是一个电路。所以电路就是由权重连接的特征的集合,它们实现了算法。它们告诉我们特征是如何被使用的,如何被构建的,如何彼此连接的。
也许值得尝试明确一下这里真正的核心假设是什么。我认为核心假设是我们所说的"线性表征假说"(linear representation hypothesis)。如果我们想到汽车检测器,它激发得越多,我们就越倾向于认为模型越来越确信图像中存在汽车。或者如果是某种神经元组合来表征汽车,那种组合激发得越多,我们就越认为模型认为有汽车存在。这不必然是真的,对吧?你完全可以想象一个汽车检测器神经元,它的激活值在一到二之间表示一回事,在三到四之间表示完全不同的事。那就是非线性表征(nonlinear representation)。原则上,模型是可以这样做的。我认为这样做对它们来说比较低效。如果你想想要怎么实现那样的计算,感觉挺烦的。但原则上模型可以那样做。
所以,用特征和电路框架来思考事物的一种方式,是我们在用线性的方式来思考——我们认为如果一个神经元或一组神经元激发得更多,就意味着检测到更多的某个特定事物。这样一来,权重就有了非常清晰的解释——它们是这些实体、这些特征之间的边,而那条边就有了含义。所以这在某种意义上是核心所在。
**Chris Olah:** Absolutely. So maybe I'll start by just describing some phenomena and then we can sort of build to the idea of features and circuits. So I spent quite a few years, maybe like five years to some extent with other things, studying this one particular model, Inception V1, which is this one vision model. It was state of the art in 2015 and very much not state of the art anymore. And it has maybe about 10,000 neurons, and I spent a lot of time looking at the 10,000-odd neurons of Inception V1. And one of the interesting things is, there are lots of neurons that don't have some obvious interpretable meaning, but there are a lot of neurons in Inception V1 that do have really clean interpretable meanings. So you find neurons that just really do seem to detect curves, and you find neurons that really do seem to detect cars, and car wheels, and car windows, and floppy ears of dogs, and dogs with long snouts facing to the right, and dogs with long snouts facing to the left, and different kinds of fur. And there's sort of this whole beautiful -- edge detectors, line detectors, color contrast detectors, these beautiful things we call high-low frequency detectors. I sort of felt like a biologist. You're just looking at this sort of new world of proteins and you're discovering all these different proteins that interact. So one way you could try to understand these models is in terms of neurons. You could try to be like, oh, there's a dog-detecting neuron and here's a car-detecting neuron. And it turns out you can actually ask how those connect together. So you can go and say, oh, I have this car-detecting neuron, how was it built? And it turns out in the previous layer it's connected really strongly to a window detector and a wheel detector and a sort of car body detector. And it looks for the window above the car and the wheels below and the car chrome sort of in the middle, sort of everywhere but especially on the lower part. And that's sort of a recipe for a car, right? Like, that is -- earlier we said the thing we wanted from mech interp was to get algorithms, to ask what is the algorithm that runs. Well, here we're just looking at the weights of the neural network, reading off this kind of recipe for detecting cars. It's a very simple, crude recipe, but it's there. And so we call that a circuit, this connection.
Well, okay, so the problem is that not all of the neurons are interpretable. And there's reason to think -- we can get into this more later -- that there's this superposition hypothesis, there's reason to think that sometimes the right unit to analyze things in terms of is combinations of neurons. So sometimes it's not that there's a single neuron that represents, say, a car, but it actually turns out that after you detect the car, the model sort of hides a little bit of the car in the following layer in a bunch of dog detectors. Why is it doing that? Well, maybe it just doesn't want to do that much work on cars at that point and it's sort of storing it away. So it turns out then that the subtle pattern of -- there's all these neurons that you think are dog detectors, and maybe they're primarily that, but they all a little bit contribute to representing a car in that next layer. Okay, so now we can't really think -- there might still be something, I don't know, you could call it like a car concept or something, but it no longer corresponds to a neuron. So we need some term for these kind of neuron-like entities, these things that we sort of would have liked the neurons to be, these idealized neurons, the things that are the nice neurons, but also maybe there's more of them somehow hidden. And we call those features.
And then what are circuits? So circuits are these connections of features, right? So when we have the car detector and it's connected to a window detector and a wheel detector and it looks for the wheels below and the windows on top, that's a circuit. So circuits are just collections of features connected by weights, and they implement algorithms. So they tell us how features are used, how they are built, how they connect together.
So maybe it's worth trying to pin down what really is the core hypothesis here. I think the core hypothesis is something we call the linear representation hypothesis. So if we think about the car detector, the more it fires, the more we sort of think of that as meaning the model is more and more confident that a car was present. Or if it's some combination of neurons that represent a car, the more that combination fires, the more we think the model thinks there's a car present. This doesn't have to be the case, right? You could imagine something where you have this car detector neuron and you think, ah, if it fires between one and two that means one thing, but it means totally different if it's between three and four. That would be a nonlinear representation. In principle, models could do that. I think it's sort of inefficient for them to do. If you try to think about how you'd implement computation like that, it's kind of an annoying thing to do. But in principle models can do that.
So one way to think about the features and circuits sort of framework for thinking about things is that we're thinking about things as being linear. We're thinking about there being that if a neuron or a combination of neurons fires more, it sort of means more of a particular thing being detected. And then that gives weights a very clean interpretation as these edges between these entities, these features, and that edge then has a meaning. So that's in some ways the core thing.
Well, okay, so the problem is that not all of the neurons are interpretable. And there's reason to think -- we can get into this more later -- that there's this superposition hypothesis, there's reason to think that sometimes the right unit to analyze things in terms of is combinations of neurons. So sometimes it's not that there's a single neuron that represents, say, a car, but it actually turns out that after you detect the car, the model sort of hides a little bit of the car in the following layer in a bunch of dog detectors. Why is it doing that? Well, maybe it just doesn't want to do that much work on cars at that point and it's sort of storing it away. So it turns out then that the subtle pattern of -- there's all these neurons that you think are dog detectors, and maybe they're primarily that, but they all a little bit contribute to representing a car in that next layer. Okay, so now we can't really think -- there might still be something, I don't know, you could call it like a car concept or something, but it no longer corresponds to a neuron. So we need some term for these kind of neuron-like entities, these things that we sort of would have liked the neurons to be, these idealized neurons, the things that are the nice neurons, but also maybe there's more of them somehow hidden. And we call those features.
And then what are circuits? So circuits are these connections of features, right? So when we have the car detector and it's connected to a window detector and a wheel detector and it looks for the wheels below and the windows on top, that's a circuit. So circuits are just collections of features connected by weights, and they implement algorithms. So they tell us how features are used, how they are built, how they connect together.
So maybe it's worth trying to pin down what really is the core hypothesis here. I think the core hypothesis is something we call the linear representation hypothesis. So if we think about the car detector, the more it fires, the more we sort of think of that as meaning the model is more and more confident that a car was present. Or if it's some combination of neurons that represent a car, the more that combination fires, the more we think the model thinks there's a car present. This doesn't have to be the case, right? You could imagine something where you have this car detector neuron and you think, ah, if it fires between one and two that means one thing, but it means totally different if it's between three and four. That would be a nonlinear representation. In principle, models could do that. I think it's sort of inefficient for them to do. If you try to think about how you'd implement computation like that, it's kind of an annoying thing to do. But in principle models can do that.
So one way to think about the features and circuits sort of framework for thinking about things is that we're thinking about things as being linear. We're thinking about there being that if a neuron or a combination of neurons fires more, it sort of means more of a particular thing being detected. And then that gives weights a very clean interpretation as these edges between these entities, these features, and that edge then has a meaning. So that's in some ways the core thing.
**Lex Fridman:** 你熟悉 Word2Vec 的结果吗?就是那个:King 减去 Man 加上 Woman 等于 Queen。之所以能做这种算术,就是因为有线性表征。你能解释一下那个表征吗?首先,特征(feature)是激活值的一个方向,你是这样理解的吗?Word2Vec 里的减去 Man 加上 Woman 那套东西,你能解释一下那是什么吗?
**Lex Fridman:** Are you familiar with the Word2Vec results? So you have like, King minus Man plus Woman equals Queen. Well, the reason you can do that kind of arithmetic is because you have a linear representation. Can you actually explain that representation a little bit? So first off, a feature is a direction of activation, you think of it that way? Can you do the minus Man plus Woman, the Word2Vec stuff? Can you explain what that is?
**Chris Olah:** 对,有一个非常简单、干净的解释,正好说明了我们在谈论的事情。有这个非常著名的结果,Word2Vec,由 Tomas Mikolov 等人提出,之后有大量后续工作对此进行了探索。我们有时会创建词嵌入(word embeddings),把每个词映射到一个向量。顺便说一句,就这件事本身,如果你之前没想过,其实挺疯狂的,对吧?如果你只是在物理课上学过向量,然后有人告诉你,哦,我要把字典里的每一个词都变成一个向量,这个想法本身就挺疯狂的。
但你可以想象各种各样把词映射到向量的方式。然而看起来,当我们训练神经网络时,它们倾向于以某种特定方式把词映射到向量,使得这些向量具有线性结构——也就是说,方向有含义。例如,会有某个方向似乎对应于性别,男性词语在一个方向上,女性词语在另一个方向上。线性表征假说大致可以这样理解:这实际上就是根本性的东西——一切都是不同的方向有不同的含义,把不同的方向向量加在一起就可以表征概念。
Mikolov 的论文认真对待了这个想法,其中一个推论就是你可以用词来做算术游戏。你可以拿"King",减去"Man"这个词,加上"Woman"这个词,相当于在尝试切换性别,结果确实会接近"Queen"这个词。你也可以做其他事情,比如"Sushi"减去"Japan"加上"Italy"得到"Pizza",诸如此类。
所以这在某种意义上就是线性表征假说的核心。你可以把它描述为一个纯粹关于向量空间的抽象命题,也可以描述为关于神经元激活值的陈述。但它真正关注的是方向有含义这一性质。在某种意义上,它甚至比这更微妙一些。我认为它主要关注的是能够把东西加在一起这个性质——你可以独立地修改,比如说,性别和皇室地位,或者菜系类型或国家与食物概念,通过把它们加起来。
但你可以想象各种各样把词映射到向量的方式。然而看起来,当我们训练神经网络时,它们倾向于以某种特定方式把词映射到向量,使得这些向量具有线性结构——也就是说,方向有含义。例如,会有某个方向似乎对应于性别,男性词语在一个方向上,女性词语在另一个方向上。线性表征假说大致可以这样理解:这实际上就是根本性的东西——一切都是不同的方向有不同的含义,把不同的方向向量加在一起就可以表征概念。
Mikolov 的论文认真对待了这个想法,其中一个推论就是你可以用词来做算术游戏。你可以拿"King",减去"Man"这个词,加上"Woman"这个词,相当于在尝试切换性别,结果确实会接近"Queen"这个词。你也可以做其他事情,比如"Sushi"减去"Japan"加上"Italy"得到"Pizza",诸如此类。
所以这在某种意义上就是线性表征假说的核心。你可以把它描述为一个纯粹关于向量空间的抽象命题,也可以描述为关于神经元激活值的陈述。但它真正关注的是方向有含义这一性质。在某种意义上,它甚至比这更微妙一些。我认为它主要关注的是能够把东西加在一起这个性质——你可以独立地修改,比如说,性别和皇室地位,或者菜系类型或国家与食物概念,通过把它们加起来。
**Chris Olah:** Yeah, there's a very -- such a simple, clean explanation of what we're talking about, exactly. So there's this very famous result, Word2Vec, by Tomas Mikolov et al., and there's been tons of follow-up work exploring this. So sometimes we create these word embeddings where we map every word to a vector. I mean, that in itself, by the way, is kind of a crazy thing if you haven't thought about it before, right? We're going and representing -- we're turning, you know, like if you just learned about vectors in physics class and I'm like, oh, I'm going to actually turn every word in the dictionary into a vector, that's kind of a crazy idea.
But you could imagine all kinds of ways in which you might map words to vectors. But it seems like when we train neural networks, they like to map words to vectors such that they have linear structure in a particular sense, which is that directions have meaning. So for instance, there will be some direction that seems to sort of correspond to gender, and male words will be far in one direction and female words will be in another direction. And the linear representation hypothesis is -- you could sort of think of it roughly as saying that that's actually kind of the fundamental thing that's going on, that everything is just different directions have meanings and adding different direction vectors together can represent concepts.
And the Mikolov paper sort of took that idea seriously, and one consequence of it is that you can do this game of playing sort of arithmetic with words. So you can do "King" and you can subtract off the word "Man" and add the word "Woman," and so you're sort of trying to switch the gender, and indeed if you do that the result will be close to the word "Queen." And you can do other things like you can do "Sushi" minus "Japan" plus "Italy" and get "Pizza," or different things like this.
So this is in some sense the core of the linear representation hypothesis. You can describe it just as a purely abstract thing about vector spaces. You can describe it as a statement about the activations of neurons. But it's really about this property of directions having meaning. And in some ways it's even a little subtler than that. It's really, I think, mostly about this property of being able to add things together, that you can sort of independently modify, say, gender and royalty, or cuisine type or country and the concept of food, by adding them.
But you could imagine all kinds of ways in which you might map words to vectors. But it seems like when we train neural networks, they like to map words to vectors such that they have linear structure in a particular sense, which is that directions have meaning. So for instance, there will be some direction that seems to sort of correspond to gender, and male words will be far in one direction and female words will be in another direction. And the linear representation hypothesis is -- you could sort of think of it roughly as saying that that's actually kind of the fundamental thing that's going on, that everything is just different directions have meanings and adding different direction vectors together can represent concepts.
And the Mikolov paper sort of took that idea seriously, and one consequence of it is that you can do this game of playing sort of arithmetic with words. So you can do "King" and you can subtract off the word "Man" and add the word "Woman," and so you're sort of trying to switch the gender, and indeed if you do that the result will be close to the word "Queen." And you can do other things like you can do "Sushi" minus "Japan" plus "Italy" and get "Pizza," or different things like this.
So this is in some sense the core of the linear representation hypothesis. You can describe it just as a purely abstract thing about vector spaces. You can describe it as a statement about the activations of neurons. But it's really about this property of directions having meaning. And in some ways it's even a little subtler than that. It's really, I think, mostly about this property of being able to add things together, that you can sort of independently modify, say, gender and royalty, or cuisine type or country and the concept of food, by adding them.
**Lex Fridman:** 你认为线性表征假说成立吗?它能经受住考验吗?能随着规模扩大而延续吗?
**Lex Fridman:** Do you think the linear representation hypothesis holds? That it carries? That it scales?
**Chris Olah:** 到目前为止,我看到的一切都与这个假说一致,而且并非必然如此,对吧?你可以写出权重使得神经网络没有线性表征,使得理解它们的正确方式不是线性表征。但我认为我见过的每一个自然训练出的神经网络都具有这个性质。最近有一篇论文——有一些在边缘地带的推进。我想最近有一些研究多维特征(multi-dimensional features)的工作,其中不是单一方向,而更像是一个方向的流形(manifold)。在我看来,这仍然是一种线性表征。另外还有一些论文表明,在非常小的模型中,可能会出现非线性表征。我认为这方面的结论尚不明朗。但我认为到目前为止我们看到的一切都与线性表征假说一致。
这很惊人。并非必然如此,然而我认为有大量证据表明,至少这个假说极其普遍,而且迄今为止的证据都与它一致。有人也许会说,哦,Christopher,如果我们不确定这是否为真,你却以它为真来研究所有神经网络,这不危险吗?好,我认为认真对待假说并尽可能地推进,其实是有美德的。也许某天我们会发现某些与线性表征假说不一致的东西,但科学史上充满了被证明是错的假说和理论,我们通过在它们的框架下工作、将它们作为假设并尽力推进,学到了很多东西。我想这就是 Kuhn 所说的"常规科学"(normal science)的核心。
这很惊人。并非必然如此,然而我认为有大量证据表明,至少这个假说极其普遍,而且迄今为止的证据都与它一致。有人也许会说,哦,Christopher,如果我们不确定这是否为真,你却以它为真来研究所有神经网络,这不危险吗?好,我认为认真对待假说并尽可能地推进,其实是有美德的。也许某天我们会发现某些与线性表征假说不一致的东西,但科学史上充满了被证明是错的假说和理论,我们通过在它们的框架下工作、将它们作为假设并尽力推进,学到了很多东西。我想这就是 Kuhn 所说的"常规科学"(normal science)的核心。
**Chris Olah:** So far I think everything I have seen is consistent with this hypothesis, and it doesn't have to be that way, right? You can write down neural networks where you write weights such that they don't have linear representations, where the right way to understand them is not in terms of linear representations. But I think every natural neural network I've seen has this property. There's been one paper recently -- there's been some sort of pushing around the edges. So I think there's been some work recently studying multi-dimensional features, where rather than a single direction it's more like a manifold of directions. This to me still seems like a linear representation. And then there's been some other papers suggesting that maybe in very small models you get nonlinear representations. I think the jury's still out on that. But I think everything that we've seen so far has been consistent with the linear representation hypothesis.
And that's wild. It doesn't have to be that way, and yet I think there's a lot of evidence that certainly at least this is very, very widespread, and so far the evidence is consistent with that. And I think, you know, one thing you might say is, well, Christopher, that's a lot to ride on if we don't know for sure this is true and you're sort of investigating all neural networks as though it is true. Isn't that dangerous? Well, I think actually there's a virtue in taking hypotheses seriously and pushing them as far as they can go. It might be that someday we discover something that is inconsistent with the linear representation hypothesis, but science is full of hypotheses and theories that were wrong, and we learned a lot by sort of working under them as an assumption and then going and pushing them as far as we can. I guess this is sort of the heart of what Kuhn would call normal science.
And that's wild. It doesn't have to be that way, and yet I think there's a lot of evidence that certainly at least this is very, very widespread, and so far the evidence is consistent with that. And I think, you know, one thing you might say is, well, Christopher, that's a lot to ride on if we don't know for sure this is true and you're sort of investigating all neural networks as though it is true. Isn't that dangerous? Well, I think actually there's a virtue in taking hypotheses seriously and pushing them as far as they can go. It might be that someday we discover something that is inconsistent with the linear representation hypothesis, but science is full of hypotheses and theories that were wrong, and we learned a lot by sort of working under them as an assumption and then going and pushing them as far as we can. I guess this is sort of the heart of what Kuhn would call normal science.
**Lex Fridman:** 不知道你有没有兴趣,我们可以聊很多关于科学哲学的话题——那就会引出范式转移(paradigm shift)。我喜欢这个,认真对待假说,把它推向自然结论。
**Lex Fridman:** I don't know if you want, we can talk a lot about philosophy of science and -- that leads to the paradigm shift. So, yeah, I love it, taking the hypothesis seriously and taking it to its natural conclusion.
**Chris Olah:** 对,就像 scaling 假说一样。
**Chris Olah:** Yeah, same with the scaling hypothesis.
**Lex Fridman:** 完全一样,完全一样。我喜欢。
**Lex Fridman:** Same, exactly, exactly. And I love it.
**Chris Olah:** 我的一位同事 Tom Henighan,他是物理学家出身,给我打了一个很好的比方——热质说(caloric theory)。曾经有一段时间,我们认为热是一种叫做"热质"(caloric)的东西,热的物体让冷的物体变暖,是因为热质在它们之间流动。因为我们已经太习惯于用现代理论来思考热,所以这听起来有点可笑。但其实很难设计出能证伪热质说的实验。你完全可以在相信热质的情况下做出很多真正有用的工作。比如,最早的内燃机就是由相信热质理论的人开发出来的。
**Chris Olah:** One of my colleagues, Tom Henighan, who is a former physicist, made this really nice analogy to me of caloric theory, where once upon a time we thought that heat was actually this thing called caloric, and the reason hot objects would warm up cool objects is like the caloric is flowing through them. And because we're so used to thinking about heat in terms of the modern theory, that seems kind of silly. But it's actually very hard to construct an experiment that disproves the caloric hypothesis. And you can actually do a lot of really useful work believing in caloric. For example, it turns out that the original combustion engines were developed by people who believed in caloric theory.
**Chris Olah:** 就是在热质理论的框架下。所以我认为,认真对待假说——即使它们可能是错的——是有美德的。
**Chris Olah:** In the caloric theory. So I think there's a virtue in taking hypotheses seriously even when they might be wrong.
**Lex Fridman:** 对,这里面有很深的哲学真理。我对太空旅行也有类似的感觉,比如殖民火星。有很多人批评这件事。我觉得,如果你就假设我们必须殖民火星以便为人类文明留条后路,即使这未必是真的,这个假设也会催生出一些有趣的工程突破甚至科学突破,我是这么认为的。
**Lex Fridman:** Yeah, yeah, there's a deep philosophical truth to that. That's kind of how I feel about space travel, like colonizing Mars. There's a lot of people that criticize that. I think if you just assume we have to colonize Mars in order to have a backup for human civilization, even if that's not true, that's going to produce some interesting engineering and even scientific breakthroughs, I think.
**Chris Olah:** 对,而且这确实是另一件我觉得非常有趣的事情。你知道,有一种方式是:让社会中有一批人几乎是不理性地专注于研究某个特定假说,这可能非常有价值,因为要在科学探索中保持斗志、真正深入推进某件事,是需要付出很多的——毕竟大多数科学假说最终被证明是错的,很多科学研究没有结果。但去做这件事……有一个关于 Jeff Hinton 的笑话:Jeff Hinton 在过去 50 年里每年都"发现"了大脑的工作原理。
**Chris Olah:** Yeah, well, and actually this is another thing that I think is really interesting. So, you know, there's a way in which I think it can be really useful for society to have people almost irrationally dedicated to investigating a particular hypothesis, because it takes a lot to sort of maintain scientific morale and really push on something when, you know, most scientific hypotheses end up being wrong. A lot of science doesn't work out. And yet it's very useful to go do -- you know, there's a joke about Jeff Hinton, which is that Jeff Hinton has discovered how the brain works every year for the last 50 years.
**Lex Fridman:** 是啊。
**Lex Fridman:** Yeah.
**Chris Olah:** 但我说这话是带着真正深深的敬意的,因为这确实让他做出了一些非常出色的工作。
**Chris Olah:** But, you know, I say that with really deep respect, because in fact that actually led to him doing some really great work.
**Lex Fridman:** 对,他拿了 Nobel 奖。现在谁还笑得出来?
**Lex Fridman:** Yeah, he won the Nobel Prize. Now who's laughing now?
**Chris Olah:** 正是,正是。我认为,一个人应该能够适时地清醒过来,以恰当的信心水平来看待事物,但我也认为有很大的价值在于:就是那种"我将基本上假设——我将以这个问题可以解决、或者这大致是正确方向为前提,然后在这个框架下工作一段时间,真正努力推进"的态度。如果社会上有很多人在为不同的事情这样做,那其实非常有价值——要么真正排除某些东西(我们可以说,好,那个方向不行,我们知道有人认真试过了),要么得到某些真正教会我们关于这个世界的东西。
**Chris Olah:** Exactly, exactly. Yeah, I think one wants to be able to pop up and sort of recognize the appropriate level of confidence, but I think there's also a lot of value in just being like, you know, I'm going to essentially assume -- I'm going to condition on this problem being possible or this being broadly the right approach, and I'm just going to go and assume that for a while and go and work within that and push really hard on it. And, you know, if society has lots of people doing that for different things, that's actually really useful in terms of either really ruling things out -- we can be like, well, you know, that didn't work, we know that somebody tried hard -- or going and getting to something that does teach us something about the world.
**Lex Fridman:** 所以另一个有趣的假说是叠加假说(superposition hypothesis)。你能描述一下什么是叠加吗?
**Lex Fridman:** So another interesting hypothesis is the superposition hypothesis. Can you describe what superposition is?
**Chris Olah:** 对。我们之前聊到 Word2Vec,聊到也许有一个方向对应于性别,另一个对应于皇室地位,另一个对应于意大利,另一个对应于食物,等等等等。通常这些词嵌入可能是 500 维,1000 维,如果你相信所有这些方向都是正交的,那么你只能有 500 个概念。你知道,我很爱吃披萨,但如果让我列出英语中最重要的 500 个概念,"意大利"能不能排进去——至少不是那么显然。因为你还得有复数和单数,动词、名词和形容词,有很多事情要处理,然后才轮得到意大利、日本,还有,世界上有那么多国家呢。
# Lex Fridman Podcast #452 — Chris Olah 访谈翻译(第12部分)
# 主题:机械可解释性(Mechanistic Interpretability)
---
那么,模型怎么可能同时满足两个条件——线性表示假说(linear representation hypothesis)成立,又能表示比方向数更多的东西?这意味着什么?如果线性表示假说成立,那一定有什么有趣的事情在发生。在我们深入讲这个之前,再告诉你一件有意思的事:之前我们说到那些多义神经元(polysemantic neurons),对吧?当我们研究 Inception V1 的时候,有些神经元很干净,比如汽车检测器、曲线检测器,响应的东西非常一致,但也有很多神经元会对一堆毫不相关的事物做出响应。这本身也是一个有趣的现象。另外,就连那些看起来非常干净的神经元,如果你去看弱激活——也就是激活量只有最大激活值5%左右的情况——你会发现那已经不是它真正在检测的核心东西了。比如你看一个曲线检测器,再去看它激活量为5%的那些位置,你可以把它解释为噪音,也可以理解成它在那里其实在做别的事情。
好,这怎么可能呢?数学上有一个很神奇的东西,叫压缩感知(compressed sensing)。它有一个非常令人惊讶的结论:如果你有一个高维空间,把它投影到一个低维空间,通常你没办法反过来还原出原来的高维向量——你已经把信息丢掉了。就像你没办法对一个长方形矩阵求逆,只能对方阵求逆一样。但事实证明,这个说法并不完全准确。如果我告诉你那个高维向量是稀疏的——也就是大部分都是零——那你其实往往可以以很高的概率把那个高维向量找回来。
这个结论挺令人意外的。它说的是:你可以有一个高维向量空间,只要向量是稀疏的,你就可以把它投影到低维空间,用一个低维的投影来表示它,而且这种方式是有效的。叠加假说(superposition hypothesis)说的正是神经网络里发生的就是这件事。比如,词嵌入(word embeddings)里发生的就是这回事。词嵌入之所以能同时让方向成为有意义的东西,原因在于:它在一个相当高维的空间里运作,而且这些概念本身是稀疏的——你通常不会同时谈论 Japan 和 Italy,大多数句子里这两个词都是零,根本没有出现。如果真是这样,那么你就可以拥有比维度数多得多的"有意义的方向",也就是更多的特征(features)。落实到神经元上,就是你可以拥有比神经元数量多得多的概念。
这就是叠加假说的大致意思。它还有一个更狂野的推论——神经网络不只是表示可能是这样,连神经元之间的计算,那些连接本身,也可能是这样的。所以从某种意义上说,神经网络可能是一个更大、更稀疏的神经网络的"影子",我们看到的只是它的投影。叠加假说最强的版本会认真对待这件事,说:某种意义上真的存在一个"楼上的模型",那里的神经元是真正稀疏的、全部可解释的,它们之间的权重也是非常稀疏的电路,那才是我们真正在研究的东西。而我们观察到的只是它的影子,我们需要找到原始的那个对象。学习的过程,就是在试图构建一个对"楼上模型"的压缩,使得在投影过程中不损失太多信息。
# Lex Fridman Podcast #452 — Chris Olah 访谈翻译(第12部分)
# 主题:机械可解释性(Mechanistic Interpretability)
---
那么,模型怎么可能同时满足两个条件——线性表示假说(linear representation hypothesis)成立,又能表示比方向数更多的东西?这意味着什么?如果线性表示假说成立,那一定有什么有趣的事情在发生。在我们深入讲这个之前,再告诉你一件有意思的事:之前我们说到那些多义神经元(polysemantic neurons),对吧?当我们研究 Inception V1 的时候,有些神经元很干净,比如汽车检测器、曲线检测器,响应的东西非常一致,但也有很多神经元会对一堆毫不相关的事物做出响应。这本身也是一个有趣的现象。另外,就连那些看起来非常干净的神经元,如果你去看弱激活——也就是激活量只有最大激活值5%左右的情况——你会发现那已经不是它真正在检测的核心东西了。比如你看一个曲线检测器,再去看它激活量为5%的那些位置,你可以把它解释为噪音,也可以理解成它在那里其实在做别的事情。
好,这怎么可能呢?数学上有一个很神奇的东西,叫压缩感知(compressed sensing)。它有一个非常令人惊讶的结论:如果你有一个高维空间,把它投影到一个低维空间,通常你没办法反过来还原出原来的高维向量——你已经把信息丢掉了。就像你没办法对一个长方形矩阵求逆,只能对方阵求逆一样。但事实证明,这个说法并不完全准确。如果我告诉你那个高维向量是稀疏的——也就是大部分都是零——那你其实往往可以以很高的概率把那个高维向量找回来。
这个结论挺令人意外的。它说的是:你可以有一个高维向量空间,只要向量是稀疏的,你就可以把它投影到低维空间,用一个低维的投影来表示它,而且这种方式是有效的。叠加假说(superposition hypothesis)说的正是神经网络里发生的就是这件事。比如,词嵌入(word embeddings)里发生的就是这回事。词嵌入之所以能同时让方向成为有意义的东西,原因在于:它在一个相当高维的空间里运作,而且这些概念本身是稀疏的——你通常不会同时谈论 Japan 和 Italy,大多数句子里这两个词都是零,根本没有出现。如果真是这样,那么你就可以拥有比维度数多得多的"有意义的方向",也就是更多的特征(features)。落实到神经元上,就是你可以拥有比神经元数量多得多的概念。
这就是叠加假说的大致意思。它还有一个更狂野的推论——神经网络不只是表示可能是这样,连神经元之间的计算,那些连接本身,也可能是这样的。所以从某种意义上说,神经网络可能是一个更大、更稀疏的神经网络的"影子",我们看到的只是它的投影。叠加假说最强的版本会认真对待这件事,说:某种意义上真的存在一个"楼上的模型",那里的神经元是真正稀疏的、全部可解释的,它们之间的权重也是非常稀疏的电路,那才是我们真正在研究的东西。而我们观察到的只是它的影子,我们需要找到原始的那个对象。学习的过程,就是在试图构建一个对"楼上模型"的压缩,使得在投影过程中不损失太多信息。
**Chris Olah:** Yeah. So earlier we were talking about Word2Vec, right? And we were talking about how maybe you have one direction that corresponds to gender, and maybe another that corresponds to royalty, and another one that corresponds to Italy, and another one that corresponds to food, and all these things. Well, often times these word embeddings might be 500 dimensions, a thousand dimensions, and so if you believed that all of those directions were orthogonal, then you could only have 500 concepts. And, you know, I love pizza, but like, if I was going to give the 500 most important concepts in the English language, probably Italy wouldn't be -- it's not obvious at least that Italy would be one of them. Because you have to have things like plural and singular, and verb and noun and adjective, and there's a lot of things we have to get to before we get to Italy, and Japan, and, you know, there's a lot of countries in the world.
And so how might it be that models could simultaneously have the linear representation hypothesis be true and also represent more things than they have directions? So what does that mean? Well, if the linear representation hypothesis is true, something interesting has to be going on. Now I'll tell you one more interesting thing before we go and do that, which is, you know, earlier we were talking about all these polysemantic neurons, right? These neurons that, when we're looking at Inception V1, there are these nice neurons like the car detector and the curve detector and so on that respond to very coherent things, but there are lots of neurons that respond to a bunch of unrelated things. That's also an interesting phenomenon. And it turns out as well that even these neurons that are really, really clean, if you look at the weak activations -- so if you look at the activation where it's like activating 5% of the maximum activation -- it's really not the core thing that it's detecting. So if you look at a curve detector, for instance, and you look at the places where it's 5% active, you could interpret it just as noise, or it could be that it's doing something else there.
Okay, so how could that be? Well, there's this amazing thing in mathematics called compressed sensing. And it's actually this very surprising fact where, if you have a high-dimensional space and you project it into a low-dimensional space, ordinarily you can't go and sort of unproject it and get back your high-dimensional vector. You threw information away. This is like, you can't invert a rectangular matrix; you can only invert square matrices. But it turns out that that's actually not quite true. If I tell you that the high-dimensional vector was sparse -- so it's mostly zeros -- then it turns out that you can often find back the high-dimensional vector with very high probability.
So that's a surprising fact. It says that you can have this high-dimensional vector space, and as long as things are sparse, you can project it down, you can have a lower-dimensional projection of it, and that works. So the superposition hypothesis is saying that that's what's going on in neural networks. That's, for instance, what's going on in word embeddings. The word embeddings are able to simultaneously have directions be the meaningful thing, and by exploiting the fact that they're operating on a fairly high-dimensional space and the fact that these concepts are sparse -- like, you usually aren't talking about Japan and Italy at the same time, and in most sentences Japan and Italy are both zero, they're not present at all -- if that's true, then you can have it be the case that you can have many more of these directions that are meaningful, these features, than you have dimensions. And when we're talking about neurons, you can have many more concepts than you have neurons.
So that's the superposition hypothesis at a high level. Now it has this even wilder implication, which is to say that neural networks -- it may not just be the case that the representations are like this, but the computation may also be like this, the connections between all of them. And so in some sense, neural networks may be shadows of much larger, sparser neural networks, and what we see are these projections. The strongest version of the superposition hypothesis would be to take that really seriously and sort of say, there actually is in some sense this upstairs model where the neurons are really sparse and all interpretable, and the weights between them are these really sparse circuits, and that's what we're studying. And the thing that we're observing is the shadow of it, and we need to find the original object. And the process of learning is trying to construct a compression of the upstairs model that doesn't lose too much information in the projection.
And so how might it be that models could simultaneously have the linear representation hypothesis be true and also represent more things than they have directions? So what does that mean? Well, if the linear representation hypothesis is true, something interesting has to be going on. Now I'll tell you one more interesting thing before we go and do that, which is, you know, earlier we were talking about all these polysemantic neurons, right? These neurons that, when we're looking at Inception V1, there are these nice neurons like the car detector and the curve detector and so on that respond to very coherent things, but there are lots of neurons that respond to a bunch of unrelated things. That's also an interesting phenomenon. And it turns out as well that even these neurons that are really, really clean, if you look at the weak activations -- so if you look at the activation where it's like activating 5% of the maximum activation -- it's really not the core thing that it's detecting. So if you look at a curve detector, for instance, and you look at the places where it's 5% active, you could interpret it just as noise, or it could be that it's doing something else there.
Okay, so how could that be? Well, there's this amazing thing in mathematics called compressed sensing. And it's actually this very surprising fact where, if you have a high-dimensional space and you project it into a low-dimensional space, ordinarily you can't go and sort of unproject it and get back your high-dimensional vector. You threw information away. This is like, you can't invert a rectangular matrix; you can only invert square matrices. But it turns out that that's actually not quite true. If I tell you that the high-dimensional vector was sparse -- so it's mostly zeros -- then it turns out that you can often find back the high-dimensional vector with very high probability.
So that's a surprising fact. It says that you can have this high-dimensional vector space, and as long as things are sparse, you can project it down, you can have a lower-dimensional projection of it, and that works. So the superposition hypothesis is saying that that's what's going on in neural networks. That's, for instance, what's going on in word embeddings. The word embeddings are able to simultaneously have directions be the meaningful thing, and by exploiting the fact that they're operating on a fairly high-dimensional space and the fact that these concepts are sparse -- like, you usually aren't talking about Japan and Italy at the same time, and in most sentences Japan and Italy are both zero, they're not present at all -- if that's true, then you can have it be the case that you can have many more of these directions that are meaningful, these features, than you have dimensions. And when we're talking about neurons, you can have many more concepts than you have neurons.
So that's the superposition hypothesis at a high level. Now it has this even wilder implication, which is to say that neural networks -- it may not just be the case that the representations are like this, but the computation may also be like this, the connections between all of them. And so in some sense, neural networks may be shadows of much larger, sparser neural networks, and what we see are these projections. The strongest version of the superposition hypothesis would be to take that really seriously and sort of say, there actually is in some sense this upstairs model where the neurons are really sparse and all interpretable, and the weights between them are these really sparse circuits, and that's what we're studying. And the thing that we're observing is the shadow of it, and we need to find the original object. And the process of learning is trying to construct a compression of the upstairs model that doesn't lose too much information in the projection.
**Lex Fridman:** 对,就是找到一种高效压缩它的方式之类的。
**Lex Fridman:** Yeah, finding how to fit it efficiently or something like this.
**Chris Olah:** 所以这其实是在说,梯度下降(gradient descent)当然可以直接去表示一个稠密的神经网络,但它某种程度上是在愉快地搜索那些能被投影到低维空间的极度稀疏模型的空间。有很多研究者一直在努力研究稀疏神经网络——去设计那种边(edges)稀疏、激活(activations)也稀疏的神经网络。我的感觉是,这方面的工作……感觉很有原则,逻辑上很说得通,但从结果来看,总体上并没有真正取得很好的成果,这是我的大致印象。我觉得一个可能的原因是:神经网络从某种意义上说本来就已经是稀疏的了。梯度下降一直以来——你在努力去做的事,梯度下降其实在背后比你更高效地搜索稀疏模型的空间,学习最高效的稀疏模型,然后弄清楚如何把它折叠压缩,以便在 GPU 上方便地运行——而 GPU 擅长的是稠密矩阵乘法。你根本打不过它。
**Chris Olah:** So this sort of says that gradient descent, you know, could just represent a dense neural network, but it sort of says that gradient descent is pleasantly searching over the space of extremely sparse models that could be projected into this low-dimensional space. And this large body of work of people going and trying to study sparse neural networks, where you go and design neural networks where the edges are sparse and the activations are sparse -- my sense is that work has generally... it feels very principled, it makes so much sense, and yet that work hasn't really panned out that well, is my impression broadly. And I think a potential answer for that is that actually the neural network is already sparse in some sense. Gradient descent was the whole time -- you were trying to go and do this, gradient descent was actually behind the scenes going and searching more efficiently than you could through the space of sparse models and learning whatever sparse model was most efficient and then figuring out how to fold it down nicely to go and run conveniently on your GPU, which does nice dense matrix multiplies. And you just can't beat that.
**Lex Fridman:** 你觉得一个神经网络里能塞进多少个概念?
**Lex Fridman:** How many concepts do you think can be shoved into a neural network?
**Chris Olah:** 这取决于它们有多稀疏。有一个上界大概来自参数(parameters)的数量,因为你还是得有权重把这些概念连接起来。所以那是一个上界。压缩感知和 Johnson-Lindenstrauss 引理(Johnson-Lindenstrauss Lemma)有一些非常漂亮的结论。它们基本上告诉你:如果你有一个向量空间,想让里面存在"几乎正交"的向量——这大概就是你这里想要的,对吧?你是在说,好吧,我不指望我的概念、我的特征能严格正交,但我希望它们相互干扰不要太大,我得要求它们近似正交——那么这个数量实际上是关于神经元数量的指数级别。所以到了某个点,神经元数量甚至不再是瓶颈了。
不过这里有一些很漂亮的结论,而且实际上可能比那更好,因为那是假设任何随机的特征组合都可能同时激活。但实际上特征之间有相关性结构:某些特征更可能同时出现,另一些则不太可能。所以我猜神经网络在"塞东西"这件事上表现会很好,以至于这可能根本不是限制因素。
不过这里有一些很漂亮的结论,而且实际上可能比那更好,因为那是假设任何随机的特征组合都可能同时激活。但实际上特征之间有相关性结构:某些特征更可能同时出现,另一些则不太可能。所以我猜神经网络在"塞东西"这件事上表现会很好,以至于这可能根本不是限制因素。
**Chris Olah:** Depends on how sparse they are. So there's probably an upper bound from the number of parameters, because you still have to have weights that go and connect them together. So that's one upper bound. There are in fact all these lovely results from compressed sensing and the Johnson-Lindenstrauss Lemma and things like this. They basically tell you that if you have a vector space and you want to have almost orthogonal vectors -- which is sort of probably the thing that you want here, right? So you're going to say, well, I'm going to give up on having my concepts, my features, be strictly orthogonal, but I'd like them to not interfere that much. I'm going to have to ask them to be almost orthogonal. Then this would say that it's actually exponential in the number of neurons that you have. So at some point that's not going to even be the limiting factor.
But there are some beautiful results there, and in fact it's probably even better than that in some sense, because that's sort of for saying that any random set of features could be active. But in fact the features have a correlational structure where some features are more likely to co-occur and other ones are less likely to co-occur. And so neural networks, my guess would be, can do very well in terms of packing things in, such that it's probably not the limiting factor.
But there are some beautiful results there, and in fact it's probably even better than that in some sense, because that's sort of for saying that any random set of features could be active. But in fact the features have a correlational structure where some features are more likely to co-occur and other ones are less likely to co-occur. And so neural networks, my guess would be, can do very well in terms of packing things in, such that it's probably not the limiting factor.
**Lex Fridman:** 多义性(polysemanticity)的问题是怎么跟这些联系起来的?
**Lex Fridman:** How does the problem of polysemanticity enter the picture here?
**Chris Olah:** 多义性是我们观察到的一种现象:我们看一个神经元,它不只代表一个概念。它不是一个干净的特征,它会对一堆毫不相关的东西做出响应。而叠加假说,你可以把它理解成是对多义性这一现象的一种解释。所以多义性是观察到的现象,叠加是解释这一现象(以及其他一些东西)的假说。
这让机械可解释性(mech interp)更难了,对吧?如果你试图从单个神经元的角度去理解一切,而神经元又是多义的,那你就麻烦大了。最直接的问题是:好,你在看这些神经元,想要理解它们,这个神经元响应好多东西,没有一个清晰的含义——这很糟糕。另一个问题是:最终我们想理解权重(weights),如果你有两个多义神经元,一个响应三种东西,另一个也响应三种东西,它们之间有一个权重,这意味着什么?是说有九种交互在发生吗?这很奇怪。
但还有一个更深层的原因,它跟神经网络在非常高维的空间里运作这一事实有关。我说我们的目标是理解神经网络、理解其中的机制,你可能会问:它就是一个数学函数,为什么不直接看呢?我做过的最早的项目之一研究的是把二维空间映射到二维空间的神经网络,你可以用非常漂亮的方式把它解释成在弯曲流形——为什么我们不能这样做呢?因为随着空间维度升高,那个空间的"体积"在某种意义上是关于输入数量的指数级别,你根本没办法可视化。所以我们需要把它拆开,把那个指数级的空间分解成一堆我们可以独立推理的东西——数量是非指数的,但可以独立推理。独立性至关重要,因为正是独立性让你不用考虑所有那些指数量级的组合。而"单义性"——每个东西只有一个含义——正是让你能够独立思考它们的关键。所以我觉得,如果你想知道为什么我们要追求可解释的、单义的特征,这才是真正深层的原因。
这让机械可解释性(mech interp)更难了,对吧?如果你试图从单个神经元的角度去理解一切,而神经元又是多义的,那你就麻烦大了。最直接的问题是:好,你在看这些神经元,想要理解它们,这个神经元响应好多东西,没有一个清晰的含义——这很糟糕。另一个问题是:最终我们想理解权重(weights),如果你有两个多义神经元,一个响应三种东西,另一个也响应三种东西,它们之间有一个权重,这意味着什么?是说有九种交互在发生吗?这很奇怪。
但还有一个更深层的原因,它跟神经网络在非常高维的空间里运作这一事实有关。我说我们的目标是理解神经网络、理解其中的机制,你可能会问:它就是一个数学函数,为什么不直接看呢?我做过的最早的项目之一研究的是把二维空间映射到二维空间的神经网络,你可以用非常漂亮的方式把它解释成在弯曲流形——为什么我们不能这样做呢?因为随着空间维度升高,那个空间的"体积"在某种意义上是关于输入数量的指数级别,你根本没办法可视化。所以我们需要把它拆开,把那个指数级的空间分解成一堆我们可以独立推理的东西——数量是非指数的,但可以独立推理。独立性至关重要,因为正是独立性让你不用考虑所有那些指数量级的组合。而"单义性"——每个东西只有一个含义——正是让你能够独立思考它们的关键。所以我觉得,如果你想知道为什么我们要追求可解释的、单义的特征,这才是真正深层的原因。
**Chris Olah:** Polysemanticity is this phenomenon we observe where we look at many neurons and the neuron doesn't just represent one concept. It's not a clean feature; it responds to a bunch of unrelated things. And superposition is -- you can think of it as being a hypothesis that explains the observation of polysemanticity. So polysemanticity is this observed phenomenon and superposition is a hypothesis that would explain it, along with some other things.
So that makes mech interp more difficult, right? If you're trying to understand things in terms of individual neurons and you have polysemantic neurons, you're in an awful lot of trouble. I mean, the easiest answer is like, okay, well, you're looking at the neurons, you're trying to understand them, this one responds to a lot of things, it doesn't have a nice meaning -- okay, that's bad. Another thing you could ask is, you know, ultimately we want to understand the weights, and if you have two polysemantic neurons, and each one responds to three things, and then the other neuron responds to three things, and you have a weight between them, what does that mean? Does it mean that there are these nine interactions going on? It's a very weird thing.
But there's also a deeper reason, which is related to the fact that neural networks operate on really high-dimensional spaces. So I said that our goal was to understand neural networks and understand the mechanisms, and one thing you might say is, well, why not -- it's just a mathematical function, why not just look at it? Like, one of the earliest projects I did studied these neural networks that mapped two-dimensional spaces to two-dimensional spaces, and you can sort of interpret them in this beautiful way as bending manifolds. Why can't we do that? Well, as you have a higher-dimensional space, the volume of that space in some sense is exponential in the number of inputs you have, and so you can't just go and visualize it. So we somehow need to break that apart; we need to somehow break that exponential space into a bunch of things that we can reason about independently -- some non-exponential number of things that we can reason about independently. And the independence is crucial, because it's the independence that allows you to not have to think about all the exponential combinations of things. And things being monosemantic -- things only having one meaning -- is the key thing that allows you to think about them independently. And so I think if you want the deepest reason why we want to have interpretable, monosemantic features, I think that's really the deep reason.
So that makes mech interp more difficult, right? If you're trying to understand things in terms of individual neurons and you have polysemantic neurons, you're in an awful lot of trouble. I mean, the easiest answer is like, okay, well, you're looking at the neurons, you're trying to understand them, this one responds to a lot of things, it doesn't have a nice meaning -- okay, that's bad. Another thing you could ask is, you know, ultimately we want to understand the weights, and if you have two polysemantic neurons, and each one responds to three things, and then the other neuron responds to three things, and you have a weight between them, what does that mean? Does it mean that there are these nine interactions going on? It's a very weird thing.
But there's also a deeper reason, which is related to the fact that neural networks operate on really high-dimensional spaces. So I said that our goal was to understand neural networks and understand the mechanisms, and one thing you might say is, well, why not -- it's just a mathematical function, why not just look at it? Like, one of the earliest projects I did studied these neural networks that mapped two-dimensional spaces to two-dimensional spaces, and you can sort of interpret them in this beautiful way as bending manifolds. Why can't we do that? Well, as you have a higher-dimensional space, the volume of that space in some sense is exponential in the number of inputs you have, and so you can't just go and visualize it. So we somehow need to break that apart; we need to somehow break that exponential space into a bunch of things that we can reason about independently -- some non-exponential number of things that we can reason about independently. And the independence is crucial, because it's the independence that allows you to not have to think about all the exponential combinations of things. And things being monosemantic -- things only having one meaning -- is the key thing that allows you to think about them independently. And so I think if you want the deepest reason why we want to have interpretable, monosemantic features, I think that's really the deep reason.
**Lex Fridman:** 所以目标——就像你最近的工作所指向的——是如何从一个充满多义特征和各种混乱的神经网络里提取出单义特征(monosemantic features)?
**Lex Fridman:** And so the goal here, as your recent work has been aiming at, is how do we extract the monosemantic features from a neural net that has polysemantic features and all this mess?
**Chris Olah:** 对。我们观察到这些多义神经元,我们假设叠加是其中的原因。如果叠加确实是原因,那其实有一种很成熟的技术是理论上应该做的事,那就是字典学习(dictionary learning)。而且事实证明,如果你做字典学习——特别是用一种高效又能做好正则化的方式,叫稀疏自编码器(sparse autoencoder)——如果你训练一个稀疏自编码器,那些漂亮的、可解释的特征就会自然涌现出来,而在此之前根本看不到它们。这不是你能事先预测到的结果,但它确实非常非常有效。对我来说,这似乎是对线性表示和叠加假说的一个不简单的验证。
**Chris Olah:** Yes. We observe these polysemantic neurons, and we hypothesize that superposition is what's going on. And if superposition is what's going on, there's actually a sort of well-established technique that is the principled thing to do, which is dictionary learning. And it turns out, if you do dictionary learning -- in particular if you do a nice efficient way that in some sense nicely regularizes it as well, called a sparse autoencoder -- if you train a sparse autoencoder, these beautiful interpretable features start to just fall out where there weren't any beforehand. And so that's not the sort of thing that you would necessarily predict, but it turns out that that works very, very well. To me, that seems like some non-trivial validation of linear representations and superposition.
**Lex Fridman:** 那做字典学习的时候,你不是在寻找某种特定类别的东西,你也不知道会出现什么。
**Lex Fridman:** So with dictionary learning, you're not looking for particular kinds of categories; you don't know what they are.
**Chris Olah:** 这就回到我们之前说的那个点了,对吧?我们不做假设。梯度下降比我们聪明,所以我们不假设那里有什么。当然你也可以那样做——可以假设有一个 PHP 特征然后去搜索它——但我们不这么做。我们是说我们不知道那里会有什么,我们就让稀疏自编码器去发现那里有什么。
**Chris Olah:** And this gets back to our earlier point, right? We're not making assumptions. Gradient descent is smarter than us, so we're not making assumptions about what's there. I mean, one certainly could do that -- one could assume that there's a PHP feature and go and search for it -- but we're not doing that. We're saying we don't know what's going to be there. Instead we're just going to let the sparse autoencoder discover the things that are there.
**Lex Fridman:** 那你能谈谈去年10月发表的那篇《Towards Monosemanticity》论文吗?它有很多很好的突破性结果。
**Lex Fridman:** So can you talk to the "Towards Monosemanticity" paper from October last year that had a lot of nice breakthrough results?
**Chris Olah:** 你这么描述真是太好了。是的,那是我们第一次真正用稀疏自编码器取得成功。我们用了一个单层模型,对它做字典学习,发现了很多非常好的、可解释的特征。比如阿拉伯语特征、希伯来语特征、Base64特征——这些是我们深入研究的几个例子,我们真的证明了它们就是我们以为的那样。另外,如果你把同一个模型训练两次,得到两个不同的模型,再对它们做字典学习,你会发现两个模型里有对应的特征。这很有趣。你还能发现各种各样的特征。所以那篇论文真的只是在证明这个方法有效。我还应该提一下,Cunningham 等人大约同一时间也有非常类似的结果。
**Chris Olah:** That's very kind of you to describe it that way. Yeah, I mean, this was our first real success using sparse autoencoders. So we took a one-layer model, and it turns out if you do dictionary learning on it, you find all these really nice interpretable features. So, you know, the Arabic feature, the Hebrew feature, the Base64 feature -- those were some examples that we studied in a lot of depth and really showed that they were what we thought they were. It turns out if you train a model twice and train two different models and do dictionary learning, you find analogous features in both of them. So that's fun. You find all kinds of different features. So that was really just showing that this works. And, you know, I should mention that there was Cunningham et al. that had very similar results around the same time.
**Lex Fridman:** 做这种小规模实验然后发现它真的有效,这本身就挺好玩的。
**Lex Fridman:** There's something fun about doing these kinds of small-scale experiments and finding that it's actually working.
**Chris Olah:** 对,而且这里面有太多结构了。回头想想,有段时间我以为这些机械可解释性工作最终的结论会是我能解释清楚为什么这件事很难、没有可行性。就像说:好,叠加是个问题,叠加真的很难处理,我们基本上完蛋了。但结果不是这样。事实是,一种非常自然的技术直接就奏效了。
所以这其实是一个很好的局面。我觉得这是个有难度的研究问题,有很大的研究风险,很可能还是会失败。但我认为,当这个方法开始奏效的时候,相当大一部分研究风险已经被我们甩在身后了。
所以这其实是一个很好的局面。我觉得这是个有难度的研究问题,有很大的研究风险,很可能还是会失败。但我认为,当这个方法开始奏效的时候,相当大一部分研究风险已经被我们甩在身后了。
**Chris Olah:** Yeah, well, and there's so much structure here. Like, maybe stepping back for a while, I thought that maybe all this mechanistic interpretability work -- the end result was going to be that I would have an explanation for why it was sort of very hard and not going to be tractable. We'd be like, well, there's this problem with superposition and it turns out that superposition is really hard, and we're kind of screwed. But that's not what happened. In fact, a very natural technique just works.
And so then that's actually a very good situation. I think this is a hard research problem and it's got a lot of research risk, and it might still very well fail. But I think some very significant amount of research risk was sort of put behind us when that started to work.
And so then that's actually a very good situation. I think this is a hard research problem and it's got a lot of research risk, and it might still very well fail. But I think some very significant amount of research risk was sort of put behind us when that started to work.
**Lex Fridman:** 你能描述一下用这种方法能提取出什么样的特征吗?
**Lex Fridman:** Can you describe what kind of features can be extracted in this way?
**Chris Olah:** 这取决于你研究的模型。模型越大,特征就会越复杂,我们待会儿可能会聊到后续的工作。但在这些单层模型里,我觉得很常见的特征是语言——包括编程语言和自然语言。还有很多特征是特定语境下的特定词。比如"the"——我觉得真正的理解方式是,"the"后面很可能跟着一个名词。所以你可以把它叫"the"特征,但也可以把它理解成"预测特定名词"的特征。会有这样的特征:在法律文档的语境下触发"the",或者在数学文档的语境下触发"the",等等。在数学语境下,你可能是"the"然后预测"vector"、"matrix"这些数学词,在其他语境下则预测别的东西。这很常见。
**Chris Olah:** Well, so it depends on the model that you're studying. The larger the model, the more sophisticated they're going to be, and we'll probably talk about follow-up work in a minute. But in these one-layer models, some very common things I think were languages -- both programming languages and natural languages. There were a lot of features that were specific words in specific contexts. So "the" -- and I think really the way to think about this is that "the" is likely about to be followed by a noun. So it's really -- you could think of this as a "the" feature, but you could also think of this as predicting a specific noun feature. And there would be these features that would fire for "the" in the context of, say, a legal document or a mathematical document or something like this. And so maybe in the context of math, you're like, "the" and then predict "vector," "matrix," all these mathematical words, whereas in other contexts you would predict other things. That was common.
**Lex Fridman:** 基本上还是需要聪明的人类来给我们看到的东西打标签?
**Lex Fridman:** And basically you need clever humans to assign labels to what we're seeing?
**Chris Olah:** 对。这个方法做的只是帮你把东西"展开"。如果所有东西原本叠在一起——叠加把所有东西压叠在一起,你根本看不见——这个方法就是把它展开。但展开了之后,你面对的还是一个非常复杂的东西需要去理解。所以你还得做大量工作来搞清楚这些特征是什么。有些非常微妙。即使在这个单层模型里,有些关于 Unicode 的东西就很酷。当然有些语言是用 Unicode 编码的,分词器(tokenizer)不一定会为每个 Unicode 字符单独设一个 token。所以你会看到这种模式:交替出现的 token,每个代表 Unicode 字符的一半。然后有一个不同的特征在另一半上激活,意思是:好,我刚完成了一个字符,去预测下一个前缀。然后在前缀上,预测一个合理的后缀。就这样来回交替。
所以这些单层模型真的很有趣。还有另一件事——你可能以为 Base64 只有一个特征,但实际上有好几个 Base64 特征,因为你可以把英文文本编码成 Base64,那样的 Base64 token 分布和普通 Base64 非常不同。分词方式也有一些它可以利用的东西。总之,各种有趣的东西。
所以这些单层模型真的很有趣。还有另一件事——你可能以为 Base64 只有一个特征,但实际上有好几个 Base64 特征,因为你可以把英文文本编码成 Base64,那样的 Base64 token 分布和普通 Base64 非常不同。分词方式也有一些它可以利用的东西。总之,各种有趣的东西。
**Chris Olah:** Yes. So the only thing this is doing is sort of unfolding things for you. If everything was sort of folded over top of itself -- superposition folded everything on top of itself, you can't really see it -- this is unfolding it. But now you still have a very complex thing to try to understand. So then you have to do a bunch of work understanding what these are. And some of them are really subtle. Like, there are some really cool things even in this one-layer model about Unicode, where, of course, some languages are in Unicode and the tokenizer won't necessarily have a dedicated token for every Unicode character. So instead what you'll have is these patterns of alternating tokens that each represent half of a Unicode character. And then you have a different feature that activates on the opposing ones to be like, okay, I just finished a character, go and predict the next prefix. Then on the prefix, predict a reasonable suffix. And you have to alternate back and forth.
So these one-layer models are really interesting. And there's another thing, which is you might think, okay, there would just be one Base64 feature, but it turns out there's actually a bunch of Base64 features, because you can have English text encoded as Base64, and that has a very different distribution of Base64 tokens than regular Base64. And there are some things about tokenization as well that it can exploit. And, I don't know, all kinds of fun stuff.
So these one-layer models are really interesting. And there's another thing, which is you might think, okay, there would just be one Base64 feature, but it turns out there's actually a bunch of Base64 features, because you can have English text encoded as Base64, and that has a very different distribution of Base64 tokens than regular Base64. And there are some things about tokenization as well that it can exploit. And, I don't know, all kinds of fun stuff.
**Lex Fridman:** 给这些东西打标签有多难?这件事能用 AI 自动化吗?
**Lex Fridman:** How difficult is the task of sort of assigning labels to what's going on? Can this be automated by AI?
**Chris Olah:** 我觉得取决于特征,也取决于你对 AI 的信任程度。有很多工作在做自动化可解释性(automated interpretability)。我觉得这是个非常令人兴奋的方向,我们也做了相当多的自动化可解释性,让 Claude 去给我们的特征打标签。
**Chris Olah:** Well, I think it depends on the feature, and it also depends on how much you trust your AI. There's a lot of work doing automated interpretability. I think that's a really exciting direction, and we do a fair amount of automated interpretability and have Claude go and label our features.
**Lex Fridman:** 有没有什么有趣的时刻,它完全说对了或者完全说错了?
**Lex Fridman:** Is there some fun moments where it's totally right or it's totally wrong?
**Chris Olah:** 有,我觉得很常见的情况是它说的东西很笼统——在某种意义上是对的,但没有真正捕捉到正在发生的事情的具体细节。我觉得这是很普遍的情况。我不确定自己有没有特别好笑的例子。
**Chris Olah:** Yeah, well, I think it's very common that it says something very general, which is true in some sense but not really picking up on the specifics of what's going on. So I think that's a pretty common situation. I don't know that I have a particularly amusing one.
**Lex Fridman:** 这很有意思,那种"它说的是真的"但就是差了一点点、没能触及事物深层细微之处的小差距。
**Lex Fridman:** That's interesting, that little gap between "it is true" but it doesn't quite get to the deep nuance of a thing.
**Chris Olah:** 对,这是个普遍挑战。就像,它能说出真实的东西,这已经是了不起的成就了,但它有时候缺少深度。在这个语境里,感觉有点像 ARC 挑战(ARC challenge),那种类似 IQ 测试的题目。搞清楚一个特征代表什么,感觉有点像要解一个小谜题。有些容易,有些很难。所以这确实很棘手。还有另一件事——我不知道,也许在某种程度上这只是我个人的审美偏好,但我来解释一下。我其实对自动化可解释性有一点警惕,部分原因就是我想要的是人类来理解神经网络,如果神经网络替我理解了,我不太喜欢那样。在某种程度上我有点像那些数学家,说:如果是计算机自动证明的,那不算,我不能理解它。但我也确实认为这里存在一种类似《反思对信任的信任》("Reflections on Trusting Trust")的问题——那篇著名的演讲说的是:当你写计算机程序的时候,你必须信任你的编译器,如果你的编译器里有恶意软件,它就可以把恶意软件注入到下一个编译器里,你就麻烦了。好,如果你用神经网络来验证你的神经网络是否安全,你要检验的假设恰恰是这个神经网络可能不安全,你就得担心:它有没有可能正在暗中捉弄你?我觉得这现在还不是大问题,但从长远来看,如果我们必须用非常强大的 AI 系统来审计我们的 AI 系统,那真的是我们能信任的吗?也许我只是在为自己的想法找理由,因为我就是想让我们达到人类能理解一切的那个地步。
**Chris Olah:** Yeah, that's a general challenge. It's like, it's still an incredible accomplishment -- they can say a true thing, but it doesn't quite -- it's not -- it's missing the depth sometimes. And in this context it's like the ARC challenge, you know, the sort of IQ-type tests. It feels like figuring out what a feature represents is a bit of -- is a little
puzzle you have to solve. Yeah, and I think that sometimes they're easier and sometimes they're harder as well. So yeah, I think that's tricky. Now there's another thing which -- I don't know, maybe in some ways this is my aesthetic coming in, but I'll try to give you a rationalization. I'm actually a little suspicious of automated interpretability, and I think that's partly just that I want humans to understand neural networks, and if the neural network is understanding it for me, I'm not -- I don't quite like that. But I do have -- in some ways I'm sort of like the mathematicians who are like, you know, if there's a computer-automated proof, it doesn't count. They won't understand it. But I do also think that there is this kind of "Reflections on Trusting Trust" type issue where, you know, there's this famous talk about when you're writing a computer program you have to trust your compiler, and if there was malware in your compiler then it could go and inject malware into the next compiler, and you'd be kind of in trouble, right? Well, if you're using neural networks to verify that your neural networks are safe, the hypothesis that you're testing for is, okay, well, the neural network maybe isn't safe, and you have to worry about, is there some way that it could be screwing with you? So I think that's not a big concern now, but I do wonder in the long run, if we have to use really powerful AI systems to audit our AI systems, is that actually something we can trust? But maybe I'm just rationalizing because I just want us to get to a point where humans understand everything.
puzzle you have to solve. Yeah, and I think that sometimes they're easier and sometimes they're harder as well. So yeah, I think that's tricky. Now there's another thing which -- I don't know, maybe in some ways this is my aesthetic coming in, but I'll try to give you a rationalization. I'm actually a little suspicious of automated interpretability, and I think that's partly just that I want humans to understand neural networks, and if the neural network is understanding it for me, I'm not -- I don't quite like that. But I do have -- in some ways I'm sort of like the mathematicians who are like, you know, if there's a computer-automated proof, it doesn't count. They won't understand it. But I do also think that there is this kind of "Reflections on Trusting Trust" type issue where, you know, there's this famous talk about when you're writing a computer program you have to trust your compiler, and if there was malware in your compiler then it could go and inject malware into the next compiler, and you'd be kind of in trouble, right? Well, if you're using neural networks to verify that your neural networks are safe, the hypothesis that you're testing for is, okay, well, the neural network maybe isn't safe, and you have to worry about, is there some way that it could be screwing with you? So I think that's not a big concern now, but I do wonder in the long run, if we have to use really powerful AI systems to audit our AI systems, is that actually something we can trust? But maybe I'm just rationalizing because I just want us to get to a point where humans understand everything.
**Lex Fridman:** 对,我是说——这挺好笑的——尤其是当我们在谈论 AI 安全,以及寻找跟 AI 安全相关的特征,比如欺骗(deception)之类的。那我们来聊聊2024年5月的《Scaling Monosemanticity》论文。把这件事扩展到 Claude 3 Sonnet 上,需要做什么?
**Lex Fridman:** Yeah, I mean, especially -- that's hilarious -- especially as we talk about AI safety and looking for features that would be relevant to AI safety, like deception and so on. So let's talk about the "Scaling Monosemanticity" paper in May 2024. What did it take to scale this, to apply to Claude 3 Sonnet?
**Chris Olah:** 需要很多 GPU。多很多的 GPU。我的一位团队成员 Tom Henighan,参与过最初的 scaling laws 工作,他从很早就对一件事很感兴趣:可解释性有没有 scaling laws?所以当这项工作开始成功、稀疏自编码器开始奏效的时候,我们很快就对一个问题产生了强烈的兴趣:把稀疏自编码器做得更大,有没有 scaling laws?这跟把基础模型做得更大又是什么关系?结果证明这非常有效,你可以用它来预测——如果你训练一个给定大小的稀疏自编码器,应该用多少 tokens 来训练,等等。这对我们扩展这项工作帮助非常大,让我们更容易训练出真正大型的稀疏自编码器——虽然不像训练大模型那么贵,但训练最大的那些已经开始变得很贵了。
**Chris Olah:** Well, a lot of GPUs. A lot more GPUs. But one of my teammates, Tom Henighan, was involved in the original scaling laws work, and something that he was sort of interested in from very early on is, are there scaling laws for interpretability? And so something he sort of immediately did when this work started to succeed and we started to have sparse autoencoders work -- we became very interested in, what are the scaling laws for making sparse autoencoders larger, and how does that relate to making the base model larger? And so it turns out this works really well, and you can use it to sort of project -- if you train a sparse autoencoder of a given size, you know how many tokens should you train on and so on. So this was actually a very big help to us in scaling up this work and made it a lot easier for us to train really large sparse autoencoders, where it's not like training the big models, but it's starting to get to a point where it's actually expensive to train the really big ones.
**Lex Fridman:** 所以你还得……我是说,你得做所有那些分片(sharding)之类的工程工作,这里也有巨大的工程挑战,对吧?
**Lex Fridman:** So you have to -- I mean, you have to do all the stuff of splitting it across -- there's a huge engineering challenge here too, right?
**Chris Olah:** 对。有一个科学层面的问题,就是怎么有效地扩展;然后还有大量的工程工作来实现这个扩展。你得做分片,很多事情都得仔细考虑。我很幸运能跟一群很棒的工程师一起工作,因为我本人绝对不是一个好工程师。
**Chris Olah:** Yes. So there's a scientific question of how do you scale things effectively, and then there's an enormous amount of engineering to go and scale this up. You have to shard it, you have to think very carefully about a lot of things. I'm lucky to work with a bunch of great engineers, because I am definitely not a great engineer.
**Lex Fridman:** 对,尤其是基础设施方面。所以总结一下——它成功了?
**Lex Fridman:** Yeah, on the infrastructure especially, yeah, for sure. So it turns out -- TL;DR -- it worked?
**Chris Olah:** 成功了。我觉得这很重要,因为你完全可以想象这样一个世界:在《Towards Monosemanticity》之后,有人说:"Chris,这很好,它在单层模型上有效,但单层模型很特殊。也许线性表示假说和叠加假说是理解单层模型的正确方式,但它不是理解大模型的正确方式。"所以我觉得《Scaling Monosemanticity》这篇论文……首先,"Cutting Him at All"那篇论文有点打破了这个顾虑,暗示情况并非如此。但《Scaling Monosemanticity》提供了相当有力的证据,表明即使对非常大的模型——我们在 Claude 3 Sonnet 上做了,那当时是我们的一个生产模型——即使这些模型,至少在线性特征这个层面,似乎也能被很好地解释,对它们做字典学习是有效的。你学到的特征越多,你解释的东西就越多。这是一个非常有希望的信号。而且你发现的特征非常迷人,这些特征还是多模态的(multimodal)——同一个概念,它们既响应图像也响应文本,这很有意思。
**Chris Olah:** It worked. Yeah, and I think this is important because you could have imagined a world where, after "Towards Monosemanticity," you'd say, "Chris, this is great, it works on a one-layer model, but one-layer models are really idiosyncratic. Maybe the linear representation hypothesis and superposition hypothesis is the right way to understand a one-layer model, but it's not the right way to understand large models." And so I think the "Scaling Monosemanticity" paper sort of -- I mean, first of all, the "Cutting Him at All" paper sort of cut through that a little bit and suggested that this wasn't the case. But "Scaling Monosemanticity" was, I think, significant evidence that even for very large models -- and we did it on Claude 3 Sonnet, which at that point was one of our production models -- even these models seem to be substantially explained, at least by linear features, and doing dictionary learning on them works. And as you learn more features, you explain more and more. So that's a quite promising sign. And you find really fascinating abstract features, and the features are also multimodal -- they respond to images and text for the same concept, which is fun.
**Lex Fridman:** 对,你能解释一下吗?比如"后门"(backdoor)——有很多例子。
**Lex Fridman:** Yeah, can you explain that? I mean, like, "backdoor" -- there's just a lot of examples.
**Chris Olah:** 好,我们从一个例子开始。我们发现了一些跟代码里的安全漏洞(security vulnerabilities)和后门(backdoors)相关的特征。这实际上是两个不同的特征。有一个安全漏洞特征,如果你强制激活它,Claude 就会开始往代码里写安全漏洞,比如缓冲区溢出(buffer overflows)。它也会对各种东西触发——它的顶部数据集样本里有一些东西,比如 `--disable SSL` 之类的,这些显然非常不安全。
**Chris Olah:** Yeah, so maybe let's start with one example, which is we found some features around security vulnerabilities and backdoors in code. So it turns out those are actually two different features. There's a security vulnerability feature, and if you force it active, Claude will start to write security vulnerabilities like buffer overflows into code. And it also fires for all kinds of things -- some of the top dataset examples for it were things like, you know, "--disable SSL" or something like this, which are sort of obviously really insecure.
**Lex Fridman:** 所以目前来看,感觉这可能只是因为样本就是这样呈现的——还是比较表面、比较明显的例子,对吧?我猜想法是,未来你可能能够检测更微妙的东西,比如欺骗、bug 之类的。
**Lex Fridman:** So at this point it's kind of like maybe it's just because the examples are presented that way -- it's kind of surface-level, more obvious examples, right? I guess the idea is that down the line you might be able to detect more nuanced things, like deception or bugs or that kind of stuff.
**Chris Olah:** 对,我可能想区分两件事。一件是特征或概念本身的复杂性,另一件是我们看到的样本有多微妙。当我们展示顶部数据集样本的时候,那些是让这个特征激活得最强的样本。这并不意味着它不会对更微妙的东西触发。不安全代码特征——它激活最强的东西确实是那些非常明显的"关闭安全"之类的东西,但它也会对缓冲区溢出和更微妙的代码安全漏洞触发。这些特征都是多模态的,所以你可以问:哪些图片会激活这个特征?结果发现,安全漏洞特征会被这样的图片激活:有人点击 Chrome 里那个"这个网站的 SSL 证书可能有问题"的警告,然后点击继续。
还有一件非常有意思的事,是代码后门特征。如果你激活它,Claude 就会写一个把你的数据发送到某个端口的后门。但你可以问:哪些图片会激活后门特征?结果是:那些带有隐藏摄像头的设备。显然有整整一类产品,就是那些看起来无害但其实装了隐藏摄像头的设备,它们的广告里会展示藏在哪里。而我想,这就是物理世界里的"后门"。这说明这些概念有多抽象。我有点心疼居然有整个市场在卖这种设备,但我确实有点惊喜,这是它给出的图片特征顶部样本。
还有一件非常有意思的事,是代码后门特征。如果你激活它,Claude 就会写一个把你的数据发送到某个端口的后门。但你可以问:哪些图片会激活后门特征?结果是:那些带有隐藏摄像头的设备。显然有整整一类产品,就是那些看起来无害但其实装了隐藏摄像头的设备,它们的广告里会展示藏在哪里。而我想,这就是物理世界里的"后门"。这说明这些概念有多抽象。我有点心疼居然有整个市场在卖这种设备,但我确实有点惊喜,这是它给出的图片特征顶部样本。
**Chris Olah:** Yeah, well, I maybe want to distinguish two things. One is the complexity of the feature or the concept, and the other is the nuance of how subtle the examples we're looking at are. When we show the top dataset examples, those are the most extreme examples that cause that feature to activate. And so it doesn't mean that it doesn't fire for more subtle things. The insecure code feature -- the stuff that it fires for most strongly are these really obvious, you know, "disable the security" type things. But it also fires for buffer overflows and more subtle security vulnerabilities in code. These features are all multimodal, so you could ask, "What images activate this feature?" And it turns out that the security vulnerability feature activates for images of people clicking on Chrome to go past the, you know, "this website's SSL certificate might be wrong" warning.
Another thing that's very entertaining is the backdoors-in-code feature. If you activate it, Claude writes a backdoor that will dump your data to a port or something. But you can ask, "What images activate the backdoor feature?" It was devices with hidden cameras in them. So there's apparently a whole genre of people selling devices that look innocuous but have hidden cameras, and they have ads showing where there's a hidden camera in it. And I guess that is the physical version of a backdoor. And so it sort of shows you how abstract these concepts are. I'm sort of sad that there's a whole market of people selling devices like that, but I was kind of delighted that that was the thing it came up with as the top image examples for the feature.
Another thing that's very entertaining is the backdoors-in-code feature. If you activate it, Claude writes a backdoor that will dump your data to a port or something. But you can ask, "What images activate the backdoor feature?" It was devices with hidden cameras in them. So there's apparently a whole genre of people selling devices that look innocuous but have hidden cameras, and they have ads showing where there's a hidden camera in it. And I guess that is the physical version of a backdoor. And so it sort of shows you how abstract these concepts are. I'm sort of sad that there's a whole market of people selling devices like that, but I was kind of delighted that that was the thing it came up with as the top image examples for the feature.
**Lex Fridman:** 对,很好。它是多模态的,是多语境的——涵盖了一个单一概念最广泛的定义。很棒。对我来说,从 AI 安全的角度看,最有趣的特征之一是欺骗(deception)和说谎,以及这些方法能否检测模型中的谎言——尤其是当模型越来越聪明的时候。可以想象,一个超级智能模型最大的威胁之一,就是它能对操作它的人隐瞒自己的真实意图。所以你从检测模型内部的谎言中学到了什么?
**Lex Fridman:** Yeah, it's nice. It's multimodal, it's multi-context -- it's as broad a definition of a singular concept. It's nice. Yeah, to me one of the really interesting features, especially for AI safety, is deception and lying, and the possibility that these kinds of methods could detect lying in a model, especially as it gets smarter and smarter and smarter. Presumably that's a big threat of a superintelligent model -- that it can deceive the people operating it as to its intentions, or any of that kind of stuff. So what have you learned from detecting lying inside models?
**Chris Olah:** 在某些方面,我们还处于早期阶段。我们确实找到了很多跟欺骗和说谎相关的特征。有一个特征会在人撒谎、欺骗的时候触发,如果你强制激活它,Claude 就开始对你撒谎。所以我们有一个欺骗特征。此外还有各种各样的特征,比如关于隐瞒信息、不回答问题,关于权力寻求(power-seeking)、政变(coups)之类的。有很多特征跟一些令人不安的事情有关,如果你强制激活它们,Claude 的行为方式就不是你想要的那种。
# Lex Fridman Podcast #452 — 第十三部分(终章)
---
# Lex Fridman Podcast #452 — 第十三部分(终章)
---
**Chris Olah:** Yeah, so I think we're in some ways in early days for that. We find quite a few features related to deception and lying. There's one feature that fires for people lying and being deceptive, and you force it active and Claude starts lying to you. So we have a deception feature. I mean, there's all kinds of other features about withholding information and not answering questions, features about power-seeking and coups and stuff like that. There are a lot of features that are kind of related to spooky things, and if you force them active, Claude will behave in ways that are -- they're not the kind of behaviors you want.
**Lex Fridman:** 在机械可解释性(mech interp)这个领域,你觉得有哪些令人兴奋的方向值得探索?
**Lex Fridman:** What are possible next exciting directions to you in the space of mech interp?
**Chris Olah:** 有很多。首先,我非常希望能做到这一点——真正搞清楚电路(circuits)。不只是理解特征(features),还要以此为基础去理解模型的计算过程。这对我来说才是终极目标。这方面已经有一些工作——我们发表过几篇东西,Sam Marks 也有一篇论文做了类似的事情,还有一些边缘领域的探索。但我觉得还有很多工作要做,这将会是非常令人兴奋的方向。
这跟一个我们称之为"干扰权重"(interference weights)的难题有关。由于叠加现象(superposition),如果你直接粗暴地去看特征之间是否相连,可能会发现一些权重其实在上层模型里并不存在,只是叠加现象造成的假象。这是与之相关的一个技术挑战。
另一个我觉得令人兴奋的方向是——你可以把稀疏自编码器(sparse autoencoders)想象成一台望远镜。它让我们得以观察到那些存在于外面的特征。随着稀疏自编码器越做越好,字典学习(dictionary learning)越来越精进,我们能看到越来越多的"星星",也能聚焦到越来越暗的"星星"上。但有大量证据表明,我们目前看到的还只是所有"星星"中极小的一部分。在我们的神经网络宇宙里,有大量物质我们还观察不到。也许我们永远都无法拥有足够精细的仪器来观测它,也许其中一些在计算上根本不可行。这有点像早期天文学里的"暗物质"——不一定是现代天文学意义上的暗物质,而是指那时候我们还不知道这些无法解释的物质究竟是什么的状态。所以我经常思考这些"暗物质",想知道我们能否最终观测到它们,以及如果无法观测,对安全性意味着什么——如果神经网络中相当大一部分对我们来说是不可见的。
还有一个我常常思考的问题是:机械可解释性归根结底是一种非常微观的可解释性方法。它试图以极其细粒度的方式去理解事物。但我们真正关心的很多问题其实是宏观层面的——我们关心神经网络行为层面的问题。宏观问题才是我最在乎的,当然你也可以关心很多其他更大尺度的问题。微观方法的好处是,更容易去验证"这是不是真的";但缺点是,它离我们真正关心的东西太远,所以我们现在面对的是一段很长的梯子要爬。我常在想,我们能不能——找到那些更大尺度的抽象,用来理解神经网络?我们能从这种极微观的方法里跳脱出来吗?
这跟一个我们称之为"干扰权重"(interference weights)的难题有关。由于叠加现象(superposition),如果你直接粗暴地去看特征之间是否相连,可能会发现一些权重其实在上层模型里并不存在,只是叠加现象造成的假象。这是与之相关的一个技术挑战。
另一个我觉得令人兴奋的方向是——你可以把稀疏自编码器(sparse autoencoders)想象成一台望远镜。它让我们得以观察到那些存在于外面的特征。随着稀疏自编码器越做越好,字典学习(dictionary learning)越来越精进,我们能看到越来越多的"星星",也能聚焦到越来越暗的"星星"上。但有大量证据表明,我们目前看到的还只是所有"星星"中极小的一部分。在我们的神经网络宇宙里,有大量物质我们还观察不到。也许我们永远都无法拥有足够精细的仪器来观测它,也许其中一些在计算上根本不可行。这有点像早期天文学里的"暗物质"——不一定是现代天文学意义上的暗物质,而是指那时候我们还不知道这些无法解释的物质究竟是什么的状态。所以我经常思考这些"暗物质",想知道我们能否最终观测到它们,以及如果无法观测,对安全性意味着什么——如果神经网络中相当大一部分对我们来说是不可见的。
还有一个我常常思考的问题是:机械可解释性归根结底是一种非常微观的可解释性方法。它试图以极其细粒度的方式去理解事物。但我们真正关心的很多问题其实是宏观层面的——我们关心神经网络行为层面的问题。宏观问题才是我最在乎的,当然你也可以关心很多其他更大尺度的问题。微观方法的好处是,更容易去验证"这是不是真的";但缺点是,它离我们真正关心的东西太远,所以我们现在面对的是一段很长的梯子要爬。我常在想,我们能不能——找到那些更大尺度的抽象,用来理解神经网络?我们能从这种极微观的方法里跳脱出来吗?
**Chris Olah:** Well, there's a lot of things. For one thing, I would really like to get to a point where we have circuits, where we can really understand not just the features but then use that to understand the computation of models. That really, for me, is the ultimate goal of this. And there's been some work -- we put out a few things, there's a paper from Sam Marks that does some stuff like this, there's been some, I'd say, work around the edges here. But I think there's a lot more to do, and I think that will be a very exciting thing.
That's related to a challenge we call interference weights, where, due to superposition, if you just sort of naively look at whether features are connected together, there may be some weights that don't exist in the upstairs model but are just sort of artifacts of superposition. So that's a technical challenge related to that.
I think another exciting direction is just -- you might think of sparse autoencoders as being kind of like a telescope. They allow us to look out and see all these features that are out there. And as we build better and better sparse autoencoders, get better and better at dictionary learning, we see more and more stars, and we zoom in on smaller and smaller stars. But there's a lot of evidence that we're only still seeing a very small fraction of the stars. There's a lot of matter in our neural network universe that we can't observe yet. And it may be that we'll never be able to have fine enough instruments to observe it, and maybe some of it just isn't possible, isn't computationally tractable to observe. There's sort of a kind of dark matter -- not maybe in the sense of modern astronomy, but of earlier astronomy, when we didn't know what this unexplained matter is. And so I think a lot about that dark matter and whether we'll ever observe it, and what that means for safety if we can't observe it -- if some significant fraction of neural networks are not accessible to us.
Another question that I think a lot about is, at the end of the day, mechanistic interpretability -- it's a very microscopic approach to interpretability. It's trying to understand things in a very fine-grained way. But a lot of the questions we care about are very macroscopic. We care about these questions about neural network behavior. And I think that's the thing that I care most about, but there are lots of other sort of larger-scale questions you might care about. And somehow, the nice thing about having a very microscopic approach is it's maybe easier to ask, "Is this true?" But the downside is it's much further from the things we care about, and so we now have this ladder to climb. And I think there's a question of, can we -- will we be able to find -- are there sort of larger-scale abstractions that we can use to understand neural networks? Can we get up from this very microscopic approach?
That's related to a challenge we call interference weights, where, due to superposition, if you just sort of naively look at whether features are connected together, there may be some weights that don't exist in the upstairs model but are just sort of artifacts of superposition. So that's a technical challenge related to that.
I think another exciting direction is just -- you might think of sparse autoencoders as being kind of like a telescope. They allow us to look out and see all these features that are out there. And as we build better and better sparse autoencoders, get better and better at dictionary learning, we see more and more stars, and we zoom in on smaller and smaller stars. But there's a lot of evidence that we're only still seeing a very small fraction of the stars. There's a lot of matter in our neural network universe that we can't observe yet. And it may be that we'll never be able to have fine enough instruments to observe it, and maybe some of it just isn't possible, isn't computationally tractable to observe. There's sort of a kind of dark matter -- not maybe in the sense of modern astronomy, but of earlier astronomy, when we didn't know what this unexplained matter is. And so I think a lot about that dark matter and whether we'll ever observe it, and what that means for safety if we can't observe it -- if some significant fraction of neural networks are not accessible to us.
Another question that I think a lot about is, at the end of the day, mechanistic interpretability -- it's a very microscopic approach to interpretability. It's trying to understand things in a very fine-grained way. But a lot of the questions we care about are very macroscopic. We care about these questions about neural network behavior. And I think that's the thing that I care most about, but there are lots of other sort of larger-scale questions you might care about. And somehow, the nice thing about having a very microscopic approach is it's maybe easier to ask, "Is this true?" But the downside is it's much further from the things we care about, and so we now have this ladder to climb. And I think there's a question of, can we -- will we be able to find -- are there sort of larger-scale abstractions that we can use to understand neural networks? Can we get up from this very microscopic approach?
**Lex Fridman:** 对,你写过关于这个"器官"(organs)问题的文章。
**Lex Fridman:** Yeah, you've written about this, this kind of "organs" question.
**Chris Olah:** 对,没错。
**Chris Olah:** Yeah, exactly.
**Lex Fridman:** "如果我们把可解释性看作一种对神经网络的解剖学,那么大多数电路研究都在研究微小的细小静脉,着眼于小尺度的个体神经元及其连接方式。然而,有很多自然而然的问题是这种小尺度方法无法回答的。相比之下,生物解剖学中最重要的抽象概念涉及的是更大尺度的结构,比如独立的器官——心脏,或者整个器官系统——比如呼吸系统。所以我们不禁要问:人工神经网络有没有'呼吸系统'、有没有'心脏'、有没有'大脑区域'?"
**Lex Fridman:** "If we think of interpretability as a kind of anatomy of neural networks, most of the circuit threads involve studying tiny little veins, looking at the small scale and individual neurons and how they connect. However, there are many natural questions that the small-scale approach doesn't address. In contrast, the most prominent abstractions in biological anatomy involve larger-scale structures like individual organs, like the heart, or entire organ systems, like the respiratory system. And so we wonder, is there a respiratory system or heart or brain region of an artificial neural network?"
**Chris Olah:** 对,正是这样。你想想科学的发展,很多科学领域都在多个抽象层次上研究事物。在生物学里,分子生物学研究蛋白质和分子,细胞生物学研究细胞,组织学研究组织,解剖学研究器官,然后是动物学,再然后是生态学。层次非常多。物理学也一样——研究单个粒子的物理学,然后统计物理学推导出热力学等等。所以通常都会有不同的抽象层次。我认为,如果机械可解释性成功了,目前来看它更像是神经网络的"微生物学",但我们真正想要的是"解剖学"。你可能会问:为什么不能直接跳到那一层去研究?我认为很大程度上是因为,如果不先用正确的方式把微观结构拆解清楚,再研究它们如何连接在一起,就很难看清这种宏观结构。但我有信心,未来会出现远比特征和电路更宏大的东西,会有一个关于更大尺度事物的故事,然后你可以在此基础上深入研究你最关心的部分。
**Chris Olah:** Yeah, exactly. And I mean, if you think about science, a lot of scientific fields investigate things at many levels of abstraction. In biology you have molecular biology studying proteins and molecules and so on, and you have cellular biology, and then you have histology studying tissues, and you have anatomy, and then you have zoology, and then you have ecology. So you have many, many levels of abstraction. Or in physics -- the physics of individual particles, and then statistical physics gives you thermodynamics and things like this. And so you often have different levels of abstraction. And I think right now, mechanistic interpretability, if it succeeds, is sort of like a microbiology of neural networks, but we want something more like anatomy. And a question you might ask is, why can't you just go there directly? And I think the answer is, in significant part, it's that it's actually very hard to see this macroscopic structure without first breaking down the microscopic structure in the right way and then studying how it connects together. But I'm hopeful that there is going to be something much larger than features and circuits, and that we're going to be able to have a story that involves much bigger things, and you can then sort of study in detail the parts you care about.
**Lex Fridman:** 就像神经生物学相对于心理学或精神病学的关系——相当于给你的神经网络找一位心理学家或精神科医生。
**Lex Fridman:** As opposed to neurobiology, like a psychologist or psychiatrist for your neural network.
**Chris Olah:** 我觉得最美妙的事情是,如果我们能够——不是让这两件事成为各自独立的领域——而是在它们之间搭起一座桥梁,让所有更高层次的抽象都能牢牢扎根在这个更扎实、理想中更严格的基础之上。
**Chris Olah:** And I think that the beautiful thing would be if we could, rather than having disparate fields for those two things, build a bridge between them such that you could have all of your higher-level abstractions be grounded very firmly in this very solid, more rigorous, ideally, foundation.
**Lex Fridman:** 你觉得人类大脑——生物神经网络——和人工神经网络之间有什么区别?
**Lex Fridman:** What do you think is the difference between the human brain -- the biological neural network -- and the artificial neural network?
**Chris Olah:** 神经科学家的工作比我们难得多。我有时候数自己的幸运,就是因为我的工作比神经科学家容易太多了。比如,我们可以记录所有神经元的活动,而且可以用任意数量的数据来做。还有一件事——神经元在你记录的过程中不会发生变化。你可以消融(ablate)神经元,可以编辑连接,然后再撤销这些改动。这真的非常棒。你可以对任意神经元施加干预,强制激活它,然后看看会发生什么。你知道哪些神经元连接到哪里。神经科学家一直渴望得到连接组(connectome);我们已经有了,而且规模远超秀丽隐杆线虫(C. elegans)。不仅如此,我们不只知道连接图谱,还知道哪些神经元之间是兴奋性还是抑制性连接。不只是一个二值掩码,我们知道权重。我们可以求梯度,从计算角度知道每个神经元在做什么。这个优势列表可以一直列下去——我们比神经科学家多太多优势了。
然而就算有这些优势,这件事还是非常难。所以我有时候会想,天哪,如果对我们来说都这么难,那在神经科学的那些限制条件下,这件事就近乎不可能了。我的团队里有几位神经科学家,也许有些神经科学家——也许他们中的一些人愿意换一个同样很难但更容易一点点的问题,来研究人工神经网络。等我们在理解神经网络这个"容易一点点的小池子"里(虽然还是非常难)搞清楚一些事情之后,再回头去攻克生物神经科学。
然而就算有这些优势,这件事还是非常难。所以我有时候会想,天哪,如果对我们来说都这么难,那在神经科学的那些限制条件下,这件事就近乎不可能了。我的团队里有几位神经科学家,也许有些神经科学家——也许他们中的一些人愿意换一个同样很难但更容易一点点的问题,来研究人工神经网络。等我们在理解神经网络这个"容易一点点的小池子"里(虽然还是非常难)搞清楚一些事情之后,再回头去攻克生物神经科学。
**Chris Olah:** Well, the neuroscientists have a much harder job than us. Sometimes I just count my blessings by how much easier my job is than the neuroscientist's. So we can record from all the neurons. Yeah, we can do that on arbitrary amounts of data. The neurons don't change while you're doing that, by the way. You can ablate neurons, you can edit the connections and so on, and then you undo those changes. That's pretty great, yeah. You can force any -- you can intervene on any neuron and force it active and see what happens. You know which neurons are connected to everything, right? Neuroscientists want to get the connectome; we have the connectome. And we have it for much bigger than C. elegans. And then not only do we have the connectome, we know which neurons excite or inhibit each other, right? So it's not just that we know the binary mask; we know the weights. We can take gradients, we know computationally what each neuron does. So the list goes on and on -- we just have so many advantages over neuroscientists.
And then despite having all those advantages, it's really hard. And so one thing I do sometimes think is, gosh, if it's this hard for us, it seems impossible -- or near impossible -- under the constraints of neuroscience. I've got a few neuroscientists on my team, and maybe the neuroscientists -- maybe some of them would like to have an easier problem that's still very hard, and they could come and work on neural networks. And then after we figure out things in the easy little pond of trying to understand neural networks -- which is still very hard -- then we could go back to biological neuroscience.
And then despite having all those advantages, it's really hard. And so one thing I do sometimes think is, gosh, if it's this hard for us, it seems impossible -- or near impossible -- under the constraints of neuroscience. I've got a few neuroscientists on my team, and maybe the neuroscientists -- maybe some of them would like to have an easier problem that's still very hard, and they could come and work on neural networks. And then after we figure out things in the easy little pond of trying to understand neural networks -- which is still very hard -- then we could go back to biological neuroscience.
**Lex Fridman:** 我很喜欢你写过的那段话,关于机械可解释性研究的两个目标:安全(safety)与美(beauty)。能聊聊美这一面吗?
**Lex Fridman:** I love what you've written about the goal of mech interp research as two goals: safety and beauty. So can you talk about the beauty side of things?
**Chris Olah:** 对。有一件有意思的事——我觉得有些人对神经网络感到失望,他们会说:"哎,不就是这些简单的规则,然后靠大量工程堆叠规模就管用了。那些复杂的想法在哪里?这算什么漂亮的科学成果。"每当有人这么说,我就会想象他们在说:"进化好无聊啊。就是一堆简单的规则,跑很久就得到了生物学。生物学这个结果真是太差劲了。复杂的规则呢?"
但美恰恰在于:简单能生成复杂。生物学有这些简单的规则,然后诞生了我们身边所有的生命和生态系统——自然界所有的美——这一切都只是从进化而来,从极其简单的东西而来。同样地,我认为神经网络在自身内部创造了巨大的复杂性、美和结构,而人们通常不去看、不去理解它,因为太难理解了。但我相信,神经网络内部有极其丰富的结构等待被发现,有许多非常深刻的美,只要我们愿意花时间去看、去理解。
但美恰恰在于:简单能生成复杂。生物学有这些简单的规则,然后诞生了我们身边所有的生命和生态系统——自然界所有的美——这一切都只是从进化而来,从极其简单的东西而来。同样地,我认为神经网络在自身内部创造了巨大的复杂性、美和结构,而人们通常不去看、不去理解它,因为太难理解了。但我相信,神经网络内部有极其丰富的结构等待被发现,有许多非常深刻的美,只要我们愿意花时间去看、去理解。
**Chris Olah:** Yeah. So there's this funny thing where I think some people are kind of disappointed by neural networks, where they're like, "Ah, it's just these simple rules, then you just do a bunch of engineering to scale it up and it works really well. Where are the complex ideas? This isn't a very nice, beautiful scientific result." And I sometimes think, when people say that, I picture them being like, "Evolution is so boring. It's just a bunch of simple rules and you run evolution for a long time and you get biology. What a sucky way for biology to have turned out. Where are the complex rules?"
But the beauty is that the simplicity generates complexity. Biology has these simple rules and it gives rise to all the life and ecosystems that we see around us -- all the beauty of nature -- that all just comes from evolution, from something very simple. And similarly, I think that neural networks create enormous complexity and beauty and structure inside themselves that people generally don't look at and don't try to understand, because it's hard to understand. But I think that there is an incredibly rich structure to be discovered inside neural networks, a lot of very deep beauty, if we're just willing to take the time to go and see it and understand it.
But the beauty is that the simplicity generates complexity. Biology has these simple rules and it gives rise to all the life and ecosystems that we see around us -- all the beauty of nature -- that all just comes from evolution, from something very simple. And similarly, I think that neural networks create enormous complexity and beauty and structure inside themselves that people generally don't look at and don't try to understand, because it's hard to understand. But I think that there is an incredibly rich structure to be discovered inside neural networks, a lot of very deep beauty, if we're just willing to take the time to go and see it and understand it.
**Lex Fridman:** 对,我很喜欢机械可解释性。那种感觉——我们正在理解,或者在窥见理解内部正在发生的魔法——真的很美妙。在我看来,这是一个真正在呼唤被追问的问题。我经常感到惊讶——确实很多人在思考这件事,但为什么——我们知道如何创造能做这些事情的计算机系统吗?我们已经有了这些惊人的系统,但我们不知道如何直接编写计算机程序来完成这些事情,而这些神经网络却能做到。这就让人觉得,这显然是一个呼唤被解答的问题——如果你有任何程度的好奇心的话。人类现在拥有了这些能做到我们自己都不知道怎么做的事情的人工制品,这究竟是怎么回事?
**Lex Fridman:** Yeah, I love mech interp. The feeling like we are understanding, or getting glimpses of understanding, the magic that's going on inside is really wonderful. It feels to me like one of the questions that's just calling out to be asked. And I'm often surprised by how -- I mean, a lot of people think about this, but -- how is it that we don't know how to create computer systems that can do these things? And yet we have these amazing systems. We don't know how to directly create computer programs that can do these things, but these neural networks can do all these amazing things. And it just feels like that is obviously the question that is calling out to be answered, if you have any degree of curiosity. It's like, how is it that humanity now has these artifacts that can do these things that we don't know how to do?
**Chris Olah:** 对,我很喜欢那个意象——电路朝着目标函数(objective function)的光的方向生长。这是我们培育出来的有机体,而我们完全不知道我们究竟培育了什么。
**Chris Olah:** Yeah, I love the image of the circuits towards the light of the objective function. It's this organic thing that we've grown and we have no idea what we've grown.
**Lex Fridman:** 感谢你为安全付出的工作,感谢你欣赏自己发现的那些美,也感谢你今天来聊天,Chris。这次对话真的很精彩。
**Lex Fridman:** Well, thank you for working on safety, and thank you for appreciating the beauty of the things you discover, and thank you for talking today, Chris. This is wonderful.
**Chris Olah:** 谢谢你花时间来聊。
---
---
**Chris Olah:** Thank you for taking the time to chat as well.
**Lex Fridman:** 感谢大家收听这次与 Chris Olah 的对话,以及此前与 Dario Amodei 和 Amanda Askell 的对话。如果你想支持这个播客,请查看简介中的赞助商信息。最后,让我用 Alan Watts 的一句话来结束今天的节目:"理解变化的唯一方式,就是纵身跃入其中,随它流动,加入这场舞蹈。"感谢收听,希望下次再见。
**Lex Fridman:** Thanks for listening to this conversation with Chris Olah, and before that with Dario Amodei and Amanda Askell. To support this podcast, please check out our sponsors in the description. And now let me leave you with some words from Alan Watts: "The only way to make sense out of change is to plunge into it, move with it, and join the dance." Thank you for listening, and hope to see you next time.