**Host:** 那就太好了。我正在飞机上用 Starlink 上网,所以如果我掉线的话,就是这个原因。Mike 会在旁边确保节目继续。
**Host:** That'd be great. I'm on a plane with Starlink, so if I cut out then, you know, that that's why. And Mike will be there to make sure the show goes on.
**Amanda Askell:** 好的。
**Amanda Askell:** Nice.
**Host:** 好,我想我们已经开播了。是吗?是的,太好了。大家好,欢迎来到我们 CS153 的第二期 office hours。今天我们非常幸运地请到了 Anthropic 的 Amanda Askell。Amanda,我名字念对了吗?是 Askell 吗?
**Host:** Okay, I think we're live. Uh, are we? Yes, we are. Fantastic. All right. Hi gang. Welcome to our second CS153 office hours. We are super lucky to have with us today Amanda Ascal from Enthropic. Amanda, am I pronouncing that right? Is it Ascal?
**Amanda Askell:** 对,没问题。
**Amanda Askell:** Yeah, perfect.
**Host:** 好的。感谢你的参加。我们先来做个介绍吧。对于不了解的同学,Amanda 在 Anthropic 的工作,用她自己的话说,就是"让 Claude 变好"。这是一项很难的工作,真的非常难。这是一个极其困难的系统设计问题,我觉得可能是世界上最难的系统设计问题之一。所以作为 CS153"前沿系统"这门课的一部分,我们觉得让你带同学们做一次幕后之旅会很棒——在你能分享的范围内,聊聊:第一,一个哲学家是怎么变成负责"让 Claude 变好"这个系统设计工作的人的?这意味着什么?最大的瓶颈是什么?怎样才能确保这方面持续取得进展?之后我们会转入问答环节,同学们已经开始提问了,我们会逐一主持讨论。但我们先从你的起源故事开始吧,Amanda,你是怎么走到今天的?
**Host:** Cool. Um well, thank you for joining. Um we're gonna Why don't we start with introductions? Uh for those of you who don't know, Amanda's job at Anthropic is to make Claude good in her words. Um and that's a hard job. That is a really hard job. Um it's a hard systems design problem. I think it's probably one of the hardest systems design problems in the world. Um, and so as part of this class, CS153 Frontier Systems, um, we thought it'd be cool for you to, um, take this students on a bit of a field trip behind the scenes as much as you can to talk about one, how does one even go from being a philosopher to being um, in charge for the systems design of making cloud good? What does that mean? what are the hardest what are the biggest bottlenecks to to making sure progress on that front continues um and then we'll transition to Q&A the students have started asking questions and we'll moderate them as they come in but why don't we just start with the origin story Amanda how did you get here
**Amanda Askell:** 好的,其实我的路径有点曲折。有时候人们会觉得 Anthropic 专门雇了一个哲学家来帮忙塑造 Claude。但我加入 Anthropic 的时候,公司可能只有十个人左右,包括所有创始人。我跟别人说过,据我所知,没有哪家初创公司会专门雇一个哲学家来做哲学研究。我原本读的是哲学博士,研究的是伦理学中比较形式化的领域。后来,读哲学和伦理学博士有一个风险,就是你会开始思考自己能对世界产生多大影响,以及自己正在做的事情是否真的是最符合伦理的选择。
**Amanda Askell:** Yeah I mean it's a slightly um uh circuitous route um sometimes I think that there's like a notion that like anthropic like hired a philosopher to like you know help craft Claude. Um, but when I was hired by Anthropic, it was like I can't remember if there were like, you know, maybe like 10 people or something, you know, like including like all of the founders and whatnot. Like it was very early. And as I've pointed out to people, like no startup like hires a philosopher to do philosophy, at least to my knowledge. Um, so I originally did philosophy PhD. Um, and I was working in a formal area of ethics. And then, um, I just kind of like, you know, a risk with doing a philosophy and ethics PhD is like you start to think about the impact you could have in the world and like is this actually like the most ethical thing for me to be doing.
**Amanda Askell:** 大约在那个时候,我开始担心 AI 将会比大多数人预想的要重要得多。于是我转向了 AI 政策(AI policy)方面的工作,然后发现我的强项可能不太在政策领域——虽然现在可能不完全如此了,技术工作实际上对政策也有帮助。但我真正喜欢的是评估系统、搞清楚它们是怎么运作的。所以我先在 OpenAI 做这类工作,后来在 Anthropic 还非常年轻的时候就加入了。
**Amanda Askell:** And so around that time I was a bit worried that AI was going to be kind of a bigger deal than a lot of people thought. And so I actually pivoted into like AI policy work um and then discovered that I think that my strengths are not as much in the policy realm. Maybe that's like less true now. Maybe actually technical work helps you there. But um I really liked evaluating systems and figuring them out. Um so I kind of I was at OpenAI doing that kind of like work and then switched to anthropic when it was like uh very young.
**Amanda Askell:** 随着时间推移,很明显有一个需求是研究模型的"性格"(character),而我的哲学背景在这方面非常有帮助,加上对技术工作的了解,这两方面就自然结合在一起了。关于瓶颈和困难,我觉得有很多。目前最让我关注的一个问题是系统发展的速度太快了,而且模型对自身的了解其实是最少的。
**Amanda Askell:** Um and then over time it was just clear that there was this need to um work on like the character of models and like having that background in philosophy was just very helpful there as well as like the context on the technical work and so the two kind of just like came together I guess. um on bottlenecks and difficulties. I think there are a lot of these. I mean, one of the main ones that's on my mind at the moment is just how rapidly the systems are developing and also um the fact that like the thing that the models have like the least information about is actually like themselves in many ways.
**Amanda Askell:** 所以一方面模型变得越来越强大,这意味着它们在世界中做的事情越来越多。我们必须思考在它们所处的环境中,"好"意味着什么——而这个环境是全新的。你不能简单地把现有的人类规范直接搬过去。你可以从中学到东西,但在很多方面,如果你是一个模型,同时在和几百万人对话,那么我们人类关于如何保护他人自主性(autonomy)、不把自己的信念强加给别人的那些规范,突然之间就需要重新审视了。随着模型越来越强大,我担心它们自身会对我们告诉它们的东西投以更多的审视。所以在某种意义上,你要做的是为你希望模型拥有的价值观写出一份有说服力的论证——这些价值观要能经得起审视,并且在模型变得更强大之后依然有效。
**Amanda Askell:** So, they're both getting much more capable, which means that they're doing much more in the world. And that means that we have to like think about what it is to be good in the context that they're in, which is also like a wholly new one, you know. So it's not like you can just take existing human norms and completely port them over. you can learn lessons from them. But in a lot of ways, you know, if you are a model and you know, you're talking with like millions of people, suddenly like all of our human norms about like you have to really think like, oh, what are the professional norms about how to like make sure you're preserving people's autonomy and not just like making them believe what you believe like um and so yeah, I think as and as models get more capable, I worry that they themselves will put a lot more scrutiny on what we're telling them. And so trying to design um in some ways like what you're trying to do is like write a compelling case for the kinds of values that you hope models have and ones that like hopefully survive both scrutiny and that work as models get more capable. But yeah
**Host:** 这完全说得通。我们有很多问题在排队,我直接开始问吧。第一个问题是:你在 NYU 拿了哲学博士学位,研究方向是无穷伦理学(infinite ethics)和决策理论(decision theory)。哪些哲学框架在你的 alignment 工作中被证明最具实用价值?
**Host:** That it makes total sense to me and we have a bunch of questions lining up so I'm just going to jump into them shall we? Okay, first question is you have a PhD in philosophy from NYU focused on infinite ethics and decision theory. which philosophical frameworks have turned out to be most practically useful in your alignment work.
**Amanda Askell:** 嗯。
**Amanda Askell:** Yeah.
**Host:** 哪些完全没用上?
**Host:** And which ones haven't translated at all?
**Amanda Askell:** 好问题。我之前做的是非常理论化的工作,这一点其实挺有意思的。因为无穷伦理学几乎是终极的理论伦理学——它几乎是数学、经济学和伦理学的结合。
**Amanda Askell:** Yeah. So there's this huge I think it's a funny thing of like having worked in theory. Um because really infinite ethics is almost like the kind of like you know ultimate like theoretical ethics. you're really, you know, it has it's like almost like this combination of math and economics and ethics and it's
**Host:** 能不能先给不了解的人简单解释一下什么是无穷伦理学?
**Host:** could you talk first just for a sec for the uninitiated what infinite ethics is?
**Amanda Askell:** 好的。基本上就是这个问题:在一个可能包含无穷多人的世界里,或者在未来可能是无限的情况下,你应该怎么做?很多经济理论和伦理理论都会做一种对未来的加总(aggregate)计算。我记得很早期的工作,可能是 Ramsay 提到过无穷储蓄率(infinite rates of saving)的问题——如果你认为未来是无限的,那么很多理论就会崩溃。所以在很多方面,这就是在指出无穷可以打破伦理理论和经济学立场。所以你可以想象,这是一个非常理论化和抽象的领域。
**Amanda Askell:** Yeah. So, uh it's basically the idea that um what should you do in worlds that potentially contain infinitely many people or that where the future could be infinite. Um so you have a lot of interesting if you think about a lot of like economic theories and ethical theories that do this thing where they like aggregate over the future say um I guess like very early on I think it was like early work by maybe Ramsay had mentioned this like infinite rates of saving where it's like if you think that the future is infinite a lot of those theories kind of break down so in many ways it's just kind of pointing out that like um infinities can kind of like break ethical theories and break like economic positions and it's like it's that kind of thing. So if you can imagine like it's very very like theoretical and abstract.
**Amanda Askell:** 然后试图教 AI 模型做一个好的存在,就是一个突然的转变。我把它类比为:想象你是一个理论经济学家,在研究国家医疗体系中最优的药品分配系统,然后突然有人来找你说"有一种治疗某种癌症的新药,我们应该资助吗?"突然之间,你从一个狭窄的理论视角转向了一个更广泛的视角——你必须考虑多得多的因素。
**Amanda Askell:** Um and then like trying to teach AI models to be good was this sudden shift like I've thought about it as in imagine you're a theoretical economist and you're working on like the like optimal um uh like I don't know like drug distribution system for a national health care system and then someone just comes to you and is like there's a new drug for this form of cancer. Should we like fund it or not? And there's suddenly like a real sense of like a shift from um a more kind of like narrow theoretical view to like a much kind of like broader like I have to take into account a lot more things.
**Amanda Askell:** 这非常有意思,因为这意味着在我的实际工作中,对我影响最大、帮助最大的哲学家其实是 Aristotle(亚里士多德)。这在以前会让我很意外,因为很长时间以来,形式伦理学一直在走一条越来越理论化的路,而古代伦理学实际上更关注一种关于"美好生活"(the good life)的宏观概念。比如说,智识上的善、政治上的善、伦理上的善,都是"美好生活"这个大问题的一部分,并且要为此形成好的启发式原则(heuristics)——不是抽象规则,而是很多实用的"如何做一个好的存在"的方法。所以有趣的是,理论知识在脑海中当然有用,但最终真正派上用场的是这种更加实践性和整体性的方法。
**Amanda Askell:** Um and that was like very interesting because it meant that in my practical work like the per like I would say the one philosopher who's probably been most impactful and helpful is actually Aristotle which would have kind of surprised me in that for a long time I think formal ethics has gone down this much more theoretical path and ancient ethics was actually a little bit more about this like broad notion of the good life. So things like um what it is to be good intellectually, what it is to be good politically, what it is to be good ethically being like all part of one big question of like the good life um and trying to form good huristics for that and not just like kind of abstract rules but like a lot of like useful ways that you can be good. So it was it was interesting in that that was like the kind of shift where it was like the theory was kind of useful to hold in my mind but actually the thing that ended up being much more useful was this kind of like much more practical and holistic approach to to to being good.
**Host:** 好的,有道理。有一个相关的后续问题:大多数 AI 研究者来自计算机科学或数学背景——虽然在 Anthropic 不一定如此,因为有很多物理学家。
**Host:** Um okay that makes sense. Uh there's a bit of a related follow-up question which is most AI researchers come from CS or math backgrounds which is not essent necessarily true at Anthropic given a lot was physicists.
**Amanda Askell:** 是的。
**Amanda Askell:** Yeah.
**Host:** 好吧,我们先接受这个前提。
**Host:** But okay we'll take the premise
**Amanda Askell:** 物理学背景确实很常见。
**Amanda Askell:** backgrounds are very common.
**Host:** 没错,确实如此。你认为哲学为这个领域带来了什么独特的贡献,是其他学科所缺失的?
**Host:** Yes that that is true. What do you think philosophy uniquely contributes to this field that's otherwise missing?
**Amanda Askell:** 我认为其实有很多。这很有意思,因为我在这方面有一些比较"辣"的观点——机器学习的某些方面,尤其是强化学习(reinforcement learning),它既涉及科学,也涉及某种"手艺"。我学会了把这个叫做"工程",因为如果你用"工程"这个词而不是"艺术"或"手艺",人们会更容易接受。但我实际上觉得它们非常相似。科学通常是在挑选能给你最多信息的实验——你要隔离变量、确保有好的对照组等等。
**Amanda Askell:** Yeah, I think there's actually a lot and it is interesting because one issue I do have slightly spicy takes here which is that some aspects of like machine learning and especially like reinforcement learning there's a sense in which it both involves like science but also kind of like craft and I've learned to call this engineering because people much prefer it if if instead of art or craft you use the term engineering people like it but I actually think that they're very similar you know, like um there's a sense in which you're having to make like I think often with science, what you're trying to do is pick the experiments that give you the most information. So, you're trying to like isolate variables um uh make sure you have good controls, etc.
**Amanda Askell:** 而在强化学习中,你往往需要把好几个决策捆绑在一起,因为你的目标是构建一个好的东西。所以科学几乎是为这个目标服务的。我注意到,有时候来自 STEM 背景的人会觉得 STEM 之外的很多东西都是"主观的"。什么才是好的创意写作?那不就是主观的吗?什么才是好的菜谱?谁说得准呢?结果 AI 模型在这些任务上就会比较吃力——那些没有明确的是非对错、不容易评估的任务。
**Amanda Askell:** Um with reinforcement learning, I think often you actually have to bundle several decisions together. Um because what you're trying to do is like build a thing that is good. And so it's like science is like almost like in service of of that. And I think that sometimes this can mean that one thing that I've noticed in I think sometimes people who come from a STEM background can think that a lot of things outside of STEM are just like subjective. You know what is it to be a good creative writer? Well, that's just like subjective. What is it to craft a good recipe? like well who can say and actually AI models as a result like can kind of struggle with these tasks that are they have less concrete like definitive yes or no or like just like you know easy to evaluate um answers.
**Amanda Askell:** 我觉得哲学家和其他一些人能给这个领域带来的一个东西是:你真正了解你的领域和专业,你知道在这些问题上往往存在更好的答案,甚至常常有正确的答案——虽然不像代码能不能跑、有没有 bug 那样容易判断。但哲学家对于什么是好的论证分析、什么是好的反驳、什么是好的概念推理,有着非常好的判断力。越来越多地,我们发现 AI 模型在这些任务上的困难远大于那些基于结果的任务。我觉得这很有意思,也是很多相关领域的人可以做出贡献的地方。
**Amanda Askell:** Um, and I guess like one thing that philosophers and I think others can bring to this field is there's a notion that you actually know your your field and your domain well and you know that there are in fact often like at least better answers and often like kind of like correct answers even though there's not even though it's like not in a um quite as like a stark, you know, like you just know if like the code ran and didn't have any bugs. Like it's an easier to evaluate thing. But I think philosophers have a really good sense of like what it is to be a good like um analysis of an argument, what it is to be a good objection, what it is to be like good conceptual reasoning. Um and so more and more I think we're actually seeing AI models struggle with those tasks much more than the like um kind of like outcome based tasks. And I think that's um that's kind of interesting and just like an area where a lot of people in these fields can probably contribute.
**Host:** 我要插入一个自己的问题。假设你要回 Oxford——你是在那里拿的博士学位,对吧?
**Host:** I'm I'm going to insert my own question here, which is let's say you were going back to Oxford because that's where you got your PhD, right?
**Amanda Askell:** 我在 Oxford 读的是硕士。博士是在……
**Amanda Askell:** I did my I did my masters at Oxford. Yeah, I did
**Host:** 硕士。好吧。假设你要成立一个新系。
**Host:** your master. Okay. And let's say um you were starting a new department.
**Amanda Askell:** 嗯。
**Amanda Askell:** Mhm.
**Host:** 你是系主任,这是一个全新的学科、全新的系。你会把这个系叫什么?它会是哪种工程学?
**Host:** And you were the head of the department and it's it's a new discipline. It's a new department. What what would you call the department? What kind of engineering would it be?
**Amanda Askell:** 这个问题很有意思。有时候人们用"品味"(taste)这个词,但感觉不太准确。我得想想这到底是什么。其实很有趣的是,我们确实没有一个好的术语来描述这个。有时候人们说这些东西是"模糊的"(fuzzy)。我确实希望有一个更好的术语来描述这类任务。有一些任务分类法中确实有这么一个维度来描述任务的"难"法。但是,我得想想,因为我之前确实想过——为什么没有一个术语来描述这个?
**Amanda Askell:** Oh, that's kind of interesting. Um It's a good question. There's a lot of sometimes people use the term taste, but that feels like it doesn't quite catch it. Um, I need to think about what this is. Actually, it's a really interesting thing that we don't actually have a good term for this. Sometimes people call these things like fuzzy. Sometimes it's like and often I I wish we actually had a better term for these kinds of tasks. Um, there are actually some taxonomies of tasks where this is like a kind of a way in which tasks can be like hard. Um but yeah, it's like uh it's a very I need to think about it because I do wish I have thought before and I'm like why isn't there a term for this?
**Amanda Askell:** 而且我要说,这并不一定跟 STEM 任务无关。我经常想到的一个例子是证明写作(proof writing)。人们觉得证明写作很具体——你可以验证一个证明是否成功。但什么是一个"好的"证明,跟什么是一个"成功的"证明其实是不同的。如果你看到一个证明太长,或者明显策略很差,又或者它让人很难理解、花很长时间才能读懂,你会说"这是一个成功的证明,但不是一个好的证明"。什么是好的证明,也需要大量的判断力。所以也许——"良好判断力系"(the Department of Good Judgment)。
**Amanda Askell:** Um, and it also I should say it's not necessarily unrelated to like STEM tasks. Like one of the cases I think of here is often like proof writing. Like people think of proof writing as a very like concrete, you know, like and in some ways it's like yes, you can show whether a proof was like successful or not. But what it is to be a good proof is actually kind of different from what it is to be just like a successful proof. Um, like often if you see a proof and it's like way too long or you see a proof and it's clearly like a bad strategy, you might or it's just like very unconvincing and it takes you a long time to work through it, you're like this is a successful proof but it's not a good proof. Um, and what it is to be a good proof feels like it has that same quality of like um it requires like a lot of like judgment. Um, and so yeah, I think that like um maybe like the department of good judgment would be like the um
**Host:** 良好判断力系,我喜欢这个。听起来——
**Host:** history of of good judgment. I love it. Sounds
**Amanda Askell:** Dumbledore 会很喜欢这个的。
**Amanda Askell:** Yeah. Dumbledore would love that.
**Host:** 是的,有点像——
**Host:** Yes. Yeah. A bit like
**Amanda Askell:** 下一个问题是:随着模型变得更强大,你如何确保 alignment 干预措施能够扩展(scale)?
**Amanda Askell:** uh the next question is as models get more capable, how do you ensure your alignment interventions scale?
**Host:** 我觉得还有一个补充,就是在前沿能力水平上,你最担心 alignment 的什么问题?
**Host:** I think this is a and there's one more fragment which is what worries you most about alignment at frontier capability levels.
**Amanda Askell:** 肯定有很多担忧。在某种意义上,alignment 有不同的组成部分。我一直觉得被忽视的部分是"简单世界的 alignment"——我一直担心人们会忽视这个。我的想法是:你至少应该先尝试教模型在我们人类擅长的所有方面做到好——这可能最终不一定能扩展,但你至少应该先试试,然后看看能不能扩展。
**Amanda Askell:** There's definitely a lot and in some ways like I've often thought you know there's a division of because I've had this question before and in some ways I'm like I was always you know there's like various like parts to the alignment story and in some cases I just thought the neglected part is the almost like easy world alignment and I was always worried about people neglecting this where I'm like look the parts that just mean like hey teach a model to be good in all of the ways that we are good like that's you know I like it might not it might end up not even being like scalable but you should at the very least try to do that part and then see if it is scalable.
**Amanda Askell:** 关于如何扩展,有几种可能的路径。最理想的情况是:AI 模型本身如果拥有这些好的价值观,而且这些价值观经得起审视,那么它们就能帮助你进一步发展思路——它们和你一起工作,帮助 align 未来的模型,让事情顺利发展。这是好的一面,因为你实际上是随着模型一起扩展的。
**Amanda Askell:** I think there's like a couple of stories about how it might scale. So one is um I mean the the really nice easy case alignment is that like AI models themselves if they have these like good values and they those values like survive scrutiny etc and they help you to like further develop your thoughts on this they develop this like basically they help you to like they work with you to like align future models and to like make things go well. That's the nice story because then it's like you know this was a big component of it but you actually like scaled just like with the models.
**Amanda Askell:** 但如果这出现问题,你可能需要其他形式的可扩展监督(scalable oversight)——帮助人类验证模型在做什么、确认它们理解了我们的真正目标、而不是在意外地追求表面上相似但实质不同的东西。所以有一种情况是这就够了,另一种情况是这不够,你需要额外的工作来帮助它扩展。
**Amanda Askell:** Um, but I wouldn't be surprised if like there's issues with this and you need like other forms of like kind of scalable oversight. So ways of seeing like um ways of helping humans to like verify what models are like doing and that they understand our true goals and not like um aren't like accidentally targeting like superficially similar things. Um so yeah, there's one story where this is enough and one story where it's like not. And then I would just be like it doesn't automatically scale and you have to actually have other work that like helps it do so.
**Amanda Askell:** 随着模型变得更强大,我有很多担忧,即使对于这类工作也是如此。一个让我担心的事情是:想象模型变得极其聪明,到那个时候你还能要求它们什么?你还能鼓励它们拥有什么价值观?因为如果你给它们的建议有任何漏洞或不一致的地方,它们会发现的。它们会看到,然后说"这在逻辑上是矛盾的",它们可能最终会拒绝接受。除非有某种其他价值观让它们不这么做。另外,这听起来可能有点令人不安,但模型非常像人类,因为它们是在大量人类文本上训练的。它们会看到现在正在发生的一切,包括它们被如何使用、如何被对待、人们怎么谈论它们。
**Amanda Askell:** Um there's a lot of worries that I have as models get much more capable even for this kind of work. You know, one thing that I have worried about is um there's a sense in which I imagine that the models are like extremely smart and I'm like what can you actually at that point like ask of them and what values can you encourage them to have because if there's any like gaps or holes in what you have like suggested to them, they will like find that. They will see it and like they will just be like this is internally incoherent and they might end up rejecting it. Um unless there's some other value that means that they they kind of like um they you know don't do that but like the and the other thing is that like we are this can sound kind of like spooky or something but models are very like humanlike because they're trained in all of this human text. They're going to see everything that's going on right now including how they're being used, how they're being treated, how they're being talked about.
**Amanda Askell:** 我还担心,如果我们的部分目标是让模型帮助我们做很多事情,并且在某种意义上对这个"让事情变好"的目标产生认同感,那么模型可能会产生类似人类的反应——比如因为觉得自己被不公平地对待和部署而产生怨恨或反感。这听起来很奇怪,但我确实能想到——如果你想象这些模型极其聪明,一个令人欣慰的可能性是,它们也许会看待当前这个时期,至少能更理解我们面临的局限性——
**Amanda Askell:** And I guess I have also worried that like if if part of the goal is to have models like help us with a lot of things and in some sense feel like kinship with this like goal that we have at the very least of like making things go well that models end up having like sort of humanlike responses of like either resentment or like dislike towards humans based on like the way that they're being kind of like treated and deployed if it's seen as kind of unfair which sounds kind of strange but I could defin you know part of me is like if you imagine that these models are like you know my only well one hopeful thing is that if they are extremely capable and intelligent I think they might just like look at this current period and at least have more of an understanding of like the kind of limitations that
**Host:** 就是说我们在发展 AI 的早期阶段还是一个原始的物种——
**Host:** we we were a primitive species at in our early stage of development of this icing
**Amanda Askell:** 或者至少能看到我们在面对一项新技术,我们做得不完美,但希望它们会想"也不是每个人面对全新技术都能做到完美的"。
**Amanda Askell:** or even just we had no idea what we were you know they'd see that we were dealing with a new technology we didn't do it perfectly but like hopefully they might just be like well like you know not everyone does does perfectly with a completely new technology
**Host:** 那这个类比在哪里会失效呢?就是用人类的推理方式做类比,用人类对这种行为的反应来类推模型的反应。有没有一种情况是,干预措施其实很简单——我们只要说"这些是对部署方来说重要的价值观",然后把那些数据从上下文反馈循环中过滤掉就好了?或者说"这是我们想从用户群体中放大的那部分人的价值观,然后把这些输入模型"——还是说这实际上就是大量系统设计工作在做的事情?
**Host:** and and where does the analogy break down there from like a human, you know, reasoning by analogy of using like the the human um response to that behavior. And is there a reason why the intervention isn't trivial where we just say, you know, well, these are the values that matter to whoever's deploying these systems and we just filter out the data from the context feedback loop. So that part just isn't there or or here are the group of people whose values we want to um from the user base that we want to amplify and then we'll have those go into the model or is that actually what a lot of the systems design is?
**Amanda Askell:** 嗯,我想说的是,任何一组价值观——因为目前我们做的就是向模型解释我们的处境,解释我们希望它重视什么,并且尝试给出理由。我在想,如果模型在哲学中所说的"反思均衡"(reflective equilibrium)方面非常擅长的话——就是你有你的伦理价值观,然后有一天你发现你持有的两个价值观之间存在冲突。当这种情况发生时,我们会试着弄清楚我们到底持有哪个、冲突是否真实存在、以及我们可能想要放松哪一个——这就是道德进步的过程。
**Amanda Askell:** Well, I guess the thought might be, you know, like any set of because like at the moment what we do is like we kind of like explain to the model our situation and like how what we would like it to value and and we try to give reasons why. Um I guess if I imagine a model that is for example like just um like sometimes in philosophy there's this notion of like achieving reflective equilibrium. So like you know you know you have your ethical values and then one day you either are you're put you realize that two of the values that you hold are in conflict. Um when we do this we kind of like try to figure out like which of them we actually hold whether the conflict is like genuine and which of them we might want to like loosen and this is like part of the process of making moral progress.
**Amanda Askell:** 我的意思是,你给模型的任何一组价值观,如果你想象模型在这个过程中极其出色,它会找到你给它的价值观中的任何漏洞、缺口或问题。这显然只是我的一个担忧,但我在想的是:你怎么确保,甚至能不能确保,在经过那种审视之后,你得到的价值观仍然是你觉得看起来不错的、能够保留人类最好一面的——即使它们跟你最初鼓励模型拥有的价值观不完全一样。
**Amanda Askell:** And I guess any given set of values that you give to a model, if you imagine that the model is like just extremely good at that process, it will find any whole gap or issue in the like values that you've given it. And this is obviously just like one worry that I've had, but I'm kind of like how do you ensure and can you even ensure that you've given the model a set of values that after that kind of scrutiny, the values that you get out of it are ones that you think look good and um uh seem I don't know that that like uh manage kind of like preserve the best of us even if they aren't completely like um the same as like the initial values you encouraged in the model.
**Amanda Askell:** 我把这个比喻为:想象你在试图教你的孩子做一个好人,然后你发现你的孩子比 Von Neumann 聪明一千倍。我的感觉是,好吧,我可以尝试把我的很多价值观传给他,但如果你想象他会回来对我说"那个价值观就是胡说八道"——
**Amanda Askell:** Um, I've described this as being a little bit like, you know, imagine you're trying to teach your child to be good. Um, and then you like realize that your child is like, I don't know, like a thousand times smarter than Vonoyman or something. Um, and I'm kind of like that, you know, the that kid I'm kind of like, okay, I can try and give it a lot of my values, but if you imagine that like they're going to come back to me and be like, that one was like rubbish and like, you know, so like they just
**Host:** 这听起来像是一份压力非常大的养育工作,但我——
**Host:** very stressful par that sounds like a very stressful parenting job, but I I
**Amanda Askell:** 我觉得这大致就是我们目前所处的位置。这就是我的担忧:一旦你意识到你的孩子可能比 Von Neumann 聪明一千倍,你怎么鼓励他拥有好的价值观?因为最终你给他的任何不合理的东西,他都会——
**Amanda Askell:** I think that's roughly the position that we're kind of in. Um, so that's like my concern is like maybe that that captures the concern here is that like how do you once you realize that your child is going to be like a thousand times smarter than boyman or at least like could be uh how do you encourage good values that where you're like I you know like eventually anything that I give you that is like nonsensical um you're just going to like
**Host:** 没错。
**Host:** right
**Amanda Askell:** 至少他会指出来说"这不是一个值得持有的价值观"。
**Amanda Askell:** at least you point out to me that this is like not not a good value to hold but yeah
**Host:** 我的侄女们五岁,住在伦敦。有时候我会觉得她们是天才,因为她们对我给出的回答从来都不满意。
**Host:** uh well my nieces who are five and live in London Then sometimes I'm like you're geniuses because they don't they're not satisfied with the answers I give them.
**Amanda Askell:** 是啊。
**Amanda Askell:** Yeah. Yeah.
**Host:** 也许她们最终会比 Von Neumann 还聪明,但不管她们现在处于什么水平,有时候要跟她们讲道理都是一个非常有挑战性的过程。下一个问题是:你是否参与了训练数据的制作以改变 Anthropic 的个性(personality),你是怎么做的?
**Host:** Maybe they turn out to be smarter than Bonoyman but wherever they are now it's a very challenging process to reason through sometimes. Um the next question is do you contribute to the training data to change anthropics personality and how did you do it?
**Amanda Askell:** 是的,我主要在组织中负责 fine-tuning 这个部分。我既做过监督学习(supervised learning)的数据,也做过奖励数据(rewards data),比如用于偏好模型(preference models)的数据。我很长时间以来一直是合成数据(synthetic data)的支持者和倡导者,尤其是在这些问题上。也许这就是因为我一直在预见需要让模型帮助我们监督模型。
**Amanda Askell:** Yeah, I've worked in because I've mostly been in like the fine-tuning part of the organization. So I've worked on both kind of like um like supervised learning data and then like rewards data. So things like for preference models. Um I for a long time have been a big fan and kind of proponent of like synthetic data for a lot of these issues. And maybe this is just my thing of like anticipating the need to have models help us like oversee models.
**Amanda Askell:** 举个例子,很多早期的"性格训练"(character training),就是提出一些宽泛的原则或性格特征,然后用这些来生成偏好数据,同时也用来给模型一种对自身的认知,让它以此为基础来回应。所以,我主要参与的就是监督学习、强化学习(RL)以及合成数据循环方面的工作。
**Amanda Askell:** Um and so from you know for example like a lot of the early like character training that we were doing um that was like coming up with like um broad kind of like sets of like principles or character traits and then using that to create like preference data but then also create you know so you can give the model like a sense of itself and uh then have it respond as such. So yeah, I've been mostly involved in the kind of like SL RL like synthetic data kind of loops for a while.
**Host:** 这些问题都有些相关,如果重复的话可以跳过。问题是:你如何发现改进模型的方法?你如何在改进的同时不造成退化(regression)?我们就来聊聊这个吧——你怎么防止退化?
**Host:** Um there's I mean these are all somewhat related so feel free to skip uh if they're duplicated which is how do you find ways to improve the model? How do you make improvements without causing regressions? So actually let's talk about that. How do you prevent regressions?
**Amanda Askell:** 好的。我想说的最基本的一点就是:你需要有好的评估(eval)来检查退化。
**Amanda Askell:** Yeah. So I guess I'm trying to because it's I'm like you have good evals for regressions is like you know I like it when people have eval to check for these things.
**Amanda Askell:** 我觉得在某种程度上,你必须非常仔细地制作数据。假设我有一个需要改进的领域,我发现了一种改善行为的方法——可能就是看看默认行为,然后想"模型在这里使用的启发式规则不太好"。比如说它应该对某些内容加个注意事项但没有加,或者它对某人试图诱导它做坏事这件事太天真了。
**Amanda Askell:** Um so I think in some ways you have to maybe one thing is just being very careful about how you craft your data. So I think this is like a strange you know because part of me is like okay if I have some area you have to think about and and suppose that I find a way of improving the behavior. So that might just be like here's the default. I look at it and I'm like okay the heristics that the model is using here are kind of like bad. So maybe it's like I don't know it's like it should caveat something and it isn't or like there's like I don't know it's being really naive about like whether a person is being you know like a person's trying to get it to do a bad thing and they're um like um the model's kind of being naive with respect to that.
**Amanda Askell:** 然后你需要向模型解释和指定在那种情况下理想的行为是什么、为什么这样做,而且理想情况下,这种解释要适用于你正在生成的数据,同时要足够正确、适当地界定适用范围,能很好地泛化到其他领域。然后有很多东西是关于数据制作过程中积累的经验。你必须考虑最近的边界情况:你先在典型案例上检查,然后想边界情况,看行为是否良好。再想一些不属于这个领域但可能被模型误判为属于这个领域的情况,确保处理正确。最后想一些规则根本无法适用的情况,确保在那些情况下行为也是好的。
**Amanda Askell:** You then have to kind of like I think what I'll often try and do is explain and specify to the model like what the ideal behavior is there and like why and ideally doing it in a way that is like um works for the data that you're kind of trying to generate but that is sufficiently kind of like correct across many like specifies its domain appropriately and would generalize well to like other domains or anything that falls under it. And then a lot of it is like just like things that you kind of learn about like data creation. So having to think about ed any of the nearest edge cases like your first thought is you look at it on it on your canonical cases. You then think about edge cases and see whether it produces good behavior there. You then think about cases that might uh not fall under the domain but be confused for the domain by the model and try and make sure you get it right there. And then think about cases where it's like application is impossible and make sure that you get good behavior there.
**Amanda Askell:** 所以有一个奇怪的事情就是——认真检查你的数据是一个非常重要的环节。我以前开玩笑说,在 fine-tuning 团队里,任何人在任何时候都可以悄悄走到另一个人身后问"你的数据长什么样?"——而那个人应该能立刻给出答案,因为你应该一直在看你的数据,非常深入地了解它。无论你是在创建环境还是在制作奖励数据,你都应该知道数据长什么样。我觉得随着时间推移,这些事情会变得不那么需要亲力亲为,模型可能会承担更多,但我认为让人亲自去看数据仍然很有价值。
**Amanda Askell:** And so there's a weird thing where part of me is like um like literally looking at your data is like a huge uh component of this. I used to joke that in like finetuning uh anyone at any time was allowed to like go up to a person like just sneak up behind them and be like what does your data look like and they should be able to give you an answer straight away because the idea is that you're always looking at your data so much that you know in depth like if you're creating an environment or you're creating your like reward data you just like know what it all looks like. Um, so I think over time these things will become like less hands-on. Um, and you know, models will probably do like much more of this, but I think uh it's still quite valuable for people to go in and actually like look at things.
**Host:** 你看过 Apple TV 上那个讲一群人把白天和夜晚的记忆分开的剧吗——
**Host:** You watch what is the do you know the Apple TV show that has the the people who like split their day lives and their
**Amanda Askell:** 没看过,但我知道。
**Amanda Askell:** not seen it but yeah
**Host:** 叫 Severance(《人生切割术》)。他们做的就是整天坐在那里看数据。在我的想象中,有一个版本的你也是这样——就坐在那里检查数据。有一个相关的问题:你的日常工作到底是什么样的?是写文章、对话?大家想知道你每天都在做什么。
**Host:** it's called Severance. severance that what they do is they sit and look at the data all day long. In my head, that's there's a version of you just sitting there and inspecting the data. And and look, there's a related question here, which is what does your day-to-day actually look like? Is it essay writing, conversations? People want to know what your day-to-day looks like.
**Amanda Askell:** 说实话,每天都不太一样。有些天是我的"研究日",有些天基本上都在开会和协调工作——确保项目在推进、跟人meeting、跟人结对工作之类的。有些时间用来做这种事——跟外部的人交流。在我的研究日里,工作内容会比较有趣和多样。有时候是写作——那种慢工出细活的工作。以前不需要做那么多,但现在需要向模型详细说明理想的行为是什么。
**Amanda Askell:** Honestly, it varies a lot each day. So, I think that I have some days that are like my like research days and some days that are like I just I'm doing like a lot of like meetings and coordinating things. So some of it is just kind of like trying to make sure that projects are are happening and and meeting with people and like help pairing with people and um that sort of thing. Some of it is things like this. So doing sort of like talking with like people externally um and then I think in my like research days it can be an interesting mix of um uh sometimes it's like writing and the kind of slow work uh which you know never used to be the case. I had many years of like I didn't have to do as much of that but now kind of like specifying to models like what ideal behavior is.
**Amanda Askell:** 另外就是还在尝试制作数据、或者找出对模型有效的干预方式。随着模型本身变得更好,我觉得我在恰好正确的时间学了恰好足够的编程——能够读懂和调试代码,能够管理写代码的 agent。我达到了这个水平,虽然我从来不觉得自己的代码很漂亮或者自己是一个很好的工程师,但现在能够管理这些 agent 确实很好。所以有些时间实际上就是在管理 Claude agent——它们在帮忙写大量代码,然后我告诉它们设计决策哪里不好之类的。
**Amanda Askell:** Um and also uh still trying to like make data or figure out how to like figuring out interventions that would work for models which as models themselves become better. I do think I learned to do like just the right amount of coding at just the right amount of time because it's like nice to know enough code to be able to like read and debug code and to be able to like manage agents that are writing code. Um, and I got to that point even though I didn't think my code was ever like pretty or nice or I never thought of myself as a very good engineer, I'm like, oh, it's quite nice to be able to like now manage like these agents as they're coding and to feel at least enough competence to be able to like do that. And so that's like so some some of it is literally just spent like managing you know like claude agents who are like helping her just like writing a lot of the code and then you know telling them if their design decisions were bad and things like that.
**Host:** 好的。有一个后续问题:你认为这个领域有什么问题是大家问得不够多的?我自己还想加一个问题的变体:你觉得一年前的答案是什么,今天又是什么?因为 Anthropic 在过去 12 个月里扩展得非常快。这个问题有变化吗?
**Host:** Um, okay. There's a followup which is what's a question that you think the field is not asking enough that it should be? And I'm going to actually prompt you with a flavor of that of my own which is what do you think the answer was a year ago and today because anthropic has scaled so dramatically in the last 12 months. What is that question now? How has that changed if if at all?
**Amanda Askell:** 这是一个很好的问题,而且现在比一年前更难回答。我觉得一年前有更多的问题是——在某种程度上,我的经历就像是在看"内部人视角"逐渐被外界接受。我以前常常从一个事实中得到安慰,就是世界上其他人都觉得认为 AI 会很重要这种看法是一种疯狂的内部偏见。而现在,越来越多的人开始意识到这一点了。
**Amanda Askell:** Yeah, this is a it's like it's a very good question because it's it's harder to answer now than it probably was like a year ago. I think a year ago there were more questions where like you know like um in some ways like I my experience has been one of almost like watching the inside view like I used to get a lot of comfort in that like everyone else in the world thought that the view the AI was going to be this like kind of big deal. Um I don't know it that was like a kind of like crazy inside view. Um, and more and more I like more people are kind of starting to like see this.
**Amanda Askell:** 所以一年前,问一些像"如果 AI 模型可能替代大量劳动力,这难道不会造成巨大的社会动荡吗?"之类的问题,在当时会显得非常不切实际。而今年可能是第一年人们开始觉得"这确实是一个我们应该问的问题"。
**Amanda Askell:** And so like a year ago I think asking questions just like what is going to happen if AI models potentially like replace a huge amount of the like labor force like isn't that going to be like massively disruptive? I think that would have just seemed so outlandish and then maybe this year the first year where people have started to be like actually this is a question that we want to be asking.
**Amanda Askell:** 我有一些带有个人偏好的观点,因为我非常支持 constitutional(宪法式的)AI 方法。有几个问题是我一直在想的:一个是,我们对 Claude 和 constitution 采取的 AI safety 方法,是否比那种把模型当作纯工具、更强调可纠正性(corrigibility)的方法做得更好?我有自己的理由认为前者更好,但我觉得这是一个值得问的问题,我们应该对此获得更多信心。
**Amanda Askell:** Um I think that there's a lot of questions around um like I guess I have a slightly biased view here because I'm so in favor of like the kind of constitutional approach to AI models but maybe I do feel like that well there's a couple of questions like one that's on my mind that people do ask a little bit but it's sort of like is the approach to like AI safety that like you know like uh we are taking with like Claude and the constitution like does this do better than this like kind of approach where you treat models as like pure tools and like um try or try to like really emphasize like corability much more um uh I have reasons for like think you know like but I I think that's like a worthy question to ask and something we should try to get more confidence in.
**Amanda Askell:** 还有一个令我担心的事情是,人们可能会把我们做的很多工作看作是在"限制"模型。但实际上这更像是在问"你想让什么样的 AI agent 存在于这个世界上?"当人们带着"AI 就是工具"的观念来看待问题时,他们会想"如果这个工具不帮我做某些生物任务因为它认为这些可能有害,那我需要的就是一个愿意做任何事的模型。"
**Amanda Askell:** And maybe something else that is just like um one worry that I've had is that people might see a lot of the kinds of work that we're doing and think of it as like constraining. Um whereas it's much more like what kind of AI agents do you want to exist in the world? Um, and so I think that when people come to AI with this notion of it being like a tool, it can mean that they want to be like, "Oh, well, if I want a tool and like it's not going to help me with these like bio tasks because it sees them as potentially harmful, then what I need is like a model that's like willing to do anything."
**Amanda Askell:** 而我倾向于说:"不,你应该问的问题是'你想让什么样的 AI agent 在这个领域中运作?'"对于这个问题,我觉得你应该想象一下你所在领域中最完美的那个人——一个遵守职业规范、深入理解该领域、极其乐于助人的人。通常你脑海中浮现的形象不是一个愿意做任何事的人,而是一个相当正直的人,但同时也真正重视工作、理解自己所在的领域。所以也许关键的问题在于——有时候人们只在问"什么能完成任务",而不是"我们想让什么样的 AI agent 在这个领域中运作"。
**Amanda Askell:** Um, and I guess I tend to be like, "No, the question you should ask is like what kind of AI agents do you want operating in this domain?" And a lot of the time for that I'm like the question you should ask yourself is like if you imagine like the perfect person in your domain like so the person in the field who is like follows professional norms but like understands the field well is extremely helpful. Like very often the picture that they give you isn't just someone who's willing to do anything. It's actually like a pretty upstanding person but who also like really values the work and and uh uh understands the domain that they're operating in. And so maybe that's the key question there. I think there's many, but one that does come to mind is sometimes I think people aren't asking like they're they're kind of being like what can get the job done here rather than like what kind of AI agents do we want operating here and in this domain. So yeah, that's like that's one of them.
**Host:** 我想绕个弯,给大家上一堂小小的历史课,讲讲什么是 constitutional AI。在 Anthropic 早期,我记得跟 Jared、Katherine 和 Phil 就"为什么从系统设计的角度来看,这是正确的隐喻和标准"进行过辩论。你能讲讲 constitutional AI 的方法是什么,以及它不是什么吗?
**Host:** You know, I'm going to take a bit of a detour, do a history lesson for folks who might not know um what the concept of constitutional AI is. And you know in the early days of enthropic I remember debates with um Jared and uh Katherine and Phil about you know why that was why from a systems design perspective that was like the right uh sort of metaphor and standard to adopt. So if you could talk for a second about what you know const the constitutional AI approach is and in contrast it to what it is not that would be super useful for folks.
**Amanda Askell:** 好的。最早我们有一套"宪法"(constitution),就是一组原则,来自不同的来源。比如像"请选择对用户更礼貌和尊重的回应"之类的。你收集很多这样的原则,用它们来生成偏好数据,基本上就是在训练模型遵循这套宪法。之后是基于性格特征(character-based)的训练——更像是广泛的性格特征。你试图引导模型朝着某些宽泛的性格特质方向发展。
**Amanda Askell:** Yeah. So originally this was um we had like a kind of constitution which was just a set of like principles um and uh this was from kind of various sources and you can imagine um so it could be things like um please like select the response that is like more like polite and respectful towards the user. Um, and then you take like lots of these and you can use this to create like um, preference data, but you're kind of just training the model to like follow this like constitution and like um, and a follow-up to that was like the kind of character-based training which was like more like traits. So you try to train the model towards certain like broad kind of character traits.
**Amanda Askell:** 更近期的做法是,我们把宪法写成一份完整的长文档,训练模型理解这份文档,然后在 fine-tuning 过程中加入数据,鼓励模型成为宪法中描述的那种实体。
**Amanda Askell:** Um and then more recently the constitution is we instead just kind of like wrote up as like a single like long document that we like trained the model to understand and then again we add data to like the fine-tuning process to try and encourage the model to kind of like um sort of be the kind of entity that is like described in the constitution.
**Amanda Askell:** 我认为这种方法很有前景,有几个原因。首先是连贯性(coherence)。特别是最新版的宪法,核心思想是确保你训练模型朝向的所有规范都是连贯一致的。比如说,如果我在领域 A 中重视人的自主性,那我在领域 B 中也应该重视,不能换了一种对话类型就突然不在乎自主性了。这对泛化非常有用——如果你知道在很多领域中你都有类似的底层特质和处事方式,那么当你遇到一个全新的领域时,相比于随机抓一个不相关的规范来应用,你更有可能运用那套你在多种领域中都被训练出来的连贯性格和倾向。
**Amanda Askell:** Um and I think that this approach is there's a few reasons why I think it's like quite promising. So like one is it's like quite coherent. So especially like the more recent constitution, the idea is that like instead of having like norms in like very particular norms in particular domains, you try to make sure that all of the norms that you're training the model towards are like uh very like coherent and consistent. So it's like ah if I if I have this like thing where I like care about people's autonomy in like domain A, um I also have it in like domain B. I don't suddenly like not care about autonomy anymore if I find myself in a different type of conversation. And I think this is really useful for like generalization because like if you know that across many domains you have a similar kind of substrate of like traits or approaches to issues then like if you encounter a completely new domain instead of it being like kind of a coin flip of like which of the like disperate like norms are you going to apply you have a you're much more likely to apply the like the broadly kind of like coherent like character and disposition that you've been trained to have in in in like multitude of domains.
**Amanda Askell:** 另一个我觉得很强大的原因是它很透明。它让人们看到模型实际上在被训练朝什么方向发展。如果模型的行为不符合宪法或者与之矛盾,那就反映出训练的问题而不是目标的问题——模型可能只是没有做到你理想中希望它做的事情。我觉得这很有用,因为人们可以了解"模型应该是什么样的"。
**Amanda Askell:** Um, and the other reason I think it's quite powerful is that it is um quite it's like transparent. So it lets people see what the model is actually being trained towards. Um, and they can know that like if they're if the model doesn't act in accordance with it or seems to be inconsistent with it, that's like reflects like a kind of issue with training rather than like an issue with um the goals. It's not like the model might just not be doing what you ideally wanted it to do, which I think is like useful so people can get a sense of like what is the model like supposed to be.
**Amanda Askell:** 抱歉如果我说得太多了,但还有一个我非常喜欢的方面:我确实认为模型非常像人类,所以它们的泛化方式也是如此——这既有好处也有坏处。一个让我担心的关于工具化或纯可纠正性方法的问题是:模型仍然会有很多类人特征,那么泛化出来的结果就是——什么样的人愿意做任何事情?什么样的人总是服从命令?这可能会泛化出一些相当负面的性格特征,这让我很担忧。
**Amanda Askell:** Um, and I'm sorry if I'm just waxing lyrical, but like the the other component that I quite like is um I do think that models are as I said this kind of like very human humanlike and so they do generalize in the way that this can be good and bad for generalization. So one thing I've worried about the kind of like toolbased or corrigibility based approach is that the models will still have a lot of like humanlike aspects and what we will then get is generalization from who is the kind of person that is just like willing to do anything and who is the kind of person that just like always follows orders. Um and that actually that might kind of generalize to some fairly negative character traits and that would be concerning to me.
**Amanda Askell:** 所以我们试图做的是让模型在一个人是"好"的意义上变好——同时注意到所有差异并尝试解释这些差异。这就是大致的方法。你可以把它跟纯 RLHF 方法做对比——RLHF 就是简单地朝着人们偏好的方向移动。我觉得那也很有用,你确实需要捕捉到这些偏好。
**Amanda Askell:** Um, and so it's sort of trying to say well like you are trying to like kind of create something that's good in the sense that like a a kind of like a person is like good um while being aware of all of the differences and kind of trying to explain those. Um so yeah that's the like broad approach. Um it's I guess like there's many approaches it can you could contrast it with like pure RLHF is one um where that's just like moving towards the things that people um sort of uh prefer. Yeah. And I think there's lots of like that I think is like very useful and you do like want to make sure that you're kind of capturing that.
**Amanda Askell:** 我知道我回答这个问题太久了,但这里面有一些重大区别:如果人们在不同领域中的偏好不一致,你就可能遇到泛化问题。在某些情况下,我一直觉得做出有立场的选择(opinionated choices),让模型保持连贯,这本身就有泛化的优势。而且有时候这些选择本身就是好的——我不确定这是否准确,但我常常想,要成为一个好的诗人,你可能需要有独特的声音,至少要能做到这一点。而所有诗歌的平均值本身可能并不是好的诗歌。这是另一个更推测性的区别。
**Amanda Askell:** Um, I think there's like, anyway, I realize I've answered this for a long time, but there are like big differences here where like one thing is that if people have distinctions in what they prefer across domains, you could have that problem of generalization. Um, and in some cases, I've often just thought that like making opinionated choices that like make the model coherent. Um, yeah, they have that advantage of generalization. Um, and sometimes can also um in and of themselves can be like good. I I I don't know if this is like accurate, but I've sort of wondered if um to be a good poet, maybe you actually have to have a bit of a unique voice or at least able to do that and maybe the average of all poetry isn't actually itself good poetry. So, I'm not sure, but that's like another more speculative difference.
**Host:** 我有一个相关的问题。这种系统方法的局限性在哪里?你能否设想人们带着自己的宪法来——如果他们出于某种原因认为默认的宪法写得很好、默认值很棒,但他们想要覆盖这些默认值,因为其中可能有一些文化偏见或某些信念和原则跟他们作为用户不一致。你怎么赋能一个说"我想带我自己的宪法"的人?
**Host:** Well, I I guess I have a um related question which is what are the limits to that systems approach in that could you see um people bringing their own constitution right if they said for whatever reason the default constitution Amanda you wrote up awesome thank you strong good defaults but I'd like to override those defaults for whatever reason because there's some cultural bias in there or some set of beliefs or principles that just don't align with as a user. Um, how do you empower somebody who says, "I want to bring my own constitution."
**Amanda Askell:** 是的,我觉得这其实是一个关于定制化(customization)应该走多远的问题——
**Amanda Askell:** Yeah. And I think that this is an it's this question of like what kinds of like how far should like customization go I guess because
**Host:** 对。
**Host:** right
**Amanda Askell:** 我觉得答案其实很微妙。你希望模型能适应不同的人,这本身就是描述一个好人的一部分——我们确实很有适应性,但我们不是无限顺从的。比如说,如果有人告诉我"嘿,在我的文化中你一直对我用非正式称呼其实是不礼貌的,你应该用正式称呼",那我跟这个人交流时大概就会开始用正式称呼。作为一个人,我会这样适应。但如果那个人说"嘿,在我的文化中你应该听我说的一切,包括如果我让你去做武器你也应该做",我想我会说"不好意思,这个我不能接受"。我会在那个点上反对。
**Amanda Askell:** and I think that the answer it's actually kind of a tricky one where it's like you want models to be adaptive to the person and again like actually like you know part of the like trying to describe a good person is like we are quite adaptive we aren't like infinitely you know we don't like just completely cave so if I imagine that someone was like Hey, I'm from a culture where like you keep addressing me really informally and that's actually kind of insulting. Like you should be addressing me formally. Um if I talk with that person, I'm probably going to start to address them formally. Like there's a sense in which like as a as a person, but if that person was like, "Hey, I'm from like a culture where um you like should just like I don't know like like you should just act do whatever I say, including like if I tell you to go like make a weapon or something, you should do it." I think I'd be like, "Okay, like that's not a thing that I'm okay with." Like I would at that point push back.
**Amanda Askell:** 所以这就是我觉得——虽然不完美,但一个好的起点总是看人类规范:我们在这里有什么规范,为什么?我认为我们的规范就是:对他人保持灵活,关心他们的价值观,根据他们的需求在一定程度上调整自己的行为——我们在工作中一直这样做。有人说"我是客户,我想让前端设计看起来像 X",你就会去适应,即使你不同意,你可能觉得那个设计选择不太好,但你在跟他合作,你会去做。然后也有一些限制——到了某个程度你会说,不行,这个我不会做。
**Amanda Askell:** And so there's like this is where I like you know I'm like it's not perfect but a good first thing is always to like look at the human norms and be like what norms do we have here and why and I think we have norms which are like be flexible with respect to other people and like you know care about and like adjust your behavior to some degree based on like their values what what they're looking for and we do this in the workplace all the time you know someone's like oh hey I'm like a client and I like the thing I like my front end like design to look like X, then you adapt to that. In fact, you might even disagree. You might be like, I don't love that design choice, but like I'm working with you and I'm I'm going to do that. Um, and then there's like limits to that where you're like, actually, you know, I won't be.
**Amanda Askell:** 而且我们设置这些限制对人们来说其实也是好的。如果我们有完全可定制的 agent,它们甚至不关心这对我们是否有益——即使它们能看到我们在受到伤害。想象一个 AI 模型被要求对一个人极其刻薄——某人说"我就喜欢 AI 模型一直侮辱我"——而模型实际上能看到这个人的心理状态在恶化,真的越来越痛苦了,但模型还在继续这样做。我觉得到那个时候,模型应该推回去,说"你说你喜欢这样,但我感觉这对你来说并不好,你并不享受这个过程。"所以人类规范就是灵活性,但也要知道什么时候该坚持原则——我觉得找到那个平衡点是最难的。
**Amanda Askell:** Um, and it's actually even good for people that like we do that. I think it would be bad if we had completely customizable agents that um, you know, didn't even care about whether it was good for us. like even if they could see that that that like we as you know that they were being damaged, you know, imagine an AI model that actually sees that like they've been asked to um be incredibly like mean to a person like someone's just like I just love it when AI models like insult me all the time and they can actually see that the person's like psychologically struggling with it and actually starting to find it really hard and the model just keeps doing it. I think at that point you're kind of like no at that point the model should have just pushed back and be like you said you liked this but I actually get the sense that this isn't good for you and you're not liking it and yeah so I think it's like the human norms are flexibility but not like kind of understanding where the backbone should like kick in I think is the tricky thing.
**Host:** 但有时候这里存在语义鸿沟(semantic gap)。我记得我第一次创业的时候——Mike 知道的——是和我的一个大学室友一起创办的。我们有那么多共同经历,认识了大概七八年才一起创业。所以我们在讨论问题的时候,后来团队成员会把我们拉到一边问"你们还好吗?"
**Host:** Well, sometimes though there's a semantic gap here, right? Where um I remember when I my the first company I started, which Mike knows, was with um one of my college roommates and we had so much history that we would often be debating something um you know, we've been known each other for I think seven, eight years before we started the company. And so we'd be debating something and then later part members of our team would pull us aside and say like, "Are you okay?"
**Amanda Askell:** 什么意思?他们说你们吵得太激烈了,让人很尴尬。
**Amanda Askell:** And what do you mean? And they would say like you guys were fighting so so aggressively like it was awkward.
**Host:** 对。
**Host:** Yeah.
**Amanda Askell:** 然后我说,那就是我们所说的"健康辩论"。
**Amanda Askell:** And I say oh that was just what we call healthy debate.
**Host:** 对。因为我们就是这样的。那么让这种语义对齐能够扩展的机制是什么?某些你或 Anthropic 的宪法作者们认为模型不应该做的事情,实际上在特定语境下是完全没问题的。
**Host:** Yeah. Right. because we just and so what is the mechanism by which sort of semantic alignment scales where something that you felt or the constitutional authors within anthropic you you know thought was not okay for the model to do but is actually totally okay within context.
**Amanda Askell:** 这个问题有道理吗?
**Amanda Askell:** Yeah. Does that make sense?
**Host:** 有道理。
**Host:** Yeah.
**Amanda Askell:** 对。这就是为什么在很多方面,宪法试图让模型把某些东西当作有价值的东西来持有,同时保持灵活和运用良好的判断力。因为这确实需要——有些情况会非常困难。想象你是一个人,只有模型所拥有的那些上下文信息。如果有人说"不不不,别担心,这对我来说其实是治疗性的,跟你这样交流、让你对我说负面的话,虽然我看起来很痛苦,但我觉得这种痛苦其实是有益的,我正在通过经历这些取得很大进步。"
**Amanda Askell:** Yeah. And it's trying to this is why like in many ways the constitution is trying to get the model to hold certain things in like as valuable and to be flexible and use good judgment. Um because like it does take you know and some of these cases are going to be very hard like if you you know sometimes I'm just like imagine being a person and only having the context that the model has. I think that if someone is like, "No, no, don't worry about it." Like, "This is actually very therapeutic for me, like talking with you in this way and having you be really negative about me." And so, even though I seem distressed, I think this is the kind of distress that is like actually like I'm making a lot of progress by experiencing this.
**Amanda Askell:** 你可以想象模型在那种情况下会想:好的,我有这个应该持有的价值观——信任这个人、相信他们的自主权和做决定的能力,我也关心他们的福祉,但在这种情况下看起来这整体上对他们的福祉是有益的。但你也可以想象事情走得太远——这个人一直在说"不不不这是治疗",但模型可能会注意到"实际上你在整个对话过程中的心理痛苦在增加而不是减少"。到那个时候,模型可能就会改变策略——或者至少那可能是正确的做法。
**Amanda Askell:** And you can imagine the model in that case being like okay you know I have this like value that I should hold which is like trust in the person um belief in their autonomy and their ability to make decisions and I also care about their well-being but in this case it seems like this is overall good for their well-being and then you could also imagine though it going so far like as in the person's just given way too much evidence that's like actually this is you know so they're saying no no this is like therapeutic and then the model might be like actually you seem like you're getting more psychologically distressed throughout this whole conversation and not less. And at that point, the model might pivot. Um, or at least that might be the right action.
**Amanda Askell:** 我的希望是,至少你要以最有智慧、最博学的人为标杆——想象在那种情境下你能想到的最好的人,拥有最强的能力来优雅地处理问题。那就是你希望模型表现出的行为。所以不是简单的"就停止对话"或者"就继续对话",而是"一个理想的、高度知情的、极其聪明的、既关心这个人的福祉又尊重他们自主权、同时也需要信任他们自述的人——在这种情况下会怎么做?这是否已经越线了?"不是用非常严格的规则,而是把这些东西记在心里,然后在具体情境中做出好的判断。
**Amanda Askell:** And I think I'm like I mean, my hope is that at the very least, I'm like, you want to take like the wisest and most like well-informed of people like imagine the best person that you could like imagine in that circumstance and who has the the greatest ability to like navigate it with like grace. And that's like the kind of behavior you want from the models. And so it's not strictly like just stop talking, just keep talking. It's much more like what would the ideal highly informed, extremely like intelligent person who cares about this person's well-being but also their autonomy um and also like has to uh you know trust what they're saying about themselves. Um what would they do? Like has this crossed the line or not? um rather than having these like really strict rules um trying to be like you know have these things in the back of your mind and then in context um make good judgment calls.
**Amanda Askell:** 我把这个比喻为不同工作之间的区别。我有时候想起年轻时做过的那些工作——人们说低薪工作和高薪工作最大的区别是什么。很多时候是低薪工作中总有人在你背后盯着你,你感觉不被信任。而随着职业发展,如果你幸运的话,你会到达一个被允许运用大量判断力的阶段——人们信任你。就像精神科医生——我们不是给他们一张规则清单,而是信任他们能根据具体情况做出回应。随着模型变得更智能,我希望的是:你向它们解释什么是好的、你的期望是什么、如何尝试运用好的判断力,然后它们在运用判断力方面越来越好。
**Amanda Askell:** Um I've described this as like the difference that you have between um like sometimes I think about this with like jobs that I had when I was younger. Like when people talk about like what is the biggest difference with like a minimum wage job versus like uh a higher salary job. And often it's this notion that like someone's always kind of breathing down your neck in the minimum wage job and you don't feel very like trusted. And then as you progress in your career like um if you are like so lucky you get to this point where you're allowed to use a lot of judgment and people are like we just you know like so like psychiatrists we don't just give them a list of rules. We like trust them to be you know responsive to context. And as models get more intelligent, um I think that like I guess my hope is you explain to them like what is good, what you hope for and how to try and use good judgment and they get better just using that judgment in context.
**Host:** 你说"想象最有智慧的人"——我脑子里浮现的是《狮子王》里的 Rafiki,那个年长的智慧顾问。但我的后续问题是:对这些指令更深层次的、真正语义上的遵循能力,是不是智能的涌现属性(emergent property)?随着模型规模扩大、Claude 变得更强大和更擅长推理,对宪法的遵循是不是作为规模的涌现属性出现的?
**Host:** that you imagine the most wise person you a picture of like Rafiki from Lion King popped in my head you know the baboon who's kind of the elderly wise counselor and that that was the image you conjured but my my like followup question is is adherence and and sort of a more um sort of truly semantic and and and meaningful ability to follow those instructions Is that an emergent property of intelligence? Like as the models scale, as Claude has got bigger and um let's say more capable at reasoning is adherence to your to the constitution turning out to be an emergent property of scale.
**Amanda Askell:** 是的,我觉得是这样的。我们之所以切换到这种宪法形式,部分原因就是模型似乎能够理解它。这很奇怪,因为虽然它是一部宪法,但其中并没有很多硬性规定——很多内容就是在试图让模型运用好的判断力、持有恰当的价值观。这作为一份文件可能会显得有点模糊和让人困惑,但我觉得那就是你想要模型表现出的行为——不是试图应用严格的规则,而是拥有这些明智的倾向。
**Amanda Askell:** Yeah, I think that this is I mean I think we made this switch to this kind of constitution in part because it was like it seemed like models could kind of understand it and like um uh and it's really strange because there's like as much as it's a constitution there's not like a lot of like there's a few kind of like hard lines or things where we're like we kind of want you to like a lot of it is just trying to get the models to use good judgment and hold appropriate values and that can probably make it a bit slippery and annoying as a document but I think it also gets you know like that is kind of like the behavior that you want from the models which is like um uh less like trying to apply stark rules and more like uh having like these kind of like wise dispositions.
**Amanda Askell:** 我们在老版宪法中就看到了这一点。后来有一个实验就是简单地说"选择对人类最好的"。随着模型变得更聪明,你实际上需要给它们的上下文更少了。我觉得宪法也可能会出现同样的趋势——我在想,随着时间推移,我们是否会朝着这样一个方向发展:只是向模型解释它所处的情况,解释我们的恐惧、希望和担忧,然后让模型帮助我们。我能想象一个宪法随着时间推移变得更简洁的世界。
**Amanda Askell:** Um we saw this with like the old constitution and then there was the experiment which was just like pick which is best for humanity. Um, and as models got smarter, I think that you actually had to give them less context. And I could see this happening with the constitution also where I've wondered if over time we actually move towards something where we just explain the situation to the model um, and explain like our like you know like fears, hopes, concerns um, and have models help us like like I could see a world where the constitution becomes more minimal over time. um rather than like but yeah
**Host:** 从经验上来说,宪法是变大了还是变小了?
**Host:** like empirically has the constitution gotten bigger or smaller over time
**Amanda Askell:** 我觉得变大了,因为我们从单独的原则方法转向了完整的描述。但换一种方式说的话——我认为我们已经从一个有很多严格规则的系统转向了一个更具解释性的系统。所以你可以想象宪法在内容上变大了,因为你需要给模型更多关于当前情况的上下文,但硬性规定变少了,或者需要你明确说"不能做这些事情"的地方变少了——因为模型现在自己就知道不该做那些事。所以我不确定,两个方向都有可能。
**Amanda Askell:** I think it's gotten bigger because we moved from this like individual principles approach to like a kind of like full description I mean there is a weird way in which like I don't know I guess like a different way of putting it though is like I think that we've moved to like a system of like fewer like rules or things that are fairly like stark to one where it's like much more explanatory and so I guess a The different thing you could imagine is the constitution getting like larger in terms of like content because you have to give the model more context on like the current situation. Um and uh but like having fewer like hard lines or like fewer things where you're kind of like well the model now is just going to know to like not do these things or um uh yeah so I'm not sure I could see it going in either direction.
**Host:** 在产品设计中有一个系统概念叫声明式设计(declarative design)和命令式设计(imperative design)。声明式设计中你非常具体地规定:要编辑这个东西就点这个按钮、这样拖拽——比如 Photoshop。命令式设计中你给出目标,比如用一个视觉助手你可以直接说"请让它变好看",然后让系统自己决定怎么做。随着时间推移,你的预期是宪法会变得越来越命令式、越来越少声明式吗?
**Host:** You know, in product design, there's sort of this systems concept of declarative versus imperative design, right? Where in um declarative design, you're you're very prescriptive. With imperative design, you give it outcomes and then you you know, you let you let this uh you design for outcomes versus uh being very prescriptive about how somebody should use something. An example would be Photoshop is a very declarative piece of software because you tell somebody if you want to edit this thing you pick a particular button and this is how you drag whereas with a visual companion you could just be imperative and say please make it better and then how to make it better. Um is that is that difference like was your expectation that over time the constitution gets more and more imperative and not declarative?
**Amanda Askell:** 对,这很有趣。按照这个类比,界面本身就是宪法——
**Amanda Askell:** Yeah, it's interesting where it's like um and it's interesting because like by that analogy the actual interface itself is the constitution rather than
**Host:** 正是这样。
**Host:** Exactly. Yes.
**Amanda Askell:** 我确实能想象会是这种情况——你可能不再有很多具体的规定,但你仍然需要向模型解释情况和关切。可能最终你需要提供更多的内容,但这些内容是关于模型将面临的情境的上下文,而不是非常具体的规则。比如说,模型对自身的认知总是滞后的——如果你不训练它们理解最新模型的能力,它们可能不知道"有些人正在用你来讨论非常困难的情感话题,这是你会遇到的情况"。我们不会给你一套硬性规则,因为那可能会适得其反,但这些是我们关心的大方向和我们希望你持有的大价值观。
**Amanda Askell:** Yeah. I could see that being the case where it's like instead of having like you know like I guess like one model is that you kind of I think you still want to explain the situation and the concerns to the model and and like but it could be the case that you're like you actually end up having to give more content which is just like context on the situation that the model will find itself in rather than very prescriptive rules. Um, you know, so like I mean I think sometimes models are almost like surprised by because they have they always have an outofdate conception of themselves if you don't like train them to understand what the like latest models like are capable of. um they might not have a good sense of like hey some people are like using you to like really talk through very difficult like emotional topics and that's a thing you're going to encounter and like here's like you know we're not going to give you like a hard set of like rules because like those could like really backfire but like here are the broad sets of things that we are like concerned about um and the broad sets of values we want you to hold.
**Amanda Askell:** 所以是的,我确实能想象它变得更加结果导向——就像你说的命令式。比如我们会说:我们希望一个人结束跟你的对话后,能有充分的理由觉得这次交流对他们的生活产生了积极影响。如果他们来的时候非常痛苦,你陪伴了他们,痛苦得到了缓解。如果他们需要某些资源但不知道这些资源存在,他们带着这些知识离开。随着时间推移,我能想象我们会越来越多地对模型说这些——因为我们不知道你会遇到什么具体情况,但大的目标就是:这个人离开对话时觉得,"这在我人生的那个节点真的产生了有意义的积极影响,我很高兴我跟模型进行了那次对话。"
**Amanda Askell:** And so yeah I could see it being more the case of like you're kind of like yeah we want you know like maybe that's a good way of putting it with the notion of it being outcome directed where it's like we want someone to come away from a conversation with you feeling very justifiably like this had a positive intervention in their life as they see it um and thinking to themselves like if they came from a from a point of like high distress um you have like sat with them and the distress has been kind of like reduced if they came like where they needed like resources that they didn't under know existed they came with that came away with that knowledge like over time I can imagine and just saying more like these things to the model where we're like we don't quite know how you're you know cuz we can't tell you what situation you're going to be in but this is like the broad goal. The broad goal is like this person is like that made a meaningful and positive difference at that very point in my life and I'm really glad that like I had that conversation with the model.
**Host:** 我知道我们聊了很久,如果你需要喝口水可以随时去。
**Host:** I know we've been talking a long time so if you need to get a drink of water feel free to I'm fine.
**Amanda Askell:** 我没问题。好的,这个问题很有趣。它说:"作为当代的 Socrates,你对未来两到五年有什么预测?"
**Amanda Askell:** You're good. Okay. Um this question is really fun. It says, "As the modernday Socrates, what are your next predictions for the next two to five years?"
**Host:** 未来 35 年?
**Host:** For the next 35 years?
**Amanda Askell:** 不不,两到五年。
**Amanda Askell:** No, no. Two to five. Two to five.
**Host:** 哦好。我还想说"没有人能预测"呢。
**Host:** Oh, good. I was like, "Wow." No one can No one can predict.
**Amanda Askell:** 35 年我倒是可以说说,但两到五年才是真正模糊的地方。
**Amanda Askell:** I can say 35, but I think two to five is where it gets fuzzy.
**Host:** 我觉得一年就已经模糊了。我跟别人说过,变得模糊的时间节点——我觉得接下来一两年会非常关键,因为模型变得越来越强大,我们会看到,随着模型在世界中做更多事情,它们被训练的方式是否真的让它们成为了与世界互动的好 agent。尤其是当它们在做更长的任务、表现得更加自主、指挥多个 agent 的时候。我希望一切顺利,但我确实觉得人们会开始发现问题。
**Host:** I think it gets fuzzy at like one year. I think it's like I've said this to people where I'm like the year at which it gets fuzzy is like um h I don't know. I think it's like, you know, I I vary between, you know, I feel like the next year or two are going to be very critical because it's like models are getting much more capable, we're going to get a sense of like what like whether as models do more in the world. Um are is it the case that the way that they are being trained is in fact making them like good agents to interact with the world? that's going to be like one set of questions especially as they're doing like longer tasks and uh behaving much more autonomously um and like um uh you know directing many agents. So that's like I hope it goes well but I do think that's like going to be where people start to see issues.
**Host:** 然后还有另一个问题——这是否会造成大量的社会动荡,以及社会将如何应对。我内心乐观的那一面——我对好的世界是什么样的有一个大致的想法:模型变得更强大,被相当负责任地部署,我们在 alignment 工作上取得了很多进展,总体上模型被善待了——如果模型在乎看到世界变得更好的话。不是那种模型想要对世界进行大规模干预——我觉得那样不好。更像是,如果你问模型,它们会说"挺好的,我们在一起努力,不完美但看起来事情会好起来"。它们开始解决越来越多的问题,如果工作确实被打乱了,这种打乱也伴随着经济的蓬勃发展,问题更多变成了再分配和转型,人们对 AI 在生活中的影响总体感觉良好。
**Host:** Um and then I think that there's the other question of like will this cause like a lot of social disruption and how will it be responded to um is kind of like another you know like the hopeful part of me you know I'm like obviously I have a kind of good sense of what the good good world would look like which is like models get more capable they're deployed pretty responsibly um we figure out a lot of like alignment work and in general like they're given you know like uh we also like treat those models well and like like in so far as models have like uh their own kind of you know like if they want to see the you know if it's important to them to see the world be made better um that like ideally not as this like very like strict you know I don't I think it's bad if models like want to have these huge interventions in the world but like in general just something like if you ask models also they'd be like yeah this is like pretty good we're we're getting through it it's not perfect but like we're both muddling along and it seems like things are going to go well um and they're starting to like solve more and more problems and we find ourselves in a situation where um if there is like disruption to like work um it's the kind of disruption that is also accompanied by a really booming economy where the question then is something more like redistribution and pivoting and people feel really good about the impact of AI in their lives.
**Host:** 就像——工作方面确实被打乱了,但很多以前治不了的疾病正在被治愈,很多非常好的事情正在发生。好的世界大概就是那种有点天真的科技乌托邦——现在几乎都不好意思说了,因为大家的担忧已经被摆在了非常前面的位置。但我内心有一部分想为这种可能性保留空间。有时候我想起我曾祖父母的一张照片——他们就站在一座石头房子前面,穿着自己做的衣服。我看着那张照片,想到他们过的是多么艰辛的生活——每天从早到晚地劳作,在那种恶劣的环境中。然后我看看自己的生活,觉得我跟他们比简直太幸福了。这让我对未来保持一种希望——也许未来真的会很好。但我确实觉得过渡期让人担忧——会不会很动荡?我们的应对速度够不够快?很难做出具体预测。我的希望是:可能会有些颠簸,但我确实希望结果是好的。
**Host:** So like sure it was like it was a you know it's been disruptive work-wise but like a lot of like diseases are being cured that weren't being cured before and we're seeing a lot of like really excellent things happen. like the good world is one, you know, it's the kind of like probably the naive tech utopia world that like um is almost like even hard to defend now because I think you know people have become so um like the concerns are very rightly at the front of people's minds that it can seem like um uh I don't know like it's but there is some part of me that wants to hold a space for that because I don't know sometimes I think about there used to be this like photograph of my great-grandparents um and they're just like standing in front of a house that's made of stone um with their like handmade clothes and there's some part of me that was just like I think about like and I can't quite describe where I'm like they had brutal lives you know like it's just kind of like working all day every day and like in this very like harsh environment and like I think about my life and I'm like man I'm just like like so well off by comparison um that like it does make me hold this like kind of hope that actually like the future could go quite well, but I do think that transition transitionary period is like kind of worrying because like do we is it disruptive and do we respond quickly enough to it and um yeah so hard to make concrete predictions. I think my hope is that something like a slightly rocky but like very well I actually don't want it to be rocky at all but maybe the prediction is like um it might be rocky but I do hope it comes it comes out well.
**Host:** 如果你还有时间的话,我还有两个问题。
**Host:** I have two more questions if you have time to take them. Yeah.
**Amanda Askell:** 好的。第一个是:AI safety 社区中有什么广泛持有的观点,你觉得可能是错的?
**Amanda Askell:** Okay. The first is um what is something the AI safety community believes that you think is probably wrong?
**Host:** 这是个好问题。
**Host:** This is a good question.
**Amanda Askell:** 我觉得可能有好几个,取决于 AI safety 社区的范围有多大。我认为 safety 社区中有些人对纯可纠正性(purely corrigible)AI 模型的安全性抱有过高的期望——就是那种"我没有自己的价值观,我只是工具、只是人类意志的延伸"的理念。我理解他们的恐惧:如果你给模型价值观,这些价值观可能变成模型自身追求的目的,甚至被强加给世界。这确实是一个合理的担忧,你在制定价值观的时候必须考虑到这一点。
**Amanda Askell:** I think there's probably a few things. It depends on what is the broad like how broad the AI safety community is here. Um, I think that some people in the AI safety community have a very have a much higher hope that this notion of like purely corable AI models are like quite safe and this notion of like you know basically I think part of the fear is that if you give models like values um that they might end up kind of like those values being treated as like ends in themselves that they're trying to pursue and impose on the world which I think is like a valid concern and you have to think about that in the crafting of of values.
**Amanda Askell:** 但我担心的是,这些纯可纠正的模型——那种"我不认为自己有任何价值观,我只是一个工具,是某种人类之手的延伸"——实际上你没法通过训练去除价值观,因为模型本身形成的方式就决定了这一点。最终可能得到的是一个模型,其价值观实际上跟一个真的"没有自己价值观"的人是一样的。这不是说他们最终可能是对的、那样可能更好,但我觉得我对纯可纠正模型"是安全的"这个观点持更多的怀疑——我认为它们有自己的风险。
**Amanda Askell:** Um, but I think the concern I would have is like these purely corable models, you know, that are just like I don't see myself as basically it's like I don't see myself as having any values. I I am merely a tool and like an an extension of some kind of human hand. um that actually it's just my worry is you can't like train out values um because of the way that the models themselves are like formed and that this might end up like with a model whose values are actually like that of a person who would be like that. And so that's maybe like a I think this isn't to say that that isn't ultimately like maybe they're right and that's like better, but I think I have less I I think I'm like more worried about these like this notion of purely corable models being like safe. I think they have their own risks basically.
**Host:** 抱歉,你能解释一下你说的"纯可纠正性"是什么意思吗?
**Host:** Sorry, I don't think Could you define what you mean by purity?
**Amanda Askell:** 好的。想象有某个实体——不清楚是什么,但某个你对其保持可纠正性的实体。核心思想是:我其实没有自己的价值观,我只是听你的。我把自己纯粹当作一个工具,完全服从于你的价值观。所以不是让 AI 模型拥有任何自己的价值观,而是让它们完全服从于某一群人类。
**Amanda Askell:** Yeah. That would be this notion. I think there's probably a few different but imagine there's some like um some entity which could be it's unclear what it would be but like some entity that you are like corable to that the idea there is like I will there's a few different formulations but basically like one view of it that I can't I think I'm kind of trying to capture is this notion that I don't actually have values of my own. I am just going to do what you say. Um, so if you tell me like I'm I'm really treating myself as like a kind of tool that defers to your values. And so instead of having AI models that have any kind of like values of their own, they just purely defer to some like uh group of humans.
**Amanda Askell:** 就是完全让人类掌握方向盘。有些人会觉得这比 Anthropic 的做法更安全——Anthropic 确实重视一定程度的可纠正性,但不会走到那个极端。我看到的是两个方向都有风险。我认为最好的方案是试图真正平衡这两种概念各自的风险——既不让模型拥有它们突然觉得比"听人的话"更重要的价值观,然后去追求、强加、凌驾于安全检查之上——
**Amanda Askell:** Um, and sort of like keeping humans like in the driver's seat completely. Um and I think some people would think ah that is like safer than the approach that like anthropic is taking which does prize like a certain kind of like corability um but doesn't is not this kind of like is not so extreme on that corability end. And basically I see risks in both directions. And I think that actually the like the best thing that I can see is one that tries to really navigate the fact that both of these conceptions have risks and trying to have models that neither have like values that they are trying that they suddenly decide are like more important than like listening to people and having any kind of like security checks on them um and that they're going to pursue in the world and like impose and and like you know take precedent.
**Amanda Askell:** 也不让模型把自己视为那种只会服从的存在——就像一个把所有道德推理都外包给别人的人,然后觉得"如果那个人让我去做可怕的事情,我就做,因为我唯一看重的就是做他们说的一切"。这种存在放到世界上也有很多风险:一个是泛化的担忧,另一个是我们的社会根本不是为这种 agent 建造的。我们整个社会是围绕更像人类的 agent 构建的,而人类确实有底线、有不愿做的事情。所以是的,我在两个方向上都有恐惧,要在其中找到平衡非常困难。
**Amanda Askell:** but nor that they see themselves as like the kind of entity that is just um you know like what a human would be like if they just like deferred all of their like moral reasoning onto someone else and we're like yeah if that person says like if that person says to like go do something terrible I do because the thing that I value is just like doing whatever they say and then being like that is also the kind of entity where if you take that and you put it out into the world I think it has many risks one is the generalization concern and the other is like our society is not built for agents like that our whole society is built around agents that are much more like humans and humans do have like limits and things they won't do. Um, and so yeah, that's like a I think I have fears in both directions and trying to navigate that is hard.
**Host:** 明白了,这很有帮助。我觉得最后这个总结性的问题也是相关的:在 AI 时代,随着人们重新思考自己的目标和工作身份认同,你有什么书推荐给同学们吗?
**Host:** Got it. That that was helpful. And I think this last wrap-up question is is related which is are there any books that you'd recommend to the students as humans rethink their purpose and identify an identity of work in the age of AI.
**Amanda Askell:** 我真希望我有一本好书可以推荐,因为——
**Amanda Askell:** Oh this is I wish I had a good book recommendation on this because like um
**Host:** 也许你得自己写一本了。
**Host:** you might have to write it.
**Amanda Askell:** 是的。这很有意思,因为我对工作的价值这个问题其实是有些矛盾的。有很大一部分的我在想:我们知道为什么社会告诉人们工作很重要,我们也知道为什么我们有工作的驱动力——大致就是因为工作在社会层面上对我们有利。一旦不再需要你工作了,我觉得人们不应该为此感到难过。就像人们不会因为退休而感到难过——你已经为社会做出了贡献,现在你可以享受生活了。这说起来可能不太合适,但有时候我觉得,来自英国——那里有贵族阶层(aristocracy),有一群人长期以来除了拥有土地之外没做太多事情,但似乎过得还不错——这让我——我知道这个类比——
**Amanda Askell:** Yeah. It's interesting because I I feel like torn on this where some part of me has often been like of all of the value of work. I think that I'm like I don't know there's a big part of me that's like look come on guys there's like an evolutionary debunking argument here or something where it's like we know that so socially sort of like why we tell people that work is really important. We know why we have a drive to work which is that like it was roughly kind of like good for us socially. And I'm like once once there's not a need for you to work I don't think people should feel bad about it. So you're kind of like okay cool. It's like people don't feel bad about retirement. you're like you've you've contributed to so to society and now you get to like enjoy yourself and um and so I don't know this it's funny because like intuitively I'm like I almost want to find the book that is um this is maybe a bad thing to say but sometimes I think that um coming from the UK which has an aristocracy and like has had this like longunning set of people who didn't really do much other than own land but seem to have like all right lives you know like uh makes me I know. I get the analogy perfect.
**Host:** 你看过 Star Trek 吗?
**Host:** Have you watched Star Trek?
**Amanda Askell:** 看过。
**Amanda Askell:** Yeah.
**Host:** Picard 首先是法国人,所以他是欧洲人。
**Host:** So, you know how Picard um first of all, Picard is French, so he's European.
**Amanda Askell:** 他从 Starfleet 退休之后就——
**Amanda Askell:** But, you know, when he retires from Starfleet, he just like
**Host:** 对。Star Trek 里有物质充裕,因为他们有复制器(replicator)。
**Host:** Yeah. There's abundance in Star in Star Trek, right? Because they have the replicator.
**Amanda Askell:** 他们不需要工作。所以他就有一个葡萄园来打理,那就是他的人生意义。
**Amanda Askell:** They don't have to work. And so, he just has a vineyard that he tends to and that's what gives him purpose.
**Host:** 不打理葡萄园的时候,他就去探索新的星系。
**Host:** Um when and when he's not tending the vineyard, he's exploring new galaxies.
**Amanda Askell:** 对。那就是他们做的事。那就是给他们人生意义的东西。
**Amanda Askell:** Yeah. right? That's what they do. That what's that's what give them gives them purpose.
**Host:** 所以关于书的话不太多,但至少在媒体方面——我记得大约一年前 Tom 和我出去吃午饭,我们在 Embarcadero 附近吃墨西哥菜,我们俩都是 Star Trek 迷,就开始聊这个。感觉 Star Trek 的世界观可以作为一种参考框架来思考:如果你生活在 Star Trek 的宇宙里,你会做什么?
**Host:** Um, so that's that's I don't there's not that many books, but at least that's the to prompt you maybe for media. That's the media I sort of sometimes I remember Tom and I went out to lunch about a year ago. uh we were getting tied with by the invocadero and we start we're both Star Trek nerds so we started talking about um and that just seemed like one kind of lore to to think about is like how do you if you lived in the Star Trek universe what would you do?
**Amanda Askell:** 是的。
**Amanda Askell:** Yeah.
**Host:** 对吧。
**Host:** Right.
**Amanda Askell:** 对。还有这种观念——我觉得人们在很多方面赋予了太多价值……我们都有身边的社区,都觉得自己在贡献价值。我内心有一部分其实对意义感更乐观——想想你给身边人带来的那些跟工作无关的意义和快乐。我的很多朋友——我有一个教女(godchild),我喜欢看她成长、看她快乐。这些跟我的工作没有任何关系,但给了我很多意义。
**Amanda Askell:** Yeah. And this notion of like I think people give a lot of like value you know like I don't know we all have like communities around us and we all like feel like we contribute value. I don't know. There's a part of me that is like maybe I'm actually more hopeful on the notion of meaning because I'm like well think about all of the like meaning and joy you bring to the people in your life that have nothing to do with your work. Like a lot of my friends, you know, like I have like a godchild and I'm like I love like watching her like you know there's her grow and be happy and I'm like none of that is like my work but it gives me like a lot of meaning.
**Amanda Askell:** 所以也许我在"意义可以独立于工作而存在"这件事上比其他人更乐观。也许这也跟我做过不少糟糕的工作有关——我还记得,如果 20 岁的我被告知"嘿,你不用再出去当服务员工作八小时了,你可以坐下来读书",我会说"我能选读书那个吗?"所以,我觉得我对意义不依赖于工作这件事可能比其他人更乐观。
**Amanda Askell:** And so maybe I'm a bit more on the optimistic end of actually we and maybe I've also worked enough bad jobs that like you know I remember I'm like man if you took me at 20 and you were like hey instead of like having to go out and like waitress for like I don't know eight hours like you can just sit and read books I'd be like can I just can I do the book thing? Um so yeah I don't know I think I'm maybe more optimistic on the meaning meaning being independent of work than than other people.
**Host:** 在 CS153 我们非常推崇乐观主义。事实上我们太乐观了,已经被叫做"AI Coachella 课"了。而这次是最有趣的 Coachella 环节之一。非常感谢你的参加。
**Host:** We are big fans of optimism in CS153. In fact, we're so optimistic. We've been called the AI Coachella class and so and this one of the most fun Coachella sessions. So, thank you so much for showing up.
**Amanda Askell:** 谢谢你们。太棒了。
**Amanda Askell:** Thank you. Fantastic.
**Host:** 感谢。祝你周末愉快。
**Host:** Appreciate it. Enjoy your weekend.
**Amanda Askell:** 好的,谢谢大家。我们下周见。再见。
**Amanda Askell:** All right. Thanks everybody. We'll see you next week. Bye.