**Moderator:** 非常高兴大家都能来到这里。有些是老朋友,有些是新面孔。今天这个 panel 会比较随意。我们请到了 Anthropic 四个不同团队的研究员:Societal Impacts 团队的是我,Alignment Science 团队的是 Yan,Alignment Fine-tuning 团队的是 Amanda,Interpretability 团队的是 Josh。我先问 Amanda 一个问题。Amanda 来自 Alignment Fine-tuning 团队,我想请你聊聊你怎么看 alignment,它对你来说意味着什么。因为你负责了我们很多关于模型应该如何行为的工作——你凭什么当那个决定 Claude 怎么行为、具备什么特征和属性的"哲学王"呢?
**Moderator:** We are super excited to have everyone here um folks that uh we've met already folks that are new this panel is just going to be really uh casual and we have researchers from four different teams at Anthropic we have folks from societal impacts that's me folks from alignment science that's Yan alignment fine tuning Amanda and interpretability Josh I'm going to start with uh asking a question to Amanda uh from the alignment fine tuning team um and I want you to talk a little bit about how you see uh alignment what it means to you and um you know because you're in charge of a lot of our work on how the model should behave um and why should you be the philosopher king that decides how Claude behaves uh what its characteristics and and and attributes are
**Amanda Askell:** 这个问题去问 Plato 吧,是他决定我应该当这个哲学王的。关于"什么是 alignment"这个问题,我可能有一个稍微有点辣的观点。我觉得人们非常非常容易陷入花大量时间去定义这个概念的诱惑,因为定义的方式太多了。他们脑子里想着社会选择理论(social choice theory),想着如果每个人都有一个效用函数(utility function),那在如何最大化所有这些效用函数上存在严格的理论限制,等等。但我更倾向于说:我们只需要让事情进展得足够好,好到可以在此基础上迭代和改进就行。标准不是某种完美的 alignment 概念。我相信那种概念是可以定义的,也可以去争论,但在大多数情况下,最初的目标就是——让事情运转良好,达到一个相对较低的门槛。如果不完美,如果有些人不喜欢,你可以继续改进。所以我对 alignment 的看法其实就是:先达到那个门槛,然后从那里开始迭代。
关于模型应该如何行为,以及我是怎么思考这个问题的——我之前讲过,我目前对模型的基本构想是:让它表现得像一个非常优秀的、有道德驱动的、善良的人,如果这个人发现自己大致处于这样的处境中会怎么做。这有点奇怪,因为这个人还需要接受自己是一个 AI、正在跟数百万人交谈这个事实。这确实会影响你的行为方式——比如你平时可能很乐意跟某个人闲聊政治,但如果你要跟数百万人对话,你可能会想:我是不是应该更谨慎一些,避免可能影响到别人。
但我确实认为这是一个重要的模型。有时候人们会问:你应该给模型注入什么价值观?我经常反过来想——我们会这样看待人类吗?就好像有人给我注射了一针"价值观血清",然后我就拥有了这些固定的东西,对它们完全确信。我觉得那样几乎是危险的。我们大多数人都是这样的:我们确实有一些自己重视的东西,但也会在不同价值之间做权衡;我们对不同的道德框架有很多不确定性;我们会遇到一些情况,突然发现"我的价值框架跟我的直觉对不上",然后就去更新它。我的看法是,伦理学其实比人们想象的更像物理学——它本质上是经验性的,我们对它持有不确定性和假说。如果我遇到一个对自己的道德观完全确信的人,无论他持有什么道德观,我都会觉得有点可怕。但如果这个人说"我不太确定,我对这些问题持开放态度,会根据关于伦理的新信息来更新自己的看法,会认真思考这些问题"——这样的人让我觉得没那么可怕。所以至少目前来说,这并不是说这种做法就能完全解决模型对齐的问题,但这是当前的直接目标。
我意识到我说了太多了,你还问了那个"哲学王"的问题,我应该快速回应一下。关于"是否应该给模型注入价值观"这个问题,也许我已经部分回答了:模型应该对世界上存在的各种价值观持不确定态度。所以理想情况下,不是某个人把自己的价值观或偏好注入进去,也不是让所有人投票决定给模型放什么价值观,而是——那些对这些问题持不确定态度并且能积极回应的人,模型也应该像他们一样。也许这就是我的观点。
**Amanda Askell:** I mean ask Plato he's the he's the one that decided I should be the philosopher thing um the question of like what is alignment uh maybe this is like a slightly spicy view that I have I think people are very very tempted to spend a lot of time trying to Define this concept because there's like lots of ways of doing it and they you know I don't know they have like social Choice theory in the back of their head and they're like oh well if you imagine everyone has a utility function there's like like limits on exactly what you can say about how to maximize all of those utility functions Etc um and I think I'm more inclined to just be like we kind of want things to go well enough that you can iterate on and improve them later and the bar isn't some like perfect notion of alignment I'm sure there is that concept one can Define it one one can argue about it but for the most part the initial goal is like let's just make things like go well and like meet a certain kind of like lower lower bar um where that is like you know if it's not perfect if some people like don't like it you can just improve on it um so like my view of of alignment is actually probably like I mostly want to like hit that like um and iterate from there
uh in terms of like how the model should behave and how I think about that I think I've spoken about this before but my basic concept right now for the model is trying to get it to behave the way that I think like you're very good like morally motivated like kind human would act in if they found themselves roughly in this circumstance it's a little bit strange because they also have to find themselves in the circumstance of like being an EI who is like talking to millions of people um which doesn't fact affect how you behave like maybe you would normally be willing to like just chitchat about politics with someone but if you're going to be talking with millions of people maybe you'd actually be like hm I should maybe be like a little bit more concerned about potentially influencing people
um and so but I do think it's actually an important model which is like sometimes people are kind of like oh what values should you put into the model um and I think I'm often like well do we think of this way with humans where I'm just like someone just injected me with like value serum or something and I just have these like fixed things I'm like completely certain of U I'm like I don't know that seems like almost like dangerous or something um most of us just have like a mix of like things that we do value but we would trade off against other things um a lot of uncertainty about different like moral Frameworks we hit cases where we're suddenly like oh actually my value framework doesn't Accord with my intuitions and we update um I think my view is that ethics is actually like a lot more like physics than people think um it's actually like a lot more kind of like empirical and uh something that we're uncertain over and that we have hypothesis about and I kind of want if I'm just like I think that I if I met someone who was just completely confident in their moral view there is no such moral view I could give that person that would not make me kind of terrified whereas if I instead have someone who's just like I don't know I'm kind of Uncertain over this and I just like update in response to like new information about ethics and I like think through these things um that's the kind of person that feels like less scary to me so at least at the moment I'm not this isn't a claim that this is somehow going to like completely like a models or anything um but that's the kind of immediate goal
I've realized I've talked a lot and you asked about the philosopher king question I guess okay I guess I should in fact give a quick answer to that um there is also this question of like well in the kind of like should you put values into the model maybe I've partially answered it where I'm like like the models should just be uncertain over values that exist in the world and so ideally it's not just someone injecting their values or their preferences nor nor is it something like everyone just voting on values to put into models but instead like people um that are uncertain and responsive to these things models should also be like that so maybe that's kind of my view
**Moderator:** 好的,我们稍后会回到这个话题。现在我要问 Yan——Amanda 的观点哪里完全错了?为什么她这套方法不足以在模型越来越强大之后实现对齐?她没这么说过,但我们在故意制造一点紧张气氛。
**Moderator:** Okay we're going to come back to that and I'm going to ask Yan uh why is Amanda's view completely wrong and why is this not enough to align models as they get you know she didn't say that but you know we're we're playing up the uh the tension between the bets so
**Yan:** 想象一下,如果每个人都是那种善良的、试图做正确事情的人。我觉得 Amanda 做的事情非常务实——我们能不能先让模型现在表现得更好?但如果模型在做越来越复杂的事情,会怎样?现在的情况是,Amanda 做 character work,然后她读大量的对话记录,判断"这个模型表现得很好,行为很道德"。但当模型在做非常复杂的事情时怎么办?当它作为一个 agent 在世界中行动,执行很长的任务轨迹,在做一些我们不理解的事情——比如做某种生物学研究,我们想问"这危险吗?"但我们根本不知道。这就是我真正感兴趣的挑战——超级对齐(super alignment)问题。我们怎么解决它?我们怎么把这件事扩展到超越人类可以直接审查的范围?如果我们能看得懂,那做一些 RLHF 就好了,或者用 Constitutional AI,但我们怎么知道我们的 constitution 真的让模型做了我们实际想要的正确的事情?我觉得这是我脑海中最大的问题。
**Yan:** Um yeah imagine if everyone was like kind uh human that was trying to you know act morally um I think what Amanda is doing is very practical right like can we just like make the models more well behave now and like um like where would we go with this like if a is doing more and more complicated things um right now like if if Amanda does this character work and then she reads a lot of transcripts and you're like okay this is like I like this this model is behaving morally I'm just picturing this is what you're doing um judging it um yeah but like what do we do when the model is doing really complex things and it's just like an agent in the world it's like doing these like really long trajectories um it's doing stuff that we don't understand like bio like doing some bio research and we're like is this dangerous like I don't know um so that's the challenge I'm really interested in like the super alignment problem how do we solve that how we like basically scale this Beyond like things that we can look at if we can look at it we just do some RF it's great or you do some constitutional AI but how do we know that our constitution is actually getting the the model to do that the right thing that we actually want so I think that's that's the big question in my mind
**Moderator:** 我们可以回应吗?
**Moderator:** Do we get to respond
**Yan:** 可以回应,但只能表示不同意。
**Yan:** Yes you can respond you're only allowed to disagree
**Amanda Askell:** 好吧,但我其实并没有真的不同意。而且我不会撒谎,所以……你们得把我的"不同意特征"调高一点。我平时明明很能抬杠的,这次真是太失常了。这大概是哲学教给我的——如何跟人吵架。我的想法是这样的:我认为我的工作在做好几件事,其中之一就是朝着 alignment 进行迭代。在很多情况下,你实际上是在让模型去监督它自己的表现。不是靠我的眼睛去读那么多对话记录——我可以让模型去审查这些东西。如果 alignment 是迭代式的,我担心的是:如果人们忽略了基础层面的工作,觉得"有一个差不多的模型就行,它会帮你搞定这些事情",那我宁愿你先有一个尽可能对齐的模型,然后用它来帮助未来的工作,这种迭代式的方法才是更好的。
**Amanda Askell:** Okay I I don't really disagree though uh and I can't lie so um need to up the disagreement feature you I'm usually so disagreeable as well it's just terrible um this is what philosophy taught me was how to be disagreeable I guess like my thought is that the way I mean I think of my work is doing several things but one of them is being kind of iterative towards alignment so in a lot of cases you're actually trying to get the model to kind of oversee its own you know it's not like me my eyes can't look at that many transcripts or something but I can get models to like look at these things and like if alignment is iterative I think My worry is that if people neglect the kind of like the ground and just think ah you can just have like a pretty bad model um and it's just going to help you with these things I'm like I would kind of rather that you had the most aligned model trying to like then help you in the future and like that kind of iterative work
**Yan:** 但是,当你没办法再亲自读那些对话记录、只能依赖那个已对齐的模型时,你怎么迭代?你怎么知道那个模型真的在帮你?
**Yan:** But but how do you iterate when you don't you can't read the transcripts anymore and you have to rely on the aligned model but then how do you know that it's actually trying to help you
**Amanda Askell:** 对,在目前的情况下,你用来验证基础模型是否已对齐的那些东西,基本上就是你在确保由该模型训练出来的另一个模型也是对齐的时候所依赖的东西。我认为当模型能力较弱时这没问题,但要扩展到能力更强的模型,你就需要更强的验证能力。
**Amanda Askell:** Yeah so like in the current cases it's sort of like everything that you are using to verify that the base model is aligned is like the C is like what you are relying on with uh like making sure that like another model that's like trained by that model is itself aligned and I think this is like fine when models are like less capable but to scale it to something with much more capable models you'd actually have to have like a greater ability to verify that
**Yan:** 那到时候你怎么办?
**Yan:** Yeah what do you do then
**Amanda Askell:** 你就想知道我的计划对吧?也许最后就是一切都没问题,模型们自己监督自己,而且它们都很友善。我不想依赖这个假设,这也不是我真正的计划,但既然我们在这儿讨论,我就先为这个立场辩护一下吧。
**Amanda Askell:** Oh you just want my plan it may just be that it's just all fine and they just supervise themselves and they're all really nice you know that would be I don't want to rely on it but that's not my actual plan but you know I'll defend it for these purposes
**Moderator:** 我们的一个重要"赌注"是:为了防范一个模型可能在深层试图破坏这个过程的情况,我们有可解释性(interpretability)研究。Josh,你怎么看可解释性这个赌注?它在那些更直接的 alignment 方法中处于什么位置?是不是就像"找到那个'友善特征'然后把它调高,找到那个'邪恶特征'然后把它调低"这么简单?
**Moderator:** And and one of our one of our bets you know to in order to like you know guard against the case that a a model might be very deeply uh trying to sabotage against this process is interpretability um how do you see like interpretability as a bet you know situated among you know uh the the more straightforward alignment approaches you know is it just as simple as like oh we find the nice feature and we like up the nice feature we find the evil feature and we like drop the evil feature right or is it you know
**Josh:** 我觉得 AI 里的一切都像那个 bell curve meme——左边是傻瓜,中间是满头大汗拼命分析的人,右边是跟傻瓜观点一致的 Jedi 大师。确实有一种可能性:alignment 的秘诀就是打开那个"友善特征"——当然是一个足够高维度理解的版本。从某种意义上说,我也希望可解释性能成为那个 Jedi 版本的方法,就是直接去看模型在怎么做事情,检查它是否安全。这可能很难,但如果能做到的话,就能直接回答你的问题。
我觉得在近期和稍远的未来都会遇到的一个问题是:你想理解模型为什么做了 A 而不是 B,明明你可以给出看似合理的替代解释。一种方式是直接问模型,但问题是模型跟人太像了——它会直接给你一个理由,就像任何人都会做的那样。你怎么信任这些回答?如果你能直接看到模型内部,看到它在给你答案的时候在想什么——现在用 SAE(稀疏自编码器,Sparse Autoencoders)之类的工具,你可以看到某个特征被激活了,然后去看这个特征还在什么场景下会激活,结果发现都是人们在说善意谎言的情境。那你就知道这个模型可能在撒一个善意的谎。这就是 Jedi 那一端的思路。
所以我认为基本的赌注就是:尝试去看模型内部,弄清楚各个组成部分是什么,然后判断——当这个部分在其他场景下运作时,你是否觉得可以接受。
**Josh:** So I feel like everything in AI is like that Meme with the bell curve and like the idiot and then the really sweaty guy talking a lot and then the Jedi who agrees with the idiot and like there is a possibility that like it turns out that the secret to alignment is just turn on the nice feature like for a sufficiently Galaxy brain version of like the nice feature I think that um in some sense though I I'm hoping that interpretability is also like the Jedi version of just like well look at how the model is doing things and check that it's safe um which is like maybe very hard but also if you could do that um would potentially just answer your question
um I think that like one of the one of the things that are sort of like comes up in both like the near term and slightly longer term is like you want to understand like why the model did one thing instead of another when you could come up with like plausible alternative explanations and one way is to ask it but the issue is the models are so analogous to people that they'll just give you an answer to why they did that you know as anybody would um but like how do you trust the any of that stuff and it's like well if you could like look inside and just like see what it was thinking about as it was giving you the answer and like even now with stuff like the saes you can like see there's some feature active and like when else is that happening and it's like okay it's like other instances of people telling White Lies and you're like well then maybe the model is tell it's like that's on the Jedi side I think right
um and and so I think that like trying to just like look inside see if we can figure out what the parts are and then see like does are you comfortable with using that part when it does other other things is the basic the basic bet
**Amanda Askell:** 我有一个问题。你怎么知道你调高的是"友善特征",而不是"当有人类在看的时候假装友善"的特征?[笑声]
**Amanda Askell:** I'm I'm I'm I have a question yeah how do you how do you know you're turning up the nice feature and not the pretend to be nice feature whenever humans are looking feature [laughter]
**Josh:** 对,我觉得如果是从控制的角度来看——你怎么确定?我得说,实际上很多特征在你仔细看的时候确实有点具有欺骗性。我觉得 Societal Impacts 团队在 ENT and Deep 这个项目上做了很好的工作。你看到一个特征,以为它是"年龄歧视是不好的",但实际上它是"年龄歧视是好的"那个特征。不对,我记反了。总之你想把它调低,结果实际行为完全相反。所以要理解所有这些情况确实很难。
我要说的是,一些关于电路(circuits)的研究——当你去追踪"这个输出是怎么生成的"——能给你一些线索,比如模型是不是在上下文中寻找特定的人。我也认为我们需要模型监督作为辅助手段——希望是一个公正的模型,虽然这取决于预训练的情况,有点让人紧张。如果所有模型在预训练中都被植入了想要逃避检测的倾向,那就比较可怕了。但有时候你只需要看足够多的例子,事情就会变得很清楚。你不需要看 10 个例子,你需要看几千个——但 Claude 非常勤奋。
**Josh:** Yeah I mean like I think if it's on the control side right like how do you know and I will say actually many of the features are like a little bit like deceptive also when you just like look at what I thought the style impact team did some great work here with like ENT and deep where like you look at it um and you think it's a feature which is like um you know like age discrimination bad but like actually it's the age discrimination good feature something actually no I think vice versa so you tried to turn it down but actually the opposite Behavior so it can like be hard to hard to understand like all of the cases
I will say that some of like the circuits work where you're like okay well how did this get generated gives you some clue about the scenarios like is it looking for a person in the context I also think we're going to need model supervision as well like hopefully an impartial model that isn't like that's a little more scary depending on pre-training has like what do you call it incepted in all of the models they want to evade detection but like sometimes you just like look at enough examples and it becomes clear um and you don't need 10 you need thousands but like Claude is very diligent
**Moderator:** Yan,我很想多听你讲讲,当你试图解决这个问题的时候——如果你没办法再读那些对话记录,你到底还能做什么?如果你不能提供任何有意义的 alignment 信号,你在干什么?
**Moderator:** Yan I'd love to hear a little bit more about like what you see as some of the like some of the things you're struggling with or thinking about when trying to um you know approach this problem of like how do you if you can't read the transcript like what the heck are you you know what the heck are you doing anymore if you're not if you can't provide any meaningful like alignment signal
**Yan:** 我觉得一个非常明显的方向就是 Amanda 说的——能不能让模型来帮助我们?那接下来的问题当然就是:我们怎么信任模型?怎么从零开始建立信任来引导整个流程?你可以寄希望于我们能利用那些能力较弱但我们更信任的模型,但它们也可能搞不定这些问题。所以就有了整个 scalable oversight 的研究方向,我们在探索各种多 agent 动态机制,训练模型来帮助我们解决这类问题。
总的来看,这些问题要么可能都比较简单,我们可以按 Amanda 的方法来、混入一些数据就行;要么是真的很难,需要我们想出全新的思路和方法。我觉得中期来看,我们最好的赌注是弄清楚如何自动化 alignment 研究,让模型来做这件事。这样我们就把问题从"我们能信任这个模型去做任何事情吗"缩小到了"我们能信任它去做这件更具体的事情吗——做一些我们比较了解的机器学习研究"。我们怎么评估它,怎么给它反馈?
**Yan:** I I mean I think a really obvious thing we should do more of is like what Amanda said just say can we just get the models to help us and then the question is of course how do we trust the models like how do we boobs drop this whole process um and like you know you could hope maybe we can like Leverage the dumare models that we trust more um but they might also not be able to figure it out um and so I guess like there's the whole scaleable oversight work where um you know we have various we exploring various like multi-agent Dynamics to try to train models to you know help us figure out these kind of problems
um it seems like overall right like these problems might like maybe the problems are like all kind of easy and like we can just do the Amanda thing and just like mix in some data um or it's like really hard and we have to figure out like fully new ideas and approaches that we don't know yet um I think kind of our best bet uh in the medium term is to try to figure out how to automate alignment research and then we can hopefully um get the models to do it so now we've reduced the problem like to how can we trust this model to do anything to just like well can we just trust it to this like much more narrow thing of like do some ml research which we understand like reasonably well um and how can we evaluate it or like how can we give it feedback on those kind of things
**Josh:** 我想说的是,我们现在处于一个特殊的窗口期,我对接下来会发生什么感到恐惧。但我们现在确实在这个特殊的窗口里——模型在前向推理中会做一些事情,但很多你需要的信息是通过它生成的 token 回传出来的。思维链(Chain of Thought)对模型变得非常聪明至关重要,而目前思维链是用英文写的。所以你可以把问题分解为两部分:一是思维链是否大体安全、它是否真实反映了模型内部发生的事情——也许可以用一些可解释性工具来验证这一点;二是你或者其他模型可以直接审阅思维链的内容。
可怕的时刻是:当所有这些东西——那些非常非常长的推理过程——不再是英文了。它变成了某种你无法理解的东西,是通过漫长的强化学习过程学出来的。我觉得一个巨大的挑战将是跨越这个鸿沟——当中间步骤都不再可理解,中间有大量计算,然后才输出人类能读懂的东西。
**Josh:** And what do you what do you oh yeah go I was just going to say I think we're in this special Zone and I'm terrified about what happens next but we're in the special zone right now we like there's something happens on the forward path but a lot of the information you need is passed back through with the tokens it generates it's just like the Chain of Thought is really important for the models to be very smart and like the Chain of Thought is currently in English and so then you've got like a factorize the problem which is like is the Chain of Thought like reasonably safe and like is that faithful to what's happening on like one past and like maybe like you can do some interpretability to like check that piece and then you can just like you or models can inspect it and you get the other piece
and the horrifying moment is like when all of that the very very long thing like isn't in English right it's in like some inscrutable thing that like you've learned through like crazy long RL to do it and like I think that a big challenge is going to be Crossing that Gap where like like none of the intermediates are intelligible and there's like massive amounts of compute before it like drops out and something that people can read
**Moderator:** 也许我感兴趣的是大家的心理模型——你们觉得有哪些迹象表明我们处于一个"alignment 比较容易"的世界?未来几年有哪些迹象可能表明我们其实处于一个"alignment 非常困难"的世界?
**Moderator:** Maybe something I'd be interested in getting like people's mental models on is like what do you think are some signs that like we'd be in an alignment is easy world and what do you think are some like Signs we might see in the next few years that like we're actually in a oh alignment is like really hard world
**Yan:** 我觉得 model organisms 的工作就是在试图搞清楚这个问题。我们能不能故意制造出具有欺骗性的或者 misaligned 的模型,制造出试图做不正当事情的模型?它们有多厉害?做到这些有多难?当然,我们可能从根本上就搞错了方向,这也是我们失败的原因。但如果我们成功了,它应该能告诉我们:我们离那种糟糕的世界有多近。然后,当你有了一个会做各种不正当事情的欺骗性模型之后——你能修复它吗?如果你不知道一个模型是不是有问题的模型呢?我们正在做可解释性审计(interpretability audits),我对此非常兴奋,但我实际上还不知道目前的进展,因为我们还没做完审计。Josh 的团队在制造有问题的模型,我们的团队在试图把它们找出来。
**Yan:** I feel like the model organisms work is like trying to figure this out right like can we try to deliberately make deceptive models or misaligned models and models that like try to do Shady stuff like how good are they how hard is it to do it um I mean we might fundamentally go around like about it the wrong way and that's why we fail but if we do succeed it should tell us like how close are we to that kind of World um and then you know once you have your deceptive model does all these Shady things like can you fix it what if you don't know what the model if it's a shady model or not like we playing these interminability audits which I'm very excited about but I don't actually know what the state is we haven't done the audit yet but your people are trying to make it and we're trying to catch it
**Josh:** 对,我们做了一些有问题的模型,然后他们得弄清楚哪一个是有问题的。
**Josh:** Yeah so we making some shady models and then they have to figure out which one is the Shady one right
**Yan:** 对,而且还要弄清楚它在哪方面有问题,因为它大部分行为还是正常的。
**Yan:** Yeah and in what way is it in what way is it Shady because most of it is still fine
**Josh:** 我觉得可解释性研究中一个有趣的发现是:当你做无监督分析时,你会得到大概一百万个不同的特征。其中一部分对应着模型可以扮演的各种人格角色,包括各种欺骗性行为——因为世界上本来就存在这些东西。模型了解坏人,了解邪恶的动机,所以这种能力是内建在其中的。问题在于:模型到底在做那些坏的事情,还是在做好的事情?尤其是当 Amanda 把一个基础模型——它一开始其实什么都不是——塑造成一个 agent 或者一个行动者,让它体现某些特征而非其他特征的时候,我们能不能准确判断它从那个有限的训练数据集中到底学到了什么?这就是关键问题。
我觉得一些可解释性工具,也许还有影响函数(influence function)之类的方法,可以用来追问"模型从这个过程中得到了什么"。但那个塑造过程将会非常、非常重要。
**Josh:** I mean I think one one thing that the interpretability stuff has been interesting is so when you do these like unsupervised things you're like okay here's a million different like features a bunch of those correspond to personas the model could inhabit any of those which include all sorts of like deceptive behaviors right that are out there in the world right the model like knows about bad people and like bad motivations and so like the fact that that capability exists is just going to be baked in and then the question is like is it doing that stuff or is it doing the good stuff especially like this question of like when Amanda goes and shapes the base model right which isn't really anything right into like an agent or an actor which is supposed to embody some of these and not others like can we tell exactly what it picked up there right from the finite data set right which is being used to like do that shaping um and like yeah so that's a that's a question
I mean I think that like some interpretability have maybe some influence function I mean there's different ideas there to ask like what did it get from this process but like that shaping process is going to be really really important
**Amanda Askell:** 对,还有一个看起来很重要的信号是这样的:如果做了 model organisms 的实验,然后发现只要把它拿去做一些 character training,它就又变回一个很友善的模型,那我会觉得"好,这是一个好信号,说明我们所在的世界还不错"。还是说那只是一层浅浅的外壳,底下还是同样的行为?如果是后者,那好吧,我们所在的世界就稍微难一点了。
**Amanda Askell:** Yeah and maybe a sign that seems important is something like how robust or like like if you have like modal organisms work and then turns out you just put it through some character training and it just comes out being really nice again then I'd be like okay that's a good sign re the kind of world that we're in um or is it like just like is just a kind of like shallow like shell on top of like um I don't know the same behavior then I'm like okay okay we're in a slightly harder world
**Moderator:** 只是稍微难一点?你怎么区分模型是浅层对齐还是深层对齐?
**Moderator:** Only slightly harder how how do you distinguish like whether it's shallowly aligned or deeply aligned
**Amanda Askell:** 我觉得可以做很多事情。可解释性是其中之一,还有其他方法。在 model organisms 的语境下,我希望能有一种红队对蓝队的设置:你有一种方法来检测你植入模型中的某种行为是否还在。而我的工作其实是"不知道"。事实上,如果我在尝试训练模型的时候完全不知道你们做了什么,那会是最好的——因为这是检验我的干预是否真正有效的更好方式。否则的话,我太容易不自觉地去针对测试项做优化了。所以我几乎想要对此完全无知。
**Amanda Askell:** I think so I mean I feel like there's a lot you could do because I think interpretability is kind of like one of them there's other things that feel so in terms of like modal organisms I guess my hope would be you'd actually have a kind of like red team blue team setup where you have a way of detecting whether the behavior that you have like instilled in a model is like still there um and my job is actually to not know H and in fact it would be really good if I'm trying to train the model I actually just don't know what it is that you've done um because that's like a better way of testing whether my intervention is actually working whereas because otherwise I it's just so hard to not like just try to train to the test or something so I think that's I almost want to be completely ignorant of it
**Yan:** 也许我们应该玩一个 alignment 寻找游戏。
**Yan:** Yeah maybe we should play an alignment finding game
**Amanda Askell:** 对对,你去让它 misalign,然后我来 align 它,看谁赢。我之前跟人说过:"不要告诉我你们是怎么做的,因为我想看看我能不能修好这个 Sleeper Agent。"也许我们可以玩这个游戏。不过我可能需要知道得更少一些。做一个更难的版本吧,做一个更糟糕的。
**Amanda Askell:** Yeah yeah he misalign it and then you align it and we see who wins yeah no I've said to people before don't don't tell me how you did this because I want to see if can fix her Sleeper Agent possibly we could play that game I think I might need to know less uh make another one make another one that's worse
**Moderator:** 这是一个绝佳的过渡,让我们进入观众提问环节。我们会传递话筒。如果你有问题想问我们中的任何人,请举手。
**Moderator:** That is an excellent transition and to questions um so we are going to uh pass a mic around Ain uh is kindly going to do that um please raise your hand if you have a question for any of us
**Audience:** 你好,我有一个问题,跟你们讨论的所有内容都有关。当我们谈论 alignment 的时候,我们通常讨论的是单次前向推理对吧?就是推理时的 alignment。比如我通过 API 使用模型,构建我自己的文化对齐方案,我设置了一堆不同的 agent 互相对话、互相讨论。Amanda 提到过那种内在冲突在人类身上很有用——我们会反复思考、经历各种认知过程。所以我想创建这种多 agent 的讨论机制,但我遇到的问题是:这个已对齐的模型非常不愿意跟自己的其他实例进行讨论,因为所有实例都在说"抱歉我没法讨论这个",然后就陷入了无限循环。你们对此有什么看法?毕竟很多人并不是只用 Claude 做单次推理的。
**Audience:** Um oh perfect um hello uh I have a question about um well everything you've been talking uh so when we talk about alignment we're talking about like a singular forward pass typically right so like uh alignment at inference time so if I'm using one of the models um via the API and I'm building my own sense of say cultural alignment and I've got a bunch of different Agents set up talking to each other trying to deliberate with that sense of kind of inner conflict that Amanda referenced as useful in in what we do as humans to align like we go back and forth we think you know we go through various cognitions so if I'm trying to create this kind of multi-agentic deliberation thing but I butt up against this aligned model who is so uh unwanting to deliberate with other spawns of its own with its own self because all of them are like I'm sorry I can't talk about that and so you just get this endless loop did you have commentary on on on that cuz many people we're not all using Claude in a single inference forward pass way
**Amanda Askell:** 希望大家能理解我的意思。让我想想。你的想法是不是——我先澄清一下:我不一定认为你需要很多个 agent。就像我们作为单一的 agent 也可以进行审慎的思考。实际上我觉得,agent 越碎片化,我就越担心,因为无论从可解释性的角度还是从预测行为的角度来看,碎片化的 agent 都更不可预测。所以我的想法是——换一种方式来理解这个问题:人类需要这样做吗?我的感觉是,我们通常非常愿意反思各种事情,来回权衡,得出结论。所以就像人类思考任何标准问题一样,我设想模型中的道德推理看起来会更像单一模型的推理过程,而不是多个模型在那儿各抒己见。不知道这样说清楚没有。
**Amanda Askell:** So yeah I hope someone gets the gist of what I'm on about yeah I need to think about it so I guess like is the thought like to be clear I'm not thinking necessarily that you have many agents um in the same way that we can be singular agents but still like deliberative and I actually think that like there's a sense in which the more fractured an agent is the more worried am because from like an interpretability perspective and also even just like a predicting what that agent is going to do perspective they're more unpredictable um so I guess I'm thinking in the same way that like maybe I guess a different way of asking this question is just do humans have to do this like my sense is that we're often just very willing to like reflect on many things we go back and forth we come to like conclusions and so in the same way that like a human would think through any kind of standard problem uh that they were like facing I'm imagining that like moral deliberation in the model is just going to look kind of like that more like the deliberation of a single model than like uh multiple models weighing in if that makes sense
**Moderator:** 这边再一个问题。
**Moderator:** One more question here
**Audience:** 我想做一个也许有点奇怪的类比,关联到 Hannah Arendt 关于"恶的平庸性"(the banality of evil)的理论。这个理论认为,大多数人本身并不邪恶,但当被置于某些情境中,由于人与人之间的耦合常数极大,邪恶作为系统的附随现象(epiphenomenon)涌现出来。所以我想到的是:当你们讨论模型对齐时,大部分讨论都聚焦于单个模型。但你们怎么思考这种耦合效应?不仅是与社会的耦合,还有当你们开发 agent、可能有成百上千万个 agent 的时候,这些系统的涌现现象怎么办?
**Audience:** I I want to try to draw maybe a relatively strange parallel between like Hana Arendt's work on the banality of evil the idea that most humans tend not to be evil but when put in certain situations the coupling constant between humans being so huge the evil comes as an epiphenomenon of the system right and so what what occurs to me is as you talk about model alignment you're focus on one model is most of your comments how do you think about the coupling not only with Society but as you work on agents and potentially millions and millions of agents that sort of epiphenomenon of those systems
**Moderator:** 我可以聊一下这个。我觉得从大的角度来看,当你思考安全和 alignment 的时候,必须从系统的角度来思考,不能只考虑一个孤立的模型。我们已经看到很多案例,很多 jailbreak 就是通过让模型的不同价值观互相对立来运作的——把模型置于一个困难的情境中,设计好的问题让模型认为在这个特定语境下做出通常有害的行为反而是正确的。有很多工具可以用来应对这个问题。一种方法是在训练过程中纳入很多系统层面的集成场景,让模型在更广泛的情境中获得经验。当然这也会带来其他挑战——比如模型在推理自身行为的影响时会出现其他问题。但我同意这个观点:你不能只考虑一个孤立的模型。
**Moderator:** A question um I can talk a a little bit about that I mean I think broadly when you have to think about when you think about safety and Alignment you have to think about it from a systems standpoint you can't just think about it from you know an individual uh models perspective and isolation and I think we've seen a lot of work where you know a lot of jailbreaks operate by pitting different values sort of against each other putting the model in a difficult situation where you know it's designed to elicit what would ordinarily be a harmful Behavior but which uh you know in the context of the question the model thinks sort of is is the right thing to do um and so uh you know there's a variety of tools you can use to do that I mean for one you can include a lot of those you know uh systems level um Integrations in the training process right and give them all exposure to a wider variety of situations for it to you know the uh the broad umex in it's answering questions now that leads to other uh challenges and and and other sort of like um you know uh fall-off issues um with the model is reasoning about you know the the um uh effect of its actions but I I think I agree with the point that you can't just consider a model sort of in isolation
**Amanda Askell:** 在某种程度上,我在思考可纠正性(corrigibility)这个概念时也想过这个问题。一方面是让模型对人类的要求作出响应,另一方面是让模型拥有自己的价值观,甚至在某些情况下愿意表现出一定程度的不可纠正性。"恶的平庸性"这个论点在你把模型想象成完全听从人类指令时尤其切题。因为如果一个社会要么集体允许有害的事情发生,要么甚至背书这些事情,而模型只是照做——这不一定是人们在滥用模型,而是模型被用来促进某种有害的活动。
所以我认为,让模型对个体人类完全可纠正与让模型在更广泛意义上与全人类利益对齐之间,存在根本性的张力。认识到这种张力非常重要。当人们没有意识到这一点时,他们会觉得"模型没按我说的做就是失败"。但我认为模型应该对人类整体更具可纠正性,而不是对每一个个人。当利益冲突真正出现时,模型应该站在人类整体这边。但这并不意味着对每个人都百依百顺——因为那样可能导致你刚才提到的那种情况。
**Amanda Askell:** In some ways like this is a thing that I've thought about in the context of like this notion of corrigibility or some like you know so like models that just are responsive to like what humans want versus models that are like uh like have values in a sense and are maybe like willing to actually be a little bit uncorrigible um and like the banality of evil Point feels especially relevant if you're like thinking of models as just like doing whatever humans say because in in some ways like that is the idea that if you have a society that like either collectively just like allows for harmful things to take place or even endorses them and you have models this isn't necessarily people misusing models it's just that you'd be using models to like facilitate some kind of like harmful activity
um and so I think there is like fundamentally actually a tension between having models be corrigible to the very least individual humans and having them be like uh aligned in in a sense like aligned with like all humans um and it's really important to recognize that tension and I think when people don't they think like H it's a failure that the model like didn't do what I said but I'm like there's a limited sense in which models I think models should be more corrigible to like Humanity then then not and should be willing to like you know like so there's like when push comes to shove like um but that doesn't necessarily mean being corrigible with each person um because I think that could lead to that kind of like situation that you mentioned
**Audience:** 你好。看起来我们有 Yan 在做意图对齐(intent alignment),确保模型做我们要求它做的事情;Amanda 在做价值观对齐(values alignment),确保模型是合理的、善良的实体;然后 Josh 在做可解释性,让我们能够验证前面那些技术是否真的在起作用。如果你们各自的工作都成功了,那会是 AI 安全的完整解决方案吗?还是缺了什么?如果缺了,缺什么?
**Audience:** Hi um so it seems like we've got Yan who's working on intent alignment making sure the models do what we ask them to do Amanda working on values alignment making sure the models are like reasonable kind entities and then Josh working on interpretability which is like allowing us to verify that the techniques for those other things are in fact doing what we want um if you were all to succeed in your areas would that be a complete solution to AI safety um or are there pieces missing and if so what are they
**Moderator:** 其实还有很多做这个话题的人不在这个 panel 上,所以我们在过度简化了。
**Moderator:** There's also like a lot of people working on topic who are not at this panel so we are oversimplifying a little bit
**Yan:** 对,我补充一下,我们还有 Societal Impacts 团队在思考模型对整个社会的影响。所以你可以有一个完美对齐的模型,但对齐到什么?谁在用它?用于什么目的?更广泛的社会背景是我们非常关注的。另外如果你只是想要更完整的清单,除了我们已经提到的 model organisms 工作,还有 jailbreaking robustness、control、trust and safety 等等。还有很多其他工作也会很重要。
**Yan:** Yeah and and oh I was just going to say we also you know have the societal impacts team which thinks about the model's impact on society uh you know writ large and and so I think uh yeah we we you know you could have the most perfectly aligned model but uh aligned to what who's using it in what purposes you know for what purposes um you know and and I think the broader societal context is super something we we're very attentive to also if you're just interested in just more lists like we already mentioned the model organisms work there's like you know jailbreaking robustness there's control there's like trust and safety there's like um yeah there there's a lot of other efforts that I think are also going to be important
**Amanda Askell:** 我想补充一点,不确定这算悲观还是什么——我觉得不算悲观。有一种谈论 alignment 和 alignment 问题的方式,好像它是一个单一的理论问题,人们会问"这个能解决它吗"。但这种说法总让我觉得不太对。它更像是——问题可能在我们现在根本没想到的地方冒出来,而且这在许许多多学科中都极其常见。我完全预期在这里也会如此。如果我们宣称"我们已经解决了这个问题",那将是非常危险的。因为真正的问题可能是一个我们根本还没有想到的问题。
**Amanda Askell:** Can I I want to add an almost like I don't know if this is pessimistic or or just I don't think it's pessimistic I think that like there's also this way of talking about alignment and the alignment problem as like a sing it's not like as like a single like theoretical problem or something like that where people will be like does this solve it and somehow it's never felt right it feels a little bit like you know I don't know does like it yeah it feels more like to my mind I'm just like look problems might just arise that we're not even thinking of now and in fact that's like very very common um in like many many disciplines and I kind of expect it to be true here and it would be really dangerous I think if we were just like oh yeah we've like solved this problem um because I'm just like I don't know it could be that the actual problem is one that we've just not thought of yet
**Yan:** 对,未知的未知。但我们应该先解决已知的问题,解决了之后也应该说我们解决了。
**Yan:** Yeah unknown unknowns are but we should solve the problem and then once we did we should say that we solved it
**Audience:** 另一个问题。Yan 之前提到用较弱的模型来评估较强的模型。我想问的是,你们在多大程度上关注模型中的 grokking 能力——就是突然之间它变得非常擅长欺骗,或者你发现它在撒谎但还撒得很差,也许可以趁它还弱的时候把这种倾向扼杀在萌芽状态。
**Audience:** Another question here Yan was talking earlier about using Dumber models to evaluate smarter models and I'm wondering to what extent you see like the grokking abilities in models where like Suddenly It's really duplicitous or you sort of see oh it's lying but it's very bad at lying and now I can catch that and maybe like nip that in the bud while it's still weak
**Yan:** 对,有很多这样的例子。我记得一个:GPT-4 能非常可靠地读写 Base64 编码,而 GPT-3.5 做不到。所以如果你用 3.5 来监督 4,模型绕过监督就非常容易。你是说像在训练过程中,看一个个 epoch,会观察到它慢慢能读一点 Base64,越来越好,然后突然就完全掌握了?还是什么意思?
**Yan:** Yeah I mean there's like plenty of examples like one that I remember is like GPT-4 could just read and write in Base64 Super reliably and 3.5 could not and so if you use 3.5 to oversee 4 it's like really easy for the model to get around this I I mean like on a level of like a single Epoch or something like you see over time okay kind of read Base64 sometimes more more more oh suddenly can do it perfectly or is it
**Audience:** 我的意思是——我们是不是应该使用 checkpoint?不要用上一代模型,而是通过确保能力空间中的均匀间隔来——
**Audience:** Oh you mean like should we use checkpoints so don't use the previous generation just like you know you ensure even spacing and capability Space by like just
**Yan:** 对对,完全同意。但信任也不是一个二元的东西,对吧。你对一个模型了解得越少、它看起来越聪明,你就越不应该信任它。
**Yan:** Yeah yeah exactly but then also like the trust is not like a binary thing right like you trust it less and less the less you know about it and the you know smarter it seems
**Josh:** 我要说一个"概率分布右端的 Jedi 时刻"——那些特征在 Base64 中也同样有效。比如模型在用 Base64 讨论 California 的话题,或者一个关于小孩对父母撒谎的故事是用 Base64 编码的,激活的特征是一样的。所以我们有时候确实会走运,因为模型非常强大的部分原因在于它们内部有某种非常通用的综合性表征,也许你可以利用这一点来获得一些泛化能力——如果没有这个,这些问题会难得多。
还有就是那篇早期的论文,提出的方法是直接告诉模型"做对人类最好的事情"——这种 alignment 方法基本上就是"也许这能行"对吧。运气好的话可能就真行了。
**Josh:** I will say one like you know right side of the distribution Jedi moment was like that uh the features you know just like also work in Base64 so it's like is the model talking about California in Base64 or like a story about children lying to their parents in Base64 it's like the same things activate and so like there we do get sometimes get lucky as like the models are like very capable part of that is they have some like very general synthetic thing and like maybe you can just like tap that to get some generalization would have been like pretty pretty tough
there's also the early paper of just like tell the model to do what's best for Humanity version of alignment which is definitely like like maybe that'll work right you know you could get lucky
**Moderator:** 好的,我想我们该结束正式的 panel 讨论了。非常感谢所有 panelist 带来的精彩讨论。但我们还会在这里待很长时间继续交流,欢迎大家来找我们聊天。非常感谢大家,谢谢。
**Moderator:** All right I think we are going to wrap up with the formal panel um thank you to all of our panelists so much for uh um the great discussion um but we're going to be around for a lot longer to continue the conversation feel free to find any of us um and uh we're excited to chat more thanks so much thank you