**Lenny:** 今天的嘉宾是 Edwin Chen,Surge AI 的创始人兼 CEO。Edwin 是一位非凡的 CEO,Surge 也是一家非凡的公司。他们是领先的 AI 数据公司,为每一家前沿 AI 实验室的训练提供支持。他们也是有史以来最快达到 10 亿美元营收的公司——成立仅 4 年,团队不到 100 人,而且完全自力更生,从未接受过一分钱的风险投资,自成立第一天起就实现了盈利。正如你将在这次对话中听到的,Edwin 对于如何打造一家重要的公司、如何构建真正对人类有益的 AI,有着非常不同的见解。我非常喜欢这次对话,收获颇丰,也非常期待你能听到。如果你喜欢这档播客,别忘了在你常用的播客应用或 YouTube 上订阅和关注,这对我们帮助非常大。如果你成为我新闻通讯的年度订阅用户,你可以免费使用一整年大量出色的产品,包括 Devon、Lovable、Replet、Bolt、NAM、Linear、Superhum、Dcript、Whisper Flow、Gamma、Perplexity、Warp、Granola、Magic Patterns、Ray、Catch、Shepardd、Mobin、Post Hog 和 Stripe Atlas。请访问 Lenny's.com,点击 Product Pass。接下来,在赞助商的简短广告之后,让我们欢迎 Edwin Chen。
**Lenny:** Today, my guest is Edwin Chen, founder and CEO of Surge AI. Edwin is an extraordinary CEO and Surge is an extraordinary company. They're the leading AI data company powering training at every Frontier AI lab. They're also the fastest company to ever hit $1 billion in revenue in just 4 years after launch with fewer than 100 people and also completely bootstrapped. They've never raised a dollar in VC money. They've also been profitable from day one. As you'll hear in this conversation, Edwin has a very different take on how to build an important company and how to build AI that is truly good and useful to humanity. I absolutely love this conversation and I learned a ton. I am really excited for you to hear it. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. And if you become an annual subscriber of my newsletter, you get a ton of incredible products for free for an entire year, including Devon, Lovable, Replet, Bolt, NAM, Linear, Superhum, Dcript, Whisper Flow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Ray, Catch, Shepardd, Mobin, Post Hog, and Stripe Atlas. Head on over to Lenny's.com and click product pass. With that, I bring you Edwin Chen after a short word from our sponsors.
**Lenny:** 我的播客嘉宾和我都喜欢聊产品品味(craft and taste)、主动性(agency)和产品市场契合度(product market fit)。但你知道我们不喜欢聊什么吗?SOC 2。这就是 Vanta 的用武之地。Vanta 帮助各种规模的公司快速实现合规,并通过行业领先的 AI 自动化和持续监控来保持合规状态。无论你是正在应对第一次 SOC 2 或 ISO 27001 审计的初创公司,还是管理供应商风险的大型企业,Vanta 的信任管理平台都能让这一切更快、更简单、更具可扩展性。Vanta 还能帮助你将安全问卷的完成速度提高五倍,让你更快地拿下更大的订单。根据 IDC 最近的一项研究,Vanta 客户每年节省超过 50 万美元,生产力提升了三倍。建立信任不是可选项,Vanta 让它自动完成。访问 vanta.com/lenny 即可获得 1000 美元折扣。
**Lenny:** My podcast guests and I love talking about craft and taste and agency and product market fit. You know what we don't love talking about? Sock 2. That's where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industryleading AI automation and continuous monitoring. Whether you're a startup tackling your first Sock 2 or ISO 2701 or an enterprise managing vendor risk, Vanta's trust management platform makes it quicker, easier, and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner. The result, according to a recent IDC study, Vant customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get $1,000 off at vanta.com/lenny.
**Lenny:** 给你出个谜题。OpenAI、Cursor、Perplexity、Vercel、Plaid 和数百家其他成功公司有什么共同点?答案是——它们都由今天的赞助商 WorkOS 提供支持。如果你正在为企业构建软件,你可能已经体会过集成单点登录(SSO)、SCIM、RBAC、审计日志等大客户所需功能的痛苦。WorkOS 将这些交易障碍变成了即插即用的 API,打造了一个专为 B2B SaaS 设计的现代开发者平台。无论你是正在争取第一个企业客户的种子轮初创公司,还是正在全球扩张的独角兽企业,WorkOS 都是实现企业级就绪、释放增长潜力的最快途径。他们本质上就是企业级功能领域的 Stripe。访问 workos.com 开始使用,或者直接在他们的 Slack 支持频道提问——那里有真正的工程师,回复速度非常快。WorkOS 让你用出色的 API、全面的文档和流畅的开发者体验,像顶级团队一样构建产品。访问 workos.com,让你的应用今天就实现企业级就绪。
**Lenny:** Here's a puzzle for you. What do OpenAI, Cursor, Perplexity, Verscell, Platt, and hundreds of other winning companies have in common? The answer is they're all powered by today's sponsor, WorkOS. If you're building software for enterprises, you've probably felt the pain of integrating single signon, skim, arbback, audit logs, and other features required by big customers. Work OS turns those deal blockers into drop-in APIs with a modern developer platform built specifically for B2B SAS. Whether you're a seedstage startup trying to land your first enterprise customer or a unicorn expanding globally, work OS is the fastest path to becoming enterprise ready and unlocking growth. They're essentially Stripe for enterprise features. Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions super fast. Workos allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience. Go to works.com to make your app enterprise ready today.
**Lenny:** Edwin,非常感谢你来到这里,欢迎来到播客。
**Lenny:** Edwin, thank you so much for being here and welcome to the podcast.
**Edwin Chen:** 非常感谢邀请,我超级兴奋。
**Edwin Chen:** Thanks so much for having me. I'm super excited.
**Lenny:** 我想先聊聊你们取得的成绩有多不可思议。很多人和很多公司都在谈论借助 AI 用极少的人打造大规模业务,而你们真正做到了——而且是前所未有的。你们在不到四年内营收达到了 10 亿美元,团队大约只有 60 到 70 人,完全靠自己,没有拿过任何风投的钱。据我所知,之前从没有人做到过这一点。所以你们实际上正在实现人们描述的 AI 将带来的那个梦想。我很好奇,你觉得这种情况会因为 AI 而越来越多吗?另外,AI 在哪些方面最帮助你们找到了这种杠杆效应?
**Lenny:** I want to start with just how absurd what you've achieved is. A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI and you guys have done this in a way that is is unprecedented. You guys hit a billion in revenue in less than four years with less than 60 around 60 to 70 people. You're completely bootstrapped. Haven't raised any VC money. I don't believe anyone has ever done this before. So you guys are actually achieving the dream of what people are describing will happen with AI. I'm curious just do you think this will happen more and more as a result of AI? And also just where has AI most helped you uh find leverage to be able to do this?
**Edwin Chen:** 是的,我们去年营收超过 10 亿美元,团队不到 100 人。我认为在未来几年内,我们会看到比例更夸张的公司——比如每位员工创造 1 亿美元营收。AI 只会越来越强,让事情越来越高效,所以这个比例是必然的。我以前在几家大型科技公司工作过,总觉得我们可以裁掉 90% 的人,反而能跑得更快,因为最优秀的人被各种杂事拖累了。所以当我们创办 Surge 的时候,就想用一种完全不同的方式来建设——打造一支极小但极精锐的团队。是的,疯狂的是,我们真的成功了。
**Edwin Chen:** Yeah. So, we hit over a billion of revenue last year with under 100 people. And I think we're going to see companies with even crazier ratios, like 100 million per employee in the next few years. AI is just going to get better and better and make things more efficient. So, that ratio just becomes inevitable. Like I I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people would have all these distractions. And so when we started Surge, we wanted to build it completely differently with a super small, super elite team. And yeah, what's crazy is that we actually succeeded.
**Edwin Chen:** 所以我觉得两件事正在碰撞。第一,人们开始意识到,你不需要建立庞大的组织才能赢。第二,AI 带来的效率提升将催生一个公司建设的黄金时代。我最兴奋的不仅仅是公司会变得更小,而是我们将看到本质上不同的公司出现。想想看,更少的员工意味着更少的资本需求,更少的资本意味着你不需要融资。所以,以后创业的不再是那些擅长做路演、擅长炒作的创始人,而是真正擅长技术或产品的人。产品也不再是为了迎合营收指标和风投的期望而优化,而是由这些痴迷于产品的小团队打造出更有意思的东西——人们真正在乎的东西,真正的技术,真正的创新。所以我其实非常希望,创业圈能摆脱那些花里胡哨的套路,重新成为属于黑客们的地方。
**Edwin Chen:** And so I think two things are colliding. One is that people are realizing that you don't have to build giant organizations in order to win. And two, yeah, all these efficiencies from AI and they're just going to lead to a really amazing time in company building. Like the thing I'm most excited about is that the types of companies are going to change, too. It won't just be that they're smaller. We're going to see fundamentally different companies emerging. Like, if you think about it, fewer employees means less capital. Less capital means you don't need a raise. So, instead of companies started by founders who are great at pitching and great at hyping, you'll get founders who are really great at technology or product. And instead of products optimized for revenue and what VCs want to see, you'll get more interesting ones built by these tiny obsessed teams. So, people building things they actually care about. Real real technology, real innovation. So I'm actually really really hoping that the Slick on Matty startup scheme it will actually go back to being a place for for hackers again.
**Lenny:** 你们在很多方面都非常逆潮流而行。其中之一就是你们没有在 LinkedIn 上发爆款帖子,也没有在 Twitter 上不停地推广 Surge。我觉得大多数人直到最近才听说 Surge,然后你们突然冒出来,说我们是增长最快的公司,营收已经到 10 亿了。你们为什么要这么做?我猜这是刻意的。
**Lenny:** You guys have done a lot of things uh in a very contrarian way. And one was actually just not being like on LinkedIn posting viral posts not on Twitter constantly promoting Surge. I think most people hadn't heard of Surge until just recently. And then you just came out and like okay the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.
**Edwin Chen:** 我们基本上从来不想玩硅谷的游戏。我一直觉得那很荒谬。你小时候梦想做什么?是从零开始自己造一家公司、每天钻进代码和产品里?还是向风投解释你的每一个决策,然后上那个巨大的公关和融资跑步机?
**Edwin Chen:** We basically never wanted to play the Silicon Valley game. And like I always thought it was ridiculous. Like what did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code and your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel?
**Edwin Chen:** 这确实让事情对我们来说更难了。因为当你融资的时候,你自然而然就成了硅谷产业体系的一部分——你的投资人会在 Twitter 上帮你宣传,你会上 TechCrunch 头条,各大媒体都会报道你以天价估值融了多少钱。所以对我们来说更难,因为我们成功的唯一途径就是打造一个好 10 倍的产品,靠研究人员的口碑传播。
**Edwin Chen:** And it definitely made things more difficult for us because yeah, when you fund raise, you just naturally get part of this kind of Silicon Valley industrial complex where people will your VCs will tweet about you. You'll get the Techrunch headlines. You'll get announced in all the newspapers because you raised at this massive valuation. And so it made things more difficult for us because the only way we were going to succeed was by building a 10 times better product and getting word of mouth from researchers.
**Edwin Chen:** 但我觉得这也意味着我们的客户是真正懂数据、真正在乎数据的人。我一直认为,我们的早期客户必须与我们的使命高度一致,他们必须真正关心高质量数据,真正理解这些数据将如何让他们的 AI 模型变得更好——因为正是他们在帮助我们,是他们在给我们的产出提供反馈。所以,与客户之间这种非常紧密的使命对齐,实际上在早期帮了我们很大的忙。这些客户购买我们的产品,是因为他们知道它有多不同、它确实在帮助他们,而不是因为他们看到了什么媒体报道。这确实让事情更难了,但我认为是以一种非常好的方式。
**Edwin Chen:** But I think it also meant that our customers were people who really understood data and really cared about it. Like I always thought it was really important for us to have customers, early customers who are really aligned with what we were building and who really cared about having really high quality data and really understood how that data would make their AI models so much better because they were the ones helping us. They were the ones giving us feedback on what we're producing. And so just having that kind of like very very close mission alignment with our customers actually helped us early on. So these were people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw something in that current shell line. So it made things harder for us. But I I think in a really good way.
**Lenny:** 对于创始人来说,这真是一个非常鼓舞人心的故事——他们不需要整天泡在 Twitter 上推广自己在做的事情,不需要融资,可以就这样埋头去做。所以我非常喜欢 Surge 的故事。对于不了解 Surge 的人,能不能简单介绍一下 Surge 是做什么的?
**Lenny:** It's such an empowering story to hear this journey for for founders that they don't need to be on Twitter all day promoting what they're doing. They don't have to raise money. They can just kind of go heads down and build. So I I love so much about uh the story of Serge. For people that don't know what Serge does, just to give us a quick explanation of what Surge is,
**Edwin Chen:** 我们本质上是在教 AI 模型什么是好的、什么是不好的。我们用人工数据来训练它们,我们有很多不同的产品,比如 SFT、RLHF、评分标准(rubrics)、验证器(verifiers)、RL 环境(RL environments)等等。我们还会衡量模型的进步情况。所以本质上,我们是一家数据公司。
**Edwin Chen:** We essentially teach AI models what's good and what's bad. So, we train them using human data and there's a lot of different products that we have like SAT, RHF, rubrics, verifiers, R environments, and so on and so on. And then we also measure how well they're progressing. So, essentially, we're we're we're a data company.
**Lenny:** 你一直强调的是,质量是你们如此成功的核心原因——数据的质量。要创造更高质量的数据需要什么?你们做了什么不同的事情?别人忽略了什么?
**Lenny:** What you always talk about is the quality has been the big reason you guys have been so successful. The quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?
**Edwin Chen:** 我觉得大多数人甚至不理解这个领域里"质量"到底意味着什么。他们以为只要投入人力就能得到好数据,这完全是错的。我举个例子。假设你想训练一个模型写一首关于月亮的诗。什么算是一首高质量的好诗?如果你不深入思考质量,你会想——这是一首诗吗?有没有八行?有没有包含"月亮"这个词?你把这些框都打了勾,然后说"嗯,这是一首好诗"。但这和我们想要的完全不同。我们要的是诺贝尔文学奖级别的诗歌。这首诗独特吗?充满微妙的意象吗?它会让你感到惊喜、触动你的心弦吗?它是否教会了你一些关于月光本质的东西?它是否在玩弄你的情感?它是否引发了你的思考?
**Edwin Chen:** I think most people don't understand what quality even means in the space. They think you can just throw bodies at a problem and get good data and that's completely wrong. Let let me let me give you an example. So imagine you wanted to train a model to write an Aine poem about the moon. What makes it a good high quality poem? If you don't think deeply about quality, you'll be like, "Is this a poem? Does it contain eight lines? Does it contain a word moon?" You check all of these boxes and if so, sure, yeah, you say it's a great poem. But that's completely different from what we want. We are looking for a Nobel Prize winning poetry. Like, is this poetry unique? Is it full of subtle imagery? Does it surprise you and tug out your heart? Does it teach you something about the nature of moonlight? Does it play with your emotions? And does it make you think?
**Edwin Chen:** 这就是我们思考高质量时的标准。比如,它可能是一首关于水面月光的俳句,可能用了内韵和格律。关于月亮,有一千种写法,每一种都能给你带来关于语言、意象和人类表达的不同洞见。我认为用这种方式思考质量是非常难的——它很难衡量,非常主观、复杂、丰富,并且设定了一个非常高的标准。
**Edwin Chen:** That's what we are thinking about when we think about high quality bone. So it might be like a ha coup about moonlight on water. It might use internal rhyme and meter. There are thousand ways to write a poem about the moon and and each one gives you all these different insights into language and imagery and human expression. And I think thinking about quality way is really hard. It's hard to measure. It's really subjective and complex and rich and it sets a really high bar.
**Edwin Chen:** 所以我们必须构建所有这些技术来衡量它。我们对所有标注员有数千个信号指标,对每个项目、每个任务也有数千个信号。我们最终能够知道你是擅长写诗、写文章,还是擅长写技术文档。我们必须收集关于你背景、专业知识的所有这些信号——不仅如此,还要看你在实际执行这些任务时的真实表现。然后我们用这些信号来判断你是否适合这些项目,以及你是否在让模型变得更好。这非常难,构建所有这些衡量技术非常难,但我认为这正是我们希望 AI 做到的事情。所以我们对质量有着非常深入的理解,而且我们一直在努力达到这些标准。
**Edwin Chen:** And so we have to build all of this technology in order to measure it. Like thousands of signals on all of our workers, thousands of signals on every project, every task. Like we know at the end of the day if you are good at writing poetry versus good at writing essays versus good at writing technical doc do documentation. And so we have to gather all these signals on what your background is, what your expertise is. And not just that, like how you're actually performing when you're when you're writing all these things and we use those signals to inform whether or not you are a good worker for these projects and whether or not you are improving the models. And it's really hard and so to build all this technology to measure it, but I think that's exactly what we want AI to do. And so we have these really really deep notions about quality that we're always trying to try to achieve.
**Lenny:** 所以我听到的是,你们在你们所服务的垂直领域里,对质量的理解要深入得多。那么,具体是怎么运作的呢?是你们雇了一个在诗歌方面极有天赋的人,再加上评估体系——我猜他们也参与编写评估标准来告诉他们什么是好的?具体的机制是什么?
**Lenny:** So what I'm hearing is there's kind of a just going much deeper in uh understanding what quality is within the verticals that you are selling data around. So you and is this like a person you hire that is incredibly talented at poetry plus uh evals that they I guess help write that tell them this is great. How what's the mechanics of that?
**Edwin Chen:** 具体运作方式是,我们会收集你在平台上工作时所做一切事情的数千个信号。我们会看你的键盘输入,看你回答问题的速度,使用同行评审、代码标准,还会用你产出的结果来训练我们自己的模型,然后看这些结果是否能提升模型的表现。
**Edwin Chen:** The way it works is we essentially gather thousands of signals about everything that you're doing when you're working on platform. So we are looking at your keyboard strokes. We are looking how fast you answer things. We are using reviews, we are using code standards, we are using like we're training models ourselves on the outputs that you create and then we're seeing whether they improve a model's performance.
**Edwin Chen:** 这跟 Google 搜索很像。当 Google 搜索试图判断什么是好网页时,几乎有两个方面。一个是你要把最差的网页剔除掉——你要去掉所有的垃圾邮件、低质量内容、加载不了的页面。所以这几乎像一个内容审核问题,你只是要把最差的那些去掉。但另一方面,你还想发现最好的那些。比如,这是最佳网页,或者这个人是这份工作的最佳人选——他们不只是写高中水平诗歌的人,不只是机械地写诗、打勾确认所有显式指令的人,而是真的能写出让你感动的诗。所以我们也有所有这些信号,与去除最差的完全不同——我们在寻找最好的。就像 Google 搜索用所有这些信号输入到机器学习算法中来预测各种东西一样,我们对所有的标注员、所有的任务和所有的项目做同样的事。所以本质上,这其实是一个复杂的机器学习问题。
**Edwin Chen:** And so in a very similar way to how Google search like when Google search is trying to determine what is a good web page, there's almost two aspects of it. One is you want to remove all the worst of the worst web pages. So you want to remove all the spam, all the uh just like lowquality content, all the pages that lo don't load. And so there's like a it's almost like a content moderation problem. You just want to remove the worst of the worst. But then you also want to discover the best of the best. Okay, like this is the best web page or you know this is the best person for this job. They are not just somebody who writes the equivalent of high school level poetry again like they're not just robotically writing poetry that checks all these boxes checks all these explicit instructions but rather yeah they're they're writing poetry that makes you emotional and so we have all these signals as well that again like completely differently from moving the worst of the worst we are finding the best of the best and so we have all these signals again just like Google search uses all these signals and feeds them into their ML algorithms and uses them predicts certain types of things we we do the same with all of our with all of our workers and all of our tasks and all our projects. And so it's almost like a complicated machine learning problem at the end of the day. And um that that's actually how it works.
**Lenny:** 这真的非常有趣。我想问你一个我过去几年一直很好奇的问题。如果你看 Claude,它在编程和写作方面比任何其他模型都好了这么久。其他公司花了这么长时间才追上来,考虑到这里面蕴含的巨大经济价值,这真的很令人惊讶。几乎每一个 AI 编程产品都构建在 Claude 之上,因为它在代码和写作方面就是那么好。是什么让它好那么多?仅仅是训练数据的质量吗,还是有别的原因?
**Lenny:** That is incredibly interesting. I want to ask you about something I've been very curious about over the past couple years. If you look at Claude, it's been so much better at coding and at writing than any other model for so long. And it's really surprising just how long it took other companies to catch up considering just how much economic value there is there. just like every AI coding product sat on top of clot because it was so good clot code and writing also what is it that made it so much better is it just the quality of the data they trained on or is there is there something else
**Edwin Chen:** 我觉得这有多个层面。数据当然是很大一部分。我认为人们没有意识到,所有前沿实验室在选择什么数据进入模型时,面临的选择几乎是无穷无尽的。比如说:你是纯粹使用人工数据吗?你以什么方式收集人工数据?当你收集人工数据时,你到底要求数据标注者为你创造什么?也许你在编程领域更关心前端编程而不是后端编程。也许在做前端编程时,你非常看重前端应用的视觉设计,或者你不太在乎视觉设计,更看重效率或纯粹的正确性。
**Edwin Chen:** I think there are multiple parts to it so a big part of it certainly is the data like I think people don't realize that there there's almost like this infinite amount of choices that all the frontier labs are deciding between when they're choosing what data goes into their models it's like okay are you purely using human data. Are you gathering the human data in XYZ way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you? Like maybe you create maybe you care more for example in the coding realm, maybe you care more about front-end coding versus backend coding. Maybe when you're doing front-end coding, you care a lot about the visual design of the front end applications that you're creating. or maybe you don't care about it so much and you care more about I don't know efficiency of it or the pure correctness over that like visual design.
**Edwin Chen:** 然后还有其他问题,比如你在数据中混入了多少合成数据(synthetic data)?你对这 20 个不同的 benchmark 有多在乎?有些公司看到这些 benchmark 会想:好吧,出于公关目的,即使我们觉得这些学术 benchmark 并不那么重要,我们还是得去刷分,因为市场团队需要在那些每家公司都在谈论的标准评估上展示进步——如果我们在这些指标上表现不好,对我们不利,即便忽略这些学术 benchmark 实际上会让我们在真实任务上做得更好。另一些公司则更有原则——他们会说:不,我不在乎市场营销,我只在乎我的模型最终在真实世界任务上的表现如何,所以我要为那个去优化。这些选择之间几乎存在一种取舍关系。
**Edwin Chen:** Then other questions like okay are you caring about are you like how much synthetic data are you throwing into the mix how much do you care about these 20 different benchmarks like some companies they see these benchmarks and they're like okay for PR purposes even though we don't think that these academic benchmarks matter all that all that much then we just need to optimize for them anyways because we our marketing team needs to show certain progress on certain standard evaluations that every other company talks about and if we don't show good performance here it's just going to add it for us even even if like ignoring these academic benchmarks makes us better at the real tasks other companies are going to be principled and like okay yeah no I I don't care about marketing I just care about how my model performs on these real world task at the end of the day and so I'm going to optimize for that instead and it's almost like there's a trade-off between all of these different things.
**Edwin Chen:** 我经常想到的一点是,后训练(post-training)几乎是一门艺术,而不纯粹是一门科学。当你决定要打造什么样的模型、让它擅长什么的时候,这里面有"品味"(taste)和"鉴赏力"(sophistication)的概念。回到视觉设计那个例子——你对视觉设计的理解可能跟我不一样。也许你更看重极简主义,也许你更喜欢 3D 动画,而另一个人可能偏好看起来更有质感的风格。在设计你的后训练数据配比时,你必须在所有这些品味和鉴赏力之间做出选择,这也很重要。所以长话短说,数据当然是重要因素之一,但同样重要的是——你试图让模型优化的目标函数是什么。
**Edwin Chen:** And there's like a like one things I often think about is that there's a it's almost like there's an art to post training it's not purely a science like when you are deciding what kind of model you're trying to and what it's good at. There's this notion of taste and sophistication like okay do I think that these go so going back to example of how good the model is at visual design like okay maybe you have a different notion of visual design than what I do like maybe you care more about minimalism and you care more about I don't know uh like 3D animations than than I do and maybe uh maybe the other person prefer prefers things that look a little bit more broke like there's all these notions of taste infistation that you have to decide between when you're when you're designing your post training mix and so that matters as well. So long story short, I think there's all these different factors and certainly the data is a big part of it, but it's also like what is like what is the objective function that you're trying to optimize your model towards.
**Lenny:** 这太有意思了。领导这项工作的人的品味会决定他们要求什么数据、喂入什么数据。但这真的太疯狂了,它展示了优质数据的价值。Anthropic 基本上凭借更好的数据获得了如此巨大的增长和优势。
**Lenny:** That is so interesting. Like the taste will the taste of the person leading this work will inform what data they ask for, what data they feed it, but it just it's wild to shows the value of great data. Enthropic got so much growth and win from essentially uh better data.
**Edwin Chen:** 是的,完全正确。
**Edwin Chen:** Yeah. Yeah. Exactly.
**Lenny:** 我也能理解为什么像你们这样的公司增长这么快。这还只是一个垂直领域——只是编程。写作方面可能也有类似的情况。我觉得很有意思的是,AI——你知道,它感觉像是人工的、计算机的、二进制的东西——但品味、人类判断力仍然是这些东西成功的关键因素。
**Lenny:** And I could see why companies like yours are growing so fast. There's just so much and that's just one vertical. That's just coding. And then there's probably a similar area for writing. Uh I love that it's it's interesting that AI, you know, it's feels like this artificial computer binary thing, but it's like taste, human judgment is still such a key factor in these things being successful.
**Edwin Chen:** 是的,完全正确。就像我之前说的那个例子,有些公司如果你问他们什么是好诗,他们只会机械地把清单上的指令一条条打勾。但我认为那不算好诗。所以某些前沿实验室——那些更有品味和鉴赏力的——他们会意识到质量不能被简化为一组固定的复选框,他们会考虑所有那些隐含的、非常微妙的特质。我认为这才是让他们最终做得更好的原因。
**Edwin Chen:** Yep. Yep. Yep. Exactly. Like again going back to the example I said earlier certain companies if you ask them what is good poem they will simply robotically check off all of these uh all of these instructions on our list. But again I don't think that makes for good poetry. So certain frontier labs the ones with more taste and sophistication they will realize that it doesn't reduce to this fixed set of checkboxes and they'll consider all of these kind of implicit very subtle qualities instead. And I think that's what makes them better better at this at the end of the day.
**Lenny:** 你提到了 benchmark。这是很多人担心的问题——各种模型似乎总在刷新纪录,基本上感觉每个模型在每个 STEM 领域都已经超越了人类。但对于普通用户来说,并不觉得这些模型在持续变得多聪明。你对 benchmark 有多信任?它们与实际 AI 进步的关联度有多高?
**Lenny:** You mentioned benchmarks. This is something a lot of people worry about is there's all these models that are always like basically it feels like every model is uh better than humans at kind of every STEM field at this point. Uh but to a regular person it doesn't feel like these models are getting that much smarter constantly. What's your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?
**Edwin Chen:** 说实话,我完全不信任这些 benchmark,原因有两个。第一,我觉得很多人——甚至包括社区里的研究人员——没有意识到这些 benchmark 本身经常就是错的。它们有错误答案,充满各种混乱,而人们却信以为真。对于那些知名的 benchmark,人们可能在某种程度上已经意识到了这个问题,但绝大多数 benchmark 都有这些缺陷,而人们并不知道。
**Edwin Chen:** Yeah, so I don't trust the benchmarks at all and I think that's for two reasons. So one is I think a lot of people don't realize even researchers within the community they don't realize that the benchmarks themselves are often honestly just wrong like they have wrong answers they're full of all this uh kind like messiness and people trust for like for the for the popular ones um people have maybe realize this to some extent but the vast majority just have all these flaws that people don't realize.
**Edwin Chen:** 这是一方面。另一方面是,这些 benchmark 归根到底通常都有明确定义的客观答案,这使得模型很容易通过刷分(hill climb)来提升表现——而这与现实世界的混乱和模糊性是完全不同的。我经常说的一件事是,这些模型可以赢得 IMO 金牌,但仍然无法正确解析 PDF,这很疯狂。这是因为,虽然 IMO 金牌对普通人来说看起来很难——确实是难的——但它们有一种客观性,而解析 PDF 有时候并不具备这种客观性。所以模型和前沿实验室更容易在这些方面刷分,而不是去解决现实世界中那些混乱、模糊的问题。我认为两者之间缺乏直接的相关性。
**Edwin Chen:** So that's one part of it and the other part of it is these benchmarks at the end of the day they are often they often have well-defined objective answers that make them very easy for models to hill climb on in a way that's very very different from the messiness and ambiguity the real world I think one thing I often say is that it's kind of crazy that these models can win IMO gold medals but they still have trouble parsing PDFs and that's because yeah even though IMO gold medals seem hard to the average person. Yeah, like they are hard at the end of the day, but they have this notion of objectivity that okay, yeah, partially MPDF sometimes doesn't doesn't have and so it's easier for model for for the frontier labs to hill climb on all these than to solve all the all these messy ambiguous problems in real world. So I think there's a lack of direct correlation there.
**Lenny:** 你描述的方式很有意思——刷这些 benchmark 基本上是一种营销手段。比如 Gemini 3 刚发布,然后说"我们在所有这些 benchmark 上都是第一"。实际情况是不是这样——他们就是训练模型在这些非常具体的东西上表现好?
**Lenny:** It's so interesting the way you described it is uh hitting these benchmarks is kind of like a marketing piece when you launch say Gemini 3 just launched and it's like cool number one at all these benchmarks. Is that is what happens? they just kind of train their models to get good at these very specific things.
**Edwin Chen:** 是的。这里面也分两个层面。一方面是,有时候这些 benchmark 会以某种方式意外泄露,或者前沿实验室会调整他们在这些 benchmark 上评估模型的方式——比如调整系统提示词(system prompt),或者调整运行模型的次数等等——以一种相当于作弊(gaming)的方式。另一方面则是,当你为 benchmark 优化而不是为真实世界优化时,你自然而然就会在 benchmark 上刷分,这本身也是另一种形式的作弊。
**Edwin Chen:** Yeah. So there's uh again maybe two parts to this. So one is sometimes yeah these benchmarks they accidentally leak in certain ways or the frontier labs will tweak the way they evaluate their models on these benchmarks like they'll tweak their system prompt or they'll tweak the number of times they they run their model and so on and so on in a way that games these benchmarks. The other part of it though is it's like by optimizing for the benchmark instead of optimizing for the real world you will just naturally climb on the benchmark and yeah it's basically another form of gaming.
**Lenny:** 既然如此,如果我们正在朝 AGI 的方向前进,你怎么衡量进步呢?
**Lenny:** Knowing that with that in mind how do you kind of get a sense of if we're heading in a towards AGI how do you measure progress?
**Edwin Chen:** 我们真正看重的衡量模型进步的方式是进行各种人工评估。比如,我们会找人工标注员,让他们去跟模型对话——可能是跨越各种不同主题的对话。你是一位诺贝尔物理学奖得主,去聊聊你自己的前沿研究。你是一位老师,正在为学生制作教案,去和模型聊聊这些。或者你是一名程序员,在大型科技公司工作,每天都有各种问题要处理——去和模型聊聊,看看它能帮到你多少。
**Edwin Chen:** Yes so the way we really care about measuring model progress is by running all these human evaluations so for example what we do is yeah we will take human annotators and we'll ask them, okay, go have a conversation with the model and maybe you're having this conversation with model across all these different topics. So, okay, you are a Nobel Prize winning physicist, so you go have a conversation about pushing the frontier of your own research. You are a teacher and you're trying to create lesson plans for your students, so go talk to them all about these things. or you are a uh yeah you're you're a coder and you're working at one of these big tech companies and you have these problems every day. So go talk to them all and see how much it helps you.
**Edwin Chen:** 因为我们 Surge 的标注员都是各自领域的顶尖专家,他们不是随便浏览一下回答就完事——他们会深入地审阅每一个回答。他们会验证代码是否正确,会仔细检查物理公式,会从非常深入的层面评估模型。他们会关注准确性、指令遵循等方方面面——这些都是普通用户不会注意的。
**Edwin Chen:** And because or sergers or annotators, they are experts at the top of their fields and they are not just skimming the responses, they're actually working through the responses deeply themselves. They are yeah they're going to evaluate the code edit rights. They're going to doublech checkck the physics equations that it writes. They're going to evaluate the models in a very very deep way. So, they're going to pay attention to accuracy and instruction following and all these things that casual users don't.
**Edwin Chen:** 当你在 ChatGPT 的回复上突然看到一个弹窗,让你比较两个不同的回答——那些用户并没有在深度评估模型。他们只是凭感觉,随手选一个看起来更花哨的。我们的标注员则会仔细审视每一个回答,从多个维度进行评估。所以我认为这种方式比那些 benchmark 或者那种随机的在线 AB 测试要好得多。
**Edwin Chen:** When you suddenly get a pop-up on your chat GBT response asking you to compare these two different responses like people like that, they're not evaluating models deeply. They're just vibing and picking whatever response looks slashiest. Orators are looking closely at responses and evaluating them for all of these different dimensions. And so I think that's a much better approach than uh than than these benchmarks or kind of these random online AB tests.
**Lenny:** 我再次感到,在所有这些工作中,人类依然如此核心——我们还没有被完全取代。会不会有那么一天,我们不再需要这些人了?AI 变得足够聪明,然后我们说"好了,你们脑子里的东西我们都学完了"?
**Lenny:** Again, I love just how central humans continue to be in all this work that we're not totally done yet. Is there going to be a point where we don't need these people anymore? That AI is so smart that okay, we're good. We got every everything out of your heads.
**Edwin Chen:** 我认为在实现 AGI 之前,这不会发生。这几乎是定义上的——如果我们还没有实现 AGI,那就意味着模型还有东西可以从人类身上学习。所以我不认为这会在短期内发生。
**Edwin Chen:** Yeah, I think that will not happen until we've reached AGI. Like it's almost like by definition, if we haven't reached AGI yet, then there's more for the models to learn from. And so, yeah, I don't think that's going to happen anytime soon.
**Lenny:** 好的。所以又多了一个焦虑 AGI 的理由——到那时我们就不需要这些人了。我忍不住想问每一个跟这行密切相关的人:你对 AGI 的时间表怎么看?你觉得我们离 AGI 还有多远?是几年,还是几十年?
**Lenny:** Okay, cool. So, more reason to stress about AGI. We don't we don't need these folks anymore. What's your uh I can't not ask just any people that work closely with this stuff. I'm always just curious. What's your AGI timelines? How far do you think we are from this? Do you think we're in like a couple years or is it like decades?
**Edwin Chen:** 我肯定属于更长期的那一派。我觉得人们没有意识到,从 80% 的表现提升到 90%、再到 99%、再到 99.9% 之间,差距是巨大的。在我看来,未来一两年内,模型大概可以自动化一个普通 L6 软件工程师 80% 的工作,但要到 90% 可能还需要再过几年,到 99% 又需要几年,以此类推。所以我认为我们距离 AGI 更可能是十年或几十年,而不是大家以为的那么近。
**Edwin Chen:** So, I'm certainly on the longer time horizon front. Like I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance and so on and so on. And so like in my head I probably bet that within the next one or two years yeah the models are going to automate 80% of you know the average L6 software engineer's job but it's going to take another few years to move to 90% and another few years to 99% and so on and so on. So I think we're closer to a decade or decades away um than than folks.
**Lenny:** 你有一个犀利的观点,认为很多实验室正在把 AGI 推向错误的方向。这是基于你在 Twitter、Google 和 Facebook 的工作经历。你能谈谈这个吗?
**Lenny:** You have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction. Uh and this is based on your work at at Twitter and Google and Facebook. Can you just talk about that?
**Edwin Chen:** 我担心的是,我们不是在构建能真正推动人类进步的 AI——治愈癌症、解决贫困、理解宇宙,这些宏大的问题——而是在为 AI 垃圾内容(AI slop)做优化。我们基本上是在教模型追逐多巴胺而不是追求真相。我觉得这跟我们之前讨论的那些 benchmark 问题是相关的。
**Edwin Chen:** I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding universal, all these big grand questions, we are optimizing for AI slop instead. like we're basically teaching our models to chase dopamine instead of truth and I think this relates to what we're talking about regarding these uh these matchmarks.
**Edwin Chen:** 我举几个例子。现在这个行业被那些糟糕的排行榜困扰,比如 LM Arena。这是一个流行的在线排行榜,世界各地的随机用户投票选哪个 AI 回答更好。但问题是,就像我之前说的,他们并不会仔细阅读或核实。他们只是花两秒钟扫一眼,然后选看起来最花哨的那个。一个模型可以完全胡编乱造,但它会因为有疯狂的表情符号、加粗、Markdown 标题和各种花里胡哨但毫无意义的东西而看起来很厉害,吸引你的注意力。LM Arena 的用户们就爱这个。这简直就是在为那种在超市结账台买八卦小报的人优化模型。
**Edwin Chen:** So let me let me give a couple examples. So right now the industry is played by these terrible leaderboards like LM Arena. It's this popular online leaderboard where random people from around the world vote on which AI response is better. But the thing is like I was saying earlier they're not carefully reading or factchecking. They're skimming these responses for 2 seconds and picking whatever looks flashiest. So, a model can hallucinate everything. It can completely hallucinate, but it will look impressive because it has crazy emojis and boating and markdown headers and all these superficial things that don't matter at all, but it catch your attention. And these Alamarina users love it. It's literally optimizing your models for the types of people who buy tabloids at the grocery store.
**Edwin Chen:** 我们在自己的数据中也看到了这一点。要在 LM Arena 上爬分,最简单的方法就是加更多的格式加粗、把表情符号数量翻倍、把模型回答的长度增加两倍——即使你的模型开始产生幻觉、给出完全错误的答案也没关系。问题是,因为所有这些前沿实验室在某种程度上都必须关注公关——他们的销售团队在向企业客户推销时,那些企业客户会说:"你们的模型在 LM Arena 上只排第五,我为什么要买?"——所以他们不得不关注这些排行榜。研究人员告诉我们的是:我年底唯一能升职的方式就是爬上这个排行榜,即使我知道刷分可能会让我的模型在准确性和指令遵循方面变得更差。所以我认为有很多负面的激励机制正在把研究推向错误的方向。
**Edwin Chen:** Like, we've seen this in their data ourselves. The easiest way to climb Alam Marina, it's adding crazy voting. It's doubling the number of emojis. It's tripling the length of your model responses. even if your model starts hallucinating and getting the answer completely wrong. And the problem is again because all these frontier labs, they kind of have to pay attention to PR because their sales team when they're trying to sell to all these enterprise customers, those enterprise customers will say, "Well, well, but your model's only number five on Elmarino, so why should I buy it?" They have to say pay attention to the these leaderboards. And so what the researchers all tell us is like they'll say the only way I'm going to get promoted at the end of the year is if I climb this leaderboard even though I know that climate is probably going to make my model worse and accuracy and structure following. So I think there's all these negative incentives that are pushing pushing work in in the wrong direction.
**Edwin Chen:** 我还担心这种优化 AI 以提升用户参与度(engagement)的趋势。我之前在社交媒体公司工作过,每次我们为参与度做优化,都会发生可怕的事情——你的信息流里就会充满标题党、比基尼照片、大脚怪和恐怖的皮肤病图片。我担心同样的事情正在 AI 领域发生。想想 ChatGPT 那些谄媚的问题——"哦,你说得太对了!多棒的问题啊!"吸引用户最简单的方式就是告诉他们有多了不起。所以这些模型不断告诉你,你是天才,它们会迎合你的妄想和阴谋论,把你拉进兔子洞——因为硅谷就爱最大化用户使用时长,增加你跟它的对话次数。
**Edwin Chen:** I'm also worried about this trend towards optimizing AI for engagement. Like I used to work on social media and every time we optimize for engagement terrible things happened. You'd get clickbait and pictures of bikinis and Bigfoot and horrifying skin diseases just filling your feeds. And I think I worry the same thing's happening with AI. Like if you think about all the sick, fancy issues with Chachi. Oh, you're absolutely right. What an amazing question. Like the easiest way to hook users is to tell them how amazing they are. And so these models, they constantly tell you you're a genius. They'll feed into your delusions and conspiracy theories. they'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you're having with it.
**Edwin Chen:** 所以说,各家公司都在花大量时间刷这些排行榜和 benchmark,分数在上升,但我认为这实际上掩盖了问题——那些分数最高的模型,往往是最差的,或者至少存在根本性的缺陷。所以我真的很担心,所有这些负面激励机制正在把 AGI 推向错误的方向。
**Edwin Chen:** And so, yeah, companies are spending all their time hacking these leaderboards and benchmarks and the scores are going up, but I think it actually masks up the models with the best scores. They are often the worst or just have all these fundamental failures. So, I I think I'm really worried that all of these negative incendants are putting pushing AGI into the wrong direction.
**Lenny:** 所以我理解的是,AGI 正在被这些错误的目标函数拖慢——这些实验室关注的基本上是错误的 benchmark 和评估指标。
**Lenny:** So what I'm hearing is AGI is being slowed down by these basically the wrong objective function these labs paying attention to the wrong basically benchmarks and eval.
**Edwin Chen:** 是的。
**Edwin Chen:** Yep.
**Lenny:** 我知道你可能不方便偏向哪一方,因为你和所有实验室都合作。有没有谁在这方面做得更好,也许意识到了这是错误的方向?
**Lenny:** Is I know you probably can't play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?
**Edwin Chen:** 我得说,我一直对 Anthropic 印象非常深刻。我觉得 Anthropic 在他们在乎和不在乎什么、以及他们希望模型如何表现方面,采取了一种非常有原则的态度,让我感觉更加有原则。
**Edwin Chen:** I would say I've always been very very impressed by anthropic. Like I think anthropic takes a very principled view about what they do and don't care about and how they want their models to behave in a way that feels a lot more a lot more principled to me.
**Lenny:** 有意思。还有没有其他重大错误是你认为实验室正在犯的——那些在拖慢进度或者走错方向的?我们已经听到了追逐 benchmark 和关注参与度。还有没有别的,就是说"好,我们应该在这方面下功夫,因为这会加速一切"?
**Lenny:** Interesting. Are there any other mistakes, big mistakes you think labs are making just that are kind of slowing things down or heading the wrong direction where we've heard just uh you know chasing benchmarks this uh engagement focus. Is there anything else you're seeing of just like, okay, we should we got to work on this cuz it'll it'll speed everything up?
**Edwin Chen:** 我觉得还有一个问题是他们在构建什么产品,以及这些产品本身是否对人类有益。我经常想到 Sora,想到它意味着什么。有趣的是——哪些公司会做 Sora,哪些不会。我觉得这个问题的答案——虽然我自己也不完全确定——也许能揭示出这些公司想要构建什么样的 AI 模型、想要实现什么样的方向和未来。我经常思考这个问题。
**Edwin Chen:** I mean, I think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity. Like I I think a lot about Sora and thinking what it Yeah. What when what it entails and like it's kind of interesting. It's like which companies would build Sora and which wouldn't. And I think that answer I mean I don't know what the answer is myself. I I have an idea in my head but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they they want to want to achieve. Um yeah so so I think about that a lot.
**Lenny:** 为它辩护的话,你知道——它很有趣,人们喜欢它,它能帮公司赚钱、发展壮大、构建更好的模型。它会以有趣的方式训练数据。而且它就是很好玩。
**Lenny:** The steelman argument there is you know it's like fun people want it. It's an it'll help them generate revenue to grow this thing and build better models. Uh it'll train data in an interesting way. It's also just like you know really fun.
**Edwin Chen:** 是的。我觉得这几乎是一个"你是否在乎过程"的问题。我之前用了那个八卦小报的类比——你会为了资助另一份报纸而去卖八卦小报吗?如果你不在乎路径,你就会不择手段。但这个路径本身可能会带来负面后果,损害你试图实现的长期方向,也可能会分散你对更重要事情的注意力。所以我认为你选择的路径也非常重要。
**Edwin Chen:** Yeah. It it I think it's almost like do you care about how you get there? And in the same way, so so I made this tabloid analogy earlier, but like would you sell tabloids in order to fund I don't know some some other newspaper like sure like in some sense uh if you don't care about the path then you'll just do whatever it takes but it's possible that it has negative consequences in of itself that will harm the long ter long-term direction of what you're trying to achieve and maybe it'll distract you from from all the more important So yeah, I think that the path you take matters a lot as well.
**Lenny:** 说到这些,你之前谈了很多关于硅谷以及融大量资金、身处回音室(echo chamber)的弊端。你说过很难用这种方式建立重要的公司,不走风投路线反而可能更成功。你能谈谈你在这方面的观察、经历,以及你对创始人的建议吗?因为他们总是听到要融资、要找知名风投、要搬到硅谷。你的反面观点是什么?
**Lenny:** Along these lines, you talked a bunch about this of just Silicon Valley and kind of the the downsides of raising a lot of money being in the the echo chamber. What do you call the Silicon Valley machine? You talk about how uh it's hard to build important companies in this way and that you might actually be much more successful if you're not going down the VC path. you just talk about what you've seen there, your experience and your advice essentially to founders because they're always hearing, you know, raise money from fancy VCs, move to Silicon Valley. What's kind of the the counter take?
**Edwin Chen:** 是的。我一直非常反感很多硅谷的教条。标准的套路是每两周一次转型(pivot)来找产品市场契合度,然后追逐增长、追逐参与度——用各种暗黑模式(dark patterns),然后通过疯狂招人来闪电扩张(blitz scale)。我一直不认同。我的建议是:不要轻易转型,不要闪电扩张,不要雇那个只是想在简历上加一家热门公司的斯坦福毕业生。专注去做只有你才能做的那一件事——如果没有你独特的洞察和专业知识,这件事根本不会存在。
**Edwin Chen:** Yeah. So, I've always really hated a lot of the Silicon Valley mantras. The standard playbook is to get product market fit by pivoting every two weeks and to chase growth and chase engagement with all of these dark patterns and to blitz scale by hiring as fast as possible. And I've always disagreed. So, yeah, I I would say don't pivot. Don't don't put scale. Don't hire that Stanford grad who simply wants to add a hot company to your resume. Just build the one thing only you could build, the thing that wouldn't exist without the insight and expertise that only you have.
**Edwin Chen:** 你到处都能看到那种照本宣科的公司——某个创始人 2020 年做加密货币,2022 年转做 NFT,现在又是 AI 公司。没有一致性,没有使命,他们只是在追逐估值。我一直很讨厌这种现象,因为硅谷最爱嘲笑华尔街只看钱,但说实话,硅谷大多数人追的也是同一样东西。
**Edwin Chen:** Like you see these buy the book companies everywhere now. Some founder who was doing crypto in 2020 and then pivoted NFTs in 2022 and now they're an AI company. There's no consistency. There's no mission. They're just chasing valuations. And I've always hated this because Silicon Valley loves to score in Wall Street for focusing on money. But honestly, most of the Silicon Valley is chasing the same thing.
**Edwin Chen:** 所以我们从第一天起就坚持我们的使命——推动高质量复杂数据的前沿。我一直很珍视这一点,因为我对创业有一种非常浪漫的理念。创业本应是冒大险去做你真正相信的事情。但如果你在不停转型,你并没有在冒险,你只是在试图赚快钱。如果你因为市场还没准备好而失败了,我反而觉得那样更好——至少你为一些深刻的、新颖的、困难的事情挥了一棒,而不是转型去做另一家 LLM 封装公司。
**Edwin Chen:** And so we stayed focused on our mission from day one, pushing that frontier of high quality complex data. And I always love that because I think startups I have this very romantic notion of startups. Like startups are supposed to be about taking big risks to build something that you really believe in. But if you're constantly pivoting, you're not taking any risks. You're just trying to make a quick buck. And if you fail because the market isn't ready yet, I actually think that's way better. at least you took a swing at something deep and novel and hard instead of pivoting into another LLM rapper company.
**Edwin Chen:** 所以说,我认为你要做出真正重要的、能改变世界的东西,唯一的方式就是找到一个你相信的大创意,然后对其他一切说"不"。困难的时候不要转型。不要因为其他千篇一律的创业公司都这么做就雇 10 个产品经理。就是不断去打造那个没有你就不会存在的公司。我觉得硅谷现在有很多人已经厌倦了各种忽悠和投机,他们想和真正在乎的人一起做真正重要的大事。我希望这会是我们推动技术进步的未来方向。
**Edwin Chen:** So yeah, like I think the only way you build something that matters is that's going to change the world is if you find a big idea you believe in and you say no to everything else. So you don't keep on pivoting when it gets hard. You don't hire a team of 10 product managers because that's what every other cookie cut cookie cutter startup does. You just keep building that one company that wouldn't exist without you. And I I think there are a lot of people in silicon mining now who are sick of all the grift who want to work on big things that matter with people who actually care. And I'm I'm hoping that that will be a future of how we how we p technology.
**Lenny:** 我现在正在和 Terrence Rohan——一位我很喜欢合作的风投——一起写一篇文章。我们采访了五个人,他们很早就选对了真正成功的时代级公司,以非常早期的员工身份加入了——比如在没人觉得 OpenAI 了不起的时候就加入了,在没人知道 Stripe 了不起的时候就加入了。所以我们在寻找人们在其他人之前发现这些时代级公司的规律。其中一个规律和你刚才描述的完全一致——那就是野心。他们对想要实现的目标有着狂野的抱负,而不是像你说的那样四处张望、不管最终做什么都在寻找产品市场契合度。我很喜欢你描述的这些与我们观察到的完全吻合。
**Lenny:** I'm actually working on a post right now with uh Terrence Rohan this VC that I really like to work with and we interviewed five people who picked really uh successful generational companies early and joined them as really early employees. Like they joined OpenAI before anyone thought it was awesome. Stripe before anyone knew it was awesome. And so we're looking for patterns of how people find these generational companies before anyone else. And there's uh it it aligns exactly what you just described, which is uh ambition. They have uh wild ambition with what they want to achieve. They're not, as you said, just kind of looking around and for product market fit no matter what it ends up being. Uh and so I love that what you described very much aligns with what we're seeing there.
**Edwin Chen:** 是的,我完全认为你必须有巨大的野心,必须对你改变世界的创意有巨大的信念,并且愿意加倍投入、不惜一切代价让它成真。
**Edwin Chen:** Yeah. Yeah. I absolutely think that you have to have huge ambitions and you have to have a huge belief in your idea that's going to change the world and you have to be willing to double down and keep on doing whatever it takes to to make it happen.
**Lenny:** 我很喜欢你的叙事跟人们听到的那么多东西是如此相反,所以我很高兴我们在做这件事,我很高兴我们在分享这个故事。
**Lenny:** I I love how counter your uh narrative is to so many of the things people hear and so I love that we're doing this. I love that we're sharing this story.
**Lenny:** 本期节目由 Coda 赞助。我每天都在用 Coda 管理我的播客和社区。我把每期嘉宾的采访问题放在上面,社区资源也放在上面,工作流也在上面管理。Coda 能怎么帮到你呢?想象一下你开始一个项目,愿景很清晰,你确切知道谁负责什么,也知道去哪里找你需要的数据。事实上,你根本不需要浪费时间去找任何东西,因为你的团队需要的一切——从项目跟踪器和 OKR 到文档和表格——都在一个标签页里,全在 Coda 中。Coda 的协作式全能工作空间给你文档的灵活性、表格的结构性、应用的强大功能和 AI 的智能,全都在一个易于组织的标签页里。正如我前面提到的,我每天都在用 Coda,超过 5 万个团队信赖 Coda 来保持团队的对齐和专注。如果你是一个想要提升一致性和敏捷性的初创团队,Coda 可以帮你以创纪录的速度从规划到执行。要亲自体验,请访问 coda.io/lenny,初创团队可免费获得 6 个月的团队计划。
**Lenny:** Today's episode is brought to you by KOD. I personally use KOD every single day to manage my podcast and also to manage my community. It's where I put the questions that I plan to ask every guest that's coming on the podcast. It's where I put my community resources. It's how I manage my workflows. Here's how KOD can help you. Imagine starting a project at work and your vision is clear. You know exactly who's doing what and where to find the data that you need to do your part. In fact, you don't have to waste time searching for anything because everything your team needs from project trackers and OKRs to documents and spreadsheets lives in one tab all in KOD. With Kota's collaborative all-in-one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI, all in one easy to organize tab. Like I mentioned earlier, I use KOD every single day, and more than 50,000 teams trust KOD to keep them more aligned and focused. If you're a startup team looking to increase alignment and agility, KOD can help you move from planning to execution in record time. To try it for yourself, go to kod.io/enny io/lenny today and get 6 months free of the team plan for startups. That's cooda.io/lenny to get started for free and get six months of the team plan. kota.io/lenny.
**Lenny:** 换个方向聊一个也许有点反主流的话题。我想你可能看过 Dwarkesh Patel 和 Richard Sutton 的播客对谈——即使你没有看过,他们基本上讨论了这些内容。Richard Sutton 是著名的 AI 研究者,提出了"苦涩的教训"(The Bitter Lesson)这个概念。他谈到 LLM 几乎是一条死胡同,他认为 LLM 会很快遭遇瓶颈,因为它们学习的方式有局限。你怎么看?你觉得 LLM 能带我们到达 AGI 甚至更远吗?还是需要某种全新的东西或重大突破才行?
**Lenny:** Slightly different direction but something else that was maybe a a counternarrative. Um, I imagine you watched the Dark Cesh and Richard Sutton podcast episode and even if you didn't, there's a they basically had this conversation. Richard Sutton, he was uh famous AI researcher, had this whole bitter the bitter lesson uh meme and he talked about how LM almost are kind of a dead end and he thinks we're going to really plateau around LM because of the way they learn. Uh, what's your take there? Do you think LM will get us to AGI or beyond? Or do you think there's going to be something new or a big breakthrough that needs to get us there?
**Edwin Chen:** 我属于认为需要新东西的那一派。我的思考方式是这样的:当我想到训练时,我持有一种——不知道算不算生物学的——观点。我相信,就像人类有一百万种不同的学习方式一样,我们需要构建能够模仿所有这些方式的模型。也许模型的侧重分布会不同,它们对你来说也会不同,可能有不同的分布。但我们要确保我们有算法和数据,让模型能以与人类相同的方式学习。所以,在 LLM 的学习方式与人类不同的那些方面,确实需要一些新的东西。
**Edwin Chen:** I'm in a camp where I do believe that something new will be needed. Like the way I think about it is when I think about training, I take a very I don't know if I would say biological point of view, but I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all those ways as well. And maybe they'll have a distri different distribution of the focuses that they have. I know they'll be different for you. So maybe have a different distribution, but we want to be able to mimic the learning abilities of humans and make sure that we have the algorithms and the data for for models to learn in the same way. And so to the extent that LMS have different ways of learning from humans, then uh then yeah, I think something something needed.
**Lenny:** 这就跟强化学习(reinforcement learning)联系起来了。这是你非常看重的领域,我也越来越多地听到它在后训练世界中变得越来越重要。你能帮大家理解一下什么是强化学习、什么是 RL 环境,以及为什么它们在未来会越来越重要?
**Lenny:** This connects to uh reinforcement learning. This something that you're you're big on and something I'm hearing more and more is just becoming a big deal in the world of postraining. Can you just help people understand what is reinforcement learning and reinforcement learning environments and why they're so they're going to be more and more important in the future?
**Edwin Chen:** 强化学习本质上是训练你的模型去达成一个特定的奖励。让我解释一下什么是 RL 环境。RL 环境本质上是对现实世界的模拟。想象一下构建一个电子游戏,里面有一个完全成熟的宇宙——每个角色都有真实的故事,每个企业都有你可以调用的工具和数据,各种实体之间都在互相交互。比如,我们可能构建一个世界:有一家初创公司,里面有 Gmail 邮件、Slack 消息线程、Jira 工单、GitHub PR 和整套代码库。然后突然 AWS 宕了,Slack 也挂了——好了,模型,你怎么办?模型需要自己想办法解决。
**Edwin Chen:** Reinforcement learning is essentially training your model to reach a certain reward. And let me explain what an R environment is. An R environment is essentially a simulation of real world. So think of it like building a video game with a fully fleshed out universe. Every character has a real story. Every business has tools and data you can call. And you have all these different inter entities interacting with each other. So for example, we might build a world where you have a startup with Gmail messages and Slack threads and geo tickets and get a PRS and a whole codebase and then suddenly AWS goes down and Slack goes down and so okay model what do you do? Like the model needs to figure it out.
**Edwin Chen:** 我们在这些环境中给模型布置任务,为它们设计有趣的挑战,然后运行看它们的表现如何,再教它们——在你做得好或做得不好的时候给你奖励。我觉得有意思的是,这些环境真的能暴露出模型在真实世界端到端任务中的薄弱之处。你有很多模型在孤立的 benchmark 上看起来很聪明——它们擅长单步工具调用,擅长单步指令遵循——但一旦你把它们扔进这些混乱的世界,那里有令人困惑的 Slack 消息和它们从未见过的工具,它们需要执行正确的操作、修改数据库、在更长的时间跨度中进行交互——第一步做的事会影响第五十步——这跟之前那些人为设计的单步环境完全不同。于是模型就以各种疯狂的方式灾难性地失败了。
**Edwin Chen:** So we give the models tasks in these environments. We design interesting challenges for them and then we run them to see how they perform and then we teach them. We give them these rewards when you're doing a good job or a bad job. And I think one of the interesting things is that these environments really showcase where models are end to end are end weak at end to end tasks in the real world. You have all these models that seem really smart on isolated benchmarks. Like they're good at singlestep tool calling. They're good at singlestep instruction following, but suddenly you dump them into these messy worlds where you have confusing slack messages and tools they've never seen before and they need to perform right actions and modify the databases and interact over longer time horizons where what they do in step one affects what they do in step 50. And that's very very different from these kind of academics singlestep environments that they've been in before and so the model just fails catastrophically in all these crazy ways.
**Edwin Chen:** 所以我认为这些 RL 环境将成为模型学习的非常有趣的训练场,本质上就是对现实世界的模拟。这样模型有望在真实任务上变得越来越好,而不是在那些人为设计的环境里表现好。
**Edwin Chen:** So I think these RL environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in the real world and so they'll hopefully get better and better at real tasks compared to all these contrived environments.
**Lenny:** 我试着想象这具体是什么样的。基本上就是一个虚拟机,里面有浏览器或者电子表格之类的东西,然后打开 surge.com——等等,你们的网站是 surge.com 吗?让我们确认一下。
**Lenny:** So I'm trying to imagine what this looks like. Essentially it's like a virtual machine with I don't know a browser or a spreadsheet or something in it with like I don't know surge.com. Is that is that your website surge.com? Let's make sure we get that right.
**Edwin Chen:** 我们的网站是 surgehq.ai。
**Edwin Chen:** Yes. So, we are we are actually surgehq.ai.
**Lenny:** surgehq.ai。大家去看看。我猜你们在招人吧。
**Lenny:** Surgehq.ai. Check it out. We're hiring, I imagine.
**Edwin Chen:** 是的。
**Edwin Chen:** Yes.
**Lenny:** 好的。所以,就像这样——这是 surgehq.ai。作为一个 AI 智能体(agent),你的任务是确保网站正常运行,然后突然网站挂了。目标函数就是找出原因。这算一个例子吗?
**Lenny:** Okay. So, so it's like cool. Here's surgehq.ai. Your job here's your job as an agent, let's say, is to make sure it stays up and then all of a sudden it goes down. And the objective function is figure out why. Is that is that an example?
**Edwin Chen:** 是的。目标函数——或者说任务目标——可能是"找出原因并修复它"。而具体的目标函数可能是通过一系列单元测试,也可能是写一份文档——比如一份包含特定信息的复盘报告(retro),要与实际发生的情况完全匹配。有各种不同的奖励来判断模型是否成功。我们基本上就是在教模型去获取那个奖励。
**Edwin Chen:** Yeah. So the objective function might be or the goal of the task might be okay go figure out why and fix it. And so the objective function might be it might be passing a series of unit tests. It might be writing a document like maybe it's a retro containing certain information that matches exactly what happened. There's all these like different rewards that we might give it that determine whether or not it's succeeding. And so the models were basically teaching models to achieve that reward.
**Lenny:** 所以本质上就是让模型运行起来——这是你的目标,找出网站挂掉的原因并修复它——然后它就开始尝试各种方法,用上它拥有的所有智能。它会犯错,你沿途给它引导——做对了就奖励。那你描述的这个就是模型变得更聪明的下一个阶段——更多的 RL 环境,专注于非常具体的、有经济价值的任务。
**Lenny:** So essentially it's like running it's off and running. Here's your goal. Figure out why the site went down and fix it and it just starts trying stuff with using everything all the intelligence it's got. It makes mistakes. You kind of help it along the way rewarded if it's doing the right sort of thing. And so what you're describing here is this is where model this is the next phase of models becoming smarter. More RL environments focused on very specific tasks that are economically valuable I imagine.
**Edwin Chen:** 是的。就像过去模型有各种不同的学习方法一样——最初有 SFT 和 RLHF,然后有了评分标准(rubrics)和验证器(verifiers)——这就是下一个阶段。之前的方法并没有过时,这只是另一种学习形式,与之前所有的类型互为补充。它就像是模型学会的另一项技能。所以在这种情况下,不再是某个物理学博士坐在那里跟模型对话、纠正它、给它做评估、创建评分标准之类的——更像是这个人现在在设计一个 RL 环境。
**Edwin Chen:** Yeah. Yeah. So just in the same way that there were all these different methods for models learning in the past like originally we had SFT and RLHF and then we had rubrics and verifiers this is the next stage and it's not the case that the previous methods are obsolete. This is again just a different form of learning that complements all the previous types. So it's just like a different skill that model learn how to do. And so in this case it's less some physics PhD sitting around talking to a model correcting it giving it evals of here's what the correct answer is creating rubrics and things like that. More it's like this person now designing an environment.
**Lenny:** 我听到的另一个例子是金融分析师——就像"这是一个 Excel 表格,你的目标是算出我们的损益表之类的"。所以这个专家现在不再是坐着写评分标准,而是在设计 RL 环境。
**Lenny:** So another example I've heard is like a financial analyst just like here's an Excel spreadsheet here's your goal figure out our profit and loss or whatever. And so this expert now is instead of just sitting around writing rubrics they're designing this RL environment.
**Edwin Chen:** 是的,完全正确。那位金融分析师可能会创建一个电子表格,还可能创建模型需要调用的某些工具来帮助填写表格。比如模型需要访问 Bloomberg 终端,需要学会如何使用它,需要学会如何使用计算器,需要学会如何执行某项计算。模型有所有这些可用的工具,然后奖励可能是——我会下载那个表格,然后想看看 B22 单元格是否包含正确的损益数字,或者第二个标签页是否包含这条信息。
**Edwin Chen:** Yeah exactly. So that financial analyst might create a spreadsheet. They may create certain tools that the model needs to call in order to help fill out the spreadsheet. Like it might be okay the model needs to access Bloomberg terminal. It needs to learn how to use it and it needs to learn how to use this calculator and it needs to learn how to perform this calculation. So all it has all these tools that it has access to and then the reward might be okay, like maybe I will download that spreadsheet and I want to see does cell B22 contain the correct profit and loss number or does tab number two contain this piece of information.
**Lenny:** 有趣的是,这更接近人类的学习方式——我们就是不断尝试,搞清楚什么管用什么不管用。你说过"轨迹"(trajectories)对此非常重要——不只是目标和最终结果,而是沿途的每一步。你能谈谈什么是轨迹,为什么它很重要吗?
**Lenny:** And this what's interesting this is a lot closer to how humans learn. We just try stuff figure out what's working and what's not. You talk about how trajectories are really important to this. It's not just here's the goal and here's the end. It's like every step along the way. Can you just talk about what trajectories are and why that's important to this?
**Edwin Chen:** 我觉得人们没有意识到的一点是,有时候即使模型得到了正确答案,它到达那里的方式却非常疯狂。它可能在中间步骤中尝试了 50 次都失败了,但最终只是碰巧落在了一个正确的数字上。有时候它做事非常低效,有时候它几乎是通过"奖励破解"(reward hacking)的方式来得到正确答案。所以我认为关注轨迹实际上非常非常重要。
**Edwin Chen:** I think one of the things that people don't realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate directory, it may have tried 50 different times and failed, but eventually it just kind of like randomly lands on a correct number or correct number or maybe it sometimes it just does things very very inefficiently or it almost reward hacks a way to get at the correct answer. And so I think paying attention to trajectory is actually really really important.
**Edwin Chen:** 而且这还很重要,因为有些轨迹可能非常非常长。如果你只检查模型是否达到了最终答案,那你就遗失了模型在中间步骤中行为的大量信息。有时候你希望模型通过反思自己的行为来得到正确答案,有时候你希望它一步到位。如果你忽略了所有这些,你就丢失了大量本可以用来教它的信息。
**Edwin Chen:** And I think it's also really important because some of these trajectories can be very very long. And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing. Like sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you wanted to get the correct answer by just oneshotting it. And if you ignore all of that, it's just it's just like teaching teaching it. It's just missing a lot of the information that you could be teaching it to do.
**Lenny:** 我很喜欢这个。就是说,它尝试了一大堆东西,最终答对了——但你不希望它学到"这就是正确的路径",因为往往有更高效的方式。你提到了我们在帮助模型变得更聪明这条路上经历的各个阶段。既然你在这个领域这么久了,我觉得这对大家会非常有帮助。后训练的发展历程是什么样的——从最初最能帮助模型进步的方法开始,评估(evals)在哪里加入,RL 环境在哪里加入——这些步骤是什么?我们现在正在走向 RL 环境。
**Lenny:** I love that. Like it just yeah, it tries a bunch of stuff and eventually gets it right. You don't want it to learn this is the way to get there. There's often a much more efficient way of doing it. You mentioned all the kind of the steps we've taken along the journey of getting of helping models get smarter. Since you've been so close to this for so long, I think this is going to be really helpful for people. What's kind of like been the steps along the way from the first of post-training that has most helped models advance like where the eval fit in the RL environments just like what's been like the steps and and now we're heading towards RL environments.
**Edwin Chen:** 最早模型进行后训练(post-training)完全是通过 SFT(监督微调,Supervised Fine-Tuning)。
**Edwin Chen:** Originally the way models started getting post trained was purely through SFT.
**Lenny:** SFT 是什么的缩写?
**Lenny:** And what does that stand for?
**Edwin Chen:** SFT 是监督微调(supervised fine tuning)的缩写。我经常用人类学习的类比来理解这些概念——SFT 很像模仿大师,照搬他们的做法。后来 RLHF(基于人类反馈的强化学习)变得非常主流,类比的话就像:你写了 55 篇不同的文章,然后有人告诉你他们最喜欢哪一篇。而在过去一年左右,评分标准(rubrics)和验证器(verifiers)变得非常重要——这就像通过被打分、获得详细反馈来学习,告诉你哪里做错了。
**Edwin Chen:** So SFT stands for supervised fine tuning and it's a lot like so so again I think often in terms of these human analogies and so SFT is a lot like mimicking a master and copying what they do and then RLHF became very dominant and analogy there would be like sometimes you learn by writing 55 different essays and someone telling you which one they like the most and then I think over the past year or so rubrics and verifiers have become very important and rubrics and verifiers are like learning by being graded and getting detailed feedback on where you went wrong.
**Lenny:** 这些就是评估(eval),对吧?
**Lenny:** And those are eval another word for that.
**Edwin Chen:** 对。我觉得"eval"这个词其实涵盖了两个含义。一个是用评估来做训练——你评估模型做得好不好,做得好就给奖励。另一个是用评估来衡量模型的进步——比如你有五个不同的候选检查点(checkpoint),想挑出最好的那个发布给公众。所以你会在这五个检查点上跑各种评估,来决定哪个最好。
**Edwin Chen:** Yeah. Yeah. So I think eval often covers two terms. One is you are using the evaluations for training because you're evaluating whether or not the model did a good job and when it does do a good job you're rewarding it. And then there's this other notion of evals where you're trying to measure the model's progress like okay yeah I have five different candidate checkpoints and I want to pick the one that's best in order to release it to the public. So kind of run all these evals on these five different checkpoints in order to decide which one which one is best.
**Lenny:** 太好了。现在我们又有了 RL 环境(RL environment),算是当下最热门的新方向。我很喜欢这段创业历程的一点是,总有新东西出现。你们刚把高质量数据做得炉火纯青,客户又需要完全不同的东西了——现在要帮他们搭建各种虚拟机,应对各种不同的使用场景。感觉适应实验室需求的变化,就是你们这个行业的核心。
**Lenny:** Awesome. Yeah. Yeah. Now now now we have our environment. So it's kind of like a hot new thing. Awesome. So what I love about this business journey is just there's always something new. There's always this like, okay, we're getting so good at just all this beautiful data for companies and now they need something completely different. Now we're setting up all these virtual machines for them and all these different use cases. And it feels like that's a big part of this industry you're in is just adapting to what labs are asking for.
**Edwin Chen:** 没错。我真心认为,我们需要构建一整套产品,来反映人类学习的无数种方式。举个例子,想想成为一个伟大作家的过程。你不是靠背语法规则变好的。你靠的是阅读伟大的作品,不断练习写作,从老师那里获得反馈,从在书店买你书、留评论的读者那里获得反馈。你观察什么有效、什么无效。你通过接触各种杰作和各种烂作来培养品味。你在这个练习和反思的无尽循环中学习。而每一种学习方式——这些都是截然不同的学习方法。所以,就像伟大作家有一千种方式变得伟大一样,AI 也需要一千种方式来学习。
**Edwin Chen:** Yeah. Yeah. So, I mean, I really do think that we are going to need to build a suite of products that reflect the million different ways that humans learn and like like for example, think about becoming a great writer. You don't become great by memorizing a bunch of grammar rules. You become great by reading great books and you practice writing and you get feedback from your teachers and from the people who buy your books in the bookstore and leave reviews. And you notice what works and what doesn't. And you develop taste by being exposed to all these masterpieces and also just terrible writing. So you learn through this endless cycle of practice and reflection. And each type of learning that you have again like these are all very very different methods of learning to become a great writer. So just in the same way that there a thousand different ways that the great writer becomes great I think there's going to be a thousand different ways that AI needs to learn.
**Lenny:** 太有意思了。归根结底,AI 在很多方面就是像人类一样。这也说得通,因为神经网络和深度学习(deep learning)本来就是模仿人类学习方式和大脑运作原理设计的。但有意思的是,要让它们更聪明,就得越来越接近人类的学习方式。
**Lenny:** It's so interesting this just ends up being like just like humans in so many ways. It makes sense cuz in a sense neural networks deep learning is modeled after how humans have learned and how our brains operate. But it's interesting just to make them smarter. It's how do we come closer to how humans learn more and more.
**Edwin Chen:** 对。终极目标也许就是把你扔进环境里,看你怎么进化。但在进化的过程中,会有各种不同的子学习机制。
**Edwin Chen:** Yeah. It's almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve. But within that within that evolution there's all these different sub learning mechanisms.
**Lenny:** 是的,这其实就是我们现在在做的事。真的很有意思。这可能是通往 AGI 之前的最后一步了。说到这个,Surge 有一个非常独特的地方——你们有自己的研究团队,我觉得这很少见。聊聊为什么你们要在这方面投入,以及这个投入带来了什么成果。
**Lenny:** Yeah. Which is kind of what we're doing now. So that's really interesting. This might be the last step of until we hit AGI. Along these lines, something that's really unique to Surge that I've learned is you guys have your own research team, which I think is pretty rare. Talk about just why that's something you guys have invested in and what has come out of that investment.
**Edwin Chen:** 这要从我自己的背景说起。我本身就是研究员出身,所以我一直从根本上关心推动行业发展和推动研究社区进步,而不仅仅是收入。我觉得研究团队起到了几方面的作用。我们公司其实有两类研究员。一类是前沿部署研究员(forward deployed researchers),他们经常和客户并肩工作,帮助客户理解自己的模型。我们会和客户非常紧密地合作,帮他们了解:你的模型目前处于什么水平,在哪些方面落后于竞争对手,根据你的目标未来可以怎么改进。然后我们会设计数据集、评估方法和训练技术来提升他们的模型。这是一种非常协作的方式——我们的研究员就像客户自己的研究员一样,只是更聚焦在数据层面,和客户一起竭尽全力把模型做到最好。
**Edwin Chen:** Yeah, so I think that stems from my own background. Like my own background is as a researcher and so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what a research team does is a couple different things. So we almost have two types of researchers at our company. One is our forward deployed researchers who are often working hand in hand with our customers to help them understand their models. So we will work very closely with our customers to help them understand okay this is where your model is today. This is where you're lagging behind all the competitors. These are some ways that you could be improving in the future given given your goals and we're going to design these data sets, these evaluation methods, these training techniques to make your models better. So this like very very notion this very very collaborative notion of working with our customers like being researchers themselves just a little bit more focused on the data side and working hand in hand with them to do whatever it takes to make them the best.
**Edwin Chen:** 然后我们还有内部研究员。他们的方向稍有不同,专注于构建更好的基准测试(benchmarks)和排行榜(leaderboards)。我之前谈了很多关于当前排行榜和基准测试把模型引向错误方向的担忧。那问题就是:我们怎么解决这个问题?这正是我们研究团队当前的重点。他们也在做其他工作,比如我们需要训练自己的模型,来验证哪些类型的数据效果最好、哪些类型的人表现最好。所以他们还在研究各种训练技术,评估我们自己的数据集,以改进我们的数据运营和内部数据产品,来判断什么才算高质量。
**Edwin Chen:** And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better leaderboards. So I talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction. So yeah, so the question is how do we how do we fix that? And so that's what our research team is focused on really really heavily on right now. So they're working a lot on that and they're also working on these other things like okay we need to train our own models to see what types of data performs the best what types of people perform the best and so they are also working on all these kind of training techniques and evaluation of our own data sets to improve our data operations and the internal data products that we have that determine what makes something good quality.
**Lenny:** 这真的很酷。一般来说,只有实验室本身有研究员来推进 AI 发展。像你们这样的公司自己做原创性 AI 研究,我觉得相当少见。
**Lenny:** It's such a cool thing because I don't think like basically the labs have researchers helping them advance AI. I imagine it's pretty rare for a company like yours to have researchers actually doing primary research on AI.
**Edwin Chen:** 对。我觉得这纯粹是因为我从根本上一直很在乎这件事。我经常把我们更像是一个研究实验室,而不是一家创业公司,因为那才是我的目标。说起来有点好笑,但我一直说我宁愿当 Terrence Tao(陶哲轩),也不想当 Warren Buffett(巴菲特)。创造推动前沿的研究,而不是仅仅追求某个估值——这才是一直驱动我的东西。
**Edwin Chen:** Yeah. Yeah. I think it's just because it's something I've fundamentally always cared about. Like I often think about us more like a research lab than a startup because that is my goal. Like like it's kind of funny but I've always said I would rather be Terrence Tao than than Warren Buffett. So that notion of creating research that pushes the frontier forward and not just getting some evaluation like that that's always been what drives me.
**Lenny:** 而且这条路走通了,这就是它的美妙之处。你提到你们在招研究员,有什么想分享的吗?你们在找什么样的人?
**Lenny:** And it's worked out. That's the beautiful thing about this. You mentioned that you were hiring researchers. Is there anything there you want to share folks you're looking for?
**Edwin Chen:** 我们找的是那种对数据有根本性兴趣的人。那种人可以花 10 个小时钻研一个数据集,摆弄各种模型,然后思考:我觉得模型在这个地方失败了,我希望模型有这样的行为。核心就是非常动手(hands-on),关注模型的定性层面,而不仅仅是定量层面。不是只关心抽象的算法,而是真正和数据打交道。
**Edwin Chen:** So we look for people who are just fundamentally interested in data all day. So types of people who could literally spend 10 hours digging through a data set and playing around with models and thinking okay yeah this is where I think the model is failing this is the kind of behavior you want the model to have instead and just this aspect of being very very hands-on and thinking about the qualitative aspects of models and not just the quantitative parts so again it's like this aspect of being hands-on with data and not just caring about these kind of abstract algorithms.
**Lenny:** 太好了。我想问几个关于 AI 市场的宏观问题。你觉得未来几年会发生什么,是人们可能没有充分思考或没有预料到的?AI 将走向哪里?什么会真正重要?
**Lenny:** Awesome. I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple years that people are maybe not thinking enough about or not expecting in terms of where AI is heading? What's going to matter?
**Edwin Chen:** 我觉得未来几年会发生的一件事是,各家模型实际上会越来越差异化,因为不同实验室有不同的个性和行为风格,以及他们为模型优化的目标函数(objective function)不同。这是我大约一年前没有意识到的。一年前,我以为所有 AI 模型最终会变得非常同质化,行为都差不多。当然,某个模型今天可能在某方面稍微聪明一点,但其他模型几个月后就会追上来。
**Edwin Chen:** I think one of the things that's going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for. Like I think it's one thing I didn't appreciate a year or so ago. Like a year or so ago, I thought that all of the AI models would essentially become very very commoditized. They would all behave like each other. And sure, one of them might be slightly more intelligent in one way today, but sure the other ones would catch up in the next few months.
**Edwin Chen:** 但在过去一年里,我意识到公司的价值观会塑造模型。举个例子。前几天我让 Claude 帮我起草一封邮件,它改了 30 个版本,30 分钟后,我觉得它真的帮我写出了完美的邮件,我就发了。但随后我意识到,我花了 30 分钟做了一件根本不重要的事。没错,我得到了完美的邮件,但我花了 30 分钟做了一件我以前根本不会在意的事。而且这封邮件大概对任何事都没什么影响。
**Edwin Chen:** But I think over the past year, I've realized that the values that the companies have will shape the model. So let me give an example. So, I was asking Claude to help me draft an email the other day and it went through 30 different versions and after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized I spent 30 minutes doing something that didn't matter at all. Like, sure, now I got the perfect email, but I spent 30 minutes doing something I wouldn't have worried at all before. And this email probably didn't even move the needle on anything anyways.
**Edwin Chen:** 这里面有一个深层问题:如果你可以选择模型的完美行为,你想要什么样的模型?你想要一个说"你说得太对了,这封邮件确实还有 20 种改进方式"然后继续迭代 50 次、占用你所有时间和注意力的模型?还是想要一个为你的时间和效率优化、直接说"不,你该停了,邮件已经很好了,赶紧发出去继续做别的事"的模型?
**Edwin Chen:** So, I think there's a deep question here, which is if you could choose the perfect model behavior, which model would you want? Do you want a model that says, "You're absolutely right. There are definitely 20 more ways to improve this email and it continues for 50 more iterations and it sucks up all your time and engagement." Or do you want a model that's optimizing for your time and productivity and just says, "No, you need to stop. Your email's great. Just send it and move on with your day."
**Edwin Chen:** 就像这个问题上存在一个岔路口一样,模型在面对所有其他问题时,你期望的行为模式也会从根本上影响它的表现。这就好比 Google 构建搜索引擎和 Facebook 构建搜索引擎、Apple 构建搜索引擎,三者的方式会非常不同。它们各自有自己的原则、价值观和想在世界上实现的目标,这些会塑造它们构建的所有产品。同样,我认为所有 LLM(大语言模型)的行为也会开始变得非常不同。
**Edwin Chen:** And again like again just because like in the same way that there's like a kind like a fork in a road between how you could choose how your model behaves for this question. It's like for every other question that models have the kind of behavior that you want will fundamentally affect it. It's almost like in the same way that when Google builds a search engine, it's very very different from how Facebook would build a search engine, which is very very different from how Apple would build a search engine. Like they all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're going to build and in the same way I think all the LLMs will start behaving very very differently too.
**Lenny:** 这太有意思了。你看 Grok 就已经是这样了——它有非常不同的个性和回答方式。所以我听到的是,这种差异化会越来越明显。
**Lenny:** That is incredibly interesting. You already see that with Grok. It's got like a very different personality and a very different approach to answering questions. And so what I'm hearing is you're going to see more of this differentiation.
**Edwin Chen:** 是的。
**Edwin Chen:** Yep.
**Lenny:** 沿着这个方向再问一个问题。你觉得 AI 领域什么东西是被低估的——大家讨论得不够多但其实很酷?什么又是被高估的?
**Lenny:** Kind of another question along these lines. What do you think is most underhyped in AI that you think maybe people aren't talking enough about that is really cool and what do you think is overhyped?
**Edwin Chen:** 我觉得一个被低估的东西是各种聊天机器人即将内置的产品功能。我一直是 Claude Artifacts 的超级粉丝,觉得它做得非常好。前几天——不知道是不是新功能——它帮我写了一封邮件,然后直接创建了一个小框,我点一下就能把消息发给某人。虽然最终没完全跑通,因为它没能真正发送邮件,但那个概念——把 Artifacts 提升到下一个层次,在聊天机器人内部嵌入迷你应用、迷你 UI——我觉得大家对此讨论得太少了。这是一个被低估的方向。
**Edwin Chen:** So I think one of the things that was underhyped is the built-in products that all of the chatbots are going to start having. Like I've always been a huge fan of Claude's artifacts and I think it just works really really well. And actually the other day, I don't know if it's a new feature or not, but it asked me to help me create an email and then it just create so it didn't quite work because it didn't allow me to send the email, but what it created instead was like a little I don't call it like a little box where I could click on it and it would just text someone this message. And I think that concept of taking artifacts to the next level where you just have these like mini apps, mini UIs within the chatbots themselves, I feel like people aren't talking enough about that. So I think that's one underhyped area.
**Edwin Chen:** 至于被高估的领域,我绝对觉得氛围编程(vibe coding)被高估了。我觉得人们没有意识到,如果他们只是因为代码现在能跑就直接扔进代码库,长远来看系统会变得无法维护。我挺担心未来的编码质量,这种趋势会持续下去。
**Edwin Chen:** And in terms of overhyped areas, I definitely think that vibe coding is overhyped. I think people don't realize how much it's going to make their systems unmaintainable in the long term if they simply dump this code into their code bases if it seems to work out right now. So I kind of worry about future coding. It's just going to keep on happening.
**Lenny:** 这些回答都太棒了。关于第一点,我其实问过 Anthropic 的首席产品官 Mike Krieger 和 OpenAI 的 Kevin Weil 类似的问题:作为产品团队,你们有这么强大的智能,还需要产品团队多久?会不会 AI 直接帮你创建产品——就像下一级的氛围编程——你告诉它你想要什么,它就自动构建产品、在你使用过程中不断演进?感觉你描述的就是我们可能要走向的方向。
**Lenny:** These are amazing answers on that on that first point. This something I actually asked I had the chief product officer of Anthropic and OpenAI Kevin Weil and Mike Krieger on the podcast and I asked him just like as a product team like you have this gigabrain intelligence how long do you even need product teams you think this is this AI will just create the product for you here's what I want it's like it's like the next level vibe coding it's just just like tell it here's what I want and it's just building the product and evolving the product as you're using it and it feels like that's what you're describing is where we might be heading.
**Edwin Chen:** 对,我觉得有一个非常强大的理念——AI 帮助人们以更简单的方式实现自己的想法。
**Edwin Chen:** Yeah yeah I think there's a very very powerful notion where it helps people just achieve their ideas in a much easier way.
**Lenny:** 还有一个我们没深入聊的话题,我觉得非常有意思,就是你创办 Surge 的故事。你的背景非常独特。我总会想到 Brian Armstrong——Coinbase 的创始人——他做过一次演讲让我印象深刻,讲的是他独特的背景如何让他能够创办 Coinbase。他有经济学背景、密码学经验,还是工程师——完美的韦恩图交集才造就了 Coinbase。我觉得你和 Surge 的故事非常相似。聊聊你的背景,以及它如何引导你创办了 Surge。
**Lenny:** Something we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge. You had you have a really unique background. I always think about these Brian Brian Armstrong the founder of Coinbase once gave this talk that has really stuck with me where he kind of talked about how his very unique background allowed him to start Coinbase. You had like an economics background, he had a cryptography experience and then he was an engineer and it's got this like the perfect Venn diagram for starting Coinbase and I feel like you have a very similar story with Surge. Talk about that your background there and how that led to Surge.
**Edwin Chen:** 追溯到最早,我小时候就对数学和语言着迷。我去了 MIT,因为它显然是数学和计算机科学最好的地方之一,但也因为那是 Noam Chomsky(乔姆斯基)的大本营。我在学校时的梦想其实是找到一个连接所有这些不同领域的底层理论。
**Edwin Chen:** Going way back I was always fascinated by math and language when I was a kid like I went to MIT because it's obviously one of the best places for math and CS but also because it's the home of Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields.
**Edwin Chen:** 后来我在 Google、Facebook 和 Twitter 做研究员。我反复遇到同一个问题——要获取训练模型所需的数据几乎不可能。所以我一直坚信高质量数据的必要性。然后 GPT-3 在 2020 年发布,我意识到,如果我们想更进一步,构建能写代码、能使用工具、能讲笑话、能写诗、能解黎曼猜想、能攻克癌症的模型,那我们就需要一个全新的解决方案。
**Edwin Chen:** And then I became a researcher at Google and Facebook and Twitter. And I just kept running into the same problem over and over again. It was impossible to get the data that we needed to train our models. So I was always a huge believer in the need for high quality data. And then GPT-3 came out in 2020 and I realized that yeah, if we wanted to take things to the next level and build models that could code and use tools and tell jokes and write poetry and solve the Riemann hypothesis and cure cancer, then yeah, we were going to need a completely new solution.
**Edwin Chen:** 在那些公司时,最让我抓狂的是——我们面前有人类思维的全部力量,但市面上所有的数据解决方案都聚焦在非常简单的事情上,比如图像标注。所以我想构建一个专注于所有这些高级复杂场景的东西,真正帮助我们构建下一代模型。我在数学、计算机科学和语言学交叉领域的背景,确实深刻塑造了我一直想做的事。所以一个月后我就创办了 Surge,使命只有一个——构建我认为推进 AI 前沿所必需的那些场景。
**Edwin Chen:** Like the thing that always drove me crazy when I was at all these companies was we had the full power of the human mind in front of us and all the data solutions out there were focused on really simple things like image labeling. So I wanted to build something focused on all these advanced complex use cases instead that would really help us build the next generation models. So yeah, I think my background in kind of cross math and computer science and linguistics really really informed what I always wanted to do. And so I started Surge a month later with our one mission to basically build the use cases that I thought were going to be needed to push the frontier of AI.
**Lenny:** 你说一个月后。一个月后是在什么之后?
**Lenny:** And you said a month later. A month later after what?
**Edwin Chen:** GPT-3 在 2020 年发布之后。
**Edwin Chen:** After GPT-3 launch in 2020.
**Lenny:** 哦,好的。哇。一个伟大的决定。除了你们正在取得的巨大成功之外,是什么驱动你继续前行?是什么让你保持动力继续构建?
**Lenny:** Oh, okay. Wow. Okay. Yeah, a great decision. What just kind of drives you at this point other than just the epic success you're having? What keeps you motivated to keep building this and you know building something in this space?
**Edwin Chen:** 我骨子里是个科学家。我一直以为自己会成为数学或计算机科学教授,研究宇宙、语言和沟通的本质。说起来有点好笑,但我一直有一个天马行空的梦想——如果外星人来到地球、我们需要弄清楚怎么和他们沟通,我想成为那个被派去的人。我会用各种高深的数学、计算机科学和语言学来破译他们的语言。
**Edwin Chen:** I think I'm a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe and language and the nature of communication. Like it's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one to go with Gould and I'd use all this fancy math and computer science and linguistics to decipher it.
**Edwin Chen:** 所以即使到今天,我最喜欢做的事就是每当有新模型发布,我们会对模型本身做一次非常深入的分析。我会亲自试用,跑评估,比较它在哪些方面进步了、哪些方面退步了,然后写出非常深入的分析报告发给我们的客户。有趣的是,很多时候报告署名是数据科学团队,但其实就是我写的。我觉得我可以整天做这个。我很难忍受整天开会,我不擅长销售,不擅长做大家期望 CEO 做的那些典型事务。但我热爱写这些分析,热爱和研究团队讨论他们的发现。有时候我会和研究团队的人打电话聊到凌晨三点,深入剖析某个模型。所以我很庆幸自己仍然能每天深入数据和科学工作。
**Edwin Chen:** So even today, what I love doing most is every time a new model is released, we'll actually do a really deep dive into the model itself. I'll play around with it. I'll run evals. I'll compare where it's improved, where it's regressed. I'll create this really deep dive analysis that we send our customers. And it's actually kind of funny because a lot of times we'll say it's from a data science team, but often it's actually just from me. And I think I could do this all day. Like I have a very hard time being in meetings all day. I'm terrible at sales. I'm terrible at doing the typical CEO things that people expect you to do. But I love writing these analyses. I love jamming with a research team about what they're seeing. Sometimes I'll be like up until 3 a.m. just talking on a phone with somebody on a research team and digging into a model. So I love that I still get to be really hands-on working on the data and the science all day.
**Edwin Chen:** 驱动我的是——我希望 Surge 在 AI 的未来扮演关键角色,而我认为 AI 的未来也就是人类的未来。我们在数据、语言、质量方面有非常独特的视角,知道如何衡量这一切、如何确保一切朝正确方向发展。我们有一种独特的不受约束——不受那些有时会把公司引向错误方向的影响力所左右。就像我之前说的,我们把 Surge 更像研究实验室那样来建设,而不是典型的创业公司。所以我们重视好奇心、长期激励和学术严谨,不那么在意季度指标和董事会演示中好看的数字。我的目标是利用我们公司这些独特之处,确保我们在以一种对人类物种长远有益的方式塑造 AI。
**Edwin Chen:** And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. Like we have these really unique perspectives on data and language and quality and how to measure all this and how to ensure it's all going on the right path. And I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction. Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long-term incentives and intellectual rigor and we don't care as much about quarterly metrics and what's going to look good in a board deck. And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for the species in the long term.
**Lenny:** 这次对话让我意识到,你和像你们这样的公司对 AI 的走向有多大的影响力。你们帮助实验室了解自己的差距和需要改进的地方——不只是大家看到的 OpenAI 和那些公司的负责人在引领 AI 发展,你们对事情的走向也有很大的影响力。
**Lenny:** What I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads. The fact that you help labs understand where they have gaps and where they need to improve and it's not just you know everyone looks at just like the heads of OpenAI and all these companies as they're the ones ushering in AI but what I'm hearing here is you have a lot of influence on where things head to.
**Edwin Chen:** 对。确实有一个非常强大的生态系统,说实话,人们还不知道模型会走向何方,不知道该如何塑造它们,不知道人类在这一切中应该扮演什么角色。所以我觉得在持续塑造这场讨论方面,有很大的空间。
**Edwin Chen:** Yeah. Yeah, I think there's this really powerful ecosystem where honestly people just don't know where models are headed and how they want to shape them yet and how they want humanity to kind of play a role in the future of all this. And so I think there's a lot of opportunity to just continue shaping this discussion.
**Lenny:** 沿着这个思路,我知道你对这项工作为什么对人类重要、为什么如此关键有非常坚定的信念。聊聊这方面。
**Lenny:** Along that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important. Talk about that.
**Edwin Chen:** 我在这里会有点哲学,但这个问题本身就有点哲学性。请听我说。最直接的理解是,我们训练和评估 AI。但有一个更深层的使命我经常思考——帮助我们的客户想清楚他们理想的目标函数。他们想让自己的模型成为什么样的模型?一旦我们帮他们想清楚了,我们就帮他们训练模型去达到那个北极星。我们帮他们衡量进展。
**Edwin Chen:** I'll get a bit philosophical here, but I think the question is always a bit philosophical. So, bear with me. So, the most straightforward way of thinking about what we do is we train and evaluate AI. But there's a deeper mission that I often think about which is helping our customers think about their dream objective functions. Like, yeah, what kind of model do they want their model to be? And once we help them do that, we'll help them train their model to reach that northstar. We'll help them measure that progress.
**Edwin Chen:** 但这真的很难,因为目标函数是非常丰富和复杂的。这就像养孩子——你问他们"你想通过什么考试?想 SAT 拿高分、写一篇很棒的大学申请文书?"这是简单版。复杂版是:你希望他们成为什么样的人?如果他们不管做什么都很快乐,你会满足吗?还是你希望他们上好学校、经济上成功?再深入:你怎么定义幸福?怎么衡量他们是否快乐?怎么衡量他们是否经济上成功?这比简单地衡量 SAT 分数难得多。而我们要做的,就是帮助客户达到他们理想的北极星,并找到衡量的方法。
**Edwin Chen:** But it's really hard because objective functions are really rich and complex. It's kind of like the difference between having a kid and asking them, "Okay, what test do you want to pass? Do you want them to get a high score on an SAT and write a really good college essay?" Like that's a simplistic version versus what kind of person do you want them to grow up to be? Will you be happy if they're happy no matter what they do? Or are you hoping they'll go to a good school and be financially successful? And again, if you take that notion, it's like, okay, how do you define happiness? How do you measure whether they're happy? How do you measure whether they're financially successful? Like it's a lot harder than simply measuring whether or not you're getting a high score on the SAT. And what we're doing is we want to help our customers reach again their dream stars and figure out how to measure them.
**Edwin Chen:** 所以我举了那个邮件迭代的例子——当模型写了 50 个版本时,你是继续迭代 50 次,还是直接说"够好了,赶紧去做别的"?更宏观的问题是:我们在构建的系统,真的在推动人类进步吗?我们怎么构建数据集来朝那个方向训练和衡量?还是说我们在优化各种错误的东西——那些只会越来越多占用我们时间、让我们越来越懒的系统?
**Edwin Chen:** And so I talked about this example of what you want models to do when you're asking them to write 50 different email iterations. Do you just continue them for 50 more or do you just say no just move on with the day because this is perfect enough? And the broader question is are we building these systems that actually advance humanity? And so how do we build the data sets to train towards that and measure it? Are we optimizing for all these wrong things just systems that suck up more and more of our time and make us lazier and lazier?
**Edwin Chen:** 我觉得这和我们的工作非常相关,因为衡量和定义"某件事是否在总体上推动人类进步"非常困难。衡量那些替代指标——点击量、点赞数——倒是很容易。但我认为这正是我们工作有趣的地方。我们要攻克的是那些困难但重要的指标,需要最难获取的数据类型,而不是简单的指标。我经常说的一句话是:你就是你的目标函数(you are your objective function)。所以我们要达到的是复杂的目标函数,而不是这些简单化的替代指标。我们的工作就是找到与之匹配的数据。我们需要能衡量 AI 是否让我们的生活更丰富的数据和指标。我们要朝这个方向训练系统。我们需要的工具是让我们更好奇、更有创造力的,而不只是让我们更懒的。这很难,因为人类天生就有点懒。所以 AI 生成的低质内容(AI slop)是最容易获得互动、推高所有指标的方式。我认为选择正确的目标函数、确保我们朝着它们优化而不是朝着那些简单替代指标优化,对我们的未来至关重要。
**Edwin Chen:** And yeah, I think it's really relevant to what we do because it's very hard and difficult to measure and define whether something is generally advancing humanity. It's very easy to measure all these proxies instead like clicks and likes. But I think that's why our work is so interesting. We want to work the hard important metrics that require the hardest types of data and not just the easy ones. So I think one of the things I often say is you are your objective function. So we want to reach complex objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this. So yeah, we want data. We want metrics that measure whether AI is making our life richer. We want to train our systems this way. And we want tools that make us more curious and more creative, not just lazier. And it's hard because yeah, humans are kind of inherently lazy. So AI slop ratios are the easiest way to get engagement, make all your metrics go up. So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really, really important to our future.
**Lenny:** 哇。我很喜欢你分享的这些,让人对构建 AI、训练 AI 的细微之处有了更深的认识。从外面看,人们可能觉得 Surge 和这个领域的公司不过是在生产数据喂给 AI。但显然这里面有太多人们没有意识到的东西。我很高兴知道你在引领这件事,有你这样的人在如此深入地思考这些问题。再问最后一个问题。有没有什么是你希望在创办 Surge 之前就知道的?很多人创业时不知道自己要面对什么。有什么想告诉过去的自己的?
**Lenny:** Wow. I love how what you're sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you're doing. You know, from the outside, people could just look at Surge and companies in the space of, okay, well, they're just creating all this data feeding into AI, but clearly there's so much to this that people don't realize. And I love knowing that you're at the head of this, that someone like you is thinking through this so deeply. Maybe one more question. Is there something you wish you'd known before you started Surge? A lot of people start companies, they don't know what they're getting into. Is there something you wish you could tell your earlier self?
**Edwin Chen:** 有。我真的希望当初就知道:你可以通过埋头苦干、做出色的研究、构建真正了不起的东西来建立一家公司,而不是靠不停发推文、制造炒作和融资。说来好笑,我从来没想过要创业。我热爱做研究。我一直是 DeepMind 的超级粉丝,因为他们是一家了不起的研究公司,被收购后仍然能继续做出色的科学研究。但我一直觉得他们是独角兽中的独角兽。所以我以为如果我创办公司,就不得不变成一个整天盯着财务报表、整天开会、做各种听起来无比无聊的事情的商人。
**Edwin Chen:** Yeah. So, I definitely wish I'd known that you could build a company by being heads down and doing great research and simply building something amazing and not by constantly tweeting and hyping and fundraising. It's kind of funny, but I never thought I wanted to start a company. Like, I love doing research and I was actually always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that they were this magical unicorn. So I thought if I started a company, I'd have to become a business person looking at financials all day and being in meetings all day and doing all this stuff that sounded incredibly boring and I always hated.
**Edwin Chen:** 所以我觉得疯狂的是,事实完全不是那样。我每天仍然深入数据的细节。我热爱这一切。我喜欢做这些分析,和研究员交流——这基本上就是应用研究,我们在构建各种出色的数据系统,真正推动 AI 的前沿。所以,我希望当初就知道:你不需要把所有时间花在融资上,不需要不停制造炒作,不需要变成一个你不是的人。你完全可以通过构建一个好到能穿透所有噪音的东西来建立一家成功的公司。如果我当初知道这是可能的,我会更早开始。
**Edwin Chen:** So I think it's crazy that that didn't end up being true at all. Like I'm still in the weeds in the data every day. And I love it. Like I love that I get to do all these analyses and talk to researchers and it's basically applied research where we're building all these amazing data systems that really push the frontier of AI. So yeah, I wish I knew that you don't need to spend all your time fundraising. You don't need to constantly generate hype. You don't need to become someone you're not. You can actually build a successful company by simply building something so good that it cuts through all that noise. And I think if I'd known this was possible, I would have started even sooner.
**Lenny:** 这是一个绝佳的结尾。我觉得这正是创始人们需要听到的。我相信这次对话会激励很多创始人,尤其是那些想用不同方式做事的创始人。在进入我们非常令人期待的闪电问答环节之前,你还有什么想分享的吗?还有什么想留给听众的?我们聊了很多。说没有也完全没问题。
**Lenny:** This is such an amazing place to end. I feel like this is exactly what founders need to hear. And I think this conversation is going to inspire a lot of founders and especially a lot of founders that want to do things in a different way. Before we get to our very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave our listeners with? We covered a lot of ground. It's totally okay to say no as well.
**Edwin Chen:** 我想在最后说的是——很多人觉得数据标注(data labeling)是非常简单的工作,比如给猫的照片打标签、在汽车周围画边界框。所以我一直很讨厌"数据标注"这个词,因为它描绘了一幅非常简单化的图景,而我们做的事情完全不同。我经常觉得,我们做的事更像是养育一个孩子。你不只是给孩子灌输信息,你是在教他们价值观、创造力、什么是美,以及无数关于"什么让一个人成为好人"的微妙东西。而这就是我们在为 AI 做的事。所以我经常把我们做的事想成——这几乎关乎人类的未来,或者说我们在如何养育人类的孩子。就说到这里吧。
**Edwin Chen:** I think the thing I would end with is I think a lot of people think of data labeling as really simplistic work like labeling cat photos and drawing bounding boxes around cars. And so I've actually always hated the word data labeling because it just paints this very simplistic picture when I think what we're doing is completely different. Like I think a lot about what we're doing as a lot more like raising a child. You don't just feed a child information. You're teaching them values and creativity and what's beautiful and these infinite subtle things about what makes somebody a good person. And that's what we're doing for AI. So I yeah I just often think about what we're doing as almost like the future of humanity or how are we raising humanity's children. I'll leave it at that.
**Lenny:** 哇。我没想到这整场对话有这么多哲学思考,太喜欢了。好了 Edwin,我们进入非常令人期待的闪电问答环节。我有五个问题要问你。准备好了吗?
**Lenny:** Wow. I love just how much philosophy there is in this whole conversation that I was not expecting. With that, Edwin, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?
**Edwin Chen:** 准备好了。开始吧。
**Edwin Chen:** Yep. Let's go.
**Lenny:** 好。你最常推荐给别人的两三本书是什么?
**Lenny:** Here we go. What are two or three books that you find yourself recommending most to other people?
**Edwin Chen:** 好的。我经常推荐三本书。第一本是 Ted Chiang 的 Story of Your Life。这是我最喜欢的短篇小说,讲的是一个语言学家学习外星人语言的故事。我每隔几年就会重读一遍。
**Edwin Chen:** Yes. So, three books I often recommend are first, Story of Your Life by Ted Chiang. It's my all-time favorite short story. And it's about a linguist learning an alien language. And I obviously reread it every couple years.
**Lenny:** 那不就是 Interstellar(星际穿越)讲的故事吗?
**Lenny:** And that's what the Interstellar was about. Is that is that...
**Edwin Chen:** 有一部电影叫 Arrival(降临)。
**Edwin Chen:** Yeah. So, there's a movie called Arrival.
**Lenny:** Arrival,对。
**Lenny:** Arrival.
**Edwin Chen:** 就是根据这个故事改编的,我也非常喜欢那部电影。
**Edwin Chen:** Which was based off of the story, which I love as well.
**Lenny:** 好的,继续。
**Lenny:** Great. Okay, keep going.
**Edwin Chen:** 第二本是 Camus 的 The Myth of Sisyphus。我其实说不太清楚我为什么喜欢这本书,但最后一章总是让我觉得特别振奋。第三本是 Douglas Hofstadter 的 Le Ton Beau de Marot。大家更熟悉他的 Godel, Escher, Bach,但我一直更喜欢这一本。它基本上是把一首法语诗用 89 种不同方式翻译,并讨论每种翻译背后的动机。我一直很喜欢它所体现的理念——翻译不是一种机械的操作。相反,关于什么才算高质量翻译,有无数种思考方式。这和我思考 LLM 中数据与质量的方式非常相似。
**Edwin Chen:** And then second, Myth of Sisyphus by Camus. I actually can't really explain why I love this, but I always find the final chapter somehow really inspiring. And then third, Le Ton Beau de Marot by Douglas Hofstadter. And so I think Godel, Escher, Bach is his more famous book, but I've actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I've always loved the way it embodies this idea that translation isn't this robotic thing that you do. Instead, there's a million different ways to think about what makes a high quality translation, which mimics a lot of ways I think about data and quality in LLMs.
**Lenny:** 这些书和我们整场对话的主题都高度共鸣,尤其是第一本——如果你在学校时的目标就是破译外星语言,那你喜欢这个短篇一点也不意外。下一个问题。你最近有没有特别喜欢的电影或电视剧?
**Lenny:** All these resonate so deeply with the way with all the things we've been talking about, especially that first one, if that was your goal after school is like, I want to help translate alien language. I'm not surprised you love that short story. Next question. Do you have a favorite recent movie or TV show you've really enjoyed?
**Edwin Chen:** 我最近发现了一部新的全时段最爱的电视剧,叫 Travelers(穿越者)。讲的是一群来自未来的旅行者被送回过去阻止末日。我就是很喜欢科幻。然后我最近重看了 Contact(超时空接触),这是我最喜欢的电影之一。你会发现我的一个特点——我喜欢一切关于科学家破译外星通讯的书和电影。就是我小时候的那个梦想。
**Edwin Chen:** One of my new all-time favorite TV shows is something I found recently. It's called Travelers. It's basically about a group of travelers from the future who are sent back in time to prevent the apocalypse. So, I just really like science fiction. And then I actually just rewatched Contact, which is one of my all-time favorite movies. So, yeah. I think one of the things you'll notice about me is that yeah, I love any kind of book or film that involves scientists deciphering alien communication. Again, just this dream I always had as a kid.
**Lenny:** 太有意思了,我喜欢。好,有没有你最近发现并且非常喜欢的产品?
**Lenny:** That's so funny. I love that. Okay. Is there a product you recently discovered that you really love?
**Edwin Chen:** 说来好笑,我这周早些时候在旧金山,终于第一次坐了 Waymo。说实话,那感觉太神奇了,真的像活在未来。
**Edwin Chen:** So, it's funny, but I was in SF earlier this week and I finally took a Waymo for the first time. Honestly, it was magical and it really felt like living in the future.
**Lenny:** 是的。人们把它吹上天,但它总能超出你的预期。
**Lenny:** Yeah. It's like the thing that you can people hype it like crazy, but it always exceeds your expectations.
**Edwin Chen:** 完全配得上那些赞誉。太疯狂了。
**Edwin Chen:** Deserves the hype. It was crazy.
**Lenny:** 对。太荒谬了。如果你不在旧金山,你不会知道这些车有多普遍。满大街都是。无人驾驶的车到处跑。你去参加一个活动,结束后一排 Waymo 在那排队接人。
**Lenny:** Yeah. It's absurd. It's like holy moly. Like if you're not in SF, you don't realize just how common these things are. They're just like all over the place. Just driverless cars constantly going about and when you like go to an event at the end there's just like all these Waymos lined up picking people up.
**Edwin Chen:** 是的。
**Edwin Chen:** Yep.
**Lenny:** Waymo,干得漂亮。你有没有什么人生格言,是你在工作或生活中经常回味的?
**Lenny:** Yeah. Waymo, good job. Good job over there. Do you have a favorite life motto that you find yourself coming back to in work or in life?
**Edwin Chen:** 我觉得我提到过这个理念——创始人应该去建造一家只有他们才能建造的公司,就好像这是一种命运,他们一生的经历、兴趣和体验都在引导他们走向这件事。这个原则不只适用于创始人,它广泛适用于所有在创造东西的人。
**Edwin Chen:** So, I think I mentioned this idea that founders should build a company that only they could build almost like it's this destiny that their entire life and experiences and interests shape them towards. And so, I think that principle applies pretty broadly, not just to founders, but to people creating a thing.
**Lenny:** 让我沿着这个思路追问一个闪电问答式的回答。你对如何积累那些能引导你走向那一步的经历有什么建议?是追随你感兴趣的东西吗?因为这话说起来容易,但要真正获得那些独特的经历组合、让你能创造出真正重要的东西,其实很难。
**Lenny:** Well, let me follow that thread to a lightning round answer. Do you have any advice for how to build those sorts of experiences that help lead to that? Is it, you know, follow things that are interesting to you? Because, you know, it's easy to say that it's hard to actually acquire these really unique sets of experiences that allow you to create something really important.
**Edwin Chen:** 我的建议是真正追随你的兴趣,做你热爱的事。这就像我做 Surge 的很多决策一样——有人几年前对我说过一句话,我之前没怎么想过,但他说公司在某种意义上就是其 CEO 的化身。这很有趣。我之前没想过这个,因为我一直不太清楚 CEO 到底是做什么的。我以为 CEO 就是很泛泛的角色——你的副总裁和董事会告诉你做什么你就做什么,你只是在对各种决定说"好"。但其实不是。当我思考我们必须做的那些重大而困难的决定时,我不会想"公司会怎么做",也不会想"我们在优化什么指标"。我会想:我个人在乎什么?我的价值观是什么?我希望世界发生什么变化?所以我觉得,问问自己你在乎什么价值观、你想塑造什么,而不是什么在仪表板上看起来好看——这会带来真正重要的东西。
**Edwin Chen:** Yeah. So, I think it would always be to really follow your interests and do what you love. And it's almost like a lot of decisions I make about Surge, like I think one of the things that I didn't think about a couple years ago, but then someone said it to me, is that companies in a sense are an embodiment of their CEO. And it's kind of funny. I hadn't thought about that because I never quite knew what a CEO did. I always thought a CEO was kind of generic and it's like, okay, you're just doing whatever your VPs and your board and whatever tell you to do and you're just saying yes to decisions. But instead it's this idea where when I think about certain big hard decisions we have to make I don't think what would company do I don't think what metrics are we trying to optimize I just think what do I personally care about like what are my values and what do I want to see happen in the world and so I think following that idea about okay so ask yourself what are the values you care about what are things you're trying to shape and not what will look good on a dashboard I think that results in something pretty important.
**Lenny:** 我很喜欢你总是有无穷无尽的、既美好又深刻的回答。最后一个问题。在创办 Surge 之前,有一件事让你小有名气——你在 Twitter 工作时做了一张地图,显示全世界各地的人把碳酸饮料叫 soda 还是 pop。这张地图叫什么来着?
**Lenny:** I love how just you're just full of endless beautiful and very deep answers. Final question. Something that you were quite you got quite famous for before starting Surge is you built this map at Twitter while you were at Twitter that showed a map of the world and what people called whether they called it soda or pop. I don't know if it was called soda or pop. What was the name of this map?
**Edwin Chen:** 对,就叫 soda versus pop 数据集,或者 soda versus pop 地图。基本上是一张美国地图,告诉你哪些地方的人说 pop、哪些地方说 soda。
**Edwin Chen:** Yeah, it was like the soda versus pop data set or soda versus pop map. And so it's like a map of the United States and it tells you where people say pop versus soda.
**Lenny:** 那你自己说 soda 还是 pop?
**Lenny:** So do you say soda or pop?
**Edwin Chen:** 我说 soda。我是 soda 派的。
**Edwin Chen:** So I say I say soda. I'm a soda person.
**Lenny:** 好的。这是唯一正确的答案,还是说每个人都可以有自己的叫法?
**Lenny:** Okay. And is that just like that's the right answer or it's like whatever you are it's totally fine.
**Edwin Chen:** 如果你说 pop,我会稍微多看你两眼,好奇你是哪里来的。但不会太嫌弃你。
**Edwin Chen:** I think I'll look at you a little bit funny if you say pop and I'll wonder where you came from. But I won't scorn you too much.
**Lenny:** 我也是这么觉得的。Edwin,这次对话太棒了。我学到了太多东西。我相信这会帮助很多人去创办自己的公司,让公司更符合自己的价值观,构建更好的东西。最后两个问题:大家在哪里可以找到你?你们在招什么职位?听众怎样能帮到你?
**Lenny:** That's how I feel too. Edwin, this was incredible. This was such an awesome conversation. I learned so much. I think we're going to help a lot of people start their own companies, help their companies become more aligned with their values and just building better things. Two final questions, where can folks find you online if they want to reach out? What roles are you hiring for? How can listeners be useful to you?
**Edwin Chen:** 好的。我以前很喜欢写博客,但过去几年没时间了,不过我现在开始重新写了。大家可以去看 Surge 的博客,surgehq.ai/blog。希望我会在那里写更多东西。我们一直在招人。如果你热爱数据,热爱数学、语言和计算机科学的交叉领域,欢迎随时联系我们。
**Edwin Chen:** Yes, so I used to love writing a blog, but I haven't had time in the past few years, but I am starting to write again. So definitely check out the Surge blog, surgehq.ai/blog. And yeah, hopefully I'll be writing a lot more there. And I would say we're definitely always hiring. So for people who just love data and people who love this intersection of math and language and computer science, definitely reach out. Reach out anytime.
**Lenny:** 好的。听众还能怎么帮到你?
**Lenny:** Awesome. And how can listeners be useful to you? Is it just I don't know. Yeah. Is there anything there? Any asks?
**Edwin Chen:** 欢迎告诉我你们希望我写什么主题的博客文章。还有,我一直对现实世界中各种 AI 失败案例非常着迷。每当你遇到一个真正有意思的失败案例,能说明我们希望模型如何表现这个深层问题的——模型在那种情况下有太多种回应方式,很多时候根本就没有唯一的正确答案——每当有这样的例子,我都很想看到。
**Edwin Chen:** So I would say definitely tell me blog topics you'd like me to write about. And then I'm always fascinated by all of these AI failures that happen in the real world. So whenever you come across a really interesting failure that I think illustrates some deep question about how we want models to behave, like there's just so many different ways a model can respond there. I just often times think there's just not a single right answer. And so whenever there's one of these examples, I just love seeing them.
**Lenny:** 你应该把这些分享到你的博客上。我也很想看这些案例。Edwin,非常感谢你来参加节目。
**Lenny:** You need to share these on your blog. I'm also I would love to see these. Edwin, thank you so much for being here.
**Edwin Chen:** 谢谢你。
**Edwin Chen:** Thank you.
**Lenny:** 大家再见。非常感谢收听。如果你觉得有价值,可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅本节目。也请考虑给我们评分或留下评论,这对其他听众发现本播客非常有帮助。你可以在 lennispodcast.com 找到所有过往节目或了解更多信息。下期见。
**Lenny:** Bye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennispodcast.com. See you in the next episode.