Stanford CS 153: Infra @ Scale - Cursor CTO & Co-Founder Sualeh Asif
概要
Cursor联合创始人兼CTO Sualeh Asif在Stanford CS 153分享基础设施百倍增长实战:数据库生死事故、对象存储迁移、模型供应商谈判链
核心洞察
- Cursor 过去一年规模增长约 100 倍,自研模型每天处理约一亿次调用,自动补全在每次按键时运行,峰值每秒约两万次模型调用,跑在全球分布的约一千到两千张 H100 上。 占前沿模型(Frontier Model)流量的"相当大比例",索引系统每天处理约十亿份文档,累计达数千亿份。这已经是一个真正的全球规模大型服务。
- Cursor 的三大基础设施支柱是索引、模型和产品层。 索引包括检索系统和 Git 历史理解;模型包括自研自动补全模型和前沿模型调用;产品层包含 apply 模型等让用户体验流畅的工程技巧——apply 看起来像复制粘贴,实际是一个处理十万到二十万 token 的模型。此外还有支撑持续改进的数据流基础设施。
- 两次标志性 Sev 事故揭示了分布式系统的深层挑战:2023 年 9 月索引系统因 DynamoDB 缓存 bug 和 Merkle 树竞态条件引发无限循环;后来 Postgres 数据库 22TB 膨胀导致 vacuum 失败、数据库完全卡死。 后者的解决方案是联合创始人 Arvid 在事故现场用两小时将核心存储重写到对象存储(S3/R2)上——一个临时救急方案最终成为正式架构方向。
- 与模型提供商的容量博弈是日常运营的核心痛点。 Cursor 可能是多家前沿模型提供商最大的客户之一,经常面临速率限制和提供商宕机。应对策略是同时跑多个提供商、多云部署、实时在提供商之间平衡流量。新模型发布前夕打电话要配额是常态——3.5 Sonnet 发布前夜联合创始人 Aman 还在找人帮忙联系 Anthropic 要更高配额。
- 贯穿全场的核心线索是"用小团队做大规模基础设施"的哲学。 从选 Postgres 弃 YugaByte 的"不要搞花活"原则,到事故现场实时重写系统迁到对象存储,到用 AI 辅助覆盖远超团队规模的基础设施面积,Sualeh 反复展示的是同一个理念:简洁的架构 + 极强的个体能力 + AI 杠杆,比堆人堆复杂系统更有效。
Cursor 的规模:每天一亿次模型调用,全球 GPU 分布
核心要点:Cursor 过去一年增长百倍,自研模型日处理一亿次调用,索引系统日处理十亿文档,已是全球规模的大型基础设施服务。
- 自动补全模型在用户每一次按键时运行,峰值每秒约两万次模型调用,跑在约一千到两千张 H100 上。从输入 token 量看,自动补全远超前沿模型调用,因为每次按键都命中几万个 token。
- GPU 集群分布在美国东海岸(Virginia)、西海岸(Phoenix)、伦敦和东京。曾尝试在法兰克福部署但不太稳定。全球分布确保日本用户的自动补全体验依然相对快速。
- 索引系统每天处理约十亿份文档,公司成立以来累计达数千亿份。部分用户的代码仓库规模达到 Instacart 等大公司的水平。
- Cursor 占前沿模型流量的"相当大比例",同时也开始自己托管一些大模型。
"过去一年我们的规模大概增长了 100 倍,有些方面甚至更多。" —— Sualeh Asif
三大基础设施支柱:索引、模型、产品层
核心要点:Cursor 的基础设施分为三大支柱——索引系统理解代码仓库、模型层处理自动补全和前沿推理、产品层用精巧工程让体验感觉"不像在用模型"。
- 索引系统:包括检索系统(用户向 agent 提问时搜索代码仓库)和 Git 历史索引(理解代码仓库的演变过程)。设计原则是完全自动化——用户打开编辑器就自动索引,不问任何问题,成本由 Cursor 承担。
- 模型层:自研自动补全模型在每次按键时运行;前沿模型(Anthropic、OpenAI 等)处理更复杂的推理任务。两者的流量特征完全不同。
- 产品层:apply 模型是核心代表。用户感受是"像复制粘贴一样快",实际上是一个模型在处理十万到二十万个 token。这种"让你觉得根本没有在使用模型"的体验,背后是大量推理层面的优化技巧。
- 第四个隐藏支柱是数据流基础设施:存储和处理输入数据,用于筛选和后台改进。不是用户直接交互的东西,而是"每天让 Cursor 变得更好"的引擎。
"你希望 apply 感觉非常快,快到你觉得根本没有在使用模型,感觉就像是复制粘贴一样。但实际上当然不是复制粘贴,而是一个模型在处理十万到二十万个 token。" —— Sualeh Asif
架构哲学:大单体 + 严格隔离 + 服务端简洁
核心要点:Cursor 采用大单体架构(monolithic architecture),但通过严格的服务隔离确保关键服务不受实验代码影响;服务端代码以简洁为最高原则。
- 用户请求打到东海岸的服务器,整体是一个大的 monolith 部署。这种规模下的关键是:把真正重要的东西放在安全隔离区域,与实验性代码分开。
- 2023 年 6-7 月的教训:所有东西跑在一个大服务器上,有人写了一个"特别恶心"的无限循环,结果把别人的 chat 服务搞挂了。此后服务器隔离划分变得核心优先级。
- 用户使用 Composer 写代码时,后台有大量请求在判断代码仓库中哪些内容该被选入最终的 prompt。服务端和全球客户端之间维持双向流。
- 架构核心原则:"服务端代码必须保持严格的简洁。如果太复杂你就无法理解,无法理解就没法可靠运行。"
索引系统的 Merkle 树与 2023 年 9 月大事故
核心要点:索引系统用客户端-服务端 Merkle 树协调文件变化,但 DynamoDB 缓存的隐蔽 bug 加上 Merkle 树的竞态条件,引发了一场痛苦的级联故障。
索引系统的核心机制是 Merkle 树同步:客户端和服务端各维护一棵树,每个文件有哈希值,文件夹哈希由子节点组合而来。当用户关机重开或 checkout 旧版本时,通过对比根哈希逐层下降找到变化的文件。
Sualeh 在 MIT 时是数据库迷,最初选了 YugaByte(Google Spanner 的后代),认为它能"无限扩展"。结果"死活搞不定它",花了很多钱缩减节点数也跑不起来。教训是迁移到 RDS Postgres——"不要选复杂的数据库,用超大规模云厂商的就好,他们知道自己在干什么。用 Postgres,别搞花活。"
迁到 Postgres 后流量飙升,暴露了一个级联故障链:
1. 缓存层 bug:DynamoDB 缓存有个隐蔽 bug——如果文件够大就不会被缓存,而团队没有监控这个数据库的错误率,完全没发现。
2. Embedding 模型过载:大文件绕过缓存直接打到 embedding 模型,推高负载。
3. Merkle 树竞态条件:分块和 embedding 完成后要更新 Merkle 树上层节点的哈希,这个操作在队列上存在竞态条件,大文件无法提交。
4. 无限循环:客户端发现大文件没提交就重试,重试时缓存又 miss,又打到 embedding 模型,又提交不了——永远循环。
- 全局错误率看起来完全正常——问题只出在那些约六十行的大文件上,正是这些关键文件引发了所有连锁反应。诊断花了"非常痛苦的一周"。
- 修复路径:先注意到缓存命中率异常,修复缓存 bug 后追查提交失败的原因,最终定位竞态条件。Sualeh 强调即使知道大概率存在竞态条件,在生产环境中找到它也"极其艰难","最终靠的就是大量仔细的思考"。
"分布式系统中的竞态条件——上过分布式系统课的人都会告诉你,它们极其隐蔽难以发现。而在实际生产中找到它们同样困难。" —— Sualeh Asif
Postgres 22TB 危机:事故现场重写系统迁到对象存储
核心要点:RDS Postgres 膨胀到 22TB 后因 vacuum 失败完全卡死,无法重启。联合创始人 Arvid 在事故现场用两小时将核心存储重写到对象存储上,成为比所有其他修复方案更快的解决路径。
索引系统修复后"漂亮地运行了大概八个月,没有任何事故",Sualeh 甚至忘了有监控仪表盘。然后某天被大量告警吵醒——缓存和 embedding 模型全面报警。
根因是 Postgres 的工作负载特性与存储机制的冲突:Postgres 的更新不是原地修改(MySQL 的做法),而是一次删除加一次插入。用户每次打字都触发大量删除和重新插入,空间不释放,只靠后台 vacuum 进程清理。当 vacuum 和 anti-wraparound vacuum 都出问题后,数据库变成"一辆启动后就磨磨蹭蹭最终完全停下来的车"——限流、清理积压、恢复流量、又停,反复循环。
折腾几个小时后数据库甚至无法重启。Sualeh 给朋友 Martin 打电话求助 RDS 专家,AWS 支持"完全不知道怎么回事"。团队中有一位 GitHub 前基础设施副总裁对数据库了如指掌。最后甚至跟 RDS 架构师——"写这个数据库的人"——通了电话,对方说:"呃,我也帮不了你们太多,你们的情况很严重。"
Sualeh 同时分配多路任务:一人删除所有外键(22TB 中有些表有四个外键关联,给数据库增加大量指针追踪负担);一人重写工作负载(22TB 中一张表占 20TB);其他人尝试各种疯狂方案。所有参数已调到最大——"这是 AWS 上能找到的最大的 RDS 实例,不可能再大了。"
关键转折:Sualeh 在事故中间对联合创始人 Arvid 说"试试把这个迁到对象存储上"。Arvid 的任务是把占 20TB 的 chunk 存储表重写到对象存储。结果他竟然比所有其他修复方案更快完成——因为这是 Cursor 经历过的持续时间最长的 Sev 之一。大约十个小时后团队决定放弃救活当前实例,将流量切到 Arvid 两小时前刚写好的新系统。
"就是有人跟你说'我的代码没 bug,不需要测试,我两小时前刚写的'——然后你要把你跑的最高吞吐量的服务放上去。" —— Sualeh Asif
Sualeh 强调他对 Arvid 有"绝对的信任"。现在 Cursor 还有一个新项目要把整个系统都迁到对象存储上,"干掉中间所有的数据库。因为扩展数据库的最好方式就是——不要数据库。"
Andre 指出这个故事的节奏与 CS 153 第一课讲的 Midjourney Sev 事故惊人相似——"事故响应的物理规律最终看起来惊人地相似。总有基础设施团队中的一个人成为关键人物,他的重构方案最终成为比你预期稳定得多、持续得多的正式方案。"
对象存储趋势:数据库领域最值得关注的架构转向
核心要点:Sualeh 认为"把数据库构建在对象存储(S3/R2/Azure Blob)之上"是近年数据库领域最重要的趋势——对象存储极度可靠、扩展性极强,写数据库最难的存储层问题直接消失。
- 数据库领域的三个"传奇":Google Spanner(全球规模)、FoundationDB、Redis 等 KV 存储。"基本上有了这三个你可以搞定一切——也许再加个分析引擎。"
- 2010 年代早期的趋势是大型分析引擎(Snowflake、Databricks),以向量化操作著称。最近的趋势是将数据库构建在对象存储之上。
- 代表案例:Warp Stream(被 Confluent 收购),将 Kafka 重写为运行在 Blob 存储上。Sualeh 补充说"如果你要组建基础设施团队,人们会告诉你的第一件事就是'永远不要自己跑 Kafka'——它就像癌症一样难以根除。"
- Cursor 依赖的 Turbopuffer:一个基于对象存储的向量数据库。
- 对象存储的核心优势:"世界上没有什么比对象存储可靠性更高的东西。"写数据库最难的是存储层,容易出现数据损坏;基于对象存储可以绕过这个问题。
冷启动问题与事故恢复:模型提供商都还不成熟
核心要点:推理服务的冷启动问题被严重低估——全部节点宕机后如果不做流量管控就恢复,先起来的少数节点会被所有用户请求冲垮,"还没来得及变健康就被打死了"。
- Sualeh 直言"推理这件事没有人已经完全搞明白",虽然各家都在全球规模运行,但"大家都还相对不成熟"。
- 某个不具名提供商在 Cursor 扩容过程中刚搭起推理服务。Cursor 每天要求"每分钟多一千万 token,不对,一亿 token",结果对方在三四千万 token/分钟时就崩了——"缓存还没搞定。"
- 冷启动场景:假设跑着每秒十万次请求的服务,所有节点都挂了。从一千个节点中先拉起十个——这十个会被全部流量冲垮。解决方案:要么杀掉一部分流量,要么做用户优先级划分,要么像 WhatsApp 那样先恢复某些关键前缀路由。
- "看看各大模型提供商的状态页面吧,历史可靠性都很糟糕,没有人的可靠性是好的。"
"我不认为推理这件事有人已经完全搞明白了。虽然我们和其他人都在全球规模上运行这些服务,但大家都还相对不成熟。" —— Sualeh Asif
单行改动引发的一级事故:模型代码审查的未来
核心要点:Cursor 大约一周半前经历了一次约 230 分钟的宕机,起因仅是一个单行代码改动——"如果你看那个改动,你根本不会发现问题"。Sualeh 认为 AI 模型审查代码在捕获此类隐蔽 bug 上已接近实用。
- 在严重级别一到五的尺度上(一最严重),这次事故被评为一级。Andre 开玩笑说"可能应该是零级"。
- Sualeh 透露如果当时跑了团队正在开发的 bugbot,就能抓住那个 bug。Chris 正在发布这个 bugbot 工具。
- 关于事故中的人:"有两种人——一种是变得焦躁的,另一种是真正活过来的。我们有一位联合创始人,他不喜欢日常运维工作,但他热爱 Sev。你能见到他最开心的时候就是在一场 Sev 的正中间,一切都在崩溃。"
与模型提供商的容量博弈:一条从 Cursor 到 TPU 上架的电话链
核心要点:Cursor 可能是多家前沿模型提供商最大的客户之一,容量谈判是日常运营核心。从 Cursor 到 Anthropic 到 Google TPU 团队到 Borg 团队,需求沿着一条长长的电话链向上传递。
- 系统中"最混乱的部分之一"就是处理速率限制和 token 配额的那套系统。Cursor 同时跑在多个提供商上,它们"经常宕机"。
- Andre 描绘了一条生动的电话链:Cursor 联合创始人 Aman 在 3.5 Sonnet 发布前夜给 Andre 发邮件,问谁认识 Anthropic 的人、能不能帮拿到更高配额。然后 Anthropic 打电话给 Google 要更多 TPU,Google TPU 团队打电话给 Borg 团队问什么时候能上架更多服务器。"取决于具体哪天,以及在这个链条中谁是瓶颈。"
- 应对策略:同时联系多个提供商(AWS、Google Cloud、OVH),问"谁能先给我们更多 token",然后在提供商之间平衡用户流量。"各提供商的销售团队之间基本上形成了竞态条件,争着来找你。"
- Sualeh 将此比喻为"云计算时代早期的容量预留谈判",但"没有人预料到当前推理工作负载跳跃式增长的幅度"。
创业竞争与差异化:Copilot 有一年半完全没有变化
核心要点:Cursor 创始团队在 MIT 时就关注 scaling laws,最初认为 Copilot 背后的 Microsoft+OpenAI 是"不可战胜的梦之队"。但 Copilot 有一年半没做任何改进,"拥挤"是后视镜偏差。
- 创始团队最初的想法就是做编程产品,但因为 Copilot 的存在先去尝试了"一些其他复杂的方向"。直到发现 Copilot 停滞不前才回到编程方向。
- Sualeh 认为天花板"一直延伸到自动化大部分工程工作",但当时除了自动补全什么都看不到。"最长的一段时间里,基本上没有人在做真正的竞争。"
- 关于代码安全:即使在 embed 代码时也使用加密技术,向量数据库中的所有向量用存在用户设备上的密钥加密。"即使有人拿到了向量数据库——这本身就很难——最坏情况下他们也无法解读。"Sualeh 说虽然 99.99% 确定无法从向量反推代码,但"因为这不是一个可证明的事实,有这层加密更安全。"
免费 token 的攻防战与 20 美元定价的坚守
核心要点:Cursor 投入大量精力应对创造性的滥用者,同时坚持 20 美元定价——"比起涨价,我们花大量精力优化系统就是想让大家能轻松用得起"。
- Sualeh 公开邀请:如果有人能找到获取免费 token 的方法并负责任地报告,Cursor 会给予免费订阅——前提是"负责任披露"。
- 最近的案例:有人创建了数万甚至数十万个 Hotmail 帐号,用相对稳定的域名(不是垃圾域名),从约一万个 IP 地址发送流量,有十万个分片用户——不是 DDoS,而是"想在免费额度上跑自己的服务,把代码写完"。
- 定价策略:对行业内的人来说 20 美元根本不算什么,涨价是更容易的选择。但 Cursor 选择用大量系统优化来维持低价。"每一次按键都在做大概一千亿次浮点运算,就是很多的算力。"
"模型是贵,但它们太有用了。它们值的比一个汉堡多,就付那 20 美元吧,我保证值得。" —— Sualeh Asif
AI 时代的 CS 教育与 IDE 的未来
核心要点:Sualeh 不认为学 CS 是浪费时间,反而认为 AI 会让人更有创造力——"未来会有更多的系统涌现,而不是更少"。IDE 不会消失,但定义可能改变。
- "不现实的版本是'我们都会失业'。现实是:越来越多的工作会被自动化",但"没有理由花时间去架构和设计新系统"这个说法是不对的。
- Cursor 本身就是例证:用一个"非常有才华但规模不大的团队"覆盖了远超预期的基础设施面积,AI 辅助是关键杠杆。Sualeh 预期这个比例会越来越大。
- 类比:"比如给函数调用加个参数——这就像问足球运动员是不是只在动腿。他们确实在动腿,但那是最有意义的部分吗?"
- 关于 IDE 的未来:"程序员会不会有一个用来写代码的工具?会。好了下一个。"——形式可能变化,但工具本身不会消失。
- 关于是否该辍学创业:"这个选择对你来说应该是显而易见的。如果显而易见就去做,如果不是就别做。"
附录:关键人/机构/产品/数据
| 项目 | 详情 |
|------|------|
| Sualeh Asif | Cursor CTO & 联合创始人,MIT 毕业,自称"数据库迷" |
| Arvid | Cursor 联合创始人,在 Postgres 危机中两小时内将核心存储重写到对象存储 |
| Aman | Cursor 联合创始人,3.5 Sonnet 发布前夜找人帮忙要 Anthropic 配额 |
| Michael | Cursor 联合创始人,事故中喊"我们能不能把服务恢复" |
| Chris | Cursor 团队成员,正在发布 bugbot 工具 |
| Martin | Sualeh 的朋友,数据库危机时被求助找 RDS 专家 |
| Andre | CS 153 课程主持人/访谈者 |
| GitHub 前基础设施副总裁 | Cursor 团队成员,数据库危机中发挥关键作用 |
| Ben Mann | Anthropic 联合创始人,上一周 CS 153 嘉宾 |
| 自动补全模型 | 每次按键运行,峰值每秒约 2 万次调用,约 1000-2000 张 H100 |
| Apply 模型 | 处理 10-20 万 token,体验像复制粘贴 |
| Bugbot | Cursor 内部开发中的 AI 代码审查工具 |
| YugaByte | Google Spanner 后代,Cursor 早期使用后因扩展困难放弃 |
| RDS Postgres | 替代 YugaByte 的选择,22TB 时遭遇 vacuum 危机 |
| DynamoDB | 索引系统的缓存层,大文件 bug 引发级联故障 |
| Turbopuffer | 基于对象存储的向量数据库,Cursor 依赖的核心组件 |
| Warp Stream | 将 Kafka 重写为运行在 Blob 存储上,被 Confluent 收购 |
| Merkle 树 | 客户端-服务端文件同步机制,通过哈希值逐层对比定位变化 |
| 每天 1 亿次 | 自研模型日调用量 |
| 每天 10 亿份 | 索引系统日处理文档数 |
| 22TB / 64TB | Postgres 数据量 / RDS 存储上限 |
| 约 230 分钟 | 最近一次一级事故的宕机时长 |
| ~100 倍 | 过去一年的规模增长倍数 |
| 20 美元/月 | Cursor 订阅价格,团队坚持不涨价 |
| 1000 亿次 | 每次按键的浮点运算量级 |
然后是模型部分。有一个自动补全模型(autocomplete model),在每一次按键时都会运行。这意味着在任何时刻,我们每秒都在进行大约两万次模型调用,跑在大约两千张 H100 或一千张 H100 左右的集群上。这仅仅是自动补全的部分。这些基础设施现在分布在多个地方,大部分在美国东海岸,一部分在西海岸比如 Phoenix 和 Virginia,还有一些在伦敦和东京。我们曾经尝试过在法兰克福部署,但不太稳定。总之 GPU 现在遍布全球,如果你在日本,自动补全体验依然相对快速。
在索引方面,我们正在扩展相当大规模的检索系统。当你向 agent 提问并让它搜索代码仓库时,那个系统现在的吞吐量已经很高了。
and then there's sort of the model model section so there's uh the autocomplete model that sort of runs on every single keystroke and that means you know we're at any point in time doing on every second 20,000 is model calls do this Fleet of something like 2000 h100 or like a th h100 something like that um that's that's just the and so and so that that infrastructure is now split up over you know much of it is is on the east coast and some of it is on the west coast like Phoenix and Virginia uh but then some of it is in London some of it is in Japan uh and Tokyo uh so it's it's got like at some point we tried to have a Frankfurt deployment that was a little bit unstable um so now now that that the gpus are pretty spread across the world so like in in general if you're in Japan you you you still get a relatively fast experience of the auto complete
uh and then on the indexing side you know sort of starting to scale up pretty large retrieval systems that um like if you ask a question to the to the agent you know and it goes in searches the repository uh you know that's that's certainly being scaled at like you know pretty high pretty high throughput now
至于我忘了提到的第三个支柱——其实是整个数据流基础设施,包括我们如何存储输入的数据,如何用它来做筛选和各种后台运行的事情。它不是你直接查询或交互的东西,而是我们用来每天让 Cursor 变得更好的东西。
and and architecturally what happens when somebody decides that they want curs and then okay so so the third pill that I forgot about is it's like the entire streaming infrastructure for like you know how we how we store the data that comes in and how we use that for creening and all sorts of things that are like uh running in the background once you've sort of that it's not a sort of real time it's not the thing that like you're querying it's not the thing that like uh you're directly interacting with but it's the thing we use to make cursor better for for everyone every day
以前经常发生的事情是——大概2023年6月或7月——所有东西都在一个大服务器上,然后有人写了一个无限循环。无限循环当然会发生,而这个是特别恶心的那种。结果它把别人的 chat 服务搞挂了。你不希望登录之类的核心服务因为有人在服务端不小心写了个无限循环就挂掉。所以服务器的隔离划分变得非常重要。
当你使用 Composer 在代码库上写代码的时候,在你打字的过程中,后台有很多请求在发出去,来确定代码库中哪些相关内容会被选入最终的提示词(prompt)。然后我们有一个双向流,模型在服务端执行操作,结果再被传回给全球各地的客户端——不管你在印度还是巴基斯坦。
我们设计时越来越强调的一点是:服务端代码必须保持严格的简洁。如果太复杂你就无法理解,无法理解就没法可靠运行。这已经变得非常重要了。
uh a common thing that used to happen is you know okay so this like September maybe maybe June of 23 or something like that or July like everything is one big server and like someone would I only write infinite Loop uh you know infinite Loops obviously happen uh and this was a pretty gnarly infinite loop it's sort of like if you do one of our ifue problem SS like we ask you dbug said infinite Loop um and that would take down like chat for someone and and you don't you don't want some of like the core services to go down like you don't want login to go down if if someone like accidentally writes an infinite loop on the server and so one of the things that like has become really important is sort of this compartmentalization of the servers
so you get to one of like hopefully your asking like composer to go write code over your code base uh while you're typing uh there's like many requests that are being sent to figure out the relevant things that will be selected from your code base to go into into the final prompt um and then then we have some bidirectional stream of like the model taking actions on the server on the infer side and that getting translated to any many any of like you know clients in I don't know India or Pakistan like you know you do your thing and then the servers results are being communicated back and forth
and uh one of the things we've tried to design for so so usually you know when we started off it was like oh you know things things are mentally complicated so you would write these complicated things but nowadays you know there's like a strict ruos on the on the server because if it too complicated you don't understand it you can't run it um that's been sort of become quite an important part
某个不具名的提供商,在我们扩容的过程中刚搭起来他们的推理服务。我们每天都跟他们说:"我们需要每分钟多一千万 token,不对,一亿 token。"在高峰流量时需要扩容,结果他们在三四千万 token 每分钟的时候就崩了。他们的缓存还没搞定。
一个被低估的事实是关于冷启动问题(cold start problem)。假设你在跑每秒十万次请求的服务,然后所有节点都挂了。如果你不阻止新请求进来就开始恢复节点——比如你从一千个节点中先拉起来十个——这十个节点会被所有用户的请求冲垮。它们还没来得及变健康就被打死了。所以在所有这些服务中都存在冷启动问题。
从严重事故中恢复的棘手之处就在于此。你要么得杀掉一部分流量,要么用一些技巧。我们这边会对用户做优先级划分,或者干脆同时对所有人限流。WhatsApp 的做法是,如果完全宕机了,他们会先恢复某些关键前缀路由,而不是试图同时为所有人提供服务——因为你根本做不到。
看看各大模型提供商的状态页面吧,历史可靠性都很糟糕,没有人的可靠性是好的。
I think a lot of the things so one underappreciated fact about these things uh especially model calls is how uh there's a traditional sort of this cold stru problem when you're running the server so the the co start problem says something like um imagine so this we've s of shot ourselves in the foot many times imagine you're running something that is you know something like 100k requests per second and like all your nodes die and so all of a sudden like nothing is working uh you might have something where like you know notes usually can go up one by one by one if you if you go up slowly and you don't sort of prohibit people from making requests like the 10 nodes that you brought up out of like a fleet of a thousand will get smashed by every single person's request and so before the 10 nodes can can even become healthy you'll just you'll just kill them and so you know before the 10 can become 50 you know you you just kill their 10 notes
so there's like a coold start problem on all these services and uh I know we'll go into one of the incidents later but one of the the tricky Parts about going from a really bad incident recovery is this like problem you have to do these coal starts uh or like you have to try to at least kill some of the traffic or uh and many of the favorite providers you have have like various tricks of being able to kill traffic you know on our side we have some priorities of users and we try to like don't like you know or or we just try to kill everyone at the same time like it will works uh WhatApp for example you know tries to use a prefix they will find uh certain prefixes that are really important to them and those one if if WhatsApp goes down completely then WhatsApp will just could try to bring up one of those fixes first before they bring up every single person because you can't serve every single person everything is done like you can't you can't do it
I mean model providers have like go look up their status Pages they all have historically terrible reliability no one has good reliability right
一开始它是 GitHub 上的一个东西,后来我们意识到:如果你给用户加很多按钮让他们手动索引代码库,没人会去索引。所以你必须让它完全自动化。唯一的办法是在用户打开编辑器的时候就触发索引。只要你同意让我们索引——或者默认同意——不管代码库多大(除非大到 Instacart 那种级别),我们就直接索引,不问任何问题,成本我们来承担。
系统的工作方式是在客户端计算一棵 Merkle 树(Merkle Tree),在服务端也有一棵。每个文件都有哈希值,每个文件夹的哈希值是其子节点哈希的组合,一直到根哈希。当你需要知道什么发生了变化——比如你关了电脑又重新打开,或者 checkout 了一年前的版本——Cursor 就通过对比客户端和服务端的 Merkle 树来找出变化。如果根哈希不同,说明至少有一个文件夹发生了变化。然后逐层下降,找到具体变化的文件。这是客户端和服务端之间的来回通信,双方各自知道文件状态的一部分,然后协调同步。
好,当这个索引系统扩容的时候,我们遇到了几个问题。首先,我们当时用的是一个叫 YugaByte 的数据库。我在 MIT 读书的时候是个数据库迷,在数据库世界里有两三个被视为"万能解决方案"的传奇数据库。一个是 Google 的 Spanner,以全球规模著称。然后是 FoundationDB,非常有名。再就是 Redis 之类的 KV 存储。基本上有了这三个你可以搞定一切——也许再加个分析引擎。
YugaByte 是 Spanner 的后代之一。我当时想,我们要扩展这个系统,就选一个能无限扩展的东西。YugaByte 理论上具备这个特性,但我们死活搞不定它。花了很多钱试图缩减节点数,它就是跑不起来。人生教训:我们把这个工作负载迁移到了 RDS(关系数据库服务),跑得非常好。不要选复杂的数据库,用超大规模云厂商的就好,他们知道自己在干什么,服务很可靠。用 Postgres,别搞花活。
我们迁到 Postgres 之后,流量几乎马上出现了一次大幅飙升。首先发生的是:我们中间有一层很大的缓存跑在 AWS 的 DynamoDB 上。代码里有个比较隐蔽的 bug——如果文件够大,它不会被缓存到 Dynamo,而我们没有监控这个数据库的错误率,所以完全没发现。
于是那些有大文件的用户,他们的文件打到 embedding 模型(嵌入模型),因为文件太大,embedding 模型的负载就被推高了。然后我们的队列上还有一个竞态条件(race condition)——当你做完分块(chunking)和 embedding 之后要保存,在提交事务之前需要更新 Merkle 树中所有上层节点的哈希值。这个哈希更新发生在队列上,而队列有竞态条件,特别是涉及大文件的时候。大文件无法提交,客户端发现大文件没提交就重试,重试时缓存又 miss,又打到 embedding 模型,又提交不了——无限循环。
我们当时发现负载一直在涨、一直在涨,根本搞不清怎么回事。因为全局错误率看起来完全正常——问题只出在那些关键的大文件上。那是非常痛苦的一周,因为整个 Merkle 树的协调过程只是漏掉了这些六十行左右的大文件,而正是它们引发了所有的连锁反应。
okay so so we're starting to write a new thing and um the way it works is it computes what is called a Merkle Tree on the client and then a Merc Tree on a server and what that means is sort of every single file gets hash and every folder's hash is the hash of its children up to a root hash and then the way we figured out if like something has changed so you know you quit your computer you uh opened up your computer again you did get check out from a year ago you boot up cursor cursor figure out everything that has changed from the server to the client and the way you figure it out is you uh you sort of defend you know desend this tree so you find that the root hash the hash at the at the root node is very different from the hash of the root node on the server if these two like hashes are different then at least one of the folders has changed underneath it has to have because otherwise the hash would be the same
so okay so you you descend down and you find which which of the folders has changed and maybe you find out that like the server folder at the top level changed so you descend down in the server folder and then you find the hashes of all of their children and you find which of the sections have changed maybe the section that has changed is like inference and then you design down into inference and like down to the point which you find the files that have changed uh so this is this back and forth communication where the server knows something about the state of the files the client knows something about the state of the files and you're trying to reconcile this state
uh um so when this inference system uh sorry this indexing system was going up uh we had well we have a few problems so first um we're living on this uh database called yugabyte uh so uh when I was at MIT at later when I was studying I was like a big databases fan and in database world uh there's maybe two or three sort of amazing databases that like you think of as these like the solution to all your problem problems so one of them is is this Google database called spanner um it's like really famous for being global scale then there's um a database called Foundation Tob uh which is really famous and and then there's sort of these KV stores like redish redish Etc these are sort of like really really famous dates there there's others but like those those feel like you know usually if you have the three of them you could make everything worth like maybe maybe you need some Analytics
and uh one of the descendants of spanner is the state of Bas called yugabyte so we're we're on yugabyte and uh I was thinking man you know we have to scale this thing let's choose like a a thing that like scales infinitely and ubite has this property where like it scaled really well and we cannot get yle to run just you're not we're paying a lot of money or sort of try to scale down as many nodes as we can and it just it just wouldn't run and a lesson in life you should uh we move that workload to RDS and it works like a charm uh so don't don't choose a complicated database go go with a hyperscalers you know they know what they're doing their full is really good use use pois don't don't do anything complicated
oh so so we we we moved this workload to post and and so the the STS of the time look something like as soon as you would turn on traffic the database would get too hot and basically me you die and so we do a lot of complicated tricks and so the reason this was happening was because we were doing a very large number of long transactions and uh civiliz of transactions especially when you do these like globally globally distributed databases they have to do a consensus protocol and the consensus protocol can to take up a lot of time uh so so while full Scrat you know a single node would have no problem whatsoever you try to do globally distributed version of it you get like a complete [ __ ] show uh this is try to you a bite like if you run it well presumably it runs really well but you should start off with post
uh so we we moveed to this this postr world and uh almost immediately uh we hit a massive spike in traffic and the first thing that happens when this massive uh Spike and traffic is we have this uh enormous cach in the middle uh that runs on one of these AWS Serv is called Dynamo and there's a slightly gnarly bug in the code where if you have a large enough file uh it it is not cached on Dynamo and we don't know that so we're not exactly monitoring the error rates on this database and so uh we missed this fact and so what happens is the people who have really large files uh they try to embedded Dynamo has a m so uh you go to the embedding model uh okay now because your fire is really big the embedding model actually gets a high enough load on this uh on on this thing
and then um we have another narly r condition on our Q uh so one of the things that happens is sort of you know once you've done a lot of the chunking and embedding and stuff you sort of save before you could commit the transaction you have to go update the hashes of all the things above the tree and this hashing happens on a que the que is a waste condition especially when it has to do with large files so large files are not getting committed and so your client discovers the ha large files not committed they said it again well now you get a Miss on the cache again so you go to the embedding mod and you don't commit it so well you try again and so we noticed at this time that like load just keeps going and going and going you have no explanation of what the hell is going on um because the global erates look they look totally normal like there's everything is the problem is the only thing you're missing is like these loadbearing files uh and so that that was a pretty gnarly week because the entire merkl Tre reconciliation was happening just as a the merry would only miss these enormous like 60 line files and those those would C all the dramaa
但分布式系统中的竞态条件——上过分布式系统课的人都会告诉你,它们极其隐蔽难以发现。而在实际生产中找到它们同样困难。即使你知道大概率存在竞态条件,找到它也极其艰难。最终靠的就是大量仔细的思考。
uh but the but the race conditions on Qing are uh I mean the race conditions if distributed systems are are if you go to distributed systems that's everyone will tell you they're really gnarly and hard to find and that does not make it any easier in real life to find them uh like even if you know that you probably have a race condition finding it is is enormously difficult and it comes down to just um I guess I guess a lot of car thought but yeah
然后有一天我被大量告警吵醒,一看系统里所有东西都在疯狂报警。缓存出问题了,embedding 模型也在出问题。我们最终定位到原因:RDS 现在有大约 22TB 的数据,而 RDS 有 64TB 的上限。
原因在于我们的工作负载特点。每次你更新一个文件,我们会对记录做更新。MySQL 的更新是原地修改磁盘上的记录,但 Postgres 不是这样——Postgres 的更新实际上是一次删除加一次插入。如果你的工作负载主要是用户打字,那看到的就是海量的删除和插入。
在 Postgres 中,删除只是一个墓碑标记(tombstone),告诉索引"别再访问这条记录了",但它不回收内存。你不断地删除和重新插入记录,空间不会被释放。有一个叫 vacuum 的后台进程负责清理——重写到新的块里,回收空间。还有一个相关的东西叫事务 ID 回卷(anti-wraparound vacuum)。
如果 vacuum 和 anti-wraparound vacuum 都出问题了——我们发现数据库开始不开心了。"不开心的数据库"长什么样呢?事务勉强能通过,然后就像一辆启动后就磨磨蹭蹭最终完全停下来的车。你拼命限流、清理积压的事务、然后慢慢恢复流量——砰,又停了。就像一辆不停抖动、完全开不动的车。
折腾了几个小时之后,在某个时刻数据库甚至无法重启了。这时候邮件开始疯狂涌入,所有人都在抱怨索引不工作。因为数据库完全卡死,连查询都做不了。
这时候我给一个朋友 Martin 打电话:"Martin,我们遇大麻烦了,你认识 RDS 专家吗?"我们联系了 AWS 支持,他们完全不知道怎么回事。好在我们团队里有几个非常厉害的人——其中一位是 GitHub 前基础设施副总裁,对数据库了如指掌。
然后我们跟 RDS 的架构师通了电话——写这个数据库的人——他们说:"呃,我也帮不了你们太多,你们的情况很严重。"非常有帮助,非常有帮助。
在事故中,我个人认为有两种人:一种是变得焦躁的,另一种是真正活过来的。我们有一位联合创始人,他不喜欢日常运维工作,但他热爱 Sev(严重事故)。你能见到他最开心的时候就是在一场 Sev 的正中间,一切都在崩溃。Sev 是 severity(严重级别)的缩写。
这真的很有压力,因为 Michael 在喊:"我们能不能把服务恢复?"你真的很想让服务回到健康状态,不惜一切代价。我们面临几个选择。首先,我们之前有非常优雅的 schema 结构,有些表有四个外键关联到其他表。外键会给数据库增加很多负担,因为它要做大量的指针追踪。
所以我分配一个人去删除所有外键,减轻数据库负担。分配另一个人去重写工作负载——22TB 中有一张表占了 20TB,他的任务是尽可能快地干掉这张表,完全改变工作负载模式。再分配另一个人去尝试其他方案。因为我们团队有很多非常有才华的人,你可以让他们去尝试一些非常疯狂的想法。
um uh now one of the core differences between post and MySQL is that MySQL if you call an update it just like goes to that record on the dis and just updates the record that is not how postc Works a post and update is actually a delete and an ad um so if you call update on a record it actually deletes the record and adds a new record uh and if you have a a workload where the only thing people are doing is typing uh well the only thing you see is updates which means the only thing you see is there's enormous numbers of deletes and enormous number of ads
and uh in postgress a delete is just like a tombstone it just like marches a dead rail helps the index don't don't go to that rail in again but it doesn't like I mean you you know you could just think about it on dis you know there record record record record record uh if you delete thing you know the memory is not being reclaimed uh so you're you're just constantly at deleting and reing these records but like uh there's a background process called a vacuum that is meant to in the background like clean up all the things rewrite it to another block and like reclaim that space and give it back and um the the reverse of this happen for something called transaction idas which is called the anti- transaction thing
now okay going into all these details because what am I watch to tell you if the vacuum and the antiv vacuum both are going he wire um so we we notice okay d is not happy right now uh and what a not happy database looks like is like all of a sudden like I mean the transactions are kind of going through but then it start and then like it's like a car that just you know you start it and then it just grinds to a halt and just goes everything goes down again then you look do all this effort to like kill the traffic um uh you know slowly clean up all the back transactions you start traffic slowly again and then boom it grinds to Halt so you're just having this like it's like just imagine a car that just like you know shuttering it's just like not moving
and so so for a couple hours you know okay we do this and and yeah I can't remember what exactly happened but at some point realized the database even stops booting up back in uh so now you know emails are pouring in so you know everyone's really mad that indexing isn't exactly working uh and because the data is now completely grinded to Hal we can't even do your queries
um so at this point there's um a friend of ours called Martin right I I call Martine I'm like Martin we are in big trouble uh do you know any RDS SCs so you know we call up AWS support has no [ __ ] clue what's going on uh so uh we like like at least there's a few people in the team that are extremely talented I mean one of them was the you know VP of infrared GitHub so he knows his databases uh up you know up down Center
and uh so now we're on the call with these like uh RDS Architects the people who sort of written the database and they're like well gosh uh I can't can't tell you anything uh you guys are in big trouble and they like well very helpful very helpful
um so now there are two kinds of people in STS in my personal opinion there's like the people who sort of are like get shuty uh and then there's the people that really come alive so we have uh one of the co-founders he just really he does not like day-to-day infra but he loves Zs he just gets like really happy you the happiest you ever see him is you know in the middle of a SV everything is grind do know what a SV is a SV means uh like like an incident like a something going wrong
okay so this is like this is stressful like because everyone is like can we you know Michael's yelling like can we can we put the service back on like you you really want the service to go back and healthy uh and at whatever cost uh so we are faced with a few options we could so the first thing we do right so so we had we had this all this extremely nice schema structure uh where like some of the tables had like four end keys and other tables and so like that means like once you delete a record in the in the I don't know how many of you have been to database crash but like you delete a record in one thing and it's sort of like automat you you have all these nice foreign Keys uh foreign Keys add a lot of load to the database because it has to sort of do a lot of pointer chasing to figure out
uh so you sign one guy you know go delete all the foreign Keys you know that your job is to make sure you alleviate the load in the database delete all the foreign Keys um we assign another person to go uh you know rewrite the load uh have like so one of the TBS out of the 22 terabytes one of them is a 20 terabytes or like your job is to try to delete this t as fast as possible you know completely change the workload uh we assign another person to like go so one of the things that like kind of special about the S is because we have quite a few really talented folks you can like really try very crazy crazy ideas and like let them uh do it
他把它重写为运行在对象存储(object storage)上。这里岔开讲一个趋势——过去几年数据库领域有一些很有趣的潮流。2010年代早期是大型分析引擎,最著名的是 Snowflake,还有 Databricks,以及各种开源版本。这些分析引擎以向量化操作著称,所以非常快。
最近一个我很喜欢的趋势是:把数据库构建在对象存储之上。对象存储就是 S3、R2、Azure Blob 这些东西。对象存储已经被优化到极致,扩展性极强,速度相对较快,而且可靠性极高——世界上没有什么比对象存储可靠性更高的东西了。
写数据库最难的部分是存储层,会变得非常混乱,要非常小心避免数据损坏。这个趋势的一个例子是 Warp Stream,被 Confluent 收购了。Warp Stream 做的事情是把 Kafka(一个著名的流处理系统)重写为运行在 Blob 存储上。顺便说一句,如果你要组建基础设施团队,人们会告诉你的第一件事就是"永远不要自己跑 Kafka"——它就像癌症一样难以根除。另一个我们非常喜欢并依赖的是 Turbopuffer,一个同样基于对象存储的向量数据库。
回到事故。我在事故中间跟那个人说:"试试把这个迁到对象存储上。"他的任务是把存储 chunk 的那个最大的表重写到对象存储上。他居然在我们其他人之前就完成了——因为这是我们经历过的持续时间最长的 Sev 之一,你不知道数据库还能不能恢复。
然后我们把大部分工作负载从数据库上移走,基本上做了一次完整的重构和迁移。
so uh you know obviously post is really old but say they kind of the early 2010s was these large analytical engines um uh and so the most famous of which is snowflake but then there's data rcks um and then there's like lots of various op Source versions of these analytical engin and what's famous about these analytical engin is they do vectoral operations and so Theory really fast and uh you know a recent a recent Trend that uh I really like is this idea of writing databases on top of uh object storage so object storage is something like S3 R2 uh your blob Etc and object storage has been optimized to death and is extremely scalable and uh relatively fast and very very very high reliability there's nothing in the world that is more higher reliability than object storage
and the one of the hardest parts about writing a database is the storage layer the storage layers relatively can get really messy uh there's all these weird things you really have to carefully worry about to not have corrupt data and so uh our a recent Trad so variations of this is um is a company called warp stream that was acquired by confluent and what warp stream does is a rro kfka which is a famous sort of streaming system uh to run on on blob storage so that's the the main idea of the handed company was to move kfka to blob stor which me kofka is like um if you're ever starting an infer team the thing that people will tell you is never run Kafka like it's it's like cancer it goes and you know never can never can really get it um and so another uh provider we really like and rely on is this company called tribo puffer uh that does a Vector database also on object storage
and so now back to St so I tell this person let's try to move this to object storage so this is in the middle of it app I'm like your job rewrite this on object storage uh the biggest table that's storing the chunks and uh he he actually manages to move it to object storage before any of the rest of us uh because you know this is one of the longer running stabs that we've ever ever had it you don't know if you'll ever get your database back online
um and so then we sort of remove a lot of the workload off the database and we um um a full essentially at this point you're like well we just need a full refactor we're doing a full new migration
不过,做这件事的是我们的联合创始人 Arvid,我对他有绝对的信任。他就是做到了——我们在事故现场实时重写,迁到了 Blob 存储上。现在我们甚至还有一个新项目要把整个系统都迁到 Blob 存储上,干掉中间所有的数据库。因为扩展数据库的最好方式就是——不要数据库。
and yeah I mean uh this was one of the co-founders Arvid and I trust my life in or he sort of you know he does he does his thing uh we Rite it on the Fly um and so it goes on it goes on Bob storage and uh now we still we still want it on blob so there's there's actually a new new project to actually moved the entire thing to bobge kill kill all the all the databas in the middle uh because uh uh the best way to scale a database is to just not have a database
但有趣的是,在我们现在的世界里,这种单行改动引发的事故——有时候模型其实能抓住它。让模型审查你的代码,它们能发现你预料不到的问题,可能会帮你省掉大量时间和痛苦。
um um but like you know sometimes in in in in our in in this new in the new sub where uh this like single line change C is like sometimes models can catch it actually kind of funny and uh see like it's kind of it's kind of cool to have these like sometimes like you can be protected by have have models review your code uh you know they'll find they'll find things you don't predict and probably probably save you a lot of time and pain
但 Copilot 有一年半完全没有任何变化。AI 的第一个产品是 Copilot 的自动补全,然后他们就再也没做任何改进。而且任何人都能清楚地看到天花板极高——我真心相信天花板一直延伸到自动化大部分工程工作——但当时除了自动补全什么都看不到。所以我觉得这是一种后视镜偏差——感觉拥挤只是回头看的时候才有的感觉。最长的一段时间里,基本上没有人在做真正的竞争。
but the copar didn't change at all for like a year and a half I mean you saw the auto complete was the AI the first AI product was uh you know kPa auto complete and and then they make no changes and and one could clearly see that the ceiling was really really really high I mean I I truly believe the ceiling you know goes up to automating a lot of engineering and you couldn't see like anything past autocomplete and that that so like I I think it was like it's like a n phenomena that there's a lot of uh people but like where the longest time there was basically no way um cool it it only feels cluttered in retrospect
我们花了大量时间封锁那些极具创造力的滥用者。两天前的那个——有个人不知道在哪里,他想办法创建了数万甚至数十万个 Hotmail 帐号,还是相对稳定的域名不是那种垃圾域名。他从大概一万个 IP 地址发送流量,有十万个分片用户。你得花大量时间分析流量来封锁这些人。
uh a lot a lot of a lot of time has been spent uh blocking spammers who really find most creative ways um uh the one from two nights ago was there's some some guy I I don't know where he is he's he's managed to find a way to create like uh tens of thousands hundreds of thousands of Hotmail accounts uh just like a relatively a stable domain it's not even the shitty domain uh and uh he sends graphic from like I don't know 10,000 IP addresses and you know 100,000 uh users sort of sharded and you you have to spend a lot of time sort of like analyzing the traffic to like block the spammers
我不确定还有谁这么做,但因为我们是在为 OpenAI 之类的服务处理代码,有这个密钥让我晚上睡得安心一点。虽然我99.99%确定没有办法从向量反推出代码,但因为这不是一个可证明的事实,有这层加密更安全。这显然是一种责任,我们确实在用相当精密的技巧来确保不出问题。
um and so uh and uh I'm not sure anyone else does this but because it's uh we're runting code that you know for opening eye and all these other things uh I feel a little bit safer at night knowing that there's a key even though I'm I'm like 99.99% sure there's no way to go from a vector to code uh but because it's it's not some provable fact about the world it's it's better to just have this encryption key uh I don't know it's like it's like difficult obviously it's like it's like a responsibility and uh you know we we try relatively difficult uh you know tricks and things to like make sure that we don't [ __ ] up
实际上,可能真正会发生的是你能更有创造力了。因为很多系统本身需要大量人手来构建,随着它们越来越复杂你需要更多人。能跟模型协作来做这些事,实际上会是一个巨大的加速器。我预期未来会有更多的系统涌现,而不是更少。
换句话说——我不认为 Cursor 能在没有 AI 辅助的情况下覆盖这么大的基础设施面积。我们用一个非常有才华但规模不大的团队做到了,其中一个原因就是 AI 辅助。我预期这个比例会越来越大,我们会做真正困难复杂的事情,因为模型帮你处理了那些无聊的、无意义的方面。比如给函数调用加个参数——这就像问足球运动员是不是只在动腿。他们确实在动腿,但那是最有意义的部分吗?
in fact uh probably the thing that actually happens is that like you can actually be more creative because for a lot of these systems you actually need a lot of people to build uh as these things get more and more complicated uh you need you need a lot of people and in fact having um being able to collaborate with models on this will actually be like a I I expect a lot more system from the future than than not um I another word of saying is I I don't think cursor would be uh you know doing as as much sort of uh we cover a lot of surface area on our infosystems uh with uh a very talented team but also because of AI assistance and I I I expect that you know ratio you get bigger and bigger and you know I expect us to do actually like really really difficult complicated things because uh because the models are there to help you with like the boring uh un meaningless aspects like I don't know like adding a adding a variable to a function call is not something that like it's it's like asking like you know is is is like a football player the only thing they're doing is like moving their legs I mean they are but is that the most meaningful part of it