Inference, Diffusion, World Models, and More | YC Paper Club

节目

Y Combinator

嘉宾

Tanishk、Stannis、Isaac Ward、Ashe、Con Woo

日期

2026-05

时长

67 min

查看原始内容 →

概要

Inference 不仅是成本问题，更是能力上限。 Tanishk 提出核心论点：当模型的性能与思考量成正比时，tokens/second 就等于你能交付的峰值智能。Speculative Speculative Decoding（SSD）通过并行化 drafting 和 verification，在 4×H100 上让 Llama 370B 达到 300 tokens/sec，80-90% 的时间正确预测 verification outcomes。

Diffusion model 可以同时解决机器人控制中的两大难题——compounding errors 和 action selection。Google DeepMind 的 Stannis 展示了 DMPC（Diffusion Model Predictive Control），通过 factorized representation（action proposal + dynamics model 分离），实现 runtime 适配新奖励函数和新环境动力学（如 walker 断脚踝后仅需 adapt dynamics model 即可恢复性能）。

World model 的核心挑战是 representational collapse，LWM 用一个优雅的正则项解决了它。 Isaac Ward 介绍了 Yan LeCun 的 JEPA 架构 world model——在 latent space 而非像素空间预测，用 SIG regularizer（Sketching-Isotropic-Gaussian）确保 embedding 分布健康。结果：比竞品快约 50 倍，仅需单卡 <24GB VRAM、15M 参数。LeCun 已筹资 $10.3 亿专门训练此类模型。

深度学习的"谜团"并不神秘。 Q Labs 的 Ashe 介绍了 Andrew Gordon Wilson 的工作：用经典 PAC-Bayes 框架统一解释了 overparameterization 为何改善泛化（更低的 empirical risk + 更多 flat minima → 更好的可压缩性）、benign overfitting 为何可能（soft inductive bias = 灵活性 + 正则化的结合），以及这些理论对优化泛化效率的实际意义。

当数据成为瓶颈时，经典机器学习技术重获新生。 Con Woo 展示了一组令人惊讶的结果：互联网数据年增仅 ~3% 而计算投入年增 ~4-5x，数据约束时代已经到来。在 200M tokens 的约束下，aggressive regularization（weight decay 30x 常规）+ ensembling + distillation 的组合可实现 5x data efficiency win；self-distillation 等价于隐式 2-ensemble；在 CPT 场景下，4B tokens 配合这些技术可匹配 73B tokens 的性能（17x data efficiency）。

贯穿全场的核心线索是"用更少做更多"——无论是 SSD 用并行投机隐藏延迟、DMPC 用 factorization 实现 runtime 复用、LWM 用一个正则项替代复杂 tricks、还是 ensembling+distillation 用计算换数据，五篇论文都在探索如何在各自领域的核心约束下，通过更聪明的算法设计突破效率边界。

YC Paper Club 开场：Pioneer 校园与 AI 社区建设

核心要点： 首次 YC Paper Club 在 Pioneer（YC Woodside 校园）举办，1000+ 人申请仅选 ~100 人，目标是连接顶尖创始人和研究者。

活动在 Pioneer 举办——这是 YC 在 Woodside 的校园，对主持人有特殊意义：他在 W16 batch 时就在这里，当时 140 家公司中 10-15 家成为独角兽（WPY、Astronis、Deep Graham 等）。

当年 Sam Altman 还在运营 YC，坐在这个房间里的还有 Andrej Karpathy、Vaj Deremba 和 Greg Brockman——他们正在启动一个叫 OpenAI 的东西，"当时 AI 公司还没几家"，他们甚至在向 YC 公司请教"你们在解决什么问题"来寻找研究方向。

现场观众质量极高：主持人让大家举手——有人 10,000+ 引用，有人融资超过 5000 万美元。他提到 300,000 引用量级的可能只有 Chris Manning 一个人。

隐藏使命："Make Pioneer Great Again"——Bay Area AI 人才约一半在旧金山（Anthropic、OpenAI、Cursor），另一半在 Peninsula（Google DeepMind、Tesla、XAI、Thinking Machines），后者缺少社区聚会，Paper Club 要填补这个空白。

Speculative Speculative Decoding：Inference 是能力，不只是成本

核心要点： 当模型性能与思考量（compute at inference）成正比时，tokens/second 就等于峰值智能——SSD 通过预测 verification 结果来并行化本质上串行的推测解码过程。

Tanishk 的核心论点是"inference as capability"而非 cost/convenience。他的理想未来是"用 20,000 台 B200 组成的数据中心只做一件事——攻克黎曼猜想"。这不是关于省钱，而是关于你的推理速度直接决定了你能交付的智能上限。

三个为什么 inference 重要的理由：（1）serving billions of users（或 10 个 Claude Code 重度用户）= trillions of tokens，inference 成本已超过训练成本；（2）RL 的计算需求开始超过 pre-training，而 RL 本质上就是 inference 的 wrapper；（3）最关键也最少被讨论的——inference 速度 = capability ceiling。

Vanilla speculative decoding 原理：用小模型（draft）快速生成若干猜测 tokens，然后用大模型（target）一次 forward pass 验证所有猜测。核心不对称性：verification（并行）比 generation（串行）便宜——transformer 可以一次 forward pass 得到序列中所有位置的概率，但生成必须逐个。验证时发现不可信的 token 就拒绝，并在拒绝位置免费采样一个"bonus token"。

SSD 的突破：vanilla spec decoding 的瓶颈是 drafting 和 verification 必须串行——round t 的 draft 必须等 round t-1 的 verification 结果才能开始。SSD 的核心思路极其简单：让 drafting 和 verification 并行发生。当 draft 发送一轮猜测后，不等 verification 完成，立即预测最可能的 verification 结果，并在这些预测之上开始下一轮 drafting。

预测 verification outcomes 的直觉：draft 生成 blue tokens 时，还有一些"差点被选中"的候选 tokens——这些恰好是大模型 bonus token 的热门候选。利用 draft 的 token 分布信息来预测 target 的 verification 结果，正确率达到 80-90%，足以获得显著加速。

结果：Llama 370B 在 4×H100 上达到 300 tokens/sec。对比 SG Lang（测试中最快的开源 inference engine），SSD 在 latency 和 throughput 上都有提升——通常 speculative decoding 只赢 latency 不赢 throughput，SSD 两者都赢。

"Inference today is seen as a cost or convenience lever. But in one, two, or three years, inference is going to be seen as a capability." —— Tanishk

Diffusion Model Predictive Control：用扩散模型做机器人控制

核心要点： DMPC 用 diffusion model 同时学习 multi-step action proposals 和 multi-step dynamics models，factorized 架构使得模型可以在 runtime 适配新奖励函数和新环境动力学，而无需重新训练。

Model Predictive Control（MPC）的两大痛点：（1）dynamics model 不够准确 → compounding errors；（2）planning algorithm 不够强大 → 选不到好的 action sequence。DMPC 用 diffusion model 一次解决两个问题。

DMPC 算法极其简单：从离线数据集学三个组件——policy（给定观察预测 actions）、dynamics model（给定 actions 预测未来状态）、objective function。关键创新是 multi-step：action proposal 一次预测整个 horizon 的 actions（类似 diffusion policy 但在更多样的数据上训练），dynamics model 一次推演多步（减少 compounding error）。

Factorized representation 的威力：因为 action proposal 和 dynamics model 是分开的，你可以只替换其中一个。Stannis 展示了一个漂亮的实验——walker 的左脚踝断了，环境动力学改变了。只需在新环境中采集少量 play data，adapt dynamics model，就能恢复大部分性能。Joint modeling 方法（如 Diffuser）做不到这一点。

扩散模型在机器人领域的版图：Stannis 梳理了四种范式——Diffusion Policy（行为克隆，需要专家数据）、Diffuser（joint state-action modeling）、Decision Diffuser（observation-only learning，可用纯视频数据）、DMPC（runtime 适配，最灵活但需要 planner）。

DMPC 可以 runtime 适配 novel reward functions——训练时只学 locomotion（前进、跳跃），inference 时改变 reward 就能产生新行为。

LWM / JEPA World Model：用一个正则项优雅解决 Representational Collapse

核心要点： Yan LeCun 投注 $10.3 亿的 world model 路线（JEPA），其核心挑战是训练时的 representational collapse——LWM 用 SIG regularizer 以极低成本避免了这个问题，实现约 50x 加速和 15M 参数的轻量模型。

Isaac Ward 指出 world model 不是新概念——可追溯到 1990 年 Richard S. Sutton 在 Europe 会议上的论文："一个黑盒接收 situation 和 action 作为输入，输出对下一个 situation 的预测。"如今只是新的包装和广告。

三大能力：（1）生成"想象中的"未来状态（用于规划）；（2）Model-based control（Stannis 上一个 talk 已解释）；（3）Surprise quantification——world model 可以量化预测误差，检测到环境变化时出现 spike，这是 model-free 方法原生不具备的能力。

Model free vs model based 之争：这是当前研究和创业社区正在"打仗"的问题。Isaac 指出即使 model-free 的 policy 内部也隐藏了高度混淆的 world model（有论文证据支持），所以问题不是"有没有 world model"而是"是否显式表示"。

World model 训练的核心难题是 representational collapse：你在同时学习"如何表示世界"和"actions 如何改变世界"，优化 landscape 中有大量 trivial collapse 的局部最小值（比如"所有状态都一样"）。现有方法用各种 tricks 来避免（explicit heuristics、foundational models 做 backbone、privileged data）。

LWM 的优雅方案：JEPA（Joint Embedding Predictive Architecture）在 latent space 预测，而非在像素空间。用 encoder 把观察编码为 latent vector，predictor 预测"执行 action 后的下一个 latent"，然后加一个 SIG regularizer：
Sketching：对高维 embedding 做一维 slice
Isotropic：每个方向看起来一样
Gaussian：每个 slice 上的分布应为高斯分布
如果这三个条件满足，latent space 的分布就是健康的，不会 collapse。

实战结果：2D 任务（push-t）上 LWM 优于竞品；3D 任务 Dino World Model 更好（因为有 foundational backbone）；但 LWM 约 50x faster（所有工作在 latent space 完成，无需额外 forward pass 或双模型），单卡 <24GB VRAM，仅 15M 参数。

最酷的能力演示：给 push-t 的茶杯换颜色或瞬移位置，world model 在 perturbation 的那一刻立即检测到模型误差 spike——这种"surprise quantification"对真实世界部署至关重要。

"Hidden in this presentation is really a billion-dollar question — Yan LeCun's raised $1.03 billion basically just to train world models." —— Isaac Ward

Deep Learning 的"谜团"其实不神秘：PAC-Bayes 框架的统一解释

核心要点： Andrew Gordon Wilson 的工作用经典 PAC-Bayes 理论统一解释了三大"谜团"——overparameterization、benign overfitting、double descent——关键洞见是 overparameterization 同时降低了 training loss 和 compression term。

PAC-Bayes bound：test loss（泛化）≤ training loss + compression term。过去人们发现对大模型这个 bound 是 vacuous（松到没用），但 Andrew 指出这是因为 compression term 的计算方式不对。

Overparameterization 为何改善泛化：（1）更多参数 → 更低的 empirical risk（训练 loss）——这很直觉。（2）关键发现——更多参数 → 更可压缩的解（Lotfi et al. 的实验发现参数数量与编码训练集所需 bits 呈负相关）。两个 term 都下降，所以泛化改善。

Flatness 视角：随着参数增加，parameter space 中 flat minima 的体积呈指数增长，而 sharp minima 的体积增长慢得多。Flat minima 更可压缩 → overparameterization 自然导向更好的泛化。这让 PAC-Bayes bounds 在 billion-parameter 模型上也能给出有用的（non-vacuous）bound。

Benign overfitting：神经网络能拟合完全随机的噪声，同时在结构化数据上泛化良好——这看似矛盾。Andrew 用正则化多项式模型给出直觉：模型有足够参数拟合任何数据，但正则化推动它优先使用低阶项。神经网络是"带有 soft inductive bias 的高表达力模型"——兼具灵活性和泛化偏好。

No Free Lunch 定理的推论：改善学习效率的唯一途径是 inductive bias。如果我们能找到正确的 inductive bias（基于这些理论），就可能主动优化泛化性能——考虑到 AI 与人类之间的巨大 sample efficiency gap，潜在收益极大。

数据约束时代的预训练：Ensembling + Distillation = 5x Data Efficiency

核心要点： 互联网数据年增仅 ~3% 而计算投入年增 ~4-5x——数据将成为预训练的首要瓶颈。通过 aggressive regularization + ensembling + distillation 的组合，可在固定数据量下获得 5x data efficiency win，经典 ML 技术在新时代重获生命。

数据墙正在逼近：人类生成的互联网文本年增约 3%，而预训练计算年增约 4-5x → 每个数据点的计算投入年增约 4x。这意味着"数据约束但计算不约束"正在成为新的算法 regime。

Canonical setting：200M tokens from DCLM（通用网页数据）。标准 recipe（epoching + 扩大模型）在模型超参数化后开始 overfit，loss 反弹。

第一招：Aggressive regularization——weight decay 开到常规 compute-optimal pre-training 的 30 倍。结果：loss 随参数增加遵循非常干净的 power law（指数=1，符合 data constraint theory），并且有明确的 asymptote（3.43），表示"无限计算下的最佳可能性能"。

第二招：Ensembling——5 个 300M 模型的 ensemble（共 1.5B 参数）在 data-constrained setting 下优于单个 1.5B 模型。Ensemble 也遵循干净的 power law，且 asymptote 显著低于单模型 regularization。

两招组合（Joint Scaling Recipe）：取 ensemble 的 asymptote（对 K 取极限），再对模型大小取极限 → 双极限给出"无穷大模型 + 无穷多 ensemble"的理论最佳性能。结果：5x data efficiency win over standard recipe。

让结果实用化——Distillation：8-ensemble（共 2.4B 参数）蒸馏到单个 300M 模型，保留 83% 的 loss improvement。更惊人的是 self-distillation：把一个 300M 模型蒸馏到另一个 fresh 300M 模型，loss 竟然也显著下降——甚至超过 regularized recipe 的 asymptote。先行研究显示 self-distillation 等价于隐式训练 2-ensemble。

数据规模无关性：在 4 个不同的 token count（最高 1.7B tokens）上重复实验，data scaling law 显示 data efficiency win 在 token 数量维度上是常数——即使外推到 10 trillion tokens 量级也应保持 5x 优势。

CPT 场景验证：对 3B 模型做 continued pre-training，仅用 4B 数学相关 tokens（而非完整的 73B corpus），配合 aggressive epoching + ensembling，匹配了全量 73B tokens 的性能——17x data efficiency win。

主持人点出了更大的图景：AI 的两大待解决问题是 intelligence per watt 和 intelligence per sample——前者差 1-2 个数量级，后者差多个数量级。"我不知道你们读了互联网的百分之多少，但我没读完整个互联网。"

"When you're constrained by data and unconstrained by compute, the types of algorithmic choices you make matter a lot." —— Con Woo

附录：关键人/机构/产品/数据

项目	详情
Tanishk	Stanford 博士生，SSD 论文作者，合作者 Tri Dao、Aar May
Stannis	Google DeepMind staff research scientist，DMPC 论文
Isaac Ward	World models 研究者，LWM 论文演讲
Ashe	Q Labs co-founder/president（YC startup），与 Andrew Gordon Wilson 合作
Con Woo (Ku)	Data efficiency scaling 论文 co-lead，合作者 Suhas、Percy Liang、Potsu
Andrew Gordon Wilson	泛化理论研究者，"Deep Learning is Not So Mysterious or Different"
Yan LeCun	JEPA 架构提出者，2026.3 筹资 $10.3 亿训练 world models
Chris Manning	被引用为可能有 300,000 citations 的人
Chris Ré	Stanford 教授，主持人的实验室，研究 intelligence per sample
SSD	Speculative Speculative Decoding，并行化推测解码
DMPC	Diffusion Model Predictive Control，factorized diffusion-based 控制
LWM	Lay World Model，JEPA 架构 + SIG regularizer
SIG	Sketching-Isotropic-Gaussian regularizer，防止 representational collapse
PAC-Bayes	经典泛化理论框架：test loss ≤ training loss + compression term
Llama 370B	在 4×H100 上 SSD 达到 300 tokens/sec
DCLM	通用网页数据集，用于 data efficiency 实验
W16 batch	YC 2016 冬季，140 公司，10-15 独角兽
数据增长率	互联网文本 ~3%/年 vs 预训练计算 ~4-5x/年
Joint scaling recipe	5x data efficiency win (double limit)
Self-distillation	等价于隐式 2-ensemble
CPT 17x win	4B math tokens 匹配 73B full corpus