メインコンテンツまでスキップ

Sergey Levine : 自律型 AI ロボットの実現 : 10年以内に実用化への「自己強化のサイクル」が始まる

· 約143分

前置き

つい先日、up された動画を AI で整理した。Sergey Levine 教授の予想通りであれば、10年後あたりから徐々に自律型 AI ロボットが社会に浸透しだし、20-30年後(一世代後)にはそれなしでは社会は回らなくなっている…のかも。

要旨

AI

ロボットAIの進化と未来

このポッドキャストの対談では、カリフォルニア大学バークレー校の教授であり、Physical Intelligenceの共同設立者でもあるSergey Levine氏が、‌‌ロボットAIの進歩と将来の可能性‌‌についてDwarkesh Patel氏と議論しています。

Levine氏は、‌‌ロボットの汎用基礎モデル‌‌の構築を目指すPhysical Intelligenceの取り組みを説明し、‌‌洗濯物の折り畳みや掃除などの器用なタスクをロボットが実行‌‌できるようになった初期の進歩を強調しています。

この対談は、‌‌物理的なタスクにおける継続的な学習、常識、人間とロボットの協力‌‌といった主要な課題を掘り下げています。さらに、‌‌自律型ロボットのタイムラインと経済的影響‌‌を、大規模言語モデル(LLM)の進化と比較しながら考察し、‌‌AI開発におけるハードウェアとソフトウェアの役割、データ収集の課題、地政学的な側面‌‌についても触れています。

目次

  1. 前置き
  2. 要旨
  3. 概要
    1. 1. ロボットファンデーションモデルの目標と現状
    2. 2. 自律型ロボット実現への課題とタイムライン
    3. 3. LLMとの比較とロボット学習の利点
    4. 4. ロボットと自動運転の比較
    5. 5. ロボットAIの進歩を加速する要因
    6. 6. 視覚モデルと実世界タスクの関連性
    7. 7. 創発的能力とモラベックのパラドックス
    8. 8. ロボットの計算資源とハードウェアの課題
    9. 9. 模倣学習から強化学習へ
    10. 10. ロボットと知識労働の統合
    11. 11. シミュレーションと現実世界データ
    12. 12. 社会経済的影響と将来への備え
    13. 結論
  4. 自律型 AI ロボットの実現 : 10年以内に実用化への「自己強化のサイクル」が始まる
  5. 時系列
  6. 主要関係者
  7. 情報源
    1. 動画概要欄

概要

AI

詳細ブリーフィングドキュメント: ロボットAIの進化と未来

概要

このブリーフィングドキュメントは、Sergey Levine氏(Physical Intelligence共同創設者、UC Berkeley教授)との対談「Fully autonomous robots are much closer than you think – Sergey Levine」からの抜粋に基づいています。本対談では、ロボットAIの現状、今後の展望、主要な課題、そして社会経済的な影響について深く掘り下げられています。特に、ロボットファンデーションモデルの可能性、自律型ロボットの実現時期、そしてその実現に向けた技術的・戦略的考察が中心テーマとなっています。

主要テーマと重要なアイデア/事実

1. ロボットファンデーションモデルの目標と現状

  • 目標: Physical Intelligenceは、「あらゆるロボットをあらゆるタスクで制御できる汎用モデル」であるロボットファンデーションモデルの構築を目指しています。Levine氏はこの取り組みを「AI問題の非常に根本的な側面」と捉え、「本当に汎用的なロボットができれば、人間ができることの大部分をこなせるようになるだろう」と述べています。(00:00:46)
  • 現状: 同社は、洗濯物の折りたたみやキッチンの掃除といった「器用なタスク」をこなせるロボットの基本的な構築に成功しています。「結果はかなりクールだと思う」とLevine氏は評価していますが、これはあくまで「非常にごく初期の始まり」であり、最終目標は「非常にシンプルで基本的なバージョン」を超えたものです。(00:01:05, 00:02:29)
  • 最終的なビジョン: ロボットに「Tシャツをたたんでください」と指示するのではなく、「ロボット、これからは私の家事をすべてこなしてください。午後6時に夕食を作ってほしい、午前7時に起きて仕事に行く、土曜日に洗濯をしてほしいから準備しておいてほしい」といった高レベルで継続的な指示を与え、ロボットが数ヶ月から1年間自律的に実行できる状態を目指しています。(00:02:37)

2. 自律型ロボット実現への課題とタイムライン

  • 主要な課題: 最終的なビジョンを実現するには、「継続的に学習する能力」「物理世界への理解と常識」「必要に応じてより多くの情報を引き出す能力」「特定のケースを賢く処理する能力」「継続的に改善する能力」「安全性への理解と信頼性」「間違いを修正する能力」が不可欠です。(00:03:23-00:03:59)
  • タイムライン: Levine氏は、ロボットが「何か役立つものを届ける基本的な能力レベル」に達し、「実世界で経験を収集し、その経験を活用してより良くなる」という「フライホイール」が回り始める時期を「非常に近い」と見ています。(00:04:22, 00:04:42) 具体的には、「一桁の年数」が非常に現実的であり、彼自身は「1、2年以内には何か実際に世に出ることを本当に願っている」と述べています。(00:05:14)
  • 完全自律型ハウスキーパーの実現: 完全な自律型ハウスキーパーのような「かなり堅牢なもの」が実現する時期についても「おそらく一桁」と予想しており、Patel氏が「5年」という中央値を提示した際に「5年は良い中央値だ」と同意しています。(00:09:39, 00:10:40) これは、「ほとんどのブルーカラーの仕事」をこなせることを意味します。(00:10:52)
  • 進展の性質: これは「研究所ですべてを開発してから完了する」というものではなく、「AIアシスタントで見てきたように、基本的な能力レベルに達したら世に出て、経験を積み、それによって向上していく」という漸進的なプロセスを辿ると予想されています。(00:04:12)

3. LLMとの比較とロボット学習の利点

  • LLMのフライホイール: LLMの分野では、自動化されたフライホイールはまだ明確には確立されていませんが、多くの組織がその実現に取り組んでおり、「人間が関与するループのフライホイール」はすでに存在するとLevine氏は指摘しています。(00:05:55)
  • ロボット学習の利点: ロボット工学はLLMと「それほど大きくは違わない」としつつも、いくつかの点で「より管理しやすい」違いがあるとLevine氏は語っています。(00:07:08)
    • 自然な監督: 人間がロボットを監督または指示する場合、「非常に自然な監督の源」が存在し、人間にはロボットの成功を支援する「大きなインセンティブ」があります。(00:07:18)
    • ミスの回復と学習: 物理的な世界で行動する際、ロボットはミスを犯し、そこから回復し、将来そのミスを避けるための教訓を得る機会がより多くあります。「Tシャツをたたんで少し失敗した場合、それはかなり明白だ」とLevine氏は例を挙げています。(00:07:37-00:07:58)
    • 多様な学習信号: ロボットは人間の言葉による指示や共同作業中の自然なフィードバックからも学習できます。これは、人間の行動を観察したり、行動をラベル付けしたりすることを超えた学習の可能性を示唆しています。(00:15:36)

4. ロボットと自動運転の比較

  • 2009年との違い: 2009年と現在で最も異なる点として、Levine氏は「機械学習システムの技術、特に周囲の世界を理解するための知覚技術」の進歩を挙げています。「2025年現在、私たちはより汎用性が高く堅牢な知覚システム、そしてより一般的に周囲の世界を理解するためのシステムに関するはるかに優れた技術を持っている」と述べています。(00:18:22, 00:18:59)
  • ロボット操作の特性: ロボット操作は「ある意味でははるかに難しい問題」ですが、「別の意味では、より限定された範囲でフライホイールを始動しやすい問題空間」でもあります。(00:19:15, 00:19:24)
    • ミスの許容度: 運転ではミスが重大な結果を招くため、学習が難しいですが、食器洗いのような多くのロボット操作タスクでは、「ミスを犯し、それを修正し、そこから学ぶ」ことが可能です。(00:19:56, 00:20:18)
    • 常識: LLMやVLMの活用により、「何が起こるかについて合理的な推論を行う能力」(常識)が飛躍的に向上しました。「2009年の自動運転車では、その質問に答えることはできなかっただろう」とLevine氏は指摘しています。(00:20:26, 00:21:02)

5. ロボットAIの進歩を加速する要因

  • 産業規模の取り組み: ロボットファンデーションモデルを機能させるには、「単なる実験室での科学実験」ではなく、「産業規模の構築努力」が必要です。過去の研究は「基礎研究」として重要でしたが、「それを現実にする推進力」が不足していました。これは「ロボットファンデーションモデルをそれ自体のために本当に確立することに特化した、単一の焦点」を必要とします。(00:22:50, 00:23:08, 00:23:25)
  • データ収集の課題: データは大きなボトルネックですが、「どの軸でスケールすれば、どの能力軸に貢献するか」を理解することが重要です。単にデータ量を増やすだけでなく、「どのようなデータを、どのような設定で収集するか」「そのデータを消費するどのような方法がどのように機能するか」を特定する必要があります。(00:24:04, 00:24:58)
  • 学習のフライホイール: 最終的に必要なデータ量を知るよりも、「自律的かつ継続的に成長するデータ収集を代表するデータフライホイール」をいつ開始できるか、つまり「いつ開始できるか」がより重要な問いです。(00:26:17, 00:26:48) これは、ロボットが「現場で学習」したり、データ収集のプロセス自体が「有用で価値がある」方法でデータを取得したりすることを含みます。(00:26:48, 00:27:00)
  • モデルアーキテクチャ: Physical Intelligenceの現在のモデルは「モーター制御に特化したビジョン言語モデル」であり、GemmaのようなオープンソースLLMに「アクションエキスパート」を組み合わせたものです。これは、画像情報や言語情報だけでなく、ロボットの連続的なアクションを生成するための「アクションデコーダー」を持つ「エンドツーエンドのトランスフォーマー」です。(00:27:54, 00:29:10, 00:29:47)
  • 既存知識の活用: AIの最近の革新がロボット工学にもたらす最大の利点は、「事前知識を活用する能力」です。事前学習されたLLMやVLMから得られる「世界に関する抽象化された知識」は非常に強力です。(00:29:59, 00:30:26)

6. 視覚モデルと実世界タスクの関連性

  • ビデオモデルの課題: 以前は、画像やビデオの生成が言語モデルのように「世界の深い理解」をもたらさなかったという課題がありました。これは、テキストがすでに人間が重要と考える情報に抽象化されているのに対し、ビデオは「圧縮されたピクセル」のような、意味論的に異なるレベルで表現されているためだとLevine氏は考えています。(00:34:02, 00:34:24, 00:35:22)
  • ロボットにおける目的の重要性: しかし、ロボットの場合、「目的を持って仕事に取り組んでいる」という点が重要です。その知覚は「その目的を達成するために奉仕する」ものであり、この「強力な焦点合わせの要因」が、人間が「文字通り目の前にあるものを見ない」というトンネル視覚を持つことと同様に、情報の選別と学習に役立ちます。(00:35:49, 00:36:09)
  • 受動的観察の限界: YouTubeのような膨大なビデオデータを単に観察するだけでは、ロボットが物理世界を効率的に学習することは難しいとLevine氏は考えています。なぜなら、具体的な目標がなければ「何を見るべきか」が明確ではないからです。しかし、ロボットが「対話から学ぶ」ような「具現化された基盤モデル」は、目標を持つことで他のデータソースを「よりうまく吸収できる」可能性があります。(00:36:57, 00:37:20, 00:37:42)

7. 創発的能力とモラベックのパラドックス

  • 創発的能力: LLMの創発的能力は、インターネットデータに多くの情報が含まれているだけでなく、「一般化が特定のレベルに達すると、組成可能になる」ことに起因しています。Levine氏は、レシピを国際音声記号で書くことができるLLMの例を挙げ、これは「組成的な一般化」であると説明しています。(00:39:23, 00:39:55, 00:40:22)
  • ロボットの創発的能力: Physical Intelligenceのロボットでも、偶然にも創発的な能力が観察されています。例えば、誤って2枚のTシャツを拾った際に1枚を捨てる、買い物袋が倒れた際に立て直すといった行動は、「明示的にデータ収集を指示していない」にもかかわらず現れました。「学習を大規模に行うと、このような組成可能性が生まれる」とLevine氏は語っています。(00:40:53, 00:41:17, 00:41:37)
  • 短期間のコンテキスト: ロボットが「わずか1秒のコンテキスト」で「数分かかるタスク」をこなせるのは、「モラベックのパラドックス」に起因するとLevine氏は説明しています。(00:42:32, 00:43:00, 00:43:36)
    • モラベックのパラドックス: AIでは「簡単なことが難しく、難しいことが簡単」とされています。人間が無意識にこなす知覚や物体操作はAIにとっては難しく、チェスや微積分のような認知的に難しいタスクはAIにとって簡単です。
    • 記憶と認知負荷: 人間が認知的に負荷の高いタスク(複雑な数学問題など)を行う際には多くの情報を記憶する必要がありますが、熟練したタスク(オリンピック水泳選手のように)を行う際には「その瞬間に集中」し、「すべてのコンテキストを慎重に考える必要はない」ため、より少ない記憶で実行できます。(00:44:07, 00:44:41, 00:45:04)

8. ロボットの計算資源とハードウェアの課題

  • 推論のトレードオフ: ロボットには「推論速度」「コンテキスト長」「モデルサイズ」という3つの主要な計算資源のトレードオフがあります。現在のモデルは100ミリ秒の推論速度、1秒のコンテキスト長、数十億のパラメータを持っていますが、これらは人間の脳と比較して「何桁も小さい」ものです。(00:46:17, 00:46:57)
  • コンテキスト表現: この問題の解決策の一つは、「コンテキストの適切な表現」を見つけることです。これは、過去の観察と変化を簡潔に表現し、不要な情報を捨てることで、より効率的な処理を可能にします。多モーダルモデルの進化がこの課題解決に貢献するとLevine氏は考えています。(00:47:19, 00:48:39, 00:49:09)
  • 脳とGPUの比較: 人間の脳が並列処理に優れているのに対し、現代のモデルは「逐次的」な処理に偏っています。しかし、トランスフォーマーは本質的に「並列化可能」なアーキテクチャであるため、将来的にロボットシステムは「知覚、自己受容、計画を同時に行う」並列プロセスを採用する可能性があります。(00:50:26, 00:51:05, 00:51:47)
  • ハードウェアとアルゴリズムの進化: 5年後には、より優れたGPUだけでなく、「適切な表現を考案する」といったアルゴリズムの進歩が、ロボットの高性能化を可能にするでしょう。特に、センサー情報の「時間的相関」を利用して、より圧縮された表現を用いることで、効率を大幅に向上させることができます。(00:53:42, 00:54:11, 00:54:44)
  • オンボードとオフボード推論: 将来的には、安価なシステムでは「思考の一部を外部化」し、オフボード推論を利用する一方、接続性に依存できない屋外ロボットなど、より信頼性の高いシステムでは高コストなオンボード推論を採用するなど、「両方」の形態が見られると予想されています。(00:55:02, 00:56:14)
  • 計画と反応: ロボットの動きを制御する際、すべての時間ステップで大量の思考をする必要はなく、計画を「前もって」行い、そのプロセスを「展開」することで、動きの間は「より基本的な抽象度のレベル」で反応することができます。(00:56:51, 00:57:19, 00:57:38)

9. 模倣学習から強化学習へ

  • RLの必要性: Levine氏は、模倣学習よりも「強化学習(RL)が多くのケースで優れている」という考えを変えていません。しかし、効果的に自身の経験から学習するためには、「すでに自分が何をしているかについて何かを知っていること」が極めて重要です。(00:58:01, 00:58:24)
  • 事前知識の構築: そのため、現在のモデルは「教師あり学習」によって「事前知識を提供する基盤」を構築しており、これによって「後で物事をはるかに早く理解できるようになる」としています。(00:58:57) これは、LLMがまず「次のトークン予測」で訓練され、その基盤の上にRLが構築されたのと同様の軌跡を辿ると予想されています。(00:59:28)

10. ロボットと知識労働の統合

  • モデルの統合: Levine氏は、10年後には「知識労働のための最高のモデルもロボットモデルになる、あるいはアクションエキスパートを持つ」ことを「本当に願っている」と述べています。(00:59:52)
  • ロボットの利点: ロボットの具現化された側面が、他のあらゆるものを改善すると彼は考えています。
    • 表現と焦点: タスクを遂行しようとすることから生まれる「焦点」が、世界の構造化に役立ち、他の信号をより効果的に利用することを可能にします。(01:00:19)
    • 物理世界の深い理解: 物理世界を言語では表現できない「非常に深い根本的なレベル」で理解することは、他の問題解決にも役立ちます。「私たちは世界を特定の形で経験し、私たちの主観的な経験は私たちの考え方を非常に深い形で形作る」とLevine氏は説明しています。(01:00:52)

11. シミュレーションと現実世界データ

  • シミュレーションの限界: シミュレーションは「リハーサルや反事実の検討」には役立ちますが、「世界についての情報をより多く学習すること」はできません。「情報がシステムに注入される必要がある」とLevine氏は強調しています。(01:02:36, 01:06:33, 01:07:12)
  • モデルによるシミュレーション: 最も強力な合成体験の作成方法は「本当に良いモデルから」生まれるとLevine氏は考えています。なぜなら、そのモデルは「微細な詳細について人間よりも多く知っている可能性が高い」からです。しかし、そのモデルも「世界を経験することから知識を得る」必要があります。(01:07:12, 01:07:44)
  • 反事実の検討: 最適な意思決定の核心は「反事実を検討する」ことにあるとLevine氏は語っています。「これの代わりにこれをしたらもっと良くなるか」という問いに答えるメカニズムがあれば、それが「学習されたシミュレータ」であろうと「価値関数」であろうと、最終的には同じことです。(01:08:54, 01:09:16)
  • 現実世界データの重要性: 結局のところ、シミュレーションを含む他のデータソースを効果的に活用する鍵は、「本当に良い」基盤モデルを構築すること、つまり「現実世界のデータで訓練され、それを理解する」モデルを持つことです。(01:05:25, 01:05:53)

12. 社会経済的影響と将来への備え

  • 2030年のボトルネック: 2030年には、AIのキャパックスが年間数兆ドル規模に達すると予想されており、その際、「データセンターの建設や太陽光パネルの設置」といった物理的な作業を担う「労働力」が大きなボトルネックになる可能性があります。Levine氏は、ロボットがその建設作業を「助けることができる」と述べています。(01:10:01, 01:10:36, 01:10:51)
  • ロボットの多様性: ロボットは「機械的な人間」ではなく、「車やブルドーザー」のようなものとして考えるべきです。維持管理の要件が低く、様々な場所に配置でき、人間に似ている必要はありません。非常に多様なロボットシステムを動かす知能があれば、「機械的な人間以上のことを多くできる」可能性があります。(01:11:12)
  • コストの低下: ロボットアームのコストは劇的に低下しており、Levine氏が研究を始めた2014年の40万ドルから、現在は3,000ドルになり、「そのごく一部の価格で作れる」と考えています。(01:12:34, 01:13:10) これは、「規模の経済」と「技術的進歩」によるものですが、さらに「AIシステムが賢くなるほど、ハードウェアが特定の要件を満たす必要性が低くなる」という「ソフトウェア要素」も大きいと説明されています。安価な視覚フィードバックを利用することで、高精度で堅牢なハードウェアの必要性が減少します。(01:13:45)
  • ハードウェアの最適化: AIシステムが十分に能力を持つようになれば、様々な企業が「それぞれのニッチを埋める最適なロボットハードウェア」を革新できるようになるでしょう。重要なのは「基本的なレベルの知能を付与できる」AIシステムであり、あとは「最低限のパッケージ」を特定することが重要だとLevine氏は考えています。(01:16:07, 01:16:46)
  • サプライチェーンと地政学: AIエコシステムのサプライチェーンにおいて、中国が多くの部分を占めていることについて、Levine氏は「これは非常に複雑な問題」と認識しています。(01:18:06, 01:19:46)
    • 自動化の利点: 高い生産性を持つ高度に教育された労働力を擁する経済にとって、自動化は「各人の生産性を倍増させる」ため非常に有利です。(01:19:46, 01:20:19)
    • バランスの取れたエコシステム: この目標を達成するためには、「バランスの取れたロボットエコシステムへの投資」「ソフトウェアとハードウェア両方のイノベーションの支援」といった「多くの良い決定」が必要であると強調しています。(01:21:08)
    • ロボットがロボットを生産: ロボットの生産自体が物理的な作業であるため、「ロボット工学に非常に優れること」は、その生産を助けることになる、という循環的な側面も指摘しています。(01:22:29)
  • 完全自動化社会: Dwarkesh Patel氏は、社会全体が「完全自動化」を目指すべきであり、その結果として「超富裕な社会」が到来するという見方を示しています。Levine氏も「あるレベルでは、それは非常に合理的な見方」としつつ、「テクノロジーは人々の予想通りに進化することはめったにない」ため、「目的地と同じくらい道のりも重要」であると述べています。(01:25:16, 01:26:06)
  • 教育の重要性: 不確実な未来への備えとして、Levine氏は「教育」が「変化の負の効果に対する最高の緩衝材」であると強調しています。良い教育は「特定の事実」よりも「スキルや理解を習得する能力」を育むため、「社会として集団で引っ張れる単一のレバーは、より良い教育だ」と結論付けています。(01:26:42, 01:27:18, 01:27:33)

結論

Sergey Levine氏との対談は、ロボットAIが単なる技術革新に留まらず、社会経済の構造そのものを変革する可能性を秘めていることを示唆しています。ロボットファンデーションモデルの登場により、汎用的なロボットが「一桁の年数」で実現し、実世界で学習する「フライホイール」が回り始める可能性が高まっています。

この進展は、製造業、サービス業、そしてAIインフラ自体の構築にも大きな影響を与えるでしょう。しかし、その道のりは「産業規模の努力」「適切なデータ収集」「計算資源の効率化」「ハードウェアコストの削減」といった多くの技術的課題を伴います。さらに、地政学的・社会経済的な側面において、「バランスの取れたエコシステム構築」や「教育の強化」といった戦略的な取り組みが不可欠であることも強調されました。

完全自動化社会への道のりは不確実ですが、その方向性は明確であり、人類社会の生産性と豊かさを飛躍的に向上させる可能性を秘めていると言えるでしょう。

自律型 AI ロボットの実現 : 10年以内に実用化への「自己強化のサイクル」が始まる

AI

ロボットの「グランドビジョンとタイムライン」というより大きな文脈において、提示された情報源は「タイムライン」について以下の点を述べています。

‌グランドビジョン‌‌: Physical Intelligenceの創設者であるSergey Levine氏によると、同社は‌‌あらゆるロボットを制御してあらゆるタスクを実行できる汎用モデルである「ロボット基盤モデル」を構築すること‌‌を目指しています。このビジョンの究極の目標は、ロボットが単に特定のタスク(例:Tシャツをたたむ、箱をたたむなど)を実行するだけでなく、‌‌「より広範で複雑な指示」‌‌、例えば「午後6時に夕食を作り、午前7時に仕事に行き、土曜日に洗濯をする」といった指示を理解し、実行できることです。これには、‌‌継続的な学習能力、物理世界への理解、常識、必要な情報を引き出す能力、安全性の理解、信頼性、間違いを修正する能力‌‌などが不可欠であると説明されています。Levine氏は、現在の取り組みは、これらの「本当に難しい問題」に取り組むための‌‌「ごくごく初期の始まり」‌‌、つまり基本的な構成要素を配置している段階であると述べています。

‌タイムライン‌‌: Sergey Levine氏は、この壮大なビジョンの実現時期について、具体的な年数を「いつ完了するか」という観点ではなく、‌‌「フライホイール(自己強化のサイクル)がいつ始まるか」‌‌という観点から語っています。

  1. ‌フライホイールの開始‌‌: Levine氏は、ロボットが「ある基本的な能力レベル」に達し、有用なものを提供できるようになれば、AIアシスタントと同様に実社会に展開され始めると考えており、そこで経験を収集し、その経験を活用して改善していくと述べています。この‌‌「フライホイールが始まる時期」は「非常に近い将来」‌‌であり、Physical Intelligence社はすでにその可能性を模索しています。
  2. ‌実用化の時期‌‌: Levine氏は、‌‌「シングルディジットの年数(10年未満)」‌‌が非常に現実的であると考えており、‌‌「1年か2年以内」‌‌には‌‌「実際に何かが実用化されること」‌‌を本当に期待していると述べています。ここでいう「実用化」とは、‌‌人々が実際に気にかけ、やってもらいたいと思うことを、十分に有能にこなすロボットが存在する‌‌ことを意味します。
  3. ‌広範な自律性の時期‌‌: 家全体を自律的に管理するハウスキーパーのような、より広範なタスクをこなすロボットについても、‌‌「おそらくシングルディジットの年数」‌‌(10年未満)で実現するだろうという感覚を持っていると述べています。Dwarkesh Patel氏が「5年」という中央値を提示した際には、Levine氏は‌‌「5年は良い中央値だと思う」‌‌と同意しています。ただし、これは「LLMが特定の範囲内で優れているが限界がある」のと同様に、‌‌ロボットに与える作業の「スコープ」が徐々に拡大していく‌‌という文脈で語られています。

‌タイムラインを可能にする要因と自動運転車との比較‌‌: Levine氏は、ロボティクスが自動運転車のように10年以上かからないと考える理由として、いくつかの要因を挙げています。

  • ‌機械学習技術の進歩‌‌: 2009年当時は知覚システムが未熟でしたが、‌‌2025年時点では「汎用性と堅牢な知覚システム」の技術が格段に進歩しており、より良い出発点にある‌‌と説明しています。
  • ‌ロボットマニピュレーションの特性‌‌: ロボットマニピュレーションは、自動運転よりもはるかに難しい問題である側面もある一方で、‌‌「より限定されたスコープでフライホイールを開始しやすい」‌‌問題空間でもあります。
  • ‌間違いからの学習の容易さ‌‌: 皿洗いの例のように、物理的なタスクではロボットが間違いを犯し、それを修正して学ぶことが比較的容易です。間違いが起こるとはっきりと分かり、それを反省して次により良く実行できるため、知識を得やすいと述べています。自動運転では、間違いが大きな結果を招くため、このような学習が難しいと指摘しています。
  • ‌常識(Common Sense)の理解‌‌: 約5年前には不可能だった、LLMやVLMを用いた「常識」に基づいた合理的な推測ができるようになったことも大きな変化です。これにより、ロボットは、間違いを経験する前に何が起こるかを推論できるようになり、より小さなスコープから始めて成長する道を開くと考えられています。

‌今後の課題と進展の道筋‌‌: ロボット基盤モデルを実現するためには、単なる科学実験ではなく、‌‌産業規模での構築努力‌‌が必要であると強調されています。データ収集のスケーリングは、タスクの種類を増やす「水平方向」だけでなく、‌‌堅牢性、効率性、エッジケースへのインテリジェントな対応‌‌といった「他の軸(垂直方向)」での拡大も必要とされており、そのためには適切な種類のデータを収集し、適切な方法論を特定することが課題です。

Sergey Levine氏は、LLMと同様に、初期は‌‌教師あり学習を通じて「事前知識」という強固な基盤を構築すること‌‌が重要であると指摘しています。この基盤が強固になればなるほど、‌‌強化学習(RL)‌‌のような、よりアクセスしやすい訓練方法でシステムをさらに改善することが容易になります。また、人間が言語で指示を与えたり、協力したりすること(ヒューマン・イン・ザ・ループ)によって、ロボットが現場で学習し、スキルを習得する可能性が高まると述べています。

時系列

AI
  • 2009年: Googleが自動運転車プロジェクトを開始。当時は機械学習システムによる世界理解(特に知覚)の技術が未熟であり、汎化に課題があった。

  • 2014年: Sergey Levineがロボット工学の分野で働き始める。当時、研究用ロボット(PR2)は1台40万ドルだった。

  • 約5年前(2020年頃): Common sense(常識)について、現在のLLMやVLMのような推論能力を持つシステムは存在しなかった。

  • 現在(対談時点、Physical Intelligence設立から1年後):Physical Intelligenceが設立され、ロボットのための汎用基盤モデルの構築を目指している。

    • ロボットはすでに洗濯物の折り畳みやキッチンの掃除といった器用なタスクをこなすことができる。
    • ロボットアームのコストは1台約3,000ドルにまで低下。
    • ロボットの訓練データ量は、マルチモーダル訓練データセットと比較して1~2桁少ない。
    • Physical Intelligenceの現在のモデルは、視覚言語モデルに運動制御用のアクションエキスパートを追加したもの(Gemmaモデルがベース)。
    • 人間による監督や指示(言語による指示など)を通じて、ロボットが現場で学習し、スキルを習得する可能性が認識されている。
    • Laboboxがロボットデータセットを提供しており、Dwarkesh Patelは同社のラボを訪問。
  • 近い将来(現在から1〜2年以内、フライホイールが始まる時期):ロボットが実際に役立つタスクを十分な能力でこなし始め、実世界に展開され、経験を収集し改善する「フライホイール」が開始される可能性がある。

    • 展開初期のロボットの作業範囲は限定的である。
  • 約5年後(Dwarkesh Patelの推定):完全に自律的に家を運営できるロボットが登場する可能性(Sergey Levineの「単桁年数」という推定の中央値)。

    • これにより、ほとんどのブルーカラーの仕事が自律的に行えるようになる可能性があるが、最初は人間の生産性を高めるツールとしての役割が大きい。
  • 2025年(現在と近い将来の間):汎用的で堅牢な知覚システム、および世界を理解するためのシステムにおいて、機械学習技術が2009年当時よりも格段に進歩している。

  • 5〜10年後(Sergey Levineの推定、正確な時期は不確定):ロボットの「能力の範囲」が徐々に拡大し、より多くのタスクを自律的にこなせるようになる。

    • モデルのメモリ、コンテキスト長、モデルサイズ、推論速度といった計算上の課題に対して、表現方法の革新や並列処理の最適化が進む。
    • ハードウェアのコストはさらに低下し、モバイルアームが数百ドルで手に入るようになる可能性も示唆される。
    • AIシステムがハードウェアの限界を押し広げ、ハードウェアの信頼性とコストがより重要になる。
  • 10年後(Sergey Levineの希望的観測):知識労働を担う最高のモデルも、ロボット工学モデルやアクションエキスパートを搭載したものになる可能性があり、ロボット要素が他のAIを改善すると考えられている。

  • 2030年頃:AIのキャピタル支出が年間数兆ドル規模になり、データセンターやチップ工場、ソーラーパネル工場などの建設が大規模に進む。

    • ロボット経済が十分に成熟し、このインフラ建設に大きく貢献できるようになることが期待される。
    • 物理的に数億〜数十億台のロボットが必要となる可能性がある。
  • 未来(最終的な状態):社会が完全な自動化とそれに伴う富の増大を経験する。

    • 技術の進化は予測不可能であり、最終的な状態に到達するまでの「旅」も重要である。
    • 教育が変化に対する個人の適応力を高める最も重要な要素として強調される。

主要関係者

AI
  • Dwarkesh Patel (ドワーケッシュ・パテル): ポッドキャストのホスト。Physical Intelligenceの共同創設者であるSergey Levineにインタビューしている。AI、特にLLMとロボット工学の進歩、経済的・社会的影響に関心が高い。

    • Sergey Levine (セルゲイ・レヴィン):Physical Intelligenceの共同創設者。
    • UCバークレー校の教授。
    • ロボット工学、強化学習(RL)、AIの分野における世界有数の研究者の一人。
    • Physical Intelligenceでは、あらゆるロボットがあらゆるタスクを実行できる汎用ロボット基盤モデルの構築を目指している。
    • 現在のロボット工学の進歩、将来の展望、課題について詳細な見解を述べている。
  • Mark (マーク): Hudson River Tradingのシニアリサーチャー。ポッドキャストのスポンサーセクションに登場し、市場価格と履歴データのデータセットについて解説している。

  • Sander (サンダー): GDMの研究者。ビデオおよびオーディオモデルの研究者で、動画とテキスト間での転移学習の課題について独自の視点を持っている。

  • Manu (マヌー): LaboboxのCEO。Laboboxはロボットの訓練データを提供する企業。Dwarkesh Patelが彼のラボを訪問し、データパッケージについて説明を受けている。

  • Adnan Esmail (アドナン・エスメイル): Sergey Levineの共同創設者。ロボットハードウェアのコスト削減に関する専門家として言及されている。

情報源

動画(1:28:28)

Fully autonomous robots are much closer than you think – Sergey Levine

動画概要欄

21,500 views Sep 13, 2025 Dwarkesh Podcast Sergey Levine is one of the world’s top robotics researchers and co-founder of Physical Intelligence. He thinks we’re on the cusp of a “self-improvement flywheel” for general-purpose robots. His median estimate for when robots will be able to run households entirely autonomously? 2030.

If Sergey’s right, the world 5 years from now will be an insanely different place than it is today. This conversation focuses on understanding how we get there: we dive into foundation models for robotics, and how we scale both the data and the hardware necessary to enable a full-blown robotics explosion.

展開

(以下は、"Fully autonomous robots are much closer than you think – Sergey Levine" と題された対談動画の話者識別つきの文字起こしです。)

[Dwarkesh Patel] : Today I'm chatting with Sergei Levin, who is a co-founder of Physical Intelligence, which is a robotics foundation's model company and also a professor at UC Berkeley, and just generally one of the world's leading researchers in robotics, RL, and AI. Sergei, thank you for coming on the podcast. Thank you, and thank you for the kind introduction. Let's talk about robotics. So before I pepper you with questions, I'm wondering if you can give the audience a summary of where Physical Intelligence is at right now. You guys started a year ago, and what does the progress look like? (00:00:29)

[Dwarkesh Patel] : What are you guys working on? (00:00:30)

[Sergey Levine] : Yeah, so Physical Intelligence aims to build robotic foundation models, and that basically means general purpose models that could, in principle, control any robot to perform any task. We care about this because we see this as a very fundamental aspect of the AI problem. The robot is essentially encompassing all AI technology, so if you can get a robot that's really general, then you can do, hopefully, a large chunk of what people can do. And where we're at right now is, I think we've kind of gotten to the point where we've built out a lot of the basics. (00:01:05)

[Sergey Levine] : And, you know, I think those basics actually are pretty cool, like they work pretty well. We can get a robot that will, like, fold laundry and that will go into a new home and, like, try to clean up the kitchen. But in my mind, what we're doing at Physical Intelligence right now is really the very, very early beginnings, like putting in place the basic building blocks on top of which we can then tackle all these, like, really tough problems. (00:01:24)

[Dwarkesh Patel] : And what's the year-by-year vision? So, one year in, now I got a chance to watch some of the robots, and they can do pretty dexterous tasks, like folding a box using grippers. And it's like, I don't know, it's like pretty hard to fold a box, even with, like, my hands. If you had to go year by year until we get to the full, like, robotics explosion, what is happening every single year? What is the thing that needs to be unlocked, et cetera? (00:01:46)

[Sergey Levine] : So, there are a few things that we need to get right. I mean, dexterity, obviously, is one of them. And in the beginning, we really wanted to make sure that we understand whether the methods that we're developing have the ability to tackle, like, the kind of intricate tasks that people can do. Now, as you mentioned, like, folding a box, folding different articles of laundry, cleaning up a table, making a coffee, that sort of thing. And that's, like, that's good. (00:02:09)

[Sergey Levine] : Like, that works. You know, I think that the results we've been able to show are pretty cool. But again, like, the end goal of this is not to fold a nice T-shirt. The end goal is to just, like, confirm our initial hypothesis that, like, the basics are kind of solid. But from there, there are a number of really major challenges. And I think that, you know, sometimes when results get abstracted to the level of, like, a three-minute video, someone can look at this video, it's like, oh, that's cool. Like, that's what they're doing. (00:02:33)

[Sergey Levine] : But it's not. Like, it's a very simple and basic version of what I think is to come. Like, what you really want from a robot is not to tell it, like, hey, please fold my T-shirt. What you want from a robot is to tell it, like, hey, robot, like, you're now doing all sorts of home tasks for me. I like to have dinner made at 6 p.m. I wake up and go to work at 7 a.m. I'd like, you know, I like to do my laundry on Saturday, so make sure that's ready. (00:02:58)

[Sergey Levine] : This and this and this. And by the way, check in with me, like, every Monday to see, like, you know, what I want you to do to pick up when you do the shopping. Right. Like, that's the prompt. And then the robot should go and do this for, like, you know, six months, a year. Like, that's the duration of the task. So it's... ultimately, if this stuff is successful, it should be a lot bigger. And it should have that ability to learn continuously. (00:03:23)

[Sergey Levine] : It should have the understanding of the physical world, the common sense, the ability to go in and pull in more information if it needs it. Like, if I ask you, like, hey, tonight, like, you know, can you can you make me this type of salad? It's like, OK, you should, like, figure out what that entails, like, look it up, go and buy the ingredients. So there's a lot that goes into this. It requires common sense. (00:03:40)

[Sergey Levine] : It requires understanding that there are certain edge cases that you need to handle intelligently, cases where you need to think harder. It requires the ability to improve continuously. It requires understanding safety, being reliable at the right time, being able to fix your mistakes when you do make those mistakes. So there's a lot more that goes into this. But the principles there are you need to leverage prior knowledge and you need to have the right representations. (00:04:05)

[Dwarkesh Patel] : So this grand vision, what year, if you had to give a median estimate? Or 25 percentile, 50, 75? (00:04:12)

[Sergey Levine] : I think it's something where it's not going to be a case where we develop everything in the laboratory and then it's done. And then, you know, come 2030-something, you get a robot in a box. I think it'll be the same as what we've seen with AI assistance, that once we reach some basic level of competence where the robot is delivering something useful, it'll go out there in the world. The cool thing is that once it's out there in the world, they can collect experience and leverage that experience to get better. (00:04:38)

[Sergey Levine] : So to me, what I tend to think about a lot in terms of timelines is not the date when it will be done, but the date when it will, when like the flywheel starts, basically. So when does the flywheel start? I think that could be very soon. And I think there's some decisions to be made. Like the trade-off there is the more narrow you scope the thing, the earlier you can get it out into the real world. But soon as in like, this is something we're already exploring. (00:05:04)

[Sergey Levine] : We're already trying to figure out like what are like the real things this thing could do that could allow us to start spinning the flywheel. But I think in terms of like stuff that you would actually care about that you would want to see. So I don't know, but I think that single-digit years is very realistic. I'm really hoping it'll be more like one or two before something is like actually out there, but it's hard to say. And something being out there means what? (00:05:23)

[Sergey Levine] : Like what is out there? It means that there is a robot that does a thing that you actually care about that you want done. And it does so competently enough to like, actually do it for real, for real people that want it done. (00:05:36)

[Dwarkesh Patel] : We already have LLMs, which are like broadly deployed. And that hasn't resulted in some sort of like flywheel. At least not some obvious flywheel for the model companies where the now Claude is like learning how to do every single job in the economy or GPT is learning how to do every single job in the economy. So why doesn't that flywheel work for LLMs? (00:05:55)

[Sergey Levine] : Well, I think it's actually very close to working. And I am like 100% certain that many organizations are working on exactly this. In fact, arguably, there is already a flywheel in the sense that not an automated flywheel, but a human loop flywheel where everybody who's deploying an LLM is, of course, going to look at what it's doing, and it's going to use that to then modify its behavior. It's complex because it comes back to this question of representations and figuring out the right way to derive supervision signals and ground those supervision signals in the behavior of the system so that it actually improves on what you want. (00:06:35)

[Sergey Levine] : And I don't think that's like a profoundly impossible problem. It's just something where the details get like pretty gnarly and challenges with algorithms and stability become pretty complex. So it's just it's something that's taken a while for the community collectively to get their hands around. (00:06:49)

[Dwarkesh Patel] : Do you think it'll be easier for robotics or just that like this the state of this kind of techniques to label data that you collect out in the world and use it as a word will just the sort of like the whole wave will rise and robotics will rise as real or is there some reason that robotics will be will benefit more from this? (00:07:08)

[Sergey Levine] : Yeah, I don't think there's like a profound reason why robotics is that different. But there are a few small differences that I think make things a little bit more manageable. So especially if you have a robot that's doing something in cooperation with people, whether it's a person that's supervising it or directing it, like there are very natural sources of supervision. And there's a there's a big incentive for the person to provide the assistance that will make things succeed. There are a lot of dynamics where you can make mistakes and recover from those mistakes and then reflect back on what happened and avoid that mistake in the future. (00:07:37)

[Sergey Levine] : And I think that when you're doing physical things in the real world, that kind of stuff just happens more often than it does if you're like an AI assistant answering a question. Like if you answer a question, you just like answered it wrong. It's like, well, it's not like you can just like go back and like tweak a few things like the person you told the answer to might not even know that it's wrong. Whereas if you're like folding the T-shirt and you messed up a little bit, like, yeah, it's pretty obvious. (00:07:58)

[Sergey Levine] : Like you can reflect on that, figure out what happened and do it better next time. (00:08:00)

[Dwarkesh Patel] : Yeah. So, okay, in one year, we have robots which are like doing some useful things. Maybe if you have some like relatively simple like loopy process, they can they can do it for you. Just like you got to keep folding like thousands of boxes or something. But then there's some flywheel dot, dot, dot. There's some machine which will like just run my house for me as well as a human housekeeper would. What is the gap between this thing which will be deployed in a year that starts to flywheel and this thing which is like a fully autonomous housekeeper? (00:08:33)

[Sergey Levine] : Well, I think it's actually not that different than what we've seen with LLMs in some ways, that it's a matter of scope. Like, if you think about coding assistance, right? Like initially, the best tools for coding, they could do like a little bit of completion. Like you give them a function signature and they'll like try their best to type out like the whole function and they'll maybe like get half of it right. And as that stuff progresses, then you're willing to give these things a lot more agency. (00:08:57)

[Sergey Levine] : So that like the very best coding assistance now, like if you're doing something relatively formulaic, maybe it can like put together most of a PR for you for something, you know, fairly accessible. So I think it'll be the same thing that we'll see an increase in the scope that we're giving, that we're willing to give to the robots as they get better and better. Where initially the scope might be like, there is a particular thing you do like you're making the coffee or something. (00:09:20)

[Sergey Levine] : Whereas as they get more capable, as their ability to have common sense and a broad repertoire of tasks increases, then we'll give them greater scope. Now you're running the whole coffee shop. (00:09:29)

[Dwarkesh Patel] : I get that there's a spectrum and I get that there won't be a specific moment that feels like we've achieved it. But if you had to give a year in which like that, your median estimate of when that happens. (00:09:39)

[Sergey Levine] : I mean, my sense there too, is that this is probably a single digit thing rather than a double digit thing. But the reason it's so hard to really pin down is because as with all research, it does depend on figuring out a few question marks. And I think my answer in terms of the nature of those question marks is I don't think these are things that require profoundly deeply different ideas, but it does require the right synthesis of the kinds of things that we already know. And, you know, sometimes synthesis, to be clear, is just as difficult as coming up with like profoundly new stuff, right? (00:10:11)

[Sergey Levine] : So I think it's intellectually a very deep and profound problem and figuring that out is going to be like very exciting. But I think we kind of like know like roughly the puzzle pieces, and it's something that we need to work on. And I think if we work on it and we're a bit lucky and everything kind of goes as planned, I think single digit is reasonable. (00:10:33)

[Dwarkesh Patel] : Okay, I'm just going to do binary search until I get a year. Okay, so it's less than 10 years, so more than five years? Your median estimate, I know it's like... I think five is a good median. Okay, five years. So if you can fully autonomously run a house, then I think you've like, you can fully autonomously do most blue collar work. So your estimate is in five years, it should be able to do most like blue collar work in the economy. (00:10:58)

[Sergey Levine] : So I think there's a nuance here. And the nuance is, it becomes more obvious if we consider the analogy to the coding assistance, right? It's not like the nature of coding assistance today is that there's a switch that flips and suddenly, instead of writing software, like suddenly, like all software engineers get fired and everyone's using LMs for everything. And that actually makes a lot of sense that the biggest gain in productivity comes from experts, which is software engineers, whose productivity is now augmented by these really powerful tools. (00:11:34)

[Dwarkesh Patel] : Yeah, I mean, separate from the question of whether people will get fired or not, a different question is like, what will the economic impact be in five years? Yeah. The reason I'm curious about this is with LLMs, the relationship between the revenues for these models to their inherent, their seeming capability has been sort of mysterious in the sense that like, you have something which feels like AGI, you can have a conversation with it really like is like, you know, like passes a Turing test, it really feels like it can do all this knowledge work. (00:12:02)

[Dwarkesh Patel] : It's obviously doing a bunch of coding, etc. But then the revenues for these companies are cumulatively on the order of like 20, $30 billion per year. And that's so much less than all knowledge work, which is 30, $40 trillion. So in five years, are we in a similar situation that LLMs are now? Or is it more like, we have robots deployed everywhere, and they're actually like doing a whole bunch of real work, etc. It's a very subtle question. (00:12:29)

[Sergey Levine] : I think what it probably will come down to is this question of scope, right? Like, the reason that LLMs aren't doing all software engineering is because they're good within a certain scope, but there's limits to that. And those limits are increasing, to be clear, every year. And I think that there's no reason that we wouldn't see the same kind of thing with robots, that the scope will have to start out small, because there will be certain things that these systems can do very well, and certain other things where more human oversight is really important. And the scope will grow, and what that will translate into is increased productivity. (00:13:06)

[Sergey Levine] : And some of that productivity will come from the robots themselves being valuable, and some of it will come from the people using the robots are now more productive in their work. (00:13:16)

[Dwarkesh Patel] : But there's so many things that increase productivity, just like wearing gloves increases productivity, or like, I don't know. But then it's like, you want to understand something which, like, increases productivity 100-fold versus, like, you know, wearing glasses, or something which has, like, a small increase. So robots already increase productivity for workers, right? Where LLMs are right now in terms of the share of knowledge work they can do, which is, it's, I guess, probably like 1,000th of the knowledge work that happens in the economy LLMs are doing, at least in terms of revenue. Are you saying, like, that fraction will be possible for robots, but for physical work in five years? (00:13:54)

[Sergey Levine] : That's a very hard question to answer. I think I'm probably not prepared to tell you what percentage of all labor work can be done by robots, because I don't think right now, off the cuff, I have a sufficient understanding of what's involved in, you know, that big of a cross-section of all physical labor. I think what I can tell you is this, that I think it's much easier to get effective systems rolled out gradually in a human-in-the-loop setup. And again, I think this is exactly what we've seen with coding systems, and I think we'll see the same thing with automation, where, basically, robot plus human is much better than just human or just robot. (00:14:35)

[Sergey Levine] : And that just, like, makes total sense. It also makes it much easier to get all the technology bootstrapped, because when it's robot plus human, now there's a lot more potential for the robot to, like, actually learn on the job, acquire new skills. It's just like, you know... Because the human can label what's happening? And also because the human can help. The human can give hints. You know, let me tell you this story. (00:14:55)

[Sergey Levine] : Like, when we were working on the Py05 project, this was the paper that we released last April, we initially controlled our robots with teleoperation in a variety of different settings. And then at some point, we actually realized that we can actually make significant headway once the model was good enough by supervising it, not just with low-level actions, but actually literally instructing it through language. Now, you need a certain level of competence before you can do that. But once you have that level of competence, just standing there and telling the robot, OK, now pick up the cup, put the cup in the sink, put the dish in the sink, just with words, already actually gives the robot information that it can use to get better. (00:15:36)

[Sergey Levine] : Right. Now, imagine what this implies for the human-plus-robot dynamic. Like, now, basically, learning for these systems is not just learning from raw action, it's also learning from words, eventually learning from observing what people do, from the kind of natural feedback that you receive when you're doing a job together with somebody else. And this is also the kind of stuff where the prior knowledge that comes from these big models is tremendously valuable, because that lets you understand that interaction dynamic. So, I think that there's a lot of potential for these kind of human-plus-robot deployments to make the model better. (00:16:12)

[Dwarkesh Patel] : Interesting. So, I got to go to Labobox and see the robotic setup and try operating some of the robots myself. (00:16:19)

[SPEAKER_00] : So, the thing is, like, these triggers, be very mindful of pressing them and don't do some, like, very fast movements. Yeah. Keep it, like, kind of slow. Do I need to keep holding it? Sorry. OK. That's OK. And don't move it very fast, because he can get hurt, actually. Yeah. OK. (00:16:33)

[Dwarkesh Patel] : OK. So, operating ended up being a bit harder than I anticipated. But I did get to see the Labobox team rip through a bunch of tasks. I also got to see the output data that labs actually have to use to train their robots and ask Manu, Labobox's CEO, about how all this is packaged together. (00:16:54)

[SPEAKER_01] : So, what you're looking at is actually the final output that is then delivered to the labs, which then they use to train the models. And so, you can see on the left the visualization of the movements of the robot, including its 3D model and so forth. And on the right, you see all the camera streams synchronized with the configuration. (00:17:13)

[Dwarkesh Patel] : Labobox can get you millions of episodes of robotics data for every single robotics platform and subtask that you want to train on. And if you reach out through labobox.com slash thwarkash, Manu will be very happy with me. In terms of robotics progress, why won't it be like self-driving cars, where we, you know, it's been more than 10 years since Google launched its... it wasn't in 2009 that they launched the self-driving car initiative. And then I remember when I was a teenager, like, watching demos where we would go buy a Taco Bell and drive back. (00:17:46)

[Dwarkesh Patel] : And only now do we have them actually deployed. And even then, you know, they may make mistakes, etc. And so, maybe it'll be many more years before most of the cars are self-driving. So, why won't robotics, you know, you're saying five years to this, like, quite robust thing, but actually it'll just feel like 20 years or just like, once we get the cool demo in five years, then it'll be another 10 years before, like, we have the Waymo and the Tesla FSD working. Yeah, that's a really good question. (00:18:15)

[Sergey Levine] : So, one of the big things that is different now than it was in 2009 actually has to do with the technology for machine learning systems that understand the world around them. Principally, for autonomous driving, this is perception. For robots, it can mean a few other things as well. And perception certainly was not in a good place in 2009. The trouble with perception is that it's one of those things where you can nail a really good demo with a somewhat engineered system, but hit a brick wall when you try to generalize it. Now, at this point in 2025, we have much better technology for generalizable and robust perception systems and more generally generalizable and robust systems for understanding the world around us. (00:18:59)

[Sergey Levine] : Like, when you say that the system is scalable and machine learning scalable, it really means generalizable. So, that gives us a much better starting point today. So, that's not an argument about robotics being easier than autonomous driving. It's just an argument for 2025 being a better year than 2009. But there's also other things about robotics that are a bit different than driving. Like, in some ways, robotic manipulation is a much, much harder problem. But in other ways, it's a problem space where it's easier to get rolling to start that flywheel with a more limited scope. (00:19:29)

[Sergey Levine] : So, to give you an example, if you're learning how to drive, you would probably be pretty crazy to learn how to drive on your own without somebody helping you. Like, you would not trust your teenage child to learn to drive just on their own, just drop them in the car and say, like, go for it. And that's like a, a 16-year-old who's had a significant amount of time to learn about the world. He would never even dream of putting a five-year-old in a car and tell him to get started. (00:19:56)

[Sergey Levine] : But if you want somebody to, like, clean the dishes, like, dishes can break too, but you would probably be okay with a child trying to do the dishes without somebody constantly, like, you know, sitting next to them with a, with a, with a break, so to speak. So, for a lot of tasks that we want to do with robotic manipulation, there's potential to make mistakes and correct those mistakes. And when you make a mistake and correct it, well, first you've achieved the task because you've corrected, but you've also gained knowledge that allows you to avoid that mistake in the future. (00:20:26)

[Sergey Levine] : With driving, because of the dynamics of how it's set up, it's very hard to make a mistake, correct it, and then learn from it because the mistakes themselves have significant ramifications. Now, not all manipulation tests are like that. There are truly some, like, very safety-critical stuff. And this is where the next thing comes in, which is common sense. Common sense, meaning the ability to make inferences about what might happen that are reasonable guesses, but that do not require you to experience that mistake and learn from it in advance. That's tremendously important, and that's something that we basically had no idea how to do about five years ago. (00:21:02)

[Sergey Levine] : But now, you, we can actually use LLMs and VLMs, ask them questions, and they will make reasonable guesses. Like, they will not give you expert behavior, but you can say, like, hey, there's a sign that says slippery floor. Like, what's going to happen when I walk over that? It's kind of pretty obvious, right? And no autonomous car in 2009 would have been able to answer that question. So, common sense plus the ability to make mistakes and correct those mistakes, like, that's sounding like an awful lot like what a person does when they're trying to learn something. (00:21:30)

[Sergey Levine] : All of that doesn't make robotic manipulation easy necessarily, but it allows us to get started with a smaller scope and then grow from there. (00:21:38)

[Dwarkesh Patel] : So, for years, using... I mean, not since 2009, but we've had lots of video data, language data, and transformers for five, seven, eight years. And lots of companies have tried to build transformer-based robots with lots of training data, including Google, Meta, etc. And what is the reason that they've been hitting roadblocks? What has changed now? Yeah, that's a really good question. (00:22:06)

[Sergey Levine] : So, I'll start out with maybe a slight modification to your comment is, I think they've made a lot of progress. And in some ways, a lot of the work that we're doing now at physical intelligence is built on the backs of lots of other great work that was done, for example, at Google, like many of us were actually at Google before. We were involved in some of that work. Some of it is work that we're drawing on that others did. (00:22:28)

[Sergey Levine] : So, there's definitely been a lot of progress there. But to make robotic foundation models really work, it's not just a laboratory science experiment. It also requires industrial scale building effort. It's more like the Apollo program than it is like a science experiment. And the excellent research that was done in the past industrial research labs, and I know I was involved in much of that, was very much framed as a fundamental research effort. And that's good, like the fundamental research is really important, but it's not enough by itself. (00:23:08)

[Sergey Levine] : You need the fundamental research, and you also need the impetus to make it real. And make it real means like actually put the robots out there, get data that is representative of the kind of tasks that they need to do in the real world, get that data at scale, build out the systems, get all that stuff right. And that requires a degree of focus, a singular focus on really nailing the robotic foundation model for its own sake, not just as a way to do more science, not just as a way to like publish a paper, and not just as a way to kind of like, you know, have a research lab. (00:23:43)

[Dwarkesh Patel] : What is preventing you now from scaling that data even more? If data is a big bottleneck, why can't you just increase the size of your office 100x, have 100x more operators, we're operating these robots and collecting more data? Yeah, why not ramp it up immediately 100x more? Yeah, that's a really good question. (00:24:04)

[Sergey Levine] : So, the challenge here is in understanding which axes of scale contributes to which axis of capability. So, if we want to expand capability horizontally, meaning like the robot knows how to do 10 things now, and I'd like it to do 100 things later, you know, that can be addressed by just directly horizontally scaling what we already have. But we want to get robots to a level of capability where they can do practical, practically useful things in the real world. And that requires expanding along other axes too. (00:24:35)

[Sergey Levine] : It requires, for example, getting to very high robustness. It requires getting them to perform tasks very efficiently, quickly. It requires them to recognize edge cases and respond intelligently. And those things, I think, can also be addressed with scaling, but we have to identify the right axes for that, which means figuring out what kind of data to collect, what settings to collect it in, what kind of methods consume that data, how those methods work. So, answering those questions more thoroughly will give us a greater clarity on the axes, on those dependent variables, on the things that we need to scale. (00:25:09)

[Sergey Levine] : And we don't fully know right now what that will look like. I think we'll figure it out pretty soon. It's something we're working on actively, but we want to really get that right so that when we do scale it up, it'll directly translate into capabilities that are very relevant to practical use. (00:25:25)

[Dwarkesh Patel] : Just to give an order of magnitude, how does the amount of data you have collected compare to internet scale pre-training data? And I know it's hard to do like a token-by-token count because, you know, how does video information compare to internet information, etc.? But like, using your reasonable estimates, what fraction of... (00:25:42)

[Sergey Levine] : That's right. It's very hard to do because robotic experience consists of time steps that are very correlated with each other. So, like, the raw, like, byte representation is enormous, but probably the information density is comparatively low. Maybe a better comparison is to the data sets that are used for multimodal training. And there it's... I believe last time we did that count, it was like between one and two orders of magnitude. (00:26:07)

[Dwarkesh Patel] : The vision you have of robotics will not be possible until you collect, like, what, 100x, 1,000x more data? Well, that's the thing, that we don't know that. (00:26:17)

[Sergey Levine] : It's certainly very reasonable to infer that, like, you know, robotics is a tough problem and probably it requires, you know, as much experience as the language stuff. But because we don't know the answer to that, to me, a much more useful way to think about it is not how much data do we need to get before we're fully done, but how much data do we need to get before we can get started? Meaning, before we can get a data flywheel that represents a self-sustaining and ever-growing data collection... (00:26:48)

[Sergey Levine] : When you say self-sustaining, this is just like learning on the job or do you have something else in mind? Learning on the job or acquiring data in a way that the process of acquisition of that data itself is useful and valuable. I see, like, just some kind of RL. Like doing something, like, actually real. Yeah, I mean, ideally, I would like it to be RL because you can get away with the robot acting autonomously, which is easier. But it's not out of the question that you can have mixed autonomy. (00:27:15)

[Sergey Levine] : You can, you know, as I mentioned before, robots can learn from all sorts of other signals. I described how we can have a robot that learns from a person talking to it. So there's a lot of middle ground in between fully teleoperated robots and fully autonomous robots. (00:27:29)

Yeah. (00:27:29)

[Sergey Levine] : Okay, and how does the Pi model work? Yeah, so the current model that we have basically is a vision language model that has been adapted for motor control. So to give you a little bit of like a fanciful brain analogy, a VLM, a vision language model, is basically an LLM that has had a little, like, pseudo visual cortex grafted to it, a vision encoder, right? So our models, they have a vision encoder, but they also have an action expert, an action decoder, essentially. So it has like a little visual cortex and notionally a little motor cortex. (00:28:01)

[Sergey Levine] : And the way that the model actually makes decisions is it reads in the sensor information from the robot. It does some internal processing, and that could involve actually outputting intermediate steps, like you might tell it, clean up the kitchen, and it might think to itself, like, hey, to clean up the kitchen, I need to pick up the dish, and I need to pick up the sponge, and I need to put this and this. And then eventually it works its way through that chain of thought generation down to the action expert, which actually produces continuous actions. (00:28:25)

[Sergey Levine] : And that has to be a different module because the actions are continuous, they're high frequency, so they have a different data format than text tokens. But structurally, it's still an end-to-end transformer. And roughly speaking, technically, it corresponds to a kind of mixture of experts architecture. (00:28:42)

[Dwarkesh Patel] : Mm-hmm. And like what is actually happening is that it's like predicting I should do x thing, then it's like there's an image token, then some action tokens, like what it actually ends up doing, and then more image, more text description, more action tokens. Basically, I'm like looking at what stream is going on. (00:28:59)

[Sergey Levine] : That's right. With the exception that the actions are actually not represented as discrete tokens, it actually uses a flow matching kind of diffusion because they're continuous, and you need to be very precise with your actions for dexterous control. (00:29:10)

[Dwarkesh Patel] : I find it super interesting that, so I think you're using the open source Gemma model, which is like Google's LLM that they release open source, and then adding this action expert on top. I find it super interesting that the progress in different areas of AI is just based on not only the same techniques, but literally the same model that you can just use an open source LLM and then add this action expert on top. It is notable that you naively might think that, oh, there's a separate area of research which is robotics, and there's a separate area of research called LLMs and natural language processing. (00:29:47)

[Dwarkesh Patel] : And no, it's literally the same. The considerations are the same, the architectures are the same, even the weights are the same. I know you do more training on top of these open source models, but that I find super interesting. (00:29:59)

[Sergey Levine] : Yeah, so one theme here that I think is important to keep in mind is that the reason that those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge. And a lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world. And it's kind of like, it's a little bit abstracted knowledge. It's like, you know, you can identify objects, you can figure out like, you know, roughly where things are in image, that sort of thing. (00:30:26)

[Sergey Levine] : But I think if I had to like summarize in one sentence, the big benefit that recent innovations in AI give to robotics is really that prior, the ability to leverage prior knowledge. And I think the fact that the model is the same model, that's like, that's kind of always been the case in deep learning. But it's that ability to pull in that prior knowledge, that abstract knowledge that can come from many different sources. That's really powerful. (00:30:47)

[Dwarkesh Patel] : Today, I'm here with Mark, who is a senior researcher at Hudson River Trading. He has prepared for us a big data set of market prices and historical market data. And we're going to try to figure out what's going on and whether we can predict future prices from historical market data. Mark, let's dig in. Happy to do it. (00:31:06)

[SPEAKER_01] : So it sounds like the first fun thing to do is probably to start looking at what an order book actually looks like. Yeah, I think so. So I've given you like real order book data. That is snapshots of the top five levels of the order book, both on the bid and ask side for a couple of different tech stocks, NVIDIA, Tesla, AMD, etc. What is the shape of the prediction? Are we predicting? (00:31:28)

[SPEAKER_01] : Why don't you take a data frame, look at its Y values and just kind of like histogram it. They are centered at zero. They're roughly centered at zero. Yeah, but target of what exactly? So these things are changes in the mid price from like now to some short period of time in the future. This is actually quite interesting. It's like a mystery to solve. And each one of these can be like a sizable chunk of time for a researcher. If this sounds interesting to you, you should consider working at Hudson River Trading. Mark, where can people learn more? (00:31:54)

[SPEAKER_01] : They can learn more at hudson-trading.com slash dorkash. Amazing. (00:31:57)

[Dwarkesh Patel] : I was talking to this researcher, Sander at GDM, and he works on video and audio models. And he made the interesting point that the reason, in his view, we aren't seeing that much transfer learning between different modalities, that is to say like training a language model on video and images doesn't seem to necessarily make it that much better at textual questions and tasks, is that images are represented at a different semantic level than text. And so his argument is that text has this high level semantic representation within the model, whereas images and videos are just like compressed pixels. (00:32:38)

[Dwarkesh Patel] : There's not really a semantic... When they're embedded, they don't represent some high level semantic information. They're just like compressed pixels. And therefore, there's no transfer learning at the level at which they're going through the model. And obviously, this is super relevant to the work you're doing, because your hope is that by training the model both on the visual data that the robot sees, visual data generally, maybe even from YouTube or whatever eventually, plus language information, plus action information from the robot itself. (00:33:08)

[Dwarkesh Patel] : All of this together will make it generally robust. And then you had a really interesting blog post about why video models aren't as robust as language models. Sorry, this is not a super well-formed question. I just wanted to do a react to some thoughts. What's up with that? Yeah, yeah. (00:33:23)

[Sergey Levine] : Yeah, so I have maybe two things I can say there. I have some bad news and some good news. Yeah. So the bad news is, what you're saying is really getting at the core of a long-running challenge with video and image generation models. In some ways, the idea of getting intelligent systems by predicting video is even older than the idea of getting intelligent systems by predicting text. But the text stuff turned into practically useful things earlier than the video stuff did. (00:34:02)

[Sergey Levine] : I mean, the video stuff is great. You can generate cool videos. And I think that the work there that's been done recently is amazing. But it's not like just generating videos and images has already resulted in systems that have this kind of deep understanding of the world where you can ask them to do stuff beyond just generating more images and videos. Whereas with language, clearly it has. And I think that this point about representations is really key to it. (00:34:24)

[Sergey Levine] : One way we can think about it is this. Imagine pointing a camera outside this building. There's the sky. There's the clouds are moving around, the water, cars driving around, people. If you want to predict everything that will happen in the future, you can do so in many different ways. You can say, okay, there's people around. So let me get really good understanding, like the psychology of how people behave in crowds and predict the pedestrians. But you could also say, like, well, there's clouds moving around. (00:34:49)

[Sergey Levine] : Let me understand everything about water molecules and ice particles in the air. And you can go super deep on that. Like, if you want to, like, fully understand, like, all, you know, down to the subatomic level, everything that's going on, like, as a person, you could spend like decades just thinking about that. And you'll never even get to the pedestrians or the water. Right. So if you want to really predict everything that's going on in that scene, there's just so much stuff that even if you're doing a really great job and capturing like 100% of something, by the time you get to everything else, like, you know, ages will have passed. Whereas with text, it's already been abstract into those bits that we as humans care about. (00:35:22)

[Sergey Levine] : So the representations are already there. And they're not just good representations. They actually focus in on what really matters. Okay. So that's the bad news. Here's the good news. The good news is that we don't have to just get everything out of, like, pointing a camera outside this building. Because when you have a robot, that robot is actually trying to do a job. So it has a purpose. And its perception is in service to fulfilling that purpose. (00:35:49)

[Sergey Levine] : And that is like a really great focusing factor. We know that for people, this really matters. Like, literally what you see is affected by what you're trying to do. Like, there's been no shortage of psychology experiments showing that people have like almost a shocking degree of tunnel vision, where they will like literally not see things right in front of their eyes if it's not relevant to what they're trying to achieve. And that is tremendously powerful. (00:36:09)

[Sergey Levine] : Like, there must be a reason why people do that. Because, you know, certainly if you're out in the jungle, seeing more is better than seeing less. So if you have that powerful focusing mechanism, it must be darn important for getting it to achieve your goal. And I think robots will have that focusing mechanism because they're trying to achieve a goal. (00:36:23)

[Dwarkesh Patel] : By the way, the fact that video models aren't as robust, is that bearish for robotics? Because it will... so much of the data you will have to use will not be... I guess some of... you're saying a lot of it will be labeled. But like, ideally, you just want to be able to like throw all of everything on YouTube, every video we've ever recorded, and have it learn how the physical world works and how to like move about, etc. Just see humans performing tasks and learn from that. (00:36:50)

[Dwarkesh Patel] : But if, yeah, I guess you're saying like it's hard to learn just from that and actually like needs to practice the task itself. Well, let me put it this way. (00:36:57)

[Sergey Levine] : Like, let's say that I gave you lots of videotapes or lots of recordings of different sporting events and gave you a year to just watch sports. And then after that year, I told you, okay, now your job, you're going to be playing tennis. Yeah. Okay, that's like, that's pretty dumb, right? Whereas if I told you first, like, you're going to be playing tennis, and then I let you study up, right? Like now, you know, you really know what you're looking for. (00:37:20)

[Sergey Levine] : So I think that actually, like, there's a very real challenge here. I don't want to understate the challenge. But I do think that there's also a lot of potential for foundation models that are embodied, that learn from interaction, from controlling robotic systems to actually be better at absorbing the other data sources because they know what they're trying to do. I don't think that that by itself is like a silver bullet. I don't think it solves everything. (00:37:42)

[Sergey Levine] : But I think that it does help a lot. And I think that we've already seen the beginnings of that, where we can see that including web data in training for robots really does help with generalization. And I actually have the suspicion that in the long run, it'll make it easier to use those sources of data that have been tricky to use up until now. (00:38:03)

[Dwarkesh Patel] : Famously, LLMs have all these emerging capabilities that were never engineered in because somewhere in internet text is the data to train and to give it the knowledge to do a certain kind of thing. With robots, it seems like you are collecting all the data manually. So there won't be this mysterious new capability that is somewhere in the data set that you haven't purposefully collected, which seems like it should make it even harder to then have robust out-of-distribution kind of capabilities. And so I wonder if the trek over the next 5-10 years will just be like each subtask, you have to give it thousands of episodes. (00:38:40)

[Dwarkesh Patel] : And then it's very hard to actually automate much work just by doing subtasks. So if you think about what a barista does, what a waiter does, what a chef does, very little of it involves just sitting at one station and doing stuff. It's like you got to move around, you got to restock, you got to fix a machine, etc. Go between the counter and the cashier and the machine, etc. So, yeah, will it just be like, will there just be this long tale of things that you had to keep, skills you had to keep, like adding episodes for manually, then labeling and seeing how well they did, etc. Or is there some reason to think that it will progress more generally than that? (00:39:23)

[Dwarkesh Patel] : Yeah. (00:39:23)

[Sergey Levine] : So there's a subtlety here. Emerging capabilities don't just come from the fact that internet data has a lot of stuff in it. They also come from the fact that generalization, once it reaches a certain level, becomes compositional. There was a cute example that one of my students really liked to use in some of his presentations, which is, you know what international phonetic alphabet is? No. IPA. So if you look in a dictionary, they'll have the pronunciation of a word and written in like kind of funny letters. (00:39:55)

[Sergey Levine] : That's basically international phonetic alphabet. So it's an alphabet that is pretty much exclusively used for writing down pronunciations of individual words and dictionaries. And you can ask an LLM to write you a recipe for like making some meal in international phonetic alphabet, and it will do it. And that's like, like, holy crap, like that is definitely not something that it has ever seen, because IPA is only ever used for writing down pronunciations of individual words. So that's compositional generalization. (00:40:22)

[Sergey Levine] : It's putting together things you've seen like that in new ways. And it's like, you know, arguably, there's nothing like profoundly new here, because like, yes, you've seen different words written in that way, but you figured out that now you can compose the words in this other language the same way that you've composed words in English. So that's actually where the emergent capabilities come from. And because of this, in principle, if we have a sufficient diversity of behaviors, the model should figure out that those behaviors can be composed in new ways as the situation calls for it. (00:40:53)

[Sergey Levine] : And we've actually seen things even with our current models, which, you know, I should say that I think they're in the grand scheme of things, like looking back five years from now, we'll probably think that these are tiny in scale. But we've already seen what I would call emergent capabilities. When we were playing around with some of our laundry folding policies, actually, we would discover this by accident. The robot accidentally picked up two t-shirts out of the bin instead of one, starts folding the first one, the other one gets in the way, picks up the other one, throws it back in the bin. (00:41:17)

[Sergey Levine] : And we're like, we didn't know it would do that. Like, holy crap. And then we tried to play around with it. And it's like, yep, it does that every time. Like you can drop in, you know, it's doing its work, drop something else on the table, just pick it up, put it back. Right. Okay, that's cool. Shopping bag, it starts putting things in the shopping bag, the shopping bag tips over, picks it back up and stands it upright. We didn't tell anybody to collect data for that. (00:41:37)

[Sergey Levine] : I'm sure somebody accidentally at some point or maybe intentionally picked up the shopping bag. But it's just, you have this kind of compositionality that emerges when you do learning at scale. And that's really where all these remarkable capabilities come from. And now you put that together with language, you put that together with all sorts of chain of thought reasoning, and there's a lot of potential for the model to compose things in new ways. (00:41:58)

Right. (00:41:58)

[Dwarkesh Patel] : I had an example like this when I got a tour of the robots, by the way, at your office. So it was folding shorts. And I don't know if there was an episode like this in the in the training set, but just for fun, I like took one of the shorts and like, turned it inside out. And then it was able to understand that it first needed to get So first of all, the grippers are just like, like this, like two limbs, or just like opposable finger and thumb like thing. And it's actually shocking how much you can do with just that. (00:42:32)

[Dwarkesh Patel] : Yeah, I'd understood that I first needed to fold that inside out before folding it correctly. I mean, what's especially surprising about that is, it seems like this model only has like one second of context. So as compared to these language models, which can often like see the entire code base, and they're like, observing hundreds of thousands of tokens and thinking about them before outputting. And they're observing their own train of thought for thousands of tokens before making a plan about how to code something up. Your model is like seeing one image, like what happened in the last second. (00:43:00)

[Dwarkesh Patel] : And it vaguely knows like, it's supposed to fold this short. And it's seen like the image of what's happened the last second. And I guess it works. It's like crazy that like, no, it will just see the last thing that happened and then keep executing on the plan. So fold it inside out, then fold it correctly. But it's shocking that a second of context is enough to execute on a minute-long task. Yeah, I'm curious why you made that choice in the first place and why it's possible to actually do tasks if a human could only like think I had like a second of memory, and had to like do physical work. (00:43:34)

[Dwarkesh Patel] : I feel like that would just be impossible. (00:43:35)

Yeah. (00:43:36)

[Sergey Levine] : I mean, it's not that there's something good about having less memory, to be clear. Like, I think that adding memory, adding longer context, all that stuff, adding higher resolution images, I think those things will make the model better. But the reason why it's not the most important thing for the kind of skills that you saw when you visited us, at some level, I think it comes back to Moravec's paradox. So Moravec's paradox is basically that it's like, you know, if you know one thing about if you want to know one thing about robotics, it's like that's that's the thing. (00:44:07)

[Sergey Levine] : Moravec's paradox says that basically, in AI, the easy things are hard and the hard things are easy, meaning like the things that we take for granted, like picking up objects, perceiving the world, all that stuff, those are all the hard problems in AI. And the things that we find challenging, like playing chess and doing calculus, actually are often the easier problems. And I think this memory stuff is actually Moravec's paradox in disguise, where we think that the cognitively demanding tasks that we do, that we find hard, that kind of cause us to think like, oh man, I'm sweating, I'm working so hard. Those are the ones that require us to keep lots of stuff in memory, lots of stuff in our minds. (00:44:41)

[Sergey Levine] : Like if you're solving some big math problem, if you're having a complicated technical conversation on a podcast, like those are things we have to keep all those pieces, all those puzzle pieces in your head. If you're doing a well-rehearsed task, if you are an Olympic swimmer and you're swimming with perfect form and you're like right there in the zone, like people even say like it's in the moment. It's in the moment, right? (00:45:04)

[Sergey Levine] : Like it's like you've practiced it so much, you've baked it into your neural network in your brain that you don't have to think carefully about keeping all that context, right? So it really is just Morvick's paradox manifesting itself. But that doesn't mean that we don't need the memory. It just means that if we want to match the level of dexterity and physical proficiency that people have, there's other things we should get right first, and then gradually go up that stack into the more cognitively demanding areas, into reasoning, into context, into planning, all that kind of stuff. (00:45:36)

[Sergey Levine] : And that stuff will be important too. (00:45:37)

[Dwarkesh Patel] : And how physically will... so you have you have this like trilemma, you have three different things which all take more compute during inference that you want to increase at the same time. You have the inference speed, and so humans are processing 24 frames a second or whatever it is. We can react to things extremely fast. Then you have the context length, and for I think the kind of robot which is just like cleaning up your house, I think it has to kind... it has to be aware of like things that happened minutes ago or hours ago, and how that influences its plan about the next task it's doing. And then you have the model size. (00:46:17)

[Dwarkesh Patel] : And I guess at least with LLMs we've seen that there's gains from increasing the amount of parameters. And I think currently you have 100 millisecond inference speeds, you have a second long context, and then the model is what? A couple billion parameters? How many? Okay. And so each of these, at least two of them, are many orders of magnitude smaller than what seems to be the human equivalent, right? Like the model... (00:46:57)

[Dwarkesh Patel] : if a human brain has like trillions of parameters, and this has like two billion parameters, and then if humans are processing at least as fast as this model, like actually a decent bit faster, and we have hours of context, depends on how you define human context, but hours of context, minutes of context, sometimes decades of context. Yeah, exactly. So you have to have many order of magnitude improvements across all of these three things, which seem to oppose each other, or like increasing one reduces the amount of compute you can dedicate towards the other one in inference. So how are we going to solve this? (00:47:19)

[Sergey Levine] : Yeah, well, that's a very big question. Yeah, let's try to unpack this a little bit. I think there's a lot going on in there. One thing that I would say is a really interesting technical problem, and I think that it's something where we'll see perhaps a lot of really interesting innovation over the next few years, is the question of representation for context. So if you imagine the, like some examples you gave, like if you have a home robot that's doing something, it needs to keep track. As a person, there's certainly some things where you keep track of them very symbolically, like almost in language, like, you know, I have my checklist, I'm going shopping, and I, you know, at least for me, I can like literally visualize in my mind, like my checklist, like, you know, pick up the yogurt, pick up the milk, pick up whatever. (00:48:07)

[Sergey Levine] : And I'm not like picturing the shelf with the milk sitting there. I'm just thinking like milk, right? But then there's other things that are much more spatial, almost visual. You know, when I was trying to get to your studio, I was thinking like, okay, here's what the street looks like. Here's what that street looks like. Here's, you know, what I expect the doorway to look like. So representing your context in the right form that captures what you really need to achieve your goal, and otherwise kind of discards all the unnecessary stuff. (00:48:39)

[Sergey Levine] : I think that that's like, that's a really important thing. And I think we're seeing the beginnings of that with multimodal models. But I think that multimodality has so much more to it than just like image plus text. And I think that that's a place where there's a lot of room for really exciting innovation. Ooh, do you mean in terms of how we represent? Okay, yeah, how we represent both context, but what happened in the past and also plans or reasoning as you can call it in LM world, which is what we would like to happen in the future or intermediate processing stages in solving a task. (00:49:09)

[Sergey Levine] : I think doing that in a variety of modalities, including potentially learned modalities that are suitable for the job, is something that has, I think, enormous potential to overcome some of these challenges. (00:49:19)

[Dwarkesh Patel] : Interesting. Another question I have as we're discussing these like, tough trade-offs in terms of inference is comparing it to the human brain and figuring out the human brain is able to have hours, decades of context while being like, being able to act on the order of 10 milliseconds while having 100 trillion parameters or however you want to count it. And I wonder if the best way to understand what's happening here is that human brain hardware is just way more advanced than the hardware we have in GPUs or that the algorithms for encoding video information are like way more efficient. And maybe it's like some crazy mixture of experts where the active parameters is also on the order of billions, a little billions or some mixture of the two. (00:50:13)

[Dwarkesh Patel] : Basically, if you had to think about like, why do we have these models that are across many dimensions, orders of magnitude, less efficient? Is it hardware or algorithms than compared to the brain? (00:50:26)

[Sergey Levine] : Yeah, that's a really good question. So I definitely don't know the answer to this. I am not by any means well versed in neuroscience, but if I had to guess and also provide an answer that leans more on things I know, it's something like this, that the brain is extremely parallel. It kind of has to be just out of because of the biophysics. But it's even more parallel than your GPU. If you think about how a modern multimodal language model processes the input, if you give it some images and some text, like first it reads in the images, then it reads in the text and then proceeds one token at a time to generate the output. (00:51:05)

[Sergey Levine] : It makes a lot more sense to me for an embodied system to have parallel processes. Now, mathematically, you can actually make close equivalences between parallel and sequential stuff, like transformers aren't actually fundamentally sequential, like you kind of make them sequential by putting in position embeddings. Transformers are fundamentally actually very parallelizable things. That's what makes them so great. So I don't think that actually, mathematically, this like highly parallel thing where you're doing perception and proprioception and planning all at the same time is actually necessarily needs to look that different from a transformer, although its practical implementation will be different. And you could imagine that the system will in parallel think about, okay, here's like my long-term memory, like here's what I've a decade ago. (00:51:47)

[Sergey Levine] : Here's my short-term kind of spatial stuff. Here's my semantic stuff. Here's what I'm seeing now. Here's what I'm planning. And all of that can be implemented in a way that there's some very familiar kind of attentional mechanism, but in practice, all running in parallel, maybe at different rates, maybe with a more complex things running slower, the faster reactive stuff running faster. (00:52:07)

[Dwarkesh Patel] : I'm sure you've been seeing a bunch of fun images that people have been generating with Google's new image generation model, my Xfeed is full of wild images, but you might not realize that this model can also help you do less flashy tasks like restoring historical pictures, or even just cleaning up images. For example, I was reading this old paperback as I was prepping to interview Sarah Payne, and it had this really great graph of World War II Allied shipping that I wanted to overlay in the lecture. Now in the past, this would have taken one of my editors 20 or 30 minutes to digitize and clean up manually. (00:52:38)

[Dwarkesh Patel] : But now we just took a photo of the page and then dropped it into Nano Banana and got back a clean version. This was a one shot. But if Nano Banana doesn't nail it on the first attempt, you can try to just go back and forth with it until you get a result that you're super happy with. We keep finding new use cases for this model. And honestly, this is one of those tools that just doesn't feel real. (00:52:58)

[Dwarkesh Patel] : Check out Gemini 2.5 flash image model, aka Nano Banana on both Google AI Studio and the Gemini app. All right, back to Sergei. If in five years we have a system which is like as robust as a human in terms of interacting with the world, then what has happened that makes it physically possible to be able to run those kinds of models? To have video information that is streaming at real time or hours of prior video information is somehow being encoded and considered while decoding in like a millisecond scale and with many more parameters. Is it just that like NVIDIA has shipped much better GPUs or that you guys have come up with much better like encoders and stuff or like what's happened in the five years? (00:53:42)

[Sergey Levine] : I think there are a lot of things to this question. I think certainly there's like a really fascinating systems problem. I'm by no means a systems expert, but I would imagine that the right architecture in practice, especially if you want an affordable low cost system, would be to externalize at least part of the thinking. Yeah. You know, you could imagine maybe in the future you'll have a robot that has like, you know, if your internet connection is not very good, the robot is in kind of like a dumber reactive mode. But if you have a good internet connection, then it can like be a little smarter. (00:54:11)

[Sergey Levine] : That's pretty cool. But I think there are also research and algorithms, things that can help here, like figuring out the right representations, concisely representing both your past observations, but also changes in observation, right? Like, you know, your sensory stream is extremely temporally correlated, which means that the marginal information gained from each additional observation is not the same as the entirety of that observation. Because the image that I'm seeing now is very correlated to the image I saw before. So, in principle, if I want to represent it concisely, I get away with a much more compressed representation than if I represent the images independently. (00:54:44)

[Sergey Levine] : So, there's a lot that can be done on the algorithm side to get this right. And that's really interesting algorithms work. I think there's also like a really fascinating systems problem. To be truthful, like, I haven't gotten to the systems problem because, you know, you want to implement the system once you sort of know the shape of the machine learning solution. But I think there's a lot of cool stuff to do there. (00:55:02)

[Dwarkesh Patel] : Yeah, maybe you guys need to hire, like, the people who run the YouTube data centers because, like, they know how to encode video information. Okay, this actually is an interesting question, which is that with LLMs, of course, they're being... Theoretically, you could run your own model on this laptop or whatever. But realistically, what happens is that the largest, most effective models are being run in batches of thousands, millions of users at the same time, not locally. Well, the same thing happened in robotics because of the inherent deficiencies of batching, plus the fact that we have to do this incredibly compute-intensive inference task. (00:55:41)

[Dwarkesh Patel] : And so, you don't want to be carrying around, like, you know, like $50,000 GPUs per robot or something. You just want that to happen somewhere else. So, yeah, this robotics world, should we just be anticipating something where you need connectivity everywhere, you need robots that are, like, have, like, super fast, and you're streaming video information back and forth, right? Or at least video information one way. So, does that have interesting implications about, like, how this deployment of robots will actually be instantiated? (00:56:12)

[Dwarkesh Patel] : I don't know. (00:56:14)

[Sergey Levine] : But if I were to guess, I would guess that it will actually see both. That we'll see low-cost systems with off-board inference and more reliable systems, for example, in settings where, like, if you have an outdoor robot or something where you can't rely on connectivity, that are costlier and have on-board inference. A few things I'll say from a technical standpoint that might contribute to understanding this. While a real-time system obviously needs to be controlled in real time, often at high frequency, the amount of thinking you actually need to do for every time step might be surprisingly low. (00:56:51)

[Sergey Levine] : And again, we see this in humans and animals. When we plan out movements, there is definitely a real planning process that happens in the brain. Like, if you record, like, from a monkey brain, you will actually find neural correlates of planning. And there is something that happens in advance of a movement. And when that movement actually takes place, the shape of the movement correlates with what happened before the movement. Like, that's planning, right? (00:57:19)

[Sergey Levine] : So that means that you put something in place and, you know, set the initial conditions of some kind of process and then unroll that process. And that's the movement. And that means that during that movement, you're doing less processing and you kind of batch it up in advance. But you're not, like, entirely in open loop. It's not like you're playing back a tape recorder. You are actually reacting as you go. (00:57:38)

[Sergey Levine] : You're just reacting at a level of abstraction, a more basic level of abstraction. And again, this comes back to representations. Figure out which representations are sufficient for kind of planning in advance and then unrolling, which representations require a tight feedback loop. And for that tight feedback loop, like, what is it? What are you doing feedback on? Like, you know, if I'm driving a vehicle, maybe I'm doing feedback on the position of the lane marker so that I stay straight. And then at a lower frequency, I sort of gauge where I am in traffic. (00:58:01)

[Dwarkesh Patel] : And then so you have a couple of lectures from a few years back where you say, like, even for robotics, RL is, in many cases, better than imitation learning. But so far, the models are exclusively doing imitation learning. So I'm curious how your thinking on this has changed. Or maybe it's not changed, but then you need to do this for the RL. Like, why can't you do RL yet? Yeah. So the key here is prior knowledge. (00:58:24)

[Sergey Levine] : So in order to effectively learn from your own experience, it turns out that it's really, really important to already know something about what you're doing. Otherwise, it takes far too long. It's just like it takes a person, when they're a child, a very long time to learn very basic things, to learn to write for the first time, for example. Once you already have some knowledge, then you can learn new things very quickly. So the purpose of training the models with supervised learning now is to build out that foundation that provides the prior knowledge so they can figure things out much more quickly later. (00:58:57)

[Sergey Levine] : And again, this is not a new idea. This is exactly what we've seen with LLMs, right? LLMs started off being trained purely with next token prediction, and that provided an excellent starting point, first for all sorts of synthetic data generation, and then for RL. So I think it makes total sense that we would expect basically any foundation model effort to follow the same trajectory. We would first build out the foundation essentially in like a somewhat brute force way. And the stronger that foundation gets, the easier it is to then make it even better with much more accessible training. (00:59:28)

[Dwarkesh Patel] : In 10 years, will the best model for knowledge work also be a robotics model or have like an action expert attached to it? And the reason I ask is like, so far, we've seen advantages from using more general models for things. And will robotics fall into this bucket of we will just have the model, which does everything, including physical work and knowledge work? Or do you think they'll continue to stay separate? (00:59:52)

[Sergey Levine] : I really hope that they will actually be the same. And you know, obviously, I'm extremely biased. I love robotics. I think it's like it's very fundamental to AI. But I think that it's optimistically that it's actually the other way around that the robotics element of the equation will make all the other stuff better. And there are two reasons for this that I could tell you about. One has to do with representations and focus. (01:00:19)

[Sergey Levine] : So what I said before, with video prediction models, if you just want to predict everything that happens, it's very hard to figure out what's relevant. If you have the focus that comes from actually trying to do a task, now that acts to structure how you see the world in a way that allows you to more fruitfully utilize the other signals, that could be extremely powerful. The second one is that understanding the physical world at a very deep fundamental level at a level that goes beyond just what we can articulate with language can actually help you solve other problems. And we see we experience this all the time. (01:00:52)

[Sergey Levine] : Like when we talk about abstract concepts, we say like, this company has a lot of momentum. Yeah, right. I like you will use like social metaphors to describe inanimate objects, like my computer hates me, right? Like, we experience the world in a particular way. And our subjective experience shapes how we think about in very profound ways. And then we use that as a hammer to basically hit all sorts of other nails that are far too abstract to handle any other way. (01:01:17)

[Dwarkesh Patel] : I guess but there might be other considerations that are relevant to physical robots in terms of like, inference speed and model size, etc, which might be different than the considerations for knowledge work. But then maybe you can, maybe that doesn't change. Maybe it's still the same model, but then you can serve it in different ways. And the advantages of co-training are high enough that yeah, whenever I'm like, I'm wondering in five years, if I'm using a model to code for me, does it also know how to do robotic stuff? And yeah, maybe the advantages of co-training and robotics are high enough that it's worth. (01:01:49)

[Sergey Levine] : Well, and I should say that the coding is probably like the pinnacle of abstract knowledge work in the sense that like, just by the mathematical nature of computer programming, it's an extremely abstract activity, which is why people struggle with it so much. (01:02:02)

[Dwarkesh Patel] : Yeah. I'm a bit confused about why simulation doesn't work better for robots. Like, if I look at humans, smart humans do a good job of, if they're intentionally trying to learn, noticing what about the simulation is similar to real life and paying attention to that and learning from that. So if you have like pilots who are learning in simulation or F1 drivers who are learning in simulation, should it be expected to be a case that as robots get smarter, they will also be able to learn more things through simulation or is this cursed and we need real world data forever? (01:02:36)

[Sergey Levine] : This is a very subtle question. Your example with the airplane pilot using simulation is really interesting, but something to remember is that when a pilot is using a simulator to learn to fly an airplane, they're extremely goal directed. So their goal in life is not to learn to use a simulator. Their goal in life is to learn to fly the airplane. They know there will be a test afterwards and they know that eventually they'll be in charge of like a few hundred passengers and they really need to not crash that thing. And when we train models on data from multiple different domains, the models don't know that they're supposed to solve a particular task. (01:03:11)

[Sergey Levine] : They just see like, hey, here's one thing I need to master. Here's another thing I need to master. So maybe like a better analogy there is if you're like playing a video game where you can fly an airplane and then eventually someone puts you in the cockpit of a real one. Like it's not that the video game is useless, but it's not the same thing. And if you're trying to play that video game and your goal is to like really like master the video game, you're not going to go about it in quite the same way. (01:03:33)

[Dwarkesh Patel] : Oh, isn't... can you do some kind of meta RL on this, which is like almost identical actually to the... there's this really interesting paper you wrote in 2017 where maybe the loss function is not how well it does at a particular video game or particular simulation, but how well being trained in different video games makes it better at some other downstream task. I did a terrible job explaining, but... I understand what you mean. Yeah, yeah. (01:03:57)

[Dwarkesh Patel] : Okay, maybe can you do a better job explaining what I was trying to say? (01:03:59)

[Sergey Levine] : I think what you're trying to say is basically that, well, maybe if we have like a really smart model that's doing meta learning, perhaps it can figure out that its performance on a downstream problem, a real world problem, is increased by doing something in a simulator. And then specifically make that the loss function, right? That's right. But here's the thing with this. There's a set of these ideas that are all going to be like something like trained to make it better on the real thing by leveraging something else. And the key linchpin for all of that is the ability to train to be better on the real thing. (01:04:31)

[Sergey Levine] : The thing is like, I actually suspect in reality, we might not even need to do something quite so explicit because meta learning is emergent, as you pointed out before, right? Like LLMs essentially do a kind of meta learning via in context learning. I mean, we can debate as to how much that's learning or not. But the point is that large, powerful models trained on the right objective on real data get much better at leveraging all the other stuff. And I think that's actually the key. (01:04:55)

[Sergey Levine] : And coming back to your airplane pilot, like the airplane pilot is trained on a real world objective. Like their objective is to be a good airplane pilot, to be successful, to have a good career. And all of that kind of propagates back into the actions they take in leveraging all these other data sources. So what I think is actually the key here to leverage your auxiliary data sources, including simulation, is to build the right foundation model that is really good, that has those emergent abilities. And to your point, to get really good like that, it has to have the right objective. (01:05:25)

[Sergey Levine] : Now, we know how to get the right objective out of real world data. Maybe we can get out of other things, but that's harder right now. And I think that, again, we can look to the examples of what happened in other fields. Like these days, if someone trains an LLM for solving complex problems, they're using lots of synthetic data. But the reason they're able to leverage that synthetic data effectively is because they have this starting point that is trained on lots of real data that kind of gets it. And once it gets it, then it's more able to leverage all this other stuff. (01:05:53)

[Sergey Levine] : So I think, perhaps ironically, the key to leveraging other data sources, including simulation, is to get really good at using real data, understand what's up with the world, and then now you can fruitfully use all this other stuff. (01:06:04)

[Dwarkesh Patel] : So once we have this, in 2035, 2030, basically the sci-fi world, are you optimistic about the ability of true AGIs to build simulations in which they are rehearsing skills that no human or AI has ever had a chance to practice before? Some, you know, they need to like practice to be astronauts because we're building the Dyson sphere, and they can just do that in simulation? Or like, will the issue with simulation continue to be one, regardless of how smart the models get? (01:06:33)

[Sergey Levine] : So here's what I would say, that deep down, at a very fundamental level, the synthetic experience that you create yourself doesn't allow you to learn more about the world. It allows you to rehearse things, it allows you to consider counterfactuals, but somehow information about the world needs to get injected into the system. So, and I think the way you pose this question actually elucidates this very nicely, because in robotics, classically, people have often thought about simulation as a way to inject human knowledge, because a person knows how to write down like differential equations, they can code it up, and that like gives the robot more knowledge than had before. (01:07:12)

[Sergey Levine] : But I think that increasingly, what we're learning from experiences in other fields, from how like the video generation stuff goes, from synthetic data for LLMs, is that actually, probably the most powerful way to create synthetic experience is from a really good model. Because, you know, the model probably knows more than a person does about those fine-grained details. But then, of course, where does that model get the knowledge from experiencing the world? So, in a sense, what you said, I think is actually quite right, in that a very powerful AI system can simulate a lot of stuff. (01:07:44)

[Sergey Levine] : But also, at that point, it kind of almost doesn't matter, because viewed as a black box, what's going on with that system is that information comes in and capability comes out. And whether the way it processes that information is by imagining some stuff and simulating or by some model-free method, it's kind of irrelevant in understanding its capabilities. (01:07:59)

[Dwarkesh Patel] : Do you have a sense of what the equivalent is in humans? Like, whatever we're doing when we're daydreaming or sleeping or... I don't know if you have some sense of what this auxiliary thing we're doing is, but if you had to make an ML analogy for it, what is it? (01:08:14)

[Sergey Levine] : Well, yeah, I mean, certainly when you sleep, your brain does stuff that looks an awful lot like what it does when it's awake, that looks an awful lot like playing back experience or perhaps generating new statistically similar experience. And so, I think it's very reasonable to guess that perhaps simulation through a learned model is like part of how your brain figures out like counterfactuals, basically. But something that's kind of even more fundamental than that is that optimal decision making at its core, regardless of how you do it, requires considering counterfactuals. You basically have to ask yourself, if I did this instead of that, would it be better? (01:08:54)

[Sergey Levine] : And you have to answer that question somehow. And whether you answer that question by using a learned simulator or whether you answer that question by using a value function or something like that, by using a reward model, in the end, it's kind of all the same. Like, as long as you have some mechanism for considering counterfactuals and figuring out which counterfactual is better, you've got it. So, I like thinking about it this way because it kind of simplifies things. (01:09:16)

[Sergey Levine] : It tells us that the key is not necessarily to do really good simulation. The key is to figure out how to answer counterfactuals. (01:09:21)

[Dwarkesh Patel] : Yeah, interesting. So, stepping a big picture again, the reason I'm interested in getting concrete understanding of when this robot economy will be deployed is because it's actually pretty relevant to understanding how fast AGI will proceed in the sense that, well, it's, you know, obviously the data flywheel. But also, if you just extrapolate out the capex for AI, suppose by 2030, you know, people have different estimates, but many people have estimates in the hundreds of gigawatts, 100, 200, 300 gigawatts. And then you can just like crunch numbers on like, if you have 200 gigawatts deployed or 100 gigawatts deployed by 2030, the marginal capex per year is like trillions of dollars. (01:10:01)

[Dwarkesh Patel] : It's like two, three, four trillion dollars a year. And that corresponds to actual data centers you have to build, actual chip foundries you have to build, actual solar panel factories you have to build. And I'm very curious about whether by this time, by 2030, if the big bottleneck we have is just like people to like lay out the solar panels next to the data center or assemble the data center, whether the robot economy will be mature enough to help significantly in that process. That's cool. (01:10:36)

[Sergey Levine] : So you're basically saying like, how much concrete should I buy now to build the data center so that by 2030, I can power all the robots? Yeah, yeah. That is a more ambitious way of thinking about it than has occurred to me. But it's a cool question. I mean, the good thing, of course, is that the robots can help you build that stuff. (01:10:51)

Right. (01:10:51)

[Dwarkesh Patel] : But will they be able to by the time that like, there's some like, there's the non-robotic stuff, which will also like mandate a lot of capex. And then there's robot stuff, you actually have to build robot factories, etc. But every agency, there will be this industrial explosion across the whole stack. And how much will robotics be able to speed that up or make it possible? (01:11:12)

[Sergey Levine] : I mean, in principle, quite a lot, right? I think that we have a tendency sometimes to think about robots as like mechanical people. But that's not the case, right? Like, people are people and robots are robots. Like the better analogy for the robot, it's like your car or a bulldozer. Like, it has much lower maintenance requirements, you can put them into all sorts of weird places, and they don't have to look like people at all. You can make a robot that's, you know, 100 feet tall, you can make a robot that's tiny. (01:11:42)

[Sergey Levine] : So I think that if you have the intelligence to power very heterogeneous robotic systems, you can probably actually do a lot better than just having like, you know, mechanical people in effect. And it can be a big productivity boost for the real people. And it can allow you to solve problems that are very difficult to solve now. Yeah. You can, you know, for example, I'm not an expert on data centers by any means, but you could build data centers in a very remote location, because the robots don't have to worry about whether there's like a shopping center nearby. (01:12:12)

[Dwarkesh Patel] : And then do you have a sense of how, so there's like, where will the software be? And then there's a question of how many physical robots will we have? So, like, how many of the kinds of robots you're training in physical intelligence, like these tabletop arms, are there physically in the world? How many will there be by 2030? How many will be needed? I mean, these are tough questions, like how many will be needed for that? (01:12:34)

[Sergey Levine] : These are very tough questions. And also, you know, economies of scale in robotics so far have not functioned the same way that they probably would in the long term, right? Just to give you an example, when I started working in robotics in 2014, I used a very nice research robot called the PR2 that cost $400,000 to purchase. When I started my research lab at UC Berkeley, I bought robot arms that were $30,000. The kind of robots that we are using now at physical intelligence, each arm costs about $3,000. And we think they can be made for a small fraction of that. (01:13:10)

[Sergey Levine] : So, these things... What is it? What is the cause of that learning rate? Well, there are a few things. So, one, of course, has to do with economies of scale. So, custom build, high-end research hardware, of course, is going to be much more expensive than kind of more productionized hardware. But the other, and then, of course, there's a technological element that as we get better at building actuated machines, they become cheaper. But there's also a software element, which is the smarter your AI system gets, the less you need the hardware to satisfy certain requirements. (01:13:45)

[Sergey Levine] : So, traditional robots and factories, they need to make motions that are highly repeatable. And therefore, it requires a degree of precision and robustness that you don't need if you can use cheap visual feedback. So, AI also makes robots more affordable and lowers the requirements on the hardware. (01:14:02)

[Dwarkesh Patel] : Interesting. Okay. So, do you think the learning rate will continue? Do you think it will cost hundreds of dollars by the end of the decade to buy mobile arms? (01:14:11)

[Sergey Levine] : That is a great question for my co-founder, Adnan Esmail, who is probably like the best person arguably in the world to ask that question of. But certainly, the drop in costs that I've seen has surprised me year after year. (01:14:24)

Okay. (01:14:25)

[Sergey Levine] : And how many arms are there probably in the world? Is it more than a million, less than a million? So, I don't know the answer to that question, but it's also a tricky question to answer because not all arms are made equal. Like, arguably, the kind of robots that are like assembling cars in a factory are just not the right kind to think about. So, the kind you want to train on? Very few, because they are not currently commercially deployed, unlike the factory robots. (01:14:48)

[Sergey Levine] : So, like, less than 100,000? (01:14:50)

[Dwarkesh Patel] : I don't know, but probably. Okay. And we want billions of robots, like, at least millions of robots. If you're just thinking about, like, the industrial explosion that you need to have this AI-exclusive growth, not only do you need the arms, but then you need, like, something that can move around. Basically, I'm just trying to think about, like, will that be possible by the time that you need a lot more labor to power this AI boom? (01:15:22)

[Sergey Levine] : Well, you know, economies are very good at filling demand when there's a lot of demand, right? Like, how many iPhones were in the world in 2001, right? That's right. So, I think there's definitely a challenge there. And I think it's something that is worth thinking about. And a particularly important question for researchers like myself is, how can AI affect how we think about hardware? (01:15:46)

Right. (01:15:46)

[Sergey Levine] : Because there are some things that I think are going to be really, really important. Like, you probably want your thing to, like, not break all the time. Yeah. There's some things that are firmly in that category of, like, question marks. Like, how many fingers do we need? Like, you said yourself before that you were kind of surprised that a robot with two fingers can do a lot. Okay, maybe you still want, like, more than that, but still, like, finding the bare minimum that still lets you have good functionality, that's important. (01:16:07)

[Sergey Levine] : That's in the question mark box. And there's some things that I we probably don't need. Like, we probably don't need the robot to be, like, super duper precise, because we know that feedback can compensate for that. So, I think my job, as I see it right now, is to figure out what sort of the minimal package we can get away with. And I really like to think about robots in terms of minimal package, because I don't think that we will have, like, the one ultimate robot, like, sort of the mechanical person, basically. I think what we will have is a bunch of things that good, effective robots need to satisfy, just like good smartphones need to have a touchscreen, like, that's something that we all kind of agreed on, and then a bunch of other stuff that's kind of optional, depending on the need, depending on the cost point, et cetera. (01:16:46)

[Sergey Levine] : And I think there will be a lot of innovation, where once we have very capable AI systems that can be plugged into any robot to endow it with some basic level of intelligence, then lots of different people can innovate on how to get the robot hardware to be optimal for each niche it needs to fill. In terms of manufacturers, is there some Nvidia of robotics? Not right now. Maybe there will be someday. (01:17:07)

[Sergey Levine] : I would really like, maybe I'm being idealistic, but I would really like to see a world where there's a lot of heterogeneity in robots. What is the biggest bottleneck in the hardware today, as somebody who's designing the algorithms that run on it? It's a tough question to answer, mainly because things are changing so fast. I think that, to me, the things that I spend a significant amount of time thinking about on the hardware side is really more like reliability and cost. It's not that I'm, like, that worried about cost, it's just that cost translates to number of robots, which translates to amount of data. (01:17:39)

[Sergey Levine] : And being an ML person, I really like having lots of data, so I really like having robots that are low cost, because then I can have more of them and therefore more data. And reliability is important, more or less, for the same reason. But I think it's something that we'll get more clarity on as things progress, because as we... basically, the AI systems of today are not pushing the hardware to the limit. So as the AI systems get better and better, the hardware will get pushed to the limit, and then we'll hopefully have a much better answer to your question. (01:18:06)

[Dwarkesh Patel] : Okay, so this is a question I've had for a lot of guests, and is that if you go through any layer of this AI explosion, you find that a bunch of the actual source supply chain is being manufactured in China. So, other than chips, obviously. But then, you know, if you talk about data centers, and you're like, oh, all the wafers for solar panels, and a bunch of the cells and modules, etc. are manufactured in China, then you just go through the supply chain. (01:18:39)

[Dwarkesh Patel] : And then, obviously, robot arms are being manufactured in China. And so if you live in this world, where the hardware is just incredibly valuable to ramp up manufacturing of, because each robot can produce some fraction of the value that a human worker can produce. And not only is that true, but the value of human workers, or any kind of worker, has just tremendously skyrocketed because we just need tons of bodies to lay out the tens of thousands of solar farms, acres of solar farms, and data centers, and foundries, and everything. (01:19:15)

[Dwarkesh Patel] : In this boom world, the big bottleneck there is just like, how many robots can you physically deploy? How many can you manufacture? Because you guys are going to come up with the algorithms, now we just need the hardware. And so, this is a question I've asked many guests, which is that if you look at the part of the chain that you are observing, what is the reason that China just doesn't win by default? If they're producing all the robots, and you come up with the algorithms that make those robots super valuable, why don't they just win by default? (01:19:46)

[Sergey Levine] : Yeah, so this is a very complex question. I'll start with the broader themes and then try to drill a little bit into the details. One broader theme here is that if you want to have an economy where you get ahead by having a highly educated workforce, by having people that have high productivity, meaning that for each person's hour of work, lots of stuff gets done, automation is really, really good. (01:20:19)

[Sergey Levine] : Because automation is what multiplies the amount of productivity that each person has. Again, same as like LLM coding tools. LLM coding tools amplify the productivity of a software engineer, robots will amplify the productivity of basically everybody that is doing work. Now, that's kind of like a final state, like a desirable final state. Now, there's a lot of complexity in how you get to that state, how you make that an appealing journey to society, how you navigate the geopolitical dimension of that, like all of that stuff is actually pretty complicated. And it requires making a number of really good decisions, like good decisions about investing in a balanced robotics ecosystem, supporting both software innovation and hardware innovation. (01:21:08)

[Sergey Levine] : I don't think any of those are insurmountable problems. It just requires a degree of kind of long-term vision, and the right kind of balance of investment. But what makes me really optimistic about this is that final state. I think we can all agree that in the United States, we would like to have the kind of society where people are highly productive, where we have highly educated people doing high value work. And because that state seems to me very compatible with automation, with robotics, at some level, there should be a lot of incentive to get to that state. (01:21:45)

[Sergey Levine] : And then from there, we have to solve for all the details that will help us get there. And that's not easy. I think there's a lot of complicated decisions that need to be made in terms of private industry, in terms of investment, in terms of the political dimension. But I'm very optimistic about it because it seems to me like the light at the tunnel is in the right direction. (01:22:05)

[Dwarkesh Patel] : I mean, yeah, I guess there's a different question, which is that if the value is sort of bottlenecked by hardware, and so you just need to produce more hardware, what is the path by which hundreds of millions of robots or billions of robots are being manufactured in the U.S. or with allies? I don't know how to approach that question, but it seems like a different question than like, okay, well, what is the impact on human wages or something? (01:22:29)

[Sergey Levine] : So again, for the specifics of how we make that happen, I think that's a very long conversation that I'm probably not the most qualified to speak to. But I think that in terms of the ingredients, the ingredient here that I think is important is that robots help with physical things, physical work. And if producing robots is itself physical work, then getting really good at robotics should help with that. It's a little circular, of course. (01:22:57)

[Sergey Levine] : And, you know, as with all circular things, you have to like kind of bootstrap it and try to get that engine going. But it seems like it is an easier problem to address than, for example, the problem of digital devices, where work goes into creating, you know, computers, phones, et cetera, but the computers and phones don't themselves help with the work. (01:23:18)

Right. (01:23:18)

[Dwarkesh Patel] : I guess feedback loops go both ways. They can help you or they can help others. And it's a positive some world, so it's not necessarily bad to help others. But to the extent that a lot of the things which would go into this feedback loop, the subcomponent manufacturing and supply chain already exists in China, it seems like the stronger feedback loop would exist in China. And then there's a separate discussion, like maybe that's fine. Maybe that's good. (01:23:44)

[Dwarkesh Patel] : And maybe they'll continue exporting this to us. But it's just like notable that I just find it notable that whenever I talk to guests about different things, it's just like, oh, yeah, that, you know, within a few years, the key bottleneck to every single part of the supply chain here will be something that China is like the 80 percent world supplier of something. Well, yeah. (01:24:03)

[Sergey Levine] : And this is why I said before that I think something really important to get right here is a balanced robotics ecosystem. Right. Like I think I think AI is tremendously exciting. But I think we should also recognize that getting AI right is not the only thing that we need to do. And we need to think about how to balance our priorities, our investment, the kind of things that we spend our time on. Just as an example, at Physical Intelligence, we do take hardware very seriously. (01:24:31)

[Sergey Levine] : Actually, we build a lot of our own things and we want to have a hardware roadmap alongside our AI roadmap. But I think that, you know, that's just us. I think that for the United States, for, you know, arguably for human civilization as a whole, like, I think we need to think about these problems very holistically. (01:24:51)

Yeah. (01:24:51)

[Sergey Levine] : And I think it is easy to get distracted sometimes when there's a lot of excitement, a lot of progress in one area, like AI. And we are tempted to lose track of other things, including things you've said, like, hey, like, you know, there's a hardware component, there's a there's an infrastructure component with compute and things like that. So I think that in general, it's good to have a more holistic view of these things. And I wish we had, you know, more holistic conversations about that sometimes. (01:25:16)

[Dwarkesh Patel] : I do think from the perspective of society as a whole, how should they be thinking about the advances in robotics and knowledge work? And I think it's basically, like, society should be playing for full automation. Like, there will be a period in which people's work is way more valuable because there's this huge boom in the economy where, like, building all these data centers or building all these factories. But then eventually, humans can do things with their body, and we can do things with our mind. (01:25:39)

[Dwarkesh Patel] : There's not, like, some secret third thing. So what should society be planning for? It should be full automation of humans. And there will also be a society being much wealthier. So presumably, there's ways to do this in a way that, like, everybody is much better off than they are today. But then, like, the end state, the light at the end of the tunnel is the full automation, plus super wealthy society with some redistribution or whatever way to figure that out, right? I don't know if you disagree with that characterization. (01:26:06)

[Sergey Levine] : So I think at some level, that's a very reasonable way to look at things. But I think that if there's one thing that I've learned about technology, it's that it rarely evolves quite the way that people expect. And sometimes the journey is just as important as the destination. So I think it's actually very difficult to plan ahead for an end state. But I think directionally what you said makes a lot of sense. And I do think that it's very important for us collectively to think about how to structure the world around us in a way that is amenable to greater and greater automation across all sectors. (01:26:42)

[Sergey Levine] : But I think we should really think about the journey just as much as the destination, because things evolve in all sorts of unpredictable ways, and we'll find automation showing up in all sorts of places, probably not the places we expect first. So, you know, I think that the constants here that I think are really important is education is really, really valuable. Like, education is the best buffer somebody has against the negative effects of change. So if there's like one single lever that we can pull collectively as a society, it's like more education, because that's really helpful. (01:27:18)

[Dwarkesh Patel] : Is that true? I mean, the Moravec's paradox is like the things which are like most beneficial from education for humans might be the easiest to automate because it's really easy to educate AIs. You know, you can throw the textbooks, it would take you eight years of grad school to do. In an afternoon. Well, what education gives you is flexibility. (01:27:33)

[Sergey Levine] : So it's less about the particular facts, you know, as it is about your ability to acquire skills, acquire understanding. So it has to be good education. (01:27:47)

[Dwarkesh Patel] : Right. Okay, Sergey, thank you so much for coming on the podcast. Thank you. Fascinating. Yeah, this was intense. Tough questions. I hope you enjoyed this episode. If you did, the most helpful thing you can do is just share it with other people who you think might enjoy it. Send it to your friends, your group chats, Twitter, wherever else, just let the word go forth. Other than that, super helpful if you can subscribe on YouTube and leave a five star review on Apple Podcasts and Spotify. Check out the sponsors in the description below. (01:28:15)

[Dwarkesh Patel] : If you want to sponsor a future episode, go to dwarkesh.com slash advertise. Thank you for tuning in. I'll see you on the next one. (01:28:23)

(2025-09-13)