Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI

Francesco Ferroni, Prithvijit Chattopadhyay, Greg Heinrich, Mike Ranzinger, Roberto Amoroso, Alice Luo, Andrew Wang, Ming-Yu Liu

May, 2025 Video-Text Retrieval, Multimodal, Physical AI, Autonomous Vehicle, Robotics

Overview of our proposed Cosmos-Reason1 architecture.

Abstract

Cosmos Embed1 is a joint video-text embedder tailored for physical AI. Multi-modal embeddings, particularly joint video-text embedders, are critical for physical AI development pipelines. They enable essential data curation tasks including text-to-video search, inverse video search, semantic deduplication, and targeted filtering. Additionally, these embeddings can also serve as representations to condition on for downstream physical AI models. While existing video-text embedders perform well in general domains, they underperform substantially on physical AI tasks. To bridge this gap, we introduce Cosmos Embed1, a joint video-text embedder specifically tailored for physical AI applications.

Type

Conference paper

Video-Text Retrieval Multimodal Physical AI Autonomous Vehicle Robotics

Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI

Abstract

Roberto Amoroso

Senior Research Engineer @ NVIDIA
VLMs & Multimodal Retrieval

Related

Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI

Abstract

Roberto Amoroso

Senior Research Engineer @ NVIDIAVLMs & Multimodal Retrieval

Related

Senior Research Engineer @ NVIDIA
VLMs & Multimodal Retrieval