Autonomous Vehicle

Scalable Parallel Prompting for Complex AV Video Captioning

[ CVPRW 2026 ] We propose pSVLMs, a scalable video captioning framework based on small VLMs that produces diverse intermediate captions, which are consolidated into a comprehensive unified description for autonomous driving datasets.

April Yang, Roberto Amoroso, Nikita Durasov, Devansh Bisla, Sandipan Kundu, Elmar Haussmann, Ruchi Bhargava, Maying Shen, Nadine Chang, Jose M. Alvarez

Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI

Cosmos-Embed1 is a joint video-text embedder tailored for physical AI applications, including autonomous vehicle (AV) and robotics. It can be used for text-to-video retrieval, inverse video search, semantic deduplication, zero-shot and k-nearest-neighbors (kNN) classification, and as a base model for video curation tasks.

Francesco Ferroni, Prithvijit Chattopadhyay, Greg Heinrich, Mike Ranzinger, Roberto Amoroso, Alice Luo, Andrew Wang, Ming-Yu Liu

Cosmos-Embed1: A Joint Video-Text Embedder for Physical AI