Scalable Parallel Prompting for Complex AV Video Captioning

April Yang, Roberto Amoroso, Nikita Durasov, Devansh Bisla, Sandipan Kundu, Elmar Haussmann, Ruchi Bhargava, Maying Shen, Nadine Chang, Jose M. Alvarez

April, 2026 Video Captioning, Multimodal, VLM, Autonomous Driving, Autonomous Vehicle

Overview of the proposed pSVLMs framework.

Overview of our proposed pSVLMs framework for scalable metadata-aware video captioning.

Abstract

Video captioning is crucial for curating datasets to train end-to-end autonomous driving (AD) models. Large video-language models (VLMs) can generate temporally grounded descriptions of driving scenes but face challenges in cost, spatial understanding, the omission of specific details, and compounding hallucinations, where earlier introduced errors compromise the subsequent output. We propose Parallel - SVLMs (pSVLMs), a scalable video captioning framework based on small VLM that produces diverse intermediate captions, which are consolidated into a comprehensive unified description. Captions are first decomposed into structured components describing the scene, road entities, and key driving actions. Then, a parallel multi-round prompting module produces diverse intermediate captions that are consolidated into a unified description. Finally, metadata—such as ego trajectories and basic speed and direction information—is incorporated to enhance 3D scene understanding and improve contextual alignment. Experiments on a large internal dataset show our method produces more comprehensive, granular captions than existing approaches while remaining computationally efficient. This framework enables scalable, metadata-aware multimodal captioning for improved dataset curation and safer AD deployment.

Type

Conference paper

Publication

In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2026

Video Captioning Multimodal VLM Autonomous Driving Autonomous Vehicle Parallel Prompting

Scalable Parallel Prompting for Complex AV Video Captioning

Abstract

Roberto Amoroso

Senior Research Engineer @ NVIDIA
VLMs & Multimodal Retrieval

Related

Scalable Parallel Prompting for Complex AV Video Captioning

Abstract

Roberto Amoroso

Senior Research Engineer @ NVIDIAVLMs & Multimodal Retrieval

Related

Senior Research Engineer @ NVIDIA
VLMs & Multimodal Retrieval