1

Cosmos-Embed1 is a joint video-text embedder tailored for physical AI applications, including autonomous vehicle (AV) and robotics. It can be used for text-to-video retrieval, inverse video search, semantic deduplication, zero-shot and k-nearest-neighbors (kNN) classification, and as a base model for video curation tasks.

Francesco Ferroni, Prithvijit Chattopadhyay, Greg Heinrich, Mike Ranzinger, Roberto Amoroso, Alice Luo, Andrew Wang, Ming-Yu Liu

[ CVIU 2025 ] We propose a novel self-supervised pre-training technique for Vision Transformer called MaPeT and a novel image tokenizer called k-CLIP which directly employs discretized CLIP features.

Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo_Baraldi, Andrea Pilzer, Rita Cucchiara

[ WACV 2025 ] We propose PQR, a novel LLM-based framework for video question answering that introduces T-Former, a question-guided temporal querying Transformer designed to efficiently extract and integrate video-specific features tailored to a given question.

Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita Cucchiara, Volker Tresp

[ CVPR 2024 ] We present FreeDA, a novel training-free diffusion-augmented method for open-vocabulary segmentation, which leverages diffusion models to visually localize generated concepts and local-global similarities to match superpixel-based class-agnostic regions with semantic classes.

Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

[ WACV 2024 ] We present FOSSIL, a novel Unsupervised Open-Vocabulary Semantic Segmentation model that enables a self-supervised visual backbone to perform open-vocabulary segmentation directly on the visual modality by retrieving a support set of generated synthetic references.

Luca Barsellotti, Roberto Amoroso, Lorenzo Baraldi, Rita Cucchiara

[ WACV 2024 ] We present a novel method for enhancing semantic segmentation models evaluation by categorizing errors, offering insights into false positives/negatives, and improving performance through the combination of model strengths.

Maximilian Bernhard, Roberto Amoroso, Yannic Kindermann, Lorenzo Baraldi, Rita Cucchiara, Volker Tresp, Matthias Schubert

[ TOMM 2024 ] We propose a novel deepfake detection method for images generated through Diffusion Models and created a new dataset COCO-Fake consisting of 650K generated fake images.

Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

[ BMVC 2023 ] We present a novel superpixel-based positional encoding technique that combines Vision Transformer (ViT) features with superpixels priors to improve the performance of semantic segmentation architectures.

Roberto Amoroso, Matteo Tomei, Lorenzo Baraldi, Rita Cucchiara

[ ICIAP 2023 ] We propose a novel open-vocabulary semantic segmentation paradigm based on weakly supervised visual prototypes extracted from image-caption pairs and adopt a retrieval-based approach to combine visual and textual features to enhance segmentation performance.

Luca Barsellotti, Roberto Amoroso, Lorenzo Baraldi, Rita Cucchiara

[ IEEE CCNC 2023 ] In this work, we propose a super-resolution technique for traffic matrix estimation. We also expand our design by employing a federated learning model to address scalability and improve performance.

Roberto Amoroso, Lorenzo Pappone, Flavio Esposito