My name is Roberto Amoroso. I am an ELLIS PhD student enrolled in the International Doctorate in ICT program at the AImageLab research group of the University of Modena and Reggio Emilia 🇮🇹, under the supervision of Prof. Rita Cucchiara and Prof. Lorenzo Baraldi. I am involved in the study and development of novel Computer Vision and Deep Learning techniques.
I am currently a PhD Intern at LMU - Ludwig-Maximilians-Universität of Munich, in Germany 🇩🇪, focusing on Multimodal Video Understanding with Large Language Models (LLM) and Open-vocabulary Image Segmentation, under the co-supervision of Prof. Volker Tresp.
Prior to joining AImageLab, I was Research Scholar at the Networking Research Group in Saint Louis, USA 🇺🇸, working on Super-resolution techniques applied to Internet traffic matrices.
My primary research focus is on Image Segmentation, which includes multimodal machine learning approaches for learning new semantic concepts from textual information in a weakly/un-supervised way, aka Open-vocabulary Segmentation. In addition, I have also conducted research on the optimization of Transformer-based architecture for image classification.
My primary areas of research are Open-vocabulary Segmentation and Multimodal Video Understanding. In addition, I have also conducted research on the pre-training and optimization of Transformer-based architecture for image classification, self-supervised learning, deepfake detection of synthetic images, and the development of image watermarking systems for artwork protection.
Feel free to reach me out if you have any questions or curiosities! :)
ELLIS PhD in AI and Computer Vision, 2024
University of Modena and Reggio Emilia
MS in Artificial Intelligence, 2020
University of Modena and Reggio Emilia
BS in Computer Engineering, 2018
University of Modena and Reggio Emilia
HumanE-AI-NET
project, funded by the EU Framework Programme for Research and Innovation Horizon 2020
.Attended the following courses:
[ BMVC 2023 ] We present a novel superpixel-based positional encoding technique that combines Vision Transformer (ViT) features with superpixels priors to improve the performance of semantic segmentation architectures.
We propose a novel self-supervised pre-training technique for Vision Transformer called MaPeT and a novel image tokenizer called k-CLIP which directly employs discretized CLIP features.
We propose a novel deepfake detection method for images generated through Diffusion Models and created a new dataset COCO-Fake consisting of 650K generated fake images.