VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

Abstract

The recent developments in Large Multi-modal Video Models (Video-LMMs) have significantly enhanced our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accidents and crimes. In this paper, we introduce VANE-Bench, a benchmark designed to assess the proficiency of Video-LMMs in detecting and localizing anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models, encompassing a variety of subtle anomalies and inconsistencies grouped into five categories: unnatural transformations, unnatural appearance, pass-through, disappearance and sudden appearance. Additionally, our benchmark features real-world samples from existing anomaly detection datasets, focusing on crime-related irregularities, atypical pedestrian behavior, and unusual events. The task is structured as a visual question-answering challenge to gauge the models' ability to accurately detect and localize the anomalies within the videos. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies. In conclusion, our research offers significant insights into the current capabilities of Video-LMMs in the realm of anomaly detection, highlighting the importance of our work in evaluating and improving these models for real-world applications.

VANE-Bench serves as a strong benchmark to improve the performance and capabilities of Video-LMMs on anomaly detection.

Key Contributions:

We present VANE-Bench: Video ANomaly Evaluation Benchmark, consisting of 325 video clips, and 559 challenging question-answer pairs from both real-world video surveillance and AI-generated videos.
We perform detailed evaluation of over nine state-of-the-art closed-source and open-source Video-LMMs on VANE-Bench, and show that most models exhibit poor performance, highlighting the challenging nature of our proposed benchmark.
We conduct detailed result analysis, and also perform human evaluation on VANE-Bench to set a reasonable benchmark target.
We open-source our code, and describe the data construction process of VANE-Bench along with making our data publicly available.

VANE-Bench Overview

VANE-Bench is a challenging benchmark created to assess video anomaly detection within Large Multimodal Models (LMMs). It consists of 325 video clips, and 559 QA pairs from over nine different AI-generated and real-world video samples, each showcasing different types of anomalies. The subtle, and hard to detect anomalies present within VANE-Bench makes our benchmark challenging for all the SOTA LMMs, and even for few humans. Our benchmark enables the development and evaluation of stronger Video-LMMs for real-world applications like deepfake detection, crime detection, and traffic accidents identification.

VANE-Bench dataset statistics: Left and Middle: Composition and type of anomalies in AI-generated and real-world video samples. Right: Number of samples and QA pairs in each video dataset.

VANE-Bench Construction

We propose a semi-automatic data construction pipeline which we used to construct our VANE-Bench dataset. Since SOTA AI-Generated videos may have subtle and challenging anomalies, we require high quality captions describing all of the specific inconsistencies present in the given video to construct relevant QA pairs. Our pipeline first annotates the anomalies using the frame annotation module (FAM). The caption generating module (CGM) then utilizes these annotations to produce captions, followed by the question-answer generation module (QAGM) creating QA pairs based on the annotated frames and captions. Annotating the clips before caption generation is crucial for focusing the model on the specific anomaly regions in the video.

Quantitative Results

Evaluation results of Video-LMMs across different types of video samples on the VANE benchmark. We present results for both open-source and closed-source models. The first five rows show results on AI-generated videos and last four contain results on real-world anomaly datasets. The evaluation metric is accuracy out of 100.00
Benchmark Category	Video-LLaMA	VideoChat	Video-ChatGPT	Video-LLaVA	MovieChat	LLaMA-VID	TimeChat	Gemini-1.5 Pro	GPT4o
SORA	11.59	10.74	26.47	10.86	8.69	7.97	21.73	51.45	55.80
SORA
OpenSORA	18.00	28.00	22.00	18.00	10.00	14.00	26.00	84.00	68.00
OpenSORA
Runway Gen2	16.00	4.00	12.00	16.00	16.00	20.00	28.00	28.00	40.00
Runway Gen2
VideoLCM	10.57	17.64	18.26	19.23	14.42	19.23	22.11	49.04	50.96
VideoLCM
Modelscope-T2V	10.41	20.83	16.66	16.66	6.25	14.58	20.83	75.00	64.58
Modelscope-T2V
Avenue	30.00	32.25	39.39	3.03	18.18	27.27	24.20	100.00	84.85
Avenue
UCFCrime	9.47	11.57	31.57	10.52	18.51	15.78	7.30	76.84	83.16
UCFCrime
UCSD-Ped1	16.66	13.33	40.00	2.77	6.66	6.66	27.58	96.67	93.33
UCSD-Ped1
UCSD-Ped2	5.55	13.88	19.44	6.06	11.11	19.44	11.11	94.44	86.11
UCSD-Ped2

Qualitative Results and Other Plots

Left: Performance of Video-LMMs on five anomaly categories of SORA dataset. Right: Overall performance of Video-LMMs averaged across all the benchmark datasets including AI-generated and real-world anomaly datasets.

Human vs Video-LMMs' performance on SORA: Performance comparison of humans vs Video-LMMs on VQA task of detecting anomalies in SORA dataset. We find that closed-source Video-LMMs perform comparably to humans while open-source Video-LMMs struggle to detect the subtle anomalies.

Qualitative examples: The figure shows the response of Video-LMMs to the VQA task of detecting anomalies in the video. The correct answer is written in bold in the user query. We find that the majority of Video-LMMs struggle to answer the questions correctly.

Example showcasing the importance of our Frame Annotation Module (FAM). We note that without FAM, the LMM responsible for generating the captions is not able to identify or describe the accurate anomaly present in the video. However, by providing the bounding box annotation for the inconsistency, we are able to ensure that the generated caption accurately describes the anomaly in the video.

BibTeX

@misc{bharadwaj2024vanebench,
      title={VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs}, 
      author={Rohit Bharadwaj and Hanan Gani and Muzammal Naseer and Fahad Shahbaz Khan and Salman Khan},
      year={2024},
      eprint={2406.10326},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}