A Comparative Analysis of Contrastive and Generative Vision-Language Models for Zero-Shot Behavior Recognition in Surveillance Videos

Authors

DOI:

https://doi.org/10.66279/vcth9h10

Keywords:

Vision–Language Models, Zero-Shot Learning, Behavior Recognition, Surveillance Videos

Abstract

Vision-language models (VLMs) have recently demonstrated strong zero-shot capability for object recognition and scene classification, yet their suitability for modeling covert human behaviors, such as theft, remains largely unexamined. This paper presents a case study comparing two zero-shot VLM paradigms for cashier theft detection in surveillance footage: a contrastive embedding model (CLIP) and a generative vision-language model (a Llama-3.2-11B-Vision-based pipeline operating in the BLIP/BLIP-2 family of generative architectures). On a set of real cashier-counter recordings, the contrastive model produced near-tied confidence scores between theft and normal-activity prompts (theft confidence 0.504, normal-activity confidence 0.496), indicating weak discriminative margin when intent and temporal context are required. The generative pipeline, in contrast, produced confident and structured binary outcomes (theft confidence 1.000, normal-activity confidence 0.000) accompanied by interpretable natural-language descriptions of the suspect and the event. These results, while drawn from a small, non-benchmarked sample rather than a large annotated corpus, suggest that contrastive similarity scoring is better suited to fast object-level screening, whereas generative reasoning is better suited to behavior-level interpretation. A hybrid pipeline that couples a fast contrastive pre-filter with a generative reasoning stage is proposed as a practical direction for zero-shot surveillance systems that require both efficiency and interpretability

Downloads

Download data is not yet available.

Author Biographies

  • Ayman Mohamed, Al-Ahliyya Amman University

    Faculty of Information Technology, Al-Ahliyya Amman University, Amman 19328, Jordan

  • Saeed Hamouda, Al-Ahliyya Amman University

    Faculty of Information Technology, Al-Ahliyya Amman University, Amman 19328, Jordan

  • Abdelrahman Elsayed, Al-Ahliyya Amman University

    Faculty of Information Technology, Al-Ahliyya Amman University, Amman 19328, Jordan

  • Mohamed M. Reda Ali, Isra University

    Department of Computer Science, Faculty of Information Technology, Isra University, Jordan

References

[1] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.

[2] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00684

[3] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6479–6488, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00678

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PmLR, 2021.

[5] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning, pp. 12888–12900, PMLR, 2022.

[6] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning, pp. 19730–19742, PMLR, 2023.

[7] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 6074–6082, 2024. DOI: https://doi.org/10.1609/aaai.v38i6.28423

[8] L. Zanella, B. Liberatori, W. Menapace, F. Poiesi, Y. Wang, and E. Ricci, “Delving into clip latent space for video anomaly recognition,” Computer Vision and Image Understanding, vol. 249, p. 104163, 2024. DOI: https://doi.org/10.1016/j.cviu.2024.104163

[9] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection,” in 2023 IEEE International Conference on Image Processing (ICIP), pp. 3230–3234, IEEE, 2023. DOI: https://doi.org/10.1109/ICIP49359.2023.10222289

[10] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.

[11] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International journal of computer vision, vol. 130, no. 9, pp. 2337–2348, 2022. DOI: https://doi.org/10.1007/s11263-022-01653-1

[12] Z. Weng, X. Yang, A. Li, Z. Wu, and Y.-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” in International conference on machine learning, pp. 36978–36989, PMLR, 2023.

[13] S. Ahmad, S. Chanda, and Y. S. Rawat, “Ez-clip: Efficient zeroshot video action recognition,” arXiv preprint arXiv:2312.08010, 2023.

[14] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 49250–49267, 2023. DOI: https://doi.org/10.52202/075280-2142

[15] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, “Harnessing large language models for training-free video anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18527–18536, 2024. DOI: https://doi.org/10.1109/CVPR52733.2024.01753

[16] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo, “Follow the rules: reasoning for video anomaly detection with large language models,” in European Conference on Computer Vision, pp. 304–322, Springer, 2024. DOI: https://doi.org/10.1007/978-3-031-73004-7_18

[17] H. Lv and Q. Sun, “Video anomaly detection and explanation via large language models,” arXiv preprint arXiv:2401.05702, 2024.

[18] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19606–19616, 2023. DOI: https://doi.org/10.1109/CVPR52729.2023.01878

[19] Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022. DOI: https://doi.org/10.1007/s11263-022-01594-9

Downloads

Published

29-06-2026

How to Cite

A Comparative Analysis of Contrastive and Generative Vision-Language Models for Zero-Shot Behavior Recognition in Surveillance Videos. (2026). Engineering Systems and Intelligent Technologies (ESIT), 3(1), 23-37. https://doi.org/10.66279/vcth9h10

Most read articles by the same author(s)

Similar Articles

You may also start an advanced similarity search for this article.