Video action recognition meets vision-language models exploring human factors in scene interaction:a review

doi:https://doi.org/10.1007/s11801-025-5058-9

Home > Archive>Volume , Issue 10, 2025 >626-640. DOI:https://doi.org/10.1007/s11801-025-5058-9

Video action recognition meets vision-language models exploring human factors in scene interaction:a review
DOI:
                        https://doi.org/10.1007/s11801-025-5058-9
                    
CSTR:
                        
Author:
                        
Affiliation:1. School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China;2. Innovation Center for Smart Medical Technologies & Devices, Binjiang Institute of Zhejiang University, Hangzhou 310053, China;3.School of Computer and Mathematical Sciences, University of Adelaide, Adelaide 5000, Australia;4. College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China;5.College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou 310058, China
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Video action recognition (VAR) aims to analyze dynamic behaviors in videos and achieve semantic understanding. VAR faces challenges such as temporal dynamics, action-scene coupling, and the complexity of human interactions. Existing methods can be categorized into motion-level, event-level, and story-level ones based on spatiotemporal granularity. However, single-modal approaches struggle to capture complex behavioral semantics and human factors. Therefore, in recent years, vision-language models (VLMs) have been introduced into this field, providing new research perspectives for VAR. In this paper, we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field. Additionally, we propose the concept of "Factor" to identify and integrate key information from both visual and textual modalities, enhancing multimodal alignment. We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.

Reference

Cited by

Get Citation

GUO Yuping, GAO Hongwei, YU Jiahui, GE Jinchao, HAN Meng, JU Zhaojie. Video action recognition meets vision-language models exploring human factors in scene interaction:a review[J]. Optoelectronics Letters,2025,(10):626-640

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:March 16,2025
Revised:May 26,2025
Adopted:
Online: September 22,2025
Published:

Home

About us

Authors

Editors

News

Contents

Contact us

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code