Video action recognition meets vision-language models exploring human factors in scene interaction:a review
CSTR:
Author:
Affiliation:

1. School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China;2. Innovation Center for Smart Medical Technologies & Devices, Binjiang Institute of Zhejiang University, Hangzhou 310053, China;3.School of Computer and Mathematical Sciences, University of Adelaide, Adelaide 5000, Australia;4. College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China;5.College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou 310058, China

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Video action recognition (VAR) aims to analyze dynamic behaviors in videos and achieve semantic understanding. VAR faces challenges such as temporal dynamics, action-scene coupling, and the complexity of human interactions. Existing methods can be categorized into motion-level, event-level, and story-level ones based on spatiotemporal granularity. However, single-modal approaches struggle to capture complex behavioral semantics and human factors. Therefore, in recent years, vision-language models (VLMs) have been introduced into this field, providing new research perspectives for VAR. In this paper, we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field. Additionally, we propose the concept of "Factor" to identify and integrate key information from both visual and textual modalities, enhancing multimodal alignment. We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.

    Reference
    Related
    Cited by
Get Citation

GUO Yuping, GAO Hongwei, YU Jiahui, GE Jinchao, HAN Meng, JU Zhaojie. Video action recognition meets vision-language models exploring human factors in scene interaction:a review[J]. Optoelectronics Letters,2025,(10):626-640

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:March 16,2025
  • Revised:May 26,2025
  • Adopted:
  • Online: September 22,2025
  • Published:
Article QR Code