Abstract:Continuous Sign Language Recognition (CSLR) helps deaf people to actively communicate with hearing people by recognizing their sign language as gloss. Enhancing the generalization ability of CSLR visual feature extractors is a worthwhile research area. In this work, we model gloss as prior knowledge to facilitate the learning of more generalizable visual features. Then, we present a gloss-guided visual-gloss alignment network (GVAN). Specifically, we extract gloss representations using a pretrained graph-based model. We design a cross-modality graph alignment(CMGA) mechanism that innovatively maps video and gloss text features into a heterogeneous graph composed of visual and semantic nodes, enabling effective cross-modality feature alignment. Additionally, we introduce a cross-modality alignment constraint to optimize video-text matching and ensure global semantic consistency. Experimental results on both German and Chinese sign language benchmark datasets demonstrate that the proposed GVAN achieves competitive performance. Ablation studies further validate the effectiveness of several key components within GVAN.