Abstract:Panoramic video delivers a 360° field of view, allowing users to freely explore and perceive their visual environment. Crucially, its spatial audio provides directional sound cues that guide visual attention and significantly enhance immersive exploration. However, there is little research on salience prediction through the joint use of visual and auditory modalities. To this end, we propose a spatiotemporal attention-based audiovisual saliency prediction (STAV) model that effectively leverages cross-modal spatial-temporal features from both visual and auditory streams. Specifically, we use Video Swin Transformer to extract spatiotemporal visual features from videos and design a multi-dimensional feature enhancement module (MFEM) to balance multi-scale spatiotemporal feature representations. Furthermore, we employ SoundNet to ex-tract audio features with multiple attributes and calculate the audio energy map (AEM) to perceive the location of sound sources and obtain spatial information about the audio. Finally, we fuse audio and visual features and combine them with spatially encoded cues from the AEM to generate the final audiovisual saliency map. Comprehensive experimental results on three different panoramic video audiovisual datasets demonstrate the effectiveness of this model.