Abstract:Emotion recognition plays a critical role in AI applications, including multimedia retrieval, advertising, and mental health monitoring. Traditional methods often rely on unimodal data, but multimodal approaches offer more robust performance. This paper introduces Enhanced EmotionClip (EE-CLIP), a multimodal visual emotion recognition model that integrates visual and textual features using Context-Aware Attention (CAA) and Image-Text Feature Fusion (ITFF) mechanisms. We propose an emotion-guided contrastive learning framework to align visual and textual data more effectively. Evaluated on the MELD dataset, EE-CLIP outperforms existing methods, achieving higher F1-scores. Through UMAP visualizations and confusion matrix analysis, we explore emotion relationships and show how EE-CLIP enhances emotion understanding. An ablation study further highlights the importance of CAA and ITFF for improved performance.