Abstract:In order to solve the challenge of detecting driver yawning, we propose a new method based on spatiotemporal features of frame sequences, called the 3D Residual Fusion Classification Network (3D-RFCNet). This method captures the spatiotemporal information of frame sequences using 3D convolutions, combined with an improved 3D Spatial Channel Attention Module (3D-SCAM) to optimize feature weights and introduce skip connections, effectively fusing features from both shallow and deep networks. Subsequently, the temporal and frequency domain feature information are separately input into a Transformer encoder, further enhancing the model's temporal feature extraction capability. Finally, the temporal and frequency domain features are fused, improving the robustness of the network. Experimental comparisons with classical classification networks on the YawDD dataset validate the superiority of 3D-RFCNet in detecting driver yawning. The experimental results show that, compared to improved methods, the proposed method achieves a yawning detection accuracy of 99.42%, making it highly effective for driver yawning detection.