In audiovisual emotion recognition, representational learning is a research direction receiving considerable attention, and the key lies in constructing effective affective representations with both consistency and variability. However, there are still many challenges to accurately realize affective representations. For this reason, in this paper we proposed a cross-modal audiovisual recognition model based on a multi-head cross-attention mechanism. The model achieved fused feature and modality alignment through a multi-head cross-attention architecture, and adopted a segmented training strategy to cope with the modality missing problem. In addition, a unimodal auxiliary loss task was designed and shared parameters were used in order to preserve the independent information of each modality. Ultimately, the model achieved macro and micro F1 scores of 84.5% and 88.2%, respectively, on the crowdsourced annotated multimodal emotion dataset of actor performances (CREMA-D). The model in this paper can effectively capture intra- and inter-modal feature representations of audio and video modalities, and successfully solves the unity problem of the unimodal and multimodal emotion recognition frameworks, which provides a brand-new solution to the audiovisual emotion recognition.
Fatigue driving is one of the leading causes of traffic accidents, posing a significant threat to drivers and road safety. Most existing methods focus on studying whole-brain multi-channel electroencephalogram (EEG) signals, which involve a large number of channels, complex data processing, and cumbersome wearable devices. To address this issue, this paper proposes a fatigue detection method based on frontal EEG signals and constructs a fatigue driving detection model using an asymptotic hierarchical fusion network. The model employed a hierarchical fusion strategy, integrating an attention mechanism module into the multi-level convolutional module. By utilizing both cross-attention and self-attention mechanisms, it effectively fused the hierarchical semantic features of power spectral density (PSD) and differential entropy (DE), enhancing the learning of feature dependencies and interactions. Experimental validation was conducted on the public SEED-VIG dataset. The proposed model achieved an accuracy of 89.80% using only four frontal EEG channels. Comparative experiments with existing methods demonstrate that the proposed model achieves high accuracy and superior practicality, providing valuable technical support for fatigue driving monitoring and prevention.