文章摘要
徐华南,周晓彦,姜万,李大鹏.基于3D和1D多特征融合的语音情感识别算法[J].声学技术,2021,40(4):496~502
基于3D和1D多特征融合的语音情感识别算法
Speech emotion recognition algorithm based on 3D and 1D multi-feature fusion
投稿时间:2020-03-31  修订日期:2020-07-02
DOI:10.16300/j.cnki.1000-3630.2021.04.009
中文关键词: 语音情感识别  双线性卷积网络  长短期记忆网络  注意力(attention)  多特征融合
英文关键词: speech emotion recognition  bilinear convolutional neural network (BCNN)  long short-term memory (LSTM)  attention mechanism  multi-feature fusion
基金项目:国家自然基金(61902064、81971282);中央基本科研业务费(2242018K3DN01)。
作者单位E-mail
徐华南 南京信息工程大学电子与信息工程学院, 江苏南京 210044  
周晓彦 南京信息工程大学电子与信息工程学院, 江苏南京 210044 18326167806@163.com 
姜万 南京信息工程大学电子与信息工程学院, 江苏南京 210044  
李大鹏 南京信息工程大学电子与信息工程学院, 江苏南京 210044  
摘要点击次数: 571
全文下载次数: 340
中文摘要:
      针对语音情感识别任务中特征提取单一、分类准确率低等问题,提出一种3D和1D多特征融合的情感识别方法,对特征提取算法进行改进。在3D网络,综合考虑空间特征学习和时间依赖性构造,利用双线性卷积神经网络(Bilinear Convolutional Neural Network,BCNN)提取空间特征,长短期记忆网络(Short-Term Memory Network,LSTM)和注意力(attention)机制提取显著的时间依赖特征。为降低说话者差异的影响,计算语音的对数梅尔特征(Log-Mel)和一阶差分、二阶差分特征合成3D Log-Mel特征集。在1D网络,利用一维卷积和LSTM的框架。最后3D和1D多特征融合得到判别性强的情感特征,利用softmax函数进行情感分类。在IEMOCAP和EMO-DB数据库上实验,平均识别率分别为61.22%和85.69%,同时与提取单一特征的3D和1D算法相比,多特征融合算法具有更好的识别性能。
英文摘要:
      In order to solve the problems of single feature extraction and low classification accuracy in speech emotion recognition task, a 3D and 1D multiple feature fusion method for emotion recognition is proposed in this paper to improve the feature extraction algorithm. In 3D network, the spatial feature learning and time-dependent construction are considered. The bilinear convolutional neural network (BCNN) is used to extract spatial features, the short-term memory network (LSTM) and the attention mechanism are used to extract significant time-dependent features. In order to reduce the influence of speaker differences, the Log-Mel features of speech signal and the first-order differential and the second- order differential features are computed to synthesize the 3D Log-Mel feature set. In 1D network, the 1D convolution and LSTM network are used. Finally, 3D and 1D features are fused to obtain discriminative emotional features, and the emotions are classified by using softmax functions. The average recognition rates are 61.22% and 85.69% respectively on IEMOCAP and EMO-DB databases, and the multi-feature fusion algorithm has better recognition performance than the 3D and 1D algorithm for single feature extraction.
查看全文   查看/发表评论  下载PDF阅读器
关闭