欢迎访问《声学技术》编辑部！

文章摘要

宋南,吴沛文,杨鸿武.融合人脸表情的手语到汉藏双语情感语音转换[J].声学技术,2018,37(4):372~379

融合人脸表情的手语到汉藏双语情感语音转换

Gesture-to-emotional speech conversion based on gesture recognigion and facial expression recognition

投稿时间：2017-10-09 修订日期：2017-12-17

DOI：10.16300/j.cnki.1000-3630.2018.04.014

中文关键词: 手势识别表情识别深度神经网络汉藏双语情感语音合成手语到语音转换

英文关键词: gesture recognition facial expression recognition deep neural network Mandarin-Tibetan bilingual emotional speech synthesis gesture to speech conversion

基金项目:国家自然科学基金（11664036、61263036、61262055）、甘肃省高等学校科技创新团队项目（2017C-03）资助。

作者	单位	E-mail
宋南	西北师范大学物理与电子工程学院, 甘肃兰州 730070
吴沛文	西北师范大学物理与电子工程学院, 甘肃兰州 730070
杨鸿武	西北师范大学物理与电子工程学院, 甘肃兰州 730070	yanghw@nwnu.edu.cn

摘要点击次数: 1037

全文下载次数: 837

中文摘要:

针对聋哑人与正常人之间存在的交流障碍问题，提出了一种融合人脸表情的手语到汉藏双语情感语音转换的方法。首先使用深度置信网络模型得到手势图像的特征信息，并通过深度神经网络模型得到人脸信息的表情特征。其次采用支持向量机对手势特征和人脸表情特征分别进行相应模型的训练及分类，根据识别出的手势信息和人脸表情信息分别获得手势文本及相应的情感标签。同时，利用普通话情感训练语料，采用说话人自适应训练方法，实现了一个基于隐Markov模型的情感语音合成系统。最后，利用识别获得的手势文本和情感标签，将手势及人脸表情转换为普通话或藏语的情感语音。客观评测表明，静态手势的识别率为92.8%，在扩充的Cohn-Kanade数据库和日本女性面部表情（Japanese Female Facial Expression，JAFFE）数据库上的人脸表情识别率为94.6%及80.3%。主观评测表明，转换获得的情感语音平均情感主观评定得分4.0分，利用三维情绪模型（Pleasure-Arousal-Dominance，PAD）分别评测人脸表情和合成的情感语音的PAD值，两者具有很高的相似度，表明合成的情感语音能够表达人脸表情的情感。

英文摘要:

This paper proposes a face expression integrated gesture-to-emotional speech conversion method to solve the communication problems between healthy people and speech disorders. Firstly, the feature information of gesture image are obtained by using the model of the deep belief network (DBN) and the features of facial expression are extracted by a deep neural network (DNN) model. Secondly, a set of support vector machines (SVM) are trained to classify the gesture and facial expression for recognizing the text of gestures and emotional tags of facial expression. At the same time, a hidden Markov model-based Mandarin-Tibetan bilingual emotional speech synthesis is trained by speaker adaptive training with a Mandarin emotional speech corpus. Finally, the Mandarin or Tibetan emotional speech is synthesized from the recognized text of gestures and emotional tags. The objective tests show that the recognition rate for static gestures is 92.8%. The recognition rate of facial expression achieves 94.6% on the extended Cohn-Kanade database (CK+) and 80.3% on the JAFFE database respectively. Subjective evaluation demonstrates that synthesized emotional speech can get 4.0 of the emotional mean opinion score. The pleasure-arousal-dominance (PAD) tree dimensional emotion model is employed to evaluate the PAD values for both facial expression and synthesized emotional speech. Results show that the PAD values of facial expression are close to the PAD values of synthesized emotional speech. This means that the synthesized emotional speech can express the emotion of facial expression.

查看全文查看/发表评论下载PDF阅读器

关闭