10.14489/vkit.2021.01.pp.044-051

DOI: 10.14489/vkit.2021.01.pp.044-051

Щетинин Е. Ю.
РАСПОЗНАВАНИЕ ЭМОЦИЙ В РЕЧИ ЧЕЛОВЕКА С ИСПОЛЬЗОВАНИЕМ ГЛУБОКИХ НЕЙРОННЫХ СЕТЕЙ
(с. 44-51)

Аннотация. Показаны различные подходы для решения задачи распознавания эмоций человека по голосу с помощью методов глубокого обучения. В качестве моделей глубоких нейронных сетей использованы глубокие сверточные нейронные сети, а также рекуррентные нейронные сети с двухнаправленной LSTM (Long Short-Term Memory)-ячейкой памяти. На их основе предложен ансамбль из нейронных сетей. Проведены компьютерные эксперименты по применению построенных нейронных сетей и популярных алгоритмов машинного обучения для распознавания эмоций в речи человека, содержащихся в базе аудиозаписей RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song). Результаты вычислений показали более высокую эффективность моделей нейронных сетей по сравнению с алгоритмами машинного обучения. Полученные с помощью ансамбля нейронных сетей оценки точности распознавания для отдельных эмоций составили 92 %. Предложены направления дальнейших исследований в области распознавания эмоций человека.

Ключевые слова: распознавание эмоций; глубокое обучение; сверточные нейронные сети; рекуррентные нейронные сети; BLSTM-модель; ансамбль нейронных сетей.

Shchetinin E. Yu.
EMOTIONS RECOGNITION IN HUMAN SPEECH USING DEEP NEURAL NETWORKS
(pp. 44-51)

Abstract. The recognition of human emotions is one of the most relevant and dynamically developing areas of modern speech technologies, and the recognition of emotions in speech (RER) is the most demanded part of them. In this paper, we propose a computer model of emotion recognition based on an ensemble of bidirectional recurrent neural network with LSTM memory cell and deep convolutional neural network ResNet18. In this paper, computer studies of the RAVDESS database containing emotional speech of a person are carried out. RAVDESS-a data set containing 7356 files. Entries contain the following emotions: 0 – neutral, 1 – calm, 2 – happiness, 3 – sadness, 4 – anger, 5 – fear, 6 – disgust, 7 – surprise. In total, the database contains 16 classes (8 emotions divided into male and female) for a total of 1440 samples (speech only). To train machine learning algorithms and deep neural networks to recognize emotions, existing audio recordings must be pre-processed in such a way as to extract the main characteristic features of certain emotions. This was done using Mel-frequency cepstral coefficients, chroma coefficients, as well as the characteristics of the frequency spectrum of audio recordings. In this paper, computer studies of various models of neural networks for emotion recognition are carried out on the example of the data described above. In addition, machine learning algorithms were used for comparative analysis. Thus, the following models were trained during the experiments: logistic regression (LR), classifier based on the support vector machine (SVM), decision tree (DT), random forest (RF), gradient boosting over trees – XGBoost, convolutional neural network CNN, recurrent neural network RNN (ResNet18), as well as an ensemble of convolutional and recurrent networks Stacked CNN-RNN. The results show that neural networks showed much higher accuracy in recognizing and classifying emotions than the machine learning algorithms used. Of the three neural network models presented, the CNN + BLSTM ensemble showed higher accuracy.

Keywords: Emotion recognition; Deep learning; Recurrent neural networks; Convolutional networks; BLSTM model; Stacked model.

+ - Информация об авторах (About the Authors) Click to collapse

Рус

Е. Ю. Щетинин (Финансовый университет при Правительстве Российской Федерации, Москва, Россия) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

Eng

E. Yu.Schetinin (Financial University under the Government of the Russian Federation, Moscow, Russia) E-mail: Этот e-mail адрес защищен от спам-ботов, для его просмотра у Вас должен быть включен Javascript

+ - Библиографический список (References) Click to collapse

Рус

1. Rabiner L., Juang B. Fundamental of Speech Recognition. New Jersey: Prentice-Hall. 1993. P. 544.
2. Schuller B. The Computational Paralinguistics Challenge // IEEE Signal Processing Magazine, 2012. V. 29, Is. 4. P. 1264 – 1281.
3. Карпов А. А., Кайа Х., Салах А. А. Актуальные задачи и достижения паралингвистического анализа речи // Научно-технический вестник информационных технологий, механики и оптики. 2016. Т. 16, № 4. С. 581 – 592. DOI: 10.17586/2226-1494-2016-16-4-581-592
4. Singh N., Agrawal A., Khan R. A. Automatic Speaker Recognition: Current Approaches and Progress in Last Six Decades // Global Journal of Enterprise Information System. 2017. V. 9, Is. 3. P. 45 − 52. DOI: 10.18311/gjeis/2017/15973
5. Hochreiter S., Schmidhuber J. Long Short-Term Memory // Neural Computation, 1997. V. 9, Is. 8. P. 1735 – 1780. DOI: 10.1162/neco.1997.9.8.1735
6. Schuster M., Paliwal Kuldip K. Bidirectional Recurrent Neural Networks // IEEE Transactions on Signal Processing. 1997. V. 45, Is. 11. P. 2673 – 2681. DOI: 10.11.09/78.650093
7. He K., Zhang X., Ren S. and Sun J. Deep Residual Learning for Image Recognition [Электронный ресурс]. URL: arxiv.org.pdf. 1512.03385.pdf (дата обращения: 02.08.2020).
8. Livingstone S. R., Russo F. A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English, PLoS ONE. 2018. V. 13(5): e0196391// https://doi.org/10.1371/ journal.pone.0196391 (дата обращения: 02.08.2020).
9. Hasan R., Jamil M., Rabbani G., Rahman S. Speaker Identification Using Mel-frequency Cepstral Coefficients // 3rd Intern. Conf. on Electrical and Comp. Eng. // ICECE. 2004. (28 – 30 Dec., Dhaka, Bangladesh). 2004. P. 28 – 30.
10. Audio and Music Signal Analysis in Python / Brian McFee et al. // Proc. of the 14th Python in Science Conf. (SCIPY 2015). P. 18 – 24. DOI: 10.25080/Majora-7b98e3ed-003
11. Chollet F. Deep Learning with Python, Manning Publications Co., 2018. 373 p.
12. Popova A., Rassadin A., Ponomarenko A. Emotion Recognition in Sound, in: Advances in Neural Computation, Machine Learning, and Cognitive Research // Selected Papers from the XIX International Conference on NeuroInformatics (October 2 – 6, 2017, Moscow, Russia). V. 736. Cham: Springer International Publishing, 2017. P. 117 – 124.
13. Стерлинг Г., Приходько П. Глубокое обучение в задаче распознавания эмоций из речи // Информационные технологии и системы 2016: тр. конф. (Минск, 26 окт. 2016 г.). ИППИ РАН. 2016. С. 451 – 456.
14. Севастьянов Л. А., Щетинин Е. Ю. О методах повышения точности многоклассовой классификации на несбалансированных данных // Информатика и ее применения. 2020. Т. 14, № 1. С. 63 – 70.

Eng

1. Rabiner L., Juang B. (1993). Fundamental of Speech Recognition. New Jersey: Prentice-Hall.
2. Schuller B. (2012). The Computational Paralinguistics Challenge. IEEE Signal Processing Magazine, Vol. 29, (4), pp. 1264 – 1281.
3. Karpov A. A., Kaya H., Salah A. A. (2016). Actual tasks and achievements of paralinguistic speech analysis. Nauchno-tekhnicheskiy vestnik informatsionnyh tekhnologiy, mekhaniki i optiki, Vol. 16, (4), pp. 581 – 592. [in Russian language] DOI: 10.17586/2226-1494-2016-16-4-581-592
4. Singh N., Agrawal A., Khan R. A. (2017). Automatic Speaker Recognition: Current Approaches and Progress in Last Six Decades. Global Journal of Enterprise Information System, Vol. 9, (3), pp. 45 − 52. DOI: 10.18311/gjeis/2017/15973
5. Hochreiter S., Schmidhuber J. (1997). Long Short-Term Memory. Neural Computation, Vol. 9, (8), pp. 1735 – 1780. DOI: 10.1162/neco.1997.9.8.1735
6. Schuster M., Paliwal Kuldip K. (1997). Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, Vol. 45, (11), pp. 2673 – 2681. DOI: 10.11.09/78.650093
7. He K., Zhang X., Ren S. and Sun J. Deep Residual Learning for Image Recognition. Available at: arxiv.org.pdf.1512.03385.pdf (Accessed: 02.08.2020).
8. Livingstone S. R., Russo F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE, Vol. 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391 (Accessed: 02.08.2020).
9. Hasan R., Jamil M., Rabbani G., Rahman S. (2004). Speaker Identification Using Mel-frequency Cepstral Coefficients. 3rd International Conference on Electrical and Computer Engineering. ICECE., pp. 28 – 30. Dhaka.
10. McFee Brian et al. (2015). Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference (SCIPY 2015), pp. 18 – 24. DOI: 10.25080/Majora-7b98e3ed-003
11. Chollet F. (2018). Deep Learning with Python. Manning Publications Co.
12. Popova A., Rassadin A., Ponomarenko A. (2017). Emotion Recognition in Sound, in: Advances in Neural Computation, Machine Learning, and Cognitive Research. Selected Papers from the XIX International Conference on NeuroInformatics, Vol. 736, pp. 117 – 124. Cham: Springer International Publishing.
13. Sterling G., Prihod'ko P. (2016). Deep learning in the task of recognizing emotions from speech. Information Technologies and Systems: Proceedings of the IITP RAS Conference, pp. 451 – 456. [in Russian language]
14. Sevast'yanov L. A., Shchetinin E. Yu. (2020). On methods of increasing the accuracy of multiclass classification on unbalanced data. Informatika i ee primeneniya, Vol. 14, (1), pp. 63 – 70. [in Russian language] DOI: 10.14357/19922264200109

+ - Заказать электронную версию статьи (Purchase digital version of a single article) Click to collapse

Рус

Статью можно приобрести в электронном виде (PDF формат).

Стоимость статьи 450 руб. (в том числе НДС 18%). После оформления заказа, в течение нескольких дней, на указанный вами e-mail придут счет и квитанция для оплаты в банке.

После поступления денег на счет издательства, вам будет выслан электронный вариант статьи.

Для заказа скопируйте doi статьи:

10.14489/vkit.2021.01.pp.044-051

и заполните форму

Отправляя форму вы даете согласие на обработку персональных данных.

Eng

This article is available in electronic format (PDF).

The cost of a single article is 450 rubles. (including VAT 18%). After you place an order within a few days, you will receive following documents to your specified e-mail: account on payment and receipt to pay in the bank.

After depositing your payment on our bank account we send you file of the article by e-mail.

To order articles please copy the article doi:

10.14489/vkit.2021.01.pp.044-051

and fill out the form