End-to-End Listening Agent for Audiovisual Emotional and Naturalistic Interactions

Kevin El Haddad, Yara Rizk, Louise Heron, Nadine Hajj, Yong Zhao, Jaebok Kim, Trung Ngô Trọng, Minha Lee, Marwan Doumit, Payton Lin, Yelin Kim, Hüseyin Çakmak

 
   

Abstract


In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user’s paralinguistic and nonverbal expressions, prediction of the agent’s reactions, synthesis of the agent’s expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.


Keywords


Listening Agent; Smile; Laughter; Head Movement; Eyebrow Movement; Speech Emotion Recognition; Nonverbal Expression Detection; Sequence-to-Sequence Prediction Systems; Multimodal Synthesis; Nonverbal Expression Synthesis; Emotion Database; Dyadic Conversa

Full Text:

PDF

References


Aubrey, A., Marshall, D., & Rosin, L. (2013). Cardiff conversation database (ccdb): A database of natural dyadic conversations. 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 277–282). Portland, OR, USA: IEEE.

https://doi.org/10.1109/CVPRW.2013.48

Baltruvsaitis, T., Robinson, P., & Morency, L.-P. (2016). Openface: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1--10). IEEE.

C. Busso, M. B. ( vol. 42, no. 4, pp. 335-359). IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, December 2008.

C. Busso, S. P. (January-March 2017). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 119-130.

El Haddad, K., Cakmak, H., Gilmartin, E., Dupont, S., & Dutoit, T. (2016). Towards a listening agent: a system generating audiovisual laughs and smiles to show interest. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 248-255). Tokyo, Japan: ACM, New York, NY, USA.

https://doi.org/10.1145/2993148.2993182

Haidt, J. (2003). The moral emotions. Handbook of Affective Sciences, 11, 852-870.

Hüseyin Çakmak, K. E. (13-14 June 2016). A real time OSC controlled avatar for human machine interactions. Workshop on Artificial Companion Affect Interaction. Brest, France.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017b). "Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition." In: Proceedings of ACM Multime- dia, pp. 1006–1013.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017a). "Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning." In: Proceedings of INTERSPEECH, pp. 1113–1117.

https://doi.org/10.21437/Interspeech.2017-736

Kim, Y., & Provost, E. M. (2016). Emotion spotting: discovering regions of evidence in audio-visual emotion expressions. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 92-99). Tokyo, Japan: ACM, New York, NY, USA.

https://doi.org/10.1145/2993148.2993151

Kim, Y., Provost, E. M., & Lee, H. (2013). Deep learning for robust feature generation in audiovisual emotion recognition. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013) (pp. 3687-3691). Vancouver, BC: IEEE.

https://doi.org/10.1109/ICASSP.2013.6638346

Tadas Baltrušaitis, P. R.-P. (2016). OpenFace: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: IEEE.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200-5204). Shanghai, China: IEEE.

https://doi.org/10.1109/ICASSP.2016.7472669




DOI: http://dx.doi.org/10.7559/citarj.v10i2.424

Refbacks

  • There are currently no refbacks.




Journal of Science and Technology of the Arts
Revista de Ciência e Tecnologia das Artes
ISSN: 1646-9798
e-ISSN: 2183-0088
Portuguese Catholic University | Porto

    

Esta revista científica é financiada por Fundos Nacionais através da FCT – Fundação para a Ciência e a Tecnologia


 Governo da República Portuguesa