Extended reality (XR) systems are about to be integrated into our daily lives and will provide support in a variety of fields such as education and coaching. Enhancing user experience demands agents that are capable of displaying realistic affective and social behaviors within these systems, and, as a prerequisite, with the capability of understanding their interaction partner and responding appropriately. Based on our literature review of recent works published in the field of co-speech gesture generation, researchers have developed complex models capable of generating gestures characterized by a high level of human-likeness and speaker appropriateness. Nevertheless, this is only true in settings where the agent has an active status (i.e., the agent acts as the speaker), or it is delivering a monologue in a non-interactive setting. However, as illustrated in multiple works and competitions like the GENEA Challenge, these models remain inadequate in generating interlocutor-aware gestures. We consider interlocutor-aware gesture generation the process of displaying gestures that take into account the conversation partner’s behavior. Moreover, in settings where the agent is the listener, generated gestures lack the level of naturalness that we expect from a face-to-face conversation. To overcome these issues, we have designed a pipeline, called TAG2G, composed of a diffusion model, which was demonstrated to be a stable and powerful tool in gesture generation, and a vector-quantized variational auto-encoder (VQVAE), widely employed to produce meaningful gesture embeddings. Refocusing from monadic to dyadic multimodal input settings (i.e., taking into account text, audio, and previous gestures of both participants of a conversation) allows us to explore and infer the complex interaction mechanisms that lie in a balanced two-sided conversation. As per our results, a multi-agent conversational input setup improves the generated gestures’ appropriateness with respect to the conversational counterparts. Conversely, when the agent is speaking, a monadic approach performs better in terms of the generated gestures’ appropriateness in relation to the speech.

TAG2G: A Diffusion-Based Approach to Interlocutor-Aware Co-Speech Gesture Generation / Favali, F.; Schmuck, V.; Villani, V.; Celiktutan, O.. - In: ELECTRONICS. - ISSN 2079-9292. - 13:17(2024), pp. 1-19. [10.3390/electronics13173364]

TAG2G: A Diffusion-Based Approach to Interlocutor-Aware Co-Speech Gesture Generation

Favali F.;Villani V.;
2024

Abstract

Extended reality (XR) systems are about to be integrated into our daily lives and will provide support in a variety of fields such as education and coaching. Enhancing user experience demands agents that are capable of displaying realistic affective and social behaviors within these systems, and, as a prerequisite, with the capability of understanding their interaction partner and responding appropriately. Based on our literature review of recent works published in the field of co-speech gesture generation, researchers have developed complex models capable of generating gestures characterized by a high level of human-likeness and speaker appropriateness. Nevertheless, this is only true in settings where the agent has an active status (i.e., the agent acts as the speaker), or it is delivering a monologue in a non-interactive setting. However, as illustrated in multiple works and competitions like the GENEA Challenge, these models remain inadequate in generating interlocutor-aware gestures. We consider interlocutor-aware gesture generation the process of displaying gestures that take into account the conversation partner’s behavior. Moreover, in settings where the agent is the listener, generated gestures lack the level of naturalness that we expect from a face-to-face conversation. To overcome these issues, we have designed a pipeline, called TAG2G, composed of a diffusion model, which was demonstrated to be a stable and powerful tool in gesture generation, and a vector-quantized variational auto-encoder (VQVAE), widely employed to produce meaningful gesture embeddings. Refocusing from monadic to dyadic multimodal input settings (i.e., taking into account text, audio, and previous gestures of both participants of a conversation) allows us to explore and infer the complex interaction mechanisms that lie in a balanced two-sided conversation. As per our results, a multi-agent conversational input setup improves the generated gestures’ appropriateness with respect to the conversational counterparts. Conversely, when the agent is speaking, a monadic approach performs better in terms of the generated gestures’ appropriateness in relation to the speech.
2024
13
17
1
19
TAG2G: A Diffusion-Based Approach to Interlocutor-Aware Co-Speech Gesture Generation / Favali, F.; Schmuck, V.; Villani, V.; Celiktutan, O.. - In: ELECTRONICS. - ISSN 2079-9292. - 13:17(2024), pp. 1-19. [10.3390/electronics13173364]
Favali, F.; Schmuck, V.; Villani, V.; Celiktutan, O.
File in questo prodotto:
File Dimensione Formato  
electronics-13-03364.pdf

Open access

Tipologia: VOR - Versione pubblicata dall'editore
Dimensione 1.46 MB
Formato Adobe PDF
1.46 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1367925
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact