A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.
Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation / Favali, Filippo; Schmuck, Viktor; Villani, Valeria; Celiktutan, Oya. - 35:(2025), pp. 30-44. [10.1007/978-3-031-81688-8_3]
Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation
Favali, Filippo;Villani, Valeria;
2025
Abstract
A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris