Responsible Speech and Audio Generative AI

Abstract

Background

The rapid progress in speech and audio generative AI [1,2] brings exciting new opportunities, but also raises important ethical concerns, addressing responsible-use questions more critical than ever. Recent years have witnessed great developments in synthetic speech, voice cloning, singing voice synthesis, music, and sound effect generation, profoundly changing domains [3] like media production, accessibility, and human-computer interaction. However, this progress raises critical concerns about responsibility [4], including bias, misuse, accountability, and the lack of transparency in generative model operations. This session aims to bring together the speech and audio research community to address emerging challenges in responsible generative technologies.

We welcome submissions that tackle critical issues in neural watermarking [5-7], controllability [8], fairness [9], explainability [10,11], and measurement [12,13]. Ensuring precise control is vital for nuanced speech, while fairness is essential to prevent harmful biases. Enhancing explainability builds trust, and robust measurement methodologies are indispensable for evaluating system quality. Each of these areas demands rigorous attention to ensure the ethical and beneficial deployment of speech, music, and audio generative technologies. The session will highlight technical innovations, evaluation methodologies, and frameworks that contribute to safer, fairer, and more trustworthy generative AI systems. It aims to foster broader discussions around standards, best practices, and accountability in real-world deployments of speech and audio generative technologies. Ultimately, this session seeks to establish collaborative pathways and guidelines that will shape the future of responsible speech and audio generative AI.

Objectives

This session seeks to advance state-of-the-art research on responsible speech and audio generation, focusing on accountability, controllability, fairness, and interpretability. It aims to bridge communities working on generative modeling and responsible AI while promoting collaboration across machine learning, ethics, human-computer interaction, and speech technology. In doing so, the session will support the development of tools, benchmarks, and evaluation protocols, fostering transparency, robustness, and inclusiveness in generative systems for speech, singing voice, music, and general audio. By encouraging dialogue around risks, governance, and the societal impact of speech and audio generation technologies, this session hopes to lay the groundwork for a more responsible future in this fast-evolving field.

Call for Papers

Submission Instructions

Please follow the official ASRU 2025 Author Instructions for paper formatting and submission. When submitting, be sure to select our special session “SS1. Responsible Speech and Audio Generative AI” as the primary subject area to ensure your paper is considered for inclusion.

Tips: This session will focus on Responsible Speech and Audio Generative AI, including accountability, controllability, fairness, robustness of generative models, and corresponding evaluation methods. Please note that another ASRU 2025 special session, “Frontiers in Deepfake Voice Detection and Beyond”, will focus more specifically on deepfake voice detection and related techniques.

Important Dates

All deadlines are Anywhere on Earth (AoE), unless otherwise specified.

Paper submissions open: March 28, 2025 (Welcome your ASRU submissions!)
Paper submissions due: May 28, 2025
Paper revision due: June 4, 2025
Acceptance notification: August 6, 2025
Camera-ready deadline: August 13, 2025
Workshop date: [TBA]

Topics of Interest

This special session invites work on responsible speech and audio generation AI, including TTS, voice conversion, singing synthesis, music/effects, and emerging speech/audio large-language models, to advance safety, fairness, and transparency.

We welcome submissions on a broad range of topics, including but not limited to the following:

Preventing Misuse and Ensuring Accountability in Speech and Audio Generation
- Robustness of generative speech and audio models against adversarial or malicious inputs
- Robust watermarking techniques to resist degradation and detect intentional tampering
- Evaluation framework simulating degradation and manipulation in real-world applications
Ensuring Fairness and Inclusion in Speech and Audio Generation
- Bias detection, mitigation, and fairness-aware evaluation across demographic, linguistic, and content-based variation
- Generalization and transfer learning across language, accent, musical genre, and acoustic domains
- Inclusive dataset design across diverse speakers, singing styles, musical genres, and acoustic conditions
Enabling Controllability and Transparency in Speech and Audio Generation
- Techniques for user-aligned controllable synthesis across speech, singing, music, and general audio domains
- Approaches for human-in-the-loop steering or refining generated speech or audio
- Techniques for explainable speech or audio generation
Evaluating and Benchmarking for Trustworthy Speech and Audio Generation
- Objective and subjective evaluation of synthesis quality, consistency and reliability across diverse conditions
- Fair and transparent evaluation across speech, singing, music, and other audio generation tasks
- Mitigating harmful, biased, or misleading content through ethical and risk management strategies

We invite researchers and practitioners to contribute papers that explore the intersection of generative models, responsible AI, and speech/audio technologies. Let’s build a trustworthy and inclusive generative speech future—together!

If you have any questions, feel free to contact us at: respsa-genai@googlegroups.com

Organizers


Nicholas Andrews¹	Xuanjun Chen²	Sung-Feng Huang³	Hung-yi Lee²


Yuanchao Li⁴	Hemlata Tak⁵	Jennifer Williams⁶	Haibin Wu⁷

¹Johns Hopkins University, ²National Taiwan University, ³NVIDIA, ⁴University of Edinburgh, ⁵Pindrop, ⁶University of Southampton, ⁷Microsoft

(The above organizers are sorted in alphabetical order by last name.)

Don’t miss it — see you in beautiful Honolulu, Hawaii! 😊

References

X. Tan, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
Y. Su, et al. "Audio-Language Models for Audio-Centric Tasks: A survey." arXiv preprint arXiv:2501.15177 (2025).
G. Tolomei, et al. "Prompt-to-OS: revolutionizing operating systems & human-computer interaction with integrated AI generative models." in CogMI 2023.
W. Hutiri, et al. "Not my voice! A taxonomy of ethical and safety harms of speech generators." in ACM FAccT 2024.
W. Ge, et al. "Proactive Detection of Speaker Identity Manipulation with Neural Watermarking." in 1st Workshop on GenAI Watermarking.
L. Juvela, et al. "Audio codec augmentation for robust collaborative watermarking of speech synthesis." in ICASSP, 2025.
Y. Wen, et al. "SoK: How Robust is Audio Watermarking in Generative AI models?" arXiv preprint arXiv:2503.19176 (2025).
T. Xie, et al. "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey." arXiv preprint arXiv:2412.06602 (2024).
A. Leschanowsky, et al. "Examining the interplay between privacy and fairness for speech processing: A review and perspective." in Interspeech, 2024.
J. Schneider. "Explainable generative ai (Genxai): A survey, conceptualization, and research agenda." Artificial Intelligence Review 57.11 (2024): 289.
T. Saeki, et al. "SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging nlp evaluation metrics." in Interspeech, 2024.
E. Cooper, et al. "A review on subjective and objective evaluation of synthetic speech." Acoustical Science and Technology (2024).
Y. Li, et al. "Rethinking emotion bias in music via Frechet audio distance." in IEEE ICME (2025).

Acknowledgment

We would like to thank Dr. Lin Zhang, Prof. Xin Wang and Prof. Junichi Yamagishi for their valuable comments on neural watermarking.

ASRU 2025 Special Session: Responsible Speech and Audio Generative AI