CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

Paper, Dataset, Code

Interspeech 2024

TL;DR: We show that better detection of deepfake speech from codec-based TTS systems can be achieved by training models on speech re-synthesized with neural audio codecs. We also release a dataset for this purpose.

Dataset Download

CodecFake is available on [Huggingface]

See our paper and Github for more details on using our dataset.

Fake and Real Speech Samples

We show some audio samples generated by our proposed dataset creation pipeline. Ground truth (GT) samples are randomly sampled from the VCTK corpus.

Codec Model Used
GT File ID
GT Speech
Generated Speech
SpeechTokenizer
p225_273
SpeechTokenizer
p295_144
SpeechTokenizer
p318_114
SpeechTokenizer
p343_296
SpeechTokenizer
p351_402
academicodec_hifi_16k_320d
p225_273
academicodec_hifi_16k_320d
p295_144
academicodec_hifi_16k_320d
p318_114
academicodec_hifi_16k_320d
p343_296
academicodec_hifi_16k_320d
p351_402
academicodec_hifi_16k_320d_large_uni
p225_273
academicodec_hifi_16k_320d_large_uni
p295_144
academicodec_hifi_16k_320d_large_uni
p318_114
academicodec_hifi_16k_320d_large_uni
p343_296
academicodec_hifi_16k_320d_large_uni
p351_402
academicodec_hifi_24k_320d
p225_273
academicodec_hifi_24k_320d
p295_144
academicodec_hifi_24k_320d
p318_114
academicodec_hifi_24k_320d
p343_296
academicodec_hifi_24k_320d
p351_402
audiodec_24k_320d
p225_273
audiodec_24k_320d
p295_144
audiodec_24k_320d
p318_114
audiodec_24k_320d
p343_296
audiodec_24k_320d
p351_402
descript-audio-codec-16khz
p225_273
descript-audio-codec-16khz
p295_144
descript-audio-codec-16khz
p318_114
descript-audio-codec-16khz
p343_296
descript-audio-codec-16khz
p351_402
descript-audio-codec-24khz
p225_273
descript-audio-codec-24khz
p295_144
descript-audio-codec-24khz
p318_114
descript-audio-codec-24khz
p343_296
descript-audio-codec-24khz
p351_402
descript-audio-codec-44khz
p225_273
descript-audio-codec-44khz
p295_144
descript-audio-codec-44khz
p318_114
descript-audio-codec-44khz
p343_296
descript-audio-codec-44khz
p351_402
encodec_24khz
p225_273
encodec_24khz
p295_144
encodec_24khz
p318_114
encodec_24khz
p343_296
encodec_24khz
p351_402
funcodec-funcodec_en_libritts-16k-gr1nq32ds320
p225_273
funcodec-funcodec_en_libritts-16k-gr1nq32ds320
p295_144
funcodec-funcodec_en_libritts-16k-gr1nq32ds320
p318_114
funcodec-funcodec_en_libritts-16k-gr1nq32ds320
p343_296
funcodec-funcodec_en_libritts-16k-gr1nq32ds320
p351_402
funcodec-funcodec_en_libritts-16k-gr8nq32ds320
p225_273
funcodec-funcodec_en_libritts-16k-gr8nq32ds320
p295_144
funcodec-funcodec_en_libritts-16k-gr8nq32ds320
p318_114
funcodec-funcodec_en_libritts-16k-gr8nq32ds320
p343_296
funcodec-funcodec_en_libritts-16k-gr8nq32ds320
p351_402
funcodec-funcodec_en_libritts-16k-nq32ds320
p225_273
funcodec-funcodec_en_libritts-16k-nq32ds320
p295_144
funcodec-funcodec_en_libritts-16k-nq32ds320
p318_114
funcodec-funcodec_en_libritts-16k-nq32ds320
p343_296
funcodec-funcodec_en_libritts-16k-nq32ds320
p351_402
funcodec-funcodec_en_libritts-16k-nq32ds640
p225_273
funcodec-funcodec_en_libritts-16k-nq32ds640
p295_144
funcodec-funcodec_en_libritts-16k-nq32ds640
p318_114
funcodec-funcodec_en_libritts-16k-nq32ds640
p343_296
funcodec-funcodec_en_libritts-16k-nq32ds640
p351_402
funcodec-funcodec_zh_en_general_16k_nq32ds320
p225_273
funcodec-funcodec_zh_en_general_16k_nq32ds320
p295_144
funcodec-funcodec_zh_en_general_16k_nq32ds320
p318_114
funcodec-funcodec_zh_en_general_16k_nq32ds320
p343_296
funcodec-funcodec_zh_en_general_16k_nq32ds320
p351_402
funcodec-funcodec_zh_en_general_16k_nq32ds640
p225_273
funcodec-funcodec_zh_en_general_16k_nq32ds640
p295_144
funcodec-funcodec_zh_en_general_16k_nq32ds640
p318_114
funcodec-funcodec_zh_en_general_16k_nq32ds640
p343_296
funcodec-funcodec_zh_en_general_16k_nq32ds640
p351_402

Acknowledgement

CodecFake is created based on the VCTK dataset, licensed under CC-BY-4.0.