Deepfake Detection for a Trustworthy Speech Communication

项目来源

日本学术振兴会基金(JSPS)

项目主持人

MAWALIM CandyOlivia

项目受资助机构

北陸先端科学技術大学院大学

项目编号

25K21245

立项年度

2025

立项时间

未公开

项目级别

国家级

研究期限

未知 / 未知

受资助金额

4810000.00日元

学科

ヒューマンインタフェースおよびインタラクション関連

学科代码

未公开

基金类别

若手研究

关键词

deepfake ; spoof attacks ; speaker verification ; multilingual

参与者

未公开

参与机构

北陸先端科学技術大学院大学,先端科学技術研究科

项目标书摘要:Outline of Research at the Start:Advancements in deep learning have enabled the creation of highly realistic synthetic audio(deepfakes),posing threats to voice privacy and security.This research aims to address the limitations in existing research in deepfake detection by analyzing the physiological and acoustic characteristics of speech production mechanism that is unique from deepfakes.Robust deepfake detection methods will be developed that capable of handling diverse linguistic data,providing clear explanations for detection outcomes,and adapting to the evolving deepfake attacks。

  • 排序方式:
  • 1
  • /
  • 1.Privacy-aware speaker trait and multimodal features relationship analysis in job interviews.

    • 关键词:
    • Human-computer interaction; Privacy protection; Speaker traits; Voice anonymization
    • Mawalim, Candy Olivia;Leong, Chee Wee;Okada, Shogo
    • 《Scientific reports》
    • 2026年
    • 期刊

    As the use of speech data for applications like emotion detection and health profiling grows, so do the privacy risks associated with voice recordings that can reveal sensitive speaker traits. This study investigates voice anonymization methods designed to protect speaker identity while maintaining essential speech characteristics for accurate trait inference, specifically within the context of job interviews. Our experiments show that while anonymization alters several acoustic parameters, the anonymized speech from signal processing-based methods remains suitable for overall trait assessment, with performance comparable to original speech. The phase vocoder-based method, in particular, offers modest privacy gains with an acceptable trade-off in utility, especially in scenarios with minimal attack vectors. In contrast, a neural audio codec-based method altered prosodic features critical for speaker trait estimation, slightly reducing performance in this specific task. Despite this, when carefully configured, this method provides greater privacy and generally preserves utility for speech recognition and quality assessment, even under semi-informed attack scenarios. © 2026. The Author(s).

    ...
  • 2.Robust Multilingual Audio Deepfake Detection Through Hybrid Modeling

    • 关键词:
    • Acoustic noise;Artificial intelligence;Audio acoustics;Human computer interaction;Learning systems;Linguistics;Speech communication;Speech recognition;Dataset;Deepfake;Detection system;Generated voice;Human voice;Hybrid model;Linguistic environment;Multilingual;Robust detection;Synthesis techniques
    • Mawalim, Candy Olivia;Wang, Yutong;Adila, Aulia;Okada, Shogo;Unoki, Masashi
    • 《13th ACM Workshop on Information Hiding and Multimedia Security, IHandMMSec 2025》
    • 2025年
    • June 18, 2025 - June 20, 2025
    • San Jose, CA, United states
    • 会议

    The increasing sophistication of AI-generated human voice poses a significant threat, demanding robust detection systems that can generalize effectively across diverse linguistic environments and synthesis techniques. In response to the SAFE Challenge, this paper introduces a novel approach to multilingual audio deepfake detection. Our primary contribution lies in the comprehensive study of deepfake detection using a multilingual speech corpus encompassing 17 languages and a broad spectrum of synthesis methods and acoustic conditions, designed to enable more realistic and challenging evaluations. To optimally utilize this diverse data, we propose a hybrid detection model that synergistically combines the strengths of end-to-end RawNet and AASIST architectures with language-agnostic representations learned from a multilingual self-supervised learning model. Additionally, we explore the efficacy of RawBoost data augmentation in enhancing robustness against real-world noise. Our experimental evaluation demonstrates promising generalization in generated audio detection, achieving approximately 73% balanced accuracy across multilingual data and unseen synthesis algorithms. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.

    ...
  • 3.Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction

    • 关键词:
    • Acoustic noise;Audio signal processing;Forecasting;Gears;Hearing aids;Regression analysis;Speech communication;Speech intelligibility;Frequency resolutions;Frequency sensitivity;Hearing loss;Intelligibility predictions;Multilevels;Normalized cross-correlation;Perceptual consequences;Prediction methods;Spectrotemporal modulations;Temporal resolution
    • Zhou, Xiajie;Mawalim, Candy Olivia;Unoki, Masashi
    • 《2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2025》
    • 2025年
    • October 12, 2025 - October 15, 2025
    • Tahoe City, CA, United states
    • 会议

    The diverse perceptual consequences of hearing loss severely impede speech communication, but standard clinical audiometry, which is focused on threshold-based frequency sensitivity, does not adequately capture deficits in frequency and temporal resolution. To address this limitation, we propose a speech intelligibility prediction method that explicitly simulates auditory degradations according to hearing loss severity by broadening cochlear filters and applying low-pass modulation filtering to temporal envelopes. Speech signals are subsequently analyzed using the spectro-temporal modulation (STM) representations, which reflect how auditory resolution loss alters the underlying modulation structure. In addition, normalized cross-correlation (NCC) matrices quantify the similarity between the STM representations of clean speech and speech in noise. These auditory-informed features are utilized to train a Vision Transformer-based regression model that integrates the STM maps and NCC embeddings to estimate speech intelligibility scores. Evaluations on the Clarity Prediction Challenge corpus show that the proposed method outperforms the Hearing-Aid Speech Perception Index v2 (HASPI v2) in both mild and moderate-to-severe hearing loss groups, with a relative root mean squared error reduction of 16.5% for the mild group and a 6.1% reduction for the moderate-to-severe group. These results highlight the importance of explicitly modeling listener-specific frequency and temporal resolution degradations to improve speech intelligibility prediction and provide interpretability in auditory distortions. © 2025 IEEE.

    ...
  • 4.Phoneme-Specific Challenges to Intelligibility in Hearing Impairment Under Noisy Condition

    • 关键词:
    • Errors;Forecasting;Hearing aids;Speech communication;Speech intelligibility;Speech recognition;Acoustic environment;Auditory sensitivity;Error patterns;Hearing impairments;Hearing loss;High frequency HF;International Phonetic Alphabet;Noisy conditions;Phonemes recognition;Word error rate
    • Junia, Denawati;Mawalim, Candy Olivia
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    Hearing impairment significantly reduces speech intelligibility, particularly in noisy acoustic environments, due to impaired auditory sensitivity and phoneme recognition. This study investigates whether speech intelligibility can be accurately predicted by integrating phoneme-level error patterns in noisy conditions. Using the Clarity prediction challenge dataset, phoneme-level errors were computed based on the international phonetic alphabet transcriptions and quantified by word error rate (WER). Our key findings reveal that high-frequency fricatives (f}3) and affricates (/d}3}/), along with voiced phonemes (/g/,//), showed the highest average WER. This indicates their particular vulnerability to masking and the effects of highfrequency hearing loss. While higher SNR generally improves intelligibility, we observed a weak correlation (ρ 0.20), underscoring the critical role of individual hearing loss profiles. To further our analysis, we used the five most challenging phonemes and SNR as features to predict speech intelligibility with Random Forest and XGBoost models. This approach yielded slightly better prediction performance compared to the Hearing Aid Speech Perception Index. © 2025 IEEE.

    ...
  • 5.Study on Signal Processing Techniques in Protecting Voice Personae Against Speech Synthesis Systems

    • 关键词:
    • Audio signal processing;Speech communication;Speech recognition;Artificial reverberation;Identifiability;Perceptual quality;Signal processing technique;Speech detection;Speech quality;Speech signals;Speech synthesis system;Synthesized speech
    • Li, Nopparut;Mawalim, Candy Olivia;Unoki, Masashi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    Recent advancements in speech synthesis have enabled the generation of natural-sounding speech signals that can closely mimic specific speakers, raising serious concerns about the misuse of voice recordings for impersonation and fraud. Although researchers have extensively studied spoof speech detection, they can only implement these approaches once the spoofed content has been generated. We propose a method based on F0 component elimination and compare it with conventional filtering and artificial reverberation for the impact on the quality of synthesized speech signals generated using the TorToiSe TTS model, as well as the perceptual quality of the modified speech signals. Results show that the proposed method is able to reduce the identifiability of synthesized speech signals with minimal impact on speech quality, offering a promising direction for voice personae protection against speech synthesis systems. © 2025 IEEE.

    ...
  • 排序方式:
  • 1
  • /