Pseudo-Dynamic Preservation and Elucidation of Neural Processing of Endangered Languages Based on Natural Discourse Corpora with Physiological Indices
项目来源
项目主持人
项目受资助机构
项目编号
立项年度
立项时间
项目级别
研究期限
受资助金额
学科
学科代码
基金类别
关键词
参与机构
1.Evaluation ofDifferent Training Strategies andRecognizers inLow Resource Speech Recognition Using Wav2vec2.0
- 关键词:
- Decoding;Learning algorithms;Learning systems;Self-supervised learning;Signal encoding;Speech coding;Speech communication;Supervised learning;Automatic speech recognition;Character error rates;Learning frameworks;Learning strategy;Low resource languages;Low-resource speech recognition;Minority languages;Training strategy;Transformer;Wav2vec
- Koshikawa, Takaki;Ito, Akinori;Nose, Takashi
- 《17th International Conference on Machine Learning and Computing, ICMLC 2025》
- 2025年
- February 14, 2025 - February 17, 2025
- Guangzhou, China
- 会议
Automatic Speech Recognition (ASR) is crucial for preserving minority languages, promoting inclusivity, and supporting education. Wav2vec2.0 Model, pre-trained through self-supervised learning, is effective for low-resource language speech recognition. Thus, this study investigates different learning strategies, recognizers, and frameworks to improve ASR performance for low-resource languages. First, we compared five learning strategies for low-resource language speech recognition using the wav2vec2.0 model. The Freeze-Transformer strategy, which fixes the CNN and low-layer Transformer blocks, achieved the lowest Character Error Rate (CER). Next, we evaluated five types of recognizers, including fully connected layers, MLP, RNN, LSTM, and GRU. The bi-GRU recognizer performed the best, achieving the lowest CER. Finally, we tested an Encoder-Decoder model with wav2vec2.0 as the encoder and a Transformer-decoder as the decoder. The results showed that the recognition performance did not improve with this model, even with a large amount of training data. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
...2.Improving Speaker Consistency in Speech-to-Speech Translation Using Speaker Retention Unit-to-Mel Techniques
- 关键词:
- Semantics;Speech enhancement;Translation (languages);End to end;French-english;Semantic content;Semantics Information;Speaker specific informations;Speech-to-speech translation;Synthesized speech;Voice quality;Waveforms
- Zhou, Rui;Ito, Akinori;Nose, Takashi
- 《2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024》
- 2024年
- December 3, 2024 - December 6, 2024
- Macau, China
- 会议
We propose a Speaker-Consistent Speech-to-Speech Translation (SC-S2ST) system that effectively retains speaker-specific information. While the paradigm of Speech-to-Unit Translation (S2UT) followed by Unit-to-Waveform Vocoder has become a mainstream for End-to-End S2ST systems, due to the substantial semantic content carried by discrete units, this approach primarily captures semantic information and often results in synthesized speech that lacks speaker-specific characteristics such as accent and individual voice qualities. Existing S2UT systems with style transfer face the issue of high inference latency. To address this limitation, we introduced a Speaker-Retention Unit-to-Mel (SR-UTM) framework designed to capture and preserve speaker-specific information. We conducted experiments on the CVSS-C and CVSS-T corpora for Spanish-English and French-English translation tasks. Our approach achieved BLEU scores of 16.10 and 21.68, which are comparable to those of the baseline S2UT system. Furthermore, our SC-S2UT system excelled in preserving speaker similarity. The speaker similarity experiments showed that our method effectively retains speaker-specific information without significantly increasing inference time. These results confirm that our primary approach successfully achieve speaker-consistent speech-to-speech translation. © 2024 IEEE.
...
