複数人会話に参与するロボットのための音声認識・対話システム

项目来源

日本学术振兴会基金(JSPS)

项目主持人

河原 達也

项目受资助机构

京都大学

立项年度

2025

立项时间

未公开

项目编号

25H01142

项目级别

国家级

研究期限

未知 / 未知

受资助金额

45630000.00日元

学科

人間情報学およびその関連分野

学科代码

未公开

基金类别

基盤研究(A)

关键词

音声認識 ; 音声対話 ; 多人数会話 ;

参与者

井上昂治;井本桂右;熊田孝恒;吉井和佳

参与机构

京都大学,情報学研究科;京都大学,工学研究科

项目标书摘要:Outline of Research at the Start:従来の音声認識・対話システムは、原則としてユーザが1人、すなわち1人の話者がシステムに話すことが大前提となっている。これに対して本研究では、複数人がいる会話に参与することを目指して、音声認識と音声対話の両面からモデル化及びシステム実装を行う。具体的には、(1)音声分離・発話者検出(誰が話しているのか)、(2)発話権認識(次に誰が話すのか)、(3)聞き手反応の生成(発話権がない時にどう反応するか)、(4)感情・雰囲気の認識に基づく対話生成、の課題に取り組む。ロボット・AIが、複数人がいる状況で、基本的なコミュニケーション能力及び社会性を身に着けることができるかという問いに答えるものである。

  • 排序方式:
  • 1
  • /
  • 1.Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-supervised Training of Sound Events With Partial Labels

    • 关键词:
    • Audio acoustics;Classification (of information);Cost effectiveness;Learning algorithms;Learning systems;Semi-supervised learning;Acoustic scene classification;Detection performance;Event-based;Joint analysis;Labour-intensive;Partial label;Scene classification;Semi-supervised trainings;Sound event detection;Sound events
    • Imoto, Keisuke
    • 《APSIPA Transactions on Signal and Information Processing》
    • 2025年
    • 14卷
    • 1期
    • 期刊

    Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels. © 2025 K. Imoto.

    ...
  • 2.Efficient Transformer-Based Piano Transcription with Sparse Attention Mechanisms

    • 关键词:
    • Audio acoustics;Audio systems;Computational efficiency;Computer music;Decoding;Electronic musical instruments;Signal encoding;Signaling;Attention mechanisms;Computationally efficient;Encoders and decoders;Long-term dependencies;Music signals;Musical pieces;Performance;Quadratic complexity;Sequence models;Sliding Window
    • Wei, Weixing;Yoshii, Kazuyoshi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans. © 2025 IEEE.

    ...
  • 3.TAPA-ICL: Taxonomy-Aware Prompt Augmentation for in-Context Learning in Music Understanding

    • 关键词:
    • Computer music;Context learning;Feature profiles;Features extraction;Human-readable;In contexts;Language model;Learning approach;Learning frameworks;Music understanding;Semantic reasoning
    • Zhao, Jiahao;Li, Yunjia;Yoshii, Kazuyoshi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    This paper presents TAPA-ICL, a novel in-context learning (ICL) framework for few/zero-shot symbolic music understanding task that generates human-readable analysis. Conventional ICL approaches mainly rely on the ability of large language models (LLMs) to infer patterns and tasks from contextual examples. However, while LLMs have shown basic capability of understanding symbolic music, how to enhance such capability via additional context still remains unexplored. To tackle this issue, we focus on the two major challenges as follows: (1) Sparsity of Score Input through taxonomy-aware prompt augmentation (TAPA) that distills label-to-feature profiles, reducing context length by over 10 times; and (2) Complexity of Musical Semantics Reasoning via structured chain-of-thought (CoT) prompts enforcing feature extraction, context-aware analysis, and decisionmaking. By combining TAPA strategy and the CoT reasoning prompts, our method enables effective few/zero-shot adaptation across emotion recognition, composer identification, and genre classification tasks. Experimental results show that our TAPAICL method significantly outperforms conventional few-shot ICL baselines (including those based on ultra-large and mixture of experts (MoE) models) on each downstream task and achieves slightly weaker performance to existing many-shot approaches. © 2025 IEEE.

    ...
  • 4.Physically Informed Spatial Regularization for Sound Event Localization and Detection

    • 关键词:
    • Acoustic Modeling;Acoustic noise measurement;Architectural acoustics;Audio acoustics;Deep learning;Signal processing;Stochastic models;Acoustic environment;Detection models;Direction detections;Event localizations;Events detection;Localization modeling;Multi-channel signal processing;Sound events;Source directions;Spatial regularizations
    • Liu, Haocheng;Di Carlo, Diego;Arie Nugraha, Aditya;Yoshii, Kazuyoshi;Richard, Gaël;Fontaine, Mathieu
    • 《2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2025》
    • 2025年
    • October 12, 2025 - October 15, 2025
    • Tahoe City, CA, United states
    • 会议

    Building Sound Event Localization and Detection (SELD) models that are robust to diverse acoustic environments remains one of the major challenges in multichannel signal processing, as reflections and reverberation can significantly confuse both the source direction and event detection. Introducing priors such as microphone geometry or room impulse response (RIR) into the model has proven effective in addressing this issue. Existing methods typically incorporate such priors in a deterministic way, often through data augmentation to enlarge data diversity. However, the uncertainty arising from the complex nature of audio acoustics remains largely underexplored in the SELD literature and naturally call for incorporating a stochastic modeling of acoustic prior. In this paper, we propose regularizing deep learning based SELD models with a physically constructed spatial covariance matrix (SCM) based on the estimated direction of arrival (DOA) and sound event detection (SED). © 2025 IEEE.

    ...
  • 5.Narrativity-Aware Video Summarization Based on Vision and Language Foundation Models

    • 关键词:
    • Abstracting;Computational linguistics;Computer vision;Natural language processing systems;Network embeddings;Text processing;Video analysis;Video recording;Visual languages;Embeddings;Foundation models;Language model;Narrativity;Neural-networks;Numerical performance;Textual description;Video summarization;Visual feature;Visual salience
    • Saito, Shumpei;Ueda, Hiroyuki;Ito, Yosuke;Yoshii, Kazuyoshi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    This paper presents a novel video summarization approach that prioritizes the narrative quality of the summarized video to enhance its enjoyment and appeal. While most video summarization studies focus on extracting salient scenes using lowlevel visual features, they often neglect the storytelling aspect to optimize numerical performance on standard benchmarks. To address this, we propose a multifaceted video summarization method that leverages vision and language foundation models to assess shot-level importance (e.g., 2-sec intervals) based on both visual salience and textual narrativity. Specifically, our method employs a vision-language model (VLM) to generate objective captions for individual shots. These shot-wise textual descriptions are then fed into a large language model (LLM) with a prompt designed to produce a semantically-coherent text summary with strong narrativity. The narrativity-aware text embeddings obtained by the LLM, combined with visual embeddings from a vision foundation model, are processed by a recurrent neural network (RNN) to predict importance scores. The LLM and RNN are jointly fine-tuned to align with existing benchmarks. Experiments on the SumMe benchmark demonstrated the effectiveness of our multifaceted approach, highlighting significant performance improvements and the potential of text-domain video summarization. © 2025 IEEE.

    ...
  • 6.SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer

    • 关键词:
    • Acoustic noise measurement;Deep neural networks;Direction of arrival;Gaussian noise (electronic);Microphone array;Spatial variables measurement;Alpha stable;Localization technique;Neural-networks;Physic-informed deep learning;Sound localization;Sound source localization;Spatial measures;Stable model;Steering vector;Α-stable theory
    • Carlo, Diego Di;Fontaine, Mathieu;Nugraha, Aditya Arie;Bando, Yoshiaki;Yoshii, Kazuyoshi
    • 《33rd European Signal Processing Conference, EUSIPCO 2025》
    • 2025年
    • September 8, 2025 - September 12, 2025
    • Palermo, Italy
    • 会议

    This paper describes a sound source localization (SSL) technique that combines an α-stable model for the observed signal with a neural network-based approach for modeling steering vectors. Specifically, a physics-informed neural network, referred to as Neural Steerer, is used to interpolate measured steering vectors (SVs) on a fixed microphone array. This allows for a more robust estimation of the so-called α-stable spatial measure, which represents the most plausible direction of arrival (DOA) of a target signal. As an α-stable model for the non-Gaussian case (α ∈ (0, 2)) theoretically defines a unique spatial measure, we choose to leverage it to account for residual reconstruction error of the Neural Steerer in the downstream tasks. The objective scores indicate that our proposed technique outperforms state-of-the-art methods in the case of multiple sound sources. © 2025 European Signal Processing Conference, EUSIPCO. All rights reserved.

    ...
  • 7.Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives

    • 关键词:
    • Acoustic generators;Audio acoustics;Audio signal processing;Covariance matrix;Factorization;Gaussian distribution;Iterative methods;Microphones;Three dimensional computer graphics;3D information;Audio-visual;Distributed microphones;Gaussians;Microphone arrays;Multi channel;Multichannel sounds;Nonnegative matrix factorization;Sound source separation;Spatial covariance matrix
    • Asano, Haruaki;Nihei, Ryunosuke;Bando, Yoshiaki;Nugraha, Aditya Arie;Di Carlo, Diego;Ueda, Hiroyuki;Ito, Yosuke;Yoshii, Kazuyoshi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    This paper proposes visually-informed sound source separation for audio-visual understanding of indoor scenes captured by distributed microphone arrays and cameras. Our approach leverages the 3D information of sound-emitting objects, reconstructed via 3D Gaussian splatting (3DGS), to overcome a limitation of modern blind source separation methods like multichannel nonnegative matrix factorization (MNMF). While adaptable and potentially performant, the iterative optimization of MNMF often converges to poor local minima due to the highly-expressive full-rank spatial covariance matrices (SCMs) of sources. Our key idea is to treat the set of 3D Gaussians representing a sizable sound source object as a collection of sub-sources that share an audio signal but have unique emission weights, both of which are to be estimated jointly from an observed mixture. To enforce this structure, we guide MNMF by regularizing the SCM of each source object at each frequency. Specifically, we use a prior that centers the SCM estimate around a weighted sum of theoretical SCMs, which are analytically derived from the 3D Gaussian positions. Experiments with simulated data, featuring two 3D human models, demonstrated the effectiveness of the proposed method. To our knowledge, this is the first work to use 3D Gaussians as a common primitive for joint audio-visual analysis. © 2025 IEEE.

    ...
  • 8.A Multifaceted Multi-Agent Framework for Zero-Shot Emotion Analysis and Recognition of Symbolic Music

    • 关键词:
    • Computer music;Emotion Recognition;Human computer interaction;Human engineering;Intelligent agents;Knowledge engineering;Knowledge management;Knowledge transfer;Large datasets;Psychology computing;Affective Computing;Affective computing;;Emotion recognition;Language model;Large language model;;Multiagent framework;Music emotion recognition;;Music emotions;Shot classification;Zero-shot classification
    • Zhao, Jiahao;Li, Yunjia;Yoshii, Kazuyoshi
    • 《27th International Conference on Multimodal Interaction, ICMI 2025》
    • 2025年
    • October 13, 2025 - October 17, 2025
    • Canberra, ACT, Australia
    • 会议

    This paper presents the first attempt at zero-shot music emotion recognition (MER) to map musical pieces, represented in symbolic formats (e.g., ABC notation), onto the valence-arousal space. Conventional MER approaches typically train an end-to-end deep neural network (DNN). However, the performance of such supervised methods is limited due to the multifaceted and ambiguous nature of music emotions, compounded by the scarcity of MER datasets. To address this, we leverage knowledge transfer from large language models (LLMs) pre-trained on vast text and symbolic data. We hypothesize that LLMs possess capabilities in low-level music description and high-level emotion reasoning (not necessarily in a musical context). Accordingly, we propose a multi-agent framework that performs zero-shot MER by associating objective musical attributes (harmony, melody, rhythm, and structure) with subjective attributes (valence and arousal). Our system employs a hierarchical architecture comprising (i) musical element descriptors, (ii) chain-of-thought emotion analysts, and (iii) comprehensive predictors. Knowledge injection and zero-shot prompting are utilized to mitigate inherent model biases. Evaluations on the EMOPIA dataset demonstrate that our system, built on the Gemini-2.0-Flash backbone, significantly outperforms baseline LLM models, including ultra-large models and mixture-of-experts (MoE) systems, and performs comparably to fully supervised or fine-tuned models. © 2025 Copyright held by the owner/author(s).

    ...
  • 9.Joint Separation and Tracking of Moving Sources with Distributed Microphone Arrays Based on Time-Varying Inertial Spatial Models

    • 关键词:
    • Covariance matrix;Hierarchical systems;Location;Markov processes;Matrix factorization;Maximum likelihood estimation;Microphone array;Distributed microphones;Localisation;Microphone arrays;Moving sound source;Moving source;Multi channel;Sources location;Spatial covariance matrix;Spatial modelling;Time varying
    • Nihei, Ryunosuke;Bando, Yoshiaki;Nugraha, Aditya Arie;Di Carlo, Diego;Ueda, Hiroyuki;Ito, Yosuke;Yoshii, Kazuyoshi
    • 《17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025》
    • 2025年
    • October 22, 2025 - October 24, 2025
    • Singapore, Singapore
    • 会议

    This paper describes the first attempt at separation and tracking (3 D localization) of multiple moving sound sources using multiple microphone arrays fixed at known locations in an indoor environment. As for static sources, location-dependent priors have been incorporated on the time-invariant spatial covariance matrices (SCMs) of sources in the statistical framework of blind source separation based on multichannel nonnegative matrix factorization (MNMF), achieving the maximum likelihood estimation of source locations. One may thus make both the SCMs and their priors vary over time to deal with source movements. This naive extension, however, fails to localize sources when the sources are inactive, yielding non-smooth, non-continuous trajectory estimates. To solve this problem, we formulate a hierarchical probabilistic model for multichannel mixture signals that consists of inertial Markov models for source locations, location-aware moving-average models for source SCMs, and NMF-based lowrank models for the power spectral densities (PSDs) of sources. All the time-varying attributes of sources are jointly estimated under a maximum-a-posteriori (MAP) principle, and the source images are then estimated with a multichannel Wiener filter. The experiment using simulated data with two moving sources and four four-channel arrays showed that the proposed method achieved better separation and smoother localization. © 2025 IEEE.

    ...
  • 排序方式:
  • 1
  • /