A Study on High Quality D... - 鈴木優 - 日本学术振兴会基金(JS...

A Study on High Quality Dataset Construction for Multitask Learning

项目来源

日本学术振兴会基金(JSPS)

项目主持人

鈴木優

项目受资助机构

岐阜大学

立项年度

2024

立项时间

未公开

项目编号

24K03044

项目级别

国家级

研究期限

未知 / 未知

受资助金额

18590000.00日元

学科

合同審査対象データベース関連、ウェブ情報学およびサービス情報学関連;ウェブ情報学およびサービス情報学関連;データベース関連

学科代码

未公开

基金类别

基盤研究(B)

关键词

クラウドソーシング ; 機械学習 ; マルチタスク学習 ; ラベル不均衡問題 ; 評判分析 ;

参与者

灘本明代；波多野賢治

参与机构

甲南大学；同志社大学

项目标书摘要：Outline of Research at the Start:本研究では,マルチタスク学習と呼ばれる方法で機械学習の精度向上に必要なデータセットの構築を行う.マルチタスク学習では,二つ以上の問題をまとめて解くことによって精度が向上することが知られているが,どのような問題を組み合わせると精度が向上するかは自明ではない.そこで,相性の良い問題，悪い問題の傾向について明らかにすると共に,データセットの構築を行うことがこの研究の目的である.クラウドソーシングによりデータセットを構築することにより,安価で高精度なデータセット構築が可能となる.評判分析を一つのテストケースとして用い,既存手法で60%程度の精度であったものを80%程度まで向上させることを目指す。

排序方式：时间相关性
显示方式：列表摘要

1.Speech-Scenario Generation Based onthePhilosophy ofaProminent Leader Within aSmall Community

关键词：
;Community IS;Graduation ceremony;Language model;LLM;Philosophy;Retrieval-augmented generation;Scenarios generation;Small community;Speech-scenario generation;Text generations

Kitahata, Tetsuya;Seki, Kazuhiro;Nadamoto, Akiyo
《36th International Conference on Database and Expert Systems Applications, DEXA 2025》
2026年
August 25, 2025 - August 27, 2025
Bangkok, Thailand
会议

Research about long text generation has been actively conducted with the advancement of large language models. However, generating long text that considers the unique philosophy within a small community, such as speeches used in graduation ceremonies or company inductions, remains challenging. The reason is that information and literature about prominent leaders in small communities are generally extremely limited compared to those about well-known prominent leaders. This study focuses on prominent leaders within small communities, such as university or company founders. It aims to generate speech scenarios that automatically share small communities’ unique philosophies. In this paper, we target Hachisaburo Hirao, the founder of our university, and extract sentences containing his philosophies from his diaries, lectures, and autobiographies to create a quotations database. We then propose a method to generate speech scenarios based on the user’s input theme using Retrieval-Augmented Generation (RAG) with the quotations database. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

...

2.Two-Stage Fine-Tuning forDialogue Generation withSmall Community Prominent Leaders’ Philosophies

关键词：
Speech communication;Dialogue generations;Fine tuning;Generation method;Language model;Philosophy;Public data;Small business;Small community;Speech patterns;Two-stage fine-tuning

Kitahata, Tetsuya;Seki, Kazuhiro;Nadamoto, Akiyo
《27th International Conference on Information Integration and Web Intelligence, iiWAS 2025》
2026年
December 8, 2025 - December 10, 2025
Matsue, Japan
会议

Recent advances in large language models (LLMs) have enabled the replication of speech patterns and philosophies of prominent historical figures. However, generating dialogue that reflects the philosophies of prominent leaders in small communities, such as founders of local universities or small businesses, remains a challenge due to the limited availability of public data. Nevertheless, the philosophies of such leaders often serve as important educational and behavioral foundations for members of these communities. In this study, we propose a dialogue generation method that enables the sharing of a prominent local leader’s philosophy through natural conversation. Specifically, we classify sentences left behind by the leader—such as those in books or diaries—into four types: statements, thoughts, actions, and facts. We then perform two-stage fine-tuning using the statements and thoughts to generate dialogues that faithfully reflect the leader’s values and philosophy. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

...

3.Parameter Drift asaSignal forMembership Inference inOverfit-Tuned LLMs

关键词：
;Inference attacks;Internal parameters;Language model;Large language model;Membership inference attack;Parameter dynamics;Parameters drift;Pre-training;White box;White-box attack

Kitamura, Takuto;Suzuki, Yu
《27th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2025》
2026年
August 25, 2025 - August 27, 2025
Bangkok, Thailand
会议

We propose a novel white-box membership inference attack (MIA) for large language models (LLMs) that leverages internal parameter dynamics to determine whether a given text sample was included in a model’s pre-training data. Prior MIA approaches rely primarily on input-output behavior and struggle to distinguish memorized samples from semantically similar but unseen ones due to the probabilistic nature of LLM generation. To address this challenge, we introduce a method based on parameter drift—defined as the Euclidean distance between the entire set of model parameters before and after continual pre-training on a single input. Our hypothesis is that continual pre-trained inputs induce minimal parameter changes, while unseen inputs require greater updates to the model. Notably, we find that even semantically similar inputs yield distinct drift magnitudes, enabling more precise membership inference. We validate our approach on multiple LLMs, including Pythia and LLaMA-2, and show that it consistently outperforms existing MIA baselines such as Min-K% Prob and SaMIA*zlib under various evaluation settings. Furthermore, we demonstrate that focusing on parameters with high drift further improves inference accuracy, achieving state-of-the-art results on benchmark datasets. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

...

4.Analysis of Behavioral Facilitation Information During Disasters Based on Reader Attributes and Personality Traits

关键词：
Behavioral research;Social networking (online);Storms;Big five;Classifieds;Content-based;Large volumes;Personality traits;Social networking services

Nadamoto, Akiyo;Wakasugi, Kosuke;Suzuki, Yu;Kumamoto, Tadahiko
《Informatica 》
2026年
49卷
3期
期刊

During disasters, a large volume of messages are posted on social networking services (SNS). Some of these messages contain behavioral facilitation information, which either encourages or discourages specific actions. However, the interpretation of such information depends on the personality traits of the individuals affected. In this study, we hypothesize that victims’ personality traits influence their perception of behavioral facilitation information, and we analyze the characteristics of these differences. Focusing on typhoons, we propose a method for extracting behavioral facilitation information from posts on X (formerly Twitter) during typhoon-related disasters. The extracted information is then classified into four content-based categories: suggest, inhibition, encouragement, and wish. Furthermore, we categorize individual personality traits into five dimensions (the Big Five), and also take into account their age and sex. We then analyze how the perception of each type of behavioral facilitation information varies according to these traits. Our analysis reveals that, during disasters, the interpretation of behavioral facilitation information exhibits distinct and consistent patterns depending on the personality traits of the victims. © 2026, Slovene Society Informatika. All rights reserved.

...

5.Automated Instruction Generation viaAlternating Evaluation andCreation withLLMs

关键词：
Iterative methods;Crowdsourcing platforms;Dataset creation;High quality;Instruction;Instruction generations;Iterative process;Language model;Large language model;Performance;Workers'

Tanaka, Ryo;Suzuki, Yu
《27th International Conference on Information Integration and Web Intelligence, iiWAS 2025》
2026年
December 8, 2025 - December 10, 2025
Matsue, Japan
会议

On crowdsourcing platforms, the quality of collected data depends on the clarity of instructions, but requesters struggle to create instructions that capture their own implicit criteria. To address this issue, we propose a novel framework that uses two Large Language Models (LLMs) – a Creator and an Evaluator – to automatically explore the space of possible instructions. In this iterative process, the Creator LLM generates diverse instruction candidates, and the Evaluator LLM, acting as a proxy for human workers, assesses their performance on a task, providing a fitness score. Our experiments show that this exploratory approach is effective for discovering high-quality instructions, even if the process does not show monotonic improvement. Using the best-performing instruction created by our method with gemma3, we achieved 5.4% higher accuracy and 0.035 lower RMSE than when gemma used an instruction created by a requester. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

...

6.Analysis ofBehavioral Facilitation Information During Typhoon Period Based onVictim Attributes

关键词：
;Behavioral facilitation;Large amounts;SNS;Victim

Wakasugi, Kosuke;Suzuki, Yu;Kumamoto, Tadahiko;Nadamoto, Akiyo
《13th International Symposium on Information and Communication Technology, SOICT 2024》
2025年
December 13, 2024 - December 15, 2024
Danang, Viet nam
会议

During a disaster, a large amount of Behavioral Facilitation information (BF information) is posted on SNSs. We have proposed the method of extracting and categorizing BF information into four labels (behavioral axis). In this paper, we analyze the differences in how BF information is received before, during, and after a typhoon, focusing on the attributes of disaster victims. This research aims to clarify the appropriate information for the target victims in each typhoon period and analyze their relationship. The results have shown differences in the relationship between the victims’ attributes and their perception of BF information. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

...

7.Finding Adequate Additional Layer ofAuxiliary Task inBERT-Based Multi-task Learning

关键词：
Contrastive Learning;Deep learning;Federated learning;Multi-task learning;Natural language processing systems;BERT;Deep learning;Different layers;Language processing;Learning models;Machine learning models;Machine-learning;Multitask learning;Natural language processing;Natural languages

Kitamura, Takuto;Suzuki, Yu
《26th International Conference on Information Integration and Web Intelligence, iiWAS 2024》
2025年
December 2, 2024 - December 4, 2024
Bratislava, Slovakia
会议

We find adequate additional layers of auxiliary tasks for BERT-based multi-task learning. Multi-task learning is a method for improving the accuracy of machine learning model by adding auxiliary tasks. Previous studies propose multi-task learning models with auxiliary tasks added to different layers. Our research question is which layer is effective for adding auxiliary tasks to improve accuracy because the answer is still unknown. The aim of this study is to find adequate additional layer of auxiliary tasks that maximizes model accuracy. We use a BERT-base model consisting of twelve layers of Transformer and experiment with seven datasets. Our experimental results show that changing the additional layer of auxiliary tasks improves macro-F1 by up to 5.1% (p-value=0.019). Moreover, our findings suggest that the insertion of auxiliary tasks into layers with the main task’s characteristics increases accuracy. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

...

8.A Method to Improve Crowdsourcing Outcome and to Reduce Calculation Costs Using Machine-Learning

Ota, Nana;Suzuki, Yu
Springer Science and Business Media Deutschland GmbH
2024年
图书

排序方式：时间相关性
显示方式：列表摘要