機能等価メソッドデータセットの構築によるソフトウェ... - 肥後芳樹 - 日本学术振兴会基金(JS...

機能等価メソッドデータセットの構築によるソフトウェア工学タスクの高度化

项目来源

日本学术振兴会基金(JSPS)

项目主持人

肥後芳樹

项目受资助机构

大阪大学

项目编号

24H00692

立项年度

2024

立项时间

未公开

研究期限

未知 / 未知

项目级别

国家级

受资助金额

46540000.00日元

学科

情報科学、情報工学およびその関連分野

学科代码

未公开

基金类别

基盤研究(A)

关键词

ソースコード解析 ; 機能等価メソッド ; 大規模言語モデル ; コードクローン ; Java ; Python ; LLM

参与者

丸山勝久；林晋平；松本真佑；ヌリオリビエ

参与机构

大阪大学，大学院情報科学研究科；立命館大学，情報理工学部；東京科学大学，情報理工学院；大阪大学，高等共創研究院

项目标书摘要：初年度はJavaのデータセットの構築を目的としていたが,当初の計画に加え構築したデータセットを利用したLLMベースのコードクローン検出精度の改善も行うことができた.昨年度は,ソフトウェア工学における「機能等価メソッド」研究を中核に据え,①大規模データセットの構築と②LLM を用いた先進的クローン検出手法の高度化という二つの成果を得た。まず,314M行超の OSS(IJADataset)から自動テスト生成(EvoSuite)と相互実行によって機能等価で構造の異なる Java メソッドを抽出し,手作業検証を経て1342対の Functionally Equivalent Method Pair Dataset(FEMPDataset)を公開した。このデータセットを用いて NIL・InferCode・ASTNN を評価した結果，字句列ベース手法では検出漏れが多く,AST/深層学習系手法では誤検出が多いなど,既存技術の限界を定量的に示した。次に,FEMPDataset を学習データに GPT-3.5 turbo,Llama2-Chat-7B,Code-Llama-7B-Instruct をファインチューニングし,Type-4 クローン検出能力を向上させた。特に Code-Llama 系では精度・再現率とも大幅に改善し,Fine-tuned GPT-3.5 は GPT-4-turbo を上回る精度を達成した。これにより,データセット整備とモデル最適化を組み合わせることで,大規模言語モデルが従来困難だった大差分クローン検出にも有効であることを実証した。以上の成果は,新規データ資源の提供と LLM 応用法の確立を通じ,コードクローン研究と自動プログラム解析の発展に寄与するものであり,科研費による支援が両成果の基盤となった。昨年度は当初の予定以上に研究を進めることができた.今後は昨年度に継続して構築したデータセットを利用することによるコードクローン検出手法の精度改善について更に深化させていく.また,構築したデータセットを利用したリファクタリング支援も行う.さらに,Javaのデータセットを構築した際の知見を生かして,Pythonのデータセットも構築する.Reason:初年度はJavaのデータセットの構築を目的としていたが,当初の計画に加え構築したデータセットを利用したLLMベースのコードクローン検出精度の改善も行うことができた。Outline of Research at the Start:本研究では機能等価メソッドのデータセットを構築する.取得した機能等価メソッドの候補は手作業により真に機能等価であるかを確認する.データセットの構築後は,それを利用してソフトウェア工学技術の評価を行う.例えば、機能等価メソッドはコードクローン検出ツールの評価に利用できる.同機能を実装したメソッドはコードクローンとして検出されることが望ましいので,機能等価メソッドがどの程度コードクローンとして検出されるかを調査することで,コードクローン検出ツールの性能を評価できる.さらに,構築したデータセットを大規模言語モデルのファインチューニングに用いることにより,ソフトウェア工学タスクの高度化を目指す。

排序方式：时间相关性
显示方式：列表摘要

1.A Large-Scale Investigation Into the Loss of Pull Request Data on GitHub

关键词：
Software development management; Source coding; Application programminginterfaces; Testing; Codes; Soft sensors; Reviews; Maintenance; Java;Information science; Empirical software engineering; pull requests;social coding; GitHub; software mining

Tang, Bowen;Maruyama, Katsuhisa
《IEEE ACCESS》
2026年
14卷
期
期刊

Analyzing pull requests (PRs) on GitHub provides valuable insights that can improve software development and maintenance. Therefore, researchers must collect PRs for empirical studies when testing hypotheses and creating practical tools based on these insights. Unfortunately, using GitHub as a data source for PRs carries the risk of data loss, owing to its flexible resource management. Existing studies have indicated that data losses can occur in PRs; however, the types and impacts of these losses remain unclear. This study shares findings from our investigation, which analyzed 84,828 PRs from 30 GitHub repositories and 2,345,724 actions recorded within the PRs. It clarified how different types of data loss affected PRs and highlighted variations in the percentage of PRs affected by loss. The results showed that 54.79% of the PRs experienced some data loss. Source code loss was common, whereas the loss of user information and commits was less frequent. Most user information loss resulted from missing committers. Compared to PRs that were rejected, merged PRs were more likely to have source code losses. The source code loss rate was much lower in testing-related PRs than in those unrelated to the testing. PRs that lacked files written in a programming language were more prone to commit loss. These findings help researchers better understand data loss in PRs and develop effective strategies to prevent it.

...

2.An empirical study on the impact of change granularity in refactoring detection

关键词：
Error detection;Open systems;Coarse-grained;Commit history;Commit message;Empirical studies;Impact of changes;Open science;Refactoring detection;Refactorings;Software Evolution

Chen, Lei;Hayashi, Shinpei
《Journal of Systems and Software》
2026年
231卷
期
期刊

Detecting refactorings in commit history is essential to improve comprehension to code changes on code reviews, and to provide valuable information for empirical studies on software evolution. Techniques have been proposed to accurately detect refactorings on the granularity of a single commit. However, refactorings can be made over multiple commits because of their complexity or other practical development problems, which cause detecting on only the granularity of a single commit not enough. We observe that some refactorings can only be detected in coarser granularity, i.e., changes conducted over multiple commits, or in the granularity of a single commit but not in coarse-grained. We call these types of refactorings as coarse-grained refactorings (CGRs) and ephemeral refactorings (EPRs). We investigated the features and causes of CGRs and EPRs through an empirical study of 32 open-source Java projects and found that both commonly occur during development. In addition, we found that refactoring types related to splitting or merging classes and packages, as well as those involving modifications to the inheritance structure, tend to be CGRs, and types targeting small objects such as variables and attributes, and refactorings with context-sensitive detection criteria tend to be EPRs. The causes of CGRs and EPRs are analyzed and categorized, and the relationships between the commit messages of CGRs and themselves are also assessed. We found that about 20% of commit messages explicitly suggest the existence of CGRs. We suggest that CGRs and EPRs be valued in refactoring research and that detectors be extended to identify CGRs. Editor's note: Open Science material was validated by the Journal of Systems and Software Open Science Board. © 2025 The Authors

...

3.Coverage Isn’t Enough: SBFL-Driven Insights into Manually Created vs. Automatically Generated Tests

关键词：
Automatic test pattern generation;Software design;Software testing;Well testing;Automated test-case generations;Automatically generated;Code coverage;Fault localization;Mutation testing;Spectra's;Spectrum-based fault localization;Test case;Testing method;Testing phase

Shimizu, Sasara;Higo, Yoshiki
《26th International Conference on Product-Focused Software Process Improvement, PROFES 2025》
2026年
December 1, 2025 - December 3, 2025
Salerno, Italy
会议

The testing phase is an essential part of software development, but manually creating test cases can be time-consuming. Consequently, there is a growing need for more efficient testing methods. To reduce the burden on developers, various automated test generation tools have been developed, and several studies have been conducted to evaluate the effectiveness of the tests they produce. However, most of these studies focus primarily on coverage metrics, and only a few examine how well the tests support fault localization—particularly using artificial faults introduced through mutation testing. In this study, we compare the SBFL (Spectrum-Based Fault Localization) score and code coverage of automatically generated tests with those of manually created tests. The SBFL score indicates how accurately faults can be localized using SBFL techniques. By employing SBFL score as an evaluation metric—an approach rarely used in prior studies on test generation—we aim to provide new insights into the respective strengths and weaknesses of manually created and automatically generated tests. Our experimental results show that automatically generated tests achieve higher branch coverage than manually created tests, but their SBFL score is lower, especially for code with deeply nested structures. These findings offer guidance on how to effectively combine automatically generated and manually created testing approaches. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

...

4.How Natural Language Proficiency Shapes Generative AI Code for Software Engineering Tasks

关键词：
Codes; Natural language processing; Software engineering; Softwarereliability; Software development management; Python; Softwaremeasurement; Large language models

Rojpaisarnkit, Ruksit;Fan, Youmei;Matsumoto, Kenichi;Kula, Raula Gaikovina
《IEEE SOFTWARE》
2026年
43卷
1期
期刊

Much research has focused on prompt structure, but natural language proficiency is an underexplored factor that can influence the quality of generated code. This article investigates whether English language proficiency affects the proficiency and correctness of code generated by large language models.

...

5.How Much Can a Behavior-Preserving Changeset Be Decomposed into Refactoring Operations?

关键词：
;Behavior preservation;Refactorings

Someya, Kota;Chen, Lei;Decker, Michael J.;Hayashi, Shinpei
《41st IEEE International Conference on Software Maintenance and Evolution, ICSME 2025》
2025年
September 7, 2025 - September 12, 2025
Auckland, New zealand
会议

Developers sometimes mix behavior-preserving modifications, such as refactorings, with behavior-altering modifications, such as feature additions. Several approaches have been proposed to support understanding such modifications by separating them into those two parts. Such refactoring-aware approaches are expected to be particularly effective when the behavior-preserving parts can be decomposed into a sequence of more primitive behavior-preserving operations, such as refactorings, but this has not been explored. In this paper, as an initial validation, we quantify how much of the behavior-preserving modifications can be decomposed into refactoring operations using a dataset of functionally-equivalent method pairs. As a result, when using an existing refactoring detector, only 33.9 % of the changes could be identified as refactoring operations. In contrast, when including 67 newly defined functionally-equivalent operations, the coverage increased by over 128 %. Further investigation into the remaining unexplained differences was conducted, suggesting improvement opportunities. © 2025 IEEE.

...

6.Social Media Reactions to Open Source Promotions: AI-Powered GitHub Projects on Hacker News

关键词：
Artificial intelligence;Open systems;Social networking (online);Social sciences computing;Software design;Github project;Hacker news;LLM;News sources;Open source software projects;Open-source;Open-source softwares;Social media;Social media platforms;Spread of informations

Meakpaiboonwattana, Prachnachai;Tarntong, Warittha;Mekratanavorakul, Thai;Ragkhitwetsagul, Chaiyong;Sangaroonsilp, Pattaraporn;Kula, Raula Gaikovina;Choetkiertikul, Morakot;Matsumoto, Kenichi;Sunetnanta, Thanwadee
《41st IEEE International Conference on Software Maintenance and Evolution, ICSME 2025》
2025年
September 7, 2025 - September 12, 2025
Auckland, New zealand
会议

Social media platforms have become more influential than traditional news sources, shaping public discourse and accelerating the spread of information. With the rapid advancement of artificial intelligence (AI), open-source software (OSS) projects can leverage these platforms to gain visibility and attract contributors. In this study, we investigate the relationship between Hacker News, a social news site focused on computer science and entrepreneurship, and the extent to which it influences developer activity on the promoted GitHub AI projects. We analyzed 2,195 Hacker News (HN) stories and their corresponding comments over a two-year period. Our findings reveal that at least 19 % of AI developers promoted their GitHub projects on Hacker News, often receiving positive engagement from the community. By tracking activity on the associated 1,814 GitHub repositories after they were shared on Hacker News, we observed a significant increase in forks, stars, and contributors. These results suggest that Hacker News serves as a viable platform for AI-powered OSS projects, with the potential to gain attention, foster community engagement, and accelerate software development. © 2025 IEEE.

...

7.A Dataset of Software Bill of Materials for Evaluating SBOM Consumption Tools

关键词：
Open source software;Open systems;Tools;Bill of materials;Evaluating software;Generation tools;Material consumption;Real-world;Software bill of material;Software dependencies;Software-component;SPDX;Tool support

Kishimoto, Rio;Kanda, Tetsuya;Manabe, Yuki;Inoue, Katsuro;Qiu, Shi;Higo, Yoshiki
《22nd IEEE/ACM International Conference on Mining Software Repositories, MSR 2025》
2025年
April 27, 2025 - April 29, 2025
Ottawa, ON, Canada
会议

A Software Bill of Materials (SBOM) is becoming an essential tool for effective software dependency management. An SBOM is a list of components used in software, including details such as component names, versions, and licenses. Using SBOMs, developers can quickly identify software components and assess whether their software depends on vulnerable libraries. Numerous tools support software dependency management through SBOMs, which can be broadly categorized into two types: tools that generate SBOMs and tools that utilize SBOMs. A substantial collection of accurate SBOMs is required to evaluate tools that utilize SBOMs. However, there is no publicly available dataset specifically designed for this purpose, and research on SBOM consumption tools remains limited. In this paper, we present a dataset of SBOMs to address this gap. The dataset we constructed comprises 46 SBOMs generated from real-world Java projects, with plans to expand it to include a broader range of projects across various programming languages. Accurate and well-structured SBOMs enable researchers to evaluate the functionality of SBOM consumption tools and identify potential issues. We collected 3,271 Java projects from GitHub and generated SBOMs for 798 of them using Maven with an open-source SBOM generation tool. These SBOMs were refined through both automatic and manual corrections to ensure accuracy, currently resulting in 46 SBOMs that comply with the SPDX Lite profile, which defines minimal requirements tailored to practical workflows in industries. This process also revealed issues with the SBOM generation tools themselves. The dataset is publicly available on Zenodo (DOI: 10.5281/zenodo.14233414). © 2025 IEEE.

...

8.Revisiting Method-Level Change Prediction: A Comparative Evaluation at Different Granularities

关键词：
Computer software;Maintainability;Change prediction;Class level;Comparative evaluations;Comparison methods;Different granularities;Level change;Machine-learning;Maintenance efforts;Performance;Prediction techniques

Sugimori, Hiroto;Hayashi, Shinpei
《32nd IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025》
2025年
March 4, 2025 - March 7, 2025
Montreal, QC, Canada
会议

To improve the efficiency of software maintenance, change prediction techniques have been proposed to predict frequently changing modules. Whereas existing techniques focus primarily on class-level prediction, method-level prediction allows for more direct identification of change locations. Method-level prediction can be useful, but it may also negatively affect prediction performance, leading to a trade-off. This makes it unclear which level of granularity users should select for their predictions. In this paper, we evaluated the performance of method-level change prediction compared with that of class-level prediction from three perspectives: direct comparison, method-level comparison, and maintenance effort-aware comparison. The results from 15 open source projects show that, although method-level prediction exhibited lower performance than class-level prediction in the direct comparison, method-level prediction outperformed class-level prediction when both were evaluated at method-level, leading to a median difference of 0.26 in accuracy. Furthermore, effort-aware comparison shows that method-level prediction performed significantly better when the acceptable maintenance effort is little. © 2025 IEEE.

...

9.Toward Automated Test Generation for Dockerfiles Based on Analysis of Docker Image Layers

关键词：
Automatic test pattern generation;Codes (symbols);Image processing;Software testing;Automated test generations;Docker;Dockerfile;General programming;Generation techniques;Image layers;Layer;Source codes;Text file;Virtualizations

Goto, Yuki;Matsumoto, Shinsuke;Kusumoto, Shinji
《29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025》
2025年
June 17, 2025 - June 20, 2025
Istanbul, Turkey
会议

Docker has gained attention as a lightweight container-based virtualization platform. The process for building a Docker image is defined in a text file called a Dockerfile. A Dockerfile can be considered as a kind of source code that contains instructions on how to build a Docker image. Its behavior should be verified through testing, as is done for source code in a general programming language. For source code in languages such as Java, search-based test generation techniques have been proposed. However, existing automated test generation techniques cannot be applied to Dockerfiles. Since a Dockerfile does not contain branches, the coverage metric, typically used as an objective function in existing methods, becomes meaningless. In this study, we propose an automated test generation method for Dockerfiles based on processing results rather than processing steps. The proposed method determines which files should be tested and generates the corresponding tests based on an analysis of Dockerfile instructions and Docker image layers. The experimental results show that the proposed method can reproduce over 80% of the tests created by developers. © 2025 Copyright held by the owner/author(s).

...

10.Exploring anInclusion Relation onTest Cases toIdentify Unit andIntegration Tests

关键词：
Integration;Debugging efforts;Inclusion relation;Integration test;Line coverage;Measurement methods;Software testings;Test case;Testing efficiency;Testing process;Unit tests

Okamoto, Ryu;Matsumoto, Shinsuke;Kusumoto, Shinji
《25th International Conference on Product-Focused Software Process Improvement, PROFES 2024》
2025年
December 2, 2024 - December 4, 2024
Tartu, Estonia
会议

In software testing, among the various types of tests, two commonly conducted ones are unit and integration tests.Unit tests verify individual functionalities, and integrationtests verify the combination of multiple functionalities. If wecan identify unit/integration tests and measure them as ordinal values, such as the degree of integration-ness, we can utilizethem to improve testing efficiency. However, the definitionsof unit/integration are ambiguous, making it difficult to distinguish between them. To the best of our knowledge, there is currentlyno method for detecting this distinction. In this study, aimingto support the testing process, we will consider a measurement method for unit/integration tests. The key idea is to utilize an inclusion relation, which naturally exists among test cases. As an application of the inclusion relation, we propose a method for ordering failed tests to streamline debugging. We conducted a mutation analysisto evaluate how much our proposal reduces debugging effort comparedto a naive method. The results showed that our proposal was effective in 29.7% of cases and confirmed an average reduction of 20.7%in debugging effort. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

...

排序方式：时间相关性
显示方式：列表摘要