Collaborative Research:CI... - Mingyi Hon... - 美国国家科学基金(NSF...

Collaborative Research:CIF:Small:Inverse Reinforcement Learning with Heterogeneous Data:Estimation Algorithms with Finite Time and Sample Guarantees

项目来源

美国国家科学基金(NSF)

项目主持人

Mingyi Hong

项目受资助机构

UNIVERSITY OF MINNESOTA

财政年度

2025,2024

立项时间

未公开

项目编号

2414372

研究期限

未知 / 未知

项目级别

国家级

受资助金额

600000.00美元

学科

未公开

学科代码

未公开

基金类别

Standard Grant

关键词

Comm&Information Foundations ; Machine Learning Theory ; SMALL PROJECT ; NETWORK CODING AND INFO THEORY

参与者

未公开

参与机构

REGENTS OF THE UNIVERSITY OF MINNESOTA

项目标书摘要：Learning a structural model of dynamic decision-making helps us better understand and predict how agents,whether human or machine,make decisions over time in changing environments.Instead of just copying actions,this approach allows us to capture both the agent’s goals(preferences)and how it understands the world(environment dynamics).This provides a much deeper insight into behavior,enabling predictions about how the agent would act in new or unseen situations.Such models are valuable because they can help improve decision-making systems,allowing them to adapt and make reliable choices in complex real-world scenarios,such as personalized AI assistants,autonomous systems,or decision support tools.There is an urgent need for models and algorithms that can create such structural frameworks.The outcomes of this project will have broad applications,including areas like control systems,natural language processing,and autonomous driving.Moreover,these efforts offer valuable opportunities to enhance the optimization and reinforcement learning curriculum,engaging students from diverse backgrounds in cross-disciplinary research and K12 outreach initiatives.This project develops machine learning models of an agent’s dynamic decisions subject to structural constraints on observed behavior.Specifically,the agent’s observed behavior(data)is modeled as being consistent with the inter-temporal optimization of a reward function(preferences)given a representation of how the environment evolves pursuant to control actions(dynamics).Unlike behavioral cloning models,a structural model of observed control behavior is a solid basis to perform counterfactual analysis and/or transfer learning.However,developing structural models of control is computationally challenging and the statistical properties of structural estimators are not easy to characterize.This project aims to advance the state-of-the-art on methodologies for learning structural models of control,by considering a diverse set of data(including demonstration and preferences),and by considering both online and offline settings.Finally,extensive experiments will be conducted to evaluate and apply the proposed methodologies in aligning large language models(LLMs),and in autonomous driving.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

人员信息

Mingyi Hong(Principal Investigator)：mhong@umn.edu；

机构信息

【University of Minnesota(Performance Institution)】StreetAddress：200 Union Street SE 4-174 Keller Hall,MINNEAPOLIS,Minnesota,United States/ZipCode：554552009；【REGENTS OF THE UNIVERSITY OF MINNESOTA】StreetAddress：2221 UNIVERSITY AVE SE STE 100,MINNEAPOLIS,Minnesota,United States/PhoneNumber：6126245599/ZipCode：554143074；

项目主管部门

Directorate for Computer and Information Science and Engineering(CSE)-Division of Computing and Communication Foundations(CCF)

项目官员

Phillip Regalia(Email：pregalia@nsf.gov；Phone：7032922981)

排序方式：时间相关性
显示方式：列表摘要

1.Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

关键词：
;

Li, Jiaxiang;Zeng, Siliang;Wai, Hoi-To;Li, Chenliang;Garcia, Alfredo;Hong, Mingyi
《38th Conference on Neural Information Processing Systems, NeurIPS 2024》
2024年
December 9, 2024 - December 15, 2024
Vancouver, BC, Canada
会议

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune (SPIN, Chen et al. [2024]). Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process. Our code is available at https://github.com/JasonJiaxiangLi/Reward_learning_SFT. © 2024 Neural information processing systems foundation. All rights reserved.

...

排序方式：时间相关性
显示方式：列表摘要