映像に基づく人物行動理解の意味的深化

项目来源

日本学术振兴会基金(JSPS)

项目主持人

佐藤洋一

项目受资助机构

東京大学

项目编号

24K02956

立项年度

2024

立项时间

未公开

项目级别

国家级

研究期限

未知 / 未知

受资助金额

18590000.00日元

学科

知覚情報処理関連

学科代码

未公开

基金类别

基盤研究(B)

关键词

一人称視点映像解析 ;

参与者

古田諒佑

参与机构

未公开

项目标书摘要：Outline of Research at the Start:本研究では、外部知識として大規模言語モデルを活用することによる複雑な行動の深い意味理解と、3次元的なアフォーダンスに基づく手と物体のインタラクションの詳細な理解を軸として、深い意味レベルでの解釈に立脚した映像からの人物行動動理解の実現を目指す。これにより、従来の人物行動理解技術が抱える主要な課題、具体的には、身体動作レベルの表層的な解釈に留まり、行動の目的や複雑な行動における因果関係、さらには行動の裏にある理由や意図の理解が実現されていないという課題、さらに、詳細な行動理解のカギとなる手と物体のインタラクションの解析が2次元的かつ定性的なものに留まっているという課題の克服を図る。

排序方式：时间相关性
显示方式：列表摘要

1.Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

关键词：
Video understanding; First-person video; Egocentric; Video-language; 3D;Body pose;NETWORKS; DATASET

Grauman, Kristen;Westbury, Andrew;Torresani, Lorenzo;Kitani, Kris;Malik, Jitendra;Afouras, Triantafyllos;Ashutosh, Kumar;Baiyya, Vijay;Bansal, Siddhant;Boote, Bikram;Byrne, Eugene;Chavis, Zach;Chen, Joya;Cheng, Feng;Chu, Fu-Jen;Crane, Sean;Dasgupta, Avijit;Dong, Jing;Escobar, Maria;Forigua, Cristhian;Gebreselasie, Abrham;Haresh, Sanjay;Huang, Jing;Islam, Md Mohaiminul;Jain, Suyog;Khirodkar, Rawal;Kukreja, Devansh;Liang, Kevin J.;Liu, Jia-Wei;Majumder, Sagnik;Mao, Yongsen;Martin, Miguel;Mavroudi, Effrosyni;Nagarajan, Tushar;Ragusa, Francesco;Ramakrishnan, Santhosh Kumar;Seminara, Luigi;Somayazulu, Arjun;Song, Yale;Su, Shan;Xue, Zihui;Zhang, Edward;Zhang, Jinxu;Castillo, Angela;Chen, Changan;Fu, Xinzhu;Furuta, Ryosuke;Gonzalez, Cristina;Gupta, Prince;Hu, Jiabo;Huang, Yifei;Huang, Yiming;Khoo, Weslie;Kumar, Anush;Kuo, Robert;Lakhavani, Sach;Liu, Miao;Luo, Mi;Luo, Zhengyi;Meredith, Brighid;Miller, Austin;Oguntola, Oluwatumininu;Pan, Xiaqing;Peng, Penny;Pramanick, Shraman;Ramazanova, Merey;Ryan, Fiona;Shan, Wei;Somasundaram, Kiran;Song, Chenan;Southerland, Audrey;Tateno, Masatoshi;Wang, Huiyu;Wang, Yuchen;Yagi, Takuma;Yan, Mingfei;Yang, Xitong;Yu, Zecheng;Zha, Shengxin Cindy;Zhao, Chen;Zhao, Ziwei;Zhu, Zhifan;Zhuo, Jeff;Arbelaez, Pablo;Bertasius, Gedas;Crandall, David;Damen, Dima;Engel, Jakob;Farinella, Giovanni Maria;Furnari, Antonino;Ghanem, Bernard;Hoffman, Judy;Jawahar, C. V.;Newcombe, Richard;Park, Hyun Soo;Rehg, James M.;Sato, Yoichi;Savva, Manolis;Shi, Jianbo;Shou, Mike Zheng;Wray, Michael
《INTERNATIONAL JOURNAL OF COMPUTER VISION》
2025年
卷
期
期刊

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions-including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. https://ego-exo4d-data.org/

...

2.Masked Video andBody-Worn IMU Autoencoder forEgocentric Action Recognition

关键词：
Graphic methods;Human engineering;Image coding;Motion capture;Motion Picture Experts Group standards;Signal encoding;Video analysis;Action recognition;Auto encoders;Egocentric action recognition;Human limbs;Inertial measurements units;Motion signals;Multi-modal;Multimodal masked autoencoder;Pre-training;Visual signals

Zhang, Mingfang;Huang, Yifei;Liu, Ruicong;Sato, Yoichi
《18th European Conference on Computer Vision, ECCV 2024》
2025年
September 29, 2024 - October 4, 2024
Milan, Italy
会议

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

...

3.Learning Multiple Object States from Actions via Large Language Models

关键词：
Modeling languages;'current;Language model;Large language model;Multi-label classifications;Multiple objects;Object state;Single object;State recognition;States change;Video recognition

Tateno, Masatoshi;Yagi, Takuma;Furuta, Ryosuke;Sato, Yoichi
《2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025》
2025年
February 28, 2025 - March 4, 2025
Tucson, AZ, United states
会议

Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simulta-neously (an egg can be both raw and whisked). However, most existing research assumes single object state change (e.g. uncracked → cracked), overlooking the coexisting nature of multiple object states and the influence of past states on the current state. We formulate object state recognition as a multi-label classification task that explicitly handles multiple states. We then propose to learn multiple object states from narrated videos by leveraging LLMs to generate pseudo-labels from the transcribed narrations, capturing the influence of past states. The challenge is that narrations mostly describe human actions in the video but rarely explain object states. Therefore, we use LLM's knowledge of the relationship between actions and states to derive the missing object states. We further accumulate the derived object states to consider the past state contexts to infer current object state pseudo-labels. We newly collect Multiple Object States Transition (MOST) dataset, which includes manual multi-label annotation for evaluation purposes, covering 60 object states across six object categories. Experimental results show that our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models, demonstrating the effectiveness of our pseudo-labeling framework that considers past context via LLMs. © 2025 IEEE.

...

4.SIMHAND: MINING SIMILAR HANDS FOR LARGE-SCALE 3D HAND POSE PRE-TRAINING

关键词：
;

Lin, Nie;Ohkawa, Takehiko;Huang, Yifei;Zhang, Mingfang;Cai, Minjie;Li, Ming;Furuta, Ryosuke;Sato, Yoichi
《13th International Conference on Learning Representations, ICLR 2025》
2025年
April 24, 2025 - April 28, 2025
Singapore, Singapore
会议

We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SiMHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs solely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands. Our code is available at https://github.com/ut-vision/SiMHand. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.

...

5.Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

关键词：
;Data scarcity;Egocentric vision;Full body;In-depth study;Instructional videos;Knowledge transfer;Time segments;Video captioning;Video understanding;Web video

Ohkawa, Takehiko;Yagi, Takuma;Nishimura, Taichi;Furuta, Ryosuke;Hashimoto, Atsushi;Ushiku, Yoshitaka;Sato, Yoichi
《2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025》
2025年
February 28, 2025 - March 4, 2025
Tucson, AZ, United states
会议

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an ego-centric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exo-centric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pretraining and finetuning stages. Our experiments confirm the effectiveness of over-coming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe ego-centric videos in natural language. © 2025 IEEE.

...

6.ActionVOS: Actions asPrompts forVideo Object Segmentation

关键词：
Image segmentation;Active object;Active object segmentation;Human actions;Objects segmentation;Referring expression comprehension;Referring expressions;Referring video object segmentation;States change;Target object;Video objects segmentations

Ouyang, Liangyang;Liu, Ruicong;Huang, Yifei;Furuta, Ryosuke;Sato, Yoichi
《18th European Conference on Computer Vision, ECCV 2024》
2025年
September 29, 2024 - October 4, 2024
Milan, Italy
会议

Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an action-aware labeling module with an efficient action-guided focal loss. Such designs enable ActionVOS model to prioritize active objects with existing readily-available annotations. Experimental results on the VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects’ involvement. Further evaluations on VOST and VSCOS datasets show that the novel ActionVOS setting enhances segmentation performance when encountering challenging circumstances involving object state changes. We will make our implementation available at https://github.com/ut-vision/ActionVOS. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

...

排序方式：时间相关性
显示方式：列表摘要