映像に基づく人物行動理解の意味的深化
项目来源
项目主持人
项目受资助机构
项目编号
立项年度
立项时间
项目级别
研究期限
受资助金额
学科
学科代码
基金类别
关键词
参与者
参与机构
1.Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
- 关键词:
- Video understanding; First-person video; Egocentric; Video-language; 3D;Body pose;NETWORKS; DATASET
- Grauman, Kristen;Westbury, Andrew;Torresani, Lorenzo;Kitani, Kris;Malik, Jitendra;Afouras, Triantafyllos;Ashutosh, Kumar;Baiyya, Vijay;Bansal, Siddhant;Boote, Bikram;Byrne, Eugene;Chavis, Zach;Chen, Joya;Cheng, Feng;Chu, Fu-Jen;Crane, Sean;Dasgupta, Avijit;Dong, Jing;Escobar, Maria;Forigua, Cristhian;Gebreselasie, Abrham;Haresh, Sanjay;Huang, Jing;Islam, Md Mohaiminul;Jain, Suyog;Khirodkar, Rawal;Kukreja, Devansh;Liang, Kevin J.;Liu, Jia-Wei;Majumder, Sagnik;Mao, Yongsen;Martin, Miguel;Mavroudi, Effrosyni;Nagarajan, Tushar;Ragusa, Francesco;Ramakrishnan, Santhosh Kumar;Seminara, Luigi;Somayazulu, Arjun;Song, Yale;Su, Shan;Xue, Zihui;Zhang, Edward;Zhang, Jinxu;Castillo, Angela;Chen, Changan;Fu, Xinzhu;Furuta, Ryosuke;Gonzalez, Cristina;Gupta, Prince;Hu, Jiabo;Huang, Yifei;Huang, Yiming;Khoo, Weslie;Kumar, Anush;Kuo, Robert;Lakhavani, Sach;Liu, Miao;Luo, Mi;Luo, Zhengyi;Meredith, Brighid;Miller, Austin;Oguntola, Oluwatumininu;Pan, Xiaqing;Peng, Penny;Pramanick, Shraman;Ramazanova, Merey;Ryan, Fiona;Shan, Wei;Somasundaram, Kiran;Song, Chenan;Southerland, Audrey;Tateno, Masatoshi;Wang, Huiyu;Wang, Yuchen;Yagi, Takuma;Yan, Mingfei;Yang, Xitong;Yu, Zecheng;Zha, Shengxin Cindy;Zhao, Chen;Zhao, Ziwei;Zhu, Zhifan;Zhuo, Jeff;Arbelaez, Pablo;Bertasius, Gedas;Crandall, David;Damen, Dima;Engel, Jakob;Farinella, Giovanni Maria;Furnari, Antonino;Ghanem, Bernard;Hoffman, Judy;Jawahar, C. V.;Newcombe, Richard;Park, Hyun Soo;Rehg, James M.;Sato, Yoichi;Savva, Manolis;Shi, Jianbo;Shou, Mike Zheng;Wray, Michael
- 《INTERNATIONAL JOURNAL OF COMPUTER VISION》
- 2025年
- 卷
- 期
- 期刊
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions-including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. https://ego-exo4d-data.org/
...2.Masked Video andBody-Worn IMU Autoencoder forEgocentric Action Recognition
- 关键词:
- Graphic methods;Human engineering;Image coding;Motion capture;Motion Picture Experts Group standards;Signal encoding;Video analysis;Action recognition;Auto encoders;Egocentric action recognition;Human limbs;Inertial measurements units;Motion signals;Multi-modal;Multimodal masked autoencoder;Pre-training;Visual signals
- Zhang, Mingfang;Huang, Yifei;Liu, Ruicong;Sato, Yoichi
- 《18th European Conference on Computer Vision, ECCV 2024》
- 2025年
- September 29, 2024 - October 4, 2024
- Milan, Italy
- 会议
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
...3.Learning Multiple Object States from Actions via Large Language Models
- 关键词:
- Modeling languages;'current;Language model;Large language model;Multi-label classifications;Multiple objects;Object state;Single object;State recognition;States change;Video recognition
- Tateno, Masatoshi;Yagi, Takuma;Furuta, Ryosuke;Sato, Yoichi
- 《2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025》
- 2025年
- February 28, 2025 - March 4, 2025
- Tucson, AZ, United states
- 会议
Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simulta-neously (an egg can be both raw and whisked). However, most existing research assumes single object state change (e.g. uncracked → cracked), overlooking the coexisting nature of multiple object states and the influence of past states on the current state. We formulate object state recognition as a multi-label classification task that explicitly handles multiple states. We then propose to learn multiple object states from narrated videos by leveraging LLMs to generate pseudo-labels from the transcribed narrations, capturing the influence of past states. The challenge is that narrations mostly describe human actions in the video but rarely explain object states. Therefore, we use LLM's knowledge of the relationship between actions and states to derive the missing object states. We further accumulate the derived object states to consider the past state contexts to infer current object state pseudo-labels. We newly collect Multiple Object States Transition (MOST) dataset, which includes manual multi-label annotation for evaluation purposes, covering 60 object states across six object categories. Experimental results show that our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models, demonstrating the effectiveness of our pseudo-labeling framework that considers past context via LLMs. © 2025 IEEE.
...4.SIMHAND: MINING SIMILAR HANDS FOR LARGE-SCALE 3D HAND POSE PRE-TRAINING
- 关键词:
- ;
- Lin, Nie;Ohkawa, Takehiko;Huang, Yifei;Zhang, Mingfang;Cai, Minjie;Li, Ming;Furuta, Ryosuke;Sato, Yoichi
- 《13th International Conference on Learning Representations, ICLR 2025》
- 2025年
- April 24, 2025 - April 28, 2025
- Singapore, Singapore
- 会议
We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SiMHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs solely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands. Our code is available at https://github.com/ut-vision/SiMHand. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
...5.Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos
- 关键词:
- ;Data scarcity;Egocentric vision;Full body;In-depth study;Instructional videos;Knowledge transfer;Time segments;Video captioning;Video understanding;Web video
- Ohkawa, Takehiko;Yagi, Takuma;Nishimura, Taichi;Furuta, Ryosuke;Hashimoto, Atsushi;Ushiku, Yoshitaka;Sato, Yoichi
- 《2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025》
- 2025年
- February 28, 2025 - March 4, 2025
- Tucson, AZ, United states
- 会议
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an ego-centric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exo-centric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain shots showing either full-body or hand regions, while the egocentric view is constantly shifting. This necessitates the in-depth study of cross-view transfer under complex view changes. To this end, we first create a real-life egocentric dataset (EgoYC2) whose captions follow the definition of YouCook2 captions, enabling transfer learning between these datasets with access to their ground-truth. To bridge the view gaps, we propose a view-invariant learning method using adversarial training, which consists of pretraining and finetuning stages. Our experiments confirm the effectiveness of over-coming the view change problem and knowledge transfer to egocentric views. Our benchmark pushes the study of cross-view transfer into a new task domain of dense video captioning and envisions methodologies that describe ego-centric videos in natural language. © 2025 IEEE.
...6.ActionVOS: Actions asPrompts forVideo Object Segmentation
- 关键词:
- Image segmentation;Active object;Active object segmentation;Human actions;Objects segmentation;Referring expression comprehension;Referring expressions;Referring video object segmentation;States change;Target object;Video objects segmentations
- Ouyang, Liangyang;Liu, Ruicong;Huang, Yifei;Furuta, Ryosuke;Sato, Yoichi
- 《18th European Conference on Computer Vision, ECCV 2024》
- 2025年
- September 29, 2024 - October 4, 2024
- Milan, Italy
- 会议
Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an action-aware labeling module with an efficient action-guided focal loss. Such designs enable ActionVOS model to prioritize active objects with existing readily-available annotations. Experimental results on the VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects’ involvement. Further evaluations on VOST and VSCOS datasets show that the novel ActionVOS setting enhances segmentation performance when encountering challenging circumstances involving object state changes. We will make our implementation available at https://github.com/ut-vision/ActionVOS. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
...
