Adaptive Video Streaming ... - 周金佳 - 日本学术振兴会基金(JS...

Adaptive Video Streaming with Layered Neural Codecs for Both Machine and Human Vision

项目来源

日本学术振兴会基金(JSPS)

项目主持人

周金佳

项目受资助机构

法政大学

项目编号

25K15165

立项年度

2025

立项时间

未公开

项目级别

国家级

研究期限

未知 / 未知

受资助金额

4550000.00日元

学科

知覚情報処理関連

学科代码

未公开

基金类别

基盤研究(C)

关键词

Video Coding

参与者

未公开

参与机构

法政大学，理工学部

项目标书摘要：Outline of Research at the Start:Video content is increasingly being watched not only by humans but also by machines.One central problem is how to achieve high-efficiency data compression in a task-orientated and vision-orientated manner.This research aims to develop a deep scalable video streaming with layered neural codecs to support data compression for both machine ad human vision。

排序方式：时间相关性
显示方式：列表摘要

1.Leveraging Temporal Down-Sampling Structure and Spatio-Temporal Fusion for Efficient Video Coding.

关键词：
deep learning; low-bitrate; video coding; video enhancement

He, Keren;Gao, Yufei;Wang, Qi;Wang, Haixin;Zhou, Jinjia
《Sensors 》
2026年
26卷
5期
期刊

Down-sampling-based video compression frameworks have shown great potential in improving compression efficiency in modern sensing and imaging systems. However, existing methods ignore critical spatial and temporal redundancy, and treat all frames uniformly during down-sampling. This leads to the loss of important information and impacts compression efficiency. To address these limitations, this paper proposes a temporal down-sampling system, in which only intermediate frames are down-sampled while preserving key frames with high quality for reference. On the decoding side, we employ a frame-recurrent enhancement mechanism to maximize the use of temporal redundancy information. In the fusion of enhancement stage, we design a Multi-scale Temporal-Spatial Attention (MTSA) module. MTSA consists of two components: Multi-Temporal Attention (MTA) and Pyramid Spatial Attention (PSA). MTA performs multi-scale temporal correlation modeling, expanding the receptive field and providing stable cues in compressed regions. PSA integrates local spatial saliency and contextual structure in a progressive and multi-stage manner. Extensive experiments show that our approach achieves consistent BD-rate reductions. Under All-Intra, Low-Delay-P, and Random Access configurations, we observe BD-rate reductions of I, P, and B frames ranging from 14% to 39% compared to VVC, and outperform prior approaches anchored by the standard HEVC.

...

2.On Demand Secure Scalable Video Streaming for Both Human and Machine Applications

关键词：
Codes (symbols);Cryptography;Data privacy;Efficiency;Image communication systems;Man machine systems;Network security;Security systems;Video streaming;Deep video coding;Encrypted video streaming;Heterogeneous devices;High-efficiency video coding;Machine analysis;On demands;Scalable video streaming;Scalable video-coding;Video coding for machine;Video-streaming

Zain, Alaa;Fan, Yibo;Zhou, Jinjia
《Sensors》
2026年
26卷
4期
期刊

Scalable video coding plays an essential role in supporting heterogeneous devices, network conditions, and application requirements in modern video streaming systems. However, most existing scalable coding approaches primarily optimize human perceptual quality and provide limited support for data privacy, as well as for machine analyses and the integration of heterogeneous sensor data. This limitation motivated the development of adaptive scalable video coding frameworks. The proposed approach is designed to serve both human viewers and automated analysis systems while ensuring high security and compression efficiency. The method adaptively encrypts selected layers during transmission to protect sensitive content without degrading decoding or analysis performance. Experimental evaluations on benchmark datasets demonstrate that the proposed framework achieves superior rate distortion efficiency and reconstruction quality, while also improving machine analysis accuracy compared to existing traditional and learning-based codes. In video surveillance scenarios, where the base layer is preserved for analysis, the proposed scalable human machine coding (SHMC) method outperforms scalable extensions of H.265/High Efficiency Video Coding (HEVC), Scalable High Efficiency Video Coding (SHVC), reducing the average bit-per-pixel (bpp) by 26.38%, 30.76%, and 60.29% at equivalent mean Average Precision (mAP), Peak Signal-to-Noise Ratio (PSNR), and Multi-Scale Structural Similarity (MS-SSIM) levels. These results confirm the effectiveness of integrating scalable video coding with intelligent encryption for secure and efficient video transmission. © 2026 by the authors.

...

3.Style-Aware Music-to-Dance Generation viaMulti-Stage Unit Decomposition andRecombination

关键词：
Audio acoustics;Cluster analysis;Clustering algorithms;Computer music;Mapping;Signal processing;Audio features;Complex mapping;Feature motion;Human motions;Multi-stage framework;Multi-stages;Music-to-dance;Recombination-based generation;Specific learning;Style-specific learning

Gao, Yufei;Wu, Qian;He, Keren;Zhou, Jinjia
《32nd International Conference on Neural Information Processing, ICONIP 2025》
2026年
November 20, 2025 - November 24, 2025
Okinawa, Japan
会议

Generating dance animations from music presents significant challenges in artificial intelligence, requiring systems to capture the complex mapping between audio features and human motion.We introduce a novel multi-stage decomposition-recombination network for music-driven dance synthesis that addresses two critical limitations in existing approaches. First, unlike end-to-enddeep learning models that directly transform music into posture data—often resulting in unrealistic movements—our approach decomposes both music and dance into fundamental units and learns mappings between them, preserving the naturalness of human motion. Second, we establish that style-specific training deliversmore distinctive and stylistically coherent choreography than mixed-style approaches. Our framework implements a three-stage process: (1)an accumulation stage that constructs music and dance unit dictionaries through clustering techniques, (2) a learning stage that trains style-specific mapping models between these units, and (3)a creation stage that recombines units to generate coherentdance sequences. Quantitative evaluations show that our method outperforms the baseline approaches by 26.9% in Fréchet Inception Distance (FID) and 13.1% in music-dance correspondence scores. Qualitative analysis confirms our framework’s ability to generatedance sequences with both improved temporal coherence and clear stylistic characteristics across breaking, hip-hop, locking, and streetjazz styles. The proposed approach offers a significant advancementin realistic, style-specific music-to-dance synthesis. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.

...

4.HIGH-FREQUENCY SEMANTIC ENHANCEMENT IN COMPRESSED SCENARIOS FOR ROBUST VISUAL AND MACHINE VISION APPLICATIONS

关键词：
Computer vision;Image coding;Machine Perception;Machine vision;Man machine systems;Object detection;Object recognition;Semantic Segmentation;Semantics;Growing demand;High frequency HF;Human vision;Machine-vision;Post-processing;Post-processing techniques;Semantic enhancements;Video coding for machine;Video processing;Vision applications

He, Keren;Fu, Chen;Gao, Guangwei;Zhou, Jinjia
《32nd IEEE International Conference on Image Processing, ICIP 2025》
2025年
September 14, 2025 - September 17, 2025
Anchorage, AK, United states
会议

With the growing demand for video processing in both human and machine vision, optimizing post-processing techniques has become a crucial challenge. To address the limitations of current post-processing techniques in these domains, this paper introduces a novel post-processing method that enhances high-frequency information through semantic enhancement, significantly improving performance in both domains. We propose a High Semantic Extraction (HSE) model to capture more recognizable details, and design a High-Frequency Semantic Fusion (HFSF) strategy that preserves critical details while suppressing noise. Experimental results demonstrate that our method effectively enhances performance in object detection, semantic segmentation, and video quality, achieving a significant advancement in optimizing video processing for both human and machine vision. ©2025 IEEE.

...

排序方式：时间相关性
显示方式：列表摘要