StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
AuthorsHaibo Wang‡‡, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge†, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
† Fudan University
‡‡ Work done during Apple internship
May 30, 2025research area Methods and Algorithms, research area Speech and Natural Language Processing
April 11, 2025research area Computer Vision, research area Speech and Natural Language Processingconference ICLR