Improve Vision Language Model Chain-of-thought Reasoning

AuthorsRuohong Zhang†, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun‡, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang‡

View publication

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes often relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations. To address this limitation, we propose a two-stage post-training strategy that extends the usage of short answer data for enhanced CoT reasoning. First, we augment short answers with CoT reasoning generated by GPT-4o, enhancing the VLM’s CoT capabilities through fine-tuning. Second, we leverage short answers as outcome rewards for reinforcement learning. Specifically, short answers are used as correctness indicators to construct positive (correct) and negative (incorrect) pairs from model-generated reasoning chains. These pairs are then used to calibrate the model’s reasoning via Direct Preference Optimization. Our experiments show significant improvements in CoT reasoning on benchmark datasets, along with enhanced generalization to direct answer prediction. This work provides a critical data resource for VLM CoT training and demonstrates the effectiveness of outcome rewards for multimodal models post-training.

† Work done while at Apple
‡ Carnegie Mellon University

Improve Vision Language Model Chain-of-thought Reasoning

Related readings and updates.

Interleaved Reasoning for Large Language Models via Reinforcement Learning

How Far Are We from Intelligent Visual Deductive Reasoning?

Discover opportunities in Machine Learning.