Distillation Scaling Laws
AuthorsDan Busbridge, Amitis Shidani†, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
AuthorsDan Busbridge, Amitis Shidani†, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.
December 1, 2022research area Computer Visionconference CVPR
June 10, 2021research area Computer VisionWorkshop at CVPR