Knowledge distillation analysis and evaluation for embedded applications on FPGAs
Files
Vlaeminck_37091500_2021.pdf
Open access - Adobe PDF
- 5.46 MB
Details
- Supervisors
- Faculty
- Degree label
- Abstract
- Deep learning has been instrumental in recent advances in artificial intelligence and computer vision, with countless applications. However, deep learning requires extremely large models to achieve its stellar performances, that require a lot of processing power both for training and when deployed in devices, and consume a large amount of energy. To reduce the power consumption in embedded applications, field-programmable gate arrays could be a possible solution, as they have a high energy efficiency while still maintaining superior performance than CPUs. But their limited size requires the use of much smaller networks, which are not as accurate. Knowledge distillation is a technique of training deep networks where a large and accurate model (a teacher) is used to train a smaller but easier to deploy network (a student), by providing the latter with the outputs of the former as an additional objective. Knowledge distillation can typically train the student to a much higher accuracy than when trained alone, sometimes even surpassing its teacher. This thesis aims to evaluate the use of knowledge distillation to train networks adapted for FPGAs, as well as the factors influencing its success. Our results show that distillation is indeed applicable with success. We include a theoretical analysis of the impact of distillation on the loss and the gradients, and what this may mean concerning the mechanism of action of distillation. We obtain that the best results are obtained with a higher temperature than usually used in knowledge distillation. We find that teachers whose training is stopped early are extremely effective, improving their student's accuracies by as much as 6\% on CIFAR-100, and that distillation could potentially reduce the required training time.