top of page

Methodology

This study examined four leading models in the domain of knowledge distillation, as established by prior research. The initial model explored was Classical Knowledge Distillation, introduced in 2015 as one of the earliest notable implementations of the knowledge distillation technique. Beyond this classical method, three advanced models were investigated: Relational Knowledge Distillation, Curriculum Temperature Knowledge Distillation, and Regularizing Feature Norm and Direction.

​

Technical
Approach

technical_approach.drawio.png

Leading
Frameworks

04

01

Classic
Knowledge Distillation

The CKD method is that instead of using traditional hard labels for training, the student is trained to mimic the teacher's output distributions (softened logits) using a temperature parameter. When increased, this parameter smooths the model outputs, facilitating learning for the student model.

02

Relational
Knowledge Distillation

The RKD method preserves distance-based and angle-based relationships between data points in the intermediate feature spaces of both teacher and student networks. It calculates distances and angles between feature vectors of every data point pair in the teacher network at each layer.

03

Curriculum Temperature Knowledge Distillation

Regularizing Feature 
Norm and Direction
 

The CTKD explores the limitations of fixed temperatures in knowledge distillation, highlighting their impact on the learning process. It acknowledges the reduced effectiveness of higher temperatures in advanced stages of training and introduces a dynamic strategy to modify the distillation temperature throughout the training period.

"Knowledge Distillation via Regularizing Feature Norm and Direction"  (KD++ Distillation) focuses on aligning student features with the class-mean of the teacher's features. It trains the student to generate features with a larger norm and to align the direction of its features with the class-mean of the teacher enhancing the student's learning capacity.

Classic
Knowledge Distillation

The principle component of CKD is the loss function, which employs a temperature, T,  to soften the teacher model’s output probabilities, facilitating the transfer of more nuanced information to the student model. The CKD loss function is as follows:

​

​

where Lce is the crossentropy loss of the student predictions. LKLDivergence is the Kullback-Leibler divergence between the student and teacher logits. α weights the loss in favor of the hard predictions from the student or the logits from the teacher.

​

​

ckd_drawio.png
ckd_form.png

Relational
Knowledge Distillation

The RKD trains a student model to mimic the relational embeddings of a teacher model by matching the distances and angles between data point features, using specialized distance and angle distillation losses.

​

​

The distance loss (Ld) encourages the student’s feature representations to have similar L2 norms to those of the teacher. The angle loss (La) encourages the cosine similarity between the teacher and student feature representations to be close to 1. β allocates priority to the distance and angle losses. α attenuates or strengthens the overall distillation loss, which is the concatenation of the distance and angle loss, and can be tuned as a hyperparameter.

​

rkd_form.png
rkd_draw.png

Curriculum Temperature 
Distillation

This methodology leverages an attenuated temperature parameter to ensure a balanced transfer of knowledge from the teacher to the student. Higher temperatures lose effectiveness in later training phases and introduce a dynamic method to adjust the distillation temperature during training. This can be applied to any knowledge distillation framework that utilizes temperature.

ctkd.png

KD++
Distillation

“Knowledge Distillation via Regularizing Feature Norm and Direction,” (KD++) method is to improve knowledge distillation by aligning student model features with the teacher's class-means and promoting large-norm feature production. This is achieved through a novel loss term, ND loss.

​

​

The new loss, Lnd, encourages larger student feature norms, and minimizes the angular distance between the student features and the teacher class mean.

kd__d.png
kd__.png
bottom of page