CIKM

Compression of Deep Learning Models
for NLP
Manish Gupta1,2, Vasudeva Varma1,2, Sonam Damani1, Kedhar Nath Narahari1
gmanish@microsoft.com
1Microsoft 2IIIT-H
Agenda
• Need for compression of deep learning models
• Broad overview of popular ways of model compression
• Pruning
• Quantization
• Knowledge Distillation
• Parameter sharing
• Matrix decomposition
• Transformers with Linear Complexity
• Applications
• Summary and future trends
What this tutorial is not about?
• Non-text methods – i.e., methods proposed only in the vision and speech
communities.
• Hardware optimizations to reduce latency.
• We focus on model compression which does lead to latency reduction.
• But latency reduction using hardware optimizations is not our focus.
• Not exhaustive.
• Given the large number of papers in this area, we could not cover everything within a
short tutorial.
• Every major A/A* deep learning based conference this year has at least 10 papers on
model compression.
Agenda
• Pruning
• Quantization
• Applications
Sizes of popular Neural Network models
Jameson Toole. Smaller, faster mobile models. O’Reilly AI London, 2019.

The ever-growing size of deep neural models
T5: 11B
T-NLG: 17B
GPT3: 175B
GShard – 600B – 1T
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
[79] Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277.
[130] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
[131] Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv preprint
arXiv:2006.16668 (2020).
Deep Learning on Edge Devices
• Aiding the blind
Can we process at 30 fps?

http://isca2016.eecs.umich.edu/wp-content/uploads/2016/07/4A-1.pdf
Resource constraints
• RAM Energy cost of basic arithmetic and memory operations in a 45nm

• Prediction Latency – FLOPs. CMOS process. Memory access is 2 orders of magnitude more energy
expensive than arithmetic operations.
• Power dissipation
• Large networks do not fit in on-chip storage and hence require the more costly external DRAM accesses.
• Running a 1 billion connection neural network, for example, at 30fps would require (30Hz)(1G)(640pJ) = 19.2W just for DRAM
access - well beyond the power envelope of a typical mobile device.
• Results: Battery drains fast, phone overheats.
• Run on cloud?
• Network delay
• Power consumption for communication
• User Privacy
• Persistent Recurrent Neural Networks [78]
• Cache the weights in on-chip memory such as caches, block RAM, or register files across multiple timesteps.
• High model compression allows significantly large RNNs to be stored in on-chip memory.
• 146x speedup if the entire RNN layer can be stored in registers rather than the GPU DRAM of an NVIDIA TitanX GPU.
[78] Diamos, Greg, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. "Persistent rnns: Stashing recurrent weights on-
chip." In International Conference on Machine Learning, pp. 2024-2033. 2016.
[80] Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. "EIE: efficient inference engine on compressed deep neural network." ACM SIGARCH Computer
Architecture News 44, no. 3 (2016): 243-254.
Training/Finetuning large models is difficult
• Impossible to fine-tune pretrained BERT-large on a GPU with 12-16 GB RAM
• This poses a large barrier of entry for communities without the resources to
purchase either several large GPUs or time on Google’s TPUs.
• Tuning various configuration parameters needs lots of resources.
• By removing unimportant weights from a network, several improvements can
be expected: better generalization, fewer training examples required, and
improved speed of learning and/or classification.
Mobile-first companies dislike large apps
• App stores are very sensitive to the size of the binary files.
• For example, iPhone App Store has the restriction “apps above 150 MB will not download until
you connect to Wi-Fi”.
• A feature that increases the binary size by 150MB will receive much more scrutiny
than one that increases it by 10MB.
• “At Baidu, our #1 motivation for compressing networks is to bring down the size of
the binary file. As a mobile-first company, we frequently update various apps via
different app stores. We've very sensitive to the size of our binary files, and a feature
that increases the binary size by 100MB will receive much more scrutiny than one
that increases it by 10MB.” —Andrew Ng
• Smaller models are easier to use and deploy in real world systems. If the model is too
large, it is possible that it will need multiple server nodes.
• Smaller models also decrease the communication overhead of distributed training of
the models.
There is hope.
• There is a surprisingly large amount of redundancy among the weights of
neural networks.
• A small subset of the weights are sufficient to reconstruct the entire network.
• We can predict >95% of the weights without any drop in accuracy.
The first column in each block shows four learned features (parameters of a deep model). The second column shows a few parameters chosen at random from the original set in the
first column. The third column shows that this random set can be used to predict the remaining parameters. From left to right the blocks are: (1) a convnet trained on STL-10 (2) an
MLP trained on MNIST, (3) a convnet trained on CIFAR-10, (4) Reconstruction ICA trained on Hyvarinen’s natural image dataset (5) Reconstruction ICA trained on STL-10.
[113] Denil, Misha, Shakibi, Babak, Dinh, Laurent, de Freitas, Nando, et al. Predicting parameters in deep learning. In NIPS, 2013.
Agenda
• Pruning
• Quantization
• Applications
Pruning: sparsifying weight matrices
• Methods differ based on what is pruned and the actual logic used to prune.
• Given a matrix, one can prune
• Some weight entries
• Rows/columns (i.e., neurons)
• Blocks
• Heads
• Layers
• Which weights/neurons/blocks/heads to prune?
• Should you prune large networks or build small networks?
• Iterative/gradual pruning? Iterative pruning and densification.
• Interplay between pruning and regularization.
Quantization: Reducing #bits to represent each
weight
• Weights can be quantized to two values (binary), three values (ternary) or
multiple bits.
• Uniform vs non-uniform.
• Deterministic vs stochastic.
• Loss-aware or not.
• Trained vs tuned parameters.
• How to quantize word vectors, RNN/LSTM weight matrices, Transformers.
Knowledge distillation: Train student to mimic a pre-
trained, larger teacher
• Distillation methods vary on
• different types of teacher model
• different types of loss function
• squared error between the logits of the models
• KL divergence between the predictive distributions, or
• some other measure of agreement between the model predictions.
• different choices for what dataset the student model trains on.
• a large unlabeled dataset
• a held-out data set, or
• the original training set.
• Mimic what?
• Teacher’s class probabilities
• Teacher’s feature representation
• Learn from whom?
• Teacher, teacher assistant, other fellow students
Parameter sharing
• Methods differ depending on
• which parameters are shared,
• technique used to share parameters,
• and the level at which sharing is performed
Matrix decomposition: Factorize large matrices into
multiple smaller components
• Methods differ
• in the type of factorization technique,
• matrices being factorized,
• and the property of weight matrix being exploited.
Transformers with Linear Complexity
• Goal is to achieve latency which is linear wrt input size.
• Methods include
• Sparse Transformer
• Star Transformer
• Linformer
• Reformer
• Sinkhorn Sparse Transformer
• Linear Transformer
Agenda
• Pruning
• Quantization
• Applications
• Outline for pruning
• Pruning weights
• Pruning neurons
• Pruning blocks
• Pruning heads and layers
Pruning: Motivation
• Trillion of synapses are generated in the human brain during the first few
months of birth.
• 1 year old, peaked at 1000 trillion
• Pruning begins to occur.
• 10 years old, a child has nearly 500 trillion synapses
• This ‘pruning’ mechanism removes redundant connections in the brain.
[81] Walsh, Christopher A. "Peter Huttenlocher (1931–2013)." Nature. (2013): 172.
Optimal Brain Damage (OBD)
• Oldest work.
• How to define saliency of a weight, besides simply using its magnitude?
• Using change in the objective function caused by deleting that parameter.
• Using Taylor series, a perturbation 𝛿𝑈 of the parameter vector will change the objective function by
1 1 𝜕𝐸 𝜕2 𝐸
• 𝛿𝐸 = σ𝑖 𝑔𝑖 𝛿𝑢𝑖 + σ𝑖 ℎ𝑖𝑖 𝛿𝑢𝑖2 + σ𝑖≠𝑗 ℎ𝑖𝑗 𝛿𝑢𝑖 𝛿𝑢𝑗 + 𝑂(||𝛿𝑈||3 ) where 𝛿𝑢𝑖 ’s are the components of 𝛿𝑈. 𝑔𝑖 = and ℎ𝑖 =
2 2 𝜕𝑢𝑖 𝜕𝑢𝑖 𝜕𝑢𝑗
• But Matrix H is enormous and is very difficult to compute.
• "diagonal" approximation: removes 3rd term; "extremal" approximation: removes 1st term; "quadratic"
approximation: removes last term.
1
• Thus, finally, 𝛿𝐸 = σ𝑖 ℎ𝑖𝑖 𝛿𝑢𝑖2 . Use this as saliency of 𝑢𝑖
2
• How to find the diagonal second derivative (Hessian) ℎ𝑖𝑖 ?
• The procedure is very similar to the back-propagation algorithm used for computing the first derivatives.
• Computing the diagonal Hessian is of the same order of complexity as computing the gradient.
• Drawbacks
• Computationally prohibitive as second derivative computations are expensive.
• Cross terms in the Hessian matrix are ignored.
[35] Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
Optimal Brain Surgeon
• Use information from all second order derivatives of the error function to perform network pruning.
• “Hessians for every problem we have considered are strongly non-diagonal, and this leads OBD to eliminate the
wrong weights.”
• Using a similar derivative of E wrt weight 𝑤𝑞 , saliency of the weight is
1 𝑤𝑞2
• 𝐿𝑞 = 2 𝐻 −1 𝑞𝑞
• If we assume Hessian H to be diagonal, we get OBD.
• Computing 𝐻 −1 is difficult. They provide a faster recursion relation for calculating 𝐻−1 from training
data and structural information of the net.
• Also, unlike other methods (like OBD or magnitude pruning), OBS does not demand (typically slow)
retraining after the pruning of a weight.
𝑤𝑞
• In every step, delete w𝑞 with min 𝐿𝑞 , and update all weights (𝛿𝑤 = − 𝐻 −1 𝑒𝑞 ), where 𝑒𝑞 is the unit
𝐻 −1 𝑞𝑞
vector in weight space corresponding to (scalar) weight 𝑤𝑞 .
• Con: Unfortunately, these methods are computationally prohibitive as second derivative
computations are expensive.
[76] Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon." In Advances in neural information processing systems, pp. 164-171. 1993.
Deep Compression
A more computationally feasible method for Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow.
pruning connections and relearning weights based

solely on the magnitude of the original weights
The three-stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10×, while
quantization further improves the compression rate: between 27× and 31×. Huffman coding gives more compression: between 35× and 49×.
The compression rate already includes the meta-data for sparse representation. The compression scheme doesn’t incur any accuracy loss.
[22] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
Pruning encoder-decoder LSTM models
LSTM block weight matrix 𝑻𝟒𝒏,𝟐𝒏 can be decomposed into 8

subgroups that are responsible for the interactions between
෡ ×
{input gate i, forget gate f, output gate o, input signal 𝒉}
𝒍
{feed-forward input 𝒉𝒍−𝟏𝒕 , recurrent input 𝒉𝒕−𝟏 }.
Insight: Higher layers, attention and softmax weights are

the most important, while lower layers and the
embedding weights hold a lot of redundancy. For the
LSTM architecture, at lower layers the parameters for
the input are most crucial, but at higher layers the
Typical NMT architecture. This example has two layers, but actual system in [51] parameters for the gates also become important.
has four. The different weight classes are indicated by arrows of different color.
[51] Abigail See, Minh-Thang Luong, and Christopher D Manning. 2016. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274 (2016).
Pruning Schemes
• How do we distribute the pruning over the different weight classes of our model?
• Class-blind: Take all parameters, sort them by magnitude and prune the x% with smallest magnitude, regardless
of weight class.
• Class-uniform: Within each class, sort the weights by magnitude and prune the x% with smallest magnitude.
• Class-distribution: For each class c, weights with magnitude less than 𝜆𝜎𝑐 are pruned.
• 𝜎𝑐 is the standard deviation of class c and λ is a universal parameter chosen such that in total, x% of all parameters are pruned.
• Retraining the sparse pruned network helps. Retrain with smaller learning rate (LR), simpler LR
schedule (halve every 2 epochs), and train for fewer epochs.
• Class-blind pruning is the simplest and adheres to the principle that pruning weights (or equivalently,
setting them to zero) is least damaging when those weights are small, regardless of their locations in
the architecture.
• Class-uniform pruning and class-distribution pruning both seek to prune proportionally within each
weight class, either absolutely, or relative to the standard deviation of that class.
• Class-blind pruning outperforms both other schemes.
• A WMT’14 English-German NMT model with 200M+ parameters can be pruned by 40% with very little
performance loss.
• With retraining, we can recover and even surpass the original performance with an 80%-pruned model.
Iterative Pruning
• Regularization (L1/L2) while training.
• Dropout Ratio Adjustment
• During retraining, the dropout ratio must be
adjusted to account for the change in model
capacity.
• As pruning already reduced model capacity,
the retraining dropout ratio should be smaller.
• 𝐷𝑟 = 𝐷𝑜 𝐶𝑖𝑟 /𝐶𝑖𝑜
• 𝐶𝑖 be the number of connections in layer i, 𝐶𝑖𝑜
for the original network, 𝐶𝑖𝑟 for the network
after retraining
• 𝐷𝑜 represents the original dropout rate
• 𝐷𝑟 represents the dropout rate during retraining.
• dropout works on neurons
• 𝐶𝑖 varies quadratically with number of neurons
• Fixed threshold is used for magnitude Trade-off curve for parameter reduction and loss in top-5 accuracy. L1 regularization
performs better than L2 at learning the connections without retraining, while L2
pruning in every iteration. regularization performs better than L1 at retraining. Iterative pruning gives the best result.
[82] Han, Song, Jeff Pool, John Tran, and William Dally. "Learning both weights and connections for efficient neural network." In Advances in neural information processing systems, pp. 1135-1143. 2015.
Gradual Pruning
• Monotonically increase threshold 𝜖 for pruning.
• We don’t modify the gradients of a pruned weight
in the back-propagation step. Hyper-parameters used for determining threshold (𝝐)
• It is possible for the updates of a pruned weight to
be larger than the threshold of that layer. In this
case, the weight will be involved in the forward
pass again.
• To determine q: using a previously trained model,
weights are sorted using absolute values and we
pick the weight corresponding to the 90th
percentile as q.
• We only prune the weights of the recurrent and
linear layers but not the biases or batch norm
parameters since they are much fewer in number.
• Overall, model size can be reduced by 90% and
speed-up is around 2× to 7×.
• Layers closer to input are pruned more aggressively
compared to the final layers. (1)
[77] Narang, Sharan, Erich Elsen, Gregory Diamos, and Shubho Sengupta. "Exploring sparsity in recurrent neural networks." arXiv preprint arXiv:1704.05119 (2017).
Another simpler method for gradual pruning
• Increase the sparsity from an initial sparsity value 𝑠𝑖 (usually 0) to a final sparsity value 𝑠𝑓
over a span of n pruning steps, starting at training step 𝑡0 and with pruning frequency ∆t:
𝑡−𝑡0 3
𝑠𝑡 = 𝑠𝑓 + 𝑠𝑖 − 𝑠𝑓 1 − for 𝑡 ∈ {𝑡0 , 𝑡0 + Δ𝑡, … , 𝑡0 + 𝑛Δ𝑡}
𝑛Δ𝑡
• The binary weight masks are updated every ∆t steps as the network is trained to gradually
increase the sparsity of the network while allowing the network training steps to recover
from any pruning-induced loss in accuracy.
• Prune the network rapidly in the initial phase when the redundant connections are
abundant and gradually reduce the number of weights being pruned each time as there are
fewer and fewer weights remaining in the network.
• n depends on the learning rate schedule. SGD typically decays the learning rate during
training.
• Pruning in the presence of a very small learning rate makes it difficult for the subsequent training steps to
recover from the loss in accuracy caused by forcing the weights to zero.
• At the same time, pruning with too high of a learning rate may mean pruning weights when the weights
have not yet converged to a good solution, so it is important to choose the pruning schedule closely with
the learning rate schedule.
[72] Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017).
Iterative Magnitude Pruning for Transformers
• For starting proportion X% and ending proportion Y%, our iterative magnitude
pruning procedure pruned X% of each of the pre-trained Transformer layers,
began re-training, and pruned (Y −X)/9 % of each of the layers every 1001
iterations.
• By the 10,000th iteration, we reached Y% pruning of the model iteratively.
• Do not factor in word embeddings in compression rate.
WMT 2017 Test set BLEU and Compression Ratio
[10] Robin Cheong and Robel Daniel. 2019. transformers. zip: Compressing Transformers with Pruning and Quantization. Technical Report. Technical report, Stanford University, Stanford, California, 2019.
Sparse BERT with improved pruning
• Two problems with pruning
• The larger weight 𝑤𝑗 is penalized more heavily than smaller
weight 𝑤𝑖 in 𝑙1 regularization, which violates the original intention
of weight pruning, “removing the unimportant connections”.
• Direct optimization of a regularization penalty term causes
divergence from the original loss function and has negative effect
on the effectiveness of gradient-based update.
• Hence, we perform reweighted 𝑙1 minimization:
• min 𝑓 𝑤 + 𝛾 σ𝑖 𝛼𝑖 |𝑤𝑖 |
𝑤
• where 𝛼𝑖 > 0 and inversely proportional to |𝑤𝑖 |.
• Solution using reweighted proximal pruning (which depends
on proximal operators)
• Decouples the goals of high sparsity from minimizing loss. BERTLARGE pruning results on a set of transfer learning tasks. The degradation
• NIP: Progressive/gradual pruning without regularizers. is contrasted with the original BERT (without pruning) for transfer learning.
• Even when 90% of weights are pruned, NSP accuracy keeps

above 95% in RPP. 80% pruning for most GLUE tasks and
41% for SQuAD 1.1 at 0 degradation for BERT BASE
[20] Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019. Reweighted Proximal Pruning for Large-Scale Language Representation. arXiv preprint arXiv:1909.12486 (2019).
Iteratively Pruning and Densifying (iterative DSD)
The sparse training regularizes the model, and the final dense training restores the pruned weights
(red), increasing the model capacity without overfitting.
Why does DSD training work?

• Escape Saddle Point
• Significantly Better Minima
• Regularized and Sparse Training
• Robust re-initialization
• Break Symmetry
Weight distribution of the original GoogLeNet (a), pruned GoogLeNet (b), after retraining the sparsity-constrained GoogLeNet
(c), ignoring the sparsity constraint and recovering the zero weights (d), and after retraining the dense network (e).
[23] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. 2016. DSD: Dense-sparsedense training for deep neural networks. arXiv preprint
arXiv:1607.04381 (2016).
Iterative Growth and Prune with Hidden-layer LSTM
(H-LSTM)
• Grow-and-Prune (GP) Training
• Growth policy: Activate a dormant
w in W iff |w.grad| is larger than
the (100α) th percentile of all
elements in |W.grad|.
• Pruning policy: Remove a w iff |w|
is smaller than the (100β) th
percentile of all elements in |W|.
• α and β refer to growth ratio, and Growth and Prune Training
pruning ratio, respectively.

• Pruning of insignificant weights is
an iterative process.
[13] Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. 2018. Grow and prune compact, fast, and accurate LSTMs. arXiv preprint arXiv:1805.11797 (2018).
Should you prune large networks or build small
dense networks?
• Pruning involves extra processing plus sparse
matrices need special handling – can we avoid it?
• Large-sparse models consistently outperform small-
dense models and achieve up to 10x reduction in
number of non-zero parameters with minimal loss
in accuracy.
• Models: stacked LSTMs for language modeling, and
seq2seq models for NMT.
Deep Encoder-Decoder LSTM Models

PTB sparse vs dense models (2 LSTM layers)
• Pruning weights
• Pruning neurons
• Pruning blocks
Pruning weights vs pruning neurons
Parameter Pruning Weights Pruning Neurons

Flexibility High, can prune freely Constrained, need to find entire
rows/columns for deletion
Sparse Matrix Support Yes No
Needed
Node Importance Functions
• After training, a score is calculated for each hidden node using one of these
importance functions.
• Entropy
• Let 𝑎𝑖 be #instances with node output>0.5, and 𝑑𝑖 be #instances with node output<=0.5.
𝑑 𝑑 𝑎 𝑎
• Entropy(i)= 𝑖 log 2 𝑖 + 𝑖 log 2 𝑖 .
𝑎𝑖 +𝑑𝑖 𝑎𝑖 +𝑑𝑖 𝑎𝑖 +𝑑𝑖 𝑎𝑖 +𝑑𝑖
• The intuition is that if one node’s outputs are almost identical on all training data, these outputs do not
generate variations to later layers and consequently the node may not be useful.
• Output-weights Norm (onorm): average L1-norm of the weights of its outgoing links
• Input-weights norm (inorm): average L1-norm of the weights of its incoming links
• All the nodes are sorted by their scores and nodes with less importance values are
removed.
• Pruned DNN needs to be re-finetuned.
• On switchboard speech recognition data, onorm was found to be the best. DNN
complexity can be reduced to 37.9% without accuracy loss.
[25] Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 245–249.
Special input regularizers
• Loss with a regularizer: minimize L(W) + λR(W), where R(W) is a convex regularizer.
• To remove neurons we need a special regularizer that pushes all the incoming
connection weights to a unit together towards zero.
• Two regularizers
1/2
• 𝑙2 norm on a matrix W is 𝑅 𝑊 = σ𝑖 ||Wi: ||2 = σ𝑖 σ𝑗 𝑊𝑖𝑗2
• This puts equal pressure on each row, but within each row, the larger values contribute more, and
therefore there is more pressure on larger values towards zero.
• 𝑙∞ norm is 𝑅 𝑊 = σ𝑖 ||𝑊𝑖: ||∞ = σ𝑖 max |𝑊𝑖𝑗 |
𝑗
• This puts equal pressure on each row, but within each row, only the maximum value (or values) matter, and
therefore the pressure towards zero is entirely on the maximum value(s).
• The sparsity-inducing regularizers are not differentiable at zero, making gradient-
based optimization methods trickier to apply. In this case, proximal gradient method
could be used.
[43] Kenton Murray and David Chiang. 2015. Auto-sizing neural networks: With applications to n-gram language models. arXiv preprint arXiv:1508.05051 (2015).
Input and Output regularizers
[83] Pan, Wei, Hao Dong, and Yike Guo. "Dropneuron: Simplifying the structure of deep neural networks." arXiv preprint arXiv:1606.07326 (2016).
Wiring similar neurons together
• 𝑧 = 𝑎1 ℎ 𝑊1𝑇 𝑋 + 𝑎2 ℎ(𝑊2𝑇 𝑋)+ 𝑎3 ℎ(𝑊3𝑇 𝑋)+…+ 𝑎𝑛 ℎ(𝑊𝑛𝑇 𝑋)
• If W1 = W2, then ℎ 𝑊1𝑇 𝑋 = ℎ(𝑊2𝑇 𝑋)
• Thus, 𝑧 = (𝑎1 +𝑎2 )ℎ 𝑊1𝑇 𝑋 + 0. ℎ(𝑊2𝑇 𝑋)+ 𝑎3 ℎ(𝑊3𝑇 𝑋)+…+ 𝑎𝑛 ℎ(𝑊𝑛𝑇 𝑋)
• Whenever two weight sets (𝑊1 , 𝑊2 ) are equal, one of them can effectively be
removed.
• Surgery step: We need to alter the co-efficient 𝑎1 to 𝑎1 + 𝑎2 .
• Hebbian principle:“neurons which fire together, wire together”. A toy example showing the effect of equal weight-sets (W1 = W4). The
• If we find neurons that fire together (W1 = W2), we wire them together (𝑎1 = 𝑎1 + 𝑎2 ). circles in the diagram are neurons and the lines represent weights.
Weights of the same colour in the input layer constitute a weight-set.
• What if ||𝑊1 − 𝑊2 || = ||𝜖1,2 || ≥ 0?
• Neuron removal procedure:
1. Compute saliency 𝑠𝑖,𝑗 for all possible values of (i, j). It can be stored as a square
matrix M, with dimension equal to the number of neurons in the layer being
considered.
• 𝑠𝑖,𝑗 = 𝑎𝑗2 ||𝜖𝑖𝑗 ||22 where 𝑎𝑗2 denotes the average of the quantity over all output neurons.
2. Pick the minimum entry in the matrix. Let its indices be (i’, j’). Delete the j’th neuron,
and update 𝑎𝑖 ’ ← 𝑎𝑖 ’ + 𝑎𝑗 ’ .
3. Update M by removing the j’th column and row, and updating the i’th column (to
account for the updated 𝑎𝑖 ’.)
• How many neurons to remove?
• find the saliency value of the mode in the histogram and use that as cutoff. (a) Scaled appropriately, the saliency curve closely follows that
of increase in test error ; (b) The histogram of saliency values.
The black bar indicates the mode of the gaussian-like curve.
[54] Suraj Srinivas and R Venkatesh Babu. 2015. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149 (2015).
• Pruning weights
• Pruning neurons
• Pruning blocks
Block Pruning
• Problems with weight pruning: Irregularity of sparse matrices limits the maximum performance and energy efficiency
achievable on hardware accelerators.
• Block-sparse formats store blocks contiguously in memory reducing irregular memory accesses.
• If the maximum magnitude weight of a block is below the current threshold, we set all the weights in that block to zeros.
• The monotonically growing threshold causes more blocks to be pruned as training progress. We stop pruning more blocks
after around 40% of training has completed.
• Any blocks that had been zeroed out are held at zero even after pruning has ended resulting in a sparse model at the end
of training.
• The start slope and ramp slope determine the

rate at which the threshold increases.
• For block pruning, we need to modify the start
slope to account for the number of elements in a
block (𝑁𝑏 ).
• Start slope for weight pruning 𝜃𝑤 =𝜃
• Start slope for block pruning 𝜃𝑏 =𝜃𝑤 × 4 𝑁𝑏
Heuristics to pick hyper-parameters for block-pruning [77]
• Empirically, this leads to block sparsity from 85%
to 95%.
[44] Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).
Group Lasso Regularization
• Group lasso is a type of weight regularization that works on groups of weights and can zero out all the weights in a group.
• For each block, we add a loss term proportional to the 𝑙2 norm of the block.
• 𝐿 = 𝐿𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 + 𝜆𝑔 σ𝐺𝑔=1 ||𝑤 𝑔
||2
• We combine group lasso with block pruning, such that group lasso guides the selection of blocks to prune.
• We apply group lasso regularization to coincide with the pruning schedule. We turn off regularization when the pruning
schedule ends, which is typically after around 40% of training epochs.
• Inducing block sparsity with 4×4 blocks in vanilla RNNs and GRUs
results in 9% to 17% loss in accuracy compared to the dense
baseline. Model size reduces by nearly 10×.
• We can prune a larger dense network to recover this loss in
accuracy while maintaining high block sparsity and reducing the
overall parameter count.
• The technique works with a variety of block sizes up to 32×32.
Larger blocks require lower sparsity to maintain similar accuracy.
GRU and BiRNN model results with 4x4 blocks. Block Pruning (BP), Group
Lasso (GL), and Group Lasso with block pruning (GLP)
Bank-Balanced Sparsity (BBS)
• Unfortunately, it becomes challenging to maintain
the same model accuracy when block sparsity is
applied.
• Block size (i.e., pruning granularity) is application-
sensitive, making it another hyper-parameter to
tune.
Comparing BBS with unstructured sparsity and block sparsity by pruning a
• Bank-Balanced Sparsity (BBS) dense matrix with a sparsity ratio of 50%.
• BBS splits each weight matrix row into multiple equal-
sized banks, and adopts fine-grained pruning to each
bank independently to obtain identical sparsity among
banks.
• BBS achieves almost the same model accuracy as
unstructured sparsity and significantly outperforms block
sparsity when pruning weights at the same sparsity level.
• BBS is also amenable to FPGA acceleration because it
inherently provides a balanced matrix partitioning for
parallel computing.
• Each bank has the same number of non-zero values. Weight map visualization after pruning with (a) unstructured sparsity, (b) BBS,
• We apply the BBS pruning method iteratively to a pre- and (c) block sparsity (sparsity ratio = 90%). These weight maps are 64 × 64
trained network, and fine-tune the network after each sub-matrices of the whole 1500 × 1500 matrix.
pruning iteration to restore the model accuracy.
[6] Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and effective sparse LSTM on FPGA with Bank-Balanced Sparsity. In Proceedings of the
2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 63–72.
• Pruning weights
• Pruning neurons
• Pruning blocks
Prune heads
• Majority of attention heads can be removed without deviating too much
from the original score. Surprisingly, in some cases removing an attention
head results in an increase in BLEU/accuracy.
• Only 8 (out of 96) heads in 6-layer WMT NMT Transformer (16
Distribution of heads by model score
heads/layer) cause a statistically significant change in performance when after masking (144 BERT heads)
they are removed from the model, half of which actually result in a higher
BLEU score.
• For most layers, one head is indeed sufficient at test time, even though the
network was trained with 12 (BERT) or 16 (WMT Transformer) attention
heads.
• What if we pruned heads across two or more different layers at the same Best delta accuracy by layer when only one
time? head is kept in the BERT model. None of these
results are statistically significant with p < 0.01.
• Sort all the attention heads, and prune.
• Prune up to 20% and 40% of heads from WMT and BERT resp., without incurring
any noticeable negative impact.
Average inference speed of BERT and MNLI-matched validation set in examples per second (±
std dev.) The speedup relative to the original model Is indicated in parentheses. BERT on MNLI validation set. Different
colors are two ways of scoring heads.
[42] Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019).
Head pruning
• Only a small subset of heads are important for
translation. Most important and confident heads
play consistent and often linguistically-interpretable
roles (positional, syntactic, rare words).
• Heads are pruned according to Model trained on 6m OpenSubtitles EN-RU data
• Layer-wise relevance propagation (LRP) (Ding et al.,
2017). LRP is a method for computing the relative
contribution of neurons at one point in a network to
neurons at another.
• “confidence” of a head: average of its maximum
attention weight excluding the end of sentence symbol.
average is taken over tokens in a set of sentences used
for evaluation.
• When pruning heads using a method based on
stochastic gates and a differentiable relaxation of
the 𝐿0 penalty, we observe that specialized heads BLEU score as a function of number of retained
encoder heads (EN-RU).
are last to be pruned.
• To disable less important heads completely, we apply 𝐿0
regularization to the scalars 𝑔𝑖 . 𝐿0 norm equals the
number of non-zero components and would push the
model to switch off less important heads.
• On the English-Russian WMT dataset, pruning 38
out of 48 encoder heads results in a drop of only
0.15 BLEU.
[109] Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned." arXiv preprint arXiv:1905.09418 (2019).
Layer pruning (LayerDrop)
• DropConnect randomly drops weights. LayerDrop
does structured dropout: drop groups of weights,
heads, FFN matrices, or layers. LayerDrop (right) randomly drops layers at training time. At test time, this allows for sub-
• dropping attention heads does not reduce runtime as network selection to any desired depth as the network has been trained to be robust to
they are usually computed in parallel. pruning. In contrast to standard approaches that must re-train a new model from scratch
• Selecting layers to prune. for each model size (left), our method trains only one network from which multiple
shallow models can be extracted.
• Every other (with rate p)
• Search on validation set: Try combinations.
• Data driven: learn the drop rate of each layer.
• Parameterize 𝑝𝑑 as a non-linear function of the activation of
its layer and apply a softmax.
• At inference time, we forward only the fixed top-k highest
scoring layers based on the softmax output.
• We drop entire layers to extract shallow models at Results on WMT en-de Machine Translation (newstest2014 test set)
inference time.
• Every Other strategy works surprisingly well across
many tasks and configurations. Search on Valid and
Data Driven Pruning only offer marginal gains.
[108] Fan, Angela, Edouard Grave, and Armand Joulin. "Reducing transformer depth on demand with structured dropout." arXiv preprint arXiv:1909.11556 (2019).
Pruning attention heads and MLP layers
• For fine-tuned BERT
• it is possible to find a subnetwork of elements that
achieves performance comparable with that of the
full model
• “good” subnetworks perform considerably better
than similarly-sized subnetworks sampled from the
less important components of the model. However,
both “bad” and “good” subnetworks can be fine-
tuned separately to achieve comparable
performance.
• 86% heads and 57% MLPs survive in less than 7
tasks, which raises concerns about the degree
to which BERT relies on task-specific heuristics
rather than general linguistic knowledge.
[118] Prasanna, Sai, Anna Rogers, and Anna Rumshisky. "When BERT Plays the Lottery, All Tickets Are Winning." arXiv preprint arXiv:2005.00561 (2020).
Agenda
• Pruning
• Quantization
• Applications
Quantization: Motivation
• Most computer architectures use 32 bits to represent weights.
• Estimated precision of the brain (hippocampal spine) synapses is around 4.6
bits. [85]
• Empirical evidence suggests that most quantities in the nervous system (for
instance, firing of the neurons) have variability of a few percent due to
biological noise, or a precision of 1 in 100 at best [86]. Thus, each decision
could depend on log 2 100=6.64 bits.
[85] Bartol, Thomas M., Cailey Bromer, Justin P. Kinney, Michael A. Chirillo, Jennifer N. Bourne, Kristen M. Harris, and Terrence J. Sejnowski. "Hippocampal spine head sizes are highly precise." bioRxiv (2015): 016329.
[86] Linden, David J., ed. Think Tank: Forty Neuroscientists Explore the Biological Roots of Human Experience. Yale University Press, 2018.
weight
• Outline for quantization
• Binarized networks
• Ternarized networks
• General Quantized networks
Binary Quantization
1 𝑖𝑓 𝑤𝑖𝑗 ≥ 0
• 𝑤𝑖𝑗 = ൝
−1 𝑖𝑓 𝑤𝑖𝑗 < 0
• Size Drop : 32X
• Runtime : Much faster (7x) matrix multiplication for binary matrices [29]
• Accuracy Drop : Classification error is about 20% on the top 5 accuracy on ILSVRC dataset.
• In the forward pass, binary networks drastically reduce memory size and accesses, and
replace most arithmetic operations with bit-wise operations, which leads to great increases
of power efficiency
• We do not binarize while training but after the model has been trained.
• Unfortunately, binary quantization of the recurrent weights (𝑊ℎℎ ) never worked. [45]
• When the true value of a weight is near zero,
• its quantized value is stochastically sampled to be −1 or 1 with nearly equal probability.
• the magnitude of the weights increases and the vanishing/exploding gradients problem becomes more severe.
[84] Gong, Yunchao, Liu Liu, Ming Yang, and Lubomir Bourdev. "Compressing deep convolutional networks using vector quantization." arXiv preprint arXiv:1412.6115 (2014).
[45] Joachim Ott, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, and Yoshua Bengio. 2016. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902 (2016).
[29] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning
Research 18, 1 (2017), 6869–6898.
Binary Scheme Fixed vs Flexible
• Binary Scheme Fixed (BS-Fixed) [33]
• BS-Fixed stores the original weights and during the forward pass replaces the values with a
masked value of 𝑐1 or 𝑐2 , where 𝑐1 and 𝑐2 are fixed and chosen with hyperparameter tuning.
𝑐1 𝑖𝑓 𝑤𝑖𝑗 ≥ 0
• 𝑤𝑖𝑗 = ൝
𝑐2 𝑖𝑓 𝑤𝑖𝑗 < 0
• We keep full precision weights during training.
• At the end of training, we replace the weights with the index of its masked value.
• Binary Scheme Flexible or BS-Flexible [10]
• Choosing the values of 𝑐1 and 𝑐2 can be difficult and time-consuming in BS-Fixed.
• Initialize 𝑐1 and 𝑐2 by running k-means with two centroids over the weights, and then update 𝑐1
and 𝑐2 using back-propagation.
𝑐1 𝑖𝑓 𝑤𝑖𝑗 ≥ (𝑐1 +𝑐2 )/2
• 𝑤𝑖𝑗 = ൝
𝑐2 𝑖𝑓 𝑤𝑖𝑗 < (𝑐1 +𝑐2 )/2
• These changes eliminate the need for hyperparameter tuning.
[33] Maximilian Lam. 2018. Word2bits-quantized word vectors. arXiv preprint arXiv:1803.05651 (2018).
Stochastic binarization
1 𝑖𝑓 𝑤 ≥ 0 +1 with prob 𝑝 = 𝜎(𝑤)
• Replaces 𝑤𝑏 = ቊ with 𝑤𝑏 = ቊ
−1 𝑖𝑓 𝑤 < 0 −1 with prob 1 − 𝑝
𝑤+1 𝑤+1
• Where 𝜎 𝑤 = 𝑐𝑙𝑖𝑝 , 0,1 = max 0, min 1, , also called as hard sigmoid.
2 2
• We only binarize the weights during the forward and backward propagations but not during the parameter update.
• Keeping good precision weights during the updates is necessary for SGD.
• Test-Time Inference: 3 options.
• Use the resulting binary weights 𝑤𝑏 (this makes most sense with the deterministic binarization).
• In the stochastic case, many different networks can be sampled by sampling a 𝑤𝑏 for each weight. The ensemble output of
these networks can then be obtained by averaging the outputs from individual networks.
• Use original weights. But this does not reduce model size.
• Straight Through Estimator (STE) trick: as the quantized value is an approximation of the original value, we can substitute the
gradient with respect to the quantized value for the gradient of original value.
• The trick allows the inclusion of quantization into the computation graph of BP and allows QNNs to represent parameters, activations and gradients with low bitwidth
numbers.
• More efforts that make train/test faster but do not reduce model size
• [87] convert multiplications in the backward pass into bit-shifts by restricting the activations to be power-of-two integers.
• Further, [88] binarizes weights and activations, at the inference phase and the entire training phase of a deep network.
[12] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123–3131.
[87] Lin, Zhouhan, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. "Neural networks with few multiplications." arXiv preprint arXiv:1510.03009 (2015).
[88] Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. "Binarized neural networks." In Advances in neural information processing systems, pp. 4107-4115. 2016.
Binary-Weight-Networks [90] and LAB [28]
• Approximate the weight 𝑊 ∈ 𝑅𝑛 using a binary 𝐵 ∈ +1, −1 𝑛 and a scaling factor 𝛼 ∈ 𝑅+ such that
W ≈ αB.
• Thus, we wish to find 𝛼 ∗ , 𝐵 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝛼,𝐵 ||𝑊 − 𝛼𝐵||2
• ||𝑊 − 𝛼𝐵||2 =𝛼 2 𝐵𝑇 𝐵 − 2𝛼𝑊 𝑇 𝐵 + 𝑊 𝑇 𝑊
• Since, 𝐵 ∈ +1, −1 𝑛 , 𝐵𝑇 𝐵 = 𝑛, 𝑊 𝑇 𝑊 is a constant.
• Thus, 𝐵∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝐵 𝑊 𝑇 𝐵 𝑠. 𝑡. 𝐵 ∈ +1, −1 𝑛
• This optimization can be solved by assigning 𝐵𝑖 = +1 when 𝑊𝑖 ≥ 0, and 𝐵𝑖 = −1 otherwise.
𝑊 𝑇 𝐵∗ WT sign W σ 𝑊𝑖
• To compute 𝛼 ∗,
setting derivative of ||𝑊 to 0, we get, = = = − 𝛼𝐵||2 𝛼∗
𝑛 n 𝑛
• Thus, besides the binarized weight matrix, a scaling parameter is also learned in BWN.
• Loss-aware binarization (LAB) [28]
• Weight binarization can be formulated as the following optimization problem: min 𝑙𝑜𝑠𝑠(𝑤)
ෝ
ෝ
𝑤
ෝ 𝑙 = 𝛼𝑙 𝑏𝑙 , 𝛼𝑙 > 0, 𝑏𝑙 ∈ ±1 𝑛𝑙 , 𝑙 = 1, … , 𝐿 where 𝐿 is #layers.
• s.t. 𝑤
• Proximal Newton algorithm is used to solve for 𝛼𝑙 and 𝑏𝑙 .
[90] Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. "Xnor-net: Imagenet classification using binary convolutional neural networks." In European conference on computer vision, pp. 525-542. Springer,
Cham, 2016.
[28] Lu Hou, Quanming Yao, and James T Kwok. 2016. Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600 (2016).
weight
Ternary Connect
• Many learned weights are zero or close to zero.
+1 with prob 𝑤 if 𝑤 ∈ (0,1]
• Ternary Connect 𝑤𝑡 = ቐ 0 with prob 1 − 𝑤 if 𝑤 ∈ 0,1 , with prob 1 + 𝑤 if 𝑤 ∈ [−1,0]
−1 −1 with prob − 𝑤 if 𝑤 ∈ [−1,0]
• Like binary connect, ternary connect also eliminates all multiplications in the forward pass.
• In the deterministic form, the weights are quantized depending on 2 thresholds:
Ternary weight networks (TWN)
• Neural networks with weights constrained to +1, 0 and -1.
• TWNs achieve up to 16× or 32× model compression rate.
• The Euclidian distance between full precision weights 𝑊 and the ternary weights 𝑊 𝑡 along with a scaling
factor is minimized.
• This is difficult to solve. Hence, we approximate the solution with threshold-based ternary function.
• Further, we assume that 𝑊𝑖 s are generated from uniform or normal distribution.
2
• When 𝑊𝑖 s are uniformly distributed in [−a, a] and ∆ lies in (0, a], the approximated ∆∗ is a/3, which equals to 3 𝐸( 𝑊 ).
• When 𝑊𝑖 s are generated from normal distributions 𝑁(0, 𝜎 2 ), the approximated ∆∗ is 0.6σ which equals to 0.75𝐸(|𝑊|).
0.7
• Thus, we can use a rule of thumb that ∆∗ ≈ 0.7𝐸(|𝑊|) ≈ σ𝑛𝑖=1 |𝑊𝑖 | for fast and easy computation.
n
• Ternary weights can also be used with SGD. Ternary-valued weights are used during the forward and
backward propagations but not during the parameters update.
[91] Li, Fengfu, Bo Zhang, and Bin Liu. "Ternary weight networks." arXiv preprint arXiv:1605.04711 (2016).
Trained Ternary Quantization
𝑝 𝑝
• This method uses two full-precision scaling coefficients 𝑊𝑙 and 𝑊𝑙𝑛 for each layer 𝑙, and quantizes the weights to {−𝑊𝑙𝑛 , 0, +𝑊𝑙 }.
𝑝
• 𝑊𝑙 and 𝑊𝑙𝑛 are trainable parameters, learned using back-propagation.
𝑝
• We also maintain latent full-precision weights at training time, and discard them at test time. We back propagate the gradient to both 𝑊𝑙 and 𝑊𝑙𝑛 and to
the latent full-precision weights. This makes it possible to adjust the ternary assignment (i.e. which of the three values a weight is assigned).
• First, we normalize the full-precision weights to the range [-1, +1] by dividing each weight by the maximum weight.
• 2 heuristics adopted:
1. use the maximum absolute value of the weights as a reference to the layer’s threshold and maintain a constant factor t for all layers: ∆𝑙 = 𝑡 ×
max(|𝑤|)
෥ and
2. maintain a constant sparsity r for all layers throughout training. By adjusting the hyperparameter r we are able to obtain ternary weight networks
with various sparsities.
Overview of the trained ternary quantization procedure.
[92] Zhu, Chenzhuo, Song Han, Huizi Mao, and William J. Dally. "Trained ternary quantization." arXiv preprint arXiv:1612.01064 (2016).
Gaussian based ternary weights
• We first calculate the mean (𝜇) and standard deviation (𝜎) of the weight of a
layer.
• Quarternary Gaussian networks
[1] Md Zahangir Alom, Adam T Moody, Naoya Maruyama, Brian C Van Essen, and Tarek M Taha. 2018. Effective quantization approaches for recurrent neural networks. In 2018 International Joint Conference on Neural
Networks (IJCNN). IEEE, 1–8.
Learning Quantization step size
• Greedy algorithm for search of the quantization step
size, Δ
1. Train network with full precision weights.
2. Quantize all input data and signals of hidden layers.
3. Start with the weight quantizer between the input layer Histograms of weights before and after applying 3-point and 7-point
and the first hidden layer, try several step sizes around symmetric uniform quantization where Δ denotes a quantization step size.
the initial step size and measure the output error of the
network with the training set. The initial step size is
determined using the L2-error minimizing approach
(Lloyd-Max algorithm).
4. Choose the step size that minimizes the output error
and quantize the weights.
5. Perform the third and fourth steps for the next layer
until it reaches the last layer.
• Retrain the quantized neural network.
• Maintain not only quantized weights but also high-
precision weights.
• Forward pass is done using quantized weights and signals.
• Update these high precision weights, not the quantized
ones, to accumulates small error gradients.
• After updating the high precision weights, quantization is
performed to get new quantized weights.
Hwang, Kyuyeon, and Wonyong Sung. "Fixed-point feedforward deep neural network design using weights+ 1, 0, and− 1." In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1-6. IEEE, 2014.
HitNet: Hybrid Ternary RNN
• Quantizing weights into ternary values {-1, 0, 1}
can only save 1.4× memory consumption in the
training phase. Theoretical memory saving for an LSTM model through
• While quantizing both weights and activations

quantization. W_t denotes ternary weights, A_t denotes ternary
activations, and A_F P denotes full precision activations.
can achieve up to 16× memory saving.
• Ternary values can be further compressed by
using only 1 bit to represent -1 and +1 without
storing the zero values on some specialized The testing perplexity per word (PPW) on quantized LSTM over PTB by using
devices. Memory requirements for this further different quantization methods. W_2bit represents 2-bit weights, W_FP represents
compression can be reduced up to 32×. full precision weights, W_t represents ternary weights. The symbol A for activations
has the similar meaning. UQ denotes Uniform Quantization, TTQ denotes Threshold
Ternary Quantization, and BTQ denotes Bernoulli Ternary Quantization.
Uniform Quantization Threshold Ternary Quantization Bernoulli Ternary Quantization
[63] Peiqi Wang, Xinfeng Xie, Lei Deng, Guoqi Li, Dongsheng Wang, and Yuan Xie. 2018. HitNet: hybrid ternary recurrent neural network. In Advances in Neural Information Processing Systems. 604–614.
HitNet: Hybrid Ternary
RNN
• Fig a-c
• The distribution of weights in RNNs follows the normal
distribution although the range of 𝑊ℎ𝑜 is slightly different
from 𝑊𝑥𝑜 .
• the distribution of activations is very different from
weights. The activation range is limited within [0,1] due to
the activation functions. Most of the values locate near to
the two poles instead of the middle of this range.
• TTQ is preferable for weights. BTQ for activations.
The distribution of weights (i.e. Wxo, Who) and activations of output gate (i.e. ot)
in LSTM under different quantization methods. (a)-(c) show the distributions in full
precision (FP), (d)-(f) use the 2-bit uniform quantization (UQ), (g)-(i) use the
threshold ternary quantization (TTQ), (j)-(l) use the Bernoulli ternary quantization
(BTQ). The model used here is an LSTM trained on Penn Tree Bank (PTB) corpus.
weight
Uniform Quantization
• If values are in the range x ∈ [−1, 1]. • If original weights (X) are in the range [0,1]
• Then, use the following k-bit quantization
𝑥+1
round 2𝑘 − 1 1
2
𝑞𝑘 𝑥 = 2 −
2𝑘 − 1 2 • When entries in X are not constrained in
[0, 1]
Another method could be
• After this, scale back to the original range. 𝑋 1
• Easy to implement but far from optimum 𝑋෨ = +
2max(|𝑋|) 2
when quantizing non-uniform data, which is
believed to be the trained weights and • After quantization, we can apply a reverse
activations of deep neural network. transform to approximate the original
• Typically bias parameters are not quantized. values. Overall, the quantized result is:
[90] Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. "Xnor-net: Imagenet classification using binary convolutional neural networks." In European conference on computer vision, pp. 525-542. Springer,
Cham, 2016.
[29] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning
Research 18, 1 (2017), 6869–6898.
[24] Qinyao He, He Wen, Shuchang Zhou, Yuxin Wu, Cong Yao, Xinyu Zhou, and Yuheng Zou. 2016. Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176 (2016).
Original GRU equations
Quantization of GRU
• Besides the matrix multiplications needed to GRU Quantization
compute 𝑧𝑡 , 𝑟𝑡 and ℎ෩𝑡 , the gate structure of ℎ෩𝑡 and ℎ𝑡
brings in the need for element-wise multiplication.
• As ℎ෩𝑡 and ℎ𝑡 are also the inputs to computations at
the next timestamp, and noting that a quantized
value multiplied by a quantized value will have a
larger bit-width, we need to insert additional
quantization steps after element-wise where we assume the weights 𝑾𝒛 , 𝑾𝒓 , 𝑾 have
already been quantized to [−1, 1], and input 𝒙𝒕 have
multiplications. already been quantized to [−1, 1].
• Another problem with quantization of GRU structure LSTM Quantization
lies in the different value range of gates. The range of
tanh is [−1, 1], which is different from the value range
[0, 1] of 𝑧𝑡 and 𝑟𝑡 .
• To simplify the implementation, we can replace the
activation functions of ℎ෩𝑡 to be the sigmoid function
so that (1 − 𝑧𝑡 ) ∗ ℎ𝑡−1 + 𝑧𝑡 ∗ ℎ෩𝑡 ∈ [0, 1].
Exponential Quantization
• Quantizing the weight values to an integer power of 2 is also a way of storing
weights in low precision and eliminating multiplications.
𝑊
• Let 𝑝 = log2 𝑊 − 1
2
• Deterministic exponential quantization
• log 2 𝑊𝑏 = log 2 |𝑊| if p>0.5, else log 2 Wb = log 2 𝑊
• Stochastic exponential quantization
• For the stochastic quantization, we sample the logarithm of weight value to be its
nearest 2 integers, and the probability of getting one of them is proportional to the
distance of the weight value from that integer.
Balanced quantization
• Distributions of parameters in Neural Networks are often Floating point copy of weights after 60 epochs of training. The weight
imbalanced, such that the uniform quantization values follow a bell-shaped distribution, and the minimum and
determined from extremal values may under utilize maximum values differ a lot from the other values.
available bitwidth.
• When we quantize values, it may be desirable to make
the quantized values have balanced distributions, to take
full advantage of the available parameter space.
• The method starts by partitioning numbers into 2𝑘 bins
containing roughly the same number of entries. Results of imbalanced quantization (no equalization). After uniform
(percentiles) quantization of weight values to 2-bit numbers, the quantized values
concentrate on the central two out of four possible quantized values.
• Each partition is then mapped to an evenly-divided
interval in the closed interval [0, 1].
• Finally, the quantization step maps intervals into discrete
values and transforms the value range to be
approximately the same as input.
𝑥+1
round 2𝑘 −1 1 The histogram of the weight values is first equalized by piecewise linear
• 𝑞𝑘 𝑥 = 2 2
− transform and then mapped to a symmetric distribution. The subfigures are
2𝑘 −1 2
(a) the histogram of floating-point weight values, (b) the histogram-
equalized weight values, and (c) the quantized weight values. K=2.
[71] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. 2017. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32, 4
(2017), 667–682.
Approximate Balanced quantization
• A naive implementation using percentiles as thresholds would require sorting of
weight values during each forward operation in BP, which may slow down the
training process of QNNs.
• The 2𝑘 evenly spaced percentiles required in histogram equalization can be
computed from the recursive application of partitioning of numbers by medians.
• Further, when a distribution has bounded variance σ, the mean µ approximates the
median m as there is an inequality bounding the difference, |µ−m| ≤ σ
• Thus, we can perform approximate histogram equalization without doing sorting.
Balanced quantization with median (before matching value range) Balanced quantization with mean (before matching value range)
[71] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. 2017. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32, 4
(2017), 667–682.
K-Means Quantization after training
• We first train the neural network with high-resolution parameters.
• Then apply K-Means to the weights.
• After clustering, the value of each pixel is set to the value of the center of the
cluster it belongs to.
• Need to store mapping from integers to cluster centers. We only need log (k)
bits to code the clusters.
[89] Muller, Lorenz K., and Giacomo Indiveri. "Rounding methods for neural networks with low resolution synaptic weights." arXiv preprint arXiv:1504.05767 (2015).
K-Means Quantization during
training
• Multiple connections share the same weight, and we fine-tune
those shared weights.
• For the forward pass, the index stored for each connection is
mapped to a centroid which is then used as the weight.
• For back-propagation, during update, all the gradients are Weight sharing by scalar quantization (top) and centroids fine-tuning
grouped by the color and summed together, multiplied by the (bottom).
learning rate and subtracted from the shared centroids from last
iteration.
• We use k-means clustering to identify the shared weights for
each layer of a trained network, so that all the weights that fall
into the same cluster will share the same weight.
• Weights are not shared across layers.
• To calculate the compression rate, given k clusters, we only need
log 2 (𝑘) bits to encode the index. In general, for a network with n
connections and each connection is represented with b bits,
constraining the connections to have only k shared weights will
result in a compression rate of:
𝑛𝑏 16∗32 Distribution of weights (blue) and distribution of codebook
𝑟 = 𝑛 log = = 3.2 before (green cross) and after fine-tuning (red dot).
2 𝑘+𝑘𝑏 16∗2+4 ∗ 32
Product Quantization (PQ) and Residual
Quantization (RQ)
• PQ
• Partition the vector space into many disjoint subspaces, and perform quantization𝑛 (K-Means) in each subspace.
• Partition weight matrix W columnwise: 𝑊 = [𝑊 1 , 𝑊 2 , … , 𝑊 𝑠 ] where 𝑊 𝑖 ∈ 𝑅 𝑚× 𝑠 assuming n is divisible by s.
𝑘 𝑖 2
• Perform KMeans on each submatrix 𝑊 𝑖 : min σ𝑚 𝑖
𝑧=1 σ𝑗=1 ||𝑤𝑧 − 𝑐𝑗 ||2
• Thus, we get s codebooks.
• The reconstructed matrix is 𝑊෡ = [𝑊෡ 1, 𝑊
෡ 2, … , 𝑊 ෝ𝑗𝑖 = 𝑐𝑗𝑖 (closest centroid)
෡ 𝑠 ] where 𝑤
• PQ can be applied to either the x-axis or the y-axis of the matrix.
• We need to store the cluster indexes and codebooks for each subvector. The compression rate for this method is
(32mn)/(32kn + log 2 (k)ms).
• RQ
• First quantize the vectors into k-centers.
• Next find out the residuals for each data point (w-c) and perform k-means on the residuals. Do it recursively t
times.
• Then the resultant weight vectors are calculated as 𝑤ෝ𝑧 = 𝑐𝑗1 + 𝑐𝑗2 + ⋯ + 𝑐𝑗𝑡 given we have recursively
performed t iterations.
• We need to store all the codebooks for each iteration, which potentially needs large amount of memory. The
compression rate is m/(tk + log 2 (k)tn).
Network Sketching: Greedy approximation
• Consider a weight vector 𝑊 ∈ 𝑅𝑑 .
• We approximate 𝑊 ≈ 𝐵, 𝑎 = σ𝑚−1 𝑗=0 𝑎𝑗 𝐵𝑗 where 𝐵 ∈ +1, −1 𝑑×𝑚 and
𝑎 ∈ 𝑅𝑚 are concatenations of m binary tensors {𝐵0 , … , 𝐵𝑚−1 } and the
same number of scale factors {𝑎0 , … , 𝑎𝑚−1 }, resp. Approximate the real-valued weight tensor with a sum of
scaled binary tensors
• We wish to minimize reconstruction error: ||𝑊 − ⟨𝐵, 𝑎⟩||2 . This is NP
hard and hence this greedy solution
• 𝐵𝑗 and 𝑎𝑗 are sequentially learnt and each of them is selected to be the current
optimum with respect to the SSE minimization criterion.
• 𝐵𝑗 , 𝑎𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛𝐵∈ +1,−1 𝑑,𝑎∈𝑅 ||𝑊 ෡𝑗 − 𝑎𝐵||2 where 𝑋 = 𝑋, 𝑋 1/2, and 𝑊 ෡𝑗
indicates the approximation residue after combining all the previously generated
tensor(s).
• 𝑊෡𝑗 = 𝑊 if j=0, and 𝑊෡𝑗 = 𝑊 − σ𝑗−1 𝑎𝑘 𝐵𝑘 if 𝑗 ≥ 1.
𝑘=0
෡
෡𝑗 ) and 𝑎𝑗 = 𝐵𝑗 ,𝑊𝑗
• Solving by setting derivatives to 0 gives us 𝐵𝑗 = 𝑠𝑖𝑔𝑛(𝑊 𝑑
• Refinement: In the j-th iteration
𝑗
after minimizing, we add one extra step
to refine all computed 𝑎𝑖 𝑖=1 with the least squares solution:
−1 𝑇
• 𝑎1 , … 𝑎𝑗 = 𝑏𝑗𝑇 𝑏𝑗 𝑏𝑗 . 𝑊 where 𝑏𝑗 = [𝐵1 , … , 𝐵𝑗 ]
[93] Guo, Yiwen, Anbang Yao, Hao Zhao, and Yurong Chen. "Network sketching: Exploiting binary structure in deep cnns." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955-5963.
2017.
Alternating Multi-bit quantization
• For k = 2.
• If α1 and α2 are known in advance with α1 ≥ α2 ≥ 0, then the quantization codes are
restricted to v = {−α1 − α2, −α1 + α2, α1 − α2, α1 + α2}.
Illustration of the optimal 2-bit quantization when α1 and α2 (α1 ≥ α2)
• For any weight w, its quantization code is determined by the least distance to all are known in advance. The values are quantized into −α1−α2, −α1+α2,
codes. α1−α2, and α1+α2, respectively. And the partition intervals are
• Consequently, we can partition the number axis into 4 intervals. And each interval optimally separated by the middle points of adjacent quantization
corresponds to one particular quantization code. The common point of two adjacent codes, i.e., −α1, 0, and α1, correspondingly.
intervals then becomes the middle point of the two quantization codes, i.e., −α1, 0,
and α1.
• For the general k-bit quantization, suppose that 𝛼𝑖 𝑘𝑖=1 are known and we have all
possible codes in ascending order, i.e., 𝑣 = − σ𝑘𝑖=1 𝛼𝑖 , … , σ𝑘𝑖=1 𝛼𝑖 . Axis is divided
into 2𝑘 intervals. We can use binary search (tree) to find the right interval and hence
the code.
• In refined greedy approximation, after modification on the computed 𝛼’s, 𝐵’s are no
longer optimal while the method keeps all of them fixed.
Illustration of binary search tree to determine the optimal quantization.
• To improve the refined greedy approximation, alternating minimizing 𝛼’s and B’s
becomes a natural choice.
• By greedy initialization, only two alternating cycles is good enough to find high
precision quantization.
• Alternating Multi-bit quantization for LSTMs
• Training an f=LSTM using quantized weights 𝑤
ෝ = σ𝛼𝑖 𝑏𝑖 can be mathematically formulated
as a bi-level optimization
[65] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. 2018. Alternating multi-bit quantization for recurrent neural networks. arXiv preprint arXiv:1802.00150 (2018).
Quantized Word Vectors (Word2Bits)
• Each word vector is typically represented as a 300-500D vector, with each parameter being 32 bits.
As there are millions of words, word vectors may take up to 3-6 GB of memory/storage.
• Can we quantize word vectors?
• Quantized CBOW
• CBOW with negative sampling aims to minimize
𝑘
𝐽 𝑢0 , 𝑣ො𝑐 = − log 𝜎 𝑢𝑜𝑇 𝑣ො𝑐 − ෍ log(𝜎(−𝑢𝑖𝑇 𝑣ො𝑐 ))

𝑖=1
• Trained word vectors with varying levels of precision and
dimension on the 2017 English Wikipedia dump (24G of
text). Vocabulary size is 3.7 million.
• Thresholded vectors (T1) are computed by taking trained
full precision vectors and applying the 1-bit quantization
function on them.
• We use six datasets to evaluate word similarity and two
datasets to evaluate word analogy. The word similarity
test datasets are: WordSim Similarity, WordSim
Relatedness, MEN, Mechanical Turk, Rare Words and
Simlex. The word analogy test datasets are Google’s
analogy dataset and MSR’s analogy dataset.
Word similarity and analogy results

Transformer.zip (Pruning+Quantization)
• Iterative Magnitude Pruning: X → Y%
• Start with X% pruned model. Prune (Y −X)/9 % of each of
the layers every 1001 iterations, and end after 10K iters.
• Regularization (L1/L2) while training.
• Quantization
• Binary Scheme Fixed (BS-Fixed)
𝑐1 𝑖𝑓 𝑤𝑖𝑗 ≥ 0
• 𝑤𝑖𝑗 = ൝ 𝑐 and 𝑐2 tuned as hyper-parameters
𝑐2 𝑖𝑓 𝑤𝑖𝑗 < 0 1
• Binary Scheme Flexible or BS-Flexible
𝑐1 𝑖𝑓 𝑤𝑖𝑗 ≥ (𝑐1 +𝑐2 )/2
• 𝑤𝑖𝑗 = ൝ Init 𝑐1 and 𝑐2 using Kmeans
𝑐2 𝑖𝑓 𝑤𝑖𝑗 < (𝑐1 +𝑐2 )/2
• KMeans with linear initialization
• Transformer architecture is highly resistant to
quantization, and is able to match the original model
up to a 4-bit representation.
• Pruning is so much worse than quantization. WMT 2017 Test set BLEU and Compression Ratio
Q-BERT: Hessian Based Group-wise Quantization
• The size of parameters in BERT_BASE model is 91MB for
embedding, 325MB for encoder and 0.01MB for output.
• Different encoder layers should use different #bits for
quantization. Layers that exhibit flatter curvature of the loss
gradient surface can be quantized to lower bit precision.
The loss landscape for different layers obtained by perturbing the parameters along the first
two dominant eigenvectors of the Hessian.
Quantization results for BERTBASE obtained with 128 groups in each layer. quantization bits
used for weights =“w-bits”, embedding = “e-bits”, model size in MB =“Size”, and model size
without embedding layer in MB =“Size-w/o-e”. All the models except for Baseline are using 8-
bits activation. Furthermore, we compare Q-BERT with direct quantization method (“DirectQ”)
without using mixed precision or group-wise quantization. “MP” =mixed-precision quantization.
[52] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840 (2019).
Agenda
• Pruning
• Quantization
• Applications
Overview
• Distillation methods vary on
• different types of teacher model
• different types of loss function
• squared error between the logits of the models
• KL divergence between the predictive distributions, or
• some other measure of agreement between the model predictions.
• different choices for what dataset the student model trains on.
• a large unlabeled dataset
• a held-out data set, or
• the original training set.
• Mimic what?
• Teacher’s class probabilities
• Teacher’s feature representation
• Learn from whom?
• Teacher, teacher assistant, other fellow students
• Adversarial learning
• Outline for knowledge distillation
• Various distillation architectures
• Learning students and teacher together
• Multiple teachers
• Adversarial methods
• Distilling Transformers
• Other distillation settings
Student Teacher Networks (Mimic models)
• Teacher training: We first train a state-of-the-art deep model (or an
ensemble of deep models), and then train a shallow model to mimic the
deep model. • T: #instances
• Student training: Use logits before softmax from teacher for unlabeled • 𝑥 𝑖 : i-th instance
data. The mimic model is not trained on the original labels—it is trained to
learn the function that was learned by the larger model. • 𝑧 𝑖 : logits from teacher
• Such distilled student models are more accurate than the same shallow
student trained directly on the original labeled training data.
• Why Mimic Models Can Be More Accurate than Training on Original Labels
• Teacher removes noisy labels, if any.
• If there are complex regions in p(y|X) that are difficult to learn given the
features and sample density, the teacher may provide simpler, soft labels to the
student.
• The uncertainty from the teacher model is more informative to the student
model than the original 0/1 labels.
• The original targets may depend in part on features not available as inputs for
learning, but the student model sees targets that depend only on the input
features. The dependence on unavailable features has been eliminated by
filtering targets through the teacher model.
• Even if a shallow model has more parameters than a deep model,
inference is much faster on shallow models, especially with parallel
computation resources.
Comparison of shallow and deep models: phone error rate (PER) on
TIMIT Phoneme Recognition core test set. Key: c, convolution layer; p,
pooling layer; L, linear.
[3] Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654–2662.
Distillation
• The relative probabilities of incorrect answers tell us a lot about how the teacher
model tends to generalize.
• While [3] learn the student using L2 loss over logits, [26] suggests using
distillation.
• raise the temperature of the final softmax until teacher produces a suitably soft set of
targets.
exp(𝑧𝑖 /𝑇)
• Softmax with temperature: 𝑞𝑖 = σ
𝑗 exp(𝑧𝑗 /𝑇)
• then use the same high temperature when training the small model to match these soft
targets, but after it has been trained it uses a temperature of 1.
• Use cross entropy (H) for both the soft and hard part of the loss function.
Typically smaller weight for hard part works better.
• When using both hard and 2
soft loss, since the magnitudes of the
2
gradients produced by the
soft targets scale as 1/𝑇 it is important to multiply them by 𝑇 when using both hard and
soft targets.
[26] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Using derivatives of loss function for KD
• Sobolev Training for neural networks is a method for
incorporating target derivatives in addition to the target
values while training student network.
• Considering a neural network model m parameterised
with θ, one typically seeks to minimise the empirical
error in relation to f according to some loss function 𝑙.
• When learning in Sobolev spaces, this is replaced with:

a) Sobolev Training of order 2. Diamond nodes m and f indicate parameterised
functions, where m is trained to approximate f. Green nodes receive supervision.
Solid lines indicate connections through which error signal from loss l, l1, and l2
are backpropagated through to train m. b) Stochastic Sobolev Training of order
2. If f and m are multivariate functions, the gradients are Jacobian matrices. To
avoid computing these high dimensional objects, we can efficiently compute and
• where 𝑙𝑗 are loss functions measuring error on j-th fit their projections on a random vector 𝒗𝒋 sampled from the unit sphere.
order derivatives
[103] Czarnecki, Wojciech M., Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. "Sobolev training for neural networks." In Advances in Neural Information Processing Systems, pp. 4278-4287. 2017.
Learning the flow of solution procedure (FSP)
• We define the distilled knowledge to be transferred in terms
of flow between layers, which is calculated by computing the
inner product between features from two layers.
• If we view the input of the DNN as the question and the
output as the answer, we can think of the generated features
at the middle of the DNN as the intermediate result in the
solution process.
• There are many ways to solve the problem of generating the
output from the input. Hence, mimicking the generated
features of the teacher DNN can be a hard constraint for the The FSP (flow of solution procedure) matrix, which represents the distilled
student DNN. knowledge from the teacher DNN, is generated by the features from two layers.
By computing the inner product, which represents the direction, to generate the
• Learning the solution process from teacher is important. FSP matrix, the flow between two layers can be represented by the FSP matrix.
• The FSP matrix 𝐺 ∈ 𝑅𝑚×𝑛 is generated by the features from
two layers. Let one of the selected layers generate the feature
ℎ×𝑤×𝑚
map 𝐹1 ∈ 𝑅 , where h, w, and m represent the height,
width, and number of channels, respectively. Theℎ×𝑤×𝑛
other
selected layer generates the feature map 𝐹2 ∈ 𝑅 . Then,
the FSP matrix 𝐺 ∈ 𝑅 𝑚×𝑛 is calculated by 𝐺𝑖𝑗
[102] Yim, Junho, Donggyu Joo, Jihoon Bae, and Junmo Kim. "A gift from knowledge distillation: Fast optimization, network minimization and transfer learning." In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 4133-4141. 2017.
KD from noisy teachers
• Include a noise-based regularizer while training
the student from the teacher
• Noise is Gaussian noise with mean 0 and std dev
𝜎. This noise is added to teacher’s logits.
• This perturbation need not be imposed on all
samples. We instead select samples from the mini-
batch with some fixed probability α, and the logit
values of the selected samples are then perturbed.
• The perturbed outputs, not only, simulate a
multiple-teacher setting, but also results in noise
in the loss layer, thus producing the effect of a
regularizer.
• Also, experiments showed that a noisy teacher is
more helpful than a noisy student.
[105] Sau, Bharat Bhusan, and Vineeth N. Balasubramanian. "Deep model compression: Distilling knowledge from noisy teachers." arXiv preprint arXiv:1610.09650 (2016).
KD with teacher assistant: Multi-step KD
• Student network performance degrades when the gap between student and teacher is large.
• Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up
to a certain size, not smaller.
• To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the
gap between the student and the teacher.
• TA models are distilled from the teacher, and the student is then only distilled from the TAs.
• Multi-step TA: E.g., distillation path 10 → 6 → 4 → 2
TA fills the gap between student & teacher
[97] Mirzadeh, Seyed-Iman, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. "Improved Knowledge Distillation via Teacher Assistant." arXiv preprint arXiv:1902.03393 (2019).
Fitnets (Hint-based training)
• Allow the training of a student that is deeper and thinner than the teacher
• Student is trained using not only the outputs but also the intermediate
representations learned by the teacher as hints to improve the training process and
final performance of the student.
• we choose a hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer.
• Because the student intermediate hidden layer will generally be smaller than the
teacher’s intermediate hidden layer, additional parameters are introduced to map
the student hidden layer to the prediction of the teacher hidden layer.
1
• 𝐿𝐻𝑇 𝑊𝐺𝑢𝑖𝑑𝑒𝑑 , 𝑊𝑟 = ||𝑢ℎ 𝑥; 𝑊𝐻𝑖𝑛𝑡 − 𝑟 𝑣𝑔 𝑥; 𝑊𝐺𝑢𝑖𝑑𝑒𝑑 ; 𝑊𝑟 ||2
2
• Hint-based training with KD can be seen as Curriculum Learning with 2 stages
• First learn intermediate concepts via the hint/guided layer transfer,
• Then train the whole student network jointly, annealing λ, which allows easier examples (on
which the teacher is very confident) to initially have a stronger effect, but progressively
decreasing their importance as λ decays. λ is weight associated with the soft loss term.
[50] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
Apprentice: Quantization+KD
• Student has similar topology as that of teacher, except that the student
network has low-precision neurons compared to the teacher network which
has neurons operating at full-precision.
• Apprentice: has three schemes which produce low-precision networks using
KD techniques.
A. a low-precision network and a full-precision network are jointly trained from
scratch using KD.
B. we start with a full-precision trained network and transfer knowledge from
this trained network continuously to train a low-precision network from
scratch. We find that the low-precision network converges faster (albeit to
similar accuracies as the first scheme) when a trained complex network
guides its training.
C. we start with a trained full-precision large network and an apprentice
network that has been initialised with full-precision weights. The apprentice The teacher network is a high precision network and
network’s precision is lowered and is fine-tuned using KD. the apprentice network is a low-precision network.
• Each of these three schemes produce state-of-the-art ternary precision and
4-bit precision models.
• We quantize both weights and activations for student. We do not lower the
precision of the first layer and the final layer in the apprentice network.
• α = 1, β = 0.5 and γ = 0.5.
• Scheme C is better than A and B.
[96] Mishra, Asit, and Debbie Marr. "Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy." arXiv preprint arXiv:1711.05852 (2017).
Deep Mutual Learning (DML)
• Different from the one-way transfer between a static pre-defined teacher and a student in
model distillation, with DML, an ensemble of students learn collaboratively and teach each
other throughout the training process.
• Surprisingly, it is revealed that no prior powerful teacher network is necessary – mutual
learning of a collection of simple student networks works, and moreover outperforms
distillation from a more powerful yet static teacher.
• Specifically, each student is trained with two losses: a conventional supervised learning loss,
and a mimicry loss that aligns each student’s class posterior with the class probabilities of
other students.
Extension to Larger Student Cohorts
[69] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4320–4328.
Co-distillation
• Co-distillation refers to distillation performed:
(1) using the same architecture for all the
models; (2) using the same dataset to train all
the models; and (3) using the distillation loss
during training before any model has fully
converged.
• Key characteristic of Co-distillation is the
simultaneous training of a model and its
teacher. • Distributed scenario
• algorithm should be communication efficient.
• In the beginning of training, the distillation • to update the parameters of one network using
term in the loss is not very useful or may Co-distillation one only needs the predictions of
even be counterproductive, so to maintain the other networks, which can be computed
model diversity longer and to avoid a locally from copies of the other networks
complicated loss function schedule we only weights.
enable the distillation term in the loss • Empirically, using stale predictions instead of up-
function once training has gotten off the to-date predictions for the other neural
ground. networks has little to no adverse effect on the
quality of the final trained model produced by
Co-distillation.
[2] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
Distillation from many teachers (a generalist and an
ensemble of specialists)
• When the number of classes is very large, the teacher model could be an ensemble that contains one
generalist model trained on all the data and many “specialist” models, each of which is trained on data that is
highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).
• The softmax of this type of specialist can be made much smaller by combining all of the classes it does not
care about into a single dustbin class.
• Each specialist model is initialized with the weights of the generalist model. These weights are then slightly
modified by training the specialist with half its examples coming from its special subset and half sampled at
random from the remainder of the training set.
• In order to derive groupings of object categories for the specialists, we decided to focus on categories that our
full network often confuses. We compute covariance matrix of the predictions of our generalist model, and the
apply K-means to the columns of the covariance matrix.
• Training the student
• Step 1: For each instance, we find the n most probable classes according to the generalist model. Call this set of classes k.
• Step 2: We then take all the specialist models, m, whose special subset of confusable classes, 𝑆 𝑚 , has a non-empty
intersection with k and call this the active set of specialists 𝐴𝑘 (note
𝑔
that this set may be
𝑚
empty). We then find the full
probability distribution q over all the classes that minimizes: 𝐾𝐿 𝑝 , 𝑞 + σ𝑚∈𝐴𝑘 𝐾𝐿(𝑝 , 𝑞)
• The distribution 𝑝𝑚 is a distribution over all the specialist classes of m plus a single dustbin class, so when computing its KL divergence from the
full q distribution we sum all of the probabilities that the full q distribution assigns to all the classes in m’s dustbin.
Multi-lingual NMT using KD
• Individual models are first trained and regarded as teachers, and then the
multilingual model is trained to fit the training data and match the outputs of
individual models simultaneously through KD.
• When the accuracy of multilingual model surpasses the individual model for the accuracy
threshold τ on a certain language pair, we remove the distillation loss and just train the model
with original negative log-likelihood loss for this pair.
• It is burdensome to load all the teacher models in the GPU memory for distillation considering
there are dozens or even hundreds of language pairs in the multilingual setting.
• Alternatively, we first generate the output probability distribution of each teacher model for the sentence
pairs offline, and then just load the top-K probabilities of the distribution into memory and normalize them
so that they sum to 1 for distillation. This can reduce the memory cost again from the scale of |V| (the
vocabulary size) to K.
• It turns out that one model is enough to handle multiple languages (up to 44
languages), with comparable or even better accuracy than individual models.
[57] Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461 (2019).
Learning from Multiple Teacher Networks
• FitNets: encouraged an intermediate layer of student (guided layer) to predict outputs of some intermediate
layer of teacher (a.k.a. hint layer).
• Representational distance learning (RDL) [112]: enables a student to learn the intermediate representational
spaces of a teacher network by turning to representational distance (or dissimilarity) matrices using a portion
of training examples.
2
1
• min 𝐿𝑅𝐷𝐿 𝜃𝑆 = 2 σ𝑛𝑖=1 σ𝑛𝑗≠𝑖 𝑑𝑆𝑖𝑛𝑡𝑒𝑟 − 𝑖𝑛𝑡𝑒𝑟
𝑑𝑇
2 𝑛 −𝑛 𝑖𝑗 𝑖𝑗
𝑖𝑛𝑡𝑒𝑟
• where pairwise distance is 𝑑𝑆 𝑖𝑗
= 𝑑 𝑂𝑆𝑖𝑛𝑡𝑒𝑟 𝑥𝑖 , 𝑂𝑆𝑖𝑛𝑡𝑒𝑟
𝑥𝑗 and 𝑑 𝑖𝑛𝑡𝑒𝑟
𝑇 𝑖𝑗
= 𝑑 𝑂𝑇𝑖𝑛𝑡𝑒𝑟 𝑥𝑖 , 𝑂𝑇𝑖𝑛𝑡𝑒𝑟 𝑥𝑗 where d(.,.) is a distance metric.
• We present a method to train a thin deep network by incorporating multiple teacher networks not only in
output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the
intermediate layers by imposing a constraint about the dissimilarity among examples.
• We select data triplets from within a mini-batch that are most inconsistent with the ones provided by teacher
networks, for faster training.
• Which intermediate layer of the student and teacher networks should be used?
• Middle layer from student. For teachers, select layers such that most teachers are consistent with the resulting order
relationships under the voting strategy.
• Moreover, we leverage a voting strategy to unify multiple relative dissimilarity information provided by
multiple teacher networks.
[106] You, Shan, Chang Xu, Chao Xu, and Dacheng Tao. "Learning from multiple teacher networks." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285-1294.
2017.
[112] Patrick McClure and Nikolaus Kriegeskorte. 2016. Representational Distance Learning for Deep Neural Networks. Frontiers in Computational Neuroscience 10 (2016).
Learning from Multiple Teacher Networks
𝑚𝑢𝑙𝑡𝑖
• Simple
1 𝑚
way to do KD with multiple teachers: 𝐿𝐾𝐷 =
𝜏 𝜏
σ𝑡=1 𝑁𝑇𝑡 𝑥𝑖 , 𝑁𝑆 (𝑥𝑖 )
𝐻
𝑚
• Let 𝑝𝑖 be intermediate output from teacher for example 𝑥𝑖 . 𝑞𝑖
for student. Let 𝑤𝑠 be weights in student until the intermediate
output.
• Also, given 𝑝𝑖 , let 𝑑 𝑝𝑖 , 𝑝𝑖+ < 𝑑 𝑝𝑖 , 𝑝𝑖− from teacher.
𝐿𝑅𝐷 𝑤𝑠 ; 𝑥𝑖 , 𝑥𝑖+ , 𝑥𝑖− = max(0, 𝑑 𝑝𝑖 , 𝑝𝑖+ − 𝑑 𝑝𝑖 , 𝑝𝑖− + 𝛿)

where 𝛿 > 0 is a small number to prevent the trivial solution.
• Combining multiple teacher networks
• Multiple teacher networks may have different ordering
relationships. Use voting.
• Overall loss for a triplet pair (𝑥𝑖 , 𝑥𝑖+ , 𝑥𝑖− ) and (𝑦𝑖 , 𝑦𝑖+ , 𝑦𝑖− )
A graphical diagram for the proposed method to train a new thin deep student network by
incorporating multiple comparable teacher networks. The method consists of three losses,
including label prediction loss, dark knowledge loss and the relative similarity loss. The
incorporation of multiple teacher networks exists in two places. One is in the output layers
via averaging the softened output targets; the other lies in the intermediate layer by
determining the best triplet ordering relationships.
[106] You, Shan, Chang Xu, Chao Xu, and Dacheng Tao. "Learning from multiple teacher networks." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 1285-1294. 2017.
KD with Adversarial samples
• The generalization performance of a classifier is closely related to the adequacy of its decision boundary, so a good classifier bears a good decision
boundary.
• Therefore, transferring information closely related to the decision boundary can be a good attempt for knowledge distillation.
• To realize this goal, we utilize an adversarial attack to discover samples supporting a decision boundary, and train a student classifier based on these
samples.
• To obtain the informative samples close to the decision boundary, we utilize an adversarial attack.
• An adversarial attack tries to find a small modification that can change the class of a sample, i.e., it tries to move the sample beyond a nearby decision boundary.
• A boundary supporting sample (BSS) is an adversarial sample that lies near the decision boundary of a teacher classifier. A BSS is obtained by a gradient descent
method based on a loss function defined over classification scores, and it contains information about both the distance and the path direction from the base
sample to the decision boundary.
• A new loss function using BSSs is suggested for knowledge distillation that transfers decision boundary (in terms of magnitude+direction) to
student.
Iterative scheme to find BSSs for a base sample

The dots in the figure represent the training sample and the circle around a dot represents the distance to the nearest decision
boundary. The samples close to the decision boundary enable more accurate knowledge transfer.
[98] Heo, Byeongho, Minsik Lee, Sangdoo Yun, and Jin Young Choi. "Knowledge distillation with adversarial samples supporting decision boundary." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp.
3771-3778. 2019.
Conditional adversarial networks for KD
Typical KD proposed by Hinton:
• Knowledge is transferred from teacher to student through a
discriminator in our GAN-based approach.
• Discriminator is trained to distinguish whether the output
logits is from teacher or student network, while the student
(the generator) is adversarially trained to fool the
discriminator, i.e., output logits similar to the teacher logits so
that the discriminator can not distinguish.
• The deep teacher is pretrained offline.
• Discriminator update
• The student and discriminator are alternatively updated in the GAN-
based approach.
• We use MLP as discriminator.
• The number of nodes in each layer is the same as the dimension of
logits, i.e., the number of categories C. We denote the discriminator
that predicts binary value “Real/Fake” as D(·).
• Output of discriminator D(·) is a C + 2 dimensional vector with C
Label predictions and a Real/Fake prediction.
• To train D, we fix the student network F(·) and seek to maximize
L_discriminator
• 𝑙𝑖 is the real label for instance i.
[99] Xu, Zheng, Yen-Chang Hsu, and Jiawei Huang. "Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks." arXiv preprint arXiv:1709.00513 (2017).
KDGAN: Knowledge Distillation with Generative
Adversarial Networks
• The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial
losses.
• Within each epoch, we first train the discriminator, then the teacher and finally the student classifier.
• We formulate KDGAN as a minimax game with a classifier C, a teacher T, and a discriminator D.
• Both T and D can use privileged info as features, i.e., features that are available at train time but not at test time (e.g., because such features are
expensive to compute at test time).
• In KDGAN, D aims to maximize the probability of correctly distinguishing the true and pseudo labels, whereas C and T aim to minimize the
probability that D rejects their generated pseudo labels. Meanwhile, C learns from T by mimicking the learned distribution of T. To build a general
framework, we also enable T to learn from C because, in reality, a teacher’s ability can also be enhanced by interacting with students
[100] Wang, Xiaojie, Rui Zhang, Yu Sun, and Jianzhong Qi. "Kdgan: Knowledge distillation with generative adversarial networks." In Advances in Neural Information Processing Systems, pp. 775-786. 2018.
BERT student with small vocab
• The embedding table of the BERTBASE model, comprising over 30K
WordPiece tokens, accounts for over 21% of the model size. Can we
compress this word embedding matrix?
• Dual Training
• During distillation, for a given training sequence input to the teacher model, we
propose to mix the teacher and student vocabularies by randomly selecting (with a
probability 𝑝𝐷𝑇 ) tokens from the sequence to segment using the student
vocabulary, with the other tokens segmented using the teacher vocabulary.
• MLM task: the model now needs to learn to predict words from the student
vocabulary using context words segmented using the teacher vocabulary, and vice
versa.
• The expectation is that the student embeddings can be learned effectively this
way from the teacher embeddings as well as teacher model parameters.
• We perform dual training only for the teacher model inputs: the student model
receives words segmented exclusively using the student vocabulary. Also, during
MLM, the model uses different softmax layers for the teacher and the student
vocabularies depending on which one was used to segment the word in question.
• Shared Variable Projections
• Instead of distilling solely on the teacher model’s final-layer outputs, our model
leverages layer-wise teacher model parameters to directly optimize the
parameters of the corresponding layers in the student model.
• In order to align the student variable and the teacher variable’s projection, we
introduce a separate mean square error loss, where ↓ stands for down projection
(since the projection is to a lower dimension).
• Our method can compress BERT_BASE by 61.94x resulting in a language
model with a footprint of under 7MB. Student is 12 layer with 48D size.
KD on BERT with smaller student vocabulary. (Left) A pre-trained teacher BERT with default
parameters (e.g., 30K vocab, 768 hidden state dimension). (Right) A student BERT trained from
scratch with smaller vocab (5K) and hidden state dimension (e.g., 48). During distillation, the
teacher model randomly selects a vocabulary to segment each input word. The red and green
square next to the transformer layers indicate trainable parameters for both the student and
teacher models. The projection matrices U and V, shown as having representative shapes, are
shared across all layers for model parameters that have the same dimensions.
[70] Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. 2019. Extreme Language Model Compression with Optimal Subwords and Shared Projections. arXiv preprint arXiv:1909.11687 (2019).
Patient Knowledge Distillation
M denotes the number of layers in the student network, N

is the number of training samples.
Model architecture of the Patient KD approach to BERT model

compression. (Left) PKD-Skip: the student network learns the
teacher’s outputs in every 2 layers. (Right) PKDLast: the student
learns the teacher’s outputs from the last 6 layers. Trm: Transformer.
[55] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 (2019).
KD in MTL setting
• Given the soft targets of the training
data across multiple tasks,
• a single MT-DNN (student) is trained using
MTL and back propagation,
• except that if task t has a teacher,
• the task-specific loss is the average of two Architecture of the MT-DNN model for representation learning
objective functions,
• one for the correct targets and
• the other for the soft targets assigned by the
teacher.
• Distilled MT-DNN significantly
outperforms the original MT-DNN on 7
out of 9 GLUE tasks, pushing the GLUE
benchmark (single model) to 83.7%
[38] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. arXiv preprint arXiv:1904.09482 (2019).
TinyBERT
A summary of KD methods for BERT. Abbreviations: INIT (initializing student BERT
with some layers of pre-trained teacher BERT), DA (conducting data augmentation
• TinyBERT achieves >96% the performance of teacher BERT_BASE on GLUE for task-specific training data). Embd, Attn, Hidn, and Pred represent the knowledge
benchmark, while being 7.5x smaller and 9.4x faster on inference.
from embedding layers, attention matrices, hidden states, and final prediction
• Student loss is defined across all layers. Each student layer is first mapped to a layers, respectively
teacher layer.
An overview of Transformer distillation: (a) the framework of Transformer distillation, (b)

the details of Transformer-layer distillation consisting of Attn_loss(attention based
distillation) and Hidn_loss(hidden states based distillation).
[110] Jiao, Xiaoqi, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
MiniLM
Overview of Deep Self-Attention Distillation. The student is trained by deeply

mimicking the self-attention behavior of the last Transformer layer of the teacher.
Besides self-attention distributions, we introduce the self-attention value-relation
transfer to help the student achieve a deeper mimicry.
• |x| and 𝐴ℎ represent the sequence length and the number of attention heads. L
and M represent the number of layers for the teacher and student.
• 𝐿 = 𝐿𝐴𝑇 + 𝐿𝑉𝑅
• With a L-layer Transformer teacher with 𝑑ℎ hidden size and M-layer Transformer
student with 𝑑ℎ ′ hidden size, we first distill the teacher into a teacher assistant
with L-layer Transformer and 𝑑ℎ ′ hidden size. Comparison between the publicly released 6-layer models with 768 hidden size distilled
• 6-layer model of 768 hidden dimensions distilled from BERTBASE is 2.0× faster, from BERTBASE. We compare task agnostic distilled models without task-specific
while retaining more than 99% accuracy on SQuAD 2.0 and several GLUE distillation and data augmentation. We report F1 for SQuAD 2.0, and accuracy for other
benchmark tasks. datasets.
Wang, Wenhui, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. "Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers." arXiv preprint arXiv:2002.10957 (2020).
Distilling BERT to BiLSTMs
• We distill knowledge from BERT into a single-layer BiLSTM, as well as its
Siamese counterpart for sentence-pair tasks. The BiLSTM model for single-sentence classification. The
• concatenate–compare operation (in Siamese) between the two sentence vectors: labels are (a) input embeddings, (b) BiLSTM, (c, d) backward
𝑓 ℎ𝑠1 , ℎ𝑠2 = [ℎ𝑠1 , ℎ𝑠2 , ℎ𝑠1 ⊙ ℎ𝑠2 , |ℎ𝑠1 − ℎ𝑠2 |] and forward hidden states, respectively, (e, g) fully-connected
layer; (e) with ReLU, (f) hidden representation, (h) logit
• The distillation objective is to penalize the mean-squared-error (MSE) loss outputs, (i) softmax activation, and (j) final probabilities.
between the student network’s logits against the teacher’s logits
• Across multiple datasets in paraphrasing, NLI, and sentiment
classification, we achieve comparable results with ELMo, while using
roughly 100 times fewer parameters and 15 times less inference time.
• To facilitate effective knowledge transfer, however, we often require a
large, unlabeled dataset.
• Getting this could be difficult. Three textual data augmentation approaches
• With probability 𝑝𝑚𝑎𝑠𝑘 , we randomly replace a word with [MASK], which corresponds
to an unknown token in our models and the masked word token in BERT. e.g., the
teacher network produces less confident logits for “I [MASK] the comedy” than for “I
loved the comedy.”
• POS-guided word replacement. With probability 𝑝𝑝𝑜𝑠 , we replace a word with another
of the same POS tag. e.g., “What do pigs eat?” => “How do pigs eat?” The siamese BiLSTM model for sentence matching, with
• To preserve the original training distribution, the new word is sampled from the unigram word shared encoder weights for both sentences. The labels are (a)
distribution re-normalized by POS tag.
BiLSTM, (b, c) final backward and forward hidden states,
• n-gram sampling. With probability 𝑝𝑛𝑔 , we randomly sample an n-gram from the respectively, (d) concatenate–compare unit, (e, g) fully
example, where n is randomly selected from {1, 2, . . . , 5}. This rule is conceptually
equivalent to dropping out all other words in the example. connected layer; (e) with ReLU, (f) hidden representation, (h)
logit outputs, (i) softmax activation, and (j) final probabilities.
[58] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. arXiv preprint arXiv:1903.12136 (2019).
XtremeDistil
• Stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture.
• We do representation transfer from Transformer-based teacher model to BiLSTM-based student model with different embedding dimensions and disparate output
spaces. We use mBERT as our teacher, and 1-layer BiLSTM as student.
• Distillation Features: (1) Teacher Logits (2) Internal Teacher Representations: Hidden representations from a chosen teacher layer l.
• Projection: To make all output
𝑡
spaces compatible, we perform a non-linear projection of the parameters in student representation ℎ 𝑠 to have same shape 𝑡ℎ
as
teacher representation 𝑧𝑙 for each token 𝑥𝑘 . The projection parameters are learned by minimizing the KL-divergence (KLD) between the student and the 𝑙 layer
teacher representations.
• Stage-wise training
• We do not optimize all loss functions jointly
• First stage: train the student to mimic teacher representations
𝑓 𝑓
from its 𝑙𝑡ℎ layer by optimizing 𝑅𝑅𝐿 on unlabeled data. The student learns the parameters for word
embeddings (𝜃𝑤 ), BiLSTM (𝜃𝑏 ) and projections ⟨𝑊 , 𝑏 ⟩
• Second stage: optimize for the cross-entropy 𝑅𝐶𝐸 and logit loss 𝑅𝐿𝐿 jointly on both labeled and unlabeled data respectively to learn the corresponding parameters
𝑊𝑠 and 𝑊 𝑟 , 𝑏𝑟
• This can be further broken down in two stages, where we sequentially optimize logit loss 𝑅𝐿𝐿 on unlabeled data and then optimize cross-entropy loss 𝑅𝐶𝐸 on labeled data.
• Every stage learns parameters conditioned on those learned in previous stage followed by end-to-end fine-tuning.
• Gradual Unfreezing
• To avoid ‘catastrophic forgetting’, we start from the top layer that contains the most task-specific information and allow the model to configure the task-specific
layer first while others remain frozen. The latter layers are gradually unfrozen one by one and the model trained till convergence. Once a layer is unfrozen, it
maintains the state.
• The approach leads to massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch
inference while retaining 95% of its F1-score for NER over 41 languages.
Mukherjee, Subhabrata, and Ahmed Hassan Awadallah. "XtremeDistil: Multi-stage Distillation for Massive Multilingual Models." In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 2221-2234. 2020.
Sequence-Level Knowledge Distillation
Overview of the different knowledge distillation
approaches. In word-level knowledge distillation (left)
cross-entropy is minimized between the student/teacher
distributions (yellow) for each word in the actual target
sequence (ECD), as well as between the student
distribution and the degenerate data distribution, which
has all of its probability mass on one word (black). In
sequence-level knowledge distillation (center) the
student network is trained on the output from beam
search of the teacher network that had the highest score
(ACF). In sequence-level interpolation (right) the student
is trained on the output from beam search of the
teacher network that had the highest sim (say using
BLEU) with the target sequence (ECE).
• Two sequence-level versions of KD that further improve performance, and eliminate the need for beam search (even when applied on the original
teacher model).
• Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance.
• It is also significantly better than a baseline model trained without KD: by 4.2/1.7 BLEU for NMT with greedy decoding/beam search.
• Applying weight pruning on top of KD results in a student model that has 13× fewer parameters than the original teacher model, with a decrease of
0.4 BLEU.
• The training of the network gets complicated, if the training corpus contains noisy sentence pairs or sentences with several correct translations. In
our knowledge distillation approach, we translate the full parallel data with our teacher model. This gives us the option to score each translation
with the original reference. We remove sentences with high TER scores (Snover et al., 2006) from our training data.
[32] Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).
[95] Freitag, Markus, Yaser Al-Onaizan, and Baskaran Sankaran. "Ensemble distillation for neural machine translation." arXiv preprint arXiv:1702.01802 (2017).
Quantized distillation
• Quantized models can leverage distillation loss (Hinton et al., 2015).
• Leverages distillation during the training process, by incorporating
distillation loss, expressed with respect to the teacher network, into
the training of a smaller student network whose weights are quantized
to a limited set of levels.
• Let us define quantization process as 𝑄 𝑣 = 𝑠𝑐 −1 𝑄෠ 𝑠𝑐 𝑣
• 𝑠𝑐 −1 is inverse of the scaling function,
• 𝑄෠ is the actual quantization function that only accepts values in [0, 1].
𝑣−𝛽
• One form of scaling is 𝑠𝑐 𝑣 = 𝛼 with 𝛼 = max 𝑣𝑖 − min 𝑣𝑖 and 𝛽 =
𝑖 𝑖
min 𝑣𝑖 .
𝑖
• s is number of quantization levels.
• Bucketing: One problem with this formulation is that an identical
scaling factor is used for the whole vector, whose dimension might be
huge. To avoid this, we will use bucketing.
• Differentiable quantized distillation
• optimizes the location of quantization points through SGD, to better fit
the behavior of the teacher model.
• Can we find points p to minimize loss?
• Let 𝑝 = (𝑝1 , … , 𝑝𝑠 ) be the vector of quantization points, and let Q(v, p) be
our quantization function which quantizes each element 𝑣𝑖 to the closest
of these points.
[46] Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668 (2018).
[94] Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for conditional computation." arXiv preprint arXiv:1308.3432 (2013).
Agenda
• Pruning
• Quantization
• Applications
Parameter sharing
• Outline for parameter sharing
• Character-aware language models
• Parameter sharing in the embedding matrix
• Parameter sharing in Transformers
Character to Word (C2W)
model
• We construct vector representations of words
by composing characters using BiLSTMs.
• Relative to traditional word representation
models that have independent vectors for
each word type, C2W requires only a single
vector per character type and a fixed set of
parameters for the compositional model.
• Despite the compactness of this model, our
“composed” word representations yield state-
of-the-art results in language modeling and
POS tagging.
• As input, we define an alphabet of characters
C. For English, this vocabulary would contain
an entry for each uppercase and lowercase
letter as well as numbers and punctuation. Illustration of the word lookup tables (top) and the lexical
Composition Model (bottom). Square boxes represent vectors of
neuron activations. Shaded boxes indicate that a non-linearity.
[119] Ling, Wang, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black, and Isabel Trancoso. "Finding function in form: Compositional character models for open vocabulary word
representation." arXiv preprint arXiv:1508.02096 (2015).
Char-CNN and Char-LSTM
• CNN Softmax (Fig b)
• The character-level features allow for a smoother and compact parametrization of the word
embeddings.
• Softmax computes a logit as 𝑧𝑤 = ℎ𝑇 𝑒𝑤 where h is a context vector and 𝑒𝑤 the word
embedding. Instead of building a matrix of |𝑉| × |ℎ| (whose rows correspond to ew), we
produce 𝑒𝑤 with a CNN over the characters of w as 𝑒𝑤 = 𝐶𝑁𝑁(𝑐ℎ𝑎𝑟𝑠𝑤 )
• We used the same network architecture to dynamically generate the Softmax word
embeddings without sharing the parameters with the input word-embedding sub-network.
For inference, the vectors ew can be precomputed, so there is no computational complexity
increase w.r.t. the regular Softmax.
• To get better results, 𝑧𝑤 = ℎ𝑇 𝐶𝑁𝑁 𝑐ℎ𝑎𝑟𝑠𝑤 + ℎ𝑇 𝑀𝑐𝑜𝑟𝑟𝑤 where M is a matrix projecting a
low-dimensional embedding vector 𝑐𝑜𝑟𝑟𝑤 back up to the dimensionality of the projected
LSTM hidden state of h.
• Char LSTMs (Fig c)
• They make predictions one character at a time, thus allowing to compute probabilities over a A high-level diagram of various language models. (a) is a
much smaller vocabulary. standard LSTM LM. (b) represents an LM where both
input and Softmax embeddings have been replaced by a
• Char CNN models are more difficult to train and seem to perform worse. Most likely this is
due to the sequences becoming much longer on average as the LSTM reads the input character CNN. In (c) we replace the Softmax by a next
character by character instead of word by word. character prediction LSTM network.
• Thus, we combine the word and character-level models by feeding a word-level LSTM hidden
state h into a small LSTM that predicts the target word one character at a time.
• To make the whole process reasonably efficient, we train the standard LSTM model until
convergence, freeze its weights, and replace the standard word-level Softmax layer with the
aforementioned character-level LSTM.
[120] Jozefowicz, Rafal, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016).
Char-CNN+Highway+LSTM
• Character-level inputs, word-level predictions.
• The model takes “absurdity” as the current input and combines it with the history
(as represented by the hidden state) to predict the next word, “is”.
• First layer performs a lookup of character embeddings (of dimension four) and
stacks them to form the matrix 𝐶 𝑘 .
• Then convolution operations are applied between 𝐶 𝑘 and multiple filter matrices.
• We have 12 filters—3 filters of width 2 (blue), 4 filters of width 3 (yellow), and 5 filters of
width 4 (red).
• A max-over-time pooling operation is applied to obtain a fixed-dimensional
representation of the word, which is given to the highway network.
• Highway network: Compared to an MLP layer where 𝑧 = 𝑔(𝑊𝑦 + 𝑏), in a highway
network layer, 𝑧 = 𝑡 ⊙ 𝑔 𝑊𝐻 𝑦 + 𝑏𝐻 + 1 − 𝑡 ⊙ 𝑦 where 𝑡 = 𝜎(𝑊𝑇 𝑦 + 𝑏𝑇 ) is
called transform gate and 1-t is called carry gate.
• The highway network’s output is used as the input to a multi-layer LSTM.
• Finally, an affine transformation followed by a softmax is applied over the hidden
representation of the LSTM to obtain the distribution over the next word.
• Cross entropy loss between the (predicted) distribution over next word and the
actual next word is minimized.
[116] Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. "Character-aware neural language models." In Thirtieth AAAI conference on artificial intelligence. 2016.
Parameter sharing
HashedNets
• Consider a weight matrix 𝑉 𝑙 of size 𝑛𝑙 + 1 × 𝑛𝑙+1 between layers l and l+1.
Given a budget 𝐾 𝑙 we want to share weights within 𝑉 𝑙 to have a max of 𝐾 𝑙
unique values.
• A naive implementation of random weight sharing can be trivially achieved by
maintaining a secondary matrix consisting of each connection’s group
assignment. But this needs memory space itself. Hence hashing.
• HashedNets uses a low-cost hash function to randomly group connection
weights into hash buckets, and all connections within the same hash bucket
share a single parameter value.
• Our hashing procedure introduces no additional memory overhead.
• We assign to 𝑉𝑖𝑗𝑙 an element of 𝑤 𝑙 indexed by a hash function ℎ𝑙 (𝑖, 𝑗), as follows: An illustration of a neural network with random weight sharing under
𝑉𝑖𝑗𝑙 = 𝑤ℎ𝑙 𝑙 𝑖,𝑗 where h maps (i,j) to a natural number in {1,…,𝐾 𝑙 }. compression factor 1/4 . The 16 + 8 = 24 virtual weights are compressed
• They use https://code.google.com/p/xxhash/ into 6 real weights. The colors represent matrix elements that share the
same weight value. Connections are randomly grouped into three
• Now, rather than V we could use w: 𝑧𝑖 = σ𝑚 𝑚
𝑗=1 𝑉𝑖𝑗 𝑎𝑗 = σ𝑗=1 𝑤ℎ 𝑖,𝑗 𝑎𝑗
categories per layer and their weights are shown in the virtual weight
• Or equivalently, we could also do feature hashing. matrices V1 and V2 .
• To compute 𝑧𝑖 , we first hash the activations from the previous layer, a, with the hash
mapping function 𝜙𝑖 . : 𝑅𝑚 → 𝑅𝐾 .
• We then compute the inner product between the hashed representation 𝜙𝑖 (𝑎) and
the parameter vector w, 𝑧𝑖 = 𝑤 𝑇 𝜙𝑖 (𝑎).
• Both w and 𝜙𝑖 𝑎 are K dimensional. 𝜙𝑖 𝑎 𝑘 = σ𝑗:ℎ 𝑖,𝑗 =𝑘 𝑎𝑗
[8] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In International Conference on Machine Learning. 2285–2294.
Toeplitz-like structured matrices
• A hybrid strategy of using Toeplitz-like structured • Toeplitz-like Structured Matrices
matrices in the bottom layers and projection layers • Unlike HashedNets where weights are randomly grouped,
with shared low-rank factors on the top layers reduces parameter sharing mechanisms in structured matrices are
the parameters of a standard LSTM by 75%, at a small highly specific and deterministic.
cost of 0.3% increase in WER, on a 2,000-hr English • Toeplitz matrices have parameters tied along diagonals.
Voice Search task.
• n×n Toeplitz matrices admit O(n log n) time to compute
• Low Rank Factorization: W = 𝑊𝑎 × 𝑊𝑏 . matrix-vector products. Toeplitz matrices also have the
• ℎ𝑡𝑙 = 𝜎[𝑊𝑎𝑙 𝑊𝑏𝑙 ℎ𝑡𝑙−1 + 𝑈𝑎𝑙 𝑈𝑏𝑙 ℎ𝑡−1
𝑙
+ 𝑏𝑙 ] property that via certain shift and scale operations as
implemented by specific displacement operators, they
• Sharing Low-rank across Layers (Projection Model): can be linearly transformed into matrices of rank less
• Share low-rank factor across layers of the recurrent than or equal to 2. Thus, the displacement rank of all
architecture Toeplitz matrices is up to 2.
• Also set 𝑊𝑏𝑙 = 𝑈𝑏𝑙−1 when doing low rank factorization. • Toeplitz-like matrices: allow the displacement rank r to be
• 𝑚𝑡0 = 𝑥𝑡 . Then, 𝑚𝑡𝑙 can be seen as a linear projection of the
higher.
original hidden layer, which is shared across layers. • they include products and inverses of Toeplitz matrices, and
their linear combinations.
• For 𝑙LSTMs, a𝑙 similar projection can be done using ℎ𝑡𝑙 =
𝑃𝑙 𝑜𝑡 tanh[𝑐𝑡 ] where 𝑃𝑙 is the low rank projection matrix. • The displacement rank r serves as a knob on modeling capacity.
High displacement rank matrices are increasingly unstructured.
• With displacement rank r, there are 2nr free parameters in the
Toeplitz-like structured matrix. We apply the Toeplitz-like
transform to 𝑈 𝑙 , 𝑊 𝑙 in RNNs, and 𝑊𝑖𝑙 , 𝑊𝑓𝑙 , 𝑊𝑐𝑙 , 𝑊𝑜𝑙 , 𝑈𝑖𝑙 , 𝑈𝑓𝑙 , 𝑈𝑐𝑙 ,
𝑈𝑜𝑙 to LSTMs.
[40] Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. 2016. Learning compact recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5960–5964.
Sparse Word Representations
• Represent infrequent words’ embeddings with frequent words’ by sparse linear combinations.
• This is inspired by the observation that, in a dictionary, an unfamiliar word is typically defined by common words.
• A dense embedding is assigned to each common word; an infrequent word, on the other hand, computes its
vector representation by a sparse combination of common words’ embeddings.
• We apply same kind of model compression for both word embedding matrix as well as output layer of
RNNs/LSTMs.
• Sparse Representations of Words
• We split the vocabulary V into two disjoint subsets (B and C). B contains a fixed number (8000) of common words. C = V\B is a
set of uncommon words. We would like to use B’s word embeddings to encode C’s.
• Suppose U = (𝑈1 , 𝑈2 , . . . , 𝑈 𝐵 ) ∈ 𝑅𝐸×|𝐵| is the (learned) embedding matrix of common words, i.e., 𝑈𝑖 is the embedding of i-
th word in B.
• For a word w∈C, we shall learn a sparse vector x = (𝑥1 , 𝑥2, . . . , 𝑥 𝐵 ) as the sparse code of the word.
• If x has been learned, the embedding of w is 𝑤 ෝ = σ𝐵𝑗=1 𝑥𝑗 𝑈𝑗 = 𝑈𝑥
2
• To learn the sparse representation of a certain word w, min 𝑈𝑥 − 𝑤 2 + 𝛼 𝑥 |1 + 𝛽 1𝑇 𝑥 − 1| + 𝛾1𝑇 max{0, −𝑥}
𝑥
• where max denotes the component-wise maximum; w is the embedding for a rare word w ∈ C.
• The last two regularization terms favor a solution that sums to 1 and that is nonnegative (for psychological interpretation concerns), respectively.
• As the codes are pre-computed and remain unchanged during language modeling, they are not tunable parameters of our
neural model. Considering the learned sparse codes, we need only 4–8 values for each word on average, as the codes contain
0.05–0.1% nonzero values, which are almost negligible
[9] Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. 2016. Compressing neural language models by sparse word representations. arXiv preprint arXiv:1610.03950 (2016).
LightRNN
An example of the word table
• 2-Component (2C) shared embedding for word representations.

• We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another
vector.
• Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Thus, we only
need 2 |𝑉| vectors to represent a vocabulary of |V| unique words, which are far less than the |V| vectors.
• On the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models,
whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2.
• Let n and m denote the dimension of a row/column input vector and that of a hidden state vector
respectively.
• E.g., if n=1024, m = 1024 , |V| = 10M, 32-bit floats. The total size of the two embedding matrices X,
Y is (m|V|+n|V|) × 4 = 80GB for the vanilla RNN model and that of the four embedding matrices
𝑋𝑟 , 𝑋𝑐 , 𝑌𝑟 , 𝑌𝑐 in LightRNN is 2(m |𝑉|+n |𝑉|)×4 ≈ 50MB.
• Input and output use different embedding matrices but they share the same word-allocation table.
• Word Allocation table creation uses a bootstrap procedure to iteratively refine word allocation
based on the learned word embedding.
LightRNN (left) vs. Conventional RNN (right).
[36] Xiang Li, Tao Qin, Jian Yang, and Tie-Yan Liu. 2016. LightRNN: Memory and computation-efficient recurrent neural networks. In Advances in Neural Information Processing Systems. 4385–4393.
Learning Compact Neural Word Embeddings
• Force the embedding vectors to share parameter values, which significantly shrinks model size.
• Let U and V be vocabularies of possible input and output words, respectively.
• Let 𝑒𝑖 = i-th input embedding vector, and 𝑜𝑗 = j-th output embedding vector.
• Given training data D, Skipgram SGNS embeddings are obtained by solving
• ෠ 𝑂෠ = 𝑎𝑟𝑔𝑚𝑖𝑛𝐸,𝑂 {Ψ(𝐸, 𝑂|𝐷)} where
𝐸,
• Ψ 𝐸, 𝑂 𝐷 = σ𝐻∈𝐷 𝐿(𝑠𝐻 ) + σ𝐻 ′ ∈𝐷′ 𝐿(−𝑠𝐻′ ) where 𝐿 𝑥 = log(1 + exp(−𝑥)) and 𝑠𝐻 = 𝑒𝑖 . 𝑜𝑗
• For CBOW with negative sampling, 𝑠𝐻 = σ𝑖∈𝐼 𝑒𝑖 . 𝑜𝑗 where 𝐻 = (𝐼, 𝑗)
• Can we compress word embeddings while learning them?
• Split every embedding vector of size D into B equal sub-vectors of size C. Thus D=BxC.
• We assign a limited number of reference vectors to each block of block-splitting vectors. E.g., the
number of reference vectors becomes KxB if we assign K reference vectors to each block. Each
reference vector is of size C.
• Let 𝑝𝑏,𝑘 be the 𝑘 𝑡ℎ reference vector assigned to 𝑏𝑡ℎ block. Thus P is a set of b*k C-sized reference
vectors. Comparison of obtained model size between a conventional
method and proposed method in typical settings.
• With parameter sharing, optimization changes as follows.
• 𝑞𝑏,𝑘 denotes the k-th reference vector assigned to the b-th block of the output embedding vector.
• The constraints namely 𝑒𝑖,𝑏 ∈ Pb and 𝑜𝑗,𝑏 ∈ 𝑄𝑏 are referred to as parameter sharing constraints
• This optimization is difficult to solve. Hence, we use alternating direction method of multiplier (ADMM)
method.
[56] Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing. In IJCAI. 2046–2052.
Slim Embeddings
• Randomly share the structured parameters at both the input and output embedding layers of the RNN
LM to significantly reduce the size of model parameters.
• Assume we divide the input word embedding vector into K even parts, such that the input representation
of the current word is the concatenation of the K parts.
• For a vocabulary of V words, the input word embedding matrix thus is divided into V∗K sub-vectors, and
we map these sub-vectors into M sub-vectors randomly but as uniformly as possible.
• the total number of parameters in the input embedding layer is MN/K instead of VN, which makes the
number of parameters independent from the size of vocabulary.
• We use SGD with backpropagation through time to train our compressed neural language model.
• The output matrix can be compressed in a similar way. In the output layer, the context vector h is
projected to a vector of size |V|, such that for each word w, we compute 𝑧𝑤 = ℎ𝑇 𝑒𝑤 , which is then
normalized by a softmax non-linearity.
• We can split each 𝑒𝑤 = [𝑎𝑤1 , … , 𝑎𝑤𝐾 ] where 𝑎𝑖 are sub-vectors.
• If we also divide the context vector into K even parts h = [ℎ1 , ..., ℎ𝐾 ], then 𝑧𝑤 = σ𝐾 𝑇
𝑖=1 ℎ𝑖 𝑎𝑤𝑖 .
• Because many words share the same sub-vectors, for each unique ℎ𝑖 𝑎𝑤𝑖 , we just need to compute the partial dot Toy example of the original embedding layer and
product once. Computation of all the unique ℎ𝑖 𝑎𝑤𝑖 values is O(MH/K) where M is the total number of sub- new embedding layer. In this paper the
vectors.
concatenated word vector has the same size as
• Each 𝑧𝑤 is the sum of K partial dot products. Because the dot product results are already known from the first
step, all we need to do is sum the K values for each word. The complexity of this step is O(VK). the original one. The assignment of sub-vectors to
each word are randomly selected and fixed before
the training process. The parameters in the
subvectors are updated in the training process.
[37] Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, and Shuang Wu. 2018. Slim embedding layers for recurrent neural language models. In Thirty-Second AAAI Conference on Artificial Intelligence.
Parameter sharing
Universal Transformer
• Transition functions are applied position-wise.

• Dynamic per-position halting mechanism.
• Certain symbols (e.g. some words or phonemes) are usually more
ambiguous than others. It is therefore reasonable to allocate more
processing resources to these more ambiguous symbols.
• Adaptive Computation Time (ACT) (Graves, 2016) is a mechanism for
dynamically modulating the number of computational steps needed to
process each input symbol (called the “ponder time”) in standard
recurrent neural networks based on a scalar halting probability
predicted by the model at each step.
• Comparable perf on QA, LM, subject-verb agreement, etc.
[15] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819 (2018).
ALBERT: A Lite BERT
• ALBERT incorporates two parameter reduction techniques that lift the major
obstacles in scaling pre-trained models.
• Factorized embedding parameterization: Decompose large vocabulary embedding matrix
into two small matrices
• We reduce the embedding parameters from O(V × H) to O(V × E + E × H) where H >> E.
• cross-layer parameter sharing
• There are multiple ways to share parameters, e.g., only sharing feed-forward network (FFN)
parameters across layers, or only sharing attention parameters. The default decision for ALBERT is to
share all parameters across layers.
• An ALBERT configuration similar to BERT-large has 18x fewer parameters and
can be trained about 1.7x faster.
• The parameter reduction techniques also act as a form of regularization that
stabilizes the training and helps with generalization.
[34] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942
(2019).
Quaternion Transformer
• Significantly (75%) reduced parameter size due to lesser degrees of freedom in the Hamilton product.
• We move beyond real space, exploring computation in Quaternion space (i.e., hypercomplex numbers) as an inductive bias. Hypercomplex numbers
comprise of a real and three imaginary components (e.g., i, j, k) in which interdependencies between these components are encoded naturally
during training via the Hamilton product ⊗. Hamilton products have fewer degrees of freedom, enabling up to four times compression of model
size.
• Quaternion algebra
• A Quaternion Q ∈ H is a hypercomplex number with three imaginary components as follows: Q = r + xi + yj + zk. where ijk = i^2 = j^2 = k^2 = −1 and
noncommutative multiplication rules apply: ij = k,jk = i, ki = j,ji = −k, kj = −i, ik = −j.
• r is the real value and similarly, x, y, z are real numbers that represent the imaginary components of the Quaternion vector Q.
• Hamilton Product: 𝑄 ⊗ 𝑃 = 𝑄𝑟 𝑃𝑟 − 𝑄𝑥 𝑃𝑥 − 𝑄𝑦 𝑃𝑦 − 𝑄𝑧 𝑃𝑧 + 𝑄𝑥 𝑃𝑟 + 𝑄𝑟 𝑃𝑥 − 𝑄𝑧 𝑃𝑦 + 𝑄𝑦 𝑃𝑧 𝑖 + 𝑄𝑦 𝑃𝑟 + 𝑄𝑧 𝑃𝑥 + 𝑄𝑟 𝑃𝑦 − 𝑄𝑥 𝑃𝑧 𝑗 + ൫𝑄𝑧 𝑃𝑟 − 𝑄𝑦 𝑃𝑥 + 𝑄𝑥 𝑃𝑦 +
𝑄𝑟 𝑃𝑧 ൯𝑘
• Quaternion feed-forward: Denote by W ∈ H the weight parameter of a Quaternion feed-forward layer and let Q ∈ H be the layer input. The linear
output of the layer is the Hamilton product of two Quaternions: W ⊗ Q.
• Saving Parameters? How and Why – Weight sharing.
• let us express the Hamilton product W ⊗ Q in a Quaternion feed-forward layer in the form of matrix multiplication, which is used in real-space feed-forward.
• 𝑊 = 𝑊𝑟 + 𝑊𝑥 𝑖 + 𝑊𝑦 𝑗 + 𝑊𝑧 𝑘
• We highlight that, there are only 4 distinct parameter variable elements (4 degrees of freedom), namely Wr, Wx, Wy, Wz, in the weight matrix; while in real-space
feed-forward, all the elements of the weight matrix are different parameter variables (4 × 4 = 16 degrees of freedom).
• This results in a 75% reduction in parameterization.
• This matrix multiplication trick is used in both the attention layer and the feed forward layers of a quaternion transformer.
[59] Yi Tay, Aston Zhang, Luu Anh Tuan, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu, and Siu Cheung Hui. 2019. Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks. arXiv preprint
arXiv:1906.04393 (2019).
Agenda
• Pruning
• Quantization
• Applications
• Outline for matrix decomposition
• Two low-rank factors
• Factorizing into blocks
• Tensor train decomposition
• Block-Term tensor decomposition
LSTMs for large vocabulary
• We present two novel LSTM based RNN
architectures which make more effective use of
model parameters.
• In one of them, we connect the cell output units to a
recurrent projection layer which connects to the cell
input units and gates for recurrency in addition to
network output units for the prediction of the outputs. LSTM based RNN architectures with a recurrent projection
• In the other one, in addition to the recurrent projection layer and an optional non-recurrent projection layer
layer, we add another non-recurrent projection layer
which is directly connected to the output layer.
• it allows us to increase the number of units in the projection
layers without increasing the number of parameters in the
recurrent connections
[117] Sak, Haşim, Andrew Senior, and Françoise Beaufays. "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition." arXiv preprint arXiv:1402.1128 (2014).
Sparse Overcomplete Word Vector Representations
• Transform word vectors into sparse (and optionally binary)
vectors.
• The transformation results in longer, sparser vectors, sometimes
called an “overcomplete” representation.
• Let V be the vocabulary size. 𝑋 ∈ 𝑅𝐿×𝑉 is initial embedding
matrix.
• Sparse Coding (method A)
• In sparse coding (Lee et al., 2006), the goal is to represent each input
vector 𝑥𝑖 as a sparse linear combination of basis vectors, 𝑎𝑖 .
• 𝐷 ∈ 𝑅 𝐿×𝐾
• Sparse Nonnegative Vectors (first part of method B)

• Nonnegativity in the feature space has often been shown to Methods for obtaining sparse overcomplete vectors (top, method A) and
correspond to interpretability. sparse, binary overcomplete word vectors (bottom, method B). Observed
dense vectors of length L (left) are converted to sparse non-negative
vectors (center) of length K which are then projected into the binary vector
space (right), where L << K. X is dense, A is sparse, and B is the binary word
• This can be solved using online adaptive gradient descent (AdaGrad) vector matrix. Strength of colors signify the magnitude of values; negative
is red, positive is blue, and zero is white.
• Binarizing Transformation (second part of method B)
[17] Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 (2015).
SVD
• We jointly compress the recurrent and inter-layer matrices
corresponding to a specific layer 𝑙 by determining
𝑙 𝑟
a suitable
×𝑁
recurrent projection matrix, ℎ
denoted
𝑙 𝑙
by 𝑃 ∈ 𝑅 𝑙 𝑙 , of Typical Deep RNN
𝑙 𝑙 𝑙 𝑙 𝑙
rank 𝑟 < 𝑁 such that, 𝑊𝑙 = 𝑍ℎ 𝑃 and 𝑊𝑥 = 𝑍𝑥 𝑃 .
• We determine 𝑃𝑙 , by first computing an SVD of the
recurrent weight matrix, which we then truncate, retaining
only the top 𝑟 𝑙 singular values (denoted by Σ෪ℎ𝐿 ) and the
corresponding singular vectors from 𝑈ℎ𝑙 and 𝑉ℎ𝑙 (denoted by
෪𝑙 and 𝑉
𝑈 ෪𝑙 , respectively):
ℎ ℎ
𝑇 𝑇
• 𝑊ℎ𝑙 = 𝑈ℎ𝑙 Σℎ𝑙 𝑉ℎ𝑙 ෪𝑙෪ 𝑙 ෪𝑙
≈ 𝑈ℎ Σℎ 𝑉ℎ = 𝑍ℎ𝑙 𝑃𝑙
• We determine 𝑍𝑥𝑙 , as the solution to the following least-
squares problem:
• 𝑍𝑥𝑙 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑌 ||𝑌𝑃𝑙 − 𝑊𝑥𝑙 ||2𝐹
• We found that SVD-based initialization performed better
than training a model with recurrent projection matrices
(i.e., same model architecture) but with random
initialization of the network weights.
• In LSTMs, the recurrent weight matrix 𝑊ℎ𝑙 is concatenation The initial model (Figure (a)) is compressed by jointly factorizing recurrent
of 4 gate weight matrices which can be stacked vertically (𝑾𝒍𝒉 ) and inter-layer (𝑾𝒍𝒙 ) matrices, using a shared recurrent projection
[𝑊𝑖𝑚 , 𝑊𝑜𝑚 , 𝑊𝑓𝑚 , 𝑊𝑐𝑥 ]^T. matrix (𝑷𝒍 ) (Figure (b)).
[47] Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Lan McGraw. 2016. On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In 2016
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5970–5974.
Fix U using kernels
• Typical matrix factorization: W=UV. U has size 𝑛𝑣 × 𝑛𝛼 and V has size 𝑛𝛼 × 𝑛ℎ . Compute derivatives wrt U and V instead of W.
• In practice this naive approach does not preform as well as learning a full rank weight matrix directly.
• ෩𝑉.
Moreover, the factored representation has redundancy. If Q is any invertible matrix of size 𝑛𝛼 × 𝑛𝛼 we have 𝑊 = 𝑈𝑉 = 𝑈𝑄 𝑄−1 𝑉 = 𝑈 ෨ One way to remove
this redundancy is to fix the value of U and learn only V.
• What is a reasonable choice for U?
• U will be called static features and V as dynamic features. U is computed but V is learned using SGD.
• E.g., we can build U using kernel functions.
• Consider a weight vector w to be learned. Let w𝛼 be observed values of w. Can we use 𝑤𝛼 to compute w?
• We introduce a kernel matrix 𝐾𝛼 , with entries 𝐾𝛼 𝑖𝑗 = 𝑘(𝑖, 𝑗), to model the covariance between locations 𝑖, 𝑗 ∈ 𝛼.
• The parameters at these locations are 𝑤𝛼 𝑖 and 𝑤𝛼 𝑗 . Then, we can compute 𝑤 = 𝑘𝛼𝑇 𝐾𝛼 + 𝜆𝐼 −1 𝑤𝛼 where 𝑘𝛼 has 𝛼 rows and |w| columns. 𝜆 is a ridge regularization constant.
• In this case, U= 𝑘𝛼𝑇 𝐾𝛼 + 𝜆𝐼 −1 and V= 𝑤𝛼 .
2
𝑖𝑥 −𝑗𝑥 2 + 𝑖𝑦 −𝑗𝑦
• For image patches, where we expect smoothness in pixel space , an example of a kernel is squared exponential kernel 𝑘 𝑖, 𝑗 = exp − where 𝜎
2𝜎 2
controls degree of smoothness.
• Constructing kernel function and hence U
• When the weight space has a topological structure where we expect smoothness, for example when the weights correspond to pixels in an image patch, we can
choose a kernel-based dictionary to enforce the type of smoothness we expect.
• When there is no topological structure to exploit, we propose to use data driven dictionaries. An obvious choice here is to use a shallow unsupervised feature
learning, such as an autoencoder, to build a dictionary for the layer.
• Another option is to construct data-driven kernels for ridge regression. Easy choices here are using the empirical covariance or empirical squared covariance of the
hidden units, averaged over the data.
• Since the correlations in hidden activities depend on the weights in lower layers we cannot initialize kernels in deep layers in this way without training the previous
layers. We handle this by pre-training each layer as an autoencoder. We construct the kernel using the empirical covariance of the hidden units over the data using
the pre-trained weights. Once each layer has been pre-trained in this way we fine-tune the entire network with backpropagation, but in this phase the kernel
parameters are fixed.
Factorized L0 Pruning (FL0P)
• FL0P is a novel, structured pruning approach based on low rank factorization and augmented Lagrangian 𝑙0 norm regularization.
• 𝑙0 regularization
𝑛
• Consider a given neural network model f(·; θ) parameterized by 𝜃 = 𝜃𝑗 where each 𝜃𝑗 represents a block of weights.
𝑗=1
𝑛
• A pruning strategy can be parameterized by introducing binary variables 𝑧 = 𝑧𝑗 such that 𝑧𝑗 ∈ {0,1} and 𝜃෨ = 𝜃 ⊙ 𝑧.
𝑗=1
1
• Then, we would like to minimize 𝐸𝑧 σ𝐷 𝐿 𝑥𝑖 , 𝑦𝑖 ; 𝜃෨ + 𝜆||𝜃||
෨ 0
𝐷 𝑖=1
• Structured pruning
• Consider a fully connected layer which performs a multiplication Wx for the input x. One popular method corresponds to adding the sparsity variables as a sparse
diagonal matrix G = diag(𝑧1 ,· · ·, 𝑧 𝑥 ) to the multiplication, i.e., WGx, where |x| denotes the number of rows in x. This effectively removes a subset of columns of W
for column indices k with 𝑧𝑘 =0.
• one limitation is that this structured pruning method tends to produce lower performance than its unstructured counterpart.
• Structured pruning using factorization
• W = PQ. Let r be the number of columns of P (or equivalently the number of rows of Q), 𝑝𝑘 and 𝑞𝑘 be the k-th column of P and k-th row of Q respectively. We
achieve structured pruning by introducing a pruning variable 𝑧𝑘 for each component.
• 𝑊 = 𝑃𝐺𝑄 = σ𝑟𝑘=1 𝑧𝑘 × (𝑝𝑘 × 𝑞𝑘 ) where G = diag(𝑧1 , · · · , 𝑧𝑟 ) is again the diagonal matrix of pruning variables.
• After training, only columns and rows corresponding to non-zero diagonal values need to be stored, resulting in much smaller (but still dense) matrices P and Q. The nonzero values of G
can be absorbed into either P or Q.
• Sparsity could vary depending on 𝜆. Hence, we use augmented Lagrangian approach inspired by (Bastings et al., 2019) to control the final sparsity
level.
• On enwiki8 dataset we obtain a 1.19 perplexity score with just 5M parameters, vastly outperforming a model of the same size trained from scratch.
[64] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured Pruning of Large Language Models. arXiv preprint arXiv:1910.04732 (2019).
Differentiated Softmax
• Softmax LM: Last layer needs computation of activations for all words in
vocab.
1
• 𝑦 = exp ℎ𝑘+1 where ℎ𝑘+1 is the last hidden layer, and 𝑍 = σ𝑉𝑗=1 exp ℎ𝑗𝑘+1
𝑍
• When grouping multiple input examples into a batch, this
𝑘+1 𝑘 𝑘+1
amounts
𝑉×𝑑𝑘
to a large
matrix-matrix
𝑑𝑘 ×𝑙
product of the form 𝑊 𝐻 where 𝑊 ∈ 𝑅 and 𝐻 𝑘 ∈
𝑅 ; 𝑙 is the number of input examples in a batch. These matrices are very large.
• Differentiated Softmax
• The weight matrix of the final layer 𝑊 𝑘+1 ∈ 𝑅 𝑑𝑘 ×𝑉 stores output embeddings of
size 𝑑𝑘 for the V words
• Differentiated softmax varies the dimension of the output embeddings 𝑑𝑘 across
words depending on how much model capacity is deemed suitable for a given
word. In particular, it is meaningful to assign more parameters to frequent words
than to rare words. By definition, frequent words occur more of ten in the training
data than rare words and therefore allow to fit more parameters.
• We define partitions of the output vocabulary based on word frequency and the
words in each partition share the same embedding size. For example, we may𝑑 Final weight matrix 𝑾𝒌+𝟏 and hidden layer 𝒉𝒌 for
partition the frequency
𝑑𝐵
ordered set of output word ids, O = {1, . . . , V }, into 𝐴 𝐴 =
{1, . . . , K} and 𝐵 = {K +1, . . . , V} s.t. A ∪ B = O ∧ A ∩ B = ∅, where dA and dB are differentiated softmax for partitions A, B, C of the
different output embedding sizes and K is a word id. output vocabulary with embedding dimensions
• Partitioning results in a sparse final weight matrix 𝑊 𝑘+1 which arranges the 𝒅𝑨 , 𝒅𝑩 , 𝒅𝑪 ; non-shaded areas are zero.
embeddings of the output words in blocks, each one corresponding to a separate
partition.
• The size of the final hidden layer ℎ𝑘 is the sum of the embedding sizes of the
partitions. In practice, we compute separate matrix-vector products, or in batched
form, matrix-matrix products, for each partition in 𝑊 𝑘+1 and ℎ𝑘 .
[115] Chen, Welin, David Grangier, and Michael Auli. "Strategies for training large vocabulary neural language models." arXiv preprint arXiv:1512.04906 (2015).
Code-book for word
embeddings
• We propose to construct the embeddings with few basis vectors. For each word, the
composition of basis vectors is determined by a hash code.
Comparison of embedding computations between the conventional
• To maximize the compression rate, we adopt the multi-codebook quantization approach instead
of binary coding scheme. Each code is composed of multiple discrete numbers, such as (3, 2, 1, approach (a) and compositional coding approach (b) for constructing
8), where the value of each component is limited to a fixed range. embedding vectors
1
• We represent each word w with a code 𝐶𝑤 = (𝐶𝑤 , 𝐶𝑤2 , … , 𝐶𝑤𝑀 ). Each component 𝐶𝑤𝑖 is an integer number
in [1, K].
• Once we have obtained such compact codes for all words in the vocabulary, we use embedding vectors to
represent the codes rather than the unique words. More specifically, we create M codebooks 𝐸1 , 𝐸2 , ...,
𝐸𝑀 , each containing K codeword vectors. The embedding of a word is computed by 𝑖summing up the
codewords corresponding to all the components in the code as 𝐸 𝐶𝑤 = σ𝑀 𝑖=1 𝐸𝑖 (𝐶𝑤 )
• the number of vectors in the embedding matrix will be M × K, which is usually much smaller than the
vocabulary size.
• Given original embedding matrix ෨ we want to learn E and C such that this reconstruction loss is
𝐸,
1
መ ෠ ෨
minimized. 𝐶, 𝐸 = arg min 𝑉 σ𝑤∈𝑉 ||𝐸 𝐶𝑤 − 𝐸(𝑤)|| 2
𝐶,𝐸
• Let 𝐸෨ ∈ 𝑅 𝑉 ×𝐻
be the original embedding matrix, where each embedding vector has H
dimensions. By using the reconstruction loss, we are actually finding an approximate matrix
factorization 𝐸෨ ≈ σ𝑀 𝑖
𝑖=0 𝐷 𝐴𝑖 , where 𝐴𝑖 ∈ 𝑅
𝐾×𝐻
is a basis matrix for the i-th component. 𝐷𝑖 is a
|V|×K code matrix in which each row is a K-dimensional one-hot vector.
• 𝐸 𝐶𝑤 = σ𝑀 𝑇 𝑖
𝑖=0 𝐴𝑖 𝑑𝑤
• Therefore, the problem of learning
1
discrete
𝑀
codes 𝐶𝑤 can be converted to a problem of finding a The network architecture for learning compositional compact codes. The
set of optimal one-hot vectors 𝑑𝑤 , … , 𝑑𝑤 and source dictionaries 𝐴1 , ..., 𝐴𝑀 , that minimize the Gumbel-softmax computation is marked with dashed lines.
reconstruction loss. We directly learn the discrete codes in an end-to-end neural network (2
layer MLP) by applying the Gumbel-softmax trick.
• Experiments show the compression rate achieves 98% in a sentiment analysis task and 94% ∼
99% in machine translation tasks without performance loss.
[53] Raphael Shu and Hideki Nakayama. 2017. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068 (2017).
Word Encoded Sequence Transducers
• Consider embedding and softmax matrices (denoted by E) of size V×d.
• We factorize E as 𝐸 = 𝐶 × 𝐸 𝑐 .
• where 𝐶 ∈ 𝑅 𝑉×𝑛𝑘 is a sparse structured matrix such that each row of C is
concatenation of n weighted one hot vectors of length k
• 𝐸 𝑐 ∈ 𝑅 𝑛𝑘×𝑑 is a structured dense matrix
• We fix the locations𝑐of non-zero entries in C and do not change them during
training. We train 𝐸 and values of non-zero entries in C using SGD.
• Structured sparse matrix
• Let n be code length and k be alphabet size. Let 𝑐𝑖 (𝑤) be the ith entry of the code for
word w. It could be a number from 1 to k. An example of WEST factorization when V = 6, d = 4, k = 3, and n = 2.
• Given such a codebook c, we construct a sparse matrix C of size V × n · k
• Let 𝜆𝑤,𝑖 to be the weight corresponding to the entry corresponding to 𝑐𝑖 (𝑤). We • Language codes
differentiate between two types of sparse code books, the weighted sparse matrix • Let F be a collection of sub-units such as characters or word-pieces.
where the non-zero entries can take any value and the unweighted sparse matrix
where the non-zero entries are restricted to be one, i.e., 𝜆𝑤,𝑖 = 1 ∀w, ∀i. • We then decompose a word into sub-units such as characters or word-
• To store the codebook (the sparse matrix entries indices), V.n.ceil(log 2 𝑘) bits are pieces and obtain the code by mapping each word into sub-units.
needed. Too big. Hence, we propose to use sparse codebooks that can be stored • c(w) = F(w1), F(w2), . . . , F(𝑤𝑛’ ).
succinctly with fewer than V ·n additional parameters e.g., language codes and random
codes. • For example, if the set of words are {i, it, he, she, you, they}, and sub-
units are {I, t, he, s, you, y}, then c(she) is (4, 3).
• Structured dense matrix
• Block-diagonal structure: Each 𝐸 𝑖 is a matrix of size kd/n. Total space needed=kd • Random codes
• Band structure:
𝑖
Each 𝐸 𝑖 is a matrix of size kd. 𝐸 𝑐 is the matrix obtained by stacking • each word is mapped to a random sequence of length n such that no
entries of 𝐸 s one below another. Total space needed is kdn. two words have the same code.
• if the parameters are tied i.e., all 𝐸 𝑖 s are equal, then the number of parameters
reduces by a factor of n.
[61] Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. 2019. WEST: Word Encoded Sequence Transducers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 7340–7344.
TT-RNN
• The number of RNN parameters can be significantly reduced by representing the weight parameters based on
Tensor Train (TT) format. Calculating an element 𝑾(𝒋𝟏 , . . , 𝒋𝒌 ) using set of TT-cores
• For all matrices 𝐺𝑘 [𝑗𝑘 ] related to the same dimension k, they must be represented with size 𝑟𝑘−1 × 𝑟𝑘 , where
𝑟0 and 𝑟𝑑 must be equal to 1 to retain the final matrix multiplication result as a scalar.
𝑑
• In TT-format, we define𝑛a sequence of rank 𝑟𝑘 𝑘=0 and we call them TT-rank from tensor W. The set of
matrices 𝐺𝑘 = 𝐺𝑘 𝑗𝑘 𝑗𝑘𝑘=1 are called TT-core.
• By factoring the original tensor W into multiple TT cores, we can compress the number of elements needed
to represent the original tensor size from ς𝑑𝑘=1 𝑛𝑘 to ς𝑑𝑘=1 𝑛𝑘 𝑟𝑘−1 𝑟𝑘 .
• Representing Linear Transformation using TT-format Fully Connected vs TT Layer Running Time and Memory
• Consider linear transformation y=Wx+b where 𝑊 ∈ 𝑅 𝑀×𝑁 and 𝑏 ∈ 𝑅 𝑀 .
• We first convert 𝑊 to a tensor 𝜔 by selecting 𝑚𝑘 and 𝑛𝑘 such that 𝑀 = ς𝑑𝑘=1 𝑚𝑘 and N = ς𝑑𝑘=1 𝑛𝑘 . Next, we define two
bijective functions 𝑓𝑖 and 𝑓𝑗 . 𝑓𝑖 maps each row 𝑝 ∈ {1. . 𝑀} into 𝑓𝑖 𝑝 = [𝑖1 𝑝 , … , 𝑖𝑑 (𝑝)]. 𝑓𝑗 maps each column 𝑞 ∈
{1. . 𝑁} into 𝑓𝑗 𝑞 = [𝑗1 𝑞 , . . , 𝑗𝑑 (𝑞)].
• Thus, W 𝑝, 𝑞 = 𝜔 𝑓𝑖 𝑝 , 𝑓𝑗 𝑞 = 𝜔 𝑖1 𝑝 , … , 𝑖𝑑 𝑝 , 𝑗1 𝑞 , . . , 𝑗𝑑 𝑞 = 𝐺1 𝑖1 𝑝 , 𝑗1 𝑞 . . 𝐺𝑑 [𝑖𝑑 𝑝 , 𝑗𝑑 (𝑞)]
𝑟𝑘−1 ×𝑟𝑘
• Note for each 𝑘 ∈ {1, . . , 𝑑}, 𝐺𝑘 𝑖𝑘 𝑝 , 𝑗𝑘 𝑞 ∈ 𝑅 and 𝑖𝑘 𝑝 ∈ {1, . . , 𝑚𝑘 } and 𝑗𝑗 𝑞 ∈ {1 … , 𝑛𝑘 }.
• To represent the linear transformation using tensors, we need to reshape the vector input x into tensor X and bias vector
b into tensor B with order d to match our tensor 𝜔.
• Compressing Simple RNN with TT-format
• We need to represent matrices 𝑊𝑥ℎ and 𝑊ℎℎ into TT format and also change the inputs appropriately to be compatible in
terms of dimensions.
• For each of the two matrices, we first map them to tensors and use different TT decompositions with two different sets of
TT cores.
• Typical RNN: ℎ𝑡 = 𝑓(𝑊𝑥ℎ 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 + 𝑏ℎ ). These get transformed as shown on the right.
• Glorot initialization for TT-cores Parameters
[60] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 4451–4458.
TT–embedding
• Tensor Train (TT) decomposition for parametrizing embedding layers.
• Instead of storing huge embedding matrix, we store a sequence of much smaller 2D and 3D tensors, necessary for reconstructing the required
embeddings, which allows compressing the model significantly at the cost of a negligible performance drop.
• The values of TT–ranks directly define the compression ratio, so choosing them to be too small or too large will result into either significant
performance drop or little reduction of the number of parameters. We set all TT– ranks to 16 for problems with small vocabularies and 64 − 192
for problems with larger vocabularies, which result in a good trade-off between embedding layer compression ratio and the metric of interest.
• Compression ratios of the embedding layers: 441 on the IMDB sentiment classification with 1% absolute increase in classification accuracy; 15 on
the WMT 2014 En–De machine translation with 0.3 drop in the BLEU score; 3.8 on the WikiText-103 language modeling with 1.3 drop in test
perplexity.
Construction of the TT–matrix from the standard embedding matrix. Blue color depicts how the single element in the initial matrix is transformed
into the product of the highlighted vectors and matrices in the TT–cores.
[31] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression. arXiv preprint arXiv:1901.10787 (2019).
Block-Term tensor
decomposition
Block Term decomposition for a 3-order case tensor. A 3-order tensor
• Compared with alternative low-rank approximations, such as tensor 𝑿 ∈ 𝑹𝑰𝟏 ×𝑰𝟐 ×𝑰𝟑 can be approximated by N Tucker decompositions. We call
train RNN (TT-RNN), Block-Term RNN (BTRNN), is not only more the N the CP-rank, R1, R2, R3 the Tucker-rank and d the Core-order.
concise (when using the same rank), but also able to attain a better
approximation to the original RNNs with much fewer parameters.
• BTD decomposes a high order tensor into a sum of multiple Tucker
decomposition models.
• The redundant dense connections between input and hidden state is
replaced by low-rank BT representation.
• BT-LSTM utilizes 17,388 times fewer parameters than the standard
LSTM to achieve an accuracy improvement over 15.6% in the Action
Recognition task on the UCF11 dataset.
Tensorization operation in a case of 3-order tensors. (a) Tensorizing a vector
• Decomposing W with BTD with shape I = I1 · I2 · I3 to a tensor with shape I1 × I2 × I3; (b) Tensorizing a
• Given a 2 dimensions weight matrix 𝑊 ∈ 𝑅 𝐽×𝐼 , we can tensorize it as a 2d
𝐽1×𝐼1×𝐽2×···×𝐽𝑑×𝐼𝑑 matrix with shape I1 ×(I2 · I3) to a tensor with shape I1 ×I2 ×I3.
dimensions tensor 𝑊 ∈ 𝑅 , where 𝐼 = 𝐼1 × 𝐼2 · · · 𝐼𝑑
and 𝐽 = 𝐽1 × 𝐽2 · · · 𝐽𝑑. Following BTD, we can decompose W into:
1 2 𝑑
• BTD(W)=σ𝑁 𝑛=1 𝐺𝑛 •1 𝐴𝑛 •2 𝐴𝑛 •3 … •𝑑 𝐴𝑛
• Where 𝐺𝑛 ∈ 𝑅𝑅1×⋯×𝑅𝑑 denotes core tensor, 𝐴𝑛𝑑 ∈ 𝑅𝐼𝑑×𝐽𝑑×𝑅𝑑 denotes factor
tensor, N is CP rank and d is core-order.
• Note that 𝑅𝑘 ≤ 𝐼𝑘 (and 𝐽𝑘 ), 𝑘 = 1, . . , 𝑑. To obtain a robust model, we set each
Tucker-rank to be equal, e.g., 𝑅𝑖 = R, i ∈ [1, d], to avoid unbalanced weight
sharing in different dimensions.
• Parameter comparison
• #parameters 𝑑
in BTD=N(σ𝑑𝑘=1 𝐼𝑘 𝐽𝑘 𝑅 + 𝑅𝑑 ). Original weight matrix W has The weight matrix’s shape is 𝑰 × 𝑱. The input and hidden
𝐼 × 𝐽 = ς𝑘=1 𝐼𝑘 𝐽𝑘 parameters, which is several orders of magnitude larger tensors’ shapes are I = I1 × · · · × Id and J = J1 × · · · × Jd,
than it in the BTD representation. respectively. Here, 𝑱max = max 𝑱𝒌 , 𝒌 ∈ [𝟏, 𝒅]. Both TT-RNN
𝒌
and BT-RNN are set in same rank R.
[68] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. 2018. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 9378–9387.
Multi-linear attention with Block
Term Decomposition (BTD)
• Tensorized embedding (TE) uses TT to compress embedding 𝑨 ∈ 𝑹𝒅𝟏×𝒅𝟐×𝒅𝟑 is a 3-order tensor, and can be approximated by P Tucker
layers in Transformer-XL, but not the attention layer. decomposition. P is the CP rank, and R1, R2, R3 are the Tucker rank,
respectively. In this paper, we assume that R=R1=R2=R3.
• Also, BTD was used to compress RNNs (Ye et al).
• The self-attention function in Transformer is a non-linear
function, which makes it difficult to compress. It turns out that
the output of the attention function can be linearly represented
by a group of orthonormal base vectors.
• Tensorized Transformer
• We first build a Single-block attention based on the Tucker
decomposition, a low-rank decomposition method.
• First, we assume that the query, key and value can be mapped into three
factor matrices
• After that, we can construct a new attention (i.e., Single-block attention) by
initializing a 3-order diagonal tensor (trainable) which is the G.
• ◦ is the outer product.
• 𝑄𝑖 , 𝐾𝑗 and 𝑉𝑘 are column vectors from matrices Q, K and V, where 𝑄 ∈ 𝑅𝑁×𝑑
, 𝐾 ∈ 𝑅𝑁×𝑑 and 𝑉 ∈ 𝑅𝑁×𝑑 , and N is the length of a sequence. We set
I=J=M=R.
• We can consider G as the trainable weight.
(left) Single-block attention using Tucker decomposition. (right)

Multi-linear attention based on Block-Term tensor decomposition.
R is the rank about the tensor, N is the length of a sequence, and d is
• the dimension of matrix.
[41] Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, and Ming Zhou. 2019. A Tensorized Transformer for Language Modeling. arXiv preprint arXiv:1906.09777 (2019).
Multi-linear attention with Block Term
Decomposition (BTD)
• Tensorized Transformer (Multi-linear attention with BTD)
• To compress the multi-head mechanism, we propose a multi-
linear attention constructed by a BTD.
• We use a group of linear projections, and share the output from
the linear projections.
• The learned linear projection can map queries, keys and values
to three matrices which are composed of basis vectors. After
that, we use BTD to build multi-head mechanism.
• This attention uses the idea of parameters sharing, i.e., sharing
factor matrices across multiple blocks.
• SplitConcat(·) is a function which achieves the concatenation
after splitting for a 3-order tensor.
• 𝑊𝑂 is the parameter matrix which is a full connection layer and
correlated to the output of Multi-linear attention. A diagram which is about the incorporating of multi-linear attention
• AttenTD(·) is the function of Single-block attention, which is a in partial Transformer structure. The parameters are shared in the
part of Multi-linear attention. 𝑊𝑞 , 𝑊𝑘 and 𝑊𝑣 are the constructing of each single-block attention.
parameters matrices which are shared in constructing Multi-
linear attention.
Agenda
• Pruning
• Quantization
• Applications
Sparse Transformers
• Time and memory in Transformers grows
quadratically with the sequence length. Sparse
factorizations of the attention matrix reduce this to
O(n 𝑛).
• Strided attention (Fig b):
• have one head attend to the previous l locations, and
the other head attend to every lth location, where l is Two 2d factorized attention schemes we evaluated in comparison to the full attention
the stride and chosen to be close to 𝑛. of a standard Transformer (a). The top row indicates, for an example 6x6 image, which
positions two attention heads receive as input when computing a given output. The
• Fixed attention (Fig c): bottom row shows the connectivity matrix (not to scale) between all such outputs
• specific cells summarize previous locations and (rows) and inputs (columns). Sparsity in the connectivity matrix can lead to
propagate that information to all future cells. significantly faster computation. In (b) and (c), full connectivity between elements is
preserved when the two heads are computed sequentially.
• Three ways to integrate factorized self-attention
• use one attention type per residual block, and
interleave them sequentially
• have a single head attend to the locations of the pixels
that both factorized heads would attend to, which we
call a merged head
• Use multi-head attention
• They can model sequences tens of thousands of Sparse patterns showed increased speed and also better loss
timesteps long using hundreds of layers.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
Deep equilibrium models (DEQs)
• We broadly consider the class of weight-tied deep sequence models
(with pass through connections from the input to each layer), which
consist of the update
𝑖+1 𝑖 0 𝐿
• 𝑧1:𝑇 = 𝑓𝜃 𝑧1:𝑇 ; 𝑥 1:𝑇 , 𝑖 = 0, . . , 𝐿 − 1, 𝑧1:𝑇 = 0 𝐺 𝑥1:𝑇 ≡ 𝑧1:𝑇
• Note that T is the sequence length. Let actual label be 𝑦1:𝑇 ∈ 𝑅𝑇×𝑞 . Let
𝑇×𝑝
original input be 𝑥1:𝑇 ∈ 𝑅
• We can tie weights across layers of a Transformer, and still obtain
good results.
• To update L-layer networks like Transformers, the backward passes
rely on backpropagating through the same L layers via the chain rule,
which typically necessitates that we store the intermediate values of
these layers. Hence, almost all such models (and deep nets in
general) are stacked, trained and evaluated by unrolling a pre-
determined, fixed number of layers.
• But, if the same transformation is applied at each layer of a deep
network, what is the limit of this process, and how do we model it?
• In principle, the network could have infinite depth. We propose to
directly (and in practice, more quickly) solve for the equilibrium via
any black-box root-finding method. Importantly, we show that DEQ
can directly differentiate through the fixed point equations via A deep equilibrium model operates with significantly less memory than
implicit differentiation, which does not require storing any conventional deep nets due to an analytical backward pass
intermediate activation values. In other words, we can
backpropagate through the infinite-depth network while using only
constant memory, equivalent to a single layer’s activations.
[4] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2019. Deep equilibrium models. arXiv preprint arXiv:1909.01377 (2019).
Left: Connections of one layer in
Star Transformer Transformer, circle nodes indicate the

hidden states of input tokens. Right:
Connections of one layer in
StarTransformer, the square node is the
virtual relay node. Red edges and blue
edges are ring and radical connections,
• To reduce model complexity, we replace the fully- resp.
connected structure with a star-shaped topology, in
which every two non-adjacent nodes are connected
through a shared 2relay node. Thus, complexity is reduced
from quadratic (𝑛 ) to linear (2n).
Test Accuracy on SST dataset
“Len.” means the average sequence length on the test set.

• If the sequence length is n and the dimension of hidden states is d, the computation complexity of one
layer in the standard Transformer is O(𝑛2 𝑑). The Star-Transformer has two phases, the update of ring
connections costs O(5nd) (the constant 5 comes from the size of context information C), and the
update of radical connections costs O(nd), so the total cost of one layer in the Star-Transformer is
O(6nd).
• Compare with the standard Transformer, all positions are processed in parallel, pair-wise connections
are replaced with a “gather and dispatch” mechanism. As a result, we accelerate the Transformer 10
times on the simulation task and 4.5 times on real tasks.
[21] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-Transformer. NAACL-HLT, Volume 1 (Long and Short Papers). 1315–1325.
Reformer (LSH attention)
• We replace dot-product attention by one that uses 2locality-
sensitive hashing, changing its complexity from O(𝐿 ) to
O(LlogL), where L is the length of the sequence.
• Furthermore, we use reversible residual layers instead of the
standard residuals, which allows storing activations only once
in the training process instead of N times, where N is the
number of layers.
𝑄𝐾 𝑇 Simplified depiction of LSH Attention showing the hash-bucketing, sorting, and chunking steps
• Attention(Q,K,V)=softmax( )V. Now, 𝑄𝐾 𝑇 is of size and the resulting causal attentions. (a-d) Attention matrices for these varieties of attention.
𝑑𝑘
square(length). But it does not need to be fully materialized in
memory.
• Shared QK-Transformer: Q=K by sharing the matrix that
projects words/hidden layer to Q or K.
• we are actually only interested in softmax(𝑄𝐾 𝑇 ). Since
softmax is dominated by the largest elements, for each query
𝑞𝑖 we only need to focus on the keys in K that are closest to 𝑞𝑖 .
how can we find the nearest neighbors among the keys? Using
LSH.
• Part (a) depicts that the attention matrix for full attention is
typically sparse, but the computation does not take advantage
of this sparsity.
• We sort the queries by bucket number and, within each
bucket, by sequence position. We can follow a batching
approach where chunks of m consecutive queries (after Speed of attention evaluation as a function of input length for full- and LSH- attention.
sorting) attend to each other, and one chunk back.
Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." arXiv preprint arXiv:2001.04451 (2020).
Linformer
Per-layer time complexity and minimum number of sequential
• Self-attention mechanism can be approximated by a low-rank matrix. operations as a function of sequence length (n) for various architectures.
We further exploit this finding to propose a new self-attention
mechanism, which reduces the overall self-attention complexity from
O(𝑛2 ) to O(n) in both time and space.
• The main idea is to add two linear projection matrices 𝐸𝑖 , 𝐹𝑖 ∈𝑅𝑛×𝑘
when computing key and value. We first project the original (n×d)-
dimensional key and value layers 𝐾𝑊𝑖𝐾 and 𝑉𝑊𝑖𝑉 into (k×d)-
dimensional projected key and value layers.
• We then compute an (n × k)-dimensional context mapping matrix 𝑃ത
using scaled dot-product attention.
• If we can choose a very small projected dimension k, such that k<<n,
then we can significantly reduce the memory and space
consumption. Overall, it is O(nk).
• Headwise sharing: 𝐸𝑖 = 𝐸 and 𝐹𝑖 = 𝐹 across all heads i in a layer.
• Key-value sharing: 𝐸𝑖 = 𝐹𝑖 = 𝐸 across all heads i in a layer.
• Layerwise sharing: Single projection matrix E is used across all layers, all
heads for both key and value.
• K=256 shows no degradation on SST-2, IMDB, QNLI, QQP tasks.
Linformer over Transformer. Left table shows time saved. Right table shows memory saved.
Wang, Sinong, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. "Linformer: Self-Attention with Linear Complexity." arXiv preprint arXiv:2006.04768 (2020).
Sparse Sinkhorn Attention
• Method is based on differentiable sorting of internal representations.
• Concretely, we introduce a meta sorting network that learns to generate latent Overview of Sparse Sinkhorn Attention. A Meta Sorting Network learns
permutations over sequences. to sort sequences to enable efficient quasi-global local attention.
• Given sorted sequences, we are then able to compute quasi-global attention with
only local windows, improving the memory efficiency of the attention module. • SortNet
• For causal scenarios, we propose new algorithmic innovations such as Causal • Block embedding is sum of embeddings of all tokens belonging
Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for to the local window (block).
tailoring Sinkhorn Attention for encoding and/or decoding purposes.
• 𝑋 ′ = 𝜓𝑃 𝑋 where 𝜓𝑃 . : 𝑅𝑙×𝑑 → 𝑅𝑁𝐵 ×𝑑 and 𝜓𝑃 𝑋 𝑖 =
• Finally, we propose a Mixture model between the Sparse Sinkhorn Attention and 𝑖+1 ×𝑙𝐵
σ𝑗=𝑖×𝑙 𝑋𝑗 .
standard vanilla attention, leading to further performance improvements. 𝐵
• Our method reduces the memory complexity from O(𝑙2 ) to O(𝐵2 + 𝑁𝐵2 ) where • Next, our trainable SortNet is defined as 𝑅𝑖 = 𝑃(𝑋𝑖′ ) where i is
𝐵 = 𝑙/𝑁𝐵 . When l is large, this factorization of sequence length brings about the block index, P(X)=𝜎 𝑊𝐵 𝜎 𝑊𝑃 𝑋 + 𝑏𝑃 + 𝑏𝐵
substantial savings. Our SORTCUT variant further reduces complexity to linear- • Sinkhorn normalization
time, i.e., O(𝑙𝑁𝑘 ) where 𝑁𝑘 is a user defined budget hyperparameter and 𝑁𝑘 <<<𝑙.
• The matrix R becomes a sorting matrix (or permutation matrix)
if it is doubly stochastic (matrix is nonnegative and both rows
and columns all sum to 1).
Overview of the proposed SortCut Encoding Scheme

[124] Tay, Yi, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. "Sparse Sinkhorn Attention." arXiv preprint arXiv:2002.11296 (2020).
Efficient Attention
• Equivalence between Dot-Product and Efficient

Attention
𝑄𝐾 𝑇
• 𝐷 𝑄, 𝐾, 𝑉 = 𝑛
𝑉
𝑄 𝐾𝑇 1 1
• 𝐸 𝑄, 𝐾, 𝑉 = 𝑉 = 𝑄 𝐾𝑇 𝑉 = 𝑄𝐾 𝑇 𝑉 =
𝑛 𝑛 𝑛 𝑛
𝑄𝐾 𝑇
𝑉
𝑛
Comparison of resource usage of the efficient attention and non-local modules. This Resource requirements under different input sizes. Blue and orange bars depict resource
table assumes that 𝒅𝒗 =d requirements of efficient attention and non-local modules, resp. We assume d=𝒅𝒗 =2𝒅𝒌 =6
[126] Shen, Zhuoran, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. "Efficient Attention: Attention with Linear Complexities." arXiv preprint arXiv:1812.01243 (2018).
Linear Transformers
• We express the self-attention as a linear dot-product of
kernel feature maps and make use of the associativity
property of2
matrix products to reduce the complexity
from 𝑂(𝑁 ) to O(N), where N is the sequence length.
• We show that this formulation permits an iterative
implementation that dramatically accelerates
autoregressive transformers and reveals their relationship
to RNNs.
• Our linear transformers achieve similar performance to Normal attention
vanilla transformers and they are up to 4000x faster on
autoregressive prediction of very long sequences.
• Linearized Attention
• Similarity function between Q and K could be any kernel.
• Computational cost of softmax attention scales with O(𝑁 2 ),
where N represents the sequence length. The same is true for
the memory requirements because the full attention matrix
must be stored to compute the gradients with respect to the
queries, keys and values.
• In contrast, our proposed linear transformer has time and
memory complexity O(N) because we can compute
σ𝑁 𝑇 𝑁
𝑗=1 𝜙 𝐾𝑗 𝑉𝑗 and σ𝑗=1 𝜙 𝐾𝑗 once and reuse them for every
query. Linearized attention
• They propose 𝜙 𝑥 = 𝑒𝑙𝑢 𝑥 + 1
[127] Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." arXiv preprint arXiv:2006.16236 (2020).
Agenda
• Pruning
• Quantization
• Applications
Language modeling: Penn Tree Bank Dataset
Test perplexities on PTB with

650 hidden nodes, K=10.
NE=original word embeddings.
SE=Slim enbeddings [37]
Bank Balanced Sparsity [6]
Impact of pruning on LSTM model. [72]
The testing PPW of HitNet on the PTB dataset. Kapur et al [30], He et al [24], Zhou et
al [71], HitNet, Full Precision (FP).
Quantization on the full precision pre-trained weight of

GRU on the PTB dataset (Alternating Multi-bit
quantization) [65] One Billion (LM1B) benchmark using the Base
(50M parameter) and Big (430M) setting. [124]
Popular datasets: PTB, One billion word benchmark, Europarl, WikiText, text8
Machine translation: English-German data from
WMT 2014. newstest2014 for test.
LSTM encoder-decoder model [72]
Effects of different pruning schemes [51]
Transformer [10]
German-English: Knowledge distillation based on oracle BLEU

(oBLEU) translations [95]
BLEU when incrementally pruning heads from

each attention type in the WMT model. [42]
Sequence-level knowledge distillation, LSTM [32]
Sentiment analysis: IMDB, SST
Quaternion Transformer [59]
Binary/Ternary/Quarternary Quantization [1]
Quantized RNNs on IMDB sentence classification [24]
Sentiment analysis, LSTM on IMDB and SST datasets. Embedding compression

is calculated as the ratio between the number of parameters in the full
embedding layer and TT–embedding layer. The LSTM parts are identical in
both models, and the TT–ranks were set to 16 in these experiments. [31]
Dev set results on benchmark natural language understanding tasks. The

RoBERTa-base model here is pretrained with same corpus as BERT. [123]
Question answering: SQuAD
BERT vs ALBERT [34]
Pruned BERTBASE: F1 score of fine-tuning on SQuAD 1.1
The effect of cross-layer parameter-sharing strategies, ALBERT-base configuration.[34]

Natural language inference
Results from the GLUE test server. The best results for 3-layer and 6-layer models are in-bold. Google’s submission results are obtained from official GLUE leaderboard.
BERT12 (Teacher) is our own implementation of the BERT teacher model. FT represents direct fine-tuning on each dataset without using knowledge distillation. KD
represents using a vanilla knowledge distillation method. And PKD represents our proposed Patient-KD-Skip approach. Results show that PKD-Skip outperforms the
baselines on almost all the datasets except for MRPC. The numbers under each dataset indicate the corresponding number of training samples. [55]
Comparison between the publicly released 6-layer models with 768 hidden size distilled from BERTBASE. We compare task agnostic distilled models without task-
specific distillation and data augmentation. We report F1 for SQuAD 2.0, and accuracy for other datasets. The GLUE results of DistillBERT are taken from Sanh et al.
(2019). We report the SQuAD 2.0 result by fine-tuning their released model. For TinyBERT, we fine-tune the latest version of their public model for a fair comparison.
The results of our fine-tuning experiments are an average of 4 runs for each task. [111]
Survey of Compression
methods for BERT
• Due to BERT’s complex architecture, no
existing method takes the entire model Compression methods and their corresponding target architectural components in BERT
as the compression target.
• DQ=Data Quantization
• Pruning (PR): Elementwise Pruning (EP),
Structured Pruning (SP).
• Knowledge Distillation (KD): Distillation on
encoder outputs (EO), Distillation on
output logits (OL), Distillation on attention
maps (AM).
• Architecture-Invariant Compression (AIC):
Parameter sharing (PS), Embedding matrix
compression (EMC), Attention layer
decomposition (AD). Evaluation comparison of various compression methods.
[128] Ganesh, Prakhar, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. "Compressing large-scale transformer-based models: A case study on bert." arXiv
preprint arXiv:2002.11985 (2020).
Agenda
• Pruning
• Quantization
• Applications
Summary
• Popular in computer vision community: Pruning, Quantization
• Most popular for Transformers: Knowledge Distillation, Parameter sharing
Category Method Pred. time latency Model Compression

𝑂 𝑛2 → 𝑂 𝑛 Star Transformer/Linformer Low – O(n) O(n) for self attention
Pruning Prune heads/layers O(𝑛2 ) 40-50% of full BERT
Quantization Q-BERT O(𝑛2 ) 4-10x compression
Knowledge Distillation MiniLM O(𝑛2 ) -- ~2x faster ~5x compression
Parameter Sharing ALBERT O(𝑛2 ) -- ~2x faster ~18x compression
Matrix Decomposition BTD O(𝑛2 ) ~8x compression
Future Trends
• Lot of experiments needed to check how models respond to combination of
methods.
• Latency results will vary based on GPU architecture. With new architectures
coming in (like T4), some methods like quantization may become more
impactful.
• Real world settings are often complex. Multi-modal, multi-task, multi-label,
multi-teachers, mismatching teacher-student architectures. Efficient ways of
recommending the most promising method is necessary.
• Different components/structures of a model may respond to different kinds of
compression methods with specific hyper-parameters. How to choose these?
References
[1] Md Zahangir Alom, Adam T Moody, Naoya Maruyama, Brian C Van Essen, and Tarek M Taha. 2018. Effective quantization approaches for recurrent neural networks. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
[2] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235 (2018).
[3] Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In Advances in neural information processing systems. 2654–2662.
[4] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2019. Deep equilibrium models. arXiv preprint arXiv:1909.01377 (2019).
[5] Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. 2019. Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model. arXiv preprint arXiv:1906.00532
(2019).
[6] Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and effective sparse LSTM on FPGA with Bank-Balanced Sparsity. In Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. ACM, 63–72.
[7] Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018. Groupreduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems. 10988–10998.
[8] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In International Conference on Machine Learning. 2285–2294.
[9] Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. 2016. Compressing neural language models by sparse word representations. arXiv preprint arXiv:1610.03950 (2016).
[11] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
[12] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123–3131.
[13] Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. 2018. Grow and prune compact, fast, and accurate LSTMs. arXiv preprint arXiv:1805.11797 (2018).
[14] Sonam Damani, Kedhar Nath Narahari, Ankush Chatterjee, Manish Gupta, and Puneet Agrawal. 2020. Optimized Transformer Models for FAQ Answering. In The 24𝑡ℎ Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). To
appear.
[15] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819 (2018).
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[17] Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 (2015).
[18] Artem M Grachev, Dmitry I Ignatov, and Andrey V Savchenko. 2019. Compression of recurrent neural networks for efficient language modeling. Applied Soft Computing 79 (2019), 354–362.
References
[19] Scott Gray, Alec Radford, and Diederik P Kingma. 2017. Gpu kernels for blocksparse weights. arXiv preprint arXiv:1711.09224 (2017).
[20] Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019. Reweighted Proximal Pruning for Large-Scale Language Representation. arXiv preprint arXiv:1909.12486 (2019).
[21] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-Transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). 1315–1325.
[23] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. 2016. DSD: Dense-sparsedense training for deep neural networks. arXiv preprint arXiv:1607.04381 (2016).
[25] Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 245–249.
[27] Lu Hou and James T Kwok. 2018. Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635 (2018).
[28] Lu Hou, Quanming Yao, and James T Kwok. 2016. Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600 (2016).
[29] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017),
6869–6898.
[30] Supriya Kapur, Asit Mishra, and Debbie Marr. 2017. Low precision rnns: Quantizing rnns without losing accuracy. arXiv preprint arXiv:1710.07706 (2017).
[31] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. 2019. Tensorized Embedding Layers for Efficient Model Compression. arXiv preprint arXiv:1901.10787 (2019).
[32] Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).
[34] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[35] Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
[36] Xiang Li, Tao Qin, Jian Yang, and Tie-Yan Liu. 2016. LightRNN: Memory and computation-efficient recurrent neural networks. In Advances in Neural Information Processing Systems. 4385–4393.
References
[37] Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, and Shuang Wu. 2018. Slim embedding layers for recurrent neural language models. In Thirty-Second AAAI Conference on Artificial Intelligence.
[38] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. arXiv preprint arXiv:1904.09482 (2019).
[39] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multitask deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).
[40] Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. 2016. Learning compact recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5960–5964.
[42] Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019).
[43] Kenton Murray and David Chiang. 2015. Auto-sizing neural networks: With applications to n-gram language models. arXiv preprint arXiv:1508.05051 (2015).
[46] Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668 (2018).
[47] Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Lan McGraw. 2016. On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5970–5974.
[48] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).
[49] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683
(2019).
[50] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
[52] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840 (2019).
[53] Raphael Shu and Hideki Nakayama. 2017. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068 (2017).
[54] Suraj Srinivas and R Venkatesh Babu. 2015. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149 (2015).
References
[55] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 (2019).
[56] Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embeddings by Parameter Space Sharing.. In IJCAI. 2046–2052.
[57] Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461 (2019).
[58] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. arXiv preprint arXiv:1903.12136 (2019).
[59] Yi Tay, Aston Zhang, Luu Anh Tuan, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu, and Siu Cheung Hui. 2019. Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks. arXiv preprint arXiv:1906.04393 (2019).
[60] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 4451–4458.
[61] Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. 2019. WEST: Word Encoded Sequence Transducers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7340–7344.
[62] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[64] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured Pruning of Large Language Models. arXiv preprint arXiv:1910.04732 (2019).
[65] Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. 2018. Alternating multi-bit quantization for recurrent neural networks. arXiv preprint arXiv:1802.00150 (2018).
[66] Yafeng Yang, Kaihuan Liang, Xuefeng Xiao, Zecheng Xie, Lianwen Jin, Jun Sun, and Weiying Zhou. 2018. Accelerating and Compressing LSTM Based Model for Online Handwritten Chinese Character Recognition. In 2018 16th International
Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 110–115.
[67] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
[68] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. 2018. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 9378–9387.
[69] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4320–4328.
[70] Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. 2019. Extreme Language Model Compression with Optimal Subwords and Shared Projections. arXiv preprint arXiv:1909.11687 (2019).
[71] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. 2017. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32, 4 (2017), 667–682.
References
[73] Gale, Trevor, Erich Elsen, and Sara Hooker. "The state of sparsity in deep neural networks." arXiv preprint arXiv:1902.09574 (2019).
[74] Yew Ken Chia and Sam Witteveen. “Transformer to CNN: Label-scarce distillation for efficient text classification”. In: 2018.
[75] Jean Senellart et al. “OpenNMT System Description for WNMT 2018: 800 words/sec on a singlecore CPU”. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Melbourne, Australia: Association for Computational
Linguistics, 2018, pp. 122–128. URL: http://aclweb.org/anthology/W18-2715.
[76] Hassibi, Babak, and David G. Stork. "Second order derivatives for network pruning: Optimal brain surgeon." In Advances in neural information processing systems, pp. 164-171. 1993.
[78] Diamos, Greg, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. "Persistent rnns: Stashing recurrent weights on-chip." In International Conference on Machine
Learning, pp. 2024-2033. 2016.
[79] Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277.
[80] Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. "EIE: efficient inference engine on compressed deep neural network." ACM SIGARCH Computer Architecture News 44, no. 3 (2016): 243-254.
[81] Walsh, Christopher A. "Peter Huttenlocher (1931–2013)." Nature. (2013): 172.
[82] Han, Song, Jeff Pool, John Tran, and William Dally. "Learning both weights and connections for efficient neural network." In Advances in neural information processing systems, pp. 1135-1143. 2015.
[83] Pan, Wei, Hao Dong, and Yike Guo. "Dropneuron: Simplifying the structure of deep neural networks." arXiv preprint arXiv:1606.07326 (2016).
[85] Bartol, Thomas M., Cailey Bromer, Justin P. Kinney, Michael A. Chirillo, Jennifer N. Bourne, Kristen M. Harris, and Terrence J. Sejnowski. "Hippocampal spine head sizes are highly precise." bioRxiv (2015): 016329.
[86] Linden, David J., ed. Think Tank: Forty Neuroscientists Explore the Biological Roots of Human Experience. Yale University Press, 2018.
[88] Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. "Binarized neural networks." In Advances in neural information processing systems, pp. 4107-4115. 2016.
[89] Muller, Lorenz K., and Giacomo Indiveri. "Rounding methods for neural networks with low resolution synaptic weights." arXiv preprint arXiv:1504.05767 (2015).
[90] Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. "Xnor-net: Imagenet classification using binary convolutional neural networks." In European conference on computer vision, pp. 525-542. Springer, Cham, 2016.
References
[91] Li, Fengfu, Bo Zhang, and Bin Liu. "Ternary weight networks." arXiv preprint arXiv:1605.04711 (2016).
[92] Zhu, Chenzhuo, Song Han, Huizi Mao, and William J. Dally. "Trained ternary quantization." arXiv preprint arXiv:1612.01064 (2016).
[93] Guo, Yiwen, Anbang Yao, Hao Zhao, and Yurong Chen. "Network sketching: Exploiting binary structure in deep cnns." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955-5963. 2017.
[94] Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for conditional computation." arXiv preprint arXiv:1308.3432 (2013).
[95] Freitag, Markus, Yaser Al-Onaizan, and Baskaran Sankaran. "Ensemble distillation for neural machine translation." arXiv preprint arXiv:1702.01802 (2017).
[96] Mishra, Asit, and Debbie Marr. "Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy." arXiv preprint arXiv:1711.05852 (2017).
[97] Mirzadeh, Seyed-Iman, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. "Improved Knowledge Distillation via Teacher Assistant." arXiv preprint arXiv:1902.03393 (2019).
[98] Heo, Byeongho, Minsik Lee, Sangdoo Yun, and Jin Young Choi. "Knowledge distillation with adversarial samples supporting decision boundary." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3771-3778. 2019.
[99] Xu, Zheng, Yen-Chang Hsu, and Jiawei Huang. "Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks." arXiv preprint arXiv:1709.00513 (2017).
[100] Wang, Xiaojie, Rui Zhang, Yu Sun, and Jianzhong Qi. "Kdgan: Knowledge distillation with generative adversarial networks." In Advances in Neural Information Processing Systems, pp. 775-786. 2018.
[101] Wang, Yunhe, Chang Xu, Chao Xu, and Dacheng Tao. "Adversarial learning of portable student networks." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[102] Yim, Junho, Donggyu Joo, Jihoon Bae, and Junmo Kim. "A gift from knowledge distillation: Fast optimization, network minimization and transfer learning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
4133-4141. 2017.
[103] Czarnecki, Wojciech M., Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pascanu. "Sobolev training for neural networks." In Advances in Neural Information Processing Systems, pp. 4278-4287. 2017.
[104] Tarvainen, Antti, and Harri Valpola. "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results." In Advances in neural information processing systems, pp. 1195-1204. 2017.
[105] Sau, Bharat Bhusan, and Vineeth N. Balasubramanian. "Deep model compression: Distilling knowledge from noisy teachers." arXiv preprint arXiv:1610.09650 (2016).
[106] You, Shan, Chang Xu, Chao Xu, and Dacheng Tao. "Learning from multiple teacher networks." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1285-1294. 2017.
[107] Dai, Zihang, Guokun Lai, Yiming Yang, and Quoc V. Le. "Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing." arXiv preprint arXiv:2006.03236 (2020).
[108] Fan, Angela, Edouard Grave, and Armand Joulin. "Reducing transformer depth on demand with structured dropout." arXiv preprint arXiv:1909.11556 (2019).
References
[109] Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned." arXiv preprint arXiv:1905.09418 (2019).
[110] Jiao, Xiaoqi, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
[111] Wang, Wenhui, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. "Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers." arXiv preprint arXiv:2002.10957 (2020).
[112] Patrick McClure and Nikolaus Kriegeskorte. 2016. Representational Distance Learning for Deep Neural Networks. Frontiers in Computational Neuroscience 10 (2016).
[114] Dai, Zihang, Guokun Lai, Yiming Yang, and Quoc V. Le. "Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing." arXiv preprint arXiv:2006.03236 (2020).
[115] Chen, Welin, David Grangier, and Michael Auli. "Strategies for training large vocabulary neural language models." arXiv preprint arXiv:1512.04906 (2015).
[116] Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. "Character-aware neural language models." In Thirtieth AAAI conference on artificial intelligence. 2016.
[117] Sak, Haşim, Andrew Senior, and Françoise Beaufays. "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition." arXiv preprint arXiv:1402.1128 (2014).
[118] Prasanna, Sai, Anna Rogers, and Anna Rumshisky. "When BERT Plays the Lottery, All Tickets Are Winning." arXiv preprint arXiv:2005.00561 (2020).
[119] Ling, Wang, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W. Black, and Isabel Trancoso. "Finding function in form: Compositional character models for open vocabulary word representation." arXiv preprint
arXiv:1508.02096 (2015).
[120] Jozefowicz, Rafal, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016).
[121] Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey. Blackout: Speeding up recurrent neural network language models with very large vocabularies. arXiv preprint arXiv:1511.06909, 2015.
[122] Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." arXiv preprint arXiv:2001.04451 (2020).
[123] Wang, Sinong, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. "Linformer: Self-Attention with Linear Complexity." arXiv preprint arXiv:2006.04768 (2020).
[124] Tay, Yi, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. "Sparse Sinkhorn Attention." arXiv preprint arXiv:2002.11296 (2020).
[125] Chirkova, Nadezhda, Ekaterina Lobacheva, and Dmitry Vetrov. "Bayesian compression for natural language processing." arXiv preprint arXiv:1810.10927 (2018).
[126] Shen, Zhuoran, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. "Efficient Attention: Attention with Linear Complexities." arXiv preprint arXiv:1812.01243 (2018).
References
[127] Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." arXiv preprint arXiv:2006.16236
(2020).
[128] Ganesh, Prakhar, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. "Compressing large-scale transformer-based models: A case
study on bert." arXiv preprint arXiv:2002.11985 (2020).
[129] Mukherjee, Subhabrata, and Ahmed Hassan Awadallah. "XtremeDistil: Multi-stage Distillation for Massive Multilingual Models." In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 2221-2234. 2020.
[130] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." arXiv preprint
arXiv:2005.14165 (2020).
[131] Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. "Gshard: Scaling giant models with conditional
computation and automatic sharding." arXiv preprint arXiv:2006.16668 (2020).

CIKM

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CIKM

Загружено:

Авторское право:

Доступные форматы

Compression of Deep Learning Models

Jameson Toole. Smaller, faster mobile models. O’Reilly AI London, 2019.

Can we process at 30 fps?

• RAM Energy cost of basic arithmetic and memory operations in a 45nm

pruning connections and relearning weights based

LSTM block weight matrix 𝑻𝟒𝒏,𝟐𝒏 can be decomposed into 8

Insight: Higher layers, attention and softmax weights are

WMT 2017 Test set BLEU and Compression Ratio

• Even when 90% of weights are pruned, NSP accuracy keeps

Why does DSD training work?

pruning ratio, respectively.

Deep Encoder-Decoder LSTM Models

Parameter Pruning Weights Pruning Neurons

• The start slope and ramp slope determine the

Overview of the trained ternary quantization procedure.

• Quarternary Gaussian networks

• While quantizing both weights and activations

Uniform Quantization Threshold Ternary Quantization Bernoulli Ternary Quantization

𝐽 𝑢0 , 𝑣ො𝑐 = − log 𝜎 𝑢𝑜𝑇 𝑣ො𝑐 − ෍ log(𝜎(−𝑢𝑖𝑇 𝑣ො𝑐 ))

Word similarity and analogy results

• When learning in Sobolev spaces, this is replaced with:

TA fills the gap between student & teacher

Extension to Larger Student Cohorts

𝐿𝑅𝐷 𝑤𝑠 ; 𝑥𝑖 , 𝑥𝑖+ , 𝑥𝑖− = max(0, 𝑑 𝑝𝑖 , 𝑝𝑖+ − 𝑑 𝑝𝑖 , 𝑝𝑖− + 𝛿)

Iterative scheme to find BSSs for a base sample

M denotes the number of layers in the student network, N

Model architecture of the Patient KD approach to BERT model

An overview of Transformer distillation: (a) the framework of Transformer distillation, (b)

Overview of Deep Self-Attention Distillation. The student is trained by deeply

• 2-Component (2C) shared embedding for word representations.

LightRNN (left) vs. Conventional RNN (right).

• Transition functions are applied position-wise.

• Sparse Nonnegative Vectors (first part of method B)

(left) Single-block attention using Tucker decomposition. (right)

Star Transformer Transformer, circle nodes indicate the

Test Accuracy on SST dataset

“Len.” means the average sequence length on the test set.

Overview of the proposed SortCut Encoding Scheme

• Equivalence between Dot-Product and Efficient

Test perplexities on PTB with

Quantization on the full precision pre-trained weight of

LSTM encoder-decoder model [72]

Effects of different pruning schemes [51]

German-English: Knowledge distillation based on oracle BLEU

BLEU when incrementally pruning heads from

Quaternion Transformer [59]

Binary/Ternary/Quarternary Quantization [1]

Quantized RNNs on IMDB sentence classification [24]

Sentiment analysis, LSTM on IMDB and SST datasets. Embedding compression

Dev set results on benchmark natural language understanding tasks. The

BERT vs ALBERT [34]

Pruned BERTBASE: F1 score of fine-tuning on SQuAD 1.1

The effect of cross-layer parameter-sharing strategies, ALBERT-base configuration.[34]

Category Method Pred. time latency Model Compression

Вам также может понравиться