Training Dynamics And Tricks#
Table of Contents#
- How to Calculate the Number of FLOPs in Transformer Based Models?
- Why Does Cosine Annealing With Warmup Stabilize Training?
- How To Fine-Tune Decoder-Only Models For Sequence Classification Using Last Token Pooling?
- How To Fine-Tune Decoder-Only Models For Sequence Classification With Cross-Attention?
- How To Do Teacher-Student Knowledge Distillation?