We finetuned EleutherAI/Pythia-1b models on data sets of 4B tokens.
The datasets were mixtures of different proportions containing Wikipedia, Pubmed abstracts and Smiles.
We also did a grid search over learning rates [1e-4, 1e-5]
— = Random performance
Scan over learning rates and data mixtures
n-shot prompting is constant at 0 for all
[tldr]
- We can maintain lambada performance with very little wikipedia data (as low as 5%)
- Higher learning rates degrade lambada performance faster (only a small degradation any way because we have mixed wikipedia in)
- But high learning rate is very good for learning CompleteSmile and PubMedQA








