Baseline results from Pythia-1B compared to Pythia-1B finetuned on different mixes of data
e.g. ft_10hend_45pubmed_45wiki: 10% hendrycks STEM validation set, 45% pubmed abstracts, 45% wikipedia mix. Prepared by sampling until smallest set is exhausted.
— = Random performance



















