Baseline results from Pythia-1B compared to Pythia-1B finetuned on different mixes of data

e.g. ft_10hend_45pubmed_45wiki: 10% hendrycks STEM validation set, 45% pubmed abstracts, 45% wikipedia mix. Prepared by sampling until smallest set is exhausted.

— = Random performance

1 epochs

Screenshot 2023-05-16 at 10.36.44.png

Screenshot 2023-05-16 at 10.35.20.png

Screenshot 2023-05-16 at 10.37.08.png

Screenshot 2023-05-16 at 10.36.01.png

Screenshot 2023-05-16 at 10.35.30.png

Screenshot 2023-05-16 at 10.36.34.png

Screenshot 2023-05-16 at 10.35.47.png

Screenshot 2023-05-16 at 10.36.56.png

Screenshot 2023-05-16 at 10.36.19.png

Screenshot 2023-05-16 at 10.42.34.png

3 epochs

Screenshot 2023-05-15 at 17.49.57.png

Screenshot 2023-05-15 at 17.48.56.png

Screenshot 2023-05-15 at 17.50.14.png

Screenshot 2023-05-15 at 17.49.31.png

Screenshot 2023-05-15 at 17.49.13.png

Screenshot 2023-05-15 at 17.49.49.png

Screenshot 2023-05-15 at 17.49.22.png

Screenshot 2023-05-15 at 17.50.05.png

Screenshot 2023-05-15 at 17.49.39.png

Screenshot 2023-05-15 at 17.52.15.png