Baseline results from Pythia-1B compared to Pythia-1B finetuned on the hendrycks stem dataset before evaluating on the same task. Checks that we are actually training correctly.

— = Random performance

STEM: hendrycks-[subject]

Screenshot 2023-04-28 at 13.23.10.png

Screenshot 2023-04-28 at 13.23.26.png

Screenshot 2023-04-28 at 13.23.41.png

Screenshot 2023-04-28 at 13.23.55.png

Screenshot 2023-04-28 at 13.24.17.png

Screenshot 2023-04-28 at 13.24.05.png

Screenshot 2023-04-28 at 13.24.37.png

Screenshot 2023-04-28 at 13.24.29.png

Screenshot 2023-04-28 at 13.22.44.png

LAMBADA

Screenshot 2023-04-28 at 13.29.22.png

Screenshot 2023-04-28 at 13.30.30.png