Baseline results from Pythia-1B compared to Pythia-1B finetuned on the hendrycks stem dataset before evaluating on the same task. Checks that we are actually training correctly.
— = Random performance