Repeat Mid-Training Evaluation for 1B EuroPMC using checkpoints from 3rd and 4th epochs.
<aside> 💡 A brief summary of some highlights include
We don’t lose our gains on college biology and maintain a strong performance ~ 10% above the pretrained model and comparable to larger few-shot models (BLOOM & OPT 175B+).

Test set ~ 150 questions
We start to also show reasonable boosts in performance on college chemistry towards the end of finetuning but quite sporadic performance changes during the first 2 epochs.

Test set ~ 100 questions
We also start to show some modest improvements ~1-2% on HeadQA but given this task consists of > 2,750 questions a gap of a few percentage points is actually relatively significant.

Test set ~ 2,750 questions





