EuroPMC v0, 0-shot Evaluation

Overview

Repeat Mid-Training Evaluation for 1B EuroPMC using checkpoints from 3rd and 4th epochs.

<aside> 💡 A brief summary of some highlights include

We outperform the pretrained model (and some much larger base models) on benchmarks requiring biology / chemistry college-level material.
- However, we perform worse on high-school subjects, similar to Galactica.
We’re relatively indifferent across other benchmarks and currently lacking some figures around general NLP evaluation (although we’re aware of additional tasks).
- Will be interesting to see the effect of using few-shot prompting ourselves. </aside>

Good

We don’t lose our gains on college biology and maintain a strong performance ~ 10% above the pretrained model and comparable to larger few-shot models (BLOOM & OPT 175B+).

Test set ~ 150 questions

We start to also show reasonable boosts in performance on college chemistry towards the end of finetuning but quite sporadic performance changes during the first 2 epochs.

Test set ~ 100 questions

We also start to show some modest improvements ~1-2% on HeadQA but given this task consists of > 2,750 questions a gap of a few percentage points is actually relatively significant.

Test set ~ 2,750 questions

Worse

Screenshot 2023-08-08 at 10.09.31.png

Screenshot 2023-08-08 at 10.09.45.png

Screenshot 2023-08-08 at 10.09.51.png

Indifferent

Screenshot 2023-08-08 at 10.09.24.png

Screenshot 2023-08-08 at 10.09.38.png

Screenshot 2023-08-08 at 10.09.12.png