Overview

Repeat Mid-Training Evaluation for 1B EuroPMC using checkpoints from 3rd and 4th epochs.

<aside> 💡 A brief summary of some highlights include

  1. We outperform the pretrained model (and some much larger base models) on benchmarks requiring biology / chemistry college-level material.
  2. We’re relatively indifferent across other benchmarks and currently lacking some figures around general NLP evaluation (although we’re aware of additional tasks).

Good

We don’t lose our gains on college biology and maintain a strong performance ~ 10% above the pretrained model and comparable to larger few-shot models (BLOOM & OPT 175B+).

Test set ~ 150 questions

Test set ~ 150 questions

We start to also show reasonable boosts in performance on college chemistry towards the end of finetuning but quite sporadic performance changes during the first 2 epochs.

Test set ~ 100 questions

Test set ~ 100 questions

We also start to show some modest improvements ~1-2% on HeadQA but given this task consists of > 2,750 questions a gap of a few percentage points is actually relatively significant.

Test set ~ 2,750 questions

Test set ~ 2,750 questions

Worse

Screenshot 2023-08-08 at 10.09.31.png

Screenshot 2023-08-08 at 10.09.45.png

Screenshot 2023-08-08 at 10.09.51.png

Indifferent

Screenshot 2023-08-08 at 10.09.24.png

Screenshot 2023-08-08 at 10.09.38.png

Screenshot 2023-08-08 at 10.09.12.png