Mid-Training Evaluation for 1B EuroPMC

Overview

<aside> 📊 Highlights

Strong performance on Hendryck’s college biology 📈
Mild degradation of Lambada, especially close to loss spike regions 📉
Nothing significant of note on other benchmark tasks 🫳

All evaluation below is zero-shot.

</aside>

Context

This model is currently being finetuned from a pretrained Pythia-1b checkpoint on 250B tokens ⇒ 70k optimiser steps of EuroPMC biomedical / scientific text. The model suffered from loss spikes at ~ steps 10500, 14000, 20000 and 21000. The loss curve can be seen below with each bar in the subsequent evaluation charts being a saved checkpoint at 5k intervals. See more details here.

Screenshot 2023-08-01 at 11.59.21.png

Results

A float in the legend denotes a checkpoint which has been finetuned for that many optimiser steps.

Good

Screenshot 2023-08-01 at 12.24.59.png

We can see promising results on the hendryck’s college biology benchmark where we outperform a random baseline and Pythia pretrained model significantly (relatively equivalent).
We also see the performance improve as we train for more steps.
Finally, this benchmark is relevant to our training corpus and suggests we are potentially capable of re-using our EuroPMC knowledge in this new context.

Bad

Screenshot 2023-08-01 at 12.27.12.png

We did drop in Lambada performance throughout training and the model seems to almost get 0% during the period of a loss spike (red bar at 20k optimiser steps).

Ugly Indifferent

There were slight drops in performance or maintaining equivalent performance across most other benchmarks.

Screenshot 2023-08-01 at 13.28.36.png

Screenshot 2023-08-01 at 13.29.10.png

Screenshot 2023-08-01 at 13.28.00.png

Screenshot 2023-08-01 at 13.28.47.png

Screenshot 2023-08-01 at 13.29.25.png

Screenshot 2023-08-01 at 13.29.39.png

Screenshot 2023-08-01 at 13.29.53.png