TLDR;

Used perplexity to measure “ability to speak like a scientist” quantitatively and our finetuned model shows improvements over the pretrained model. However, we’re still awaiting a demonstration of finetuning on scientific text improving performance on a generic scientific benchmark. We didn’t degrade performance on the scientific / medical HeadQA benchmark so it’s possibly we just don’t have sufficient new information to improve performance given these PubMed tokens we’re already seen twice by the model prior to finetuning.

Overview

Results

| Model | pile_pubmed-abstracts_word_perplexity | pile_pubmed-central_word_perplexity | headqa_en_acc +/- std.err | | --- | --- | --- | --- | | EleutherAI/pythia-1b | 22.487 | 23.647 | 0.289 +/- 0.009 | | Finetuned-45B-PubMed* | 22.121 | 19.56 | 0.2837 +/- 0.009 |

*Finetuned has been further trained starting from EleutherAI/pythia-1b on 2 further epochs of 22B tokens of PubMed Central full text papers. This was a PubMed slice of the pretraining dataset (the Pile) so it has already been seen by EleutherAI/pythia-1b during pretraining.

Legend