Used perplexity to measure “ability to speak like a scientist” quantitatively and our finetuned model shows improvements over the pretrained model. However, we’re still awaiting a demonstration of finetuning on scientific text improving performance on a generic scientific benchmark. We didn’t degrade performance on the scientific / medical HeadQA benchmark so it’s possibly we just don’t have sufficient new information to improve performance given these PubMed tokens we’re already seen twice by the model prior to finetuning.
| Model | pile_pubmed-abstracts_word_perplexity
| pile_pubmed-central_word_perplexity
| headqa_en_acc +/- std.err |
| --- | --- | --- | --- |
| EleutherAI/pythia-1b | 22.487 | 23.647 | 0.289 +/- 0.009 |
| Finetuned-45B-PubMed* | 22.121 | 19.56 | 0.2837 +/- 0.009 |
*Finetuned has been further trained starting from EleutherAI/pythia-1b on 2 further epochs of 22B tokens of PubMed Central full text papers. This was a PubMed slice of the pretraining dataset (the Pile) so it has already been seen by EleutherAI/pythia-1b during pretraining.
pile_pubmed-abstracts_word_perplexity → word-level perplexity across a set of pubmed abstractspile_pubmed-central_word_perplexity → word-level perplexity across a set of pubmed full text papersheadqa_en → a set of multiple choice examination from the Spanish government exam for any healthcare or related practitioners. Topics include pharmacology, nursing, psychology, etc but the aggregate metric is currently only calculate across all questions.