1B, New BioMed Benchmarks

TLDR;

Used perplexity to measure “ability to speak like a scientist” quantitatively and our finetuned model shows improvements over the pretrained model. However, we’re still awaiting a demonstration of finetuning on scientific text improving performance on a generic scientific benchmark. We didn’t degrade performance on the scientific / medical HeadQA benchmark so it’s possibly we just don’t have sufficient new information to improve performance given these PubMed tokens we’re already seen twice by the model prior to finetuning.

Overview

Our model finetuned on PubMed papers outperforms the pretrained model on perplexity over PubMed papers and PubMed abstracts showing an ability to adapt the text generation to that domain.
- It also shows training on full text PubMed papers transfer to better performance on PubMed abstracts. Not too surprising but worth noting.
- However, gap between our pretrained model and the finetuned variant is more pronounced for full text papers (PubMed central) as expected.
- There’s no train test split AFAICT so there’s no guarantee we’re not seeing training data and memorising specific chunks of text.
We don’t improve performance on the downstream HeadQA examination questions but we don’t degrade performance in any statistically significant way.
- However, the EuroPMC corpus contains new research papers in the same topics that will hopefully contain information the model can learn to answer these questions.
- Related to above, it’s possible we’ve saturated all the useful knowledge in the Pile’s PubMed corpus?
- We could try few-shot examples for this particular task … ?

Results

| Model | pile_pubmed-abstracts_word_perplexity | pile_pubmed-central_word_perplexity | headqa_en_acc +/- std.err | | --- | --- | --- | --- | | EleutherAI/pythia-1b | 22.487 | 23.647 | 0.289 +/- 0.009 | | Finetuned-45B-PubMed* | 22.121 | 19.56 | 0.2837 +/- 0.009 |

*Finetuned has been further trained starting from EleutherAI/pythia-1b on 2 further epochs of 22B tokens of PubMed Central full text papers. This was a PubMed slice of the pretraining dataset (the Pile) so it has already been seen by EleutherAI/pythia-1b during pretraining.

Legend

pile_pubmed-abstracts_word_perplexity → word-level perplexity across a set of pubmed abstracts
pile_pubmed-central_word_perplexity → word-level perplexity across a set of pubmed full text papers
headqa_en → a set of multiple choice examination from the Spanish government exam for any healthcare or related practitioners. Topics include pharmacology, nursing, psychology, etc but the aggregate metric is currently only calculate across all questions.
- See more details here.