<aside> 📦 What is EleutherAI’s gpt-neox?

This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimisations. We aim to make this repo a centralised and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.

GPT-NeoX vs Hugging Face’s accelerate

That being said, we’ve currently migrating to using Hugging Face as it is easier to get up and running and provides a better user experience w.r.t dependency management.

</aside>

<aside> 📋 Most of our documentation is stored alongside the code in the repository

https://github.com/OpenBioML/chemnlp

</aside>

Tokenisation

See: Tokenizer review

GPT-NeoX Tokeniser

Description

Currently we are using the default GPT-NeoX tokeniser which is saved for each model checkpoint i.e. EleutherAI/pythia-160m. They state in ‣ that they use a traditional GPT-2 BPE-based tokeniser with with the same total vocabulary size of 50257 but include three core changes;

They train a new BPE tokeniser based on the Pile, taking advantage of its diverse text sources.
In contrast to the GPT-2 tokeniser which treats tokenisation at the start of a string as a non-space-delimited token, their tokeniser applies consistent space delimitation regardless.
- Basically a start of a token must be only a single space but nothing if multiple spaces?
Their tokeniser contains tokens for repeated space tokens (all positive integer amounts of repeated spaces up to and including 24). Allows tokenisation text with large amounts of whitespace using fewer tokens.

Screenshot 2023-04-03 at 10.56.18.png

How to use?

We have a tokenisation scripts inside of experiments/data/prepare_<framework>_<dataset>.py for tokenising a dataset using 1) GPT-NeoX or 2) Hugging Face. This script must be called for each new model as samples are prepared considering the context length of the model.

You can call the Hugging Face script as follows;

python experiments/data/prepare_hf_chemrxiv.py <hf-model> <context-length> # args

python experiments/data/prepare_hf_chemrxiv.py EleutherAI/pythia-160m 2048 # example

Statistics (chemrxiv)

{
	"total_raw_samples": 10934, # research articles 
	"average_words_per_sample": 6812.0, 
	"max_words_per_sample": 126439, 
	"min_words_per_sample": 4, 
	"total_tokenised_samples": 205281, # after using article chunking
	"max_context_length": 2048, # specified for all Pythia models
	"total_tokens_in_billions": 0.1577 # approx (articles x avg.words x 2.5)
}
# where "words" are considered as lists of all space separated strings

Tokenisation

GPT-NeoX Tokeniser

Description

How to use?

Statistics (chemrxiv)

Examples (chemrxiv)