<aside>
📦 What is EleutherAI’s gpt-neox?
This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimisations. We aim to make this repo a centralised and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.
accelerateThat being said, we’ve currently migrating to using Hugging Face as it is easier to get up and running and provides a better user experience w.r.t dependency management.
</aside>
<aside> 📋 Most of our documentation is stored alongside the code in the repository
https://github.com/OpenBioML/chemnlp
</aside>
See: Tokenizer review
Currently we are using the default GPT-NeoX tokeniser which is saved for each model checkpoint i.e. EleutherAI/pythia-160m. They state in ‣ that they use a traditional GPT-2 BPE-based tokeniser with with the same total vocabulary size of 50257 but include three core changes;

We have a tokenisation scripts inside of experiments/data/prepare_<framework>_<dataset>.py for tokenising a dataset using 1) GPT-NeoX or 2) Hugging Face. This script must be called for each new model as samples are prepared considering the context length of the model.
You can call the Hugging Face script as follows;
python experiments/data/prepare_hf_chemrxiv.py <hf-model> <context-length> # args
python experiments/data/prepare_hf_chemrxiv.py EleutherAI/pythia-160m 2048 # example
{
"total_raw_samples": 10934, # research articles
"average_words_per_sample": 6812.0,
"max_words_per_sample": 126439,
"min_words_per_sample": 4,
"total_tokenised_samples": 205281, # after using article chunking
"max_context_length": 2048, # specified for all Pythia models
"total_tokens_in_billions": 0.1577 # approx (articles x avg.words x 2.5)
}
# where "words" are considered as lists of all space separated strings