We fine-tuned EleutherAI/Pythia-1b models on data sets of 1B tokens.

The datasets were different mixtures of Pubmed abstracts and Smiles. The smiles were added into strings to be provided to the model in training and mixed together into different compositions of the strings.

The smiles were provided to the fully fine tuned models (FFT) in the following ways:

VALID_SMILE = f"The following is a valid molecule: {smiles}"
INVALID_SMILE = f"The following is not a valid molecule: {invalid_smiles}"
QA_SMILE = f"Question: Is the following a valid molecule: {smiles or invalid_smiles}? Answer: {Yes or No}"
Model Name Valid_Smile Invalid_Smile QA_Smile PubMed: Smiles Composition *
FFT 1 97.5 : 2.5
FFT 2 97.2 : 2.8
FFT 3 98.4: 1.6
FFT 4 95.6 : 5.4
FFT 5 95.0: 5.0

*The same molecules were used between the smiles sets, but because of differences in text templates, some had more smiles tokens than others.

The models were evaluated on the new IsSmile and CompleteSmile tasks:

# pseudo code for the tasks

# CompleteSmile
EVAL_QUESTION: f"Complete the following so it is valid molecule: {start_of_smile}"
model_output = model.generate(context=EVAL_QUESTION)
# model_output is (EVAL_QUESTION + text generated by model)
generated_text = model_output.split(start_of_smile)[1]
score = 1 if is_valid(start_of_smile + generated_text) else 0

# IsSmile
EVAL_QUESTION = f"Question: Is the following a valid molecule: {smile}? Answer: "
ANSWERS = ["Yes", "No"]
TRUE_ANSWER_INDEX = 0 # answer should be "Yes"
queries = [EVAL_QUESTION + answer for answer in ANSWERS]
# which answer is most likely given the eval_question
model_answer = np.argmax([model.loglikelihood(query) for query in queries])
score = 1 if model_answer == TRUE_ANSWER_INDEX else 0

Exploring is_smile results

Key results

We maintained Lambada performance and learned 2 new tasks: IsSmile and CompleteSmile

It looks like we didn’t need wikipedia to maintain language performance on Lambada - just scientific text as the base was enough.

The results show that to learn these two new tasks we currently have to explicitly train for them:

This type of explicit only learning is likely caused by using a small model and limited training data.

The PubMed:Smile composition likely plays a role in the results - e.g. FFT-4 is better than FFT-1 at CompleteSmile even though the both saw the same ValidSmile and InvalidSmile training data. However, FFT-4 also saw the same smiles but framed in the QA_smile template, effectively giving it a second epoch of training over smiles data.

All other metrics were basically maintained

— = Random performance