Information about the complete smile task as of 15.05.2023

This is a free form text generation task: "Complete the following so it is valid molecule: {half of a smile}"

We can get ~ 80% accuracy using a 1B model trained for 1B tokens on 50% wiki, 50% smiles

Training data was in the format “This is a valid molecule {smile}/n/n”, “This is not a valid molecule {part of a smile}/n/n”

Of the correct smiles generated, 15% were memorised from the train set.

The rest were original (not in smiles data but could be elsewhere)

Examples of original molecules

Screenshot 2023-06-15 at 17.47.34.png

Screenshot 2023-06-15 at 17.47.44.png

The task has some problems:

Successful smiles are not the end of the text generation, the model continues to generate text afterwards.

This can be prevented by modifying the training set to use a <EOS> token at the end of each smile. We can the use that as the stopping token for Smiles generation.

e.g. all below items in list were classified as valid smiles because the first part of each output was correct but was followed by meaningless text.

Screenshot 2023-06-15 at 15.07.31.png