Looking at the best performing model at this task in the recent grid search /fsx/proj-chemnlp/experiments/checkpoints/finetuned/1B_smiles_gridsearch_2/1B_fine_tune_6/checkpoint-final

This is not a particularly analysis structured, just general observations while exploring the model answers and output.

Average difference in log-likelihood of the two possible answers (True/False) generated by the model

The model is less confident about the ones it got wrong

There are only 20 wrong answers in whole test set (2000 total)

The model was not overly bias to one answer

Both valid and invalid answers were miss-classified by the model

average length of smiles which were miss classified

average length of smiles correctly classified