Looking at the best performing model at this task in the recent grid search /fsx/proj-chemnlp/experiments/checkpoints/finetuned/1B_smiles_gridsearch_2/1B_fine_tune_6/checkpoint-final
This is not a particularly analysis structured, just general observations while exploring the model answers and output.
Average difference in log-likelihood of the two possible answers (True/False) generated by the model
The model is less confident about the ones it got wrong


There are only 20 wrong answers in whole test set (2000 total)

The model was not overly bias to one answer
Both valid and invalid answers were miss-classified by the model


average length of smiles which were miss classified

average length of smiles correctly classified