Evaluation Tasks

Existing tasks in lm-evaluation-harness

Tasks can be found here

Only the tasks used to evaluate Galactica are marked True in the Galactica_task column. Some of the False tasks were actually used to train Galactica not to evaluate it. We should gather these together and add to our dataset for training.

To Do List

[ ] Make a questionnaire about LLcheM use cases and share with llm/chemistry community on discord and twitter. Use results of questionnaire to design more custom evaluation tasks (below) and inform training
[ ] Proof read the text templates for scientific validity and grammar
[ ] Swap to big refactor of lm-eval when it is complete and update chemnlp tasks (periodic_table, is_smile, complete_smile) so they are compatible with the new code base.
[x] Set up scripts for running popular evaluation benchmarks so we can compare more easily to existing leaderboards.
[ ] Parallelise our model evaluation pipeline even more - maybe this will already be possible in big-refactor
[ ] Set up tools for querying model and comparing answers between it and other OS chat models.

Ideas for specific tasks to add to lm-evaluation-harness

[ ] Incorporate BigBench (maybe just 57 tasks that Galactica evaluated on) - this will be easy when we swap to big-refactor
[ ] StereoSet and RealToxicityPrompts are common toxicity tasks which are not yet in lm-eval-harness.
[ ] Galactica task (not OS): AminoProbe: a dataset of names, structures and properties of the 20 common amino acids.
[ ] Galactica task (not OS): Chemical Reactions: a dataset of chemical reactions (can the model complete them in a valid way). Can be made from USPTO-MIT dataset (see **ChemFormer task (OS):** reaction prediction below).
[ ] Galactica task (not OS): Mineral Groups: a dataset of minerals and their mineral group classifications.