This page is for all things Evaluation
Everything here is a work in progress/idea so please contribute!
We could re-run these models on the same tasks as us. These are all the models compared in the Galactica paper
Where does chemistry end and biology/physics start?
What data are we training on vs saving for testing?
General discussion about intended use cases for the mode.
I can start out with a light version of this pipeline (smallest model only plus simplest task)