Evaluation Plan

This page is for all things Evaluation

Everything here is a work in progress/idea so please contribute!

Evaluation Models

We could re-run these models on the same tasks as us. These are all the models compared in the Galactica paper

Where does chemistry end and biology/physics start?

What data are we training on vs saving for testing?

General discussion about intended use cases for the mode.

I can start out with a light version of this pipeline (smallest model only plus simplest task)