This page is for all things Evaluation

Everything here is a work in progress/idea so please contribute!

Evaluation Models

We could re-run these models on the same tasks as us. These are all the models compared in the Galactica paper

Evaluation Tasks

Discussion Points

Where does chemistry end and biology/physics start?

What data are we training on vs saving for testing?

General discussion about intended use cases for the mode.

Setting up evaluation pipeline

I can start out with a light version of this pipeline (smallest model only plus simplest task)

  1. Add an evaluation module to the code base
  2. Add all the evaluation datasets👆to the module