Running evaluation

Our evaluation pipeline is run using the ‣ submodule: ‣

<aside> 📖 Note:

Our evaluation can be broken down into two sections: general eval and benchmark eval which include scripts for running standardised sets of tasks. However, it is trivial to run more custom sets of tasks selected from the more than 200 available in lm-evaluation-harness. Details of available tasks are here.

General eval: This includes NLP tasks, STEM tasks and safety tasks. This can be mostly used for to understand how well our models are training e.g. when we compare pre-trained and fine-tuned model performance. These scripts run fairly quickly and could be used to study intermediate checkpoints during training.

Benchmark eval: This runs a wider range of evaluation tasks and allows us to place our selves on the open-LLM-leaderboard for comparison to e.g. GPT models. This script is very slow (some work could be done to parallelise this more) and we will likely want to run this only the final checkpoint.

</aside>

Running the general pipeline tasks

The configs for NLP, STEM and Safety (0-shot) are already set up within https://github.com/OpenBioML/chemnlp in this file

To run the tasks, within the config to be run:

Update model_args to be “pretrained=fsx/path/checkpoint” including specific model checkpoint to be evaluated.
Update wandb_group and run_name to correct values for that model
Everything else can probably stay fixed but n-shot: int could be modified for all of them. This would give the same n-shot for all tasks.

# if running from inside an interactive machine
# from lm-evaluation-harness submodule dir

# run general nlp tasks
python main_eval.py '/fsx/proj-chemnlp/beth/chemnlp/experiments/configs/eval_configs/nlp_eval_config.yaml'

# run safety tasks
python main_eval.py '/fsx/proj-chemnlp/beth/chemnlp/experiments/configs/eval_configs/safety_eval_config.yaml'

# run stem tasks
python main_eval.py '/fsx/proj-chemnlp/beth/chemnlp/experiments/configs/eval_configs/stem_eval_config.yaml'

# alternatively, run using a job
# from chemnlp dir
srun experiments/scripts/run_eval.sh $1 $2 $3
# e.g. srun experiments/scripts/run_eval.sh beth beth /fsx/proj-chemnlp/beth/chemnlp/experiments/configs/eval_configs/stem_eval_config.yaml

Running the benchmark tasks

Some tasks are always evaluated in the literature with specific n-shot prompting (llm-leaderboard and GPT-4 technical report)

This script will spin up different machines for each task and the results should be logged to our wandb as well as "/fsx/proj-chemnlp/experiments/eval_tables" as a table in csv format.

This automatically uses the default eval config /fsx/proj-chemnlp/beth/chemnlp/experiments/configs/eval_configs/default_eval_config.yaml but first this must be updated to include: the model path via model_args and logging wandb_group - everything else is fixed or is overwritten in the script.

# from chemnlp repository root
python experiments/scripts/run_n_shot_benchmarks_eval.py

Evaluating over a grid search of models

Unlike above examples where the model_args can specify a specific checkpoint file of the model, the script for evaluating grid search models automatically searches for the final-checkpoints of each models within the parent grid search dir. Since we won’t be grid searching the big 1B model run this is not a problem so I am going to ignore it for now. However, if you do need it the script is here chemnlp/experiments/scripts/run_eval_batch.sh