New Model Guide

This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling lm_eval.evaluator.evaluate() to evaluate an existing model.

In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the lm_eval.api.model.LM class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this LM subclass via adding it to the library!

Setup

To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:

# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout -b <model-type>
pip install -e ".[dev]"

Now, we'll create a new file where we'll be adding our model:

touch lm_eval/models/<my_model_filename>.py

Tip: this filename should not shadow package names! For example, naming your file anthropic.py is disallowed since the API's name on pypi is anthropic, but naming it anthropic_llms.py works with no problems.

Interface

All models must subclass the lm_eval.api.model.LM class.

The LM class enforces a common interface via which we can extract responses from a model:

class MyCustomLM(LM):
    #...
    def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
        #...


    def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
        #...


    def generate_until(self, requests: list[Instance]) -> list[str]:
        #...
    #...

Where Instance is a dataclass defined in lm_eval.api.instance with property args of request-dependent type signature described below.

We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.

All three request types take as input requests of type list[Instance] that have a matching Instance.request_type to the method name.

generate_until
- Each request contains Instance.args : Tuple[str, dict] containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters.
- Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, {"until": ["\n\n", "."], "max_gen_toks": 128}).
- The generated input+output text from the model will then be returned.
loglikelihood
- Each request contains Instance.args : Tuple[str, str] containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned.
- Each request will have, as result, (ll, is_greedy): Tuple[float, int] returned, where ll is a floating point number representing the log probability of generating the target string conditioned on the input, and is_greedy being either the value 0 or 1, with it being 1 if and only if the target string would be generated by greedy sampling from the LM (that is, if the target string is the most likely N-token string to be output by the LM given the input. )
loglikelihood_rolling
- Each request contains Instance.args : Tuple[str], which is an input string to the model whose entire loglikelihood, conditioned on purely the EOT token, will be calculated.
- This is used to evaluate perplexity on a data distribution.
- It should return (ll,) : Tuple[float] , a.k.a. solely the loglikelihood of producing each piece of text given no starting input.

To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that loglikelihood_rolling is a special case of loglikelihood). For a reference implementation, check out lm_eval/models/huggingface.py ! Additionally, check out lm_eval.api.model.TemplateLM for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the lm_eval.models.huggingface.HFLM class and overriding just the initialization or a couple methods!

Tip: be careful of indexing in loglikelihood!

LMs take in tokens in position [0 1 2 ... N] and output a probability distribution for token position N+1. We provide a simplified graphic here, excerpted from huggingface.py:

# how this all works (illustrated on a causal decoder-only setup):
#          CTX      CONT
# inp    0 1 2 3|4 5 6 7 8 9   <- last token is deleted by inp[:, :-1]
# model  \               \
# logits   1 2 3|4 5 6 7 8 9   <- the ctx half gets tossed out by the
# cont_toks      4 5 6 7 8 9      [:, -len(continuation_enc):, :self.vocab_size] slice

The final token of the target is not passed into the LM, because we want the LM's predictions up to but not past that final target token. For more information, check out #942 .

Registration

Congrats on implementing your model! Now it's time to test it out.

To make your model usable via the command line interface to lm-eval using python -m lm_eval, you'll need to tell lm-eval what your model's name is.

This is done via a decorator, lm_eval.api.registry.register_model. Using register_model(), one can both tell the package what the model's name(s) to be used are when invoking it with python -m lm_eval --model <name> and alert lm-eval to the model's existence.

from lm_eval.api.registry import register_model

@register_model("<name1>", "<name2>")
class MyCustomLM(LM):

Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at lm_eval.api.registry.MODEL_REGISTRY. See lm_eval.api.registry for more detail on what sorts of registries and decorators exist in the library!

Tip: be sure to import your model in lm_eval/models/__init__.py!

Testing

We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .

Chat Templating

Many models are fine-tuned with a Chat Template in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.

In order to make your model optionally compatible with a chat format, three additional methods must be implemented:

class MyCustomLM(LM):
    #...
    @property
    def tokenizer_name(self) -> str:
        # should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.

    @property
    def chat_template(self) -> str:
        # should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
        # this will be saved in the evaluation results for reproducibility.

    def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
        # responsible for taking as input a chat history that would be fed into the model, and
        # rendering it as a string that can be then tokenized and input into the model.
    #...

apply_chat_template
- This method performs the bulk of the work required for chat-formatting.
- As input, a chat_history: List[Dict[str, str]] is passed in. This is a transcript of a conversation of a form similar to
```
[
  {"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
  {"user": <task example - a few-shot example 'input'>}
  {"assistant": <correct response to the above example>},
  # ... more few-shot examples, potentially
  {"user": <test set query--response on which we will evaluate>},
]
```
  which can then be converted into a string input.
- The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling tokenizer.apply_chat_template for HFLM--see the implementation there for reference.
tokenizer_name
- LM Eval Harness supports caching requests that are sent to a model, for faster setup when repeating an already-performed evaluation.
- However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this lm.tokenizer_name string to distinguish caches for a given model (and chat template) from one another.
chat_template
- Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.

If not implemented for a given model type, the flags --apply_chat_template , --fewshot_as_multiturn, and --system_instruction cannot be used.

Other

Pro tip: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in descending order by total input length via lm_eval.utils.Reorderer. Take a look at lm_eval.models.hf_causal.HFLM to see how this is done, and see if you can implement it in your own model!

Conclusion

After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_guide.md

model_guide.md

New Model Guide

Setup

Interface

Registration

Testing

Chat Templating

Other

Conclusion

Files

model_guide.md

Latest commit

History

model_guide.md

File metadata and controls

New Model Guide

Setup

Interface

Registration

Testing

Chat Templating

Other

Conclusion