Why Use Flexible Evaluations?

Let's assume you have an application already built that is able to answer math questions, and you want to evaluate the application.

To create an evaluation, you can follow the process for creating and evaluating an external application.

Here is the sample dataset you can run it with.

And here is an example of what the annotations page will look like.

This type of evaluation is great if you have an application with a single input and are only interested in evaluating a single output of an application. With this setup, the entire application effectively acts like a black box. You'll be able to evaluate the quality of the application's end results. However, this is not representative of most applications.

For example, what if you wanted to create an application that can take in a current date and the user is able to ask it questions like what the date was 5 day sago based on the current date?

Or, what if you wanted to ask the model what day of the week it was 543 days ago? Or evaluate a question that takes multiple reasoning steps?

Creating a flexible evaluation can allow users to:

Evaluate applications with multiple inputs and outputs.
Evaluate applications that have multiple steps, and enable users to evaluate each step of the application.
Attach metrics to application outputs so you can create evaluation criteria