Guides

Flexible Evaluations

In addition to the standard evaluation runs on the SGP platform, for external applications (applications that are not built natively on our platform), we also offer additional functionality to help users customize their evaluation process.

📘

If you haven't run an evaluation for an external application yet, check out this how to first.

When to run a flexible evaluation

The standard evaluations run on the SGP allow users to treat an application like a black box and simply focus on annotating a string input and a single string result. This works well for simple use cases.

However, for more advanced users, this approach may not suffice. Many Gen AI applications, such as AI agents, require:

  1. Multiple internal steps, such as function calls, before getting to the final answer. These steps need to be evaluated independently to properly evaluate the performance of the netire system.
  2. Multiple inputs and outputs that are potentially non string types
  3. Attaching metrics to outputs to better understand the performance of an application

Advanced users evaluating a RAG application for example, might only want to focus an evaluation on the retrieval stage to make sure all the retrieved context is in line with expectations. Additionally, for certain test cases advanced users may have additional metrics like ROUGE or BLEU generated off platform that they would like to view.

Flexible evaluation Features

Evaluate multiple stages of the application

Instead of showing just the application input and output on our platform for each test case, developers will be able to upload entire trace views to capture applications that have multiple stages. This also enables users to run evaluations and annotate results for each stage of the test case.

Evaluate Applications with Complex Inputs and Outputs

By default, SGP evaluations expect that applications accept a single string and output a single string. However, many applications accept multiple inputs with complex types, and produce multiple outputs. Flexible evaluation datasets allow you to specify test cases with multiple inputs and expected outputs of various types.

Create a custom Annotations UI

Developers will be able to configure which inputs, outputs, and traces are shown to annotators and select which questions from the question set annotators should annotate relating to those traces.

Upload custom metrics

For each test case, users can also upload custom metrics onto the platform so that developers can get a holistic view of the test case performance and see all the human annotator evaluations and other metrics all in one place.

How to get started with flexible evaluations

To get started, see our example recipe:

📘

Flexible evaluations recipe.

Then dive into our more detailed docs which explore the full capabilities of flexible evaluation.