API Reference

Create Chat Completion

Description

Given a list of messages representing a conversation history, runs LLM inference to produce the next message.

Details

Like completions, chat completions involve an LLM's response to input. However, chat completions take a conversation history as input, instead of a single prompt, which enables the LLM to create responses that take past context into account.

Messages

The primary input to the LLM is a list of messages represented by the messages array, which forms the conversation. The messages array must contain at least one message object.
Each message object is attributed to a specific entity through its role. The available roles are:

  • user: Represents the human querying the model. - assistant: Represents the model responding to user. - system: Represents a non-user entity that provides information to guide the behavior of the assistant.

When the role of a message is set to user, assistant, or system, the message must also contain a content field which is a string representing the actual text of the message itself. Semantically, when the role is user, content contains the user's query. When the role is assistant, content is the model's response to the user. When the role is system, content represents the instruction for the assistant.

Instructions

You may provide instructions to the assistant by supplying by supplying instructions in the HTTP request body or by specifying a message with role set to system in the messages array. By convention, the system message should be the first message in the array. Do not specify both an instruction and a system message in the messages array.

Log in to see full request history
Body Params
string

The account ID to use for usage tracking. This will be gradually enforced.

string
required

The ID of the model to use for chat completions. We only support the models listed here so far.

memory_strategy
object

The memory strategy to use for the agent. A memory strategy is a way to prevent the underlying LLM's context limit from being exceeded. Each memory strategy uses a different technique to condense the input message list into a smaller payload for the underlying LLM.

We only support the Last K memory strategy right now, but will be adding new strategies soon.

messages
array
required

The list of messages in the conversation.

Expand each message type to see how it works and when to use it. Most conversations should begin with a single user message.

Messages*
model_parameters
object

Configuration parameters for the chat completion model, such as temperature, max_tokens, and stop_sequences.

If not specified, the default value are:

  • temperature: 0.2
  • max_tokens: None (limited by the model's max tokens)
  • stop_sequences: None
string
Defaults to You are an AI assistant that helps users with their questions by chatting back and forth with them. When asked a question, you should answer it as best as you can with the information you have. If you need more information, you can ask the user for it.

The initial instructions to provide to the chat completion model.

Use this to guide the model to act in more specific ways. For example, if you have specific rules you want to restrict the model to follow you can specify them here.

Good prompt engineering is crucial to getting performant results from the model. If you are having trouble getting the model to perform well, try writing more specific instructions here before trying more expensive techniques such as swapping in other models or finetuning the underlying LLM.

string

Currently only supported for LLM-Engine models. A Jinja template string that defines how the chat completion API formats the string prompt. For Llama models, the template must take in at most a messages object, bos_token string, and eos_token string. The messages object is a list of dictionaries, each with keys role and content. For Mixtral models, the template must take in at most a messages object and eos_token string. The messages object looks identical to the Llama model's messages object, but the template can assume the role key takes on the values user or assistant, or system for the first message. The chat template either needs to handle this system message (which gets set via the instructions field or by the messages), or the instructions field must be set to null and the messages object must not contain any system messages.See the default chat template present in the Llama and Mixtral tokenizers for examples.

boolean
Defaults to false

Whether or not to stream the response.

Setting this to True will stream the completion in real-time.

Responses

Language
Credentials
Request
Click Try It! to start a request and see the response here! Or choose an example:
application/json