Guides

Chat Completions

Chat with an LLM

Overview

What is a chat completion?

A chat completion is a type of output generated by a Large Language Model (LLM) that operates on a list of user and AI interactions rather than a single prompt input. Unlike regular completions, which only rely on the user's original input, chat completions can build up context from the chat history. This allows the LLM to understand the conversation's flow and provide more contextually relevant responses.

Chat completions are particularly useful in interactive applications, such as chatbots and virtual assistants, where the conversation evolves over multiple turns. By considering the entire chat history, the LLM can maintain continuity and coherence, leading to more natural and engaging interactions with users.

The ability to understand the context from the ongoing conversation empowers chat completions to handle more complex queries and maintain consistent user experiences. However, it also presents challenges, as the model must avoid becoming repetitive or divulging sensitive information from previous interactions.

To ensure responsible and effective use, developers and designers should carefully curate the training data and fine-tune the LLM to adhere to ethical guidelines and privacy concerns. Despite the challenges, chat completions offer significant advancements in creating dynamic and interactive AI systems that cater to users' evolving needs.

Working Examples

Please refer to our Guides and Recipes for working E2E working examples of what you can do with Agents

Introductory Example

Let's make a simple request to the Chat Completions API.

All we need to do is set our API Key in the headers, specify a model and message list to send to the API, and then submit the call with the requests library. Notice that unlike with the Completions API, the Chat Completions API takes in a message list, not a prompt. This message list is how the user tells the agent about the full context about the conversation. Our API is stateless, so it is the user's responsibility for maintaining this message list in the client side application.

import requests
import os


SPELLBOOK_API_KEY = os.environ.get("SPELLBOOK_API_KEY")

headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "x-api-key": SPELLBOOK_API_KEY,
}
payload = {
    "model": "gpt-3.5-turbo-0613",
    "messages": [
        {
            "role": "user",
            "content": "Hello, what can you do?"
        }
    ]
}

response = requests.post("https://api.spellbook.scale.com/egp/v1/chat-completions", json=payload, headers=headers)

print(response.text)

Output

{
  "chat_completion": {
    "message": {
      "role": "agent",
      "content": "Hello! I'm an AI assistant here to help you with your questions. I can provide information, answer queries, assist with tasks, and engage in conversation on a wide range of topics. How can I assist you today?",
      "tool_request": null
    },
    "finish_reason": null
  },
  "token_usage": {
    "prompt": 71,
    "completion": 45,
    "total": 116
  }
}

Streaming Chat Completions

By default, the chat completions endpoint uses synchronous responses, meaning each request will block until the completion is fully formed, and then sent over in one piece. Since this can take several seconds, we offer the ability to stream responses so that you receive parts of the response sequentially, as soon as they are ready.

With streaming, small sets of tokens (which can be thought of as shorter representations of root words, prefixes, symbols, etc.) are sent over one batch at a time, as soon as they are available from the LLM. This can be especially useful, for instance, when designing a chat app where a user would rather see tokens appearing sequentially over multiple seconds, rather than waiting on a blank screen until the entire text pops up at once.

We can enable streaming completions with simple additions to the code above. We also use the SSE (server-sent events) protocol for streaming. This means we'll need to adjust the way we parse the results for a proper Python dictionary.

import requests
import os
import json  # <-- added
import sys  # <-- added


SPELLBOOK_API_KEY = os.environ.get("SPELLBOOK_API_KEY")

headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "x-api-key": SPELLBOOK_API_KEY,
}
payload = {
    "model": "gpt-3.5-turbo-0613",
    "messages": [
        {
            "role": "user",
            "content": "Hello, what can you do?"
        }
    ],
    "stream": True,  # <-- added
}

response = requests.post(
    url="https://api.spellbook.scale.com/egp/v1/chat-completions",
    json=payload,
    headers=headers,
    stream=True,  # <-- added
)

for raw_event in response.iter_lines():
    if raw_event:
        event_str = raw_event.decode()
        if event_str.startswith('data: '):
            event_json_str = event_str[len('data: '):]
            stream_chunk = json.loads(event_json_str)
            print(chunk["chat_completion"]["message"]["content"], end="")
            sys.stdout.flush()
print()

Output

The following output will be streamed to your Terminal 🏃

Hello! I'm an AI assistant designed to help answer questions and provide information on a wide range of topics. I can assist with general knowledge, provide definitions, offer suggestions, and more. Feel free to ask me anything you'd like to know!

Now that you know how to use this API, you can easily extend this to build simple chat applications! 😄

End to End working example

There is one more thing we have to do to continue the chat app. Because the Chat Completions API expects complete messages and not streamed chunks, we need to combine all streamed chunks back into a single assistant message to send back to the API.

Check out this walkthrough to see how to address this issue.