Synthesize Chunks

post https://api.egp.scale.com/v4/chunks/synthesis

Description

Synthesizes a response to the given query using the given chunks as context.

Details

This API operates on "chunks," which are the result of querying a vector store. A chunk is simply a fragment of a larger document. It can optionally have its own metadata or ID. You can also construct your own chunks from scratch so long as you provide the text pertaining to each chunk.

Synthesizing chunks can be thought of as the final step in a retrieval augmented generation ( RAG) system, after querying for chunks and potentially ranking them. Once you have the relevant chunks to use to respond to the user, you'll want to synthesize a readable natural language answer. During this synthesis step, we prompt an LLM with instructions and a set of (possibly transformed) chunks to guide it toward synthesizing a natural language response to the user query.

There are several synthesis strategies, whose difference become more apparent as the number of chunks increases. Currently, EGP supports the compact strategy by default: stuff as many chunks into the synthesis LLM as possible, produce a best-effort answer, then continue to stuff and answer the next set of chunks. The answer is passed through and gradually refined with each iteration of chunk scanning. This continues until all chunks have been scanned and a final answer is synthesized. More synthesis strategies are currently in development!

Restrictions and Limits

Generally, chunk synthesis increases its token consumption linearly the number of chunk-tokens.
The compact strategy attempts to be as token-efficient as possible by stuffing tokens into each round of chunk scanning. Other strategies (in development!) may produce better summaries or more precise answers at the expense of more tokens consumed.

Broadly, we recommend keeping the number of chunks to under 100, or the total number of tokens across all chunks under around 10,000.