Ingest Custom Data

Overview

In this guide, we'll walk through the steps on how to ingest your custom data. You can either do this by uploading your local files directly, or by connecting your data via a data connector.

To upload your local files, you will need to:

Create an SGP knowledge base.
Upload your local files as file bytes to your knowledge base.

To connect your data via a data connector, you will need to:

Create an SGP knowledge base.
Integrate your external account credentials with your SGP user.
Create an SGP data connector, using your integrated credentials, in your SGP knowledge base.
Ingest and sync your files' data into your knowledge base via your data connector.
Check the status and details of your syncing job.

You may reuse the same knowledge base for both local file upload as well as data connector connections.

We'll be demonstrating the workflow for ingesting and syncing your Confluence files in this guide. However, we also support other data connector types, such as Google Drive and S3, and you can also directly upload your local files to your knowledge base. Using another data type will follow a very similar series of steps.

SGP APIs used in this guide

Prerequisites

Authentication

Please follow the steps described here: Authentication

You may also find it helpful to first read through our Data Connectors component overview, which explains what data connectors are and how they fit with other SGP components.

Create a Knowledge Base

The first step to ingesting custom data is to create the knowledge base in which your data will be stored in. You can either upload files directly to this knowledge base or create data connectors which will be attached to this knowledge base. Make sure to save the knowledge_base_id that is returned as part of the API response, as you will need it for creating a data connector.

# Fill in your own API key
API_KEY = "[Your Spellbook API Key here]"

# Choose a name for your knowledge base
KNOWLEDGE_BASE_NAME = "[Choose a name]"

# Creating a knowledge base
print(f"Creating a knowledge base named {KNOWLEDGE_BASE_NAME}...")
url = "https://api.spellbook.scale.com/egp/v1/knowledge-bases"
payload = {
  "knowledge_base_name": KNOWLEDGE_BASE_NAME,
  "embedding_model": 'sentence-transformers/all-MiniLM-L12-v2'
}
headers = {
    "accept": "application/json",
	  "content-type": "application/json",
    "x-api-key": API_KEY
}
response = requests.post(url, json=payload, headers=headers)
print(f"Response: {response.text}")

# Saving the knowledge base ID for future usage
KNOWLEDGE_BASE_ID = response.json()['knowledge_base_id']

Upload Local Files

Once you have created a knowledge base, you can now upload your local files to the knowledge base. You can do this by uploading the file bytes of your local files to the knowledge base via the Upload File endpoint. You'll need to provide (and keep track of) a unique file ID for each file you upload, so that if you want to update or delete a file, we'll be able to match file IDs to locate your original file. If your file paths are unique, you may use the file path as your file ID. An example script on how to upload a file is below:

import mimetypes
import os
import json

API_KEY = '[Your API key]' 
FILE_PATH = '[Your file path]' 
KNOWLEDGE_BASE_ID = '[Your Knowledge Base ID]' 

# Uploading Local Files
print(f"Uploading local file to knowledge base...")
url = "https://api.spellbook.scale.com/egp/v1/knowledge-bases/" + KNOWLEDGE_BASE_ID + "/files"

apiarg = {
  "file-name": FILE_PATH,
  "file-id": FILE_PATH,
}

headers = {
  "accept": "application/json",
  "content-type": mimetypes.MimeTypes().guess_type(filepath)[0],
  "content-length": str(os.path.getsize(filepath)),
  "x-api-key": API_KEY,
  "spellbook-api-arg": json.dumps(apiarg),
}

with open(filepath, 'rb') as f:
  data = f.read()

response = requests.post(url, data=data, headers=headers)
print(f"Response: {response.text}")

If you would like to connect an external data source that can sync on a regular schedule rather than upload individual local files, keep reading to learn how to create and use data connectors.

Integrate Account Credentials for Data Connectors

Before creating a data connector of any type, it is highly recommended to create a data connector integration, which will essentially integrate your account credentials with your SGP user.

Let's start with integrating your Confluence account credentials using the Create Integration endpoint. Note that the payload object is dependent on the type of data connector you want to create—in this case, we are creating a Confluence integration, but for other data connector types, see the Create Integration documentation on what information to pass in.

# Fill in your Confluence credentials here
confluence_email = "[email protected]"
confluence_site_name = "xyz123"
confluence_api_key = "ABCDEFG12345678"

# Creating Confluence integration
print(f"Integrating Confluence credentials with EGP...")
url = "https://api.spellbook.scale.com/egp/v1/integrations"
headers = {
    "accept": "application/json",
	  "content-type": "application/json",
    "x-api-key": API_KEY
}
payload = { "integration_type": {
        "params": {
            "email": confluence_email,
            "site_name": confluence_site_name,
            "api_key": confluence_api_key
        },
        "name": "confluence"
    } }
response = requests.post(url, json=payload, headers=headers)
print(f"Response: {response.text}")

Create a Data Connector

After you've successfully integrated your Confluence credentials, you can now easily create a Confluence data connector that will connect your Confluence space with your SGP knowledge base.

# Choose a name for your data connector
DATA_CONNECTOR_NAME = "my_data_connector_0"

# Select which of your Confluence spaces you want to ingest data from
confluence_space_name = 'confluence-space-1'

# Creating the data connector
print(f"Creating a data connector named {DATA_CONNECTOR_NAME} in knowledge base with id {KNOWLEDGE_BASE_ID}...")
url = "https://api.spellbook.scale.com/egp/v1/data-connectors"
payload = {
    "data_connector_type": {
        "config": { "space_name": confluence_space_name },
        "name": "confluence"
    },
    "data_connector_name": DATA_CONNECTOR_NAME,
    "knowledge_base_id": KNOWLEDGE_BASE_ID,
}
headers = {
    "accept": "application/json",
	  "content-type": "application/json",
    "x-api-key": API_KEY
}
response = requests.post(url, json=payload, headers=headers)
print(f"Response: {response.text}")

# Saving this value for the next step of syncing
DATA_CONNECTOR_ID = response.json()['data_connector_id']

Sync Your Files via Data Connectors

The final and most important step is to sync the files associated with your data connector. Note here that you will need the data_connector_id of the data connector you just created, which was returned as part of the API response.

Let's start a syncing job for your Confluence data connector, which will take all the files from your specified Confluence space and insert them as vector embeddings into your knowledge base. Make sure to save the sync_id that is returned as part of the API response, as you will need it to check on the status of your syncing job.

# Starting a syncing job for your data connector
url = f"https://api.spellbook.scale.com/egp/v1/data-connectors/{DATA_CONNECTOR_ID}/sync"
headers = {
    "accept": "application/json",
    "x-api-key": API_KEY
}
response = requests.post(url, headers=headers)
print(f"Response: {response.text}")

# Saving this value for the next step of syncing
SYNC_ID = response.json()['sync_id']

Check Sync Status of a Data Connector

Once you kick off your syncing job, you can check on its status anytime via the Sync Status endpoint. Depending on how many files you are syncing, the entire syncing job may take some time but the sync status endpoint provides observability into the number of files that have finished syncing, are currently pending sync, or have failed sync.

Check on the status of your syncing job by running the following snippet:

url = f"https://api.spellbook.scale.com/egp/v1/data-connectors/{DATA_CONNECTOR_ID}/syncs/{SYNC_ID}"
headers = {
    "accept": "application/json",
    "x-api-key": API_KEY
}
response = requests.get(url, headers=headers)
print(f"Response: {response.text}")

You should see a response that provides details on the overall status of your sync, the number of files that are completed/pending/failed, as well as per-file sync status details.

And that's all! Once your syncing job has finished, all of your Confluence files will have successfully been ingested into your knowledge base.

What's Next?

With a knowledge base populated with your custom data, you can perform a variety of tasks including:

Data retrieval and querying.
Building LangChain QA agents that answer questions based on your knowledge base.