Components
Vector Indexes are made up of various components that you can customise depending on your application's requirements.
class MediaIndex(VectorIndex):
# 1. Sources - things to load in to your index
sources = [
ModelSource(model=MyModel),
]
# 2. Embedding transformers - how your source documents will be embedded
embedding_transformer = CachedEmbeddingTransformer(
base_transformer=CoreEmbeddingTransformer(
llm_service=LLMService.create(
provider="openai",
model="text-embedding-3-small",
)
),
),
# 3. Storage providers - where to store your indexed data, and how to query it
storage_provider = PgVectorProvider(model=MediaVectorModel)
Sources
Sources define what data is loaded in to your index.
A simple source implements get_documents to return an iterable of Document objects:
from django_ai_core.contrib.index.source import Source
class PokemonSource(Source):
def get_documents(self):
base_url = "https://pokeapi.co/api/v2/"
pokemon_list = requests.get(f"{base_url}pokemon-species/?limit=50").json()[
"results"
]
for pokemon in pokemon_list:
details = requests.get(
f"{base_url}/pokemon-species/{pokemon['name']}"
).json()
name = details['name']
key = f"pokemon:{name}"
yield Document(
document_key=key,
content=name,
metadata={},
)
ModelSource
One of the most frequently indexed things in your application are likely to be Django model instances.
The ModelSource provides a convenient way to define a source directly from a queryset or model:
from django_ai_core.contrib.index.source import ModelSource
from .models import MyModel
# Source which fetches all objects for a given model
all_model_source = ModelSource(model=MyModel)
# Source which fetches only objects for the given queryset
qs_model_source = ModelSource(queryset=MyModel.objects.live())
Overriding Indexed Content
By default, a ModelSource will merge all the fields in your model together to be embedded and indexed.
This behaviour can be customised using the content_fields argument:
# Only generate embeddings from the title and description fields
my_model_source = ModelSource(
model=MyModel,
content_fields=["title", "description"]
)
For more control, subclass ModelSource and override the get_content method:
class MyModelSource(ModelSource):
def get_content(self, obj):
# Use `obj` to build your embedded data however you need
return f"{obj.title}: {obj.description}"
Customising Metadata
Metadata can be added to indexed objects. This is not used in embeddings but is stored alongside your indexed objects to allow filtering.
Metadata fields can be added using the metadata_fields argument:
For more control, subclass ModelSource and override the get_metadata method:
class MyModelSource(ModelSource):
def get_metadata(self, obj):
# Use obj to build your metadata - make sure to keep the metadata generated by
# the base class around.
metadata = super().get_metadata(obj)
metadata['indexed_at'] = datetime.now()
Chunking Documents
To fit within embedding models token limits, it may be necessary to split your content in to multiple Documents that fit within these limits.
The ModelSource uses a Chunk Transformer to split strings in to multiple strings before they are converted to Documents.
This defaults to the SimpleChunkTransformer, but can be overridden:
from django_ai_core.contrib.index.chunking import SentenceChunkTransformer
my_model_source = ModelSource(
model=MyModel,
chunk_transformer=SentenceChunkTransformer()
)
Embedding Transformers
Embedding Transformers take Documents produced by Sources and pass them to an AI model for embedding.
CoreEmbeddingTransformer
This package comes with a built-in embedding transformer that uses the core LLMService to embed documents.
It can be instantiated with an LLMService instance:
from django_ai_core.llm import LLMService
from django_ai_core.contrib.index import CoreEmbeddingTransformer
embedding_transformer = CoreEmbddingTransformer(
llm_service=LLMService.create(
provider="openai",
model="text-embedding-3-small"
)
)
my_documents = [Document(...), # More documents]
embedding_transformer.embed_documents(my_documents)
CachedEmbeddingTransformer
This transformer doesn't do any embedding by itself, but instead takes a base_transformer when instantiated. It caches all embedding requests that come through the transformer in a Django model so that when re-generating indexes, the cache can be used instead of re-fetching embeddings from the LLM.
Recommended
Using this transformer is recommended as it can significantly decrease LLM costs.
from django_ai_core.llm import LLMService
from django_ai_core.contrib.index import CoreEmbeddingTransformer, CachedEmbeddingTransformer
embedding_transformer = CachedEmbeddingTransformer(
base_transformer=CoreEmbddingTransformer(
llm_service=LLMService.create(
provider="openai",
model="text-embedding-3-small"
)
)
)
my_documents = [Document(...), # More documents]
embedding_transformer.embed_documents(my_documents)
Storage Providers
Storage Providers define where the index is created and how it is searched.
Two storage providers are currently supported: