DuckDB Integration¶
One of Quiver's superpowers is its seamless integration with DuckDB for metadata storage and querying. This integration enables powerful SQL-based filtering and hybrid search capabilities. Let's dive into how Quiver leverages DuckDB to enhance vector search! 🦆
What is DuckDB?¶
DuckDB is an in-process SQL OLAP database management system designed for analytical queries. It's often described as "SQLite for analytics" and offers several advantages:
- Extremely fast analytical queries
- Columnar storage format
- In-process operation (no client-server architecture)
- Rich SQL support
- JSON functions
- Low memory footprint
These characteristics make DuckDB an ideal companion for vector search, allowing Quiver to provide rich metadata filtering capabilities alongside vector similarity search.
How Quiver Uses DuckDB¶
Quiver uses DuckDB for several key functions:
Metadata Storage¶
When you add vectors to Quiver with metadata:
idx.Add(1, vector, map[string]interface{}{
"category": "document",
"title": "Introduction to Vector Databases",
"tags": []string{"database", "vector", "tutorial"},
"created_at": time.Now().Unix(),
})
Quiver stores this metadata in DuckDB with a simple schema:
The id
column matches the vector ID, and the json
column stores the metadata as a JSON object.
SQL Filtering¶
DuckDB's SQL capabilities enable powerful filtering in hybrid searches:
// Search for vectors similar to queryVector that match specific metadata criteria
results, err := idx.SearchWithFilter(queryVector, 10,
"json_extract(json, '$.category') = 'document' AND json_array_contains(json_extract(json, '$.tags'), 'vector')")
This allows you to combine vector similarity with precise metadata filtering, enabling use cases like:
- Finding similar products within a specific price range
- Searching for documents in a particular category
- Filtering images by specific attributes
JSON Functions¶
DuckDB provides rich JSON functions that Quiver exposes for metadata queries:
// Extract nested JSON properties
"json_extract(json, '$.user.preferences.theme') = 'dark'"
// Check if an array contains a value
"json_array_contains(json_extract(json, '$.tags'), 'important')"
// Get array length
"json_array_length(json_extract(json, '$.references')) > 5"
These functions make it easy to work with complex nested metadata structures.
Implementation Details¶
Connection Management¶
Quiver manages DuckDB connections efficiently:
type DuckDB struct {
mu sync.Mutex
db adbc.Database
driver adbc.Driver
opts DuckDBOptions
conns []*DuckDBConn // track open connections
}
type DuckDBConn struct {
parent *DuckDB
conn adbc.Connection
}
Connections are pooled and reused to minimize overhead.
ADBC Interface¶
Quiver uses the Arrow Database Connectivity (ADBC) interface to communicate with DuckDB, which provides:
- Efficient data transfer using Apache Arrow
- Reduced serialization overhead
- Better type safety
// Execute a SQL query
func (c *DuckDBConn) Query(ctx context.Context, sql string) (array.RecordReader, adbc.Statement, int64, error) {
stmt, err := c.conn.NewStatement()
if err != nil {
return nil, nil, 0, fmt.Errorf("failed to create statement: %w", err)
}
if err := stmt.SetSqlQuery(sql); err != nil {
stmt.Close()
return nil, nil, 0, fmt.Errorf("failed to set SQL query: %w", err)
}
reader, err := stmt.ExecuteQuery(ctx)
if err != nil {
stmt.Close()
return nil, nil, 0, fmt.Errorf("failed to execute query: %w", err)
}
return reader, stmt, 0, nil
}
Metadata Caching¶
To improve performance, Quiver caches metadata in memory:
// Get metadata for a vector
func (idx *Index) getMetadata(id uint64) map[string]interface{} {
// Check cache first
if meta, ok := idx.cache.Load(id); ok {
return meta.(map[string]interface{})
}
// If not in cache, query DuckDB
query := fmt.Sprintf("SELECT json FROM metadata WHERE id = %d", id)
results, err := idx.QueryMetadata(query)
if err != nil || len(results) == 0 {
return nil
}
// Store in cache for future use
idx.cache.Store(id, results[0])
return results[0]
}
This caching strategy reduces database load for frequently accessed metadata.
Hybrid Search Strategies¶
Quiver's DuckDB integration enables several hybrid search strategies:
Filter-Then-Search¶
For highly selective filters, Quiver can:
- First query DuckDB to find matching IDs
- Then perform vector search only on those IDs
// Example implementation of filter-then-search
func filterThenSearch(idx *Index, query []float32, filter string, k int) ([]SearchResult, error) {
// Step 1: Get IDs matching the filter
sql := fmt.Sprintf("SELECT id FROM metadata WHERE %s", filter)
metaResults, err := idx.QueryMetadata(sql)
if err != nil {
return nil, err
}
// Extract IDs
ids := make([]uint64, len(metaResults))
for i, result := range metaResults {
ids[i] = result["id"].(uint64)
}
// Step 2: Search only those IDs
return idx.SearchSubset(query, ids, k)
}
This approach is efficient when the filter is highly selective (returns few results).
Search-Then-Filter¶
For less selective filters, Quiver can:
- First perform vector search to find similar vectors
- Then filter the results by metadata
// Example implementation of search-then-filter
func searchThenFilter(idx *Index, query []float32, filter string, k int) ([]SearchResult, error) {
// Step 1: Perform vector search
// Get more results than needed to account for filtering
results, err := idx.Search(query, k*10, 1, k*10)
if err != nil {
return nil, err
}
// Step 2: Filter results
filtered := make([]SearchResult, 0, k)
// Extract IDs
ids := make([]uint64, len(results))
for i, result := range results {
ids[i] = result.ID
}
// Query metadata for these IDs
sql := fmt.Sprintf("SELECT id FROM metadata WHERE id IN (%s) AND %s",
joinUint64(ids, ","), filter)
metaResults, err := idx.QueryMetadata(sql)
if err != nil {
return nil, err
}
// Create a set of matching IDs
matchingIDs := make(map[uint64]bool)
for _, result := range metaResults {
matchingIDs[result["id"].(uint64)] = true
}
// Filter the search results
for _, result := range results {
if matchingIDs[result.ID] {
filtered = append(filtered, result)
if len(filtered) >= k {
break
}
}
}
return filtered, nil
}
This approach is efficient when the filter is less selective (returns many results).
Adaptive Hybrid Search¶
Quiver automatically chooses the most efficient strategy based on the filter:
// Simplified version of Quiver's hybrid search
func (idx *Index) SearchWithFilter(query []float32, k int, filter string) ([]SearchResult, error) {
// Estimate the selectivity of the filter
countSQL := fmt.Sprintf("SELECT COUNT(*) FROM metadata WHERE %s", filter)
count, err := idx.estimateFilterCount(countSQL)
if err != nil {
return nil, err
}
// Choose strategy based on selectivity
if count < 1000 {
// Filter is selective, use filter-then-search
return idx.filterThenSearch(query, filter, k)
} else {
// Filter is not selective, use search-then-filter
return idx.searchThenFilter(query, filter, k)
}
}
This adaptive approach ensures optimal performance for different types of queries.
Performance Considerations¶
Query Optimization¶
DuckDB is already highly optimized for analytical queries, but there are still some best practices:
- Be specific in your filters to reduce the result set
- Use appropriate JSON functions for nested data
- Avoid complex joins or subqueries if possible
Memory Usage¶
DuckDB operates in-process, so it shares memory with Quiver:
config := quiver.Config{
// ... other settings ...
StoragePath: "./data/vectors.db", // Path to DuckDB file
}
For large datasets, consider:
- Using a file-based DuckDB database instead of in-memory
- Monitoring memory usage
- Adjusting cache sizes if needed
Persistence¶
Quiver automatically persists metadata to DuckDB:
// Persist metadata to DuckDB
func (idx *Index) persistMetadata() error {
// Begin transaction
_, err := idx.dbConn.Exec(context.Background(), "BEGIN TRANSACTION")
if err != nil {
return err
}
// Insert or update metadata
for id, meta := range idx.metadataToUpdate {
jsonData, _ := json.Marshal(meta)
sql := fmt.Sprintf("INSERT OR REPLACE INTO metadata (id, json) VALUES (%d, '%s')",
id, string(jsonData))
_, err := idx.dbConn.Exec(context.Background(), sql)
if err != nil {
idx.dbConn.Exec(context.Background(), "ROLLBACK")
return err
}
}
// Commit transaction
_, err = idx.dbConn.Exec(context.Background(), "COMMIT")
return err
}
This ensures that metadata is safely stored on disk and can be recovered after a restart.
Example Queries¶
Here are some example metadata queries you can use with Quiver:
Basic Filtering¶
// Find documents in the "science" category
results, _ := idx.SearchWithFilter(queryVector, 10, "category = 'science'")
// Find products in a price range
results, _ := idx.SearchWithFilter(queryVector, 10, "price >= 10.0 AND price <= 50.0")
// Find items created in the last week
results, _ := idx.SearchWithFilter(queryVector, 10,
fmt.Sprintf("created_at >= %d", time.Now().AddDate(0, 0, -7).Unix()))
Working with Arrays¶
// Find items with specific tag
results, _ := idx.SearchWithFilter(queryVector, 10,
"json_array_contains(tags, 'important')")
// Find items with at least 3 tags
results, _ := idx.SearchWithFilter(queryVector, 10,
"json_array_length(tags) >= 3")
Nested Properties¶
// Find items with specific nested property
results, _ := idx.SearchWithFilter(queryVector, 10,
"json_extract(json, '$.details.publisher') = 'Nature'")
// Find items with high rating
results, _ := idx.SearchWithFilter(queryVector, 10,
"CAST(json_extract(json, '$.ratings.average') AS FLOAT) >= 4.5")
Next Steps¶
Now that you understand how Quiver integrates with DuckDB, check out:
- Performance Tuning - Optimize Quiver for your needs
- Benchmarking - Measure Quiver's performance
- Metadata & Filtering - Learn more about metadata and filtering