Skip to content

DuckDB Integration

One of Quiver's superpowers is its seamless integration with DuckDB for metadata storage and querying. This integration enables powerful SQL-based filtering and hybrid search capabilities. Let's dive into how Quiver leverages DuckDB to enhance vector search! 🦆

What is DuckDB?

DuckDB is an in-process SQL OLAP database management system designed for analytical queries. It's often described as "SQLite for analytics" and offers several advantages:

  • Extremely fast analytical queries
  • Columnar storage format
  • In-process operation (no client-server architecture)
  • Rich SQL support
  • JSON functions
  • Low memory footprint

These characteristics make DuckDB an ideal companion for vector search, allowing Quiver to provide rich metadata filtering capabilities alongside vector similarity search.

How Quiver Uses DuckDB

Quiver uses DuckDB for several key functions:

Metadata Storage

When you add vectors to Quiver with metadata:

idx.Add(1, vector, map[string]interface{}{
    "category": "document",
    "title": "Introduction to Vector Databases",
    "tags": []string{"database", "vector", "tutorial"},
    "created_at": time.Now().Unix(),
})

Quiver stores this metadata in DuckDB with a simple schema:

CREATE TABLE metadata (
    id BIGINT PRIMARY KEY,
    json JSON
)

The id column matches the vector ID, and the json column stores the metadata as a JSON object.

SQL Filtering

DuckDB's SQL capabilities enable powerful filtering in hybrid searches:

// Search for vectors similar to queryVector that match specific metadata criteria
results, err := idx.SearchWithFilter(queryVector, 10, 
    "json_extract(json, '$.category') = 'document' AND json_array_contains(json_extract(json, '$.tags'), 'vector')")

This allows you to combine vector similarity with precise metadata filtering, enabling use cases like:

  • Finding similar products within a specific price range
  • Searching for documents in a particular category
  • Filtering images by specific attributes

JSON Functions

DuckDB provides rich JSON functions that Quiver exposes for metadata queries:

// Extract nested JSON properties
"json_extract(json, '$.user.preferences.theme') = 'dark'"

// Check if an array contains a value
"json_array_contains(json_extract(json, '$.tags'), 'important')"

// Get array length
"json_array_length(json_extract(json, '$.references')) > 5"

These functions make it easy to work with complex nested metadata structures.

Implementation Details

Connection Management

Quiver manages DuckDB connections efficiently:

type DuckDB struct {
    mu     sync.Mutex
    db     adbc.Database
    driver adbc.Driver
    opts   DuckDBOptions

    conns []*DuckDBConn // track open connections
}

type DuckDBConn struct {
    parent *DuckDB
    conn   adbc.Connection
}

Connections are pooled and reused to minimize overhead.

ADBC Interface

Quiver uses the Arrow Database Connectivity (ADBC) interface to communicate with DuckDB, which provides:

  • Efficient data transfer using Apache Arrow
  • Reduced serialization overhead
  • Better type safety
// Execute a SQL query
func (c *DuckDBConn) Query(ctx context.Context, sql string) (array.RecordReader, adbc.Statement, int64, error) {
    stmt, err := c.conn.NewStatement()
    if err != nil {
        return nil, nil, 0, fmt.Errorf("failed to create statement: %w", err)
    }

    if err := stmt.SetSqlQuery(sql); err != nil {
        stmt.Close()
        return nil, nil, 0, fmt.Errorf("failed to set SQL query: %w", err)
    }

    reader, err := stmt.ExecuteQuery(ctx)
    if err != nil {
        stmt.Close()
        return nil, nil, 0, fmt.Errorf("failed to execute query: %w", err)
    }

    return reader, stmt, 0, nil
}

Metadata Caching

To improve performance, Quiver caches metadata in memory:

// Get metadata for a vector
func (idx *Index) getMetadata(id uint64) map[string]interface{} {
    // Check cache first
    if meta, ok := idx.cache.Load(id); ok {
        return meta.(map[string]interface{})
    }

    // If not in cache, query DuckDB
    query := fmt.Sprintf("SELECT json FROM metadata WHERE id = %d", id)
    results, err := idx.QueryMetadata(query)
    if err != nil || len(results) == 0 {
        return nil
    }

    // Store in cache for future use
    idx.cache.Store(id, results[0])

    return results[0]
}

This caching strategy reduces database load for frequently accessed metadata.

Hybrid Search Strategies

Quiver's DuckDB integration enables several hybrid search strategies:

For highly selective filters, Quiver can:

  1. First query DuckDB to find matching IDs
  2. Then perform vector search only on those IDs
// Example implementation of filter-then-search
func filterThenSearch(idx *Index, query []float32, filter string, k int) ([]SearchResult, error) {
    // Step 1: Get IDs matching the filter
    sql := fmt.Sprintf("SELECT id FROM metadata WHERE %s", filter)
    metaResults, err := idx.QueryMetadata(sql)
    if err != nil {
        return nil, err
    }

    // Extract IDs
    ids := make([]uint64, len(metaResults))
    for i, result := range metaResults {
        ids[i] = result["id"].(uint64)
    }

    // Step 2: Search only those IDs
    return idx.SearchSubset(query, ids, k)
}

This approach is efficient when the filter is highly selective (returns few results).

Search-Then-Filter

For less selective filters, Quiver can:

  1. First perform vector search to find similar vectors
  2. Then filter the results by metadata
// Example implementation of search-then-filter
func searchThenFilter(idx *Index, query []float32, filter string, k int) ([]SearchResult, error) {
    // Step 1: Perform vector search
    // Get more results than needed to account for filtering
    results, err := idx.Search(query, k*10, 1, k*10)
    if err != nil {
        return nil, err
    }

    // Step 2: Filter results
    filtered := make([]SearchResult, 0, k)

    // Extract IDs
    ids := make([]uint64, len(results))
    for i, result := range results {
        ids[i] = result.ID
    }

    // Query metadata for these IDs
    sql := fmt.Sprintf("SELECT id FROM metadata WHERE id IN (%s) AND %s",
        joinUint64(ids, ","), filter)
    metaResults, err := idx.QueryMetadata(sql)
    if err != nil {
        return nil, err
    }

    // Create a set of matching IDs
    matchingIDs := make(map[uint64]bool)
    for _, result := range metaResults {
        matchingIDs[result["id"].(uint64)] = true
    }

    // Filter the search results
    for _, result := range results {
        if matchingIDs[result.ID] {
            filtered = append(filtered, result)
            if len(filtered) >= k {
                break
            }
        }
    }

    return filtered, nil
}

This approach is efficient when the filter is less selective (returns many results).

Quiver automatically chooses the most efficient strategy based on the filter:

// Simplified version of Quiver's hybrid search
func (idx *Index) SearchWithFilter(query []float32, k int, filter string) ([]SearchResult, error) {
    // Estimate the selectivity of the filter
    countSQL := fmt.Sprintf("SELECT COUNT(*) FROM metadata WHERE %s", filter)
    count, err := idx.estimateFilterCount(countSQL)
    if err != nil {
        return nil, err
    }

    // Choose strategy based on selectivity
    if count < 1000 {
        // Filter is selective, use filter-then-search
        return idx.filterThenSearch(query, filter, k)
    } else {
        // Filter is not selective, use search-then-filter
        return idx.searchThenFilter(query, filter, k)
    }
}

This adaptive approach ensures optimal performance for different types of queries.

Performance Considerations

Query Optimization

DuckDB is already highly optimized for analytical queries, but there are still some best practices:

  • Be specific in your filters to reduce the result set
  • Use appropriate JSON functions for nested data
  • Avoid complex joins or subqueries if possible

Memory Usage

DuckDB operates in-process, so it shares memory with Quiver:

config := quiver.Config{
    // ... other settings ...
    StoragePath: "./data/vectors.db", // Path to DuckDB file
}

For large datasets, consider:

  • Using a file-based DuckDB database instead of in-memory
  • Monitoring memory usage
  • Adjusting cache sizes if needed

Persistence

Quiver automatically persists metadata to DuckDB:

// Persist metadata to DuckDB
func (idx *Index) persistMetadata() error {
    // Begin transaction
    _, err := idx.dbConn.Exec(context.Background(), "BEGIN TRANSACTION")
    if err != nil {
        return err
    }

    // Insert or update metadata
    for id, meta := range idx.metadataToUpdate {
        jsonData, _ := json.Marshal(meta)
        sql := fmt.Sprintf("INSERT OR REPLACE INTO metadata (id, json) VALUES (%d, '%s')",
            id, string(jsonData))
        _, err := idx.dbConn.Exec(context.Background(), sql)
        if err != nil {
            idx.dbConn.Exec(context.Background(), "ROLLBACK")
            return err
        }
    }

    // Commit transaction
    _, err = idx.dbConn.Exec(context.Background(), "COMMIT")
    return err
}

This ensures that metadata is safely stored on disk and can be recovered after a restart.

Example Queries

Here are some example metadata queries you can use with Quiver:

Basic Filtering

// Find documents in the "science" category
results, _ := idx.SearchWithFilter(queryVector, 10, "category = 'science'")

// Find products in a price range
results, _ := idx.SearchWithFilter(queryVector, 10, "price >= 10.0 AND price <= 50.0")

// Find items created in the last week
results, _ := idx.SearchWithFilter(queryVector, 10, 
    fmt.Sprintf("created_at >= %d", time.Now().AddDate(0, 0, -7).Unix()))

Working with Arrays

// Find items with specific tag
results, _ := idx.SearchWithFilter(queryVector, 10, 
    "json_array_contains(tags, 'important')")

// Find items with at least 3 tags
results, _ := idx.SearchWithFilter(queryVector, 10, 
    "json_array_length(tags) >= 3")

Nested Properties

// Find items with specific nested property
results, _ := idx.SearchWithFilter(queryVector, 10, 
    "json_extract(json, '$.details.publisher') = 'Nature'")

// Find items with high rating
results, _ := idx.SearchWithFilter(queryVector, 10, 
    "CAST(json_extract(json, '$.ratings.average') AS FLOAT) >= 4.5")

Next Steps

Now that you understand how Quiver integrates with DuckDB, check out: