Walker¶

The Walker component in FlashFS is responsible for traversing file systems and collecting metadata about files and directories. It provides the foundation for creating snapshots by efficiently gathering information about the file system structure.

Overview¶

The Walker component:

Traverses directories recursively
Collects metadata about files and directories
Computes content hashes for files
Handles errors and edge cases during traversal
Supports context-based cancellation
Provides streaming implementations for memory efficiency and responsiveness

Walker Implementations¶

FlashFS provides multiple implementations for walking directory trees, each with different characteristics and use cases:

Standard Walker (`Walk`)¶

The standard walker is a non-streaming implementation that collects all entries in memory before returning them as a slice.

entries, err := walker.Walk(rootDir)
if err != nil {
    // handle error
}

for _, entry := range entries {
    // process entry
}

Callback-based Streaming Walker (`WalkStreamWithCallback`)¶

The callback-based streaming walker processes entries as they're discovered, calling a user-provided function for each entry.

err := walker.WalkStreamWithCallback(context.Background(), rootDir, walker.DefaultWalkOptions(), 
    func(entry walker.SnapshotEntry) error {
        // process entry
        return nil // return an error to stop walking
    })
if err != nil {
    // handle error
}

Channel-based Streaming Walker (`WalkStream`)¶

The channel-based streaming walker returns channels for entries and errors, allowing for concurrent processing.

entryChan, errChan := walker.WalkStream(context.Background(), rootDir)

// Process entries as they arrive
for entry := range entryChan {
    // process entry
}

// Check for errors after all entries have been processed
if err := <-errChan; err != nil {
    // handle error
}

Core Data Structure¶

The primary data structure used by the Walker is the SnapshotEntry:

type SnapshotEntry struct {
    Path          string      // Relative path from the root
    Size          int64       // Size in bytes
    ModTime       time.Time   // Modification time
    IsDir         bool        // True if it's a directory
    Permissions   fs.FileMode // File permissions
    Hash          string      // Hash of the file content (if computed)
    SymlinkTarget string      // Target of the symlink (if it's a symlink)
    Error         error
}

This structure captures essential information about each file or directory.

Performance Characteristics¶

Based on benchmarks, here are the performance characteristics of each implementation:

With Hashing Enabled¶

Implementation	Operations/sec	Time/op	Memory/op	Allocations/op
StandardWalkDir (Go stdlib)	4,286	261.9 µs	63.2 KB	757
Walk	631	1.86 ms	12.3 MB	4,678
WalkStreamWithCallback	579	2.09 ms	13.0 MB	4,437
WalkStream	579	2.11 ms	13.7 MB	4,441

Without Hashing¶

Implementation	Operations/sec	Time/op	Memory/op	Allocations/op
Walk	1,642	728.7 µs	277.1 KB	2,077
WalkStreamWithCallback	1,101	1.09 ms	185.4 KB	1,636
WalkStream	1,056	1.11 ms	188.1 KB	1,641

When to Use Each Implementation¶

Use the Standard Walker (`Walk`) when¶

You need all entries before processing can begin
The directory structure is small to medium-sized
Simplicity is preferred over advanced features
You want slightly better performance for small to medium-sized directories

Use the Callback-based Streaming Walker (`WalkStreamWithCallback`) when¶

You want to process entries as they're discovered
You prefer a callback-based programming style
You need to handle very large directory structures
You want to provide progress updates during the walk
Memory efficiency is important

Use the Channel-based Streaming Walker (`WalkStream`) when¶

You want to process entries as they're discovered
You prefer a channel-based programming style
You need to integrate with other Go concurrency patterns
You want to process entries concurrently with other operations
You need to handle very large directory structures
Memory efficiency is important

Scalability Considerations¶

The streaming walker implementations (both callback and channel-based) offer significant advantages for large directory structures:

Memory Efficiency: Since entries are processed as they're discovered, the memory footprint remains relatively constant regardless of the directory size.
Responsiveness: Users can start processing entries immediately, rather than waiting for the entire walk to complete.
Cancellation: Operations can be cancelled mid-walk using context cancellation.
Progress Reporting: Real-time progress updates can be provided during the walk.

For very large directory structures (millions of files), the streaming implementations are strongly recommended to avoid out-of-memory errors and provide better user experience.

Configuration Options¶

The Walker component can be configured using the WalkOptions struct:

type WalkOptions struct {
    // ComputeHashes determines whether file hashes should be computed.
    ComputeHashes bool
    // FollowSymlinks determines whether symbolic links should be followed.
    FollowSymlinks bool
    // MaxDepth is the maximum directory depth to traverse (0 means no limit).
    MaxDepth int
    // NumWorkers specifies the number of worker goroutines for parallel processing.
    NumWorkers int
    // HashAlgorithm specifies the algorithm to be used like "BLAKE3", "MD5" etc
    HashAlgorithm string
    // SkipErrors specifies that errors should not cancel operations
    SkipErrors bool
    // ExcludePatterns is a list of filepath.Match patterns to exclude from the walk.
    ExcludePatterns []string
    // UsePartialHashing enables partial hashing for large files.
    UsePartialHashing bool
    // PartialHashingThreshold is the file size threshold in bytes above which
    // partial hashing will be used (if enabled). Default is 10MB.
    PartialHashingThreshold int64
}

Default options can be obtained using:

options := walker.DefaultWalkOptions()

Example: Processing a Large Directory Tree¶

// Using the callback-based streaming walker
processedCount := 0
totalCount := 0

err := walker.WalkStreamWithCallback(ctx, rootDir, walker.DefaultWalkOptions(), 
    func(entry walker.SnapshotEntry) error {
        processedCount++

        // Process the entry
        if !entry.IsDir {
            // Do something with the file
            fmt.Printf("Processing file %s (%d/%d)\n", entry.Path, processedCount, totalCount)
        }

        return nil
    })

if err != nil {
    fmt.Printf("Error walking directory: %v\n", err)
}

Example: Concurrent Processing with Channels¶

// Using the channel-based streaming walker
entryChan, errChan := walker.WalkStream(ctx, rootDir)

// Create a worker pool
const numWorkers = 4
var wg sync.WaitGroup
wg.Add(numWorkers)

// Start workers
for i := 0; i < numWorkers; i++ {
    go func() {
        defer wg.Done()
        for entry := range entryChan {
            if !entry.IsDir {
                // Process the file
                processFile(entry)
            }
        }
    }()
}

// Wait for all entries to be processed
wg.Wait()

// Check for errors
if err := <-errChan; err != nil {
    fmt.Printf("Error walking directory: %v\n", err)
}

Integration with Other Components¶

The Walker integrates with:

Serializer: Provides file metadata to be serialized
Storage: Indirectly supplies the data for snapshots
Diff: Enables comparison by providing consistent metadata

Conclusion¶

FlashFS provides multiple walker implementations to suit different use cases and programming styles. For most applications, the streaming implementations offer the best balance of features, scalability, and usability, especially for large directory structures. The standard walker provides slightly better performance for small to medium-sized directories when all entries need to be collected before processing.