Walker

The Walker component in FlashFS is responsible for traversing file systems and collecting metadata about files and directories. It provides the foundation for creating snapshots by efficiently gathering information about the file system structure.

Overview

The Walker component:

  • Traverses directories recursively
  • Collects metadata about files and directories
  • Computes content hashes for files
  • Handles errors and edge cases during traversal
  • Supports context-based cancellation
  • Provides streaming implementations for memory efficiency and responsiveness

Walker Implementations

FlashFS provides multiple implementations for walking directory trees, each with different characteristics and use cases:

Standard Walker (Walk)

The standard walker is a non-streaming implementation that collects all entries in memory before returning them as a slice.

entries, err := walker.Walk(rootDir)
if err != nil {
    // handle error
}

for _, entry := range entries {
    // process entry
}

Callback-based Streaming Walker (WalkStreamWithCallback)

The callback-based streaming walker processes entries as they're discovered, calling a user-provided function for each entry.

err := walker.WalkStreamWithCallback(context.Background(), rootDir, walker.DefaultWalkOptions(), 
    func(entry walker.SnapshotEntry) error {
        // process entry
        return nil // return an error to stop walking
    })
if err != nil {
    // handle error
}

Channel-based Streaming Walker (WalkStream)

The channel-based streaming walker returns channels for entries and errors, allowing for concurrent processing.

entryChan, errChan := walker.WalkStream(context.Background(), rootDir)

// Process entries as they arrive
for entry := range entryChan {
    // process entry
}

// Check for errors after all entries have been processed
if err := <-errChan; err != nil {
    // handle error
}

Core Data Structure

The primary data structure used by the Walker is the SnapshotEntry:

type SnapshotEntry struct {
    Path          string      // Relative path from the root
    Size          int64       // Size in bytes
    ModTime       time.Time   // Modification time
    IsDir         bool        // True if it's a directory
    Permissions   fs.FileMode // File permissions
    Hash          string      // Hash of the file content (if computed)
    SymlinkTarget string      // Target of the symlink (if it's a symlink)
    Error         error
}

This structure captures essential information about each file or directory.

Performance Characteristics

Based on benchmarks, here are the performance characteristics of each implementation:

With Hashing Enabled

Implementation Operations/sec Time/op Memory/op Allocations/op
StandardWalkDir (Go stdlib) 4,286 261.9 µs 63.2 KB 757
Walk 631 1.86 ms 12.3 MB 4,678
WalkStreamWithCallback 579 2.09 ms 13.0 MB 4,437
WalkStream 579 2.11 ms 13.7 MB 4,441

Without Hashing

Implementation Operations/sec Time/op Memory/op Allocations/op
Walk 1,642 728.7 µs 277.1 KB 2,077
WalkStreamWithCallback 1,101 1.09 ms 185.4 KB 1,636
WalkStream 1,056 1.11 ms 188.1 KB 1,641

When to Use Each Implementation

Use the Standard Walker (Walk) when

  • You need all entries before processing can begin
  • The directory structure is small to medium-sized
  • Simplicity is preferred over advanced features
  • You want slightly better performance for small to medium-sized directories

Use the Callback-based Streaming Walker (WalkStreamWithCallback) when

  • You want to process entries as they're discovered
  • You prefer a callback-based programming style
  • You need to handle very large directory structures
  • You want to provide progress updates during the walk
  • Memory efficiency is important

Use the Channel-based Streaming Walker (WalkStream) when

  • You want to process entries as they're discovered
  • You prefer a channel-based programming style
  • You need to integrate with other Go concurrency patterns
  • You want to process entries concurrently with other operations
  • You need to handle very large directory structures
  • Memory efficiency is important

Scalability Considerations

The streaming walker implementations (both callback and channel-based) offer significant advantages for large directory structures:

  1. Memory Efficiency: Since entries are processed as they're discovered, the memory footprint remains relatively constant regardless of the directory size.

  2. Responsiveness: Users can start processing entries immediately, rather than waiting for the entire walk to complete.

  3. Cancellation: Operations can be cancelled mid-walk using context cancellation.

  4. Progress Reporting: Real-time progress updates can be provided during the walk.

For very large directory structures (millions of files), the streaming implementations are strongly recommended to avoid out-of-memory errors and provide better user experience.

Configuration Options

The Walker component can be configured using the WalkOptions struct:

type WalkOptions struct {
    // ComputeHashes determines whether file hashes should be computed.
    ComputeHashes bool
    // FollowSymlinks determines whether symbolic links should be followed.
    FollowSymlinks bool
    // MaxDepth is the maximum directory depth to traverse (0 means no limit).
    MaxDepth int
    // NumWorkers specifies the number of worker goroutines for parallel processing.
    NumWorkers int
    // HashAlgorithm specifies the algorithm to be used like "BLAKE3", "MD5" etc
    HashAlgorithm string
    // SkipErrors specifies that errors should not cancel operations
    SkipErrors bool
    // ExcludePatterns is a list of filepath.Match patterns to exclude from the walk.
    ExcludePatterns []string
    // UsePartialHashing enables partial hashing for large files.
    UsePartialHashing bool
    // PartialHashingThreshold is the file size threshold in bytes above which
    // partial hashing will be used (if enabled). Default is 10MB.
    PartialHashingThreshold int64
}

Default options can be obtained using:

options := walker.DefaultWalkOptions()

Example: Processing a Large Directory Tree

// Using the callback-based streaming walker
processedCount := 0
totalCount := 0

err := walker.WalkStreamWithCallback(ctx, rootDir, walker.DefaultWalkOptions(), 
    func(entry walker.SnapshotEntry) error {
        processedCount++

        // Process the entry
        if !entry.IsDir {
            // Do something with the file
            fmt.Printf("Processing file %s (%d/%d)\n", entry.Path, processedCount, totalCount)
        }

        return nil
    })

if err != nil {
    fmt.Printf("Error walking directory: %v\n", err)
}

Example: Concurrent Processing with Channels

// Using the channel-based streaming walker
entryChan, errChan := walker.WalkStream(ctx, rootDir)

// Create a worker pool
const numWorkers = 4
var wg sync.WaitGroup
wg.Add(numWorkers)

// Start workers
for i := 0; i < numWorkers; i++ {
    go func() {
        defer wg.Done()
        for entry := range entryChan {
            if !entry.IsDir {
                // Process the file
                processFile(entry)
            }
        }
    }()
}

// Wait for all entries to be processed
wg.Wait()

// Check for errors
if err := <-errChan; err != nil {
    fmt.Printf("Error walking directory: %v\n", err)
}

Integration with Other Components

The Walker integrates with:

  • Serializer: Provides file metadata to be serialized
  • Storage: Indirectly supplies the data for snapshots
  • Diff: Enables comparison by providing consistent metadata

Conclusion

FlashFS provides multiple walker implementations to suit different use cases and programming styles. For most applications, the streaming implementations offer the best balance of features, scalability, and usability, especially for large directory structures. The standard walker provides slightly better performance for small to medium-sized directories when all entries need to be collected before processing.