Walker¶
The Walker component in FlashFS is responsible for traversing file systems and collecting metadata about files and directories. It provides the foundation for creating snapshots by efficiently gathering information about the file system structure.
Overview¶
The Walker component:
- Traverses directories recursively
- Collects metadata about files and directories
- Computes content hashes for files
- Handles errors and edge cases during traversal
- Supports context-based cancellation
- Provides streaming implementations for memory efficiency and responsiveness
Walker Implementations¶
FlashFS provides multiple implementations for walking directory trees, each with different characteristics and use cases:
Standard Walker (Walk
)¶
The standard walker is a non-streaming implementation that collects all entries in memory before returning them as a slice.
entries, err := walker.Walk(rootDir)
if err != nil {
// handle error
}
for _, entry := range entries {
// process entry
}
Callback-based Streaming Walker (WalkStreamWithCallback
)¶
The callback-based streaming walker processes entries as they're discovered, calling a user-provided function for each entry.
err := walker.WalkStreamWithCallback(context.Background(), rootDir, walker.DefaultWalkOptions(),
func(entry walker.SnapshotEntry) error {
// process entry
return nil // return an error to stop walking
})
if err != nil {
// handle error
}
Channel-based Streaming Walker (WalkStream
)¶
The channel-based streaming walker returns channels for entries and errors, allowing for concurrent processing.
entryChan, errChan := walker.WalkStream(context.Background(), rootDir)
// Process entries as they arrive
for entry := range entryChan {
// process entry
}
// Check for errors after all entries have been processed
if err := <-errChan; err != nil {
// handle error
}
Core Data Structure¶
The primary data structure used by the Walker is the SnapshotEntry
:
type SnapshotEntry struct {
Path string // Relative path from the root
Size int64 // Size in bytes
ModTime time.Time // Modification time
IsDir bool // True if it's a directory
Permissions fs.FileMode // File permissions
Hash string // Hash of the file content (if computed)
SymlinkTarget string // Target of the symlink (if it's a symlink)
Error error
}
This structure captures essential information about each file or directory.
Performance Characteristics¶
Based on benchmarks, here are the performance characteristics of each implementation:
With Hashing Enabled¶
Implementation | Operations/sec | Time/op | Memory/op | Allocations/op |
---|---|---|---|---|
StandardWalkDir (Go stdlib) | 4,286 | 261.9 µs | 63.2 KB | 757 |
Walk | 631 | 1.86 ms | 12.3 MB | 4,678 |
WalkStreamWithCallback | 579 | 2.09 ms | 13.0 MB | 4,437 |
WalkStream | 579 | 2.11 ms | 13.7 MB | 4,441 |
Without Hashing¶
Implementation | Operations/sec | Time/op | Memory/op | Allocations/op |
---|---|---|---|---|
Walk | 1,642 | 728.7 µs | 277.1 KB | 2,077 |
WalkStreamWithCallback | 1,101 | 1.09 ms | 185.4 KB | 1,636 |
WalkStream | 1,056 | 1.11 ms | 188.1 KB | 1,641 |
When to Use Each Implementation¶
Use the Standard Walker (Walk
) when¶
- You need all entries before processing can begin
- The directory structure is small to medium-sized
- Simplicity is preferred over advanced features
- You want slightly better performance for small to medium-sized directories
Use the Callback-based Streaming Walker (WalkStreamWithCallback
) when¶
- You want to process entries as they're discovered
- You prefer a callback-based programming style
- You need to handle very large directory structures
- You want to provide progress updates during the walk
- Memory efficiency is important
Use the Channel-based Streaming Walker (WalkStream
) when¶
- You want to process entries as they're discovered
- You prefer a channel-based programming style
- You need to integrate with other Go concurrency patterns
- You want to process entries concurrently with other operations
- You need to handle very large directory structures
- Memory efficiency is important
Scalability Considerations¶
The streaming walker implementations (both callback and channel-based) offer significant advantages for large directory structures:
-
Memory Efficiency: Since entries are processed as they're discovered, the memory footprint remains relatively constant regardless of the directory size.
-
Responsiveness: Users can start processing entries immediately, rather than waiting for the entire walk to complete.
-
Cancellation: Operations can be cancelled mid-walk using context cancellation.
-
Progress Reporting: Real-time progress updates can be provided during the walk.
For very large directory structures (millions of files), the streaming implementations are strongly recommended to avoid out-of-memory errors and provide better user experience.
Configuration Options¶
The Walker component can be configured using the WalkOptions
struct:
type WalkOptions struct {
// ComputeHashes determines whether file hashes should be computed.
ComputeHashes bool
// FollowSymlinks determines whether symbolic links should be followed.
FollowSymlinks bool
// MaxDepth is the maximum directory depth to traverse (0 means no limit).
MaxDepth int
// NumWorkers specifies the number of worker goroutines for parallel processing.
NumWorkers int
// HashAlgorithm specifies the algorithm to be used like "BLAKE3", "MD5" etc
HashAlgorithm string
// SkipErrors specifies that errors should not cancel operations
SkipErrors bool
// ExcludePatterns is a list of filepath.Match patterns to exclude from the walk.
ExcludePatterns []string
// UsePartialHashing enables partial hashing for large files.
UsePartialHashing bool
// PartialHashingThreshold is the file size threshold in bytes above which
// partial hashing will be used (if enabled). Default is 10MB.
PartialHashingThreshold int64
}
Default options can be obtained using:
options := walker.DefaultWalkOptions()
Example: Processing a Large Directory Tree¶
// Using the callback-based streaming walker
processedCount := 0
totalCount := 0
err := walker.WalkStreamWithCallback(ctx, rootDir, walker.DefaultWalkOptions(),
func(entry walker.SnapshotEntry) error {
processedCount++
// Process the entry
if !entry.IsDir {
// Do something with the file
fmt.Printf("Processing file %s (%d/%d)\n", entry.Path, processedCount, totalCount)
}
return nil
})
if err != nil {
fmt.Printf("Error walking directory: %v\n", err)
}
Example: Concurrent Processing with Channels¶
// Using the channel-based streaming walker
entryChan, errChan := walker.WalkStream(ctx, rootDir)
// Create a worker pool
const numWorkers = 4
var wg sync.WaitGroup
wg.Add(numWorkers)
// Start workers
for i := 0; i < numWorkers; i++ {
go func() {
defer wg.Done()
for entry := range entryChan {
if !entry.IsDir {
// Process the file
processFile(entry)
}
}
}()
}
// Wait for all entries to be processed
wg.Wait()
// Check for errors
if err := <-errChan; err != nil {
fmt.Printf("Error walking directory: %v\n", err)
}
Integration with Other Components¶
The Walker integrates with:
- Serializer: Provides file metadata to be serialized
- Storage: Indirectly supplies the data for snapshots
- Diff: Enables comparison by providing consistent metadata
Conclusion¶
FlashFS provides multiple walker implementations to suit different use cases and programming styles. For most applications, the streaming implementations offer the best balance of features, scalability, and usability, especially for large directory structures. The standard walker provides slightly better performance for small to medium-sized directories when all entries need to be collected before processing.