Streaming Processing¶

FlashFS implements streaming processing capabilities throughout its architecture to efficiently handle very large datasets that wouldn't fit entirely in memory. This approach enables processing of massive file systems with minimal memory footprint while maintaining high performance.

Core Streaming Components¶

FlashFS implements streaming in several key components:

Streaming Directory Walker¶

The directory walker processes files as they're discovered, rather than collecting all files before processing:

Memory Efficiency: Only keeps necessary metadata in memory
Responsiveness: Provides immediate feedback as files are processed
Parallel Processing: Can process multiple directories concurrently

Streaming Serialization¶

The serializer breaks large datasets into manageable chunks:

Chunked Processing: Divides data into fixed-size chunks
Progress Tracking: Reports progress as chunks are processed
Parallel Processing: Enables concurrent processing of chunks

Streaming Diff Computation¶

The diff engine can compute differences between snapshots in a streaming fashion:

Incremental Comparison: Compares entries as they're read
Memory-Efficient: Doesn't require loading entire snapshots into memory
Early Termination: Can stop processing when specific conditions are met

CLI Commands for Streaming¶

FlashFS provides dedicated CLI commands for streaming operations with enhanced user experience:

flashfs stream [command]

Available Commands¶

snapshot   Take a snapshot using streaming for large directories
diff       Compare snapshots using streaming for large datasets

Stream Snapshot Command¶

The stream snapshot command is designed for large directories and provides:

flashfs stream snapshot --source /path/to/large/directory --output backup.snap

Key features:

Two-Pass Processing: First counts files to provide accurate progress reporting, then processes them
Real-Time Progress Bars: Shows detailed progress for each phase of the operation
Detailed Statistics: Displays comprehensive statistics after completion
Graceful Cancellation: Can be safely interrupted at any point with Ctrl+C

Example output:

Starting streaming snapshot of /path/to/large/directory
Press Ctrl+C to cancel at any time (snapshot will be incomplete)
Counting files (this may take a while for large directories)...
| Counting files... (1,234/-, 5,678 files/s) [2m34s]
Found 10,000 files/directories (2.5 GiB)
Creating snapshot 45% [==================>                  ] (4,500/10,000, 1,234 files/s)
...
Snapshot completed successfully:
  - Files processed: 10,000
  - Total data size: 2.5 GiB
  - Snapshot file size: 5.2 MiB
  - Compression ratio: 492.3x
  - Time elapsed: 5m23s
  - Processing speed: 7.8 MiB/s

Stream Diff Command¶

The stream diff command is designed for comparing large snapshots:

flashfs stream diff --base snapshot1.snap --target snapshot2.snap --output changes.diff

Key features:

Progress Reporting: Shows detailed progress for loading snapshots, computing differences, and writing the diff file
Detailed Statistics: Displays comprehensive statistics after completion
Graceful Cancellation: Can be safely interrupted at any point with Ctrl+C

Example output:

Comparing snapshots:
  - Base: snapshot1.snap
  - Target: snapshot2.snap
Press Ctrl+C to cancel at any time (diff will be incomplete)
| Loading snapshots (25.6 MB/s) [0s]
Computing differences...
| Computing diff [2s]
Writing diff to file...
Writing diff 100% [========================================] (4.7 MB/s)
Diff completed successfully:
  - Base snapshot size: 5.2 MiB
  - Target snapshot size: 5.3 MiB
  - Diff size: 256 KiB
  - Diff ratio: 4.8%
  - Time elapsed: 3s

Benefits of Streaming¶

Memory Efficiency¶

Streaming processing dramatically reduces memory requirements:

Constant Memory Usage: Memory consumption doesn't grow with dataset size
Resource Optimization: Efficient use of system resources
Large Dataset Support: Can handle file systems with millions of files

Performance¶

Streaming improves performance in several ways:

Reduced Latency: Start processing immediately without waiting for complete data
Parallel Processing: Process multiple chunks concurrently
I/O Optimization: Overlap I/O operations with processing

User Experience¶

Streaming enhances the user experience:

Progress Reporting: Provide real-time feedback on long-running operations
Responsiveness: Application remains responsive during processing
Cancellation: Operations can be cancelled mid-stream

Implementation Details¶

Chunk-Based Processing¶

FlashFS uses a chunk-based approach for streaming:

// Process data in chunks
err := processor.ProcessStream(reader, func(chunk Chunk) error {
    // Process this chunk
    processChunk(chunk.Data)

    // Report progress
    progress := float64(chunk.Index) / float64(chunk.Total) * 100
    fmt.Printf("Progress: %.2f%%\n", progress)

    return nil
})

Buffered I/O¶

Streaming operations use buffered I/O for performance:

// Create a buffered reader/writer
bufReader := bufio.NewReaderSize(reader, bufferSize)
bufWriter := bufio.NewWriterSize(writer, bufferSize)

Parallel Chunk Processing¶

Chunks can be processed in parallel for improved performance:

// Create a worker pool
var wg sync.WaitGroup
chunkChan := make(chan Chunk, 10)

// Start worker goroutines
for i := 0; i < runtime.NumCPU(); i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for chunk := range chunkChan {
            processChunk(chunk)
        }
    }()
}

// Feed chunks to workers
processor.ProcessStream(reader, func(chunk Chunk) error {
    chunkChan <- chunk
    return nil
})
close(chunkChan)

// Wait for all workers to finish
wg.Wait()

Configuration¶

Streaming behavior can be configured through options:

options := StreamingOptions{
    ChunkSize:  5000,     // Number of entries per chunk
    BufferSize: 128*1024, // Buffer size for I/O operations
    NumWorkers: 4,        // Number of worker goroutines
}

Use Cases¶

Very Large File Systems¶

Streaming is essential for processing very large file systems:

Enterprise Storage: File systems with millions of files
Media Archives: Large collections of media files
Source Code Repositories: Large monorepos with many files

Network Transfers¶

Streaming enables efficient network transfers:

Cloud Backups: Stream data directly to cloud storage
Remote Synchronization: Efficiently sync changes between systems
Distributed Processing: Process data across multiple nodes

Resource-Constrained Environments¶

Streaming is valuable in resource-constrained environments:

Embedded Systems: Devices with limited memory
Shared Servers: Systems where memory must be conserved
Container Environments: Containerized applications with memory limits