Streaming Processing

FlashFS implements streaming processing capabilities throughout its architecture to efficiently handle very large datasets that wouldn't fit entirely in memory. This approach enables processing of massive file systems with minimal memory footprint while maintaining high performance.

Core Streaming Components

FlashFS implements streaming in several key components:

Streaming Directory Walker

The directory walker processes files as they're discovered, rather than collecting all files before processing:

  • Memory Efficiency: Only keeps necessary metadata in memory
  • Responsiveness: Provides immediate feedback as files are processed
  • Parallel Processing: Can process multiple directories concurrently

Streaming Serialization

The serializer breaks large datasets into manageable chunks:

  • Chunked Processing: Divides data into fixed-size chunks
  • Progress Tracking: Reports progress as chunks are processed
  • Parallel Processing: Enables concurrent processing of chunks

Streaming Diff Computation

The diff engine can compute differences between snapshots in a streaming fashion:

  • Incremental Comparison: Compares entries as they're read
  • Memory-Efficient: Doesn't require loading entire snapshots into memory
  • Early Termination: Can stop processing when specific conditions are met

CLI Commands for Streaming

FlashFS provides dedicated CLI commands for streaming operations with enhanced user experience:

flashfs stream [command]

Available Commands

snapshot   Take a snapshot using streaming for large directories
diff       Compare snapshots using streaming for large datasets

Stream Snapshot Command

The stream snapshot command is designed for large directories and provides:

flashfs stream snapshot --source /path/to/large/directory --output backup.snap

Key features:

  • Two-Pass Processing: First counts files to provide accurate progress reporting, then processes them
  • Real-Time Progress Bars: Shows detailed progress for each phase of the operation
  • Detailed Statistics: Displays comprehensive statistics after completion
  • Graceful Cancellation: Can be safely interrupted at any point with Ctrl+C

Example output:

Starting streaming snapshot of /path/to/large/directory
Press Ctrl+C to cancel at any time (snapshot will be incomplete)
Counting files (this may take a while for large directories)...
| Counting files... (1,234/-, 5,678 files/s) [2m34s]
Found 10,000 files/directories (2.5 GiB)
Creating snapshot 45% [==================>                  ] (4,500/10,000, 1,234 files/s)
...
Snapshot completed successfully:
  - Files processed: 10,000
  - Total data size: 2.5 GiB
  - Snapshot file size: 5.2 MiB
  - Compression ratio: 492.3x
  - Time elapsed: 5m23s
  - Processing speed: 7.8 MiB/s

Stream Diff Command

The stream diff command is designed for comparing large snapshots:

flashfs stream diff --base snapshot1.snap --target snapshot2.snap --output changes.diff

Key features:

  • Progress Reporting: Shows detailed progress for loading snapshots, computing differences, and writing the diff file
  • Detailed Statistics: Displays comprehensive statistics after completion
  • Graceful Cancellation: Can be safely interrupted at any point with Ctrl+C

Example output:

Comparing snapshots:
  - Base: snapshot1.snap
  - Target: snapshot2.snap
Press Ctrl+C to cancel at any time (diff will be incomplete)
| Loading snapshots (25.6 MB/s) [0s]
Computing differences...
| Computing diff [2s]
Writing diff to file...
Writing diff 100% [========================================] (4.7 MB/s)
Diff completed successfully:
  - Base snapshot size: 5.2 MiB
  - Target snapshot size: 5.3 MiB
  - Diff size: 256 KiB
  - Diff ratio: 4.8%
  - Time elapsed: 3s

Benefits of Streaming

Memory Efficiency

Streaming processing dramatically reduces memory requirements:

  • Constant Memory Usage: Memory consumption doesn't grow with dataset size
  • Resource Optimization: Efficient use of system resources
  • Large Dataset Support: Can handle file systems with millions of files

Performance

Streaming improves performance in several ways:

  • Reduced Latency: Start processing immediately without waiting for complete data
  • Parallel Processing: Process multiple chunks concurrently
  • I/O Optimization: Overlap I/O operations with processing

User Experience

Streaming enhances the user experience:

  • Progress Reporting: Provide real-time feedback on long-running operations
  • Responsiveness: Application remains responsive during processing
  • Cancellation: Operations can be cancelled mid-stream

Implementation Details

Chunk-Based Processing

FlashFS uses a chunk-based approach for streaming:

// Process data in chunks
err := processor.ProcessStream(reader, func(chunk Chunk) error {
    // Process this chunk
    processChunk(chunk.Data)

    // Report progress
    progress := float64(chunk.Index) / float64(chunk.Total) * 100
    fmt.Printf("Progress: %.2f%%\n", progress)

    return nil
})

Buffered I/O

Streaming operations use buffered I/O for performance:

// Create a buffered reader/writer
bufReader := bufio.NewReaderSize(reader, bufferSize)
bufWriter := bufio.NewWriterSize(writer, bufferSize)

Parallel Chunk Processing

Chunks can be processed in parallel for improved performance:

// Create a worker pool
var wg sync.WaitGroup
chunkChan := make(chan Chunk, 10)

// Start worker goroutines
for i := 0; i < runtime.NumCPU(); i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for chunk := range chunkChan {
            processChunk(chunk)
        }
    }()
}

// Feed chunks to workers
processor.ProcessStream(reader, func(chunk Chunk) error {
    chunkChan <- chunk
    return nil
})
close(chunkChan)

// Wait for all workers to finish
wg.Wait()

Configuration

Streaming behavior can be configured through options:

options := StreamingOptions{
    ChunkSize:  5000,     // Number of entries per chunk
    BufferSize: 128*1024, // Buffer size for I/O operations
    NumWorkers: 4,        // Number of worker goroutines
}

Use Cases

Very Large File Systems

Streaming is essential for processing very large file systems:

  • Enterprise Storage: File systems with millions of files
  • Media Archives: Large collections of media files
  • Source Code Repositories: Large monorepos with many files

Network Transfers

Streaming enables efficient network transfers:

  • Cloud Backups: Stream data directly to cloud storage
  • Remote Synchronization: Efficiently sync changes between systems
  • Distributed Processing: Process data across multiple nodes

Resource-Constrained Environments

Streaming is valuable in resource-constrained environments:

  • Embedded Systems: Devices with limited memory
  • Shared Servers: Systems where memory must be conserved
  • Container Environments: Containerized applications with memory limits