Diff Computation¶
FlashFS provides powerful diff computation capabilities that enable efficient comparison, storage, and application of changes between snapshots. This feature is essential for incremental backups, file synchronization, and monitoring changes over time.
Benefits¶
- Storage Efficiency: Store only the changes between snapshots instead of full copies
- Transfer Optimization: Transmit only the differences when synchronizing snapshots
- Change Tracking: Easily identify what files have been added, modified, or deleted
- Performance: Utilize parallel processing and pre-filtering for fast comparisons
- Flexibility: Configure comparison options based on specific needs
How Diff Computation Works¶
FlashFS implements a multi-stage approach to efficiently compute differences between snapshots:
1. Bloom Filter Pre-check¶
Before performing detailed comparisons, FlashFS uses Bloom filters to quickly identify files that may have changed:
- Creates a Bloom filter from the base snapshot
- Tests each file in the target snapshot against the filter
- Files that fail the Bloom filter test are candidates for detailed comparison
This pre-filtering step significantly reduces the number of files that need detailed comparison, especially in large snapshots where most files remain unchanged.
2. Detailed Comparison¶
For files identified by the Bloom filter, FlashFS performs a detailed comparison:
- Compares file metadata (path, size, modification time, permissions)
- Optionally compares file content hashes for detecting changes even when metadata is unchanged
- Identifies files that have been added, modified, or deleted
3. Structured Diff Representation¶
FlashFS uses a structured schema to represent diffs:
- Each changed file is represented as a
DiffEntry
with: - Path of the file
- Type of change (added, modified, deleted)
-
Before and after values for size, modification time, permissions, and content hash
-
All entries are collected in a
Diff
object that provides: - Efficient serialization and deserialization
- Type-safe access to change information
- Structured representation of all changes
4. Parallel Processing¶
To accelerate diff computation for large snapshots, FlashFS can distribute the comparison work across multiple CPU cores:
- Divides the file list into chunks
- Processes each chunk in parallel
- Combines the results into a unified diff
5. Diff Storage¶
The computed diff is stored in a compact format:
- Records only the changes (added, modified, deleted files)
- Uses the same efficient FlatBuffers serialization as snapshots
- Applies Zstd compression to minimize storage requirements
Command Reference¶
Computing a Diff¶
flashfs diff [options]
Options:
--base <file>
: Base snapshot file (required)--target <file>
: Target snapshot file (required)--output <file>
: Output diff file (required)--detailed
: Perform detailed comparison including file content hashes--parallel <n>
: Number of parallel workers for comparison (default: number of CPU cores)--no-hash
: Skip hash comparison (faster but less accurate)--path-filter <pattern>
: Only compare files matching the specified path pattern
Applying a Diff¶
flashfs apply [options]
Options:
--base <file>
: Base snapshot file (required)--diff <file>
: Diff file to apply (required)--output <file>
: Output snapshot file (required)
Viewing Diff Information¶
flashfs diff-info [options]
Options:
--diff <file>
: Diff file to analyze (required)--verbose
: Show detailed information about each changed file
Examples¶
Basic Diff Computation¶
Compute the differences between two snapshots:
flashfs diff --base snapshot1.snap --target snapshot2.snap --output changes.diff
Detailed Comparison with Parallel Processing¶
Perform a detailed comparison using 8 parallel workers:
flashfs diff --base snapshot1.snap --target snapshot2.snap --output changes.diff --detailed --parallel 8
Path-Filtered Comparison¶
Compare only files in a specific directory:
flashfs diff --base snapshot1.snap --target snapshot2.snap --output changes.diff --path-filter "/home/user/documents/*"
Applying a Diff to Generate a New Snapshot¶
Apply a diff to a base snapshot to generate a new snapshot:
flashfs apply --base snapshot1.snap --diff changes.diff --output snapshot2.snap
Viewing Diff Information¶
Analyze the contents of a diff file:
flashfs diff-info --diff changes.diff --verbose
Implementation Details¶
The diff computation is implemented in the SnapshotStore
struct with the following key methods:
ComputeDiff
: Computes the differences between two snapshots and returns a structuredDiff
objectStoreDiff
: Stores the computed diff to a fileApplyDiff
: Applies a diff to a base snapshot to generate a new snapshot
Diff Schema¶
FlashFS uses a structured schema for representing diffs:
table DiffEntry {
path: string;
type: byte; // 0 = added, 1 = modified, 2 = deleted
oldSize: long;
newSize: long;
oldMtime: long;
newMtime: long;
oldPermissions: uint;
newPermissions: uint;
oldHash: [ubyte];
newHash: [ubyte];
}
table Diff {
entries: [DiffEntry];
}
This schema provides a clear, structured representation of changes, with each DiffEntry
capturing all relevant information about a change:
- For added files (
type = 0
), only the new metadata is relevant - For modified files (
type = 1
), both old and new metadata are stored - For deleted files (
type = 2
), only the old metadata is relevant
ComputeDiff Implementation¶
The ComputeDiff
function compares two snapshots and produces a structured diff:
- Deserializes both snapshots
- Creates maps of file entries for efficient lookup
- Identifies added, modified, and deleted files
- Creates
DiffEntry
objects for each change - Assembles all entries into a
Diff
object - Serializes the diff using FlatBuffers
- Compresses the serialized data
ApplyDiff Implementation¶
The ApplyDiff
function applies a diff to a base snapshot:
- Deserializes the base snapshot
- Deserializes the diff
- Creates a map of base entries
- Processes each diff entry:
- For added files, creates a new entry
- For modified files, updates the existing entry
- For deleted files, removes the entry
- Builds a new snapshot with the modified entries
- Serializes and compresses the new snapshot
The diff system uses the same efficient serialization and compression techniques as the snapshot system, ensuring consistent performance and storage efficiency.
Performance Considerations¶
- Bloom Filters: The Bloom filter pre-check significantly reduces comparison time for large snapshots
- Parallel Processing: Utilizing multiple CPU cores can dramatically speed up diff computation
- Hash Comparison: Enabling hash comparison provides more accurate results but increases computation time
- Path Filtering: Using path filters can focus the comparison on relevant files, reducing processing time
- Structured Diffs: The structured diff format allows for efficient processing and application of changes
For optimal performance, adjust the parallelism level based on your system's capabilities and use path filters when only specific directories are of interest.