bcachefs-tools/docs/feat-erasurecoding.rst


Erasure coding
~~~~~~~~~~~~~~

bcachefs also supports Reed-Solomon erasure coding - the same algorithm
used by most RAID5/6 implementations) When enabled with the ``ec``
option, the desired redundancy is taken from the ``data_replicas``
option - erasure coding of metadata is not supported.

Erasure coding works significantly differently from both conventional
RAID implementations and other filesystems with similar features. In
conventional RAID, the "write hole" is a significant problem - doing a
small write within a stripe requires the P and Q (recovery) blocks to be
updated as well, and since those writes cannot be done atomically there
is a window where the P and Q blocks are inconsistent - meaning that if
the system crashes and recovers with a drive missing, reconstruct reads
for unrelated data within that stripe will be corrupted.

ZFS avoids this by fragmenting individual writes so that every write
becomes a new stripe - this works, but the fragmentation has a negative
effect on performance: metadata becomes bigger, and both read and write
requests are excessively fragmented. Btrfs’s erasure coding
implementation is more conventional, and still subject to the write hole
problem.

bcachefs’s erasure coding takes advantage of our copy on write nature -
since updating stripes in place is a problem, we simply don’t do that.
And since excessively small stripes is a problem for fragmentation, we
don’t erasure code individual extents, we erasure code entire buckets -
taking advantage of bucket based allocation and copying garbage
collection.

When erasure coding is enabled, writes are initially replicated, but one
of the replicas is allocated from a bucket that is queued up to be part
of a new stripe. When we finish filling up the new stripe, we write out
the P and Q buckets and then drop the extra replicas for all the data
within that stripe - the effect is similar to full data journalling, and
it means that after erasure coding is done the layout of our data on
disk is ideal.

Since disks have write caches that are only flushed when we issue a
cache flush command - which we only do on journal commit - if we can
tweak the allocator so that the buckets used for the extra replicas are
reused (and then overwritten again) immediately, this full data
journalling should have negligible overhead - this optimization is not
implemented yet, however.