mirror of
https://github.com/koverstreet/bcachefs-tools.git
synced 2025-02-24 00:00:19 +03:00
46 lines
2.3 KiB
ReStructuredText
46 lines
2.3 KiB
ReStructuredText
|
||
Erasure coding
|
||
~~~~~~~~~~~~~~
|
||
|
||
bcachefs also supports Reed-Solomon erasure coding - the same algorithm
|
||
used by most RAID5/6 implementations) When enabled with the ``ec``
|
||
option, the desired redundancy is taken from the ``data_replicas``
|
||
option - erasure coding of metadata is not supported.
|
||
|
||
Erasure coding works significantly differently from both conventional
|
||
RAID implementations and other filesystems with similar features. In
|
||
conventional RAID, the "write hole" is a significant problem - doing a
|
||
small write within a stripe requires the P and Q (recovery) blocks to be
|
||
updated as well, and since those writes cannot be done atomically there
|
||
is a window where the P and Q blocks are inconsistent - meaning that if
|
||
the system crashes and recovers with a drive missing, reconstruct reads
|
||
for unrelated data within that stripe will be corrupted.
|
||
|
||
ZFS avoids this by fragmenting individual writes so that every write
|
||
becomes a new stripe - this works, but the fragmentation has a negative
|
||
effect on performance: metadata becomes bigger, and both read and write
|
||
requests are excessively fragmented. Btrfs’s erasure coding
|
||
implementation is more conventional, and still subject to the write hole
|
||
problem.
|
||
|
||
bcachefs’s erasure coding takes advantage of our copy on write nature -
|
||
since updating stripes in place is a problem, we simply don’t do that.
|
||
And since excessively small stripes is a problem for fragmentation, we
|
||
don’t erasure code individual extents, we erasure code entire buckets -
|
||
taking advantage of bucket based allocation and copying garbage
|
||
collection.
|
||
|
||
When erasure coding is enabled, writes are initially replicated, but one
|
||
of the replicas is allocated from a bucket that is queued up to be part
|
||
of a new stripe. When we finish filling up the new stripe, we write out
|
||
the P and Q buckets and then drop the extra replicas for all the data
|
||
within that stripe - the effect is similar to full data journalling, and
|
||
it means that after erasure coding is done the layout of our data on
|
||
disk is ideal.
|
||
|
||
Since disks have write caches that are only flushed when we issue a
|
||
cache flush command - which we only do on journal commit - if we can
|
||
tweak the allocator so that the buckets used for the extra replicas are
|
||
reused (and then overwritten again) immediately, this full data
|
||
journalling should have negligible overhead - this optimization is not
|
||
implemented yet, however. |