bcachefs-tools/docs/bucketbased.rst

Bucket based allocation
-----------------------

As mentioned bcachefs is descended from bcache, where the ability to
efficiently invalidate cached data and reuse disk space was a core
design requirement. To make this possible the allocator divides the disk
up into buckets, typically 512k to 2M but possibly larger or smaller.
Buckets and data pointers have generation numbers: we can reuse a bucket
with cached data in it without finding and deleting all the data
pointers by incrementing the generation number.

In keeping with the copy-on-write theme of avoiding update in place
wherever possible, we never rewrite or overwrite data within a bucket -
when we allocate a bucket, we write to it sequentially and then we don’t
write to it again until the bucket has been invalidated and the
generation number incremented.

This means we require a copying garbage collector to deal with internal
fragmentation, when patterns of random writes leave us with many buckets
that are partially empty (because the data they contained was
overwritten) - copy GC evacuates buckets that are mostly empty by
writing the data they contain to new buckets. This also means that we
need to reserve space on the device for the copy GC reserve when
formatting - typically 8% or 12%.

There are some advantages to structuring the allocator this way, besides
being able to support cached data:

-  By maintaining multiple write points that are writing to different
   buckets, we’re able to easily and naturally segregate unrelated IO
   from different processes, which helps greatly with fragmentation.

-  The fast path of the allocator is essentially a simple bump allocator
   - the disk space allocation is extremely fast

-  Fragmentation is generally a non issue unless copygc has to kick in,
   and it usually doesn’t under typical usage patterns. The allocator
   and copygc are doing essentially the same things as the flash
   translation layer in SSDs, but within the filesystem we have much
   greater visibility into where writes are coming from and how to
   segregate them, as well as which data is actually live - performance
   is generally more predictable than with SSDs under similar usage
   patterns.

-  The same algorithms will in the future be used for managing SMR hard
   drives directly, avoiding the translation layer in the hard drive -
   doing this work within the filesystem should give much better
   performance and much more predictable latency.