bcachefs-tools/docs/bucketbased.rst
2022-11-01 22:28:07 -04:00

48 lines
2.4 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Bucket based allocation
-----------------------
As mentioned bcachefs is descended from bcache, where the ability to
efficiently invalidate cached data and reuse disk space was a core
design requirement. To make this possible the allocator divides the disk
up into buckets, typically 512k to 2M but possibly larger or smaller.
Buckets and data pointers have generation numbers: we can reuse a bucket
with cached data in it without finding and deleting all the data
pointers by incrementing the generation number.
In keeping with the copy-on-write theme of avoiding update in place
wherever possible, we never rewrite or overwrite data within a bucket -
when we allocate a bucket, we write to it sequentially and then we dont
write to it again until the bucket has been invalidated and the
generation number incremented.
This means we require a copying garbage collector to deal with internal
fragmentation, when patterns of random writes leave us with many buckets
that are partially empty (because the data they contained was
overwritten) - copy GC evacuates buckets that are mostly empty by
writing the data they contain to new buckets. This also means that we
need to reserve space on the device for the copy GC reserve when
formatting - typically 8% or 12%.
There are some advantages to structuring the allocator this way, besides
being able to support cached data:
- By maintaining multiple write points that are writing to different
buckets, were able to easily and naturally segregate unrelated IO
from different processes, which helps greatly with fragmentation.
- The fast path of the allocator is essentially a simple bump allocator
- the disk space allocation is extremely fast
- Fragmentation is generally a non issue unless copygc has to kick in,
and it usually doesnt under typical usage patterns. The allocator
and copygc are doing essentially the same things as the flash
translation layer in SSDs, but within the filesystem we have much
greater visibility into where writes are coming from and how to
segregate them, as well as which data is actually live - performance
is generally more predictable than with SSDs under similar usage
patterns.
- The same algorithms will in the future be used for managing SMR hard
drives directly, avoiding the translation layer in the hard drive -
doing this work within the filesystem should give much better
performance and much more predictable latency.