mirror of
https://github.com/koverstreet/bcachefs-tools.git
synced 2025-02-24 00:00:19 +03:00
48 lines
2.4 KiB
ReStructuredText
48 lines
2.4 KiB
ReStructuredText
Bucket based allocation
|
||
-----------------------
|
||
|
||
As mentioned bcachefs is descended from bcache, where the ability to
|
||
efficiently invalidate cached data and reuse disk space was a core
|
||
design requirement. To make this possible the allocator divides the disk
|
||
up into buckets, typically 512k to 2M but possibly larger or smaller.
|
||
Buckets and data pointers have generation numbers: we can reuse a bucket
|
||
with cached data in it without finding and deleting all the data
|
||
pointers by incrementing the generation number.
|
||
|
||
In keeping with the copy-on-write theme of avoiding update in place
|
||
wherever possible, we never rewrite or overwrite data within a bucket -
|
||
when we allocate a bucket, we write to it sequentially and then we don’t
|
||
write to it again until the bucket has been invalidated and the
|
||
generation number incremented.
|
||
|
||
This means we require a copying garbage collector to deal with internal
|
||
fragmentation, when patterns of random writes leave us with many buckets
|
||
that are partially empty (because the data they contained was
|
||
overwritten) - copy GC evacuates buckets that are mostly empty by
|
||
writing the data they contain to new buckets. This also means that we
|
||
need to reserve space on the device for the copy GC reserve when
|
||
formatting - typically 8% or 12%.
|
||
|
||
There are some advantages to structuring the allocator this way, besides
|
||
being able to support cached data:
|
||
|
||
- By maintaining multiple write points that are writing to different
|
||
buckets, we’re able to easily and naturally segregate unrelated IO
|
||
from different processes, which helps greatly with fragmentation.
|
||
|
||
- The fast path of the allocator is essentially a simple bump allocator
|
||
- the disk space allocation is extremely fast
|
||
|
||
- Fragmentation is generally a non issue unless copygc has to kick in,
|
||
and it usually doesn’t under typical usage patterns. The allocator
|
||
and copygc are doing essentially the same things as the flash
|
||
translation layer in SSDs, but within the filesystem we have much
|
||
greater visibility into where writes are coming from and how to
|
||
segregate them, as well as which data is actually live - performance
|
||
is generally more predictable than with SSDs under similar usage
|
||
patterns.
|
||
|
||
- The same algorithms will in the future be used for managing SMR hard
|
||
drives directly, avoiding the translation layer in the hard drive -
|
||
doing this work within the filesystem should give much better
|
||
performance and much more predictable latency. |