mirror of
https://github.com/koverstreet/bcachefs-tools.git
synced 2025-02-24 00:00:19 +03:00
48 lines
2.4 KiB
ReStructuredText
48 lines
2.4 KiB
ReStructuredText
|
Bucket based allocation
|
|||
|
-----------------------
|
|||
|
|
|||
|
As mentioned bcachefs is descended from bcache, where the ability to
|
|||
|
efficiently invalidate cached data and reuse disk space was a core
|
|||
|
design requirement. To make this possible the allocator divides the disk
|
|||
|
up into buckets, typically 512k to 2M but possibly larger or smaller.
|
|||
|
Buckets and data pointers have generation numbers: we can reuse a bucket
|
|||
|
with cached data in it without finding and deleting all the data
|
|||
|
pointers by incrementing the generation number.
|
|||
|
|
|||
|
In keeping with the copy-on-write theme of avoiding update in place
|
|||
|
wherever possible, we never rewrite or overwrite data within a bucket -
|
|||
|
when we allocate a bucket, we write to it sequentially and then we don’t
|
|||
|
write to it again until the bucket has been invalidated and the
|
|||
|
generation number incremented.
|
|||
|
|
|||
|
This means we require a copying garbage collector to deal with internal
|
|||
|
fragmentation, when patterns of random writes leave us with many buckets
|
|||
|
that are partially empty (because the data they contained was
|
|||
|
overwritten) - copy GC evacuates buckets that are mostly empty by
|
|||
|
writing the data they contain to new buckets. This also means that we
|
|||
|
need to reserve space on the device for the copy GC reserve when
|
|||
|
formatting - typically 8% or 12%.
|
|||
|
|
|||
|
There are some advantages to structuring the allocator this way, besides
|
|||
|
being able to support cached data:
|
|||
|
|
|||
|
- By maintaining multiple write points that are writing to different
|
|||
|
buckets, we’re able to easily and naturally segregate unrelated IO
|
|||
|
from different processes, which helps greatly with fragmentation.
|
|||
|
|
|||
|
- The fast path of the allocator is essentially a simple bump allocator
|
|||
|
- the disk space allocation is extremely fast
|
|||
|
|
|||
|
- Fragmentation is generally a non issue unless copygc has to kick in,
|
|||
|
and it usually doesn’t under typical usage patterns. The allocator
|
|||
|
and copygc are doing essentially the same things as the flash
|
|||
|
translation layer in SSDs, but within the filesystem we have much
|
|||
|
greater visibility into where writes are coming from and how to
|
|||
|
segregate them, as well as which data is actually live - performance
|
|||
|
is generally more predictable than with SSDs under similar usage
|
|||
|
patterns.
|
|||
|
|
|||
|
- The same algorithms will in the future be used for managing SMR hard
|
|||
|
drives directly, avoiding the translation layer in the hard drive -
|
|||
|
doing this work within the filesystem should give much better
|
|||
|
performance and much more predictable latency.
|