mirror of
https://github.com/koverstreet/bcachefs-tools.git
synced 2025-12-07 00:00:12 +03:00
190 lines
9.3 KiB
Markdown
190 lines
9.3 KiB
Markdown
# Changelog
|
|
|
|
## v1.33.0 - Thu Dec 4 2025
|
|
|
|
`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2)
|
|
|
|
### Reconcile
|
|
|
|
An incompatible upgrade is required to enable reconcile.
|
|
|
|
Reconcile now handles all IO path options; previously only the background target
|
|
and background compression options were handled.
|
|
|
|
Reconcile can now process metadata (moving it to the correct target,
|
|
rereplicating degraded metadata); previously rebalance was only able to handle
|
|
user data.
|
|
|
|
Reconcile now automatically reacts to option changes and device setting
|
|
changes, and immediately rereplicates degraded data or metadata
|
|
|
|
This obsoletes the commands `data rereplicate`, `data job
|
|
drop_extra_replicas`, and others; the new commands are `reconcile status` and
|
|
`reconcile wait`.
|
|
|
|
The recovery pass `check_reconcile_work` now checks that data matches the
|
|
specified IO path options, and flags an error if it does not (if it wasn't due
|
|
to an option change that hasn't yet been propagated).
|
|
|
|
Additional improvements over rebalance and implementation notes:
|
|
|
|
We now have a separate index for data that's scheduled to be processed by
|
|
reconcile but can't (e.g. because the specified target is full),
|
|
`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance
|
|
spinning when a filesystem has more data than fits on the specified background
|
|
target.
|
|
|
|
This also means you can create a single device filesystem with replicas=2, and
|
|
upon adding a new device data will automatically be replicated on the new
|
|
device, no additional user intervention required.
|
|
|
|
There's a separate index for "high priority" reconcile processing -
|
|
`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be
|
|
rereplicated; they'll be processed ahead of other work.
|
|
|
|
Rotating disks get special handling. We now track whether a disk is rotational
|
|
(a hard drive, instead of an SSD); pending work on those disks is additionally
|
|
indexed in the `BTREE_ID_reconcile_work_phys` and
|
|
`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical
|
|
LBA order, not logical key order, avoiding unnecessary seeks.
|
|
|
|
We don't yet have the ability to change the rotational setting on an existing
|
|
device, once it's been set; if you discover you need this, please let us know so
|
|
it can be bumped up on the list (it'll be a medium sized project).
|
|
|
|
`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`;
|
|
as the name implies, reconcile automatically moves data off of devices in the
|
|
evacuating state. In the future, when we have better tracking and monitoring
|
|
of drive health, we'll be able to automatically mark failing devices as
|
|
evacuating: when this lands, you'll be able to load up a server with disks and
|
|
walk away - come back a year later to swap out the ones that have been failed.
|
|
|
|
Reconcile was a massive project: the short and simple user interface is
|
|
deceptive, there was an enormous amount of work under the hood to make
|
|
everything work consistently and handle all the special cases we've learned
|
|
about over the past few years with rebalance.
|
|
|
|
There's still reconcile-related work to be done on disk space accounting when
|
|
devices are read-only or evacuating, and in the future we want to reserve space
|
|
up front on option change, so that we can alert the user if they might be doing
|
|
something they don't have disk space for.
|
|
|
|
### Other improvements and changes:
|
|
|
|
- Degraded data is now always properly reported as degraded (by `bcachefs fs
|
|
usage`); data is considered degraded any time the durability on good
|
|
(non-evacuating devices) is less than the specified replication level.
|
|
|
|
- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant
|
|
cleanup and rework: every counter has a corresponding tracepoint. This makes
|
|
it easy to drill down and investigate when a filesystem is doing something
|
|
unusual and unexpected.
|
|
|
|
Under the hood, the conversion of tracepoints to printbufs/pretty printers has
|
|
now been completed, with some much improved helpers. This makes it much easier
|
|
to add new counters and tracepoints or add additional info to existing
|
|
tracepoints, typically a 5-20 line patch. If there's something you're
|
|
investigating and you need more info, just ask.
|
|
|
|
We now make use of type information on counters to display data rates in
|
|
`bcachefs fs top` where applicable, and many counters have been converted to
|
|
data rates. This makes it much easier to correlate different counters (e.g.
|
|
`data_update`, `data_update_fail`) to check if the rates of slowpath events
|
|
should be a cause for concern.
|
|
|
|
- Logging/error message improvements
|
|
|
|
Logging has been a major area of focus, with a lot of under the hood
|
|
improvements to make it ergonomic to generate messages that clearly explain
|
|
what the system is doing an why: error messages should not include just the
|
|
error, but how it was handled (soft error or hard error) and all actions taken
|
|
to correct the error (e.g. scheduling self healing or recovery passes).
|
|
|
|
When we receive an IO error from the block layer we now report the specific
|
|
error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`).
|
|
|
|
The various write paths (user data, btree, journal) now report one error
|
|
message for the entire operation that includes all the sub-errors for the
|
|
individual replicated writes and the status of the overall operation (soft
|
|
error (wrote degraded data) vs. hard error), like the read paths.
|
|
|
|
On failure to mount due to insufficient devices, we now report which device(s)
|
|
were missing; we remember the device name and model in the superblock from the
|
|
last time we saw it so that we can give helpful hints to the user about what's
|
|
missing.
|
|
|
|
When btree topology repair recovers via btree node scan, we now report which
|
|
node(s) it was able to recover via scan; this helps with determining if data
|
|
was actually lost or not.
|
|
|
|
We now ratelimit soft and hard errors separately, in the data/journal/btree
|
|
read and write paths, ensuring that if the system is being flooded with soft
|
|
errors the hard errors will still be reported.
|
|
|
|
All error ratelimiting now obeys the `no_ratelimit_errors` option.
|
|
|
|
All recovery passes should now have progress indicators.
|
|
|
|
- New options:
|
|
|
|
`mount_trusts_udev`: there have been reports of mounting by UUID failing due
|
|
to known bugs in libblkid. Previously this was available as an environment
|
|
variable, but it now may be specified as a mount option (where it should also
|
|
be much easier to find). When specified, we only use udev for getting the list
|
|
of the system's block devices; we do all the probing for filesystem members
|
|
ourself.
|
|
|
|
`writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls
|
|
for the given filesystem, and may be set persistently. Useful for setting a
|
|
lower writeback timeout for removeable media.
|
|
|
|
- Other smaller user-visible improvements
|
|
|
|
The `mi_btree_bitmap` field in the member info section of the superblock now
|
|
has a recovery pass to clean it up and shrink it; it will be automatically
|
|
scheduled when we notice that there is significantly more space on a device
|
|
marked as containing metadata than we have metadata on that device.
|
|
|
|
The member-info btree bitmap is used by btree node scan, for disaster recovery
|
|
repair; shrinking the bitmap reduces the amount of the device that has to be
|
|
scanned if we have to recover from btree nodes that have become unreadable or
|
|
lost despite replication. You don't ever want to need it, but if you do need
|
|
it it's there.
|
|
|
|
- Promotes are now ratelimited; this resolves an issue with spinning up far too
|
|
many kworker threads for promotes that wouldn't happen due to the target being
|
|
busy.
|
|
|
|
- An issue was spotted on a user filesystem where btree node merging wasn't
|
|
happening properly on the `reconcile_work` btree, causing a very slow upgrade.
|
|
Btree node merging has now seen some improvements; btree lookups can now kick
|
|
off asynchronous btree node merges when they spot an empty btree node, and the
|
|
btree write buffer now does btree merging asynchronously, which should be a
|
|
noticeable improvement on system performance under heavy load for some users -
|
|
btree write buffer flushing is single threaded and can be a bottleneck.
|
|
|
|
There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for
|
|
nodes that can be merged. It's not run automatically, but can be run if
|
|
desired by passing the `recovery_passes` option to an online fsck.
|
|
|
|
- And many other bug fixes.
|
|
|
|
### Notable under-the-hood codebase work:
|
|
|
|
A lot of codebase modernization has been happening over the past six months,
|
|
to prepare for Rust. With the latest features recently available in C and in
|
|
the kernel, we can now do incremental refactorings to bring code steadily more
|
|
in line with what the Rust version will be, so that the future conversion will
|
|
be mostly syntactic - and not a rewrite. The big enabler here was CLASS(),
|
|
which is the kernel's version of pseudo-RAII based on `__cleanup()`; this
|
|
allows for the removal of goto based error handling (Rust notably does not
|
|
have goto).
|
|
|
|
We're now down to ~600 gotos in the entire codebase, down from ~2500 when the
|
|
modernization started, with many files being complete.
|
|
|
|
Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which
|
|
is decently close to Rust/C++ vectors, and the try() macro for forwarding
|
|
errors, stolen from Rust. These cleanups have deleted thousands of lines from
|
|
the codebase over the past months.
|