From 7d5817d9c20cfa2520ad773ac6d5b3e7b69d4976 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 4 Dec 2025 10:13:32 -0500 Subject: [PATCH] Changelog for 1.33 Signed-off-by: Kent Overstreet --- Changelog.mdwn | 189 +++++++++++++++++++++++++++++++++++++++++++++++++ debian/NEWS | 189 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 378 insertions(+) create mode 100644 Changelog.mdwn create mode 100644 debian/NEWS diff --git a/Changelog.mdwn b/Changelog.mdwn new file mode 100644 index 00000000..15269b65 --- /dev/null +++ b/Changelog.mdwn @@ -0,0 +1,189 @@ +# Changelog + +## v1.33.0 - Thu Dec 4 2025 + +`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2) + +### Reconcile + +An incompatible upgrade is required to enable reconcile. + +Reconcile now handles all IO path options; previously only the background target +and background compression options were handled. + +Reconcile can now process metadata (moving it to the correct target, +rereplicating degraded metadata); previously rebalance was only able to handle +user data. + +Reconcile now automatically reacts to option changes and device setting +changes, and immediately rereplicates degraded data or metadata + +This obsoletes the commands `data rereplicate`, `data job +drop_extra_replicas`, and others; the new commands are `reconcile status` and +`reconcile wait`. + +The recovery pass `check_reconcile_work` now checks that data matches the +specified IO path options, and flags an error if it does not (if it wasn't due +to an option change that hasn't yet been propagated). + +Additional improvements over rebalance and implementation notes: + +We now have a separate index for data that's scheduled to be processed by +reconcile but can't (e.g. because the specified target is full), +`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance +spinning when a filesystem has more data than fits on the specified background +target. + +This also means you can create a single device filesystem with replicas=2, and +upon adding a new device data will automatically be replicated on the new +device, no additional user intervention required. + +There's a separate index for "high priority" reconcile processing - +`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be +rereplicated; they'll be processed ahead of other work. + +Rotating disks get special handling. We now track whether a disk is rotational +(a hard drive, instead of an SSD); pending work on those disks is additionally +indexed in the `BTREE_ID_reconcile_work_phys` and +`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical +LBA order, not logical key order, avoiding unnecessary seeks. + +We don't yet have the ability to change the rotational setting on an existing +device, once it's been set; if you discover you need this, please let us know so +it can be bumped up on the list (it'll be a medium sized project). + +`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`; +as the name implies, reconcile automatically moves data off of devices in the +evacuating state. In the future, when we have better tracking and monitoring +of drive health, we'll be able to automatically mark failing devices as +evacuating: when this lands, you'll be able to load up a server with disks and +walk away - come back a year later to swap out the ones that have been failed. + +Reconcile was a massive project: the short and simple user interface is +deceptive, there was an enormous amount of work under the hood to make +everything work consistently and handle all the special cases we've learned +about over the past few years with rebalance. + +There's still reconcile-related work to be done on disk space accounting when +devices are read-only or evacuating, and in the future we want to reserve space +up front on option change, so that we can alert the user if they might be doing +something they don't have disk space for. + +### Other improvements and changes: + +- Degraded data is now always properly reported as degraded (by `bcachefs fs + usage`); data is considered degraded any time the durability on good + (non-evacuating devices) is less than the specified replication level. + +- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant + cleanup and rework: every counter has a corresponding tracepoint. This makes + it easy to drill down and investigate when a filesystem is doing something + unusual and unexpected. + + Under the hood, the conversion of tracepoints to printbufs/pretty printers has + now been completed, with some much improved helpers. This makes it much easier + to add new counters and tracepoints or add additional info to existing + tracepoints, typically a 5-20 line patch. If there's something you're + investigating and you need more info, just ask. + + We now make use of type information on counters to display data rates in + `bcachefs fs top` where applicable, and many counters have been converted to + data rates. This makes it much easier to correlate different counters (e.g. + `data_update`, `data_update_fail`) to check if the rates of slowpath events + should be a cause for concern. + +- Logging/error message improvements + + Logging has been a major area of focus, with a lot of under the hood + improvements to make it ergonomic to generate messages that clearly explain + what the system is doing an why: error messages should not include just the + error, but how it was handled (soft error or hard error) and all actions taken + to correct the error (e.g. scheduling self healing or recovery passes). + + When we receive an IO error from the block layer we now report the specific + error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`). + + The various write paths (user data, btree, journal) now report one error + message for the entire operation that includes all the sub-errors for the + individual replicated writes and the status of the overall operation (soft + error (wrote degraded data) vs. hard error), like the read paths. + + On failure to mount due to insufficient devices, we now report which device(s) + were missing; we remember the device name and model in the superblock from the + last time we saw it so that we can give helpful hints to the user about what's + missing. + + When btree topology repair recovers via btree node scan, we now report which + node(s) it was able to recover via scan; this helps with determining if data + was actually lost or not. + + We now ratelimit soft and hard errors separately, in the data/journal/btree + read and write paths, ensuring that if the system is being flooded with soft + errors the hard errors will still be reported. + + All error ratelimiting now obeys the `no_ratelimit_errors` option. + + All recovery passes should now have progress indicators. + +- New options: + + `mount_trusts_udev`: there have been reports of mounting by UUID failing due + to known bugs in libblkid. Previously this was available as an environment + variable, but it now may be specified as a mount option (where it should also + be much easier to find). When specified, we only use udev for getting the list + of the system's block devices; we do all the probing for filesystem members + ourself. + + `writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls + for the given filesystem, and may be set persistently. Useful for setting a + lower writeback timeout for removeable media. + +- Other smaller user-visible improvements + + The `mi_btree_bitmap` field in the member info section of the superblock now + has a recovery pass to clean it up and shrink it; it will be automatically + scheduled when we notice that there is significantly more space on a device + marked as containing metadata than we have metadata on that device. + + The member-info btree bitmap is used by btree node scan, for disaster recovery + repair; shrinking the bitmap reduces the amount of the device that has to be + scanned if we have to recover from btree nodes that have become unreadable or + lost despite replication. You don't ever want to need it, but if you do need + it it's there. + +- Promotes are now ratelimited; this resolves an issue with spinning up far too + many kworker threads for promotes that wouldn't happen due to the target being + busy. + +- An issue was spotted on a user filesystem where btree node merging wasn't + happening properly on the `reconcile_work` btree, causing a very slow upgrade. + Btree node merging has now seen some improvements; btree lookups can now kick + off asynchronous btree node merges when they spot an empty btree node, and the + btree write buffer now does btree merging asynchronously, which should be a + noticeable improvement on system performance under heavy load for some users - + btree write buffer flushing is single threaded and can be a bottleneck. + + There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for + nodes that can be merged. It's not run automatically, but can be run if + desired by passing the `recovery_passes` option to an online fsck. + +- And many other bug fixes. + +### Notable under-the-hood codebase work: + +A lot of codebase modernization has been happening over the past six months, +to prepare for Rust. With the latest features recently available in C and in +the kernel, we can now do incremental refactorings to bring code steadily more +in line with what the Rust version will be, so that the future conversion will +be mostly syntactic - and not a rewrite. The big enabler here was CLASS(), +which is the kernel's version of pseudo-RAII based on `__cleanup()`; this +allows for the removal of goto based error handling (Rust notably does not +have goto). + +We're now down to ~600 gotos in the entire codebase, down from ~2500 when the +modernization started, with many files being complete. + +Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which +is decently close to Rust/C++ vectors, and the try() macro for forwarding +errors, stolen from Rust. These cleanups have deleted thousands of lines from +the codebase over the past months. diff --git a/debian/NEWS b/debian/NEWS new file mode 100644 index 00000000..15269b65 --- /dev/null +++ b/debian/NEWS @@ -0,0 +1,189 @@ +# Changelog + +## v1.33.0 - Thu Dec 4 2025 + +`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2) + +### Reconcile + +An incompatible upgrade is required to enable reconcile. + +Reconcile now handles all IO path options; previously only the background target +and background compression options were handled. + +Reconcile can now process metadata (moving it to the correct target, +rereplicating degraded metadata); previously rebalance was only able to handle +user data. + +Reconcile now automatically reacts to option changes and device setting +changes, and immediately rereplicates degraded data or metadata + +This obsoletes the commands `data rereplicate`, `data job +drop_extra_replicas`, and others; the new commands are `reconcile status` and +`reconcile wait`. + +The recovery pass `check_reconcile_work` now checks that data matches the +specified IO path options, and flags an error if it does not (if it wasn't due +to an option change that hasn't yet been propagated). + +Additional improvements over rebalance and implementation notes: + +We now have a separate index for data that's scheduled to be processed by +reconcile but can't (e.g. because the specified target is full), +`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance +spinning when a filesystem has more data than fits on the specified background +target. + +This also means you can create a single device filesystem with replicas=2, and +upon adding a new device data will automatically be replicated on the new +device, no additional user intervention required. + +There's a separate index for "high priority" reconcile processing - +`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be +rereplicated; they'll be processed ahead of other work. + +Rotating disks get special handling. We now track whether a disk is rotational +(a hard drive, instead of an SSD); pending work on those disks is additionally +indexed in the `BTREE_ID_reconcile_work_phys` and +`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical +LBA order, not logical key order, avoiding unnecessary seeks. + +We don't yet have the ability to change the rotational setting on an existing +device, once it's been set; if you discover you need this, please let us know so +it can be bumped up on the list (it'll be a medium sized project). + +`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`; +as the name implies, reconcile automatically moves data off of devices in the +evacuating state. In the future, when we have better tracking and monitoring +of drive health, we'll be able to automatically mark failing devices as +evacuating: when this lands, you'll be able to load up a server with disks and +walk away - come back a year later to swap out the ones that have been failed. + +Reconcile was a massive project: the short and simple user interface is +deceptive, there was an enormous amount of work under the hood to make +everything work consistently and handle all the special cases we've learned +about over the past few years with rebalance. + +There's still reconcile-related work to be done on disk space accounting when +devices are read-only or evacuating, and in the future we want to reserve space +up front on option change, so that we can alert the user if they might be doing +something they don't have disk space for. + +### Other improvements and changes: + +- Degraded data is now always properly reported as degraded (by `bcachefs fs + usage`); data is considered degraded any time the durability on good + (non-evacuating devices) is less than the specified replication level. + +- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant + cleanup and rework: every counter has a corresponding tracepoint. This makes + it easy to drill down and investigate when a filesystem is doing something + unusual and unexpected. + + Under the hood, the conversion of tracepoints to printbufs/pretty printers has + now been completed, with some much improved helpers. This makes it much easier + to add new counters and tracepoints or add additional info to existing + tracepoints, typically a 5-20 line patch. If there's something you're + investigating and you need more info, just ask. + + We now make use of type information on counters to display data rates in + `bcachefs fs top` where applicable, and many counters have been converted to + data rates. This makes it much easier to correlate different counters (e.g. + `data_update`, `data_update_fail`) to check if the rates of slowpath events + should be a cause for concern. + +- Logging/error message improvements + + Logging has been a major area of focus, with a lot of under the hood + improvements to make it ergonomic to generate messages that clearly explain + what the system is doing an why: error messages should not include just the + error, but how it was handled (soft error or hard error) and all actions taken + to correct the error (e.g. scheduling self healing or recovery passes). + + When we receive an IO error from the block layer we now report the specific + error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`). + + The various write paths (user data, btree, journal) now report one error + message for the entire operation that includes all the sub-errors for the + individual replicated writes and the status of the overall operation (soft + error (wrote degraded data) vs. hard error), like the read paths. + + On failure to mount due to insufficient devices, we now report which device(s) + were missing; we remember the device name and model in the superblock from the + last time we saw it so that we can give helpful hints to the user about what's + missing. + + When btree topology repair recovers via btree node scan, we now report which + node(s) it was able to recover via scan; this helps with determining if data + was actually lost or not. + + We now ratelimit soft and hard errors separately, in the data/journal/btree + read and write paths, ensuring that if the system is being flooded with soft + errors the hard errors will still be reported. + + All error ratelimiting now obeys the `no_ratelimit_errors` option. + + All recovery passes should now have progress indicators. + +- New options: + + `mount_trusts_udev`: there have been reports of mounting by UUID failing due + to known bugs in libblkid. Previously this was available as an environment + variable, but it now may be specified as a mount option (where it should also + be much easier to find). When specified, we only use udev for getting the list + of the system's block devices; we do all the probing for filesystem members + ourself. + + `writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls + for the given filesystem, and may be set persistently. Useful for setting a + lower writeback timeout for removeable media. + +- Other smaller user-visible improvements + + The `mi_btree_bitmap` field in the member info section of the superblock now + has a recovery pass to clean it up and shrink it; it will be automatically + scheduled when we notice that there is significantly more space on a device + marked as containing metadata than we have metadata on that device. + + The member-info btree bitmap is used by btree node scan, for disaster recovery + repair; shrinking the bitmap reduces the amount of the device that has to be + scanned if we have to recover from btree nodes that have become unreadable or + lost despite replication. You don't ever want to need it, but if you do need + it it's there. + +- Promotes are now ratelimited; this resolves an issue with spinning up far too + many kworker threads for promotes that wouldn't happen due to the target being + busy. + +- An issue was spotted on a user filesystem where btree node merging wasn't + happening properly on the `reconcile_work` btree, causing a very slow upgrade. + Btree node merging has now seen some improvements; btree lookups can now kick + off asynchronous btree node merges when they spot an empty btree node, and the + btree write buffer now does btree merging asynchronously, which should be a + noticeable improvement on system performance under heavy load for some users - + btree write buffer flushing is single threaded and can be a bottleneck. + + There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for + nodes that can be merged. It's not run automatically, but can be run if + desired by passing the `recovery_passes` option to an online fsck. + +- And many other bug fixes. + +### Notable under-the-hood codebase work: + +A lot of codebase modernization has been happening over the past six months, +to prepare for Rust. With the latest features recently available in C and in +the kernel, we can now do incremental refactorings to bring code steadily more +in line with what the Rust version will be, so that the future conversion will +be mostly syntactic - and not a rewrite. The big enabler here was CLASS(), +which is the kernel's version of pseudo-RAII based on `__cleanup()`; this +allows for the removal of goto based error handling (Rust notably does not +have goto). + +We're now down to ~600 gotos in the entire codebase, down from ~2500 when the +modernization started, with many files being complete. + +Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which +is decently close to Rust/C++ vectors, and the try() macro for forwarding +errors, stolen from Rust. These cleanups have deleted thousands of lines from +the codebase over the past months.