Changelog for 1.33

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-12-07 00:00:12 +03:00 · 2025-12-04 10:13:32 -05:00 · 2025-12-04 10:13:32 -05:00 · 7d5817d9c2
commit 7d5817d9c2
parent b601a0f2c3
2 changed files with 378 additions and 0 deletions
--- a/Changelog.mdwn
+++ b/Changelog.mdwn
@ -0,0 +1,189 @@
+# Changelog
+
+## v1.33.0 - Thu Dec  4 2025
+
+`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2)
+
+### Reconcile
+
+An incompatible upgrade is required to enable reconcile.
+
+Reconcile now handles all IO path options; previously only the background target
+and background compression options were handled.
+
+Reconcile can now process metadata (moving it to the correct target,
+rereplicating degraded metadata); previously rebalance was only able to handle
+user data.
+
+Reconcile now automatically reacts to option changes and device setting
+changes, and immediately rereplicates degraded data or metadata
+
+This obsoletes the commands `data rereplicate`, `data job
+drop_extra_replicas`, and others; the new commands are `reconcile status` and
+`reconcile wait`.
+
+The recovery pass `check_reconcile_work` now checks that data matches the
+specified IO path options, and flags an error if it does not (if it wasn't due
+to an option change that hasn't yet been propagated).
+
+Additional improvements over rebalance and implementation notes:
+
+We now have a separate index for data that's scheduled to be processed by
+reconcile but can't (e.g. because the specified target is full),
+`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance
+spinning when a filesystem has more data than fits on the specified background
+target.
+
+This also means you can create a single device filesystem with replicas=2, and
+upon adding a new device data will automatically be replicated on the new
+device, no additional user intervention required.
+
+There's a separate index for "high priority" reconcile processing -
+`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be
+rereplicated; they'll be processed ahead of other work.
+
+Rotating disks get special handling. We now track whether a disk is rotational
+(a hard drive, instead of an SSD); pending work on those disks is additionally
+indexed in the `BTREE_ID_reconcile_work_phys` and
+`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical
+LBA order, not logical key order, avoiding unnecessary seeks.
+
+We don't yet have the ability to change the rotational setting on an existing
+device, once it's been set; if you discover you need this, please let us know so
+it can be bumped up on the list (it'll be a medium sized project).
+
+`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`;
+as the name implies, reconcile automatically moves data off of devices in the
+evacuating state. In the future, when we have better tracking and monitoring
+of drive health, we'll be able to automatically mark failing devices as
+evacuating: when this lands, you'll be able to load up a server with disks and
+walk away - come back a year later to swap out the ones that have been failed.
+
+Reconcile was a massive project: the short and simple user interface is
+deceptive, there was an enormous amount of work under the hood to make
+everything work consistently and handle all the special cases we've learned
+about over the past few years with rebalance.
+
+There's still reconcile-related work to be done on disk space accounting when
+devices are read-only or evacuating, and in the future we want to reserve space
+up front on option change, so that we can alert the user if they might be doing
+something they don't have disk space for.
+
+### Other improvements and changes:
+
+- Degraded data is now always properly reported as degraded (by `bcachefs fs
+  usage`); data is considered degraded any time the durability on good
+  (non-evacuating devices) is less than the specified replication level.
+
+- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant
+  cleanup and rework: every counter has a corresponding tracepoint. This makes
+  it easy to drill down and investigate when a filesystem is doing something
+  unusual and unexpected.
+
+  Under the hood, the conversion of tracepoints to printbufs/pretty printers has
+  now been completed, with some much improved helpers. This makes it much easier
+  to add new counters and tracepoints or add additional info to existing
+  tracepoints, typically a 5-20 line patch. If there's something you're
+  investigating and you need more info, just ask.
+
+  We now make use of type information on counters to display data rates in
+  `bcachefs fs top` where applicable, and many counters have been converted to
+  data rates. This makes it much easier to correlate different counters (e.g.
+  `data_update`, `data_update_fail`) to check if the rates of slowpath events
+  should be a cause for concern.
+
+- Logging/error message improvements
+
+  Logging has been a major area of focus, with a lot of under the hood
+  improvements to make it ergonomic to generate messages that clearly explain
+  what the system is doing an why: error messages should not include just the
+  error, but how it was handled (soft error or hard error) and all actions taken
+  to correct the error (e.g. scheduling self healing or recovery passes).
+
+  When we receive an IO error from the block layer we now report the specific
+  error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`).
+
+  The various write paths (user data, btree, journal) now report one error
+  message for the entire operation that includes all the sub-errors for the
+  individual replicated writes and the status of the overall operation (soft
+  error (wrote degraded data) vs. hard error), like the read paths.
+
+  On failure to mount due to insufficient devices, we now report which device(s)
+  were missing; we remember the device name and model in the superblock from the
+  last time we saw it so that we can give helpful hints to the user about what's
+  missing.
+
+  When btree topology repair recovers via btree node scan, we now report which
+  node(s) it was able to recover via scan; this helps with determining if data
+  was actually lost or not.
+
+  We now ratelimit soft and hard errors separately, in the data/journal/btree
+  read and write paths, ensuring that if the system is being flooded with soft
+  errors the hard errors will still be reported.
+
+  All error ratelimiting now obeys the `no_ratelimit_errors` option.
+
+  All recovery passes should now have progress indicators.
+
+- New options:
+
+  `mount_trusts_udev`: there have been reports of mounting by UUID failing due
+  to known bugs in libblkid. Previously this was available as an environment
+  variable, but it now may be specified as a mount option (where it should also
+  be much easier to find). When specified, we only use udev for getting the list
+  of the system's block devices; we do all the probing for filesystem members
+  ourself.
+
+  `writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls
+  for the given filesystem, and may be set persistently. Useful for setting a
+  lower writeback timeout for removeable media.
+
+- Other smaller user-visible improvements
+
+  The `mi_btree_bitmap` field in the member info section of the superblock now
+  has a recovery pass to clean it up and shrink it; it will be automatically
+  scheduled when we notice that there is significantly more space on a device
+  marked as containing metadata than we have metadata on that device.
+
+  The member-info btree bitmap is used by btree node scan, for disaster recovery
+  repair; shrinking the bitmap reduces the amount of the device that has to be
+  scanned if we have to recover from btree nodes that have become unreadable or
+  lost despite replication. You don't ever want to need it, but if you do need
+  it it's there.
+
+- Promotes are now ratelimited; this resolves an issue with spinning up far too
+  many kworker threads for promotes that wouldn't happen due to the target being
+  busy.
+
+- An issue was spotted on a user filesystem where btree node merging wasn't
+  happening properly on the `reconcile_work` btree, causing a very slow upgrade.
+  Btree node merging has now seen some improvements; btree lookups can now kick
+  off asynchronous btree node merges when they spot an empty btree node, and the
+  btree write buffer now does btree merging asynchronously, which should be a
+  noticeable improvement on system performance under heavy load for some users -
+  btree write buffer flushing is single threaded and can be a bottleneck.
+
+  There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for
+  nodes that can be merged. It's not run automatically, but can be run if
+  desired by passing the `recovery_passes` option to an online fsck.
+
+- And many other bug fixes.
+
+### Notable under-the-hood codebase work:
+
+A lot of codebase modernization has been happening over the past six months,
+to prepare for Rust. With the latest features recently available in C and in
+the kernel, we can now do incremental refactorings to bring code steadily more
+in line with what the Rust version will be, so that the future conversion will
+be mostly syntactic - and not a rewrite. The big enabler here was CLASS(),
+which is the kernel's version of pseudo-RAII based on `__cleanup()`; this
+allows for the removal of goto based error handling (Rust notably does not
+have goto).
+
+We're now down to ~600 gotos in the entire codebase, down from ~2500 when the
+modernization started, with many files being complete.
+
+Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which
+is decently close to Rust/C++ vectors, and the try() macro for forwarding
+errors, stolen from Rust. These cleanups have deleted thousands of lines from
+the codebase over the past months.
--- a/debian/NEWS
+++ b/debian/NEWS
@ -0,0 +1,189 @@
+# Changelog
+
+## v1.33.0 - Thu Dec  4 2025
+
+`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2)
+
+### Reconcile
+
+An incompatible upgrade is required to enable reconcile.
+
+Reconcile now handles all IO path options; previously only the background target
+and background compression options were handled.
+
+Reconcile can now process metadata (moving it to the correct target,
+rereplicating degraded metadata); previously rebalance was only able to handle
+user data.
+
+Reconcile now automatically reacts to option changes and device setting
+changes, and immediately rereplicates degraded data or metadata
+
+This obsoletes the commands `data rereplicate`, `data job
+drop_extra_replicas`, and others; the new commands are `reconcile status` and
+`reconcile wait`.
+
+The recovery pass `check_reconcile_work` now checks that data matches the
+specified IO path options, and flags an error if it does not (if it wasn't due
+to an option change that hasn't yet been propagated).
+
+Additional improvements over rebalance and implementation notes:
+
+We now have a separate index for data that's scheduled to be processed by
+reconcile but can't (e.g. because the specified target is full),
+`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance
+spinning when a filesystem has more data than fits on the specified background
+target.
+
+This also means you can create a single device filesystem with replicas=2, and
+upon adding a new device data will automatically be replicated on the new
+device, no additional user intervention required.
+
+There's a separate index for "high priority" reconcile processing -
+`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be
+rereplicated; they'll be processed ahead of other work.
+
+Rotating disks get special handling. We now track whether a disk is rotational
+(a hard drive, instead of an SSD); pending work on those disks is additionally
+indexed in the `BTREE_ID_reconcile_work_phys` and
+`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical
+LBA order, not logical key order, avoiding unnecessary seeks.
+
+We don't yet have the ability to change the rotational setting on an existing
+device, once it's been set; if you discover you need this, please let us know so
+it can be bumped up on the list (it'll be a medium sized project).
+
+`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`;
+as the name implies, reconcile automatically moves data off of devices in the
+evacuating state. In the future, when we have better tracking and monitoring
+of drive health, we'll be able to automatically mark failing devices as
+evacuating: when this lands, you'll be able to load up a server with disks and
+walk away - come back a year later to swap out the ones that have been failed.
+
+Reconcile was a massive project: the short and simple user interface is
+deceptive, there was an enormous amount of work under the hood to make
+everything work consistently and handle all the special cases we've learned
+about over the past few years with rebalance.
+
+There's still reconcile-related work to be done on disk space accounting when
+devices are read-only or evacuating, and in the future we want to reserve space
+up front on option change, so that we can alert the user if they might be doing
+something they don't have disk space for.
+
+### Other improvements and changes:
+
+- Degraded data is now always properly reported as degraded (by `bcachefs fs
+  usage`); data is considered degraded any time the durability on good
+  (non-evacuating devices) is less than the specified replication level.
+
+- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant
+  cleanup and rework: every counter has a corresponding tracepoint. This makes
+  it easy to drill down and investigate when a filesystem is doing something
+  unusual and unexpected.
+
+  Under the hood, the conversion of tracepoints to printbufs/pretty printers has
+  now been completed, with some much improved helpers. This makes it much easier
+  to add new counters and tracepoints or add additional info to existing
+  tracepoints, typically a 5-20 line patch. If there's something you're
+  investigating and you need more info, just ask.
+
+  We now make use of type information on counters to display data rates in
+  `bcachefs fs top` where applicable, and many counters have been converted to
+  data rates. This makes it much easier to correlate different counters (e.g.
+  `data_update`, `data_update_fail`) to check if the rates of slowpath events
+  should be a cause for concern.
+
+- Logging/error message improvements
+
+  Logging has been a major area of focus, with a lot of under the hood
+  improvements to make it ergonomic to generate messages that clearly explain
+  what the system is doing an why: error messages should not include just the
+  error, but how it was handled (soft error or hard error) and all actions taken
+  to correct the error (e.g. scheduling self healing or recovery passes).
+
+  When we receive an IO error from the block layer we now report the specific
+  error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`).
+
+  The various write paths (user data, btree, journal) now report one error
+  message for the entire operation that includes all the sub-errors for the
+  individual replicated writes and the status of the overall operation (soft
+  error (wrote degraded data) vs. hard error), like the read paths.
+
+  On failure to mount due to insufficient devices, we now report which device(s)
+  were missing; we remember the device name and model in the superblock from the
+  last time we saw it so that we can give helpful hints to the user about what's
+  missing.
+
+  When btree topology repair recovers via btree node scan, we now report which
+  node(s) it was able to recover via scan; this helps with determining if data
+  was actually lost or not.
+
+  We now ratelimit soft and hard errors separately, in the data/journal/btree
+  read and write paths, ensuring that if the system is being flooded with soft
+  errors the hard errors will still be reported.
+
+  All error ratelimiting now obeys the `no_ratelimit_errors` option.
+
+  All recovery passes should now have progress indicators.
+
+- New options:
+
+  `mount_trusts_udev`: there have been reports of mounting by UUID failing due
+  to known bugs in libblkid. Previously this was available as an environment
+  variable, but it now may be specified as a mount option (where it should also
+  be much easier to find). When specified, we only use udev for getting the list
+  of the system's block devices; we do all the probing for filesystem members
+  ourself.
+
+  `writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls
+  for the given filesystem, and may be set persistently. Useful for setting a
+  lower writeback timeout for removeable media.
+
+- Other smaller user-visible improvements
+
+  The `mi_btree_bitmap` field in the member info section of the superblock now
+  has a recovery pass to clean it up and shrink it; it will be automatically
+  scheduled when we notice that there is significantly more space on a device
+  marked as containing metadata than we have metadata on that device.
+
+  The member-info btree bitmap is used by btree node scan, for disaster recovery
+  repair; shrinking the bitmap reduces the amount of the device that has to be
+  scanned if we have to recover from btree nodes that have become unreadable or
+  lost despite replication. You don't ever want to need it, but if you do need
+  it it's there.
+
+- Promotes are now ratelimited; this resolves an issue with spinning up far too
+  many kworker threads for promotes that wouldn't happen due to the target being
+  busy.
+
+- An issue was spotted on a user filesystem where btree node merging wasn't
+  happening properly on the `reconcile_work` btree, causing a very slow upgrade.
+  Btree node merging has now seen some improvements; btree lookups can now kick
+  off asynchronous btree node merges when they spot an empty btree node, and the
+  btree write buffer now does btree merging asynchronously, which should be a
+  noticeable improvement on system performance under heavy load for some users -
+  btree write buffer flushing is single threaded and can be a bottleneck.
+
+  There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for
+  nodes that can be merged. It's not run automatically, but can be run if
+  desired by passing the `recovery_passes` option to an online fsck.
+
+- And many other bug fixes.
+
+### Notable under-the-hood codebase work:
+
+A lot of codebase modernization has been happening over the past six months,
+to prepare for Rust. With the latest features recently available in C and in
+the kernel, we can now do incremental refactorings to bring code steadily more
+in line with what the Rust version will be, so that the future conversion will
+be mostly syntactic - and not a rewrite. The big enabler here was CLASS(),
+which is the kernel's version of pseudo-RAII based on `__cleanup()`; this
+allows for the removal of goto based error handling (Rust notably does not
+have goto).
+
+We're now down to ~600 gotos in the entire codebase, down from ~2500 when the
+modernization started, with many files being complete.
+
+Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which
+is decently close to Rust/C++ vectors, and the try() macro for forwarding
+errors, stolen from Rust. These cleanups have deleted thousands of lines from
+the codebase over the past months.