mirror of
https://github.com/koverstreet/bcachefs-tools.git
synced 2025-12-07 00:00:12 +03:00
Changelog for 1.33
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This commit is contained in:
parent
b601a0f2c3
commit
7d5817d9c2
189
Changelog.mdwn
Normal file
189
Changelog.mdwn
Normal file
@ -0,0 +1,189 @@
|
||||
# Changelog
|
||||
|
||||
## v1.33.0 - Thu Dec 4 2025
|
||||
|
||||
`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2)
|
||||
|
||||
### Reconcile
|
||||
|
||||
An incompatible upgrade is required to enable reconcile.
|
||||
|
||||
Reconcile now handles all IO path options; previously only the background target
|
||||
and background compression options were handled.
|
||||
|
||||
Reconcile can now process metadata (moving it to the correct target,
|
||||
rereplicating degraded metadata); previously rebalance was only able to handle
|
||||
user data.
|
||||
|
||||
Reconcile now automatically reacts to option changes and device setting
|
||||
changes, and immediately rereplicates degraded data or metadata
|
||||
|
||||
This obsoletes the commands `data rereplicate`, `data job
|
||||
drop_extra_replicas`, and others; the new commands are `reconcile status` and
|
||||
`reconcile wait`.
|
||||
|
||||
The recovery pass `check_reconcile_work` now checks that data matches the
|
||||
specified IO path options, and flags an error if it does not (if it wasn't due
|
||||
to an option change that hasn't yet been propagated).
|
||||
|
||||
Additional improvements over rebalance and implementation notes:
|
||||
|
||||
We now have a separate index for data that's scheduled to be processed by
|
||||
reconcile but can't (e.g. because the specified target is full),
|
||||
`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance
|
||||
spinning when a filesystem has more data than fits on the specified background
|
||||
target.
|
||||
|
||||
This also means you can create a single device filesystem with replicas=2, and
|
||||
upon adding a new device data will automatically be replicated on the new
|
||||
device, no additional user intervention required.
|
||||
|
||||
There's a separate index for "high priority" reconcile processing -
|
||||
`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be
|
||||
rereplicated; they'll be processed ahead of other work.
|
||||
|
||||
Rotating disks get special handling. We now track whether a disk is rotational
|
||||
(a hard drive, instead of an SSD); pending work on those disks is additionally
|
||||
indexed in the `BTREE_ID_reconcile_work_phys` and
|
||||
`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical
|
||||
LBA order, not logical key order, avoiding unnecessary seeks.
|
||||
|
||||
We don't yet have the ability to change the rotational setting on an existing
|
||||
device, once it's been set; if you discover you need this, please let us know so
|
||||
it can be bumped up on the list (it'll be a medium sized project).
|
||||
|
||||
`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`;
|
||||
as the name implies, reconcile automatically moves data off of devices in the
|
||||
evacuating state. In the future, when we have better tracking and monitoring
|
||||
of drive health, we'll be able to automatically mark failing devices as
|
||||
evacuating: when this lands, you'll be able to load up a server with disks and
|
||||
walk away - come back a year later to swap out the ones that have been failed.
|
||||
|
||||
Reconcile was a massive project: the short and simple user interface is
|
||||
deceptive, there was an enormous amount of work under the hood to make
|
||||
everything work consistently and handle all the special cases we've learned
|
||||
about over the past few years with rebalance.
|
||||
|
||||
There's still reconcile-related work to be done on disk space accounting when
|
||||
devices are read-only or evacuating, and in the future we want to reserve space
|
||||
up front on option change, so that we can alert the user if they might be doing
|
||||
something they don't have disk space for.
|
||||
|
||||
### Other improvements and changes:
|
||||
|
||||
- Degraded data is now always properly reported as degraded (by `bcachefs fs
|
||||
usage`); data is considered degraded any time the durability on good
|
||||
(non-evacuating devices) is less than the specified replication level.
|
||||
|
||||
- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant
|
||||
cleanup and rework: every counter has a corresponding tracepoint. This makes
|
||||
it easy to drill down and investigate when a filesystem is doing something
|
||||
unusual and unexpected.
|
||||
|
||||
Under the hood, the conversion of tracepoints to printbufs/pretty printers has
|
||||
now been completed, with some much improved helpers. This makes it much easier
|
||||
to add new counters and tracepoints or add additional info to existing
|
||||
tracepoints, typically a 5-20 line patch. If there's something you're
|
||||
investigating and you need more info, just ask.
|
||||
|
||||
We now make use of type information on counters to display data rates in
|
||||
`bcachefs fs top` where applicable, and many counters have been converted to
|
||||
data rates. This makes it much easier to correlate different counters (e.g.
|
||||
`data_update`, `data_update_fail`) to check if the rates of slowpath events
|
||||
should be a cause for concern.
|
||||
|
||||
- Logging/error message improvements
|
||||
|
||||
Logging has been a major area of focus, with a lot of under the hood
|
||||
improvements to make it ergonomic to generate messages that clearly explain
|
||||
what the system is doing an why: error messages should not include just the
|
||||
error, but how it was handled (soft error or hard error) and all actions taken
|
||||
to correct the error (e.g. scheduling self healing or recovery passes).
|
||||
|
||||
When we receive an IO error from the block layer we now report the specific
|
||||
error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`).
|
||||
|
||||
The various write paths (user data, btree, journal) now report one error
|
||||
message for the entire operation that includes all the sub-errors for the
|
||||
individual replicated writes and the status of the overall operation (soft
|
||||
error (wrote degraded data) vs. hard error), like the read paths.
|
||||
|
||||
On failure to mount due to insufficient devices, we now report which device(s)
|
||||
were missing; we remember the device name and model in the superblock from the
|
||||
last time we saw it so that we can give helpful hints to the user about what's
|
||||
missing.
|
||||
|
||||
When btree topology repair recovers via btree node scan, we now report which
|
||||
node(s) it was able to recover via scan; this helps with determining if data
|
||||
was actually lost or not.
|
||||
|
||||
We now ratelimit soft and hard errors separately, in the data/journal/btree
|
||||
read and write paths, ensuring that if the system is being flooded with soft
|
||||
errors the hard errors will still be reported.
|
||||
|
||||
All error ratelimiting now obeys the `no_ratelimit_errors` option.
|
||||
|
||||
All recovery passes should now have progress indicators.
|
||||
|
||||
- New options:
|
||||
|
||||
`mount_trusts_udev`: there have been reports of mounting by UUID failing due
|
||||
to known bugs in libblkid. Previously this was available as an environment
|
||||
variable, but it now may be specified as a mount option (where it should also
|
||||
be much easier to find). When specified, we only use udev for getting the list
|
||||
of the system's block devices; we do all the probing for filesystem members
|
||||
ourself.
|
||||
|
||||
`writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls
|
||||
for the given filesystem, and may be set persistently. Useful for setting a
|
||||
lower writeback timeout for removeable media.
|
||||
|
||||
- Other smaller user-visible improvements
|
||||
|
||||
The `mi_btree_bitmap` field in the member info section of the superblock now
|
||||
has a recovery pass to clean it up and shrink it; it will be automatically
|
||||
scheduled when we notice that there is significantly more space on a device
|
||||
marked as containing metadata than we have metadata on that device.
|
||||
|
||||
The member-info btree bitmap is used by btree node scan, for disaster recovery
|
||||
repair; shrinking the bitmap reduces the amount of the device that has to be
|
||||
scanned if we have to recover from btree nodes that have become unreadable or
|
||||
lost despite replication. You don't ever want to need it, but if you do need
|
||||
it it's there.
|
||||
|
||||
- Promotes are now ratelimited; this resolves an issue with spinning up far too
|
||||
many kworker threads for promotes that wouldn't happen due to the target being
|
||||
busy.
|
||||
|
||||
- An issue was spotted on a user filesystem where btree node merging wasn't
|
||||
happening properly on the `reconcile_work` btree, causing a very slow upgrade.
|
||||
Btree node merging has now seen some improvements; btree lookups can now kick
|
||||
off asynchronous btree node merges when they spot an empty btree node, and the
|
||||
btree write buffer now does btree merging asynchronously, which should be a
|
||||
noticeable improvement on system performance under heavy load for some users -
|
||||
btree write buffer flushing is single threaded and can be a bottleneck.
|
||||
|
||||
There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for
|
||||
nodes that can be merged. It's not run automatically, but can be run if
|
||||
desired by passing the `recovery_passes` option to an online fsck.
|
||||
|
||||
- And many other bug fixes.
|
||||
|
||||
### Notable under-the-hood codebase work:
|
||||
|
||||
A lot of codebase modernization has been happening over the past six months,
|
||||
to prepare for Rust. With the latest features recently available in C and in
|
||||
the kernel, we can now do incremental refactorings to bring code steadily more
|
||||
in line with what the Rust version will be, so that the future conversion will
|
||||
be mostly syntactic - and not a rewrite. The big enabler here was CLASS(),
|
||||
which is the kernel's version of pseudo-RAII based on `__cleanup()`; this
|
||||
allows for the removal of goto based error handling (Rust notably does not
|
||||
have goto).
|
||||
|
||||
We're now down to ~600 gotos in the entire codebase, down from ~2500 when the
|
||||
modernization started, with many files being complete.
|
||||
|
||||
Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which
|
||||
is decently close to Rust/C++ vectors, and the try() macro for forwarding
|
||||
errors, stolen from Rust. These cleanups have deleted thousands of lines from
|
||||
the codebase over the past months.
|
||||
189
debian/NEWS
vendored
Normal file
189
debian/NEWS
vendored
Normal file
@ -0,0 +1,189 @@
|
||||
# Changelog
|
||||
|
||||
## v1.33.0 - Thu Dec 4 2025
|
||||
|
||||
`bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2)
|
||||
|
||||
### Reconcile
|
||||
|
||||
An incompatible upgrade is required to enable reconcile.
|
||||
|
||||
Reconcile now handles all IO path options; previously only the background target
|
||||
and background compression options were handled.
|
||||
|
||||
Reconcile can now process metadata (moving it to the correct target,
|
||||
rereplicating degraded metadata); previously rebalance was only able to handle
|
||||
user data.
|
||||
|
||||
Reconcile now automatically reacts to option changes and device setting
|
||||
changes, and immediately rereplicates degraded data or metadata
|
||||
|
||||
This obsoletes the commands `data rereplicate`, `data job
|
||||
drop_extra_replicas`, and others; the new commands are `reconcile status` and
|
||||
`reconcile wait`.
|
||||
|
||||
The recovery pass `check_reconcile_work` now checks that data matches the
|
||||
specified IO path options, and flags an error if it does not (if it wasn't due
|
||||
to an option change that hasn't yet been propagated).
|
||||
|
||||
Additional improvements over rebalance and implementation notes:
|
||||
|
||||
We now have a separate index for data that's scheduled to be processed by
|
||||
reconcile but can't (e.g. because the specified target is full),
|
||||
`BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance
|
||||
spinning when a filesystem has more data than fits on the specified background
|
||||
target.
|
||||
|
||||
This also means you can create a single device filesystem with replicas=2, and
|
||||
upon adding a new device data will automatically be replicated on the new
|
||||
device, no additional user intervention required.
|
||||
|
||||
There's a separate index for "high priority" reconcile processing -
|
||||
`BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be
|
||||
rereplicated; they'll be processed ahead of other work.
|
||||
|
||||
Rotating disks get special handling. We now track whether a disk is rotational
|
||||
(a hard drive, instead of an SSD); pending work on those disks is additionally
|
||||
indexed in the `BTREE_ID_reconcile_work_phys` and
|
||||
`BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical
|
||||
LBA order, not logical key order, avoiding unnecessary seeks.
|
||||
|
||||
We don't yet have the ability to change the rotational setting on an existing
|
||||
device, once it's been set; if you discover you need this, please let us know so
|
||||
it can be bumped up on the list (it'll be a medium sized project).
|
||||
|
||||
`BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`;
|
||||
as the name implies, reconcile automatically moves data off of devices in the
|
||||
evacuating state. In the future, when we have better tracking and monitoring
|
||||
of drive health, we'll be able to automatically mark failing devices as
|
||||
evacuating: when this lands, you'll be able to load up a server with disks and
|
||||
walk away - come back a year later to swap out the ones that have been failed.
|
||||
|
||||
Reconcile was a massive project: the short and simple user interface is
|
||||
deceptive, there was an enormous amount of work under the hood to make
|
||||
everything work consistently and handle all the special cases we've learned
|
||||
about over the past few years with rebalance.
|
||||
|
||||
There's still reconcile-related work to be done on disk space accounting when
|
||||
devices are read-only or evacuating, and in the future we want to reserve space
|
||||
up front on option change, so that we can alert the user if they might be doing
|
||||
something they don't have disk space for.
|
||||
|
||||
### Other improvements and changes:
|
||||
|
||||
- Degraded data is now always properly reported as degraded (by `bcachefs fs
|
||||
usage`); data is considered degraded any time the durability on good
|
||||
(non-evacuating devices) is less than the specified replication level.
|
||||
|
||||
- Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant
|
||||
cleanup and rework: every counter has a corresponding tracepoint. This makes
|
||||
it easy to drill down and investigate when a filesystem is doing something
|
||||
unusual and unexpected.
|
||||
|
||||
Under the hood, the conversion of tracepoints to printbufs/pretty printers has
|
||||
now been completed, with some much improved helpers. This makes it much easier
|
||||
to add new counters and tracepoints or add additional info to existing
|
||||
tracepoints, typically a 5-20 line patch. If there's something you're
|
||||
investigating and you need more info, just ask.
|
||||
|
||||
We now make use of type information on counters to display data rates in
|
||||
`bcachefs fs top` where applicable, and many counters have been converted to
|
||||
data rates. This makes it much easier to correlate different counters (e.g.
|
||||
`data_update`, `data_update_fail`) to check if the rates of slowpath events
|
||||
should be a cause for concern.
|
||||
|
||||
- Logging/error message improvements
|
||||
|
||||
Logging has been a major area of focus, with a lot of under the hood
|
||||
improvements to make it ergonomic to generate messages that clearly explain
|
||||
what the system is doing an why: error messages should not include just the
|
||||
error, but how it was handled (soft error or hard error) and all actions taken
|
||||
to correct the error (e.g. scheduling self healing or recovery passes).
|
||||
|
||||
When we receive an IO error from the block layer we now report the specific
|
||||
error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`).
|
||||
|
||||
The various write paths (user data, btree, journal) now report one error
|
||||
message for the entire operation that includes all the sub-errors for the
|
||||
individual replicated writes and the status of the overall operation (soft
|
||||
error (wrote degraded data) vs. hard error), like the read paths.
|
||||
|
||||
On failure to mount due to insufficient devices, we now report which device(s)
|
||||
were missing; we remember the device name and model in the superblock from the
|
||||
last time we saw it so that we can give helpful hints to the user about what's
|
||||
missing.
|
||||
|
||||
When btree topology repair recovers via btree node scan, we now report which
|
||||
node(s) it was able to recover via scan; this helps with determining if data
|
||||
was actually lost or not.
|
||||
|
||||
We now ratelimit soft and hard errors separately, in the data/journal/btree
|
||||
read and write paths, ensuring that if the system is being flooded with soft
|
||||
errors the hard errors will still be reported.
|
||||
|
||||
All error ratelimiting now obeys the `no_ratelimit_errors` option.
|
||||
|
||||
All recovery passes should now have progress indicators.
|
||||
|
||||
- New options:
|
||||
|
||||
`mount_trusts_udev`: there have been reports of mounting by UUID failing due
|
||||
to known bugs in libblkid. Previously this was available as an environment
|
||||
variable, but it now may be specified as a mount option (where it should also
|
||||
be much easier to find). When specified, we only use udev for getting the list
|
||||
of the system's block devices; we do all the probing for filesystem members
|
||||
ourself.
|
||||
|
||||
`writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls
|
||||
for the given filesystem, and may be set persistently. Useful for setting a
|
||||
lower writeback timeout for removeable media.
|
||||
|
||||
- Other smaller user-visible improvements
|
||||
|
||||
The `mi_btree_bitmap` field in the member info section of the superblock now
|
||||
has a recovery pass to clean it up and shrink it; it will be automatically
|
||||
scheduled when we notice that there is significantly more space on a device
|
||||
marked as containing metadata than we have metadata on that device.
|
||||
|
||||
The member-info btree bitmap is used by btree node scan, for disaster recovery
|
||||
repair; shrinking the bitmap reduces the amount of the device that has to be
|
||||
scanned if we have to recover from btree nodes that have become unreadable or
|
||||
lost despite replication. You don't ever want to need it, but if you do need
|
||||
it it's there.
|
||||
|
||||
- Promotes are now ratelimited; this resolves an issue with spinning up far too
|
||||
many kworker threads for promotes that wouldn't happen due to the target being
|
||||
busy.
|
||||
|
||||
- An issue was spotted on a user filesystem where btree node merging wasn't
|
||||
happening properly on the `reconcile_work` btree, causing a very slow upgrade.
|
||||
Btree node merging has now seen some improvements; btree lookups can now kick
|
||||
off asynchronous btree node merges when they spot an empty btree node, and the
|
||||
btree write buffer now does btree merging asynchronously, which should be a
|
||||
noticeable improvement on system performance under heavy load for some users -
|
||||
btree write buffer flushing is single threaded and can be a bottleneck.
|
||||
|
||||
There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for
|
||||
nodes that can be merged. It's not run automatically, but can be run if
|
||||
desired by passing the `recovery_passes` option to an online fsck.
|
||||
|
||||
- And many other bug fixes.
|
||||
|
||||
### Notable under-the-hood codebase work:
|
||||
|
||||
A lot of codebase modernization has been happening over the past six months,
|
||||
to prepare for Rust. With the latest features recently available in C and in
|
||||
the kernel, we can now do incremental refactorings to bring code steadily more
|
||||
in line with what the Rust version will be, so that the future conversion will
|
||||
be mostly syntactic - and not a rewrite. The big enabler here was CLASS(),
|
||||
which is the kernel's version of pseudo-RAII based on `__cleanup()`; this
|
||||
allows for the removal of goto based error handling (Rust notably does not
|
||||
have goto).
|
||||
|
||||
We're now down to ~600 gotos in the entire codebase, down from ~2500 when the
|
||||
modernization started, with many files being complete.
|
||||
|
||||
Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which
|
||||
is decently close to Rust/C++ vectors, and the try() macro for forwarding
|
||||
errors, stolen from Rust. These cleanups have deleted thousands of lines from
|
||||
the codebase over the past months.
|
||||
Loading…
x
Reference in New Issue
Block a user