# Changelog ## v1.33.0 - Thu Dec 4 2025 `bcachefs_metadata_version_reconcile` (formerly known as rebalance_v2) ### Reconcile An incompatible upgrade is required to enable reconcile. Reconcile now handles all IO path options; previously only the background target and background compression options were handled. Reconcile can now process metadata (moving it to the correct target, rereplicating degraded metadata); previously rebalance was only able to handle user data. Reconcile now automatically reacts to option changes and device setting changes, and immediately rereplicates degraded data or metadata This obsoletes the commands `data rereplicate`, `data job drop_extra_replicas`, and others; the new commands are `reconcile status` and `reconcile wait`. The recovery pass `check_reconcile_work` now checks that data matches the specified IO path options, and flags an error if it does not (if it wasn't due to an option change that hasn't yet been propagated). Additional improvements over rebalance and implementation notes: We now have a separate index for data that's scheduled to be processed by reconcile but can't (e.g. because the specified target is full), `BTREE_ID_reconcile_pending`; this solves long standing reports of rebalance spinning when a filesystem has more data than fits on the specified background target. This also means you can create a single device filesystem with replicas=2, and upon adding a new device data will automatically be replicated on the new device, no additional user intervention required. There's a separate index for "high priority" reconcile processing - `BTREE_ID_reconcile_hipri`. This is used for degraded extents that need to be rereplicated; they'll be processed ahead of other work. Rotating disks get special handling. We now track whether a disk is rotational (a hard drive, instead of an SSD); pending work on those disks is additionally indexed in the `BTREE_ID_reconcile_work_phys` and `BTREE_ID_reconcile_hipri_phys` btrees so they can be processed in physical LBA order, not logical key order, avoiding unnecessary seeks. We don't yet have the ability to change the rotational setting on an existing device, once it's been set; if you discover you need this, please let us know so it can be bumped up on the list (it'll be a medium sized project). `BCH_MEMBER_STATE_failed` has been renamed to `BCH_MEMBER_STATE_evacuating`; as the name implies, reconcile automatically moves data off of devices in the evacuating state. In the future, when we have better tracking and monitoring of drive health, we'll be able to automatically mark failing devices as evacuating: when this lands, you'll be able to load up a server with disks and walk away - come back a year later to swap out the ones that have been failed. Reconcile was a massive project: the short and simple user interface is deceptive, there was an enormous amount of work under the hood to make everything work consistently and handle all the special cases we've learned about over the past few years with rebalance. There's still reconcile-related work to be done on disk space accounting when devices are read-only or evacuating, and in the future we want to reserve space up front on option change, so that we can alert the user if they might be doing something they don't have disk space for. ### Other improvements and changes: - Degraded data is now always properly reported as degraded (by `bcachefs fs usage`); data is considered degraded any time the durability on good (non-evacuating devices) is less than the specified replication level. - Counters (shown by `bcachefs fs top` and tracepoints have gotten a giant cleanup and rework: every counter has a corresponding tracepoint. This makes it easy to drill down and investigate when a filesystem is doing something unusual and unexpected. Under the hood, the conversion of tracepoints to printbufs/pretty printers has now been completed, with some much improved helpers. This makes it much easier to add new counters and tracepoints or add additional info to existing tracepoints, typically a 5-20 line patch. If there's something you're investigating and you need more info, just ask. We now make use of type information on counters to display data rates in `bcachefs fs top` where applicable, and many counters have been converted to data rates. This makes it much easier to correlate different counters (e.g. `data_update`, `data_update_fail`) to check if the rates of slowpath events should be a cause for concern. - Logging/error message improvements Logging has been a major area of focus, with a lot of under the hood improvements to make it ergonomic to generate messages that clearly explain what the system is doing an why: error messages should not include just the error, but how it was handled (soft error or hard error) and all actions taken to correct the error (e.g. scheduling self healing or recovery passes). When we receive an IO error from the block layer we now report the specific error code we received (e.g. `BLK_STS_IOERR`, `BLK_STS_INVAL`). The various write paths (user data, btree, journal) now report one error message for the entire operation that includes all the sub-errors for the individual replicated writes and the status of the overall operation (soft error (wrote degraded data) vs. hard error), like the read paths. On failure to mount due to insufficient devices, we now report which device(s) were missing; we remember the device name and model in the superblock from the last time we saw it so that we can give helpful hints to the user about what's missing. When btree topology repair recovers via btree node scan, we now report which node(s) it was able to recover via scan; this helps with determining if data was actually lost or not. We now ratelimit soft and hard errors separately, in the data/journal/btree read and write paths, ensuring that if the system is being flooded with soft errors the hard errors will still be reported. All error ratelimiting now obeys the `no_ratelimit_errors` option. All recovery passes should now have progress indicators. - New options: `mount_trusts_udev`: there have been reports of mounting by UUID failing due to known bugs in libblkid. Previously this was available as an environment variable, but it now may be specified as a mount option (where it should also be much easier to find). When specified, we only use udev for getting the list of the system's block devices; we do all the probing for filesystem members ourself. `writeback_timeout`: if set, this overrides the `vm.dirty_writeback*` sysctls for the given filesystem, and may be set persistently. Useful for setting a lower writeback timeout for removeable media. - Other smaller user-visible improvements The `mi_btree_bitmap` field in the member info section of the superblock now has a recovery pass to clean it up and shrink it; it will be automatically scheduled when we notice that there is significantly more space on a device marked as containing metadata than we have metadata on that device. The member-info btree bitmap is used by btree node scan, for disaster recovery repair; shrinking the bitmap reduces the amount of the device that has to be scanned if we have to recover from btree nodes that have become unreadable or lost despite replication. You don't ever want to need it, but if you do need it it's there. - Promotes are now ratelimited; this resolves an issue with spinning up far too many kworker threads for promotes that wouldn't happen due to the target being busy. - An issue was spotted on a user filesystem where btree node merging wasn't happening properly on the `reconcile_work` btree, causing a very slow upgrade. Btree node merging has now seen some improvements; btree lookups can now kick off asynchronous btree node merges when they spot an empty btree node, and the btree write buffer now does btree merging asynchronously, which should be a noticeable improvement on system performance under heavy load for some users - btree write buffer flushing is single threaded and can be a bottleneck. There's also a new recovery pass, `merge_btree_nodes`, to check all btrees for nodes that can be merged. It's not run automatically, but can be run if desired by passing the `recovery_passes` option to an online fsck. - And many other bug fixes. ### Notable under-the-hood codebase work: A lot of codebase modernization has been happening over the past six months, to prepare for Rust. With the latest features recently available in C and in the kernel, we can now do incremental refactorings to bring code steadily more in line with what the Rust version will be, so that the future conversion will be mostly syntactic - and not a rewrite. The big enabler here was CLASS(), which is the kernel's version of pseudo-RAII based on `__cleanup()`; this allows for the removal of goto based error handling (Rust notably does not have goto). We're now down to ~600 gotos in the entire codebase, down from ~2500 when the modernization started, with many files being complete. Other work includes avoiding open coded vectors; bcachefs uses DARRAY(), which is decently close to Rust/C++ vectors, and the try() macro for forwarding errors, stolen from Rust. These cleanups have deleted thousands of lines from the codebase over the past months.