2021-12-20 03:37:29 +03:00
|
|
|
|
\documentclass{article}
|
|
|
|
|
|
|
|
|
|
\usepackage{imakeidx}
|
|
|
|
|
\usepackage[pdfborder={0 0 0}]{hyperref}
|
|
|
|
|
\usepackage{longtable}
|
|
|
|
|
|
|
|
|
|
\title{bcachefs: Principles of Operation}
|
|
|
|
|
\author{Kent Overstreet}
|
|
|
|
|
|
|
|
|
|
\date{}
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
|
|
|
|
|
\maketitle
|
|
|
|
|
\tableofcontents
|
|
|
|
|
|
|
|
|
|
\section{Introduction and overview}
|
|
|
|
|
|
|
|
|
|
Bcachefs is a modern, general purpose, copy on write filesystem descended from
|
|
|
|
|
bcache, a block layer cache.
|
|
|
|
|
|
|
|
|
|
The internal architecture is very different from most existing filesystems where
|
|
|
|
|
the inode is central and many data structures hang off of the inode. Instead,
|
|
|
|
|
bcachefs is architected more like a filesystem on top of a relational database,
|
|
|
|
|
with tables for the different filesystem data types - extents, inodes, dirents,
|
|
|
|
|
xattrs, et cetera.
|
|
|
|
|
|
|
|
|
|
bcachefs supports almost all of the same features as other modern COW
|
|
|
|
|
filesystems, such as ZFS and btrfs, but in general with a cleaner, simpler,
|
|
|
|
|
higher performance design.
|
|
|
|
|
|
|
|
|
|
\subsection{Performance overview}
|
|
|
|
|
|
|
|
|
|
The core of the architecture is a very high performance and very low latency b+
|
|
|
|
|
tree, which also is not a conventional b+ tree but more of hybrid, taking
|
|
|
|
|
concepts from compacting data structures: btree nodes are very large, log
|
|
|
|
|
structured, and compacted (resorted) as necessary in memory. This means our b+
|
|
|
|
|
trees are very shallow compared to other filesystems.
|
|
|
|
|
|
|
|
|
|
What this means for the end user is that since we require very few seeks or disk
|
|
|
|
|
reads, filesystem latency is extremely good - especially cache cold filesystem
|
|
|
|
|
latency, which does not show up in most benchmarks but has a huge impact on real
|
|
|
|
|
world performance, as well as how fast the system "feels" in normal interactive
|
|
|
|
|
usage. Latency has been a major focus throughout the codebase - notably, we have
|
|
|
|
|
assertions that we never hold b+ tree locks while doing IO, and the btree
|
|
|
|
|
transaction layer makes it easily to aggressively drop and retake locks as
|
|
|
|
|
needed - one major goal of bcachefs is to be the first general purpose soft
|
|
|
|
|
realtime filesystem.
|
|
|
|
|
|
|
|
|
|
Additionally, unlike other COW btrees, btree updates are journalled. This
|
|
|
|
|
greatly improves our write efficiency on random update workloads, as it means
|
|
|
|
|
btree writes are only done when we have a large block of updates, or when
|
|
|
|
|
required by memory reclaim or journal reclaim.
|
|
|
|
|
|
|
|
|
|
\subsection{Bucket based allocation}
|
|
|
|
|
|
|
|
|
|
As mentioned bcachefs is descended from bcache, where the ability to efficiently
|
|
|
|
|
invalidate cached data and reuse disk space was a core design requirement. To
|
|
|
|
|
make this possible the allocator divides the disk up into buckets, typically
|
|
|
|
|
512k to 2M but possibly larger or smaller. Buckets and data pointers have
|
|
|
|
|
generation numbers: we can reuse a bucket with cached data in it without finding
|
|
|
|
|
and deleting all the data pointers by incrementing the generation number.
|
|
|
|
|
|
|
|
|
|
In keeping with the copy-on-write theme of avoiding update in place wherever
|
|
|
|
|
possible, we never rewrite or overwrite data within a bucket - when we allocate
|
|
|
|
|
a bucket, we write to it sequentially and then we don't write to it again until
|
|
|
|
|
the bucket has been invalidated and the generation number incremented.
|
|
|
|
|
|
|
|
|
|
This means we require a copying garbage collector to deal with internal
|
|
|
|
|
fragmentation, when patterns of random writes leave us with many buckets that
|
|
|
|
|
are partially empty (because the data they contained was overwritten) - copy GC
|
|
|
|
|
evacuates buckets that are mostly empty by writing the data they contain to new
|
|
|
|
|
buckets. This also means that we need to reserve space on the device for the
|
|
|
|
|
copy GC reserve when formatting - typically 8\% or 12\%.
|
|
|
|
|
|
|
|
|
|
There are some advantages to structuring the allocator this way, besides being
|
|
|
|
|
able to support cached data:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item By maintaining multiple write points that are writing to different buckets,
|
|
|
|
|
we're able to easily and naturally segregate unrelated IO from different
|
|
|
|
|
processes, which helps greatly with fragmentation.
|
|
|
|
|
|
|
|
|
|
\item The fast path of the allocator is essentially a simple bump allocator - the
|
|
|
|
|
disk space allocation is extremely fast
|
|
|
|
|
|
|
|
|
|
\item Fragmentation is generally a non issue unless copygc has to kick
|
|
|
|
|
in, and it usually doesn't under typical usage patterns. The
|
|
|
|
|
allocator and copygc are doing essentially the same things as
|
|
|
|
|
the flash translation layer in SSDs, but within the filesystem
|
|
|
|
|
we have much greater visibility into where writes are coming
|
|
|
|
|
from and how to segregate them, as well as which data is
|
|
|
|
|
actually live - performance is generally more predictable than
|
|
|
|
|
with SSDs under similar usage patterns.
|
|
|
|
|
|
|
|
|
|
\item The same algorithms will in the future be used for managing SMR
|
|
|
|
|
hard drives directly, avoiding the translation layer in the hard
|
|
|
|
|
drive - doing this work within the filesystem should give much
|
|
|
|
|
better performance and much more predictable latency.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\section{Feature overview}
|
|
|
|
|
|
|
|
|
|
\subsection{IO path options}
|
|
|
|
|
|
|
|
|
|
Most options that control the IO path can be set at either the filesystem level
|
|
|
|
|
or on individual inodes (files and directories). When set on a directory via the
|
|
|
|
|
\texttt{bcachefs attr} command, they will be automatically applied recursively.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Checksumming}
|
|
|
|
|
|
|
|
|
|
bcachefs supports both metadata and data checksumming - crc32c by default, but
|
|
|
|
|
stronger checksums are available as well. Enabling data checksumming incurs some
|
|
|
|
|
performance overhead - besides the checksum calculation, writes have to be
|
|
|
|
|
bounced for checksum stability (Linux generally cannot guarantee that the buffer
|
|
|
|
|
being written is not modified in flight), but reads generally do not have to be
|
|
|
|
|
bounced.
|
|
|
|
|
|
|
|
|
|
Checksum granularity in bcachefs is at the level of individual extents, which
|
|
|
|
|
results in smaller metadata but means we have to read entire extents in order to
|
|
|
|
|
verify the checksum. By default, checksummed and compressed extents are capped
|
|
|
|
|
at 64k. For most applications and usage scenarios this is an ideal trade off, but
|
|
|
|
|
small random \texttt{O\_DIRECT} reads will incur significant overhead. In the
|
|
|
|
|
future, checksum granularity will be a per-inode option.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Encryption}
|
|
|
|
|
|
|
|
|
|
bcachefs supports authenticated (AEAD style) encryption - ChaCha20/Poly1305.
|
|
|
|
|
When encryption is enabled, the poly1305 MAC replaces the normal data and
|
|
|
|
|
metadata checksums. This style of encryption is superior to typical block layer
|
|
|
|
|
or filesystem level encryption (usually AES-XTS), which only operates on blocks
|
|
|
|
|
and doesn't have a way to store nonces or MACs. In contrast, we store a nonce
|
2023-06-10 23:43:31 +03:00
|
|
|
|
and cryptographic MAC alongside data pointers, meaning we have a chain of trust
|
2021-12-20 03:37:29 +03:00
|
|
|
|
up to the superblock (or journal, in the case of unclean shutdowns) and can
|
|
|
|
|
definitely tell if metadata has been modified, dropped, or replaced with an
|
2023-06-10 23:43:31 +03:00
|
|
|
|
earlier version. Therefore, replay attacks are not possible, with the exception
|
|
|
|
|
of an offline rollback of the entire filesystem to a previous version (but see
|
|
|
|
|
the WARNING below).
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
Encryption can only be specified for the entire filesystem, not per file or
|
|
|
|
|
directory - this is because metadata blocks do not belong to a particular file.
|
2023-06-10 23:43:31 +03:00
|
|
|
|
All data and metadata except for the superblock is encrypted, and all data
|
|
|
|
|
and metadata is authenticated.
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
In the future we'll probably add AES-GCM for platforms that have hardware
|
|
|
|
|
acceleration for AES, but in the meantime software implementations of ChaCha20
|
|
|
|
|
are also quite fast on most platforms.
|
|
|
|
|
|
2023-06-10 23:43:31 +03:00
|
|
|
|
\texttt{scrypt} is currently used for the key derivation function (KDF), which
|
|
|
|
|
converts the user supplied passphrase to an encryption key. This is the same
|
|
|
|
|
function used by Tarsnap and Qubes OS’s backup support. The key derivation is
|
|
|
|
|
implemented entirely in user-space, so other means of deriving a key can be used
|
|
|
|
|
in the future without any kernel changes.
|
|
|
|
|
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
To format a filesystem with encryption, use
|
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
bcachefs format --encrypted /dev/sda1
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
|
|
|
|
|
You will be prompted for a passphrase. Then, to use an encrypted filesystem
|
|
|
|
|
use the command
|
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
bcachefs unlock /dev/sda1
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
|
|
|
|
|
You will be prompted for the passphrase and the encryption key will be added to
|
|
|
|
|
your in-kernel keyring; mount, fsck and other commands will then work as usual.
|
|
|
|
|
|
|
|
|
|
The passphrase on an existing encrypted filesystem can be changed with the
|
|
|
|
|
\texttt{bcachefs set-passphrase} command. To permanently unlock an encrypted
|
|
|
|
|
filesystem, use the \texttt{bcachefs remove-passphrase} command - this can be
|
|
|
|
|
useful when dumping filesystem metadata for debugging by the developers.
|
|
|
|
|
|
|
|
|
|
There is a \texttt{wide\_macs} option which controls the size of the
|
|
|
|
|
cryptographic MACs stored on disk. By default, only 80 bits are stored, which
|
|
|
|
|
should be sufficient security for most applications. With the
|
|
|
|
|
\texttt{wide\_macs} option enabled we store the full 128 bit MAC, at the cost of
|
2023-06-10 23:43:31 +03:00
|
|
|
|
making extents 8 bytes bigger. \texttt{wide\_macs} is recommended for cases
|
|
|
|
|
where an attacker can make repeated attempts at forging a MAC, such as scenarios
|
|
|
|
|
where the storage device itself is untrusted (but see below).
|
|
|
|
|
|
|
|
|
|
For technical reasons, bcachefs encryption is unsafe if the underlying storage
|
|
|
|
|
is snapshotted and rolled back to an earlier version. (Using bcachefs's own
|
|
|
|
|
snapshot functionality \textit{is} safe.) Therefore, one must exercise care
|
|
|
|
|
when using bcachefs encryption with ``fancy'' storage devices. It is safe to
|
|
|
|
|
rely on bcachefs encryption if both of the following hold:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item You trust your drives to not be actively malicious. For the
|
|
|
|
|
internal storage on your laptop or desktop, this is probably a
|
|
|
|
|
safe assumption, and if it is not, you likely have much worse
|
|
|
|
|
problems. However, it is not necessarily a safe assumption for
|
|
|
|
|
e.g. USB drives or network storage. In those cases you will
|
|
|
|
|
need to decide for yourself if you are worried about this.
|
|
|
|
|
|
|
|
|
|
\item You are not using ``fancy'' storage systems that support snapshots.
|
|
|
|
|
This includes e.g. LVM, ZFS, and loop devices on reflinked or
|
|
|
|
|
snapshotted files. Most network storage and/or virtualization
|
|
|
|
|
solutions also support snapshots.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
If you \textit{are} using snapshots, you must make sure that you never mount
|
|
|
|
|
a snapshotted, encrypted volume, except with \texttt{-o nochanges}. If this
|
|
|
|
|
rule is violated, an attacker might be able to recover sensitive data that
|
|
|
|
|
the encryption was supposed to protect \footnotemark. Future versions of
|
|
|
|
|
bcachefs will not have this limitation. In the meantime, one can make this
|
|
|
|
|
problem much more difficult to exploit by encrypting the volumes on which
|
|
|
|
|
bcachefs resides using LUKS, provided that LUKS is above anything that could
|
|
|
|
|
take a snapshot. For instance, if you are using bcachefs on LVM and might
|
|
|
|
|
take an LVM snapshot, LUKS would need to be between LVM and bcachefs.
|
|
|
|
|
|
|
|
|
|
\footnotetext{Technical details: AEAD algorithms, such as ChaCha20/Poly1305,
|
|
|
|
|
require that a \textit{nonce} be used for every encryption. This nonce does not
|
|
|
|
|
need to be kept secret, but one must never encrypt more than one message with
|
|
|
|
|
the same (key, nonce) pair. In the case of ChaCha20/Poly1305, violating this
|
|
|
|
|
rule loses confidentiality and integrity for all messages with the reused nonce.
|
|
|
|
|
Unfortunately, bcachefs currently derives the nonce for data and journal extents
|
|
|
|
|
from on-disk state. If a volume is snapshotted and the snapshot mounted,
|
|
|
|
|
bcachefs will use the same keys and nonces for both the original volume and the
|
|
|
|
|
snapshot. As long at least one of the volumes is strictly read-only, everything
|
|
|
|
|
is okay, but soon as data is written, bcachefs will use the same nonce to
|
|
|
|
|
encrypt what is almost certain to be two different messages, which is insecure.
|
|
|
|
|
Encrypting the volume bcachefs is on makes this much harder to exploit because
|
|
|
|
|
the attacks rely on observing the XOR of the ChaCha20 ciphertexts, and disk
|
|
|
|
|
encryption hides this information.}
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
\subsubsection{Compression}
|
|
|
|
|
|
|
|
|
|
bcachefs supports gzip, lz4 and zstd compression. As with data checksumming, we
|
|
|
|
|
compress entire extents, not individual disk blocks - this gives us better
|
|
|
|
|
compression ratios than other filesystems, at the cost of reduced small random
|
|
|
|
|
read performance.
|
|
|
|
|
|
|
|
|
|
Data can also be compressed or recompressed with a different algorithm in the
|
|
|
|
|
background by the rebalance thread, if the \texttt{background\_compression}
|
|
|
|
|
option is set.
|
|
|
|
|
|
|
|
|
|
\subsection{Multiple devices}
|
|
|
|
|
|
|
|
|
|
bcachefs is a multi-device filesystem. Devices need not be the same size: by
|
|
|
|
|
default, the allocator will stripe across all available devices but biasing in
|
|
|
|
|
favor of the devices with more free space, so that all devices in the filesystem
|
|
|
|
|
fill up at the same rate. Devices need not have the same performance
|
|
|
|
|
characteristics: we track device IO latency and direct reads to the device that
|
|
|
|
|
is currently fastest.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Replication}
|
|
|
|
|
|
|
|
|
|
bcachefs supports standard RAID1/10 style redundancy with the
|
|
|
|
|
\texttt{data\_replicas} and \texttt{metadata\_replicas} options. Layout is not
|
|
|
|
|
fixed as with RAID10: a given extent can be replicated across any set of
|
|
|
|
|
devices; the \texttt{bcachefs fs usage} command shows how data is replicated
|
|
|
|
|
within a filesystem.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Erasure coding}
|
|
|
|
|
|
|
|
|
|
bcachefs also supports Reed-Solomon erasure coding - the same algorithm used by
|
|
|
|
|
most RAID5/6 implementations) When enabled with the \texttt{ec} option, the
|
|
|
|
|
desired redundancy is taken from the \texttt{data\_replicas} option - erasure
|
|
|
|
|
coding of metadata is not supported.
|
|
|
|
|
|
|
|
|
|
Erasure coding works significantly differently from both conventional RAID
|
|
|
|
|
implementations and other filesystems with similar features. In conventional
|
|
|
|
|
RAID, the "write hole" is a significant problem - doing a small write within a
|
|
|
|
|
stripe requires the P and Q (recovery) blocks to be updated as well, and since
|
|
|
|
|
those writes cannot be done atomically there is a window where the P and Q
|
|
|
|
|
blocks are inconsistent - meaning that if the system crashes and recovers with a
|
|
|
|
|
drive missing, reconstruct reads for unrelated data within that stripe will be
|
|
|
|
|
corrupted.
|
|
|
|
|
|
|
|
|
|
ZFS avoids this by fragmenting individual writes so that every write becomes a
|
|
|
|
|
new stripe - this works, but the fragmentation has a negative effect on
|
|
|
|
|
performance: metadata becomes bigger, and both read and write requests are
|
|
|
|
|
excessively fragmented. Btrfs's erasure coding implementation is more
|
|
|
|
|
conventional, and still subject to the write hole problem.
|
|
|
|
|
|
|
|
|
|
bcachefs's erasure coding takes advantage of our copy on write nature - since
|
|
|
|
|
updating stripes in place is a problem, we simply don't do that. And since
|
|
|
|
|
excessively small stripes is a problem for fragmentation, we don't erasure code
|
|
|
|
|
individual extents, we erasure code entire buckets - taking advantage of bucket
|
|
|
|
|
based allocation and copying garbage collection.
|
|
|
|
|
|
|
|
|
|
When erasure coding is enabled, writes are initially replicated, but one of the
|
|
|
|
|
replicas is allocated from a bucket that is queued up to be part of a new
|
|
|
|
|
stripe. When we finish filling up the new stripe, we write out the P and Q
|
|
|
|
|
buckets and then drop the extra replicas for all the data within that stripe -
|
|
|
|
|
the effect is similar to full data journalling, and it means that after erasure
|
|
|
|
|
coding is done the layout of our data on disk is ideal.
|
|
|
|
|
|
|
|
|
|
Since disks have write caches that are only flushed when we issue a cache flush
|
|
|
|
|
command - which we only do on journal commit - if we can tweak the allocator so
|
|
|
|
|
that the buckets used for the extra replicas are reused (and then overwritten
|
|
|
|
|
again) immediately, this full data journalling should have negligible overhead -
|
|
|
|
|
this optimization is not implemented yet, however.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Device labels and targets}
|
|
|
|
|
|
|
|
|
|
By default, writes are striped across all devices in a filesystem, but they may
|
|
|
|
|
be directed to a specific device or set of devices with the various target
|
|
|
|
|
options. The allocator only prefers to allocate from devices matching the
|
|
|
|
|
specified target; if those devices are full, it will fall back to allocating
|
|
|
|
|
from any device in the filesystem.
|
|
|
|
|
|
|
|
|
|
Target options may refer to a device directly, e.g.
|
|
|
|
|
\texttt{foreground\_target=/dev/sda1}, or they may refer to a device label. A
|
|
|
|
|
device label is a path delimited by periods - e.g. ssd.ssd1 (and labels need not
|
|
|
|
|
be unique). This gives us ways of referring to multiple devices in target
|
|
|
|
|
options: If we specify ssd in a target option, that will refer to all devices
|
|
|
|
|
with the label ssd or labels that start with ssd. (e.g. ssd.ssd1, ssd.ssd2).
|
|
|
|
|
|
|
|
|
|
Four target options exist. These options all may be set at the filesystem level
|
|
|
|
|
(at format time, at mount time, or at runtime via sysfs), or on a particular
|
|
|
|
|
file or directory:
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{foreground\_target}: normal foreground data writes, and
|
|
|
|
|
metadata if \\ \texttt{metadata\_target} is not set
|
|
|
|
|
\item \texttt{metadata\_target}: btree writes
|
|
|
|
|
\item \texttt{background\_target}: If set, user data (not metadata) will
|
|
|
|
|
be moved to this target in the background
|
|
|
|
|
\item\texttt{promote\_target}: If set, a cached copy will be added to
|
|
|
|
|
this target on read, if none exists
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Caching}
|
|
|
|
|
|
|
|
|
|
When an extent has multiple copies on different devices, some of those copies
|
|
|
|
|
may be marked as cached. Buckets containing only cached data are discarded as
|
|
|
|
|
needed by the allocator in LRU order.
|
|
|
|
|
|
|
|
|
|
When data is moved from one device to another according to the \\
|
|
|
|
|
\texttt{background\_target} option, the original copy is left in place but
|
|
|
|
|
marked as cached. With the \texttt{promote\_target} option, the original copy is
|
|
|
|
|
left unchanged and the new copy on the \texttt{promote\_target} device is marked
|
|
|
|
|
as cached.
|
|
|
|
|
|
|
|
|
|
To do writeback caching, set \texttt{foreground\_target} and
|
|
|
|
|
\texttt{promote\_target} to the cache device, and \texttt{background\_target} to
|
|
|
|
|
the backing device. To do writearound caching, set \texttt{foreground\_target}
|
|
|
|
|
to the backing device and \texttt{promote\_target} to the cache device.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Durability}
|
|
|
|
|
|
|
|
|
|
Some devices may be considered to be more reliable than others. For example, we
|
|
|
|
|
might have a filesystem composed of a hardware RAID array and several NVME flash
|
|
|
|
|
devices, to be used as cache. We can set replicas=2 so that losing any of the
|
|
|
|
|
NVME flash devices will not cause us to lose data, and then additionally we can
|
|
|
|
|
set durability=2 for the hardware RAID device to tell bcachefs that we don't
|
|
|
|
|
need extra replicas for data on that device - data on that device will count as
|
|
|
|
|
two replicas, not just one.
|
|
|
|
|
|
|
|
|
|
The durability option can also be used for writethrough caching: by setting
|
|
|
|
|
durability=0 for a device, it can be used as a cache and only as a cache -
|
|
|
|
|
bcachefs won't consider copies on that device to count towards the number of
|
|
|
|
|
replicas we're supposed to keep.
|
|
|
|
|
|
|
|
|
|
\subsection{Reflink}
|
|
|
|
|
|
|
|
|
|
bcachefs supports reflink, similarly to other filesystems with the same feature.
|
2022-12-29 18:48:54 +03:00
|
|
|
|
\texttt{cp --reflink} will create a copy that shares the underlying storage.
|
|
|
|
|
Reading from that file will become slightly slower - the extent pointing to that
|
|
|
|
|
data is moved to the reflink btree (with a refcount added) and in the extents
|
|
|
|
|
btree we leave a key that points to the indirect extent in the reflink btree,
|
|
|
|
|
meaning that we now have to do two btree lookups to read from that data instead
|
|
|
|
|
of just one.
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
\subsection{Inline data extents}
|
|
|
|
|
|
|
|
|
|
bcachefs supports inline data extents, controlled by the \texttt{inline\_data}
|
|
|
|
|
option (on by default). When the end of a file is being written and is smaller
|
|
|
|
|
than half of the filesystem blocksize, it will be written as an inline data
|
|
|
|
|
extent. Inline data extents can also be reflinked (moved to the reflink btree
|
|
|
|
|
with a refcount added): as a todo item we also intend to support compressed
|
|
|
|
|
inline data extents.
|
|
|
|
|
|
|
|
|
|
\subsection{Subvolumes and snapshots}
|
|
|
|
|
|
|
|
|
|
bcachefs supports subvolumes and snapshots with a similar userspace interface as
|
|
|
|
|
btrfs. A new subvolume may be created empty, or it may be created as a snapshot
|
|
|
|
|
of another subvolume. Snapshots are writeable and may be snapshotted again,
|
|
|
|
|
creating a tree of snapshots.
|
|
|
|
|
|
|
|
|
|
Snapshots are very cheap to create: they're not based on cloning of COW btrees
|
|
|
|
|
as with btrfs, but instead are based on versioning of individual keys in the
|
|
|
|
|
btrees. Many thousands or millions of snapshots can be created, with the only
|
|
|
|
|
limitation being disk space.
|
|
|
|
|
|
|
|
|
|
The following subcommands exist for managing subvolumes and snapshots:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item \texttt{bcachefs subvolume create}: Create a new, empty subvolume
|
2024-01-04 04:32:38 +03:00
|
|
|
|
\item \texttt{bcachefs subvolume delete}: Delete an existing subvolume
|
2021-12-20 03:37:29 +03:00
|
|
|
|
or snapshot
|
|
|
|
|
\item \texttt{bcachefs subvolume snapshot}: Create a snapshot of an
|
|
|
|
|
existing subvolume
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2024-08-10 07:36:42 +03:00
|
|
|
|
A subvolume can also be deleted with a normal rmdir after deleting all the
|
2021-12-20 03:37:29 +03:00
|
|
|
|
contents, as with \texttt{rm -rf}. Still to be implemented: read-only snapshots,
|
|
|
|
|
recursive snapshot creation, and a method for recursively listing subvolumes.
|
|
|
|
|
|
|
|
|
|
\subsection{Quotas}
|
|
|
|
|
|
|
|
|
|
bcachefs supports conventional user/group/project quotas. Quotas do not
|
|
|
|
|
currently apply to snapshot subvolumes, because if a file changes ownership in
|
|
|
|
|
the snapshot it would be ambiguous as to what quota data within that file
|
|
|
|
|
should be charged to.
|
|
|
|
|
|
|
|
|
|
When a directory has a project ID set it is inherited automatically by
|
|
|
|
|
descendants on creation and rename. When renaming a directory would cause the
|
|
|
|
|
project ID to change we return -EXDEV so that the move is done file by file, so
|
|
|
|
|
that the project ID is propagated correctly to descendants - thus, project
|
|
|
|
|
quotas can be used as subdirectory quotas.
|
|
|
|
|
|
|
|
|
|
\section{Management}
|
|
|
|
|
|
|
|
|
|
\subsection{Formatting}
|
|
|
|
|
|
|
|
|
|
To format a new bcachefs filesystem use the subcommand \texttt{bcachefs
|
|
|
|
|
format}, or \texttt{mkfs.bcachefs}. All persistent filesystem-wide options can
|
|
|
|
|
be specified at format time. For an example of a multi device filesystem with
|
|
|
|
|
compression, encryption, replication and writeback caching:
|
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
bcachefs format --compression=lz4 \
|
|
|
|
|
--encrypted \
|
|
|
|
|
--replicas=2 \
|
|
|
|
|
--label=ssd.ssd1 /dev/sda \
|
|
|
|
|
--label=ssd.ssd2 /dev/sdb \
|
|
|
|
|
--label=hdd.hdd1 /dev/sdc \
|
|
|
|
|
--label=hdd.hdd2 /dev/sdd \
|
|
|
|
|
--label=hdd.hdd3 /dev/sde \
|
|
|
|
|
--label=hdd.hdd4 /dev/sdf \
|
|
|
|
|
--foreground_target=ssd \
|
|
|
|
|
--promote_target=ssd \
|
|
|
|
|
--background_target=hdd
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
|
|
|
|
|
\subsection{Mounting}
|
|
|
|
|
|
|
|
|
|
To mount a multi device filesystem, there are two options. You can specify all
|
2022-12-29 18:48:54 +03:00
|
|
|
|
component devices, separated by colons, e.g.
|
2021-12-20 03:37:29 +03:00
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
mount -t bcachefs /dev/sda:/dev/sdb:/dev/sdc /mnt
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
Or, use the mount.bcachefs tool to mount by filesystem UUID. Still todo: improve
|
|
|
|
|
the mount.bcachefs tool to support mounting by filesystem label.
|
|
|
|
|
|
|
|
|
|
No special handling is needed for recovering from unclean shutdown. Journal
|
|
|
|
|
replay happens automatically, and diagnostic messages in the dmesg log will
|
|
|
|
|
indicate whether recovery was from clean or unclean shutdown.
|
|
|
|
|
|
|
|
|
|
The \texttt{-o degraded} option will allow a filesystem to be mounted without
|
2022-12-29 18:48:54 +03:00
|
|
|
|
all the devices, but will fail if data would be missing. The
|
2021-12-20 03:37:29 +03:00
|
|
|
|
\texttt{-o very\_degraded} can be used to attempt mounting when data would be
|
|
|
|
|
missing.
|
|
|
|
|
|
|
|
|
|
Also relevant is the \texttt{-o nochanges} option. It disallows any and all
|
|
|
|
|
writes to the underlying devices, pinning dirty data in memory as necessary if
|
|
|
|
|
for example journal replay was necessary - think of it as a "super read-only"
|
|
|
|
|
mode. It can be used for data recovery, and for testing version upgrades.
|
|
|
|
|
|
|
|
|
|
The \texttt{-o verbose} enables additional log output during the mount process.
|
|
|
|
|
|
|
|
|
|
\subsection{Fsck}
|
|
|
|
|
|
|
|
|
|
It is possible to run fsck either in userspace with the \texttt{bcachefs fsck}
|
|
|
|
|
subcommand (also available as \texttt{fsck.bcachefs}, or in the kernel while
|
2022-10-11 23:04:00 +03:00
|
|
|
|
mounting by specifying the \texttt{-o fsck} mount option). In either case the
|
2021-12-20 03:37:29 +03:00
|
|
|
|
exact same fsck implementation is being run, only the environment is different.
|
|
|
|
|
Running fsck in the kernel at mount time has the advantage of somewhat better
|
|
|
|
|
performance, while running in userspace has the ability to be stopped with
|
|
|
|
|
ctrl-c and can prompt the user for fixing errors. To fix errors while running
|
|
|
|
|
fsck in the kernel, use the \texttt{-o fix\_errors} option.
|
|
|
|
|
|
|
|
|
|
The \texttt{-n} option passed to fsck implies the \texttt{-o nochanges} option;
|
|
|
|
|
\texttt{bcachefs fsck -ny} can be used to test filesystem repair in dry-run
|
|
|
|
|
mode.
|
|
|
|
|
|
|
|
|
|
\subsection{Status of data}
|
|
|
|
|
|
|
|
|
|
The \texttt{bcachefs fs usage} may be used to display filesystem usage broken
|
|
|
|
|
out in various ways. Data usage is broken out by type: superblock, journal,
|
|
|
|
|
btree, data, cached data, and parity, and by which sets of devices extents are
|
|
|
|
|
replicated across. We also give per-device usage which includes fragmentation
|
|
|
|
|
due to partially used buckets.
|
|
|
|
|
|
|
|
|
|
\subsection{Journal}
|
|
|
|
|
|
|
|
|
|
The journal has a number of tunables that affect filesystem performance. Journal
|
|
|
|
|
commits are fairly expensive operations as they require issuing FLUSH and FUA
|
|
|
|
|
operations to the underlying devices. By default, we issue a journal flush one
|
|
|
|
|
second after a filesystem update has been done; this is controlled with the
|
|
|
|
|
\texttt{journal\_flush\_delay} option, which takes a parameter in milliseconds.
|
|
|
|
|
|
|
|
|
|
Filesystem sync and fsync operations issue journal flushes; this can be disabled
|
|
|
|
|
with the \texttt{journal\_flush\_disabled} option - the
|
|
|
|
|
\texttt{journal\_flush\_delay} option will still apply, and in the event of a
|
|
|
|
|
system crash we will never lose more than (by default) one second of work. This
|
|
|
|
|
option may be useful on a personal workstation or laptop, and perhaps less
|
|
|
|
|
appropriate on a server.
|
|
|
|
|
|
|
|
|
|
The journal reclaim thread runs in the background, kicking off btree node writes
|
|
|
|
|
and btree key cache flushes to free up space in the journal. Even in the absence
|
|
|
|
|
of space pressure it will run slowly in the background: this is controlled by
|
|
|
|
|
the \texttt{journal\_reclaim\_delay} parameter, with a default of 100
|
|
|
|
|
milliseconds.
|
|
|
|
|
|
|
|
|
|
The journal should be sized sufficiently that bursts of activity do not fill up
|
2024-08-10 07:36:42 +03:00
|
|
|
|
the journal too quickly; also, a larger journal means that we can queue up
|
|
|
|
|
larger btree writes. The \texttt{bcachefs device resize-journal} can be used for
|
2021-12-20 03:37:29 +03:00
|
|
|
|
resizing the journal on disk on a particular device - it can be used on a
|
|
|
|
|
mounted or unmounted filesystem.
|
|
|
|
|
|
|
|
|
|
In the future, we should implement a method to see how much space is currently
|
|
|
|
|
utilized in the journal.
|
|
|
|
|
|
|
|
|
|
\subsection{Device management}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Filesystem resize}
|
|
|
|
|
|
|
|
|
|
A filesystem can be resized on a particular device with the
|
|
|
|
|
\texttt{bcachefs device resize} subcommand. Currently only growing is supported,
|
|
|
|
|
not shrinking.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Device add/removal}
|
|
|
|
|
|
|
|
|
|
The following subcommands exist for adding and removing devices from a mounted
|
|
|
|
|
filesystem:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item \texttt{bcachefs device add}: Formats and adds a new device to an
|
|
|
|
|
existing filesystem.
|
2022-10-11 23:04:00 +03:00
|
|
|
|
\item \texttt{bcachefs device remove}: Permanently removes a device from
|
2021-12-20 03:37:29 +03:00
|
|
|
|
an existing filesystem.
|
|
|
|
|
\item \texttt{bcachefs device online}: Connects a device to a running
|
|
|
|
|
filesystem that was mounted without it (i.e. in degraded mode)
|
|
|
|
|
\item \texttt{bcachefs device offline}: Disconnects a device from a
|
|
|
|
|
mounted filesystem without removing it.
|
|
|
|
|
\item \texttt{bcachefs device evacuate}: Migrates data off of a
|
|
|
|
|
particular device to prepare for removal, setting it read-only
|
|
|
|
|
if necessary.
|
|
|
|
|
\item \texttt{bcachefs device set-state}: Changes the state of a member
|
|
|
|
|
device: one of rw (readwrite), ro (readonly), failed, or spare.
|
|
|
|
|
|
|
|
|
|
A failed device is considered to have 0 durability, and replicas
|
|
|
|
|
on that device won't be counted towards the number of replicas
|
|
|
|
|
an extent should have by rereplicate - however, bcachefs will
|
|
|
|
|
still attempt to read from devices marked as failed.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
The \texttt{bcachefs device remove}, \texttt{bcachefs device offline} and
|
|
|
|
|
\texttt{bcachefs device set-state} commands take force options for when they
|
|
|
|
|
would leave the filesystem degraded or with data missing. Todo: regularize and
|
|
|
|
|
improve those options.
|
|
|
|
|
|
|
|
|
|
\subsection{Data management}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Data rereplicate}
|
|
|
|
|
|
|
|
|
|
The \texttt{bcachefs data rereplicate} command may be used to scan for extents
|
|
|
|
|
that have insufficient replicas and write additional replicas, e.g. after a
|
|
|
|
|
device has been removed from a filesystem or after replication has been enabled
|
|
|
|
|
or increased.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Rebalance}
|
|
|
|
|
|
|
|
|
|
To be implemented: a command for moving data between devices to equalize usage
|
|
|
|
|
on each device. Not normally required because the allocator attempts to equalize
|
|
|
|
|
usage across devices as it stripes, but can be necessary in certain scenarios -
|
|
|
|
|
i.e. when a two-device filesystem with replication enabled that is very full has
|
|
|
|
|
a third device added.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Scrub}
|
|
|
|
|
|
|
|
|
|
To be implemented: a command for reading all data within a filesystem and
|
|
|
|
|
ensuring that checksums are valid, fixing bitrot when a valid copy can be found.
|
|
|
|
|
|
|
|
|
|
\section{Options}
|
|
|
|
|
|
|
|
|
|
Most bcachefs options can be set filesystem wide, and a significant subset can
|
|
|
|
|
also be set on inodes (files and directories), overriding the global defaults.
|
|
|
|
|
Filesystem wide options may be set when formatting, when mounting, or at runtime
|
2024-08-10 07:36:42 +03:00
|
|
|
|
via \texttt{/sys/fs/bcachefs/<uuid>/options/}. When set at runtime via sysfs,
|
|
|
|
|
the persistent options in the superblock are updated as well; when options are
|
2021-12-20 03:37:29 +03:00
|
|
|
|
passed as mount parameters the persistent options are unmodified.
|
|
|
|
|
|
|
|
|
|
\subsection{File and directory options}
|
|
|
|
|
|
2022-05-03 01:39:16 +03:00
|
|
|
|
<say something here about how attrs must be set via bcachefs attr command>
|
|
|
|
|
|
2021-12-20 03:37:29 +03:00
|
|
|
|
Options set on inodes (files and directories) are automatically inherited by
|
|
|
|
|
their descendants, and inodes also record whether a given option was explicitly
|
|
|
|
|
set or inherited from their parent. When renaming a directory would cause
|
|
|
|
|
inherited attributes to change we fail the rename with -EXDEV, causing userspace
|
|
|
|
|
to do the rename file by file so that inherited attributes stay consistent.
|
|
|
|
|
|
|
|
|
|
Inode options are available as extended attributes. The options that have been
|
2024-08-10 07:36:42 +03:00
|
|
|
|
explicitly set are available under the \texttt{bcachefs} namespace, and the
|
|
|
|
|
effective options (explicitly set and inherited options) are available under the
|
2021-12-20 03:37:29 +03:00
|
|
|
|
\texttt{bcachefs\_effective} namespace. Examples of listing options with the
|
|
|
|
|
getfattr command:
|
|
|
|
|
|
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
$ getfattr -d -m '^bcachefs\.' filename
|
|
|
|
|
$ getfattr -d -m '^bcachefs_effective\.' filename
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
|
|
|
|
|
Options may be set via the extended attribute interface, but it is preferable to
|
|
|
|
|
use the \texttt{bcachefs setattr} command as it will correctly propagate options
|
|
|
|
|
recursively.
|
|
|
|
|
|
|
|
|
|
\subsection{Full option list}
|
|
|
|
|
|
|
|
|
|
\begin{tabbing}
|
|
|
|
|
\hspace{0.2in} \= \kill
|
|
|
|
|
\texttt{block\_size} \` \textbf{format} \\
|
|
|
|
|
\> \parbox{4.3in}{Filesystem block size (default 4k)} \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{btree\_node\_size} \` \textbf{format} \\
|
|
|
|
|
\> Btree node size, default 256k \\ \\
|
|
|
|
|
|
2023-01-12 22:35:23 +03:00
|
|
|
|
\texttt{errors} \` \textbf{format,mount,runtime} \\
|
2021-12-20 03:37:29 +03:00
|
|
|
|
\> Action to take on filesystem error \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{metadata\_replicas} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Number of replicas for metadata (journal and btree) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{data\_replicas} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Number of replicas for user data \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{replicas} \` \textbf{format} \\
|
|
|
|
|
\> Alias for both metadata\_replicas and data\_replicas \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{metadata\_checksum} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Checksum type for metadata writes \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{data\_checksum} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Checksum type for data writes \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{compression} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Compression type \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{background\_compression} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Background compression type \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{str\_hash} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Hash function for string hash tables (directories and xattrs) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{metadata\_target} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Preferred target for metadata writes \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{foreground\_target} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Preferred target for foreground writes \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{background\_target} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Target for data to be moved to in the background \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{promote\_target} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Target for data to be copied to on read \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{erasure\_code} \` \textbf{format,mount,runtime,inode} \\
|
|
|
|
|
\> Enable erasure coding \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{inodes\_32bit} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Restrict new inode numbers to 32 bits \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{shard\_inode\_numbers} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Use CPU id for high bits of new inode numbers. \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{wide\_macs} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Store full 128 bit cryptographic MACs (default 80) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{inline\_data} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Enable inline data extents (default on) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{journal\_flush\_delay} \` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Delay in milliseconds before automatic journal commit (default 1000) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{journal\_flush\_disabled}\`\textbf{format,mount,runtime} \\
|
|
|
|
|
\> \begin{minipage}{4.3in}Disables journal flush on sync/fsync.
|
|
|
|
|
\texttt{journal\_flush\_delay} remains in effect, thus with the
|
|
|
|
|
default setting not more than 1 second of work will be lost.
|
|
|
|
|
\end{minipage} \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{journal\_reclaim\_delay}\` \textbf{format,mount,runtime} \\
|
|
|
|
|
\> Delay in milliseconds before automatic journal reclaim \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{acl} \` \textbf{format,mount} \\
|
|
|
|
|
\> Enable POSIX ACLs \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{usrquota} \` \textbf{format,mount} \\
|
|
|
|
|
\> Enable user quotas \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{grpquota} \` \textbf{format,mount} \\
|
|
|
|
|
\> Enable group quotas \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{prjquota} \` \textbf{format,mount} \\
|
|
|
|
|
\> Enable project quotas \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{degraded} \` \textbf{mount} \\
|
|
|
|
|
\> Allow mounting with data degraded \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{very\_degraded} \` \textbf{mount} \\
|
|
|
|
|
\> Allow mounting with data missing \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{verbose} \` \textbf{mount} \\
|
|
|
|
|
\> Extra debugging info during mount/recovery \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{fsck} \` \textbf{mount} \\
|
|
|
|
|
\> Run fsck during mount \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{fix\_errors} \` \textbf{mount} \\
|
|
|
|
|
\> Fix errors without asking during fsck \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{ratelimit\_errors} \` \textbf{mount} \\
|
|
|
|
|
\> Ratelimit error messages during fsck \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{read\_only} \` \textbf{mount} \\
|
|
|
|
|
\> Mount in read only mode \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{nochanges} \` \textbf{mount} \\
|
|
|
|
|
\> Issue no writes, even for journal replay \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{norecovery} \` \textbf{mount} \\
|
|
|
|
|
\> Don't replay the journal (not recommended) \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{noexcl} \` \textbf{mount} \\
|
|
|
|
|
\> Don't open devices in exclusive mode \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{version\_upgrade} \` \textbf{mount} \\
|
|
|
|
|
\> Upgrade on disk format to latest version \\ \\
|
|
|
|
|
|
|
|
|
|
\texttt{discard} \` \textbf{device} \\
|
|
|
|
|
\> Enable discard/TRIM support \\ \\
|
|
|
|
|
\end{tabbing}
|
|
|
|
|
|
|
|
|
|
\subsection{Error actions}
|
|
|
|
|
The \texttt{errors} option is used for inconsistencies that indicate some sort
|
|
|
|
|
of a bug. Valid error actions are:
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item[{\tt continue}] Log the error but continue normal operation
|
|
|
|
|
\item[{\tt ro}] Emergency read only, immediately halting any changes
|
|
|
|
|
to the filesystem on disk
|
|
|
|
|
\item[{\tt panic}] Immediately halt the entire machine, printing a
|
|
|
|
|
backtrace on the system console
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{Checksum types}
|
|
|
|
|
Valid checksum types are:
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item[{\tt none}]
|
|
|
|
|
\item[{\tt crc32c}] (default)
|
|
|
|
|
\item[{\tt crc64}]
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{Compression types}
|
|
|
|
|
Valid compression types are:
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item[{\tt none}] (default)
|
|
|
|
|
\item[{\tt lz4}]
|
|
|
|
|
\item[{\tt gzip}]
|
|
|
|
|
\item[{\tt zstd}]
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{String hash types}
|
|
|
|
|
Valid hash types for string hash tables are:
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item[{\tt crc32c}]
|
|
|
|
|
\item[{\tt crc64}]
|
|
|
|
|
\item[{\tt siphash}] (default)
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\section{Debugging tools}
|
|
|
|
|
|
|
|
|
|
\subsection{Sysfs interface}
|
|
|
|
|
|
|
|
|
|
Mounted filesystems are available in sysfs at \texttt{/sys/fs/bcachefs/<uuid>/}
|
|
|
|
|
with various options, performance counters and internal debugging aids.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Options}
|
|
|
|
|
|
|
|
|
|
Filesystem options may be viewed and changed via \\
|
|
|
|
|
\texttt{/sys/fs/bcachefs/<uuid>/options/}, and settings changed via sysfs will
|
|
|
|
|
be persistently changed in the superblock as well.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Time stats}
|
|
|
|
|
|
|
|
|
|
bcachefs tracks the latency and frequency of various operations and events, with
|
|
|
|
|
quantiles for latency/duration in the
|
|
|
|
|
\texttt{/sys/fs/bcachefs/<uuid>/time\_stats/} directory.
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{blocked\_allocate} \\
|
|
|
|
|
Tracks when allocating a bucket must wait because none are
|
|
|
|
|
immediately available, meaning the copygc thread is not keeping
|
|
|
|
|
up with evacuating mostly empty buckets or the allocator thread
|
|
|
|
|
is not keeping up with invalidating and discarding buckets.
|
|
|
|
|
|
|
|
|
|
\item \texttt{blocked\_allocate\_open\_bucket} \\
|
|
|
|
|
Tracks when allocating a bucket must wait because all of our
|
|
|
|
|
handles for pinning open buckets are in use (we statically
|
|
|
|
|
allocate 1024).
|
|
|
|
|
|
|
|
|
|
\item \texttt{blocked\_journal} \\
|
|
|
|
|
Tracks when getting a journal reservation must wait, either
|
|
|
|
|
because journal reclaim isn't keeping up with reclaiming space
|
|
|
|
|
in the journal, or because journal writes are taking too long to
|
|
|
|
|
complete and we already have too many in flight.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_gc} \\
|
|
|
|
|
Tracks when the btree\_gc code must walk the btree at runtime -
|
|
|
|
|
for recalculating the oldest outstanding generation number of
|
|
|
|
|
every bucket in the btree.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_lock\_contended\_read}
|
|
|
|
|
\item \texttt{btree\_lock\_contended\_intent}
|
|
|
|
|
\item \texttt{btree\_lock\_contended\_write} \\
|
|
|
|
|
Track when taking a read, intent or write lock on a btree node
|
|
|
|
|
must block.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_mem\_alloc} \\
|
|
|
|
|
Tracks the total time to allocate memory in the btree node cache
|
|
|
|
|
for a new btree node.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_split} \\
|
|
|
|
|
Tracks btree node splits - when a btree node becomes full and is
|
|
|
|
|
split into two new nodes
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_compact} \\
|
|
|
|
|
Tracks btree node compactions - when a btree node becomes full
|
|
|
|
|
and needs to be compacted on disk.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_merge} \\
|
|
|
|
|
Tracks when two adjacent btree nodes are merged.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_sort} \\
|
|
|
|
|
Tracks sorting and resorting entire btree nodes in memory,
|
|
|
|
|
either after reading them in from disk or for compacting prior
|
|
|
|
|
to creating a new sorted array of keys.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_node\_read} \\
|
|
|
|
|
Tracks reading in btree nodes from disk.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_interior\_update\_foreground} \\
|
|
|
|
|
Tracks foreground time for btree updates that change btree
|
|
|
|
|
topology - i.e. btree node splits, compactions and merges; the
|
|
|
|
|
duration measured roughly corresponds to lock held time.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_interior\_update\_total} \\
|
|
|
|
|
Tracks time to completion for topology changing btree updates;
|
|
|
|
|
first they have a foreground part that updates btree nodes in
|
|
|
|
|
memory, then after the new nodes are written there is a
|
|
|
|
|
transaction phase that records an update to an interior node or
|
|
|
|
|
a new btree root as well as changes to the alloc btree.
|
|
|
|
|
|
|
|
|
|
\item \texttt{data\_read} \\
|
|
|
|
|
Tracks the core read path - looking up a request in the extents
|
|
|
|
|
(and possibly also reflink) btree, allocating bounce buffers if
|
|
|
|
|
necessary, issuing reads, checksumming, decompressing, decrypting,
|
|
|
|
|
and delivering completions.
|
|
|
|
|
|
|
|
|
|
\item \texttt{data\_write} \\
|
|
|
|
|
Tracks the core write path - allocating space on disk for a new
|
|
|
|
|
write, allocating bounce buffers if necessary,
|
|
|
|
|
compressing, encrypting, checksumming, issuing writes, and
|
|
|
|
|
updating the extents btree to point to the new data.
|
|
|
|
|
|
|
|
|
|
\item \texttt{data\_promote} \\
|
|
|
|
|
Tracks promote operations, which happen when a read operation
|
|
|
|
|
writes an additional cached copy of an extent to
|
|
|
|
|
\texttt{promote\_target}. This is done asynchronously from the
|
|
|
|
|
original read.
|
|
|
|
|
|
|
|
|
|
\item \texttt{journal\_flush\_write} \\
|
|
|
|
|
Tracks writing of flush journal entries to disk, which first
|
|
|
|
|
issue cache flush operations to the underlying devices then
|
|
|
|
|
issue the journal writes as FUA writes. Time is tracked starting
|
|
|
|
|
from after all journal reservations have released their
|
|
|
|
|
references or the completion of the previous journal write.
|
|
|
|
|
|
|
|
|
|
\item \texttt{journal\_noflush\_write} \\
|
|
|
|
|
Tracks writing of non-flush journal entries to disk, which do
|
|
|
|
|
not issue cache flushes or FUA writes.
|
|
|
|
|
|
|
|
|
|
\item \texttt{journal\_flush\_seq} \\
|
|
|
|
|
Tracks time to flush a journal sequence number to disk by
|
|
|
|
|
filesystem sync and fsync operations, as well as the allocator
|
|
|
|
|
prior to reusing buckets when none that do not need flushing are
|
|
|
|
|
available.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Internals}
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{btree\_cache} \\
|
|
|
|
|
Shows information on the btree node cache: number of cached
|
|
|
|
|
nodes, number of dirty nodes, and whether the cannibalize lock
|
|
|
|
|
(for reclaiming cached nodes to allocate new nodes) is held.
|
|
|
|
|
|
|
|
|
|
\item \texttt{dirty\_btree\_nodes} \\
|
|
|
|
|
Prints information related to the interior btree node update
|
|
|
|
|
machinery, which is responsible for ensuring dependent btree
|
|
|
|
|
node writes are ordered correctly.
|
|
|
|
|
|
|
|
|
|
For each dirty btree node, prints:
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item Whether the \texttt{need\_write} flag is set
|
|
|
|
|
\item The level of the btree node
|
|
|
|
|
\item The number of sectors written
|
|
|
|
|
\item Whether writing this node is blocked, waiting for
|
|
|
|
|
other nodes to be written
|
|
|
|
|
\item Whether it is waiting on a btree\_update to
|
|
|
|
|
complete and make it reachable on-disk
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_key\_cache} \\
|
2022-10-11 23:04:00 +03:00
|
|
|
|
Prints information on the btree key cache: number of freed keys
|
2021-12-20 03:37:29 +03:00
|
|
|
|
(which must wait for a sRCU barrier to complete before being
|
|
|
|
|
freed), number of cached keys, and number of dirty keys.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_transactions} \\
|
|
|
|
|
Lists each running btree transactions that has locks held,
|
|
|
|
|
listing which nodes they have locked and what type of lock, what
|
|
|
|
|
node (if any) the process is blocked attempting to lock, and
|
|
|
|
|
where the btree transaction was invoked from.
|
|
|
|
|
|
|
|
|
|
\item \texttt{btree\_updates} \\
|
|
|
|
|
Lists outstanding interior btree updates: the mode (nothing
|
|
|
|
|
updated yet, or updated a btree node, or wrote a new btree root,
|
|
|
|
|
or was reparented by another btree update), whether its new
|
|
|
|
|
btree nodes have finished writing, its embedded closure's
|
|
|
|
|
refcount (while nonzero, the btree update is still waiting), and
|
|
|
|
|
the pinned journal sequence number.
|
|
|
|
|
|
|
|
|
|
\item \texttt{journal\_debug} \\
|
|
|
|
|
Prints a variety of internal journal state.
|
|
|
|
|
|
|
|
|
|
\item \texttt{journal\_pins}
|
|
|
|
|
Lists items pinning journal entries, preventing them from being
|
|
|
|
|
reclaimed.
|
|
|
|
|
|
|
|
|
|
\item \texttt{new\_stripes} \\
|
|
|
|
|
Lists new erasure-coded stripes being created.
|
|
|
|
|
|
|
|
|
|
\item \texttt{stripes\_heap} \\
|
|
|
|
|
Lists erasure-coded stripes that are available to be reused.
|
|
|
|
|
|
|
|
|
|
\item \texttt{open\_buckets} \\
|
|
|
|
|
Lists buckets currently being written to, along with data type
|
|
|
|
|
and refcount.
|
|
|
|
|
|
|
|
|
|
\item \texttt{io\_timers\_read} \\
|
|
|
|
|
\item \texttt{io\_timers\_write} \\
|
|
|
|
|
Lists outstanding IO timers - timers that wait on total reads or
|
|
|
|
|
writes to the filesystem.
|
|
|
|
|
|
|
|
|
|
\item \texttt{trigger\_journal\_flush} \\
|
|
|
|
|
Echoing to this file triggers a journal commit.
|
|
|
|
|
|
|
|
|
|
\item \texttt{trigger\_gc} \\
|
|
|
|
|
Echoing to this file causes the GC code to recalculate each
|
|
|
|
|
bucket's oldest\_gen field.
|
|
|
|
|
|
|
|
|
|
\item \texttt{prune\_cache} \\
|
|
|
|
|
Echoing to this file prunes the btree node cache.
|
|
|
|
|
|
|
|
|
|
\item \texttt{read\_realloc\_races} \\
|
|
|
|
|
This counts events where the read path reads an extent and
|
|
|
|
|
discovers the bucket that was read from has been reused while
|
|
|
|
|
the IO was in flight, causing the read to be retried.
|
|
|
|
|
|
|
|
|
|
\item \texttt{extent\_migrate\_done} \\
|
|
|
|
|
This counts extents moved by the core move path, used by copygc
|
|
|
|
|
and rebalance.
|
|
|
|
|
|
|
|
|
|
\item \texttt{extent\_migrate\_raced} \\
|
|
|
|
|
This counts extents that the move path attempted to move but no
|
|
|
|
|
longer existed when doing the final btree update.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Unit and performance tests}
|
|
|
|
|
|
|
|
|
|
Echoing into \texttt{/sys/fs/bcachefs/<uuid>/perf\_test} runs various low level
|
|
|
|
|
btree tests, some intended as unit tests and others as performance tests. The
|
|
|
|
|
syntax is
|
|
|
|
|
\begin{quote} \begin{verbatim}
|
|
|
|
|
echo <test_name> <nr_iterations> <nr_threads> > perf_test
|
|
|
|
|
\end{verbatim} \end{quote}
|
|
|
|
|
|
|
|
|
|
When complete, the elapsed time will be printed in the dmesg log. The full list
|
|
|
|
|
of tests that can be run can be found near the bottom of
|
|
|
|
|
\texttt{fs/bcachefs/tests.c}.
|
|
|
|
|
|
|
|
|
|
\subsection{Debugfs interface}
|
|
|
|
|
|
|
|
|
|
The contents of every btree, as well as various internal per-btree-node
|
|
|
|
|
information, are available under \texttt{/sys/kernel/debug/bcachefs/<uuid>/}.
|
|
|
|
|
|
|
|
|
|
For every btree, we have the following files:
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \textit{btree\_name} \\
|
|
|
|
|
Entire btree contents, one key per line
|
|
|
|
|
|
|
|
|
|
\item \textit{btree\_name}\texttt{-formats} \\
|
|
|
|
|
Information about each btree node: the size of the packed bkey
|
|
|
|
|
format, how full each btree node is, number of packed and
|
|
|
|
|
unpacked keys, and number of nodes and failed nodes in the
|
|
|
|
|
in-memory search trees.
|
|
|
|
|
|
|
|
|
|
\item \textit{btree\_name}\texttt{-bfloat-failed} \\
|
|
|
|
|
For each sorted set of keys in a btree node, we construct a
|
|
|
|
|
binary search tree in eytzinger layout with compressed keys.
|
|
|
|
|
Sometimes we aren't able to construct a correct compressed
|
|
|
|
|
search key, which results in slower lookups; this file lists the
|
|
|
|
|
keys that resulted in these failed nodes.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{Listing and dumping filesystem metadata}
|
|
|
|
|
|
|
|
|
|
\subsubsection{bcachefs show-super}
|
|
|
|
|
|
|
|
|
|
This subcommand is used for examining and printing bcachefs superblocks. It
|
|
|
|
|
takes two optional parameters:
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{-l}: Print superblock layout, which records the amount of
|
|
|
|
|
space reserved for the superblock and the locations of the
|
|
|
|
|
backup superblocks.
|
|
|
|
|
\item \texttt{-f, --fields=(fields)}: List of superblock sections to
|
|
|
|
|
print, \texttt{all} to print all sections.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsubsection{bcachefs list}
|
|
|
|
|
|
|
|
|
|
This subcommand gives access to the same functionality as the debugfs interface,
|
|
|
|
|
listing btree nodes and contents, but for offline filesystems.
|
|
|
|
|
|
|
|
|
|
\subsubsection{bcachefs list\_journal}
|
|
|
|
|
|
|
|
|
|
This subcommand lists the contents of the journal, which primarily records btree
|
2022-10-11 23:04:00 +03:00
|
|
|
|
updates ordered by when they occurred.
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
\subsubsection{bcachefs dump}
|
|
|
|
|
|
|
|
|
|
This subcommand can dump all metadata in a filesystem (including multi device
|
|
|
|
|
filesystems) as qcow2 images: when encountering issues that \texttt{fsck} can
|
|
|
|
|
not recover from and need attention from the developers, this makes it possible
|
|
|
|
|
to send the developers only the required metadata. Encrypted filesystems must
|
|
|
|
|
first be unlocked with \texttt{bcachefs remove-passphrase}.
|
|
|
|
|
|
|
|
|
|
\section{ioctl interface}
|
|
|
|
|
|
|
|
|
|
This section documents bcachefs-specific ioctls:
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_QUERY\_UUID} \\
|
2022-10-11 23:04:00 +03:00
|
|
|
|
Returns the UUID of the filesystem: used to find the sysfs
|
2021-12-20 03:37:29 +03:00
|
|
|
|
directory given a path to a mounted filesystem.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_FS\_USAGE} \\
|
|
|
|
|
Queries filesystem usage, returning global counters and a list
|
|
|
|
|
of counters by \texttt{bch\_replicas} entry.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DEV\_USAGE} \\
|
|
|
|
|
Queries usage for a particular device, as bucket and sector
|
|
|
|
|
counts broken out by data type.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_READ\_SUPER} \\
|
|
|
|
|
Returns the filesystem superblock, and optionally the superblock
|
|
|
|
|
for a particular device given that device's index.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_ADD} \\
|
|
|
|
|
Given a path to a device, adds it to a mounted and running
|
|
|
|
|
filesystem. The device must already have a bcachefs superblock;
|
|
|
|
|
options and parameters are read from the new device's superblock
|
|
|
|
|
and added to the member info section of the existing
|
|
|
|
|
filesystem's superblock.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_REMOVE} \\
|
|
|
|
|
Given a path to a device or a device index, attempts to remove
|
|
|
|
|
it from a mounted and running filesystem. This operation
|
|
|
|
|
requires walking the btree to remove all references to this
|
|
|
|
|
device, and may fail if data would become degraded or lost,
|
|
|
|
|
unless appropriate force flags are set.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_ONLINE} \\
|
|
|
|
|
Given a path to a device that is a member of a running
|
|
|
|
|
filesystem (in degraded mode), brings it back online.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_OFFLINE} \\
|
|
|
|
|
Given a path or device index of a device in a multi device
|
|
|
|
|
filesystem, attempts to close it without removing it, so that
|
|
|
|
|
the device may be re-added later and the contents will still be
|
|
|
|
|
available.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_SET\_STATE} \\
|
|
|
|
|
Given a path or device index of a device in a multi device
|
|
|
|
|
filesystem, attempts to set its state to one of read-write,
|
|
|
|
|
read-only, failed or spare. Takes flags to force if the
|
|
|
|
|
filesystem would become degraded.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_GET\_IDX} \\
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_RESIZE} \\
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DISK\_RESIZE\_JOURNAL} \\
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_DATA} \\
|
|
|
|
|
Starts a data job, which walks all data and/or metadata in a
|
2022-10-11 23:04:00 +03:00
|
|
|
|
filesystem performing, performing some operations on each btree
|
2021-12-20 03:37:29 +03:00
|
|
|
|
node and extent. Returns a file descriptor which can be read
|
|
|
|
|
from to get the current status of the job, and closing the file
|
|
|
|
|
descriptor (i.e. on process exit stops the data job.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_SUBVOLUME\_CREATE} \\
|
|
|
|
|
\item \texttt{BCH\_IOCTL\_SUBVOLUME\_DESTROY} \\
|
|
|
|
|
\item \texttt{BCHFS\_IOC\_REINHERIT\_ATTRS} \\
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\section{On disk format}
|
|
|
|
|
|
|
|
|
|
\subsection{Superblock}
|
|
|
|
|
|
|
|
|
|
The superblock is the first thing to be read when accessing a bcachefs
|
|
|
|
|
filesystem. It is located 4kb from the start of the device, with redundant
|
|
|
|
|
copies elsewhere - typically one immediately after the first superblock, and one
|
|
|
|
|
at the end of the device.
|
|
|
|
|
|
|
|
|
|
The \texttt{bch\_sb\_layout} records the amount of space reserved for the
|
|
|
|
|
superblock as well as the locations of all the superblocks. It is included with
|
|
|
|
|
every superblock, and additionally written 3584 bytes from the start of the
|
|
|
|
|
device (512 bytes before the first superblock).
|
|
|
|
|
|
|
|
|
|
Most of the superblock is identical across each device. The exceptions are the
|
|
|
|
|
\texttt{dev\_idx} field, and the journal section which gives the location of the
|
|
|
|
|
journal.
|
|
|
|
|
|
|
|
|
|
The main section of the superblock contains UUIDs, version numbers, number of
|
|
|
|
|
devices within the filesystem and device index, block size, filesystem creation
|
|
|
|
|
time, and various options and settings. The superblock also has a number of
|
|
|
|
|
variable length sections:
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_journal} \\
|
|
|
|
|
List of buckets used for the journal on this device.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_members} \\
|
|
|
|
|
List of member devices, as well as per-device options and
|
|
|
|
|
settings, including bucket size, number of buckets and time when
|
|
|
|
|
last mounted.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_crypt} \\
|
|
|
|
|
Contains the main chacha20 encryption key, encrypted by the
|
|
|
|
|
user's passphrase, as well as key derivation function settings.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_replicas} \\
|
|
|
|
|
Contains a list of replica entries, which are lists of devices
|
2024-08-10 07:36:42 +03:00
|
|
|
|
that have extents replicated across them.
|
2021-12-20 03:37:29 +03:00
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_quota} \\
|
|
|
|
|
Contains timelimit and warnlimit fields for each quota type
|
|
|
|
|
(user, group and project) and counter (space, inodes).
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_disk\_groups} \\
|
|
|
|
|
Formerly referred to as disk groups (and still is throughout the
|
|
|
|
|
code); this section contains device label strings and records
|
|
|
|
|
the tree structure of label paths, allowing a label once parsed
|
|
|
|
|
to be referred to by integer ID by the target options.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_SB\_FIELD\_clean} \\
|
|
|
|
|
When the filesystem is clean, this section contains a list of
|
|
|
|
|
journal entries that are normally written with each journal
|
|
|
|
|
write (\texttt{struct jset}): btree roots, as well as filesystem
|
|
|
|
|
usage and read/write counters (total amount of data read/written
|
|
|
|
|
to this filesystem). This allows reading the journal to be
|
|
|
|
|
skipped after clean shutdowns.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{Journal}
|
|
|
|
|
|
|
|
|
|
Every journal write (\texttt{struct jset}) contains a list of entries:
|
|
|
|
|
\texttt{struct jset\_entry}. Below are listed the various journal entry types.
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_btree\_key} \\
|
|
|
|
|
This entry type is used to record every btree update that
|
|
|
|
|
happens. It contains one or more btree keys (\texttt{struct
|
|
|
|
|
bkey}), and the \texttt{btree\_id} and \texttt{level} fields of
|
|
|
|
|
\texttt{jset\_entry} record the btree ID and level the key
|
|
|
|
|
belongs to.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_btree\_root} \\
|
|
|
|
|
This entry type is used for pointers btree roots. In the current
|
|
|
|
|
implementation, every journal write still records every btree
|
|
|
|
|
root, although that is subject to change. A btree root is a bkey
|
|
|
|
|
of type \texttt{KEY\_TYPE\_btree\_ptr\_v2}, and the btree\_id
|
|
|
|
|
and level fields of \texttt{jset\_entry} record the btree ID and
|
|
|
|
|
depth.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_clock} \\
|
|
|
|
|
Records IO time, not wall clock time - i.e. the amount of reads
|
|
|
|
|
and writes, in 512 byte sectors since the filesystem was
|
|
|
|
|
created.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_usage} \\
|
|
|
|
|
Used for certain persistent counters: number of inodes, current
|
|
|
|
|
maximum key version, and sectors of persistent reservations.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_data\_usage} \\
|
|
|
|
|
Stores replica entries with a usage counter, in sectors.
|
|
|
|
|
|
|
|
|
|
\item \texttt{BCH\_JSET\_ENTRY\_dev\_usage} \\
|
|
|
|
|
Stores usage counters for each device: sectors used and buckets
|
|
|
|
|
used, broken out by each data type.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\subsection{Btrees}
|
|
|
|
|
|
|
|
|
|
\subsection{Btree keys}
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_deleted}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_whiteout}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_error}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_cookie}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_hash\_whiteout}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_btree\_ptr}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_extent}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_reservation}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_inode}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_inode\_generation}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_dirent}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_xattr}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_alloc}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_quota}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_stripe}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_reflink\_p}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_reflink\_v}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_inline\_data}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_btree\_ptr\_v2}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_indirect\_inline\_data}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_alloc\_v2}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_subvolume}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_snapshot}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_inode\_v2}
|
|
|
|
|
\item \texttt{KEY\_TYPE\_alloc\_v3}
|
|
|
|
|
\end{description}
|
|
|
|
|
|
|
|
|
|
\end{document}
|