bcachefs-tools/docs/debugging.rst
kenneth topp a539b33911
first cut
2022-11-01 23:30:19 -04:00

305 lines
10 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Debugging tools
===============
Sysfs interface
---------------
Mounted filesystems are available in sysfs at
``/sys/fs/bcachefs/<uuid>/`` with various options, performance counters
and internal debugging aids.
.. _options-1:
Options
~~~~~~~
| Filesystem options may be viewed and changed via
| ``/sys/fs/bcachefs/<uuid>/options/``, and settings changed via sysfs
will be persistently changed in the superblock as well.
Time stats
~~~~~~~~~~
bcachefs tracks the latency and frequency of various operations and
events, with quantiles for latency/duration in the
``/sys/fs/bcachefs/<uuid>/time_stats/`` directory.
.. container:: description
| ``blocked_allocate``
| Tracks when allocating a bucket must wait because none are
immediately available, meaning the copygc thread is not keeping up
with evacuating mostly empty buckets or the allocator thread is not
keeping up with invalidating and discarding buckets.
| ``blocked_allocate_open_bucket``
| Tracks when allocating a bucket must wait because all of our
handles for pinning open buckets are in use (we statically allocate
1024).
| ``blocked_journal``
| Tracks when getting a journal reservation must wait, either because
journal reclaim isnt keeping up with reclaiming space in the
journal, or because journal writes are taking too long to complete
and we already have too many in flight.
| ``btree_gc``
| Tracks when the btree_gc code must walk the btree at runtime - for
recalculating the oldest outstanding generation number of every
bucket in the btree.
``btree_lock_contended_read``
``btree_lock_contended_intent``
| ``btree_lock_contended_write``
| Track when taking a read, intent or write lock on a btree node must
block.
| ``btree_node_mem_alloc``
| Tracks the total time to allocate memory in the btree node cache
for a new btree node.
| ``btree_node_split``
| Tracks btree node splits - when a btree node becomes full and is
split into two new nodes
| ``btree_node_compact``
| Tracks btree node compactions - when a btree node becomes full and
needs to be compacted on disk.
| ``btree_node_merge``
| Tracks when two adjacent btree nodes are merged.
| ``btree_node_sort``
| Tracks sorting and resorting entire btree nodes in memory, either
after reading them in from disk or for compacting prior to creating
a new sorted array of keys.
| ``btree_node_read``
| Tracks reading in btree nodes from disk.
| ``btree_interior_update_foreground``
| Tracks foreground time for btree updates that change btree topology
- i.e. btree node splits, compactions and merges; the duration
measured roughly corresponds to lock held time.
| ``btree_interior_update_total``
| Tracks time to completion for topology changing btree updates;
first they have a foreground part that updates btree nodes in
memory, then after the new nodes are written there is a transaction
phase that records an update to an interior node or a new btree
root as well as changes to the alloc btree.
| ``data_read``
| Tracks the core read path - looking up a request in the extents
(and possibly also reflink) btree, allocating bounce buffers if
necessary, issuing reads, checksumming, decompressing, decrypting,
and delivering completions.
| ``data_write``
| Tracks the core write path - allocating space on disk for a new
write, allocating bounce buffers if necessary, compressing,
encrypting, checksumming, issuing writes, and updating the extents
btree to point to the new data.
| ``data_promote``
| Tracks promote operations, which happen when a read operation
writes an additional cached copy of an extent to
``promote_target``. This is done asynchronously from the original
read.
| ``journal_flush_write``
| Tracks writing of flush journal entries to disk, which first issue
cache flush operations to the underlying devices then issue the
journal writes as FUA writes. Time is tracked starting from after
all journal reservations have released their references or the
completion of the previous journal write.
| ``journal_noflush_write``
| Tracks writing of non-flush journal entries to disk, which do not
issue cache flushes or FUA writes.
| ``journal_flush_seq``
| Tracks time to flush a journal sequence number to disk by
filesystem sync and fsync operations, as well as the allocator
prior to reusing buckets when none that do not need flushing are
available.
Internals
~~~~~~~~~
.. container:: description
| ``btree_cache``
| Shows information on the btree node cache: number of cached nodes,
number of dirty nodes, and whether the cannibalize lock (for
reclaiming cached nodes to allocate new nodes) is held.
| ``dirty_btree_nodes``
| Prints information related to the interior btree node update
machinery, which is responsible for ensuring dependent btree node
writes are ordered correctly.
For each dirty btree node, prints:
- Whether the ``need_write`` flag is set
- The level of the btree node
- The number of sectors written
- Whether writing this node is blocked, waiting for other nodes to
be written
- Whether it is waiting on a btree_update to complete and make it
reachable on-disk
| ``btree_key_cache``
| Prints infromation on the btree key cache: number of freed keys
(which must wait for a sRCU barrier to complete before being
freed), number of cached keys, and number of dirty keys.
| ``btree_transactions``
| Lists each running btree transactions that has locks held, listing
which nodes they have locked and what type of lock, what node (if
any) the process is blocked attempting to lock, and where the btree
transaction was invoked from.
| ``btree_updates``
| Lists outstanding interior btree updates: the mode (nothing updated
yet, or updated a btree node, or wrote a new btree root, or was
reparented by another btree update), whether its new btree nodes
have finished writing, its embedded closures refcount (while
nonzero, the btree update is still waiting), and the pinned journal
sequence number.
| ``journal_debug``
| Prints a variety of internal journal state.
``journal_pins`` Lists items pinning journal entries, preventing them
from being reclaimed.
| ``new_stripes``
| Lists new erasure-coded stripes being created.
| ``stripes_heap``
| Lists erasure-coded stripes that are available to be reused.
| ``open_buckets``
| Lists buckets currently being written to, along with data type and
refcount.
| ``io_timers_read``
| ``io_timers_write``
| Lists outstanding IO timers - timers that wait on total reads or
writes to the filesystem.
| ``trigger_journal_flush``
| Echoing to this file triggers a journal commit.
| ``trigger_gc``
| Echoing to this file causes the GC code to recalculate each
buckets oldest_gen field.
| ``prune_cache``
| Echoing to this file prunes the btree node cache.
| ``read_realloc_races``
| This counts events where the read path reads an extent and
discovers the bucket that was read from has been reused while the
IO was in flight, causing the read to be retried.
| ``extent_migrate_done``
| This counts extents moved by the core move path, used by copygc and
rebalance.
| ``extent_migrate_raced``
| This counts extents that the move path attempted to move but no
longer existed when doing the final btree update.
Unit and performance tests
~~~~~~~~~~~~~~~~~~~~~~~~~~
Echoing into ``/sys/fs/bcachefs/<uuid>/perf_test`` runs various low
level btree tests, some intended as unit tests and others as performance
tests. The syntax is
::
echo <test_name> <nr_iterations> <nr_threads> > perf_test
When complete, the elapsed time will be printed in the dmesg log. The
full list of tests that can be run can be found near the bottom of
``fs/bcachefs/tests.c``.
Debugfs interface
-----------------
The contents of every btree, as well as various internal per-btree-node
information, are available under ``/sys/kernel/debug/bcachefs/<uuid>/``.
For every btree, we have the following files:
.. container:: description
| *btree_name*
| Entire btree contents, one key per line
| *btree_name*\ ``-formats``
| Information about each btree node: the size of the packed bkey
format, how full each btree node is, number of packed and unpacked
keys, and number of nodes and failed nodes in the in-memory search
trees.
| *btree_name*\ ``-bfloat-failed``
| For each sorted set of keys in a btree node, we construct a binary
search tree in eytzinger layout with compressed keys. Sometimes we
arent able to construct a correct compressed search key, which
results in slower lookups; this file lists the keys that resulted
in these failed nodes.
Listing and dumping filesystem metadata
---------------------------------------
bcachefs show-super
~~~~~~~~~~~~~~~~~~~
This subcommand is used for examining and printing bcachefs superblocks.
It takes two optional parameters:
.. container:: description
``-l``: Print superblock layout, which records the amount of space
reserved for the superblock and the locations of the backup
superblocks.
``-f, fields=(fields)``: List of superblock sections to print,
``all`` to print all sections.
bcachefs list
~~~~~~~~~~~~~
This subcommand gives access to the same functionality as the debugfs
interface, listing btree nodes and contents, but for offline
filesystems.
bcachefs list_journal
~~~~~~~~~~~~~~~~~~~~~
This subcommand lists the contents of the journal, which primarily
records btree updates ordered by when they occured.
bcachefs dump
~~~~~~~~~~~~~
This subcommand can dump all metadata in a filesystem (including multi
device filesystems) as qcow2 images: when encountering issues that
``fsck`` can not recover from and need attention from the developers,
this makes it possible to send the developers only the required
metadata. Encrypted filesystems must first be unlocked with
``bcachefs remove-passphrase``.