Rework the read path so that BCH_READ_NODECODE reads now also self-heal
after a read error and a successful retry - prerequisite for scrub.
- __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE
and a bounce.
Normally, we don't want a BCH_READ_NODECODE read to ever allocate a
split bch_read_bio: we want to maintain the relationship between the
bch_read_bio and the data_update it's embedded in.
But correcting read errors requires allocating a split/bounce rbio
that's embedded in a promote_op. We do still have a 1-1 relationship,
i.e. we only allocate a single split/bounce if it's a
BCH_READ_NODECODE, so things hopefully don't get too crazy.
- __bch2_read_extent() now is allowed to allocate the promote_op for
rewriting after a failed read, even if it's BCH_READ_NODECODE.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we don't want to block completion of the read - starting a promote calls
into the write path, which will block.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a data update doesn't want to block on allocations (promotes, self
healing on read error) - check if the allocation would fail before
kicking off the data update and calling into the write path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initialize the write op first, so that in the next patch we can check if
the allocator would block (for BCH_WRITE_alloc_nowait ops) and bail out
before taking nocow locks/dev refs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a drive is failing and we're moving data off of it, we can't
necessairly depend on capacity/disk reservation calculations to avoid
deadlocking/blocking on the allocator.
And, we don't want to queue up infinite self healing moves anyways.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Promotes, like most other internal moves, should only go to the
specified target and not fall back to allocating from the full
filesystem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that
the read is embedded in a a data_update - and we can check in the retry
path if the extent has changed and bail out.
This likely fixes some subtle bugs with read errors and data moves.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Read retries are done synchronously, so we definitely shouldn't be
holding any locks (even the srcu lock for btree key cache reclaim) when
submitting the IO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BTREE_ITER_cached_nofill has some tricky corner cases; it's used
internally for iterators that aren't walking the key cache, but need to
be coherent with the key cache.
It tells traverse to look up and lock the key cache entry if present,
but don't create one if it doesn't exist.
That means we have to have a BTREE_ITER_UPTODATE path (because after
traverse the path has to be UPTODATE, or we pop assertions) that doesn't
point to anything (which is the less bad option, taken by the previous
fix).
The previous fix for this path missed an issue that can happen in
bch2_trans_peek_key_cache(): we can't set should_be_locked on a path
that doesn't point to anything and doesn't hold locks.
Fixes: bd5b09727f ("bcachefs: Don't set btree_path to updtodate if we don't fill")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've got a problem with bch_stripe that is going to take an on disk
format rev to fix - we can't access the block sector counts if the
checksum type is unknown.
Document it for now, there are a few other things to fix as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The transaction is going to abort, so there will be no cycle involving
this transaction anymore.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When the cycle doesn't involve the initiator of the cycle detection,
we might choose a transaction that is not involved in the cycle to abort.
It shouldn't be that since it won't break the cycle, this patch
therefore chooses the transaction in the cycle to abort.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces a helper function called lock_graph_pop_from,
it pops the graph from i.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the transaction chose itself as a victim before and restarted, it
might request a no fail lock request this time. But it might be added to
others' lock graph and be chose as the victim again, it's no longer safe
without additional check. We can also convert the cycle detector to be
fully RCU-based to solve that unsoundness, but the latency added to trans_put
and additional memory required may not worth it.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the lock has been acquired and unlocked, we don't have to do clear
and wakeup again, though harmless since we hold the intent lock. Merge
the condition might be clearer.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts commit 62448afee7.
six_lock_tryupgrade fails only if there is an intent lock held,
it won't fail no matter how many read locks are held.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds another metadata version for accounting directory size.
For the new version of the filesystem, when new subdirectory items
are created or deleted, the parent directory's size will change
accordingly. For the old version of the existed file system, running
fsck will automatically upgrade the metadata version, and it will
do the check and recalculationg of the directory size.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The isize of directory is 0 in bcachefs if the directory is empty.
With more child dirents created, its size ought to change. Many
other filesystems changed as that (ie. xfs and btrfs). And many of
them changed as the size of child dirent name. Although the directory
size may not seem to convey much, we can still give it some meaning.
The formula of dentry size as follow:
occupied_size = 40 + ALIGN(9 + namelen, 8)
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_unreachable_inodes does work in online mode, with the one caveat
that it assumes check_dirents has also run - and check_dirents is not
PASS_ONLINE yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out a version of bch2_btree_pos_to_text() that doesn't take a
pointer to a in-memory btree node, to be used for btree node scrub.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New helper for dropping all write locks; which is distinct from the
helper the transaction commit path uses, which is faster and only
touches updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for the interior update locking rework, where we'll be
holding node write locks for the duration of the update - which is
needed for synchronizing with online check_allocations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>