All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kent Overstreet <kent.overstreet@linux.dev>
To: linux-bcachefs@vger.kernel.org
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Subject: [PATCH 0/8] Runtime self healing for missing backpointers
Date: Sat, 17 May 2025 15:25:37 -0400	[thread overview]
Message-ID: <20250517192547.3849149-1-kent.overstreet@linux.dev> (raw)

Post 6.14 upgrade, there have been some reports of copygc spinning
without making any progress, where it turns out it's been attempting to
evacuate buckets with missing backpointers. 

6.14 made backpointers fsck much faster, partly by deferring the "does
this backpointer point to a matching extent" checks until backpointers
are used. This means that bad backpointers that haven't yet been cleaned
up can cause the "sum up backpointers in a bucket to check if we need to
scan for missing backpointers" code to not detect missing backpointers.

Doh.

The solution is to re-run the "do backpointers in a bucket sum up to the
bucket sector counts?" checks _after_ we've walked backpointers in a
bucket at runtime, by codepaths that use backpointer_get_key() - e.g.
move_data_phys. backpointer_get_key() will have then cleaned up bad
backpointers, and if there are any missing backpointers we'll be able to
reliably detect them.

Note that the "do backpointers sum up to bucket sector counts" check is
quite cheap to do at runtime, particularly here where we've just touched
these keys.

Then, if we do detect missing backpointers, we can kick off
check_extents_to_backpointers. This is also much cheaper than it used to
be, since the scan only walks extent -> backpointer for buckets that we
know have missing backpointers.

Most of the patch series is the new infrastructure for running recovery
passes automatically and asynchronously. We'll be using this for more
things in the future, as other recovery passes become online passes -
this is one of the last missing pieces for "full runtime self healing
from anything".

To avoid unfortunate situations where something is triggering a recovery
pass continuously, there's ratelimiting, using the new superblock
section for "runtime, time of last run" for each recovery pass. Copygc
doesn't need to be able to evacuate any given bucket right away, it can
wait until a moderate percentage of buckets have missing backpointers
before kicking it off.

Kent Overstreet (8):
  bcachefs: struct bch_fs_recovery
  bcachefs: __bch2_run_recovery_passes()
  bcachefs: Reduce usage of recovery.curr_pass
  bcachefs: bch2_recovery_pass_status_to_text()
  bcachefs: bch2_run_explicit_recovery_pass() cleanup
  bcachefs: Run recovery passes asynchronously
  bcachefs: Improve bucket_bitmap code
  bcachefs: bch2_check_bucket_backpointer_mismatch()

 fs/bcachefs/alloc_background.c      |  10 +-
 fs/bcachefs/alloc_foreground.c      |   6 +-
 fs/bcachefs/backpointers.c          | 197 +++++++++----
 fs/bcachefs/backpointers.h          |  10 +-
 fs/bcachefs/bcachefs.h              |  23 +-
 fs/bcachefs/btree_cache.c           |   2 +-
 fs/bcachefs/btree_io.c              |   7 +-
 fs/bcachefs/btree_node_scan.c       |   2 +-
 fs/bcachefs/btree_update_interior.c |   2 +-
 fs/bcachefs/buckets.c               |  33 +--
 fs/bcachefs/errcode.h               |   1 -
 fs/bcachefs/error.c                 |   2 +-
 fs/bcachefs/fsck.c                  |   9 +-
 fs/bcachefs/move.c                  |  21 +-
 fs/bcachefs/movinggc.c              |  11 +-
 fs/bcachefs/rebalance.c             |   2 +-
 fs/bcachefs/recovery.c              |  46 ++-
 fs/bcachefs/recovery_passes.c       | 431 ++++++++++++++++++----------
 fs/bcachefs/recovery_passes.h       |  23 +-
 fs/bcachefs/recovery_passes_types.h |  27 ++
 fs/bcachefs/sb-members.c            |   4 +-
 fs/bcachefs/snapshot.c              |   4 +-
 fs/bcachefs/subvolume.c             |   6 +-
 fs/bcachefs/super.c                 |  10 +-
 fs/bcachefs/sysfs.c                 |   6 +
 25 files changed, 579 insertions(+), 316 deletions(-)
 create mode 100644 fs/bcachefs/recovery_passes_types.h

-- 
2.49.0


             reply	other threads:[~2025-05-17 19:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-17 19:25 Kent Overstreet [this message]
2025-05-17 19:25 ` [PATCH 1/8] bcachefs: struct bch_fs_recovery Kent Overstreet
2025-05-17 19:25 ` [PATCH 2/8] bcachefs: __bch2_run_recovery_passes() Kent Overstreet
2025-05-17 19:25 ` [PATCH 3/8] bcachefs: Reduce usage of recovery.curr_pass Kent Overstreet
2025-05-17 19:25 ` [PATCH 4/8] bcachefs: bch2_recovery_pass_status_to_text() Kent Overstreet
2025-05-17 19:25 ` [PATCH 5/8] bcachefs: bch2_run_explicit_recovery_pass() cleanup Kent Overstreet
2025-05-17 19:25 ` [PATCH 6/8] bcachefs: Run recovery passes asynchronously Kent Overstreet
2025-05-17 19:25 ` [PATCH 7/8] bcachefs: Improve bucket_bitmap code Kent Overstreet
2025-05-17 19:25 ` [PATCH 8/8] bcachefs: bch2_check_bucket_backpointer_mismatch() Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250517192547.3849149-1-kent.overstreet@linux.dev \
    --to=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.