linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC v2 00/10] btrfs: scrub: introduce a new family of ioctl, scrub_fs
Date: Wed, 28 Sep 2022 16:35:37 +0800	[thread overview]
Message-ID: <cover.1664353497.git.wqu@suse.com> (raw)

[CHANGELOG]
POC v2:
- Move the per-stripe verification to endio function
  This is to improve the performance, since my previous testing shows
  my Ryzen 5900X can only achieve ~2GiB/s CRC32 per-core, the old
  verification in main thread can be a bottleneck for fast storage.

- Add repair (writeback) support
  Now the corrupted sectors which also have good copy in the veritical
  stripes can be written back to repair.

- Change stat::data_nocsum_uncertain to stat::data_nocsum
  The main problem here is we have no way to distinguish preallocated
  extents with real NODATASUM extents.
  (Without doing complex backref walk and pre-alloc checks).

  Thus comparing the NODATASUM ranges inside the same veritical stripe
  doesn't make that much sense.

  So here we just report how many bytes don't have csum.

[BACKGROUND]
Depite the write-hole problem of RAID56, scrub is neither RAID56
friendly in the following points:

- Extra IO for RAID56 scrub
  Currently data strips of RAID56 can be read 2x (RAID5) or 3x (RAID6).

  This is caused by the fact we do one-thread per-device scrub.

  Dev 1    |  Data 1  | P(3 + 4) |
  Dev 2    |  Data 2  |  Data 3  |
  Dev 3    | P(1 + 2) |  Data 4  |

  When scrubbing Dev 1, we will read Data 1 (treated no differently than
  SINGLE), then read Parity (3 + 4).
  But to determine if Parity (3 + 4) is correct, we have to read Data 3
  and Data 4.

  On the other hand, Data 3 will also be read by scrubbing Dev 2,
  and Data 4 will also be read by scrubbing Dev 3.

  Thus all data stripes will be read twice, causing slow down in RAID56
  scrubbing.

- No proper progress report for P/Q stripes
  The scrub_progress has no member for P/Q error reporting at all.

  Thus even if we fixed some P/Q error, it will not be reported at all.

To address the above problems, this patchset will introduce a new
family of ioctl, scrub_fs ioctls.

[CORE DESIGN]
The new scrub_fs ioctl will go block group by block group, to scrub the
full fs.

Inside each block group, we go BTRFS_STRIPE_LEN as one scrub unit (will
be enlarged later to improve parallel for RAID0/10).

Then we read the full BTRFS_STRIPE_LEN bytes from each mirror (if there
is an extent inside the range).

The read bios will be submitted to each device at once, so we can still
take advantage of parallel IOs.

But the verification part still only happens inside the scrub thread, no
parallel csum check.

Also this ioctl family will rely on a much larger progress structure,
it's padded to 256 bytes, with parity specific error reporting (not yet
implemented though).

[THE GOOD]
- Every stripe will be iterated at most once
  No double read for data stripes.

- Better error reports for parity mismatch

- No need for complex bio form shaping
  Since we already submit read bios in BTRFS_STRIPE_LEN unit, and wait
  for them to finish, there are only at most nr_copies bios at fly.
  (For later RAID0/10 optimization, it will be nr_stripes)

  This behavior will reduce the IOPS usage by nature, thus no need to
  do any bio form shaping.

  This greatly reduce the code size, just check how much code are spent
  for bio form shaping in the old scrub code.

- Less block groups marked for read-only
  Now there is at most one block group marked read-only for scrub,
  reducing the possibility of ENOSPC during scrub.

[THE BAD]
- Slower for SINGLE profile
  If some one is using SINGLE profile on multiple devices, scrub_fs will
  slower.

  Dev 1:   | SINGLE BG 1 |
  Dev 2:   | SINGLE BG 2 |
  Dev 3:   | SINGLE BG 3 |

  The existing scrub code will scrub single BG 1~3 at the same time.
  But the new scrub_fs will scrub single BG 1 first, then 2, then 3.
  Causing much slower scrub for such case.

  Although I'd argue, for above case, the user should go RAID0 anyway.

[THE UGLY]
Since this is just a proof-of-concept patchset, it lacks the following
functionality/optimization:

- Slower RAID0/RAID10 scrub.
  Since we only scrub BTRFS_STRIPE_LEN, it will not utilize all devices
  from RAID0/10.
  Although it can be easily enhanced by enlarging the scrub unit to a
  full stripe.

- No RAID56 support
  Ironically.

- Very basic btrfs-progs support
  Really only calls the ioctl and gives an output.
  No background scrub or scrub status file support.

- No drop-in full fstests run yet


Qu Wenruo (10):
  btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
  btrfs: scrub: introduce place holder for btrfs_scrub_fs()
  btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs()
  btrfs: scrub: introduce place holder helper scrub_fs_block_group()
  btrfs: scrub: add helpers to fulfill csum/extent_generation
  btrfs: scrub: submit and wait for the read of each copy
  btrfs: scrub: implement metadata verification code for scrub_fs
  btrfs: scrub: implement data verification code for scrub_fs
  btrfs: scrub: implement the later stage of verification
  btrfs: scrub: implement the repair (writeback) functionality

 fs/btrfs/ctree.h           |    4 +
 fs/btrfs/disk-io.c         |   83 ++-
 fs/btrfs/disk-io.h         |    2 +
 fs/btrfs/ioctl.c           |   45 ++
 fs/btrfs/scrub.c           | 1372 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  174 +++++
 6 files changed, 1654 insertions(+), 26 deletions(-)

-- 
2.37.3


             reply	other threads:[~2022-09-28  8:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-28  8:35 Qu Wenruo [this message]
2022-09-28  8:35 ` [PATCH PoC v2 01/10] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
2022-12-03 15:10   ` li zhang
2022-12-03 23:09     ` Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 02/10] btrfs: scrub: introduce place holder for btrfs_scrub_fs() Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 03/10] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs() Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 04/10] btrfs: scrub: introduce place holder helper scrub_fs_block_group() Qu Wenruo
2022-09-28 14:30   ` Wang Yugui
2022-09-28 22:54     ` Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 05/10] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 06/10] btrfs: scrub: submit and wait for the read of each copy Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 07/10] btrfs: scrub: implement metadata verification code for scrub_fs Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 08/10] btrfs: scrub: implement data " Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 09/10] btrfs: scrub: implement the later stage of verification Qu Wenruo
2022-09-28  8:35 ` [PATCH PoC v2 10/10] btrfs: scrub: implement the repair (writeback) functionality Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1664353497.git.wqu@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).