From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH PoC v2 00/10] btrfs: scrub: introduce a new family of ioctl, scrub_fs
Date: Wed, 28 Sep 2022 16:35:37 +0800 [thread overview]
Message-ID: <cover.1664353497.git.wqu@suse.com> (raw)
[CHANGELOG]
POC v2:
- Move the per-stripe verification to endio function
This is to improve the performance, since my previous testing shows
my Ryzen 5900X can only achieve ~2GiB/s CRC32 per-core, the old
verification in main thread can be a bottleneck for fast storage.
- Add repair (writeback) support
Now the corrupted sectors which also have good copy in the veritical
stripes can be written back to repair.
- Change stat::data_nocsum_uncertain to stat::data_nocsum
The main problem here is we have no way to distinguish preallocated
extents with real NODATASUM extents.
(Without doing complex backref walk and pre-alloc checks).
Thus comparing the NODATASUM ranges inside the same veritical stripe
doesn't make that much sense.
So here we just report how many bytes don't have csum.
[BACKGROUND]
Depite the write-hole problem of RAID56, scrub is neither RAID56
friendly in the following points:
- Extra IO for RAID56 scrub
Currently data strips of RAID56 can be read 2x (RAID5) or 3x (RAID6).
This is caused by the fact we do one-thread per-device scrub.
Dev 1 | Data 1 | P(3 + 4) |
Dev 2 | Data 2 | Data 3 |
Dev 3 | P(1 + 2) | Data 4 |
When scrubbing Dev 1, we will read Data 1 (treated no differently than
SINGLE), then read Parity (3 + 4).
But to determine if Parity (3 + 4) is correct, we have to read Data 3
and Data 4.
On the other hand, Data 3 will also be read by scrubbing Dev 2,
and Data 4 will also be read by scrubbing Dev 3.
Thus all data stripes will be read twice, causing slow down in RAID56
scrubbing.
- No proper progress report for P/Q stripes
The scrub_progress has no member for P/Q error reporting at all.
Thus even if we fixed some P/Q error, it will not be reported at all.
To address the above problems, this patchset will introduce a new
family of ioctl, scrub_fs ioctls.
[CORE DESIGN]
The new scrub_fs ioctl will go block group by block group, to scrub the
full fs.
Inside each block group, we go BTRFS_STRIPE_LEN as one scrub unit (will
be enlarged later to improve parallel for RAID0/10).
Then we read the full BTRFS_STRIPE_LEN bytes from each mirror (if there
is an extent inside the range).
The read bios will be submitted to each device at once, so we can still
take advantage of parallel IOs.
But the verification part still only happens inside the scrub thread, no
parallel csum check.
Also this ioctl family will rely on a much larger progress structure,
it's padded to 256 bytes, with parity specific error reporting (not yet
implemented though).
[THE GOOD]
- Every stripe will be iterated at most once
No double read for data stripes.
- Better error reports for parity mismatch
- No need for complex bio form shaping
Since we already submit read bios in BTRFS_STRIPE_LEN unit, and wait
for them to finish, there are only at most nr_copies bios at fly.
(For later RAID0/10 optimization, it will be nr_stripes)
This behavior will reduce the IOPS usage by nature, thus no need to
do any bio form shaping.
This greatly reduce the code size, just check how much code are spent
for bio form shaping in the old scrub code.
- Less block groups marked for read-only
Now there is at most one block group marked read-only for scrub,
reducing the possibility of ENOSPC during scrub.
[THE BAD]
- Slower for SINGLE profile
If some one is using SINGLE profile on multiple devices, scrub_fs will
slower.
Dev 1: | SINGLE BG 1 |
Dev 2: | SINGLE BG 2 |
Dev 3: | SINGLE BG 3 |
The existing scrub code will scrub single BG 1~3 at the same time.
But the new scrub_fs will scrub single BG 1 first, then 2, then 3.
Causing much slower scrub for such case.
Although I'd argue, for above case, the user should go RAID0 anyway.
[THE UGLY]
Since this is just a proof-of-concept patchset, it lacks the following
functionality/optimization:
- Slower RAID0/RAID10 scrub.
Since we only scrub BTRFS_STRIPE_LEN, it will not utilize all devices
from RAID0/10.
Although it can be easily enhanced by enlarging the scrub unit to a
full stripe.
- No RAID56 support
Ironically.
- Very basic btrfs-progs support
Really only calls the ioctl and gives an output.
No background scrub or scrub status file support.
- No drop-in full fstests run yet
Qu Wenruo (10):
btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls
btrfs: scrub: introduce place holder for btrfs_scrub_fs()
btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs()
btrfs: scrub: introduce place holder helper scrub_fs_block_group()
btrfs: scrub: add helpers to fulfill csum/extent_generation
btrfs: scrub: submit and wait for the read of each copy
btrfs: scrub: implement metadata verification code for scrub_fs
btrfs: scrub: implement data verification code for scrub_fs
btrfs: scrub: implement the later stage of verification
btrfs: scrub: implement the repair (writeback) functionality
fs/btrfs/ctree.h | 4 +
fs/btrfs/disk-io.c | 83 ++-
fs/btrfs/disk-io.h | 2 +
fs/btrfs/ioctl.c | 45 ++
fs/btrfs/scrub.c | 1372 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 174 +++++
6 files changed, 1654 insertions(+), 26 deletions(-)
--
2.37.3
next reply other threads:[~2022-09-28 8:36 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-28 8:35 Qu Wenruo [this message]
2022-09-28 8:35 ` [PATCH PoC v2 01/10] btrfs: introduce BTRFS_IOC_SCRUB_FS family of ioctls Qu Wenruo
2022-12-03 15:10 ` li zhang
2022-12-03 23:09 ` Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 02/10] btrfs: scrub: introduce place holder for btrfs_scrub_fs() Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 03/10] btrfs: scrub: introduce a place holder helper scrub_fs_iterate_bgs() Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 04/10] btrfs: scrub: introduce place holder helper scrub_fs_block_group() Qu Wenruo
2022-09-28 14:30 ` Wang Yugui
2022-09-28 22:54 ` Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 05/10] btrfs: scrub: add helpers to fulfill csum/extent_generation Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 06/10] btrfs: scrub: submit and wait for the read of each copy Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 07/10] btrfs: scrub: implement metadata verification code for scrub_fs Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 08/10] btrfs: scrub: implement data " Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 09/10] btrfs: scrub: implement the later stage of verification Qu Wenruo
2022-09-28 8:35 ` [PATCH PoC v2 10/10] btrfs: scrub: implement the repair (writeback) functionality Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1664353497.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).