From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 00/13] btrfs: make read repair work in synchronous mode
Date: Tue, 3 May 2022 14:49:44 +0800 [thread overview]
Message-ID: <cover.1651559986.git.wqu@suse.com> (raw)
[CHANGELOG]
RFC v2 -> v1:
Most updates are to reduce the memory deadlock in endio function
context.
- Allocate a new mempool for read_repair_bio
This is to avoid allocating the same btrfs_bio while we're still
holding one btrfs_bio in its endio function.
The problem is there for a long time in the existing code, only
recently Christoph mentioned the possible deadlock scenario.
Furthermore, our new read_repair_bio is much smaller than btrfs_bio,
avoid wasting memory on unused members.
- Submit the assembled bio immediate if the next sector is not mergeable
Instead of holding them in a bio list, this gives us higher chance to
reclaim the space allocated for the read_repair_bio.
- Pre-allocate needed two bitmaps inside btrfs_submit_data_bio()
If we failed to allocated the memory, we just fail the bio, and VFS
layer will re-try with much smaller range, and we will have a much
higher chance to allocate the needed memory in the next try.
- Fix the btrfs/157 failure by introducing RAID56 specific repair
The old repair_io_failure() can handle it pretty well, although in
that case we will lose the async bio submission, but that should still
be acceptable just for RAID56.
RFC v1 -> RFC v2:
- Assemble a bio list for read/write bios and submit them in one go
This allows less submit bio hooks, while still allow us to wait
for them all to finish.
- Completely remove io_failure_tree infrastructure
Now we don't need to remember which mirror we hit error.
At end_bio_extent_readpage() we either get good data and done the
repair already, or we there aren't enough mirrors for us to recover
all data.
This is mostly trading on-stack memory of end_bio_extent_readpage()
with btrfs_inode::io_failure_tree.
The latter tree has a much longer lifespan, thus I think it's still a
win overall
[RFC POINTS]
- How to improve read_repair_get_sector()?
Currently we always iterate the whole bio to grab the target
page/pgoff.
Is there any better cached method to avoid such iteration?
- Is this new code logically more reader-friendly?
It's more for sure straight-forward, but I doubt if it's any easier to
read compared to the old code.
- btrfs/157 failure
Need extra check to find out why btrfs/157 failed.
In theory, we should just iterate through all mirrors, I guess it's we
have no way to exhaust all combinations, thus the extra 2 "mirrors"
can gave us wrong result for RAID6.
[BEFORE]
For corrupted sectors, we just record the logical bytenr and mirror
number into io_failure_tree, then re-queue the same block with different
mirror number and call it a day.
The re-queued read will trigger enter the same endio function, with
extra failrec handling to either continue re-queue (csum mismatch/read
failure), or clear the current failrec and submit a write to fix the
corrupted mirror (read succeeded and csum match/no csum).
This is harder to read, as we need to enter the same river twice or even
more.
[AFTER]
Before submitting a data read bio, we will pre-allocate the bitmaps used
by read repair first.
If we have no memory, we just fail and let VFS layer to retry with
smaller range, and we will have a larger chance to get the memory in
next try.
For corrupted sectors, we record the following things into an on-stack
structure in end_bio_extent_readpage():
- The original bio
- The original file offset of the bio
This is for direct IO case, as we can not grab file offset just using
page_offset()
- Offset inside the bio of the corrupted sector
- Corrupted mirror
Then in the new btrfs_read_repair_ctrl structure, we hold those info
like:
Original bio logical = X, file_offset = Y, inode=(R/I)
Offset inside bio: 0 4k 8K 12K 16K
cur_bad_bitmap | X| X| | X|
Each set bit will indicate we have a corrupted sector inside the
original bio.
After we have iterated all sectors of the original bio, then we call
btrfs_read_repair_finish() to do the real repair by:
- Assemble and submit read bios
For above case, bio offset [0, 8K) will be inside one bio, while another bio
for bio offset [12K, 16K).
And the page/pgoff will be extracted from the original bio.
This is a little different from the old behavior, as old behavior will
submit a new bio for each sector.
The new behavior will save some btrfs_map_bio() calls.
- Submit the last read bio and wait them to finish
- Re-verify the read result
- Submit write for the corrupted mirror
We do the same behavior just like read bios, assemble and submit them.
And for repaired sectors, remove them from @cur_bad_bitmap.
- Do the same loop until either 1) we tried all mirrors, or 2) no more
corrupted sectors
- Handle the remaining corrupted sectors
Either mark them error for buffered read, or just return an error for
direct IO.
By this we can:
- Remove the re-entry behavior of endio function
Now everything is handled inside end_bio_extent_readpage().
- Remove the io_failure_tree completely
As we don't need to record which mirror has failed.
- Slightly reduced overhead on read repair
Now we won't call btrfs_map_bio() for each corrupted sector, as we can
merge the sectors into a much larger bio.
Qu Wenruo (13):
btrfs: introduce a pure data checksum checking helper
btrfs: quit early if the fs has no RAID56 support for raid56 related
checks
btrfs: save the original bi_iter into btrfs_bio for buffered read
btrfs: remove duplicated parameters from submit_data_read_repair()
btrfs: add btrfs_read_repair_ctrl to record corrupted sectors
btrfs: add a helper to queue a corrupted sector for read repair
btrfs: introduce a helper to repair from one mirror
btrfs: allow btrfs read repair to submit writes in asynchronous mode
btrfs: handle RAID56 read repair differently
btrfs: switch buffered read to the new read repair routine
btrfs: switch direct IO routine to use btrfs_read_repair_ctrl
btrfs: remove io_failure_record infrastructure completely
btrfs: remove btrfs_inode::io_failure_tree
fs/btrfs/Makefile | 2 +-
fs/btrfs/btrfs_inode.h | 5 -
fs/btrfs/compression.c | 10 +-
fs/btrfs/ctree.h | 2 +
fs/btrfs/extent-io-tree.h | 15 --
fs/btrfs/extent_io.c | 490 +++++------------------------------
fs/btrfs/extent_io.h | 28 +-
fs/btrfs/inode.c | 121 +++++----
fs/btrfs/read-repair.c | 459 ++++++++++++++++++++++++++++++++
fs/btrfs/read-repair.h | 84 ++++++
fs/btrfs/super.c | 9 +-
fs/btrfs/volumes.c | 6 +
fs/btrfs/volumes.h | 4 +
include/trace/events/btrfs.h | 1 -
14 files changed, 695 insertions(+), 541 deletions(-)
create mode 100644 fs/btrfs/read-repair.c
create mode 100644 fs/btrfs/read-repair.h
--
2.36.0
next reply other threads:[~2022-05-03 6:52 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-03 6:49 Qu Wenruo [this message]
2022-05-03 6:49 ` [PATCH 01/13] btrfs: introduce a pure data checksum checking helper Qu Wenruo
2022-05-03 15:03 ` Christoph Hellwig
2022-05-03 6:49 ` [PATCH 02/13] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Qu Wenruo
2022-05-03 6:49 ` [PATCH 03/13] btrfs: save the original bi_iter into btrfs_bio for buffered read Qu Wenruo
2022-05-03 6:49 ` [PATCH 04/13] btrfs: remove duplicated parameters from submit_data_read_repair() Qu Wenruo
2022-05-03 6:49 ` [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Qu Wenruo
2022-05-03 15:06 ` Christoph Hellwig
2022-05-04 1:12 ` Qu Wenruo
2022-05-04 14:05 ` Christoph Hellwig
2022-05-04 22:40 ` Qu Wenruo
2022-05-12 17:16 ` David Sterba
2022-05-13 10:33 ` Christoph Hellwig
2022-05-13 10:53 ` Qu Wenruo
2022-05-13 10:57 ` Christoph Hellwig
2022-05-13 11:21 ` Qu Wenruo
2022-05-13 11:23 ` Christoph Hellwig
2022-05-17 13:32 ` Qu Wenruo
2022-05-03 6:49 ` [PATCH 06/13] btrfs: add a helper to queue a corrupted sector for read repair Qu Wenruo
2022-05-03 15:07 ` Christoph Hellwig
2022-05-04 1:13 ` Qu Wenruo
2022-05-04 14:06 ` Christoph Hellwig
2022-05-12 17:20 ` David Sterba
2022-05-03 6:49 ` [PATCH 07/13] btrfs: introduce a helper to repair from one mirror Qu Wenruo
2022-05-03 6:49 ` [PATCH 08/13] btrfs: allow btrfs read repair to submit writes in asynchronous mode Qu Wenruo
2022-05-03 6:49 ` [PATCH 09/13] btrfs: handle RAID56 read repair differently Qu Wenruo
2022-05-03 6:49 ` [PATCH 10/13] btrfs: switch buffered read to the new read repair routine Qu Wenruo
2022-05-03 6:49 ` [PATCH 11/13] btrfs: switch direct IO routine to use btrfs_read_repair_ctrl Qu Wenruo
2022-05-03 6:49 ` [PATCH 12/13] btrfs: remove io_failure_record infrastructure completely Qu Wenruo
2022-05-03 6:49 ` [PATCH 13/13] btrfs: remove btrfs_inode::io_failure_tree Qu Wenruo
2022-05-03 15:07 ` Christoph Hellwig
2022-05-12 17:08 ` [PATCH 00/13] btrfs: make read repair work in synchronous mode David Sterba
2022-05-12 23:01 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1651559986.git.wqu@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox