public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: linux-block@vger.kernel.org, dm-devel@redhat.com
Subject: Re: [PATCH 00/17] btrfs: split bio at btrfs_map_bio() time
Date: Wed, 1 Dec 2021 16:57:35 +0800	[thread overview]
Message-ID: <bf794c47-35aa-7823-35ea-854873a20949@suse.com> (raw)
In-Reply-To: <20211201051756.53742-1-wqu@suse.com>



On 2021/12/1 13:17, Qu Wenruo wrote:
> [BACKGROUND]
> 
> Currently btrfs never uses bio_split() to split its bio against RAID
> stripe boundaries.
> 
> Instead inside btrfs we check our stripe boundary everytime we allocate
> a new bio, and ensure the new bio never cross stripe boundaries.
> 
> [PROBLEMS]
> 
> Although this behavior works fine, it's against the common practice used in
> stacked drivers, and is making the effort to convert to iomap harder.
> 
> There is also an hidden burden, every time we allocate a new bio, we uses
> BIO_MAX_BVECS, but since we know the boundaries, for RAID0/RAID10 we can
> only fit at most 16 pages (fixed 64K stripe size, and 4K page size),
> wasting the 256 slots we allocated.
> 
> [CHALLENGES]
> 
> To change the situation, this patchset attempts to improve the situation
> by moving the bio split into btrfs_map_bio() time, so upper layer should
> no longer bother the bio split against RAID stripes or even chunk
> boundaries.
> 
> But there are several challenges:
> 
> - Conflicts in various endio functions
>    We want the existing granularity, instead of chained endio, thus we
>    must make the involved endio functions to handle split bios.
> 
>    Although most endio functions are already doing their works
>    independent of the bio size, they are not yet fully handling split
>    bio.
> 
>    This patch will convert them to use saved bi_iter and only iterate
>    the split range instead of the whole bio.
>    This change involved 3 types of IOs:
> 
>    * Buffered IO
>      Including both data and metadata
>    * Direct IO
>    * Compressed IO
> 
>    Their endio functions needs different level of updates to handle split
>    bios.
> 
>    Furthermore, there is another endio, end_workqueue_bio(), it can't
>    handle split bios at all, thus we change the timing so that
>    btrfs_bio_wq_end_io() is only called after the bio being split.
> 
> - Checksum verification
>    Currently we rely on btrfs_bio::csum to contain the checksum for the
>    whole bio.
>    If one bio get split, csum will no longer points to the correct
>    location for the split bio.
> 
>    This can be solved by introducing btrfs_bio::offset_to_original, and
>    use that new member to calculate where we should read csum from.
> 
>    For the parent bio, it still has btrfs_bio::csum for the whole bio,
>    thus it can still free it correctly.
> 
> - Independent endio for each split bio
>    Unlike stack drivers, for RAID10 btrfs needs to try its best effort to
>    read every sectors, to handle the following case: (X means bad, either
>    unable to read or failed to pass checksum verification, V means good)
> 
>    Dev 1	(missing) | D1 (X) |
>    Dev 2 (OK)	  | D1 (V) |
>    Dev 3 (OK)	  | D2 (V) |
>    Dev 4 (OK)	  | D2 (X) |
> 
>    In the above RAID10 case, dev1 is missing, and although dev4 is fine,
>    its D2 sector is corrupted (by bit rot or whatever).
> 
>    If we use bio_chain(), read bio for both D1 and D2 will be split, and
>    since D1 is missing, the whole D1 and D2 read will be marked as error,
>    thus we will try to read from dev2 and dev4.
> 
>    But D2 in dev4 has csum mismatch, we can only rebuild D1 and D2
>    correctly from dev2:D1 and dev3:D2.
> 
>    This patchset resolve this by saving bi_iter into btrfs_bio::iter, and
>    uses that at endio to iterate only the split part of an bio.
>    Other than this, existing read/write page endio functions can handle
>    them properly without problem.
> 
> - Bad RAID56 naming/functionality
>    There are quite some RAID56 call sites relies on specific behavior on
>    __btrfs_map_block(), like returning @map_length as stripe_len other
>    than real mapped length.
> 
>    This is handled by some small cleanups specific for RAID56.
> 
> [NEED FEEDBACK]
> In this refactor, btrfs is utilizing a lot of call sites like:
> 
>   btrfs_bio_save_iter();	// Save bi_iter into some other location
>   __bio_for_each_segment(bvec, bio, iter, btrfs_bio->iter) {
> 	/* Doing endio for each bvec */
>   }
> 
> And manually implementing an endio which does some work of
> __bio_chain_endio() but with extra btrfs specific workload.
> 
> I'm wondering if block layer is fine to provide some *enhanced* chain
> bio facilities?
> 
> [CHANGELOG]
> RFC->v1:
> - Better patch split
>    Now patch 01~06 are refactors/cleanups/preparations.
>    While 07~13 are the patches that doing the conversion while can handle
>    both old and new bio split timings.
>    Finally patch 14~16 convert the bio split call sites one by one to
>    newer facility.
>    The final patch is just a small clean up.
> 
> - Various bug fixes
>    During the full fstests run, various stupid bugs are exposed and
>    fixed.

The latest version can be fetched from this branch:

https://github.com/adam900710/linux/tree/refactor_chunk_map

Which already has the fix for the btrfs/187 crash, which is caused by a 
missing modification for scrub_stripe_index_and_offset().

Thanks,
Qu
> 
> Qu Wenruo (17):
>    btrfs: update an stale comment on btrfs_submit_bio_hook()
>    btrfs: save bio::bi_iter into btrfs_bio::iter before submitting
>    btrfs: use correct bio size for error message in btrfs_end_dio_bio()
>    btrfs: refactor btrfs_map_bio()
>    btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio()
>    btrfs: replace btrfs_dio_private::refs with
>      btrfs_dio_private::pending_bytes
>    btrfs: introduce btrfs_bio_split() helper
>    btrfs: make data buffered read path to handle split bio properly
>    btrfs: make data buffered write endio function to be split bio
>      compatible
>    btrfs: make metadata write endio functions to be split bio compatible
>    btrfs: make dec_and_test_compressed_bio() to be split bio compatible
>    btrfs: return proper mapped length for RAID56 profiles in
>      __btrfs_map_block()
>    btrfs: allow btrfs_map_bio() to split bio according to chunk stripe
>      boundaries
>    btrfs: remove buffered IO stripe boundary calculation
>    btrfs: remove stripe boundary calculation for compressed IO
>    btrfs: remove the stripe boundary calculation for direct IO
>    btrfs: unexport btrfs_get_io_geometry()
> 
>   fs/btrfs/btrfs_inode.h |  10 +-
>   fs/btrfs/compression.c |  70 +++-----------
>   fs/btrfs/disk-io.c     |   9 +-
>   fs/btrfs/extent_io.c   | 189 +++++++++++++++++++++++++------------
>   fs/btrfs/extent_io.h   |   2 +
>   fs/btrfs/inode.c       | 210 ++++++++++++++++-------------------------
>   fs/btrfs/raid56.c      |  14 ++-
>   fs/btrfs/raid56.h      |   2 +-
>   fs/btrfs/scrub.c       |   4 +-
>   fs/btrfs/volumes.c     | 157 ++++++++++++++++++++++--------
>   fs/btrfs/volumes.h     |  75 +++++++++++++--
>   11 files changed, 435 insertions(+), 307 deletions(-)
> 


      parent reply	other threads:[~2021-12-01  8:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-01  5:17 [PATCH 00/17] btrfs: split bio at btrfs_map_bio() time Qu Wenruo
2021-12-01  5:17 ` [PATCH 01/17] btrfs: update an stale comment on btrfs_submit_bio_hook() Qu Wenruo
2021-12-01  5:17 ` [PATCH 02/17] btrfs: save bio::bi_iter into btrfs_bio::iter before submitting Qu Wenruo
2021-12-01 11:52   ` Qu Wenruo
2021-12-01  5:17 ` [PATCH 03/17] btrfs: use correct bio size for error message in btrfs_end_dio_bio() Qu Wenruo
2021-12-01  5:17 ` [PATCH 04/17] btrfs: refactor btrfs_map_bio() Qu Wenruo
2021-12-01  5:17 ` [PATCH 05/17] btrfs: move btrfs_bio_wq_end_io() calls into submit_stripe_bio() Qu Wenruo
2021-12-01  5:17 ` [PATCH 06/17] btrfs: replace btrfs_dio_private::refs with btrfs_dio_private::pending_bytes Qu Wenruo
2021-12-01  5:17 ` [PATCH 07/17] btrfs: introduce btrfs_bio_split() helper Qu Wenruo
2021-12-01  5:17 ` [PATCH 08/17] btrfs: make data buffered read path to handle split bio properly Qu Wenruo
2021-12-01  5:17 ` [PATCH 09/17] btrfs: make data buffered write endio function to be split bio compatible Qu Wenruo
2021-12-01  5:17 ` [PATCH 10/17] btrfs: make metadata write endio functions " Qu Wenruo
2021-12-01  5:17 ` [PATCH 11/17] btrfs: make dec_and_test_compressed_bio() " Qu Wenruo
2022-03-16 19:46   ` Josef Bacik
2022-03-16 23:30     ` Qu Wenruo
2021-12-01  5:17 ` [PATCH 12/17] btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block() Qu Wenruo
2021-12-01  6:52   ` Qu Wenruo
2021-12-01  5:17 ` [PATCH 13/17] btrfs: allow btrfs_map_bio() to split bio according to chunk stripe boundaries Qu Wenruo
2021-12-01  5:17 ` [PATCH 14/17] btrfs: remove buffered IO stripe boundary calculation Qu Wenruo
2021-12-01  5:17 ` [PATCH 15/17] btrfs: remove stripe boundary calculation for compressed IO Qu Wenruo
2021-12-01  5:17 ` [PATCH 16/17] btrfs: remove the stripe boundary calculation for direct IO Qu Wenruo
2021-12-01  5:17 ` [PATCH 17/17] btrfs: unexport btrfs_get_io_geometry() Qu Wenruo
2021-12-01  8:57 ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf794c47-35aa-7823-35ea-854873a20949@suse.com \
    --to=wqu@suse.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox