From: Bob Liu <bob.liu@oracle.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>,
linux-block <linux-block@vger.kernel.org>,
linux-xfs <linux-xfs@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Martin Petersen <martin.petersen@oracle.com>,
shirley.ma@oracle.com,
Allison Henderson <allison.henderson@oracle.com>,
darrick.wong@oracle.com, hch@infradead.org
Subject: Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
Date: Fri, 1 Mar 2019 22:14:31 +0800 [thread overview]
Message-ID: <f6612507-128e-85d1-9045-60af6c87f73e@oracle.com> (raw)
In-Reply-To: <D0A57CD2-1EE2-43C9-9E79-7B09F4FCEAC6@dilger.ca>
On 3/1/19 7:28 AM, Andreas Dilger wrote:
> On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote:
>>
>> On 2/19/19 5:31 AM, Dave Chinner wrote:
>>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
>>>> Motivation:
>>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>>>> decides that the metadata is garbage, today it will shut down the entire
>>>> filesystem without trying any of the other mirrors. This is a severe
>>>> loss of service, and we propose these patches to have XFS try harder to
>>>> avoid failure.
>>>>
>>>> This patch prototype this mirror retry idea by:
>>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>> blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>> mirrors this block device has.
>>>> Helper functions were also added to get/set the nr_mirrors.
>>>>
>>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>>>> in order to support stacked layer case.
>>>>
>>>> * Modify md/raid1 to support this retry feature.
>>>>
>>>> * Adapter xfs to use this feature.
>>>> If the read verify fails, we loop over the available mirrors and retry the read.
>>>
>>> Why does the filesystem have to iterate every single posible
>>> combination of devices that are underneath it?
>
> Even if the filesystem isn't doing this iteration, there needs to be
> some way to track which devices or combinations of devices have been
> tried for the bio, which likely still means something inside the bio.
>
>>> Wouldn't it be much simpler to be able to attach a verifier
>>> function to the bio, and have each layer that gets called iterate
>>> over all it's copies internally until the verfier function passes
>>> or all copies are exhausted?
>>>
>>> This works for stacked mirrors - it can pass the higher layer
>>> verifier down as far as necessary. It can work for RAID5/6, too, by
>>> having that layer supply it's own verifier for reads that verifies
>>> parity and can reconstruct of failure, then when it's reconstructed
>>> a valid stripe it can run the verifier that was supplied to it from
>>> above, etc.
>>>
>>> i.e. I dont see why only filesystems should drive retries or have to
>>> be aware of the underlying storage stacking. ISTM that each
>>> layer of the storage stack should be able to verify what has been
>>> returned to it is valid independently of the higher layer
>>> requirements. The only difference from a caller point of view should
>>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
>
> I don't think the filesystem should be aware of the stacking (nor are
> they in the proposed implementation). That said, the filesystem-level
> checksums should, IMHO, be checked at the filesystem level, and this
> proposal allows the filesystem to tell the lower layer "this read was
> bad, try something else".
>
> One option, instead of having a bitmap, with one bit for every possible
> device/combination in the system, would be to have a counter instead.
> This is much denser, and even the existing "__u16 bio_write_hint" field
Indeed! This way is better than a bitmap.
But as Dave mentioned, it's much better and simpler if attaching a verfier function to the bio..
-
Thanks,
Bob
> would be enough for 2^16 different devices/combinations of devices to
> be tried. The main difference would be that the retry layers in the
> device layer would need to have a deterministic iterator for the bio.
>
> For stacked devices it would need to use the same API to determine how
> many possible combinations are below it, and do a modulus to pass down
> the per-device iteration number. The easiest would be to iterate in
> numeric order, but it would also be possible to use something like a
> PRNG seeded by e.g. the block number to change the order on a per-bio
> basis to even out the load, if that is desirable.
>
>> For a two layer stacked md case like:
>> /dev/md0
>> / | \
>> /dev/md1-a /dev/md1-b /dev/sdf
>> / \ / | \
>> /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
>
> In this case, the top-level md0 would call blk_queue_get_copies() on each
> sub-devices to determine how many sub-devices/combinations they have,
> and pick the maximum (3 in this case), multiplied by the number of
> top-level devices (also 3 in this case). That means the top-level device
> would return blk_queue_get_copies() == 9 combinations, but the same
> could be done recursively for more/non-uniform layers if needed.
>
> The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8],
> and can easily map an incoming bio_read_hint to the next device, either
> by simple increment or by predetermining a device ordering and following
> that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order
> that hits all of the devices exactly once). During submission bio_read_hint
> is set to the modulus of the value (so that each layer in the stack sees
> only values in the range [0, copies), and when the bio completes the top-level
> device will set bio_read_hint to be the next sub-device to try (like the
> original proposal was splitting and combining the bitmaps). If a sub-device
> gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything
> other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device
> so that it moves to the next device without returning to the caller.
>
>>> I suspect there's a more important issue to worry about: we run the
>>> XFS read verifiers in an async work queue context after collecting
>>> the IO completion status from the bio, rather than running directly
>>> in bio->bi_end_io() call chain.
>
> In this proposal, XFS would just have to save the __u16 bio_read_hint
> field from the previous bio completion and set it in the retried bio,
> so that it could start at the next device/combination. Obviously,
> this would mean that the internal device iterator couldn't have any
> hidden state for the bio so that just setting bio_read_hint would be
> the same as resubmitting the original bio again, but that is already
> a given or this whole problem wouldn't exist in the first place.
>
> Cheers, Andreas
>
next prev parent reply other threads:[~2019-03-01 14:14 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-13 9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
2019-02-13 10:26 ` Andreas Dilger
2019-02-13 16:04 ` Theodore Y. Ts'o
2019-02-14 5:57 ` Bob Liu
2019-02-18 17:56 ` Theodore Y. Ts'o
2019-02-13 9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
2019-02-13 16:18 ` Jens Axboe
2019-02-14 6:10 ` Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 3/9] md:raid1: set mirrors correctly Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 6/9] xfs: Add b_rd_hint " Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 7/9] xfs: Add device retry Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 8/9] xfs: Rewrite retried read Bob Liu
2019-02-13 9:50 ` [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry Bob Liu
2019-02-18 8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
2019-02-19 1:29 ` jianchao.wang
2019-02-18 21:31 ` Dave Chinner
2019-02-19 2:55 ` Darrick J. Wong
2019-02-19 3:33 ` Dave Chinner
2019-02-28 14:22 ` Bob Liu
2019-02-28 21:49 ` Dave Chinner
2019-03-03 2:37 ` Bob Liu
2019-03-03 23:18 ` Dave Chinner
2019-02-28 23:28 ` Andreas Dilger
2019-03-01 14:14 ` Bob Liu [this message]
2019-03-03 23:45 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f6612507-128e-85d1-9045-60af6c87f73e@oracle.com \
--to=bob.liu@oracle.com \
--cc=adilger@dilger.ca \
--cc=allison.henderson@oracle.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=shirley.ma@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).