From: Shaohua Li <shli@kernel.org>
To: NeilBrown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org, linux-block@vger.kernel.org,
Christoph Hellwig <hch@lst.de>,
linux-kernel@vger.kernel.org, hare@suse.de
Subject: Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
Date: Mon, 21 Nov 2016 18:02:38 -0800 [thread overview]
Message-ID: <20161122020238.qtuxwo5etcwmts4r@kernel.org> (raw)
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>
On Fri, Nov 18, 2016 at 04:16:11PM +1100, Neil Brown wrote:
> Hi,
>
> I've been sitting on these patches for a while because although they
> solve a real problem, it is a fairly limited use-case, and I don't
> really like some of the details.
>
> So I'm posting them as RFC in the hope that a different perspective
> might help me like them better, or find a better approach.
>
> The core idea is that when you have multiple copies of data
> (i.e. mirrored drives) it doesn't make sense to wait for a read from
> a drive that seems to be having problems. It will probably be faster
> to just cancel that read, and read from the other device.
> Similarly, in some circumstances, it might be better to fail a drive
> that is being slow to respond to writes, rather than cause all writes
> to be very slow.
>
> The particular context where this comes up is when mirroring across
> storage arrays, where the storage arrays can temporarily take an
> unusually long time to respond to requests (firmware updates have
> been mentioned). As the array will have redundancy internally, there
> is little risk to the data. The mirrored pair is really only for
> disaster recovery, and it is deemed better to lose the last few
> minutes of updates in the case of a serious disaster, rather than
> occasionally having latency issues because one array needs to do some
> maintenance for a few minutes. The particular storage arrays in
> question are DASD devices which are part of the s390 ecosystem.
>
> Linux block layer has "failfast" flags to direct drivers to fail more
> quickly. These patches allow devices in an md array to be given a
> "failfast" flag, which will cause IO requests to be marked as
> "failfast" providing there is another device available. Once the
> array becomes degraded, we stop using failfast, as that could result
> in data loss.
>
> I don't like the whole "failfast" concept because it is not at all
> clear how fast "fast" is. In fact, these block-layer flags are
> really a misnomer. They should be "noretry" flags.
> REQ_FAILFAST_DEV means "don't retry requests which reported an error
> which seems to come from the device.
> REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
> indicate a problem with the transport, rather than the device"
> REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it
> means whatever a particular driver wants it to mean, basically "I
> cannot seem to handle this right now, just resend and I'll probably
> be more in control next time". It seems to be for internal-use only.
>
> Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
> btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
> to make sense.... why would you ever use _DEV without _TRANSPORT?
>
> None of these actually change the timeouts in the driver or in the
> device, which is what I would expect for "failfast", so to get real
> "fast failure" you need to enable failfast, and adjust the timeouts.
> That is what we do for our customers with DASD.
>
> Anyway, it seems to make sense to use _TRANSPORT and _DEV for
> requests from md where there is somewhere to fall-back on.
> If we get an error from a "failfast" request, and the array is still
> non-degraded, we just fail the device. We don't try to repair read
> errors (which is pointless on storage arrays).
>
> It is assumed that some user-space code will notice the failure,
> monitor the device to see when it becomes available again, and then
> --re-add it. Assuming the array has a bitmap, the --re-add should be
> fast and the array will become optimal again without experiencing
> excessive latencies.
>
> My two main concerns are:
> - does this functionality have any use-case outside of mirrored
> storage arrays, and are there other storage arrays which
> occasionally inserted excessive latency (seems like a serious
> misfeature to me, but I know few of the details)?
> - would it be at all possible to have "real" failfast functionality
> in the block layer? I.e. something that is based on time rather
> than retry count. Maybe in some cases a retry would be
> appropriate if the first failure was very fast.
> I.e. it would reduce timeouts and decide on retries based on
> elapsed time rather than number of attempts.
> With this would come the question of "how fast is fast" and I
> don't have a really good answer. Maybe md would need to set a
> timeout, which it would double whenever it got failures on all
> drives. Otherwise the timeout would drift towards (say) 10 times
> the typical response time.
>
> So: comments most welcome. As I say, this does address a genuine
> need. Just find it hard to like it :-(
Patches looks good. As long as these don't break normal raid array (while they
don't if the superblock bit isn't set), I have no objection to apply the
patches even they are for special usage. I'll add the series to the next tree.
Just curious: will the FAILFAST increase the chance IO failure?
Thanks,
Shaohua
next prev parent reply other threads:[~2016-11-22 3:53 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-18 5:16 [PATCH/RFC] add "failfast" support for raid1/raid10 NeilBrown
2016-11-18 5:16 ` [md PATCH 2/6] md: Use REQ_FAILFAST_* on metadata writes where appropriate NeilBrown
2016-11-18 5:16 ` [md PATCH 1/6] md/failfast: add failfast flag for md to be used by some personalities NeilBrown
2016-11-18 5:16 ` [md PATCH 5/6] md/raid10: add failfast handling for reads NeilBrown
2016-11-18 5:16 ` [md PATCH 6/6] md/raid10: add failfast handling for writes NeilBrown
2016-11-18 5:16 ` [md PATCH 3/6] md/raid1: add failfast handling for reads NeilBrown
2016-11-18 5:16 ` [md PATCH 4/6] md/raid1: add failfast handling for writes NeilBrown
2016-11-18 7:09 ` [PATCH/RFC] add "failfast" support for raid1/raid10 Hannes Reinecke
2016-11-18 15:41 ` Jack Wang
2016-11-24 4:47 ` NeilBrown
2016-11-24 16:06 ` Jack Wang
2016-11-22 2:02 ` Shaohua Li [this message]
2016-11-24 23:55 ` [mdadm PATCH] Add failfast support NeilBrown
2016-11-28 13:53 ` Jes Sorensen
2016-11-29 22:02 ` [mdadm PATCH] Introduce enum flag_mode for setting and clearing flags NeilBrown
2016-11-29 22:12 ` Jes Sorensen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161122020238.qtuxwo5etcwmts4r@kernel.org \
--to=shli@kernel.org \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.