From: NeilBrown <neilb@suse.com>
To: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: raid1_end_read_request does not retry failed READ from a recovering drive
Date: Fri, 24 Jul 2015 09:24:04 +1000 [thread overview]
Message-ID: <20150724092404.66dc284d@noble> (raw)
In-Reply-To: <CAGRgLy5DsbiPsjJ2ZK_aOZ_UOqxa-zwbdDREv7hdpt7vLXCwOw@mail.gmail.com>
On Wed, 22 Jul 2015 18:10:31 +0200 Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hi Neil,
> In continuation of our discussion, I see that you have added a
> commit[1], which has a diff[2].
> But this is only the second part of the fix. We also need the first
> part, I believe, where in raid1_end_read_request we need to replace
> "!test_bit(Faulty)" with "test_bit(In_sync)". Otherwise, we will never
> retry the READ, and thus will never reach the fix_read_error code.
> And we also need to "put all of raid1_spare_active inside the
> spinlock", like you advised.
> Do you agree?
>
> We tested both parts of the fix, not the second part alone. Quoting myself:
> "With this addition, the problem appears to be fixed", i.e., I meant
> we applied both parts.
>
> I am sorry for catching this so late. I never looked at what you
> applied until now, because we are moving to kernel 3.18 long term, and
> I am checking the raid1 changes. I just assumed you applied both
> parts.
>
> If you agree, it would be good if you tag the first part of the fix as
> "cc stable" too.
>
> Thanks,
> Alex.
>
>
>
> [1]
> commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd
> Author: NeilBrown <neilb@suse.de>
> Date: Thu Sep 18 11:09:04 2014 +1000
>
> md/raid1: fix_read_error should act on all non-faulty devices.
>
> If a devices is being recovered it is not InSync and is not Faulty.
>
> If a read error is experienced on that device, fix_read_error()
> will be called, but it ignores non-InSync devices. So it will
> neither fix the error nor fail the device.
>
> It is incorrect that fix_read_error() ignores non-InSync devices.
> It should only ignore Faulty devices. So fix it.
>
> This became a bug when we allowed reading from a device that was being
> recovered. It is suitable for any subsequent -stable kernel.
>
> Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
> Cc: stable@vger.kernel.org (v3.5+)
> Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
>
> [2]
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 35649dd..55de4f6 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2155,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf,
> int read_disk,
> d--;
> rdev = conf->mirrors[d].rdev;
> if (rdev &&
> - test_bit(In_sync, &rdev->flags))
> + !test_bit(Faulty, &rdev->flags))
> r1_sync_page_io(rdev, sect, s,
> conf->tmppage, WRITE);
> }
> @@ -2167,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf,
> int read_disk,
> d--;
> rdev = conf->mirrors[d].rdev;
> if (rdev &&
> - test_bit(In_sync, &rdev->flags)) {
> + !test_bit(Faulty, &rdev->flags)) {
> if (r1_sync_page_io(rdev, sect, s,
> conf->tmppage, READ)) {
> atomic_add(s, &rdev->corrected_errors);
>
> On Mon, Sep 22, 2014 at 2:17 AM, NeilBrown <neilb@suse.de> wrote:
> > On Sun, 21 Sep 2014 19:47:14 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Thanks, Neil,
> >>
> >> On Thu, Sep 18, 2014 at 4:05 AM, NeilBrown <neilb@suse.de> wrote:
> >> > On Wed, 17 Sep 2014 20:57:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >>
> >> >> On Mon, Sep 8, 2014 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
> >> >> > On Sun, 7 Sep 2014 17:18:16 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi Neil,
> >> >> >> we see the following issue:
> >> >> >>
> >> >> >> # RAID1 has 2 drives A and B, drive B is recovering
> >> >> >> # READ request arrives
> >> >> >> # read_balanace selects drive B to read from, because READ sector
> >> >> >> comes before B->recovery_offset
> >> >> >> # READ is issued to drive B, but fails (drive B fails again)
> >> >> >>
> >> >> >> Now raid1_end_read_request() has the following code:
> >> >> >>
> >> >> >> if (uptodate)
> >> >> >> set_bit(R1BIO_Uptodate, &r1_bio->state);
> >> >> >> else {
> >> >> >> /* If all other devices have failed, we want to return
> >> >> >> * the error upwards rather than fail the last device.
> >> >> >> * Here we redefine "uptodate" to mean "Don't want to retry"
> >> >> >> */
> >> >> >> unsigned long flags;
> >> >> >> spin_lock_irqsave(&conf->device_lock, flags);
> >> >> >> if (r1_bio->mddev->degraded == conf->raid_disks ||
> >> >> >> (r1_bio->mddev->degraded == conf->raid_disks-1 &&
> >> >> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
> >> >> >> uptodate = 1;
> >> >> >> spin_unlock_irqrestore(&conf->device_lock, flags);
> >> >> >> }
> >> >> >>
> >> >> >> According to this code uptodate wrongly becomes 1, because:
> >> >> >> r1_bio->mddev->degraded == conf->raid_disks-1 is TRUE
> >> >> >> and
> >> >> >> !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags) is also TRUE
> >> >> >>
> >> >> >> Indeed, drive B is not marked as Faulty, but also not marked as In_sync.
> >> >> >> However, this function treats !Faulty being equal to In_Sync, so it
> >> >> >> decides that the last good drive failed, so it does not retry the
> >> >> >> READ.
> >> >> >>
> >> >> >> As a result, there is IO error, while we should have retried the READ
> >> >> >> from the healthy drive.
> >> >> >>
> >> >> >> This is happening in 3.8.13, but your master branch seems to have the
> >> >> >> same issue.
> >> >> >>
> >> >> >> What is a reasonable fix?
> >> >> >> 1) Do not read from drives which are !In_sync (a bit scary to read
> >> >> >> from such drive)
> >> >> >
> >> >> > It is perfectly safe to read from a !In_sync device providing you are before
> >> >> > ->recovery_offset.
> >> >> >
> >> >> >
> >> >> >> 2) replace !Faulty to In_sync check
> >> >> >
> >> >> > That probably makes sense... though that could race with raid1_spare_active().
> >> >> > If a read-error returned just after raid1_spare_active() set In_sync, and
> >> >> > before 'count' was subtracted from ->degraded, we would still set uptodate
> >> >> > when we shouldn't.
> >> >> > It probably make sense to put all of raid1_spare_active inside the spinlock -
> >> >> > it doesn't get call often enough that performance is an issue (I hope).
> >> >> >
> >> >> > So:
> >> >> > 1/ change !Faulty to In_sync
> >> >> > 2/ extend the spinlock in raid1_spare_active to cover the whole function.
> >> >>
> >> >>
> >> >> I made these fixes and reproduced the issue. However, the result is
> >> >> not what we expect:
> >> >>
> >> >> # raid1_end_read_request() now indeed adds the r1_bio into retry_list,
> >> >> as we wanted
> >> >> # raid1d calls fix_read_error()
> >> >> # fix_read_error() searches for an In_sync drive to read the data
> >> >> from. It finds such drive (this is our good drive A)
> >> >> # now fix_read_error() wants to rewrite the bad area. But it rewrites
> >> >> only on those drives that are In_sync (except the drive it got the
> >> >> data from). In our case, it never tries to rewrite the data on drive B
> >> >> (drive B is not marked Faulty and not marked In_sync). As a result,
> >> >> md_error() is not called, so drive B is still not marked as Failed
> >> >> when fix_read_error() completes
> >> >> # so handle_read_error() retries the original READ by calling
> >> >> read_balance(), which again in my case selects the recovering drive
> >> >> B...
> >> >>
> >> >> And then the whole flow repeats itself again and again...and READ
> >> >> never completes.
> >> >>
> >> >> Maybe we should not allow selecting recovering drives for READ? Or
> >> >> some other approach?
> >> >>
> >> >
> >> > Thanks for the testing and analysis.
> >> > Presumably we just want handle_read_error() to write to all non-faulty
> >> > devices, not just the InSync ones.
> >> > i.e. the following patch.
> >> >
> >> > Thanks,
> >> > NeilBrown
> >> >
> >> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> > index 6a9c73435eb8..a95f9e179e6f 100644
> >> > --- a/drivers/md/raid1.c
> >> > +++ b/drivers/md/raid1.c
> >> > @@ -2153,7 +2153,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
> >> > d--;
> >> > rdev = conf->mirrors[d].rdev;
> >> > if (rdev &&
> >> > - test_bit(In_sync, &rdev->flags))
> >> > + !test_bit(Faulty, &rdev->flags))
> >> > r1_sync_page_io(rdev, sect, s,
> >> > conf->tmppage, WRITE);
> >> > }
> >> > @@ -2165,7 +2165,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
> >> > d--;
> >> > rdev = conf->mirrors[d].rdev;
> >> > if (rdev &&
> >> > - test_bit(In_sync, &rdev->flags)) {
> >> > + !test_bit(Faulty, &rdev->flags)) {
> >> > if (r1_sync_page_io(rdev, sect, s,
> >> > conf->tmppage, READ)) {
> >> > atomic_add(s, &rdev->corrected_errors);
> >>
> >> With this addition, the problem appears to be fixed. We will give it
> >> some regression testing & will let you know if we see any issues.
> >>
> >> I presume you will be applying this fix upstream as well.
> >
> > Yes, it is already in my for-next branch.
> > I've just added your tested-by.
> >
> > Thanks,
> > NeilBrown
Hi Alex
thanks for noticing!
Just to be sure we mean the same thing: this is the patch which is
missing - correct?
Thanks,
NeilBrown
From: NeilBrown <neilb@suse.com>
Date: Fri, 24 Jul 2015 09:22:16 +1000
Subject: [PATCH] md/raid1: fix test for 'was read error from last working
device'.
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device. We simple report a
read error to the caller.
However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device. However a spare which is rebuilding
would be non-faulty but so not the only working device.
So change the test from "!Faulty" to "In_sync". If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.
This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c95b ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 166616411215..b368307a9651 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -336,7 +336,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
spin_lock_irqsave(&conf->device_lock, flags);
if (r1_bio->mddev->degraded == conf->raid_disks ||
(r1_bio->mddev->degraded == conf->raid_disks-1 &&
- !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
+ test_bit(In_sync, &conf->mirrors[mirror].rdev->flags)))
uptodate = 1;
spin_unlock_irqrestore(&conf->device_lock, flags);
}
next prev parent reply other threads:[~2015-07-23 23:24 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-07 14:18 raid1_end_read_request does not retry failed READ from a recovering drive Alexander Lyakas
2014-09-08 7:17 ` NeilBrown
2014-09-17 17:57 ` Alexander Lyakas
2014-09-18 1:05 ` NeilBrown
[not found] ` <CAGRgLy7+bN8ztaaN+oh6uATE7=Z=8LnU=R0m2CnVff4ZqgJFKg@mail.gmail.com>
[not found] ` <20140922101704.53be2056@notabene.brown>
2015-07-22 16:10 ` Alexander Lyakas
2015-07-23 23:24 ` NeilBrown [this message]
2015-07-26 8:15 ` Alexander Lyakas
2015-07-27 1:56 ` NeilBrown
2015-07-27 8:07 ` Alexander Lyakas
2015-07-31 0:12 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150724092404.66dc284d@noble \
--to=neilb@suse.com \
--cc=alex.bolshoy@gmail.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).