From: NeilBrown <neilb@suse.de>
To: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: linux-raid <linux-raid@vger.kernel.org>,
	Andrei Warkentin <andrey.warkentin@gmail.com>,
	Yair Hershko <yair@zadarastorage.com>
Subject: Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
Date: Thu, 27 Feb 2014 14:19:09 +1100	[thread overview]
Message-ID: <20140227141909.697fecd3@notabene.brown> (raw)
In-Reply-To: <CAGRgLy6pzRCPjCFT7Pots0QqSC_T_PRqN19-quCxxLoPDjqbfQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 10195 bytes --]
On Wed, 26 Feb 2014 15:02:27 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> HI Neil,
> can you please comment what do you think of this change.
Sorry for not replying earlier.
I think the core problem you are addressing is fixed by:
commit f466722ca614edcd14f3337373f33132117c7612
Author: NeilBrown <neilb@suse.de>
Date:   Mon Dec 9 12:04:56 2013 +1100
    md: Change handling of save_raid_disk and metadata update during recovery.
Can you confirm if that is the case?
With respect to the split-brain issue, I see that you would like every
superblock to know which devices are truly operational and which are still
recovering - a distinction which my current code doesn't allow.
You want to know this because if you can see only 2 drives from a 4-drive
RAID6 then:
 - if the other 2 devices are believed to be fully functional, then we risk a
   split-brain situation and shouldn't start the array
 - if either of the other devices is missing or still recovering, then there
   is no risk of split brain and the array can be started.
With the current code you might refuse to start an array because it looks
like it both other drives could be fully functional where in fact they are
still recovering.
This at least fails safely, but it may be that we want to absolutely minimise
the number of false-failures.
However I would hope that the total amount of time that the array spends
recovering a device would be a very small fraction of the total time that the
array is active.  If that is true, then the number of times you get a crash
during recovery will be a small fraction of the total crashes, so the number
of false-failures should be a small fraction of the reported failures.
So I don't see this as a big problem.
With respect to the "biggest problem" you note in the split-brain document
you reference:  this is an issue that has been in the back of my mind for
many years, wondering how much I should care and to what extent it should be
"fixed".  Whenever I have come close to fixing it, I've managed to convince
myself that I shouldn't.
We only record a device as 'failed' if a write fails.  And if a write fails
then that device is clearly inconsistent with the rest of the array.  So it
really seems appropriate to record that fact.
I could possibly be convinced that there comes a point where updating
superblocks any further is completely pointless, but I would need to be
certain that it wouldn't leave a suggestion that the array was more coherent
than it really is.
Thanks,
NeilBrown
> 
> Thanks,
> Alex.
> 
> On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas
> <alex.bolshoy@gmail.com> wrote:
> > Hello Neil, Andrei,
> > I don't how much you recall of this old discussion.
> >
> > Basically, the change that you made means that md doesn't update the
> > superblock on the recovering device, until recovery completes. As a
> > result, when assembling such array, the recovering device has an old
> > event count in the superblock and is not picked during assembly. So
> > later, user has to re-add it manually.
> > This is true for a device that failed and was re-added. For a fresh
> > device, saved_raid_disk==-1, so its superblock will still be updated
> > (and sucn device will be picked on assembly).
> >
> > On the other hand, MD updates the superblock of the In_sync devices
> > and marks the recovering device as having a valid array slot (not
> > 0xFFFF or 0xFFFE, which are treated similarly today).
> >
> > What do you think of the following change:
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 561a65f..4bbc7e3 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -1854,8 +1854,6 @@ retry:
> >                         sb->dev_roles[i] = cpu_to_le16(0xfffe);
> >                 else if (test_bit(In_sync, &rdev2->flags))
> >                         sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> > -               else if (rdev2->raid_disk >= 0)
> > -                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> >                 else
> >                         sb->dev_roles[i] = cpu_to_le16(0xffff);
> >         }
> >
> > Basically, we do not mark the slot of the recovering device, until its
> > recovery completes. So during assembly, we will not pick it, as
> > before. For a fresh drive, there will be a regression - we will not
> > pick it as well on assembly.
> >
> > The reason I am considering this change is another old discussion that
> > we had - considering split-brain resolution, and the proposal I made
> > in:
> > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
> >
> > Basically, when MD marks recovering devices as having a valid raid
> > slot in the superblock of In_sync devices, then on array assembly we
> > don't know whether sucn device is recovering or In_sync (if it is
> > inaccessible during assembly). So we have to assume it is In_sync and
> > thus can potentially cause split-brain.
> >
> >
> > What do you think of this change?
> >
> > Thanks,
> > Alex.
> >
> >
> > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote:
> >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> wrote:
> >>
> >>> Thanks for commenting, Neil,
> >>>
> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
> >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi again Neil, and Andrey,
> >>> >>
> >>> >> looking at this email thread:
> >>> >> http://www.spinics.net/lists/raid/msg36236.html
> >>> >> between you and Andrey, the conclusion was:
> >>> >> "So the correct thing to do is to *not* update the metadata on the
> >>> >> recovering device until recovery completes.  Then if it fails and is
> >>> >> re-added, it will look just the same as when it was re-added the first
> >>> >> time, and will do a bitmap-based recovery."
> >>> >>
> >>> >> I have two doubts about this decision:
> >>> >> 1) Since the event count on the recovering drive is not updated, this
> >>> >> means that after reboot, when array is re-assembled, this drive will
> >>> >> not be added to the array, and the user will have to manually re-add
> >>> >> it. I agree this is a minor thing.
> >>> >
> >>> > Still, if it can be fixed it should be.
> >>> >
> >>> >> 2) There are places in mdadm, which check for recovery_offset on the
> >>> >> drive and take decisions based upon that. Specifically, if there is
> >>> >> *no* recovery offset, the data on this drive is considered to be
> >>> >> consistent WRT to the failure time of the drive. So, for example, the
> >>> >> drive can be a candidate for bumping up events during "force
> >>> >> assembly". Now, when superblock on such drive is not updated during
> >>> >> recovery (so there is *no* recovery offset), mdadm will think that the
> >>> >> drive is consistent, while in fact, its data is totally unusable until
> >>> >> after recovery completes. That's because we have updated parts of the
> >>> >> drive, but did not complete bringing the whole drive in-sync.
> >>> >
> >>> > If mdadm would consider updating the event count if not recovery had started,
> >>> > then surely it is just as valid to do so once some recovery has started, even
> >>> > if it hasn't completed.
> >>>
> >>> The patch you accepted from me ("Don't consider disks with a valid
> >>> recovery offset as candidates for bumping up event count") actually
> >>> attempts to protect from that:)
> >>>
> >>> I don't understand why "it is just as valid to do so once some
> >>> recovery has started". My understanding is that once recovery of a
> >>> drive has started, its data is not consistent between different parts
> >>> of the drive, until the recovery completes. This is because md does
> >>> bitmap-based recovery, and not kind of journal/transaction-log based
> >>> recovery.
> >>>
> >>> However, one could argue that for force-assembly case, when data
> >>> anyways can come up as partially corrupted, this is less important.
> >>
> >> Exactly.  And mdadm only updates event counts in the force-assembly case so
> >> while it might not be ideal, it is the best we can do.
> >>
> >>>
> >>> I would still think that there is value in recoding in a superblock
> >>> that a drive is recovering.
> >>
> >> Probably.  It is a bit unfortunate that if you stop an array that is
> >> recovering after a --re-add, you cannot simply 'assemble' it again and
> >> get it back to the same state.
> >> I'll think more on that.
> >>
> >> Meanwhile, this patch might address your other problem.  It allows --re-add
> >> to work if a non-bitmap rebuild fails and is then re-added.
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index c601c4b..d31852e 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
> >>                         super_types[mddev->major_version].
> >>                                 validate_super(mddev, rdev);
> >>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
> >> -                   (!test_bit(In_sync, &rdev->flags) ||
> >> +                   (test_bit(Faulty, &rdev->flags) ||
> >>                      rdev->raid_disk != info->raid_disk)) {
> >>                         /* This was a hot-add request, but events doesn't
> >>                          * match, so reject it.
> >>
> >>
> >>>
> >>> >
> >>> >>
> >>> >> Can you pls comment on my doubts?
> >>> >
> >>> > I think there is probably room for improvement here but I don't think there
> >>> > are any serious problems.
> >>> >
> >>> > However I'm about to go on leave for a couple of week so I'm unlikely to
> >>> > think about it for a while. I've made a note to look at it properly when I
> >>> > get back.
> >>> >
> >>>
> >>> Indeed, don't you think about this while you are resting!
> >>
> >> :-)
> >>
> >>
> >> Thanks.  I did have a very enjoyable break.
> >>
> >> NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply	other threads:[~2014-02-27  3:19 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas
2012-06-07  7:22 ` Alexander Lyakas
2012-06-07 11:24   ` NeilBrown
2012-06-07 12:47     ` Alexander Lyakas
2012-06-27  7:22       ` NeilBrown
2012-06-27 16:40         ` Alexander Lyakas
2012-07-02  7:57           ` NeilBrown
2012-07-02 17:30             ` John Gehring
2014-02-19  9:51         ` Alexander Lyakas
2014-02-26 13:02           ` Alexander Lyakas
2014-02-27  3:19             ` NeilBrown [this message]
2014-03-02 11:01               ` Alexander Lyakas
2012-06-08  8:59     ` Thanks (was Re: Problem with patch...) John Robinson
2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=20140227141909.697fecd3@notabene.brown \
    --to=neilb@suse.de \
    --cc=alex.bolshoy@gmail.com \
    --cc=andrey.warkentin@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=yair@zadarastorage.com \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).