* Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
@ 2012-06-06 16:09 Alexander Lyakas
  2012-06-07  7:22 ` Alexander Lyakas
  2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: Alexander Lyakas @ 2012-06-06 16:09 UTC (permalink / raw)
  To: NeilBrown, linux-raid
Hi Neil,
it looks like we are hitting an issue with this patch.
Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
aware that this version of mdadm has an issue of unaligned writes, so
we backported the fix to it into this version).
We are testing the following simple scenario:
# 2-way raid1 with a missing drive
# a fresh drive (no superblock) is added (not re-added) to the raid,
and the raid happily starts rebuilding it
# at some point in time, this drive fails, so rebuild aborts, drive is
kicked out.
# later the drive recovers and "mdadm remove & re-add" is attempted
# re-add fails
Some debugging suggests the following:
- The superblock of the drive to be re-added looks like this:
 Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
     Array Size : 10483688 (5.00 GiB 5.37 GB)
  Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
Recovery Offset : 100096 sectors
          State : clean
  Resync Offset : N/A sectors
    Device UUID : d9114d02:76153894:87aaf09d:c55da9be
Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jun  6 18:36:55 2012
       Checksum : 94468c9d - correct
         Events : 192
   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
set) and a valid role.
- The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
...
info->disk.state = 6; /* active and in sync */
info->disk.raid_disk = role;
So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
- mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
- the kernel performs:
if (!mddev->persistent) {
... // not relevant to our case
} else
	super_types[mddev->major_version].
		validate_super(mddev, rdev);
- validate_super() performs:
	if ((le32_to_cpu(sb->feature_map) &
	     MD_FEATURE_RECOVERY_OFFSET))
		rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
	else
		set_bit(In_sync, &rdev->flags);
	rdev->raid_disk = role;
So In_sync flag is not set, because we have a recovery offset.
- Then the code introduced by the patch does:
	if ((info->state & (1<<MD_DISK_SYNC)) &&
	    (!test_bit(In_sync, &rdev->flags) ||
	     rdev->raid_disk != info->raid_disk)) {
		/* This was a hot-add request, but events doesn't
		 * match, so reject it.
		 */
		export_rdev(rdev);
		return -EINVAL;
	}
So the first part of "OR" succeeds and we abort.
An observation: this problem only happens with a fresh drive
apparently. If drive that was already In_sync fails and then is
re-added, kernel does not update its superblock until recovery
completes. So its superblock has no valid recovery_offset, and this
problem doesn't happen. The code that skips updating the superblock
was added as part of this email thread:
http://www.spinics.net/lists/raid/msg36275.html. However, I did not
dig deeper why the superblock gets updated in my case. But the
scenario above repros easily.
Q1: Can you confirm that my analysis is correct?
Q2: Is this an expected behavior? I would assume that no, because the
same case for a failed drive (not a fresh drive) works fine
Q3: How would you recommend to handle this?
Thanks!
Alex.
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas
@ 2012-06-07  7:22 ` Alexander Lyakas
  2012-06-07 11:24   ` NeilBrown
  2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Lyakas @ 2012-06-07  7:22 UTC (permalink / raw)
  To: NeilBrown, andrey.warkentin, linux-raid
Hi again Neil, and Andrey,
looking at this email thread:
http://www.spinics.net/lists/raid/msg36236.html
between you and Andrey, the conclusion was:
"So the correct thing to do is to *not* update the metadata on the
recovering device until recovery completes.  Then if it fails and is
re-added, it will look just the same as when it was re-added the first
time, and will do a bitmap-based recovery."
I have two doubts about this decision:
1) Since the event count on the recovering drive is not updated, this
means that after reboot, when array is re-assembled, this drive will
not be added to the array, and the user will have to manually re-add
it. I agree this is a minor thing.
2) There are places in mdadm, which check for recovery_offset on the
drive and take decisions based upon that. Specifically, if there is
*no* recovery offset, the data on this drive is considered to be
consistent WRT to the failure time of the drive. So, for example, the
drive can be a candidate for bumping up events during "force
assembly". Now, when superblock on such drive is not updated during
recovery (so there is *no* recovery offset), mdadm will think that the
drive is consistent, while in fact, its data is totally unusable until
after recovery completes. That's because we have updated parts of the
drive, but did not complete bringing the whole drive in-sync.
Can you pls comment on my doubts?
Thanks,
Alex.
On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote:
> Hi Neil,
> it looks like we are hitting an issue with this patch.
> Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
> aware that this version of mdadm has an issue of unaligned writes, so
> we backported the fix to it into this version).
>
> We are testing the following simple scenario:
> # 2-way raid1 with a missing drive
> # a fresh drive (no superblock) is added (not re-added) to the raid,
> and the raid happily starts rebuilding it
> # at some point in time, this drive fails, so rebuild aborts, drive is
> kicked out.
> # later the drive recovers and "mdadm remove & re-add" is attempted
> # re-add fails
>
> Some debugging suggests the following:
>
> - The superblock of the drive to be re-added looks like this:
>
>  Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
>     Array Size : 10483688 (5.00 GiB 5.37 GB)
>  Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
>    Data Offset : 2048 sectors
>   Super Offset : 8 sectors
> Recovery Offset : 100096 sectors
>          State : clean
>  Resync Offset : N/A sectors
>    Device UUID : d9114d02:76153894:87aaf09d:c55da9be
> Internal Bitmap : 8 sectors from superblock
>    Update Time : Wed Jun  6 18:36:55 2012
>       Checksum : 94468c9d - correct
>         Events : 192
>   Device Role : Active device 1
>   Array State : AA ('A' == active, '.' == missing)
>
>
> So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
> set) and a valid role.
>
> - The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
> role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
> ...
> info->disk.state = 6; /* active and in sync */
> info->disk.raid_disk = role;
>
> So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
>
> - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
> - the kernel performs:
> if (!mddev->persistent) {
> ... // not relevant to our case
> } else
>        super_types[mddev->major_version].
>                validate_super(mddev, rdev);
>
> - validate_super() performs:
>        if ((le32_to_cpu(sb->feature_map) &
>             MD_FEATURE_RECOVERY_OFFSET))
>                rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
>        else
>                set_bit(In_sync, &rdev->flags);
>        rdev->raid_disk = role;
>
> So In_sync flag is not set, because we have a recovery offset.
>
> - Then the code introduced by the patch does:
>        if ((info->state & (1<<MD_DISK_SYNC)) &&
>            (!test_bit(In_sync, &rdev->flags) ||
>             rdev->raid_disk != info->raid_disk)) {
>                /* This was a hot-add request, but events doesn't
>                 * match, so reject it.
>                 */
>                export_rdev(rdev);
>                return -EINVAL;
>        }
>
> So the first part of "OR" succeeds and we abort.
>
> An observation: this problem only happens with a fresh drive
> apparently. If drive that was already In_sync fails and then is
> re-added, kernel does not update its superblock until recovery
> completes. So its superblock has no valid recovery_offset, and this
> problem doesn't happen. The code that skips updating the superblock
> was added as part of this email thread:
> http://www.spinics.net/lists/raid/msg36275.html. However, I did not
> dig deeper why the superblock gets updated in my case. But the
> scenario above repros easily.
>
> Q1: Can you confirm that my analysis is correct?
> Q2: Is this an expected behavior? I would assume that no, because the
> same case for a failed drive (not a fresh drive) works fine
> Q3: How would you recommend to handle this?
>
> Thanks!
> Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas
  2012-06-07  7:22 ` Alexander Lyakas
@ 2012-06-07 10:26 ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2012-06-07 10:26 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 4358 bytes --]
On Wed, 6 Jun 2012 19:09:25 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> Hi Neil,
> it looks like we are hitting an issue with this patch.
> Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
> aware that this version of mdadm has an issue of unaligned writes, so
> we backported the fix to it into this version).
> 
> We are testing the following simple scenario:
> # 2-way raid1 with a missing drive
> # a fresh drive (no superblock) is added (not re-added) to the raid,
> and the raid happily starts rebuilding it
> # at some point in time, this drive fails, so rebuild aborts, drive is
> kicked out.
> # later the drive recovers and "mdadm remove & re-add" is attempted
> # re-add fails
Hmm.. I guess we should drop the "In_sync" test and just keep the "raid_disk"
part of the test?
> 
> Some debugging suggests the following:
> 
> - The superblock of the drive to be re-added looks like this:
> 
>  Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
>      Array Size : 10483688 (5.00 GiB 5.37 GB)
>   Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
> Recovery Offset : 100096 sectors
>           State : clean
>   Resync Offset : N/A sectors
>     Device UUID : d9114d02:76153894:87aaf09d:c55da9be
> Internal Bitmap : 8 sectors from superblock
>     Update Time : Wed Jun  6 18:36:55 2012
>        Checksum : 94468c9d - correct
>          Events : 192
>    Device Role : Active device 1
>    Array State : AA ('A' == active, '.' == missing)
> 
> 
> So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
> set) and a valid role.
> 
> - The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
> role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
> ...
> info->disk.state = 6; /* active and in sync */
> info->disk.raid_disk = role;
> 
> So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
> 
> - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
> - the kernel performs:
> if (!mddev->persistent) {
> ... // not relevant to our case
> } else
> 	super_types[mddev->major_version].
> 		validate_super(mddev, rdev);
> 
> - validate_super() performs:
> 	if ((le32_to_cpu(sb->feature_map) &
> 	     MD_FEATURE_RECOVERY_OFFSET))
> 		rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
> 	else
> 		set_bit(In_sync, &rdev->flags);
> 	rdev->raid_disk = role;
> 
> So In_sync flag is not set, because we have a recovery offset.
> 
> - Then the code introduced by the patch does:
> 	if ((info->state & (1<<MD_DISK_SYNC)) &&
> 	    (!test_bit(In_sync, &rdev->flags) ||
> 	     rdev->raid_disk != info->raid_disk)) {
> 		/* This was a hot-add request, but events doesn't
> 		 * match, so reject it.
> 		 */
> 		export_rdev(rdev);
> 		return -EINVAL;
> 	}
> 
> So the first part of "OR" succeeds and we abort.
> 
> An observation: this problem only happens with a fresh drive
> apparently. If drive that was already In_sync fails and then is
> re-added, kernel does not update its superblock until recovery
> completes. So its superblock has no valid recovery_offset, and this
> problem doesn't happen. The code that skips updating the superblock
> was added as part of this email thread:
> http://www.spinics.net/lists/raid/msg36275.html. However, I did not
> dig deeper why the superblock gets updated in my case. But the
> scenario above repros easily.
> 
> Q1: Can you confirm that my analysis is correct?
Looks about  right.
> Q2: Is this an expected behavior? I would assume that no, because the
> same case for a failed drive (not a fresh drive) works fine
It is a case that I hadn't really thought about (re-adding a drive that was
partially recovered), so in a sense, no behaviour is expected :-)
I agree it isn't desirable behaviour.
> Q3: How would you recommend to handle this?
1/ Change the coded added by the commit to remove the
      !test_bit(In_sync, &rdev->flags) ||
  section.
2/ Possibly backport mdadm commit 0a999759b54f94fd63ac0
  This will allow --add to work in more cases.  I'm not sure if you
  are hitting this issue or not.
Please let me know how '1' goes and I'll get that fix upstream.
Thanks,
NeilBrown
> 
> Thanks!
> Alex.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-07  7:22 ` Alexander Lyakas
@ 2012-06-07 11:24   ` NeilBrown
  2012-06-07 12:47     ` Alexander Lyakas
  2012-06-08  8:59     ` Thanks (was Re: Problem with patch...) John Robinson
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2012-06-07 11:24 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid
[-- Attachment #1: Type: text/plain, Size: 6108 bytes --]
On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> Hi again Neil, and Andrey,
> 
> looking at this email thread:
> http://www.spinics.net/lists/raid/msg36236.html
> between you and Andrey, the conclusion was:
> "So the correct thing to do is to *not* update the metadata on the
> recovering device until recovery completes.  Then if it fails and is
> re-added, it will look just the same as when it was re-added the first
> time, and will do a bitmap-based recovery."
> 
> I have two doubts about this decision:
> 1) Since the event count on the recovering drive is not updated, this
> means that after reboot, when array is re-assembled, this drive will
> not be added to the array, and the user will have to manually re-add
> it. I agree this is a minor thing.
Still, if it can be fixed it should be.
> 2) There are places in mdadm, which check for recovery_offset on the
> drive and take decisions based upon that. Specifically, if there is
> *no* recovery offset, the data on this drive is considered to be
> consistent WRT to the failure time of the drive. So, for example, the
> drive can be a candidate for bumping up events during "force
> assembly". Now, when superblock on such drive is not updated during
> recovery (so there is *no* recovery offset), mdadm will think that the
> drive is consistent, while in fact, its data is totally unusable until
> after recovery completes. That's because we have updated parts of the
> drive, but did not complete bringing the whole drive in-sync.
If mdadm would consider updating the event count if not recovery had started,
then surely it is just as valid to do so once some recovery has started, even
if it hasn't completed.
> 
> Can you pls comment on my doubts?
I think there is probably room for improvement here but I don't think there
are any serious problems.
However I'm about to go on leave for a couple of week so I'm unlikely to
think about it for a while. I've made a note to look at it properly when I
get back.
NeilBrown
> 
> Thanks,
> Alex.
> 
> 
> On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote:
> > Hi Neil,
> > it looks like we are hitting an issue with this patch.
> > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
> > aware that this version of mdadm has an issue of unaligned writes, so
> > we backported the fix to it into this version).
> >
> > We are testing the following simple scenario:
> > # 2-way raid1 with a missing drive
> > # a fresh drive (no superblock) is added (not re-added) to the raid,
> > and the raid happily starts rebuilding it
> > # at some point in time, this drive fails, so rebuild aborts, drive is
> > kicked out.
> > # later the drive recovers and "mdadm remove & re-add" is attempted
> > # re-add fails
> >
> > Some debugging suggests the following:
> >
> > - The superblock of the drive to be re-added looks like this:
> >
> >  Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
> >     Array Size : 10483688 (5.00 GiB 5.37 GB)
> >  Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
> >    Data Offset : 2048 sectors
> >   Super Offset : 8 sectors
> > Recovery Offset : 100096 sectors
> >          State : clean
> >  Resync Offset : N/A sectors
> >    Device UUID : d9114d02:76153894:87aaf09d:c55da9be
> > Internal Bitmap : 8 sectors from superblock
> >    Update Time : Wed Jun  6 18:36:55 2012
> >       Checksum : 94468c9d - correct
> >         Events : 192
> >   Device Role : Active device 1
> >   Array State : AA ('A' == active, '.' == missing)
> >
> >
> > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
> > set) and a valid role.
> >
> > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
> > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
> > ...
> > info->disk.state = 6; /* active and in sync */
> > info->disk.raid_disk = role;
> >
> > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
> >
> > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
> > - the kernel performs:
> > if (!mddev->persistent) {
> > ... // not relevant to our case
> > } else
> >        super_types[mddev->major_version].
> >                validate_super(mddev, rdev);
> >
> > - validate_super() performs:
> >        if ((le32_to_cpu(sb->feature_map) &
> >             MD_FEATURE_RECOVERY_OFFSET))
> >                rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
> >        else
> >                set_bit(In_sync, &rdev->flags);
> >        rdev->raid_disk = role;
> >
> > So In_sync flag is not set, because we have a recovery offset.
> >
> > - Then the code introduced by the patch does:
> >        if ((info->state & (1<<MD_DISK_SYNC)) &&
> >            (!test_bit(In_sync, &rdev->flags) ||
> >             rdev->raid_disk != info->raid_disk)) {
> >                /* This was a hot-add request, but events doesn't
> >                 * match, so reject it.
> >                 */
> >                export_rdev(rdev);
> >                return -EINVAL;
> >        }
> >
> > So the first part of "OR" succeeds and we abort.
> >
> > An observation: this problem only happens with a fresh drive
> > apparently. If drive that was already In_sync fails and then is
> > re-added, kernel does not update its superblock until recovery
> > completes. So its superblock has no valid recovery_offset, and this
> > problem doesn't happen. The code that skips updating the superblock
> > was added as part of this email thread:
> > http://www.spinics.net/lists/raid/msg36275.html. However, I did not
> > dig deeper why the superblock gets updated in my case. But the
> > scenario above repros easily.
> >
> > Q1: Can you confirm that my analysis is correct?
> > Q2: Is this an expected behavior? I would assume that no, because the
> > same case for a failed drive (not a fresh drive) works fine
> > Q3: How would you recommend to handle this?
> >
> > Thanks!
> > Alex.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-07 11:24   ` NeilBrown
@ 2012-06-07 12:47     ` Alexander Lyakas
  2012-06-27  7:22       ` NeilBrown
  2012-06-08  8:59     ` Thanks (was Re: Problem with patch...) John Robinson
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Lyakas @ 2012-06-07 12:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: andrey.warkentin, linux-raid
Thanks for commenting, Neil,
On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi again Neil, and Andrey,
>>
>> looking at this email thread:
>> http://www.spinics.net/lists/raid/msg36236.html
>> between you and Andrey, the conclusion was:
>> "So the correct thing to do is to *not* update the metadata on the
>> recovering device until recovery completes.  Then if it fails and is
>> re-added, it will look just the same as when it was re-added the first
>> time, and will do a bitmap-based recovery."
>>
>> I have two doubts about this decision:
>> 1) Since the event count on the recovering drive is not updated, this
>> means that after reboot, when array is re-assembled, this drive will
>> not be added to the array, and the user will have to manually re-add
>> it. I agree this is a minor thing.
>
> Still, if it can be fixed it should be.
>
>> 2) There are places in mdadm, which check for recovery_offset on the
>> drive and take decisions based upon that. Specifically, if there is
>> *no* recovery offset, the data on this drive is considered to be
>> consistent WRT to the failure time of the drive. So, for example, the
>> drive can be a candidate for bumping up events during "force
>> assembly". Now, when superblock on such drive is not updated during
>> recovery (so there is *no* recovery offset), mdadm will think that the
>> drive is consistent, while in fact, its data is totally unusable until
>> after recovery completes. That's because we have updated parts of the
>> drive, but did not complete bringing the whole drive in-sync.
>
> If mdadm would consider updating the event count if not recovery had started,
> then surely it is just as valid to do so once some recovery has started, even
> if it hasn't completed.
The patch you accepted from me ("Don't consider disks with a valid
recovery offset as candidates for bumping up event count") actually
attempts to protect from that:)
I don't understand why "it is just as valid to do so once some
recovery has started". My understanding is that once recovery of a
drive has started, its data is not consistent between different parts
of the drive, until the recovery completes. This is because md does
bitmap-based recovery, and not kind of journal/transaction-log based
recovery.
However, one could argue that for force-assembly case, when data
anyways can come up as partially corrupted, this is less important.
I would still think that there is value in recoding in a superblock
that a drive is recovering.
>
>>
>> Can you pls comment on my doubts?
>
> I think there is probably room for improvement here but I don't think there
> are any serious problems.
>
> However I'm about to go on leave for a couple of week so I'm unlikely to
> think about it for a while. I've made a note to look at it properly when I
> get back.
>
Indeed, don't you think about this while you are resting!
Alex.
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote:
>> > Hi Neil,
>> > it looks like we are hitting an issue with this patch.
>> > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
>> > aware that this version of mdadm has an issue of unaligned writes, so
>> > we backported the fix to it into this version).
>> >
>> > We are testing the following simple scenario:
>> > # 2-way raid1 with a missing drive
>> > # a fresh drive (no superblock) is added (not re-added) to the raid,
>> > and the raid happily starts rebuilding it
>> > # at some point in time, this drive fails, so rebuild aborts, drive is
>> > kicked out.
>> > # later the drive recovers and "mdadm remove & re-add" is attempted
>> > # re-add fails
>> >
>> > Some debugging suggests the following:
>> >
>> > - The superblock of the drive to be re-added looks like this:
>> >
>> >  Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
>> >     Array Size : 10483688 (5.00 GiB 5.37 GB)
>> >  Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
>> >    Data Offset : 2048 sectors
>> >   Super Offset : 8 sectors
>> > Recovery Offset : 100096 sectors
>> >          State : clean
>> >  Resync Offset : N/A sectors
>> >    Device UUID : d9114d02:76153894:87aaf09d:c55da9be
>> > Internal Bitmap : 8 sectors from superblock
>> >    Update Time : Wed Jun  6 18:36:55 2012
>> >       Checksum : 94468c9d - correct
>> >         Events : 192
>> >   Device Role : Active device 1
>> >   Array State : AA ('A' == active, '.' == missing)
>> >
>> >
>> > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
>> > set) and a valid role.
>> >
>> > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
>> > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
>> > ...
>> > info->disk.state = 6; /* active and in sync */
>> > info->disk.raid_disk = role;
>> >
>> > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
>> >
>> > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
>> > - the kernel performs:
>> > if (!mddev->persistent) {
>> > ... // not relevant to our case
>> > } else
>> >        super_types[mddev->major_version].
>> >                validate_super(mddev, rdev);
>> >
>> > - validate_super() performs:
>> >        if ((le32_to_cpu(sb->feature_map) &
>> >             MD_FEATURE_RECOVERY_OFFSET))
>> >                rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
>> >        else
>> >                set_bit(In_sync, &rdev->flags);
>> >        rdev->raid_disk = role;
>> >
>> > So In_sync flag is not set, because we have a recovery offset.
>> >
>> > - Then the code introduced by the patch does:
>> >        if ((info->state & (1<<MD_DISK_SYNC)) &&
>> >            (!test_bit(In_sync, &rdev->flags) ||
>> >             rdev->raid_disk != info->raid_disk)) {
>> >                /* This was a hot-add request, but events doesn't
>> >                 * match, so reject it.
>> >                 */
>> >                export_rdev(rdev);
>> >                return -EINVAL;
>> >        }
>> >
>> > So the first part of "OR" succeeds and we abort.
>> >
>> > An observation: this problem only happens with a fresh drive
>> > apparently. If drive that was already In_sync fails and then is
>> > re-added, kernel does not update its superblock until recovery
>> > completes. So its superblock has no valid recovery_offset, and this
>> > problem doesn't happen. The code that skips updating the superblock
>> > was added as part of this email thread:
>> > http://www.spinics.net/lists/raid/msg36275.html. However, I did not
>> > dig deeper why the superblock gets updated in my case. But the
>> > scenario above repros easily.
>> >
>> > Q1: Can you confirm that my analysis is correct?
>> > Q2: Is this an expected behavior? I would assume that no, because the
>> > same case for a failed drive (not a fresh drive) works fine
>> > Q3: How would you recommend to handle this?
>> >
>> > Thanks!
>> > Alex.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Thanks (was Re: Problem with patch...)
  2012-06-07 11:24   ` NeilBrown
  2012-06-07 12:47     ` Alexander Lyakas
@ 2012-06-08  8:59     ` John Robinson
  1 sibling, 0 replies; 14+ messages in thread
From: John Robinson @ 2012-06-08  8:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
On 07/06/2012 12:24, NeilBrown wrote:
[...]
> I'm about to go on leave for a couple of week
[...]
I hope you have a good relaxing holiday. Thanks again for your excellent 
work here.
Dear Rest Of The World: don't do anything silly to your RAID arrays in 
the next fortnight!
Cheers,
John.
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-07 12:47     ` Alexander Lyakas
@ 2012-06-27  7:22       ` NeilBrown
  2012-06-27 16:40         ` Alexander Lyakas
  2014-02-19  9:51         ` Alexander Lyakas
  0 siblings, 2 replies; 14+ messages in thread
From: NeilBrown @ 2012-06-27  7:22 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid
[-- Attachment #1: Type: text/plain, Size: 4380 bytes --]
On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> Thanks for commenting, Neil,
> 
> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi again Neil, and Andrey,
> >>
> >> looking at this email thread:
> >> http://www.spinics.net/lists/raid/msg36236.html
> >> between you and Andrey, the conclusion was:
> >> "So the correct thing to do is to *not* update the metadata on the
> >> recovering device until recovery completes.  Then if it fails and is
> >> re-added, it will look just the same as when it was re-added the first
> >> time, and will do a bitmap-based recovery."
> >>
> >> I have two doubts about this decision:
> >> 1) Since the event count on the recovering drive is not updated, this
> >> means that after reboot, when array is re-assembled, this drive will
> >> not be added to the array, and the user will have to manually re-add
> >> it. I agree this is a minor thing.
> >
> > Still, if it can be fixed it should be.
> >
> >> 2) There are places in mdadm, which check for recovery_offset on the
> >> drive and take decisions based upon that. Specifically, if there is
> >> *no* recovery offset, the data on this drive is considered to be
> >> consistent WRT to the failure time of the drive. So, for example, the
> >> drive can be a candidate for bumping up events during "force
> >> assembly". Now, when superblock on such drive is not updated during
> >> recovery (so there is *no* recovery offset), mdadm will think that the
> >> drive is consistent, while in fact, its data is totally unusable until
> >> after recovery completes. That's because we have updated parts of the
> >> drive, but did not complete bringing the whole drive in-sync.
> >
> > If mdadm would consider updating the event count if not recovery had started,
> > then surely it is just as valid to do so once some recovery has started, even
> > if it hasn't completed.
> 
> The patch you accepted from me ("Don't consider disks with a valid
> recovery offset as candidates for bumping up event count") actually
> attempts to protect from that:)
> 
> I don't understand why "it is just as valid to do so once some
> recovery has started". My understanding is that once recovery of a
> drive has started, its data is not consistent between different parts
> of the drive, until the recovery completes. This is because md does
> bitmap-based recovery, and not kind of journal/transaction-log based
> recovery.
> 
> However, one could argue that for force-assembly case, when data
> anyways can come up as partially corrupted, this is less important.
Exactly.  And mdadm only updates event counts in the force-assembly case so
while it might not be ideal, it is the best we can do.
> 
> I would still think that there is value in recoding in a superblock
> that a drive is recovering.
Probably.  It is a bit unfortunate that if you stop an array that is
recovering after a --re-add, you cannot simply 'assemble' it again and
get it back to the same state.
I'll think more on that.
Meanwhile, this patch might address your other problem.  It allows --re-add
to work if a non-bitmap rebuild fails and is then re-added.
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c601c4b..d31852e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
 			super_types[mddev->major_version].
 				validate_super(mddev, rdev);
 		if ((info->state & (1<<MD_DISK_SYNC)) &&
-		    (!test_bit(In_sync, &rdev->flags) ||
+		    (test_bit(Faulty, &rdev->flags) ||
 		     rdev->raid_disk != info->raid_disk)) {
 			/* This was a hot-add request, but events doesn't
 			 * match, so reject it.
> 
> >
> >>
> >> Can you pls comment on my doubts?
> >
> > I think there is probably room for improvement here but I don't think there
> > are any serious problems.
> >
> > However I'm about to go on leave for a couple of week so I'm unlikely to
> > think about it for a while. I've made a note to look at it properly when I
> > get back.
> >
> 
> Indeed, don't you think about this while you are resting!
:-)
Thanks.  I did have a very enjoyable break.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply related	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-27  7:22       ` NeilBrown
@ 2012-06-27 16:40         ` Alexander Lyakas
  2012-07-02  7:57           ` NeilBrown
  2014-02-19  9:51         ` Alexander Lyakas
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Lyakas @ 2012-06-27 16:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: andrey.warkentin, linux-raid
Hi Neil,
>>
>> I would still think that there is value in recoding in a superblock
>> that a drive is recovering.
>
> Probably.  It is a bit unfortunate that if you stop an array that is
> recovering after a --re-add, you cannot simply 'assemble' it again and
> get it back to the same state.
> I'll think more on that.
As I mentioned, I see the additional re-add as a minor thing, but
agree it's better to fix it. The fact that we don't know that the
drive is being recovered, bothers me more. Because user might look at
the superblock, and assume the data on the drive is consistent to some
point in time (time of the drive failure). While the actual data,
while doing bitmap-based recovery, is unusable until recovery
successfully completes. So the user might think it's okay to try to
run his app on this drive.
Yes, please think about this.
>
> Meanwhile, this patch might address your other problem.  It allows --re-add
> to work if a non-bitmap rebuild fails and is then re-added.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c601c4b..d31852e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>                        super_types[mddev->major_version].
>                                validate_super(mddev, rdev);
>                if ((info->state & (1<<MD_DISK_SYNC)) &&
> -                   (!test_bit(In_sync, &rdev->flags) ||
> +                   (test_bit(Faulty, &rdev->flags) ||
>                     rdev->raid_disk != info->raid_disk)) {
>                        /* This was a hot-add request, but events doesn't
>                         * match, so reject it.
>
I have tested a slightly different patch that you suggested earlier -
just removing the !test_bit(In_sync, &rdev->flags) check. I confirm
that it solves the problem.
The Faulty bit check seems redundant to me, because:
- it can be set by only by validate_super() and only if that drive's
role is 0xfffe in sb->roles[] array
- Long time ago I asked you, how can it happen that a drive thinks
about *itself* that it is Faulty (has 0xfffe for its role in its own
superblock), and you said this should never happen.
Anyways, I tested also the patch you suggested, and it also works.
Is there any chance to see this fix in ubuntu-precise?
Thanks again for your support,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-27 16:40         ` Alexander Lyakas
@ 2012-07-02  7:57           ` NeilBrown
  2012-07-02 17:30             ` John Gehring
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-07-02  7:57 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid
[-- Attachment #1: Type: text/plain, Size: 3245 bytes --]
On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> Hi Neil,
> 
> >>
> >> I would still think that there is value in recoding in a superblock
> >> that a drive is recovering.
> >
> > Probably.  It is a bit unfortunate that if you stop an array that is
> > recovering after a --re-add, you cannot simply 'assemble' it again and
> > get it back to the same state.
> > I'll think more on that.
> 
> As I mentioned, I see the additional re-add as a minor thing, but
> agree it's better to fix it. The fact that we don't know that the
> drive is being recovered, bothers me more. Because user might look at
> the superblock, and assume the data on the drive is consistent to some
> point in time (time of the drive failure). While the actual data,
> while doing bitmap-based recovery, is unusable until recovery
> successfully completes. So the user might think it's okay to try to
> run his app on this drive.
> Yes, please think about this.
> 
> >
> > Meanwhile, this patch might address your other problem.  It allows --re-add
> > to work if a non-bitmap rebuild fails and is then re-added.
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c601c4b..d31852e 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
> >                        super_types[mddev->major_version].
> >                                validate_super(mddev, rdev);
> >                if ((info->state & (1<<MD_DISK_SYNC)) &&
> > -                   (!test_bit(In_sync, &rdev->flags) ||
> > +                   (test_bit(Faulty, &rdev->flags) ||
> >                     rdev->raid_disk != info->raid_disk)) {
> >                        /* This was a hot-add request, but events doesn't
> >                         * match, so reject it.
> >
> 
> I have tested a slightly different patch that you suggested earlier -
> just removing the !test_bit(In_sync, &rdev->flags) check. I confirm
> that it solves the problem.
> 
> The Faulty bit check seems redundant to me, because:
> - it can be set by only by validate_super() and only if that drive's
> role is 0xfffe in sb->roles[] array
> - Long time ago I asked you, how can it happen that a drive thinks
> about *itself* that it is Faulty (has 0xfffe for its role in its own
> superblock), and you said this should never happen.
Yes, you are right.  I've remove the test on 'Faulty' - the test on the
raid_disk number is sufficient.
> 
> Anyways, I tested also the patch you suggested, and it also works.
Thanks!
> 
> Is there any chance to see this fix in ubuntu-precise?
Not really up to me.  It doesn't fix a crash or corruption so I'm not sure it
is -stable material .... though maybe it is if it fixes a regression.
I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc
(hopefully later this week).
Thanks,
NeilBrown
> 
> Thanks again for your support,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-07-02  7:57           ` NeilBrown
@ 2012-07-02 17:30             ` John Gehring
  0 siblings, 0 replies; 14+ messages in thread
From: John Gehring @ 2012-07-02 17:30 UTC (permalink / raw)
  To: linux-raid
I have what I think is a closely related issue, but since this thread
is pretty long already, I will start with a brief description to see
if this should move to a new thread.
I am seeing a repeatable instance where a drive is added to a raid
set, then almost immediately rejected.
- linux 3.2.15 from kernel.org
- mdadm 3.2.5, but I saw this with 3.2, as well
- 8 drive, raid 6
Sequence:
Remove the device
- md created and fully synced. (/dev/md1 in my case)
- no IO
- pull a drive
- issue dd read to /dev/md1 to get it to recognize missing device
- mdadm /dev/md1 -r /dev/sdc2 (for example) to remove device from the md
Add device back in
- insert drive
- mdadm --zero-superblock /dev/sdc2 (or sd<x>, whatever it came back in as)
- mdadm /dev/md1 -a /dev/sdc2
I see the following about 50% of the time:
[ 1813.355806] md: bind<sdc2>
[ 1813.367430] md: recovery of RAID array md1
[ 1813.371570] md: minimum _guaranteed_  speed: 80000 KB/sec/disk.
[ 1813.377481] md: using maximum available idle IO bandwidth (but not
more than 800000 KB/sec) for recovery.
[ 1813.387039] md: using 128k window, over a total of 62467072k.
[ 1813.392789] md/raid:md1: Disk failure on sdc2, disabling device.
[ 1813.392791] md/raid:md1: Operation continuing on 7 devices.
[ 1813.404346] md: md1: recovery done.
I recently found that after this series of events, if I once again
remove the drive from the raid set (mdadm --remove), and do a read of
the md device via dd, I can then successfully add the device back to
the md with the --zero-superblock command followed by the mdadm --add
command. That 'trick' has worked in the first three times that I've
tried it, so not a lot of data points, but interesting nonetheless.
Finally, I applied the patch referenced earlier in this thread and it
did not affect the behavior I've described above.
This has happened on different systems and with different drives; i.e.
it is not a drive failure issue.
Thoughts? Thanks.
On Mon, Jul 2, 2012 at 1:57 AM, NeilBrown <neilb@suse.de> wrote:
>
> On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas
> <alex.bolshoy@gmail.com>
> wrote:
>
> > Hi Neil,
> >
> > >>
> > >> I would still think that there is value in recoding in a superblock
> > >> that a drive is recovering.
> > >
> > > Probably.  It is a bit unfortunate that if you stop an array that is
> > > recovering after a --re-add, you cannot simply 'assemble' it again and
> > > get it back to the same state.
> > > I'll think more on that.
> >
> > As I mentioned, I see the additional re-add as a minor thing, but
> > agree it's better to fix it. The fact that we don't know that the
> > drive is being recovered, bothers me more. Because user might look at
> > the superblock, and assume the data on the drive is consistent to some
> > point in time (time of the drive failure). While the actual data,
> > while doing bitmap-based recovery, is unusable until recovery
> > successfully completes. So the user might think it's okay to try to
> > run his app on this drive.
> > Yes, please think about this.
> >
> > >
> > > Meanwhile, this patch might address your other problem.  It allows
> > > --re-add
> > > to work if a non-bitmap rebuild fails and is then re-added.
> > >
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > > index c601c4b..d31852e 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev,
> > > mdu_disk_info_t *info)
> > >                        super_types[mddev->major_version].
> > >                                validate_super(mddev, rdev);
> > >                if ((info->state & (1<<MD_DISK_SYNC)) &&
> > > -                   (!test_bit(In_sync, &rdev->flags) ||
> > > +                   (test_bit(Faulty, &rdev->flags) ||
> > >                     rdev->raid_disk != info->raid_disk)) {
> > >                        /* This was a hot-add request, but events
> > > doesn't
> > >                         * match, so reject it.
> > >
> >
> > I have tested a slightly different patch that you suggested earlier -
> > just removing the !test_bit(In_sync, &rdev->flags) check. I confirm
> > that it solves the problem.
> >
> > The Faulty bit check seems redundant to me, because:
> > - it can be set by only by validate_super() and only if that drive's
> > role is 0xfffe in sb->roles[] array
> > - Long time ago I asked you, how can it happen that a drive thinks
> > about *itself* that it is Faulty (has 0xfffe for its role in its own
> > superblock), and you said this should never happen.
>
> Yes, you are right.  I've remove the test on 'Faulty' - the test on the
> raid_disk number is sufficient.
>
> >
> > Anyways, I tested also the patch you suggested, and it also works.
>
> Thanks!
>
> >
> > Is there any chance to see this fix in ubuntu-precise?
>
> Not really up to me.  It doesn't fix a crash or corruption so I'm not sure
> it
> is -stable material .... though maybe it is if it fixes a regression.
>
> I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc
> (hopefully later this week).
>
> Thanks,
> NeilBrown
>
>
> >
> > Thanks again for your support,
> > Alex.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2012-06-27  7:22       ` NeilBrown
  2012-06-27 16:40         ` Alexander Lyakas
@ 2014-02-19  9:51         ` Alexander Lyakas
  2014-02-26 13:02           ` Alexander Lyakas
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Lyakas @ 2014-02-19  9:51 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Andrei Warkentin, Yair Hershko
Hello Neil, Andrei,
I don't how much you recall of this old discussion.
Basically, the change that you made means that md doesn't update the
superblock on the recovering device, until recovery completes. As a
result, when assembling such array, the recovering device has an old
event count in the superblock and is not picked during assembly. So
later, user has to re-add it manually.
This is true for a device that failed and was re-added. For a fresh
device, saved_raid_disk==-1, so its superblock will still be updated
(and sucn device will be picked on assembly).
On the other hand, MD updates the superblock of the In_sync devices
and marks the recovering device as having a valid array slot (not
0xFFFF or 0xFFFE, which are treated similarly today).
What do you think of the following change:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 561a65f..4bbc7e3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1854,8 +1854,6 @@ retry:
                        sb->dev_roles[i] = cpu_to_le16(0xfffe);
                else if (test_bit(In_sync, &rdev2->flags))
                        sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
-               else if (rdev2->raid_disk >= 0)
-                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
                else
                        sb->dev_roles[i] = cpu_to_le16(0xffff);
        }
Basically, we do not mark the slot of the recovering device, until its
recovery completes. So during assembly, we will not pick it, as
before. For a fresh drive, there will be a regression - we will not
pick it as well on assembly.
The reason I am considering this change is another old discussion that
we had - considering split-brain resolution, and the proposal I made
in:
https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
Basically, when MD marks recovering devices as having a valid raid
slot in the superblock of In_sync devices, then on array assembly we
don't know whether sucn device is recovering or In_sync (if it is
inaccessible during assembly). So we have to assume it is In_sync and
thus can potentially cause split-brain.
What do you think of this change?
Thanks,
Alex.
On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Thanks for commenting, Neil,
>>
>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hi again Neil, and Andrey,
>> >>
>> >> looking at this email thread:
>> >> http://www.spinics.net/lists/raid/msg36236.html
>> >> between you and Andrey, the conclusion was:
>> >> "So the correct thing to do is to *not* update the metadata on the
>> >> recovering device until recovery completes.  Then if it fails and is
>> >> re-added, it will look just the same as when it was re-added the first
>> >> time, and will do a bitmap-based recovery."
>> >>
>> >> I have two doubts about this decision:
>> >> 1) Since the event count on the recovering drive is not updated, this
>> >> means that after reboot, when array is re-assembled, this drive will
>> >> not be added to the array, and the user will have to manually re-add
>> >> it. I agree this is a minor thing.
>> >
>> > Still, if it can be fixed it should be.
>> >
>> >> 2) There are places in mdadm, which check for recovery_offset on the
>> >> drive and take decisions based upon that. Specifically, if there is
>> >> *no* recovery offset, the data on this drive is considered to be
>> >> consistent WRT to the failure time of the drive. So, for example, the
>> >> drive can be a candidate for bumping up events during "force
>> >> assembly". Now, when superblock on such drive is not updated during
>> >> recovery (so there is *no* recovery offset), mdadm will think that the
>> >> drive is consistent, while in fact, its data is totally unusable until
>> >> after recovery completes. That's because we have updated parts of the
>> >> drive, but did not complete bringing the whole drive in-sync.
>> >
>> > If mdadm would consider updating the event count if not recovery had started,
>> > then surely it is just as valid to do so once some recovery has started, even
>> > if it hasn't completed.
>>
>> The patch you accepted from me ("Don't consider disks with a valid
>> recovery offset as candidates for bumping up event count") actually
>> attempts to protect from that:)
>>
>> I don't understand why "it is just as valid to do so once some
>> recovery has started". My understanding is that once recovery of a
>> drive has started, its data is not consistent between different parts
>> of the drive, until the recovery completes. This is because md does
>> bitmap-based recovery, and not kind of journal/transaction-log based
>> recovery.
>>
>> However, one could argue that for force-assembly case, when data
>> anyways can come up as partially corrupted, this is less important.
>
> Exactly.  And mdadm only updates event counts in the force-assembly case so
> while it might not be ideal, it is the best we can do.
>
>>
>> I would still think that there is value in recoding in a superblock
>> that a drive is recovering.
>
> Probably.  It is a bit unfortunate that if you stop an array that is
> recovering after a --re-add, you cannot simply 'assemble' it again and
> get it back to the same state.
> I'll think more on that.
>
> Meanwhile, this patch might address your other problem.  It allows --re-add
> to work if a non-bitmap rebuild fails and is then re-added.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c601c4b..d31852e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>                         super_types[mddev->major_version].
>                                 validate_super(mddev, rdev);
>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
> -                   (!test_bit(In_sync, &rdev->flags) ||
> +                   (test_bit(Faulty, &rdev->flags) ||
>                      rdev->raid_disk != info->raid_disk)) {
>                         /* This was a hot-add request, but events doesn't
>                          * match, so reject it.
>
>
>>
>> >
>> >>
>> >> Can you pls comment on my doubts?
>> >
>> > I think there is probably room for improvement here but I don't think there
>> > are any serious problems.
>> >
>> > However I'm about to go on leave for a couple of week so I'm unlikely to
>> > think about it for a while. I've made a note to look at it properly when I
>> > get back.
>> >
>>
>> Indeed, don't you think about this while you are resting!
>
> :-)
>
>
> Thanks.  I did have a very enjoyable break.
>
> NeilBrown
^ permalink raw reply related	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2014-02-19  9:51         ` Alexander Lyakas
@ 2014-02-26 13:02           ` Alexander Lyakas
  2014-02-27  3:19             ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Lyakas @ 2014-02-26 13:02 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Andrei Warkentin, Yair Hershko
HI Neil,
can you please comment what do you think of this change.
Thanks,
Alex.
On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hello Neil, Andrei,
> I don't how much you recall of this old discussion.
>
> Basically, the change that you made means that md doesn't update the
> superblock on the recovering device, until recovery completes. As a
> result, when assembling such array, the recovering device has an old
> event count in the superblock and is not picked during assembly. So
> later, user has to re-add it manually.
> This is true for a device that failed and was re-added. For a fresh
> device, saved_raid_disk==-1, so its superblock will still be updated
> (and sucn device will be picked on assembly).
>
> On the other hand, MD updates the superblock of the In_sync devices
> and marks the recovering device as having a valid array slot (not
> 0xFFFF or 0xFFFE, which are treated similarly today).
>
> What do you think of the following change:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 561a65f..4bbc7e3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1854,8 +1854,6 @@ retry:
>                         sb->dev_roles[i] = cpu_to_le16(0xfffe);
>                 else if (test_bit(In_sync, &rdev2->flags))
>                         sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> -               else if (rdev2->raid_disk >= 0)
> -                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
>                 else
>                         sb->dev_roles[i] = cpu_to_le16(0xffff);
>         }
>
> Basically, we do not mark the slot of the recovering device, until its
> recovery completes. So during assembly, we will not pick it, as
> before. For a fresh drive, there will be a regression - we will not
> pick it as well on assembly.
>
> The reason I am considering this change is another old discussion that
> we had - considering split-brain resolution, and the proposal I made
> in:
> https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
>
> Basically, when MD marks recovering devices as having a valid raid
> slot in the superblock of In_sync devices, then on array assembly we
> don't know whether sucn device is recovering or In_sync (if it is
> inaccessible during assembly). So we have to assume it is In_sync and
> thus can potentially cause split-brain.
>
>
> What do you think of this change?
>
> Thanks,
> Alex.
>
>
> On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote:
>> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Thanks for commenting, Neil,
>>>
>>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
>>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi again Neil, and Andrey,
>>> >>
>>> >> looking at this email thread:
>>> >> http://www.spinics.net/lists/raid/msg36236.html
>>> >> between you and Andrey, the conclusion was:
>>> >> "So the correct thing to do is to *not* update the metadata on the
>>> >> recovering device until recovery completes.  Then if it fails and is
>>> >> re-added, it will look just the same as when it was re-added the first
>>> >> time, and will do a bitmap-based recovery."
>>> >>
>>> >> I have two doubts about this decision:
>>> >> 1) Since the event count on the recovering drive is not updated, this
>>> >> means that after reboot, when array is re-assembled, this drive will
>>> >> not be added to the array, and the user will have to manually re-add
>>> >> it. I agree this is a minor thing.
>>> >
>>> > Still, if it can be fixed it should be.
>>> >
>>> >> 2) There are places in mdadm, which check for recovery_offset on the
>>> >> drive and take decisions based upon that. Specifically, if there is
>>> >> *no* recovery offset, the data on this drive is considered to be
>>> >> consistent WRT to the failure time of the drive. So, for example, the
>>> >> drive can be a candidate for bumping up events during "force
>>> >> assembly". Now, when superblock on such drive is not updated during
>>> >> recovery (so there is *no* recovery offset), mdadm will think that the
>>> >> drive is consistent, while in fact, its data is totally unusable until
>>> >> after recovery completes. That's because we have updated parts of the
>>> >> drive, but did not complete bringing the whole drive in-sync.
>>> >
>>> > If mdadm would consider updating the event count if not recovery had started,
>>> > then surely it is just as valid to do so once some recovery has started, even
>>> > if it hasn't completed.
>>>
>>> The patch you accepted from me ("Don't consider disks with a valid
>>> recovery offset as candidates for bumping up event count") actually
>>> attempts to protect from that:)
>>>
>>> I don't understand why "it is just as valid to do so once some
>>> recovery has started". My understanding is that once recovery of a
>>> drive has started, its data is not consistent between different parts
>>> of the drive, until the recovery completes. This is because md does
>>> bitmap-based recovery, and not kind of journal/transaction-log based
>>> recovery.
>>>
>>> However, one could argue that for force-assembly case, when data
>>> anyways can come up as partially corrupted, this is less important.
>>
>> Exactly.  And mdadm only updates event counts in the force-assembly case so
>> while it might not be ideal, it is the best we can do.
>>
>>>
>>> I would still think that there is value in recoding in a superblock
>>> that a drive is recovering.
>>
>> Probably.  It is a bit unfortunate that if you stop an array that is
>> recovering after a --re-add, you cannot simply 'assemble' it again and
>> get it back to the same state.
>> I'll think more on that.
>>
>> Meanwhile, this patch might address your other problem.  It allows --re-add
>> to work if a non-bitmap rebuild fails and is then re-added.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index c601c4b..d31852e 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>>                         super_types[mddev->major_version].
>>                                 validate_super(mddev, rdev);
>>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
>> -                   (!test_bit(In_sync, &rdev->flags) ||
>> +                   (test_bit(Faulty, &rdev->flags) ||
>>                      rdev->raid_disk != info->raid_disk)) {
>>                         /* This was a hot-add request, but events doesn't
>>                          * match, so reject it.
>>
>>
>>>
>>> >
>>> >>
>>> >> Can you pls comment on my doubts?
>>> >
>>> > I think there is probably room for improvement here but I don't think there
>>> > are any serious problems.
>>> >
>>> > However I'm about to go on leave for a couple of week so I'm unlikely to
>>> > think about it for a while. I've made a note to look at it properly when I
>>> > get back.
>>> >
>>>
>>> Indeed, don't you think about this while you are resting!
>>
>> :-)
>>
>>
>> Thanks.  I did have a very enjoyable break.
>>
>> NeilBrown
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2014-02-26 13:02           ` Alexander Lyakas
@ 2014-02-27  3:19             ` NeilBrown
  2014-03-02 11:01               ` Alexander Lyakas
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-02-27  3:19 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Andrei Warkentin, Yair Hershko
[-- Attachment #1: Type: text/plain, Size: 10195 bytes --]
On Wed, 26 Feb 2014 15:02:27 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:
> HI Neil,
> can you please comment what do you think of this change.
Sorry for not replying earlier.
I think the core problem you are addressing is fixed by:
commit f466722ca614edcd14f3337373f33132117c7612
Author: NeilBrown <neilb@suse.de>
Date:   Mon Dec 9 12:04:56 2013 +1100
    md: Change handling of save_raid_disk and metadata update during recovery.
Can you confirm if that is the case?
With respect to the split-brain issue, I see that you would like every
superblock to know which devices are truly operational and which are still
recovering - a distinction which my current code doesn't allow.
You want to know this because if you can see only 2 drives from a 4-drive
RAID6 then:
 - if the other 2 devices are believed to be fully functional, then we risk a
   split-brain situation and shouldn't start the array
 - if either of the other devices is missing or still recovering, then there
   is no risk of split brain and the array can be started.
With the current code you might refuse to start an array because it looks
like it both other drives could be fully functional where in fact they are
still recovering.
This at least fails safely, but it may be that we want to absolutely minimise
the number of false-failures.
However I would hope that the total amount of time that the array spends
recovering a device would be a very small fraction of the total time that the
array is active.  If that is true, then the number of times you get a crash
during recovery will be a small fraction of the total crashes, so the number
of false-failures should be a small fraction of the reported failures.
So I don't see this as a big problem.
With respect to the "biggest problem" you note in the split-brain document
you reference:  this is an issue that has been in the back of my mind for
many years, wondering how much I should care and to what extent it should be
"fixed".  Whenever I have come close to fixing it, I've managed to convince
myself that I shouldn't.
We only record a device as 'failed' if a write fails.  And if a write fails
then that device is clearly inconsistent with the rest of the array.  So it
really seems appropriate to record that fact.
I could possibly be convinced that there comes a point where updating
superblocks any further is completely pointless, but I would need to be
certain that it wouldn't leave a suggestion that the array was more coherent
than it really is.
Thanks,
NeilBrown
> 
> Thanks,
> Alex.
> 
> On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas
> <alex.bolshoy@gmail.com> wrote:
> > Hello Neil, Andrei,
> > I don't how much you recall of this old discussion.
> >
> > Basically, the change that you made means that md doesn't update the
> > superblock on the recovering device, until recovery completes. As a
> > result, when assembling such array, the recovering device has an old
> > event count in the superblock and is not picked during assembly. So
> > later, user has to re-add it manually.
> > This is true for a device that failed and was re-added. For a fresh
> > device, saved_raid_disk==-1, so its superblock will still be updated
> > (and sucn device will be picked on assembly).
> >
> > On the other hand, MD updates the superblock of the In_sync devices
> > and marks the recovering device as having a valid array slot (not
> > 0xFFFF or 0xFFFE, which are treated similarly today).
> >
> > What do you think of the following change:
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 561a65f..4bbc7e3 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -1854,8 +1854,6 @@ retry:
> >                         sb->dev_roles[i] = cpu_to_le16(0xfffe);
> >                 else if (test_bit(In_sync, &rdev2->flags))
> >                         sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> > -               else if (rdev2->raid_disk >= 0)
> > -                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
> >                 else
> >                         sb->dev_roles[i] = cpu_to_le16(0xffff);
> >         }
> >
> > Basically, we do not mark the slot of the recovering device, until its
> > recovery completes. So during assembly, we will not pick it, as
> > before. For a fresh drive, there will be a regression - we will not
> > pick it as well on assembly.
> >
> > The reason I am considering this change is another old discussion that
> > we had - considering split-brain resolution, and the proposal I made
> > in:
> > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
> >
> > Basically, when MD marks recovering devices as having a valid raid
> > slot in the superblock of In_sync devices, then on array assembly we
> > don't know whether sucn device is recovering or In_sync (if it is
> > inaccessible during assembly). So we have to assume it is In_sync and
> > thus can potentially cause split-brain.
> >
> >
> > What do you think of this change?
> >
> > Thanks,
> > Alex.
> >
> >
> > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote:
> >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> wrote:
> >>
> >>> Thanks for commenting, Neil,
> >>>
> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
> >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi again Neil, and Andrey,
> >>> >>
> >>> >> looking at this email thread:
> >>> >> http://www.spinics.net/lists/raid/msg36236.html
> >>> >> between you and Andrey, the conclusion was:
> >>> >> "So the correct thing to do is to *not* update the metadata on the
> >>> >> recovering device until recovery completes.  Then if it fails and is
> >>> >> re-added, it will look just the same as when it was re-added the first
> >>> >> time, and will do a bitmap-based recovery."
> >>> >>
> >>> >> I have two doubts about this decision:
> >>> >> 1) Since the event count on the recovering drive is not updated, this
> >>> >> means that after reboot, when array is re-assembled, this drive will
> >>> >> not be added to the array, and the user will have to manually re-add
> >>> >> it. I agree this is a minor thing.
> >>> >
> >>> > Still, if it can be fixed it should be.
> >>> >
> >>> >> 2) There are places in mdadm, which check for recovery_offset on the
> >>> >> drive and take decisions based upon that. Specifically, if there is
> >>> >> *no* recovery offset, the data on this drive is considered to be
> >>> >> consistent WRT to the failure time of the drive. So, for example, the
> >>> >> drive can be a candidate for bumping up events during "force
> >>> >> assembly". Now, when superblock on such drive is not updated during
> >>> >> recovery (so there is *no* recovery offset), mdadm will think that the
> >>> >> drive is consistent, while in fact, its data is totally unusable until
> >>> >> after recovery completes. That's because we have updated parts of the
> >>> >> drive, but did not complete bringing the whole drive in-sync.
> >>> >
> >>> > If mdadm would consider updating the event count if not recovery had started,
> >>> > then surely it is just as valid to do so once some recovery has started, even
> >>> > if it hasn't completed.
> >>>
> >>> The patch you accepted from me ("Don't consider disks with a valid
> >>> recovery offset as candidates for bumping up event count") actually
> >>> attempts to protect from that:)
> >>>
> >>> I don't understand why "it is just as valid to do so once some
> >>> recovery has started". My understanding is that once recovery of a
> >>> drive has started, its data is not consistent between different parts
> >>> of the drive, until the recovery completes. This is because md does
> >>> bitmap-based recovery, and not kind of journal/transaction-log based
> >>> recovery.
> >>>
> >>> However, one could argue that for force-assembly case, when data
> >>> anyways can come up as partially corrupted, this is less important.
> >>
> >> Exactly.  And mdadm only updates event counts in the force-assembly case so
> >> while it might not be ideal, it is the best we can do.
> >>
> >>>
> >>> I would still think that there is value in recoding in a superblock
> >>> that a drive is recovering.
> >>
> >> Probably.  It is a bit unfortunate that if you stop an array that is
> >> recovering after a --re-add, you cannot simply 'assemble' it again and
> >> get it back to the same state.
> >> I'll think more on that.
> >>
> >> Meanwhile, this patch might address your other problem.  It allows --re-add
> >> to work if a non-bitmap rebuild fails and is then re-added.
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index c601c4b..d31852e 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
> >>                         super_types[mddev->major_version].
> >>                                 validate_super(mddev, rdev);
> >>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
> >> -                   (!test_bit(In_sync, &rdev->flags) ||
> >> +                   (test_bit(Faulty, &rdev->flags) ||
> >>                      rdev->raid_disk != info->raid_disk)) {
> >>                         /* This was a hot-add request, but events doesn't
> >>                          * match, so reject it.
> >>
> >>
> >>>
> >>> >
> >>> >>
> >>> >> Can you pls comment on my doubts?
> >>> >
> >>> > I think there is probably room for improvement here but I don't think there
> >>> > are any serious problems.
> >>> >
> >>> > However I'm about to go on leave for a couple of week so I'm unlikely to
> >>> > think about it for a while. I've made a note to look at it properly when I
> >>> > get back.
> >>> >
> >>>
> >>> Indeed, don't you think about this while you are resting!
> >>
> >> :-)
> >>
> >>
> >> Thanks.  I did have a very enjoyable break.
> >>
> >> NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
  2014-02-27  3:19             ` NeilBrown
@ 2014-03-02 11:01               ` Alexander Lyakas
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Lyakas @ 2014-03-02 11:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Andrei Warkentin, Yair Hershko
Hi Neil,
Thank you for your response to this old issue.
On Thu, Feb 27, 2014 at 5:19 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 26 Feb 2014 15:02:27 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> HI Neil,
>> can you please comment what do you think of this change.
>
> Sorry for not replying earlier.
>
> I think the core problem you are addressing is fixed by:
>
> commit f466722ca614edcd14f3337373f33132117c7612
> Author: NeilBrown <neilb@suse.de>
> Date:   Mon Dec 9 12:04:56 2013 +1100
>
>     md: Change handling of save_raid_disk and metadata update during recovery.
>
> Can you confirm if that is the case?
Yes. This fixes the problem that the metadata of the recovering device
doesn't show that it is recovering (MD_FEATURE_RECOVERY_OFFSET was not
set). Now there is no danger that when assembling an array somebody
will pick such device thinking it has valid data. This is awesome!
Is there any issue with pulling this patch into kernel 3.8.13? (other
than the fact that by default MD is not configured as a loadable
module, so we need to recompile the kernel for that).
>
> With respect to the split-brain issue, I see that you would like every
> superblock to know which devices are truly operational and which are still
> recovering - a distinction which my current code doesn't allow.
Exactly. I don't want recovering devices having a valid "role" in the
superblock (but still want them to have the MD_FEATURE_RECOVERY_OFFSET
set, like the above commit of yours does).
>
> You want to know this because if you can see only 2 drives from a 4-drive
> RAID6 then:
>  - if the other 2 devices are believed to be fully functional, then we risk a
>    split-brain situation and shouldn't start the array
>  - if either of the other devices is missing or still recovering, then there
>    is no risk of split brain and the array can be started.
Precisely! In that case having MD_FEATURE_RECOVERY_OFFSET set does not
help, because we don't see the drive.
>
> With the current code you might refuse to start an array because it looks
> like it both other drives could be fully functional where in fact they are
> still recovering.
> This at least fails safely, but it may be that we want to absolutely minimise
> the number of false-failures.
Precisely!
>
> However I would hope that the total amount of time that the array spends
> recovering a device would be a very small fraction of the total time that the
> array is active.  If that is true, then the number of times you get a crash
> during recovery will be a small fraction of the total crashes, so the number
> of false-failures should be a small fraction of the reported failures.
>
> So I don't see this as a big problem.
Ok, I understand and respect your opinion. Can you please comment
overall on my patch, if we decide to integrate it into our kernel. Do
you think this patch can break something fundamental?
>
> With respect to the "biggest problem" you note in the split-brain document
> you reference:  this is an issue that has been in the back of my mind for
> many years, wondering how much I should care and to what extent it should be
> "fixed".  Whenever I have come close to fixing it, I've managed to convince
> myself that I shouldn't.
>
> We only record a device as 'failed' if a write fails.  And if a write fails
> then that device is clearly inconsistent with the rest of the array.  So it
> really seems appropriate to record that fact.
> I could possibly be convinced that there comes a point where updating
> superblocks any further is completely pointless, but I would need to be
> certain that it wouldn't leave a suggestion that the array was more coherent
> than it really is.
>
> Thanks,
> NeilBrown
I agree with your argument that if array has failed, we should know
about it. And when assembling, user should explicitly say "force",
making us manually adjust the event counts. So I withdraw my
suggestion to not update the superblock if array has failed.
Thanks!
Alex.
>
>
>>
>> Thanks,
>> Alex.
>>
>> On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas
>> <alex.bolshoy@gmail.com> wrote:
>> > Hello Neil, Andrei,
>> > I don't how much you recall of this old discussion.
>> >
>> > Basically, the change that you made means that md doesn't update the
>> > superblock on the recovering device, until recovery completes. As a
>> > result, when assembling such array, the recovering device has an old
>> > event count in the superblock and is not picked during assembly. So
>> > later, user has to re-add it manually.
>> > This is true for a device that failed and was re-added. For a fresh
>> > device, saved_raid_disk==-1, so its superblock will still be updated
>> > (and sucn device will be picked on assembly).
>> >
>> > On the other hand, MD updates the superblock of the In_sync devices
>> > and marks the recovering device as having a valid array slot (not
>> > 0xFFFF or 0xFFFE, which are treated similarly today).
>> >
>> > What do you think of the following change:
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 561a65f..4bbc7e3 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -1854,8 +1854,6 @@ retry:
>> >                         sb->dev_roles[i] = cpu_to_le16(0xfffe);
>> >                 else if (test_bit(In_sync, &rdev2->flags))
>> >                         sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
>> > -               else if (rdev2->raid_disk >= 0)
>> > -                       sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
>> >                 else
>> >                         sb->dev_roles[i] = cpu_to_le16(0xffff);
>> >         }
>> >
>> > Basically, we do not mark the slot of the recovering device, until its
>> > recovery completes. So during assembly, we will not pick it, as
>> > before. For a fresh drive, there will be a regression - we will not
>> > pick it as well on assembly.
>> >
>> > The reason I am considering this change is another old discussion that
>> > we had - considering split-brain resolution, and the proposal I made
>> > in:
>> > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE
>> >
>> > Basically, when MD marks recovering devices as having a valid raid
>> > slot in the superblock of In_sync devices, then on array assembly we
>> > don't know whether sucn device is recovering or In_sync (if it is
>> > inaccessible during assembly). So we have to assume it is In_sync and
>> > thus can potentially cause split-brain.
>> >
>> >
>> > What do you think of this change?
>> >
>> > Thanks,
>> > Alex.
>> >
>> >
>> > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote:
>> >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> wrote:
>> >>
>> >>> Thanks for commenting, Neil,
>> >>>
>> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote:
>> >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >>> > wrote:
>> >>> >
>> >>> >> Hi again Neil, and Andrey,
>> >>> >>
>> >>> >> looking at this email thread:
>> >>> >> http://www.spinics.net/lists/raid/msg36236.html
>> >>> >> between you and Andrey, the conclusion was:
>> >>> >> "So the correct thing to do is to *not* update the metadata on the
>> >>> >> recovering device until recovery completes.  Then if it fails and is
>> >>> >> re-added, it will look just the same as when it was re-added the first
>> >>> >> time, and will do a bitmap-based recovery."
>> >>> >>
>> >>> >> I have two doubts about this decision:
>> >>> >> 1) Since the event count on the recovering drive is not updated, this
>> >>> >> means that after reboot, when array is re-assembled, this drive will
>> >>> >> not be added to the array, and the user will have to manually re-add
>> >>> >> it. I agree this is a minor thing.
>> >>> >
>> >>> > Still, if it can be fixed it should be.
>> >>> >
>> >>> >> 2) There are places in mdadm, which check for recovery_offset on the
>> >>> >> drive and take decisions based upon that. Specifically, if there is
>> >>> >> *no* recovery offset, the data on this drive is considered to be
>> >>> >> consistent WRT to the failure time of the drive. So, for example, the
>> >>> >> drive can be a candidate for bumping up events during "force
>> >>> >> assembly". Now, when superblock on such drive is not updated during
>> >>> >> recovery (so there is *no* recovery offset), mdadm will think that the
>> >>> >> drive is consistent, while in fact, its data is totally unusable until
>> >>> >> after recovery completes. That's because we have updated parts of the
>> >>> >> drive, but did not complete bringing the whole drive in-sync.
>> >>> >
>> >>> > If mdadm would consider updating the event count if not recovery had started,
>> >>> > then surely it is just as valid to do so once some recovery has started, even
>> >>> > if it hasn't completed.
>> >>>
>> >>> The patch you accepted from me ("Don't consider disks with a valid
>> >>> recovery offset as candidates for bumping up event count") actually
>> >>> attempts to protect from that:)
>> >>>
>> >>> I don't understand why "it is just as valid to do so once some
>> >>> recovery has started". My understanding is that once recovery of a
>> >>> drive has started, its data is not consistent between different parts
>> >>> of the drive, until the recovery completes. This is because md does
>> >>> bitmap-based recovery, and not kind of journal/transaction-log based
>> >>> recovery.
>> >>>
>> >>> However, one could argue that for force-assembly case, when data
>> >>> anyways can come up as partially corrupted, this is less important.
>> >>
>> >> Exactly.  And mdadm only updates event counts in the force-assembly case so
>> >> while it might not be ideal, it is the best we can do.
>> >>
>> >>>
>> >>> I would still think that there is value in recoding in a superblock
>> >>> that a drive is recovering.
>> >>
>> >> Probably.  It is a bit unfortunate that if you stop an array that is
>> >> recovering after a --re-add, you cannot simply 'assemble' it again and
>> >> get it back to the same state.
>> >> I'll think more on that.
>> >>
>> >> Meanwhile, this patch might address your other problem.  It allows --re-add
>> >> to work if a non-bitmap rebuild fails and is then re-added.
>> >>
>> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> >> index c601c4b..d31852e 100644
>> >> --- a/drivers/md/md.c
>> >> +++ b/drivers/md/md.c
>> >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
>> >>                         super_types[mddev->major_version].
>> >>                                 validate_super(mddev, rdev);
>> >>                 if ((info->state & (1<<MD_DISK_SYNC)) &&
>> >> -                   (!test_bit(In_sync, &rdev->flags) ||
>> >> +                   (test_bit(Faulty, &rdev->flags) ||
>> >>                      rdev->raid_disk != info->raid_disk)) {
>> >>                         /* This was a hot-add request, but events doesn't
>> >>                          * match, so reject it.
>> >>
>> >>
>> >>>
>> >>> >
>> >>> >>
>> >>> >> Can you pls comment on my doubts?
>> >>> >
>> >>> > I think there is probably room for improvement here but I don't think there
>> >>> > are any serious problems.
>> >>> >
>> >>> > However I'm about to go on leave for a couple of week so I'm unlikely to
>> >>> > think about it for a while. I've made a note to look at it properly when I
>> >>> > get back.
>> >>> >
>> >>>
>> >>> Indeed, don't you think about this while you are resting!
>> >>
>> >> :-)
>> >>
>> >>
>> >> Thanks.  I did have a very enjoyable break.
>> >>
>> >> NeilBrown
>
^ permalink raw reply	[flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-03-02 11:01 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas
2012-06-07  7:22 ` Alexander Lyakas
2012-06-07 11:24   ` NeilBrown
2012-06-07 12:47     ` Alexander Lyakas
2012-06-27  7:22       ` NeilBrown
2012-06-27 16:40         ` Alexander Lyakas
2012-07-02  7:57           ` NeilBrown
2012-07-02 17:30             ` John Gehring
2014-02-19  9:51         ` Alexander Lyakas
2014-02-26 13:02           ` Alexander Lyakas
2014-02-27  3:19             ` NeilBrown
2014-03-02 11:01               ` Alexander Lyakas
2012-06-08  8:59     ` Thanks (was Re: Problem with patch...) John Robinson
2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).