* Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a)
@ 2012-06-06 16:09 Alexander Lyakas
2012-06-07 7:22 ` Alexander Lyakas
2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
0 siblings, 2 replies; 14+ messages in thread
From: Alexander Lyakas @ 2012-06-06 16:09 UTC (permalink / raw)
To: NeilBrown, linux-raid
Hi Neil,
it looks like we are hitting an issue with this patch.
Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are
aware that this version of mdadm has an issue of unaligned writes, so
we backported the fix to it into this version).
We are testing the following simple scenario:
# 2-way raid1 with a missing drive
# a fresh drive (no superblock) is added (not re-added) to the raid,
and the raid happily starts rebuilding it
# at some point in time, this drive fails, so rebuild aborts, drive is
kicked out.
# later the drive recovers and "mdadm remove & re-add" is attempted
# re-add fails
Some debugging suggests the following:
- The superblock of the drive to be re-added looks like this:
Avail Dev Size : 10483712 (5.00 GiB 5.37 GB)
Array Size : 10483688 (5.00 GiB 5.37 GB)
Used Dev Size : 10483688 (5.00 GiB 5.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
Recovery Offset : 100096 sectors
State : clean
Resync Offset : N/A sectors
Device UUID : d9114d02:76153894:87aaf09d:c55da9be
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Jun 6 18:36:55 2012
Checksum : 94468c9d - correct
Events : 192
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)
So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET
set) and a valid role.
- The "re-add" path of mdadm performs calls getinfo_super1(), which performs:
role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
...
info->disk.state = 6; /* active and in sync */
info->disk.raid_disk = role;
So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set.
- mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel
- the kernel performs:
if (!mddev->persistent) {
... // not relevant to our case
} else
super_types[mddev->major_version].
validate_super(mddev, rdev);
- validate_super() performs:
if ((le32_to_cpu(sb->feature_map) &
MD_FEATURE_RECOVERY_OFFSET))
rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
else
set_bit(In_sync, &rdev->flags);
rdev->raid_disk = role;
So In_sync flag is not set, because we have a recovery offset.
- Then the code introduced by the patch does:
if ((info->state & (1<<MD_DISK_SYNC)) &&
(!test_bit(In_sync, &rdev->flags) ||
rdev->raid_disk != info->raid_disk)) {
/* This was a hot-add request, but events doesn't
* match, so reject it.
*/
export_rdev(rdev);
return -EINVAL;
}
So the first part of "OR" succeeds and we abort.
An observation: this problem only happens with a fresh drive
apparently. If drive that was already In_sync fails and then is
re-added, kernel does not update its superblock until recovery
completes. So its superblock has no valid recovery_offset, and this
problem doesn't happen. The code that skips updating the superblock
was added as part of this email thread:
http://www.spinics.net/lists/raid/msg36275.html. However, I did not
dig deeper why the superblock gets updated in my case. But the
scenario above repros easily.
Q1: Can you confirm that my analysis is correct?
Q2: Is this an expected behavior? I would assume that no, because the
same case for a failed drive (not a fresh drive) works fine
Q3: How would you recommend to handle this?
Thanks!
Alex.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas @ 2012-06-07 7:22 ` Alexander Lyakas 2012-06-07 11:24 ` NeilBrown 2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown 1 sibling, 1 reply; 14+ messages in thread From: Alexander Lyakas @ 2012-06-07 7:22 UTC (permalink / raw) To: NeilBrown, andrey.warkentin, linux-raid Hi again Neil, and Andrey, looking at this email thread: http://www.spinics.net/lists/raid/msg36236.html between you and Andrey, the conclusion was: "So the correct thing to do is to *not* update the metadata on the recovering device until recovery completes. Then if it fails and is re-added, it will look just the same as when it was re-added the first time, and will do a bitmap-based recovery." I have two doubts about this decision: 1) Since the event count on the recovering drive is not updated, this means that after reboot, when array is re-assembled, this drive will not be added to the array, and the user will have to manually re-add it. I agree this is a minor thing. 2) There are places in mdadm, which check for recovery_offset on the drive and take decisions based upon that. Specifically, if there is *no* recovery offset, the data on this drive is considered to be consistent WRT to the failure time of the drive. So, for example, the drive can be a candidate for bumping up events during "force assembly". Now, when superblock on such drive is not updated during recovery (so there is *no* recovery offset), mdadm will think that the drive is consistent, while in fact, its data is totally unusable until after recovery completes. That's because we have updated parts of the drive, but did not complete bringing the whole drive in-sync. Can you pls comment on my doubts? Thanks, Alex. On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > it looks like we are hitting an issue with this patch. > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are > aware that this version of mdadm has an issue of unaligned writes, so > we backported the fix to it into this version). > > We are testing the following simple scenario: > # 2-way raid1 with a missing drive > # a fresh drive (no superblock) is added (not re-added) to the raid, > and the raid happily starts rebuilding it > # at some point in time, this drive fails, so rebuild aborts, drive is > kicked out. > # later the drive recovers and "mdadm remove & re-add" is attempted > # re-add fails > > Some debugging suggests the following: > > - The superblock of the drive to be re-added looks like this: > > Avail Dev Size : 10483712 (5.00 GiB 5.37 GB) > Array Size : 10483688 (5.00 GiB 5.37 GB) > Used Dev Size : 10483688 (5.00 GiB 5.37 GB) > Data Offset : 2048 sectors > Super Offset : 8 sectors > Recovery Offset : 100096 sectors > State : clean > Resync Offset : N/A sectors > Device UUID : d9114d02:76153894:87aaf09d:c55da9be > Internal Bitmap : 8 sectors from superblock > Update Time : Wed Jun 6 18:36:55 2012 > Checksum : 94468c9d - correct > Events : 192 > Device Role : Active device 1 > Array State : AA ('A' == active, '.' == missing) > > > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET > set) and a valid role. > > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs: > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]); > ... > info->disk.state = 6; /* active and in sync */ > info->disk.raid_disk = role; > > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set. > > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel > - the kernel performs: > if (!mddev->persistent) { > ... // not relevant to our case > } else > super_types[mddev->major_version]. > validate_super(mddev, rdev); > > - validate_super() performs: > if ((le32_to_cpu(sb->feature_map) & > MD_FEATURE_RECOVERY_OFFSET)) > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); > else > set_bit(In_sync, &rdev->flags); > rdev->raid_disk = role; > > So In_sync flag is not set, because we have a recovery offset. > > - Then the code introduced by the patch does: > if ((info->state & (1<<MD_DISK_SYNC)) && > (!test_bit(In_sync, &rdev->flags) || > rdev->raid_disk != info->raid_disk)) { > /* This was a hot-add request, but events doesn't > * match, so reject it. > */ > export_rdev(rdev); > return -EINVAL; > } > > So the first part of "OR" succeeds and we abort. > > An observation: this problem only happens with a fresh drive > apparently. If drive that was already In_sync fails and then is > re-added, kernel does not update its superblock until recovery > completes. So its superblock has no valid recovery_offset, and this > problem doesn't happen. The code that skips updating the superblock > was added as part of this email thread: > http://www.spinics.net/lists/raid/msg36275.html. However, I did not > dig deeper why the superblock gets updated in my case. But the > scenario above repros easily. > > Q1: Can you confirm that my analysis is correct? > Q2: Is this an expected behavior? I would assume that no, because the > same case for a failed drive (not a fresh drive) works fine > Q3: How would you recommend to handle this? > > Thanks! > Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-07 7:22 ` Alexander Lyakas @ 2012-06-07 11:24 ` NeilBrown 2012-06-07 12:47 ` Alexander Lyakas 2012-06-08 8:59 ` Thanks (was Re: Problem with patch...) John Robinson 0 siblings, 2 replies; 14+ messages in thread From: NeilBrown @ 2012-06-07 11:24 UTC (permalink / raw) To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid [-- Attachment #1: Type: text/plain, Size: 6108 bytes --] On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi again Neil, and Andrey, > > looking at this email thread: > http://www.spinics.net/lists/raid/msg36236.html > between you and Andrey, the conclusion was: > "So the correct thing to do is to *not* update the metadata on the > recovering device until recovery completes. Then if it fails and is > re-added, it will look just the same as when it was re-added the first > time, and will do a bitmap-based recovery." > > I have two doubts about this decision: > 1) Since the event count on the recovering drive is not updated, this > means that after reboot, when array is re-assembled, this drive will > not be added to the array, and the user will have to manually re-add > it. I agree this is a minor thing. Still, if it can be fixed it should be. > 2) There are places in mdadm, which check for recovery_offset on the > drive and take decisions based upon that. Specifically, if there is > *no* recovery offset, the data on this drive is considered to be > consistent WRT to the failure time of the drive. So, for example, the > drive can be a candidate for bumping up events during "force > assembly". Now, when superblock on such drive is not updated during > recovery (so there is *no* recovery offset), mdadm will think that the > drive is consistent, while in fact, its data is totally unusable until > after recovery completes. That's because we have updated parts of the > drive, but did not complete bringing the whole drive in-sync. If mdadm would consider updating the event count if not recovery had started, then surely it is just as valid to do so once some recovery has started, even if it hasn't completed. > > Can you pls comment on my doubts? I think there is probably room for improvement here but I don't think there are any serious problems. However I'm about to go on leave for a couple of week so I'm unlikely to think about it for a while. I've made a note to look at it properly when I get back. NeilBrown > > Thanks, > Alex. > > > On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > > Hi Neil, > > it looks like we are hitting an issue with this patch. > > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are > > aware that this version of mdadm has an issue of unaligned writes, so > > we backported the fix to it into this version). > > > > We are testing the following simple scenario: > > # 2-way raid1 with a missing drive > > # a fresh drive (no superblock) is added (not re-added) to the raid, > > and the raid happily starts rebuilding it > > # at some point in time, this drive fails, so rebuild aborts, drive is > > kicked out. > > # later the drive recovers and "mdadm remove & re-add" is attempted > > # re-add fails > > > > Some debugging suggests the following: > > > > - The superblock of the drive to be re-added looks like this: > > > > Avail Dev Size : 10483712 (5.00 GiB 5.37 GB) > > Array Size : 10483688 (5.00 GiB 5.37 GB) > > Used Dev Size : 10483688 (5.00 GiB 5.37 GB) > > Data Offset : 2048 sectors > > Super Offset : 8 sectors > > Recovery Offset : 100096 sectors > > State : clean > > Resync Offset : N/A sectors > > Device UUID : d9114d02:76153894:87aaf09d:c55da9be > > Internal Bitmap : 8 sectors from superblock > > Update Time : Wed Jun 6 18:36:55 2012 > > Checksum : 94468c9d - correct > > Events : 192 > > Device Role : Active device 1 > > Array State : AA ('A' == active, '.' == missing) > > > > > > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET > > set) and a valid role. > > > > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs: > > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]); > > ... > > info->disk.state = 6; /* active and in sync */ > > info->disk.raid_disk = role; > > > > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set. > > > > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel > > - the kernel performs: > > if (!mddev->persistent) { > > ... // not relevant to our case > > } else > > super_types[mddev->major_version]. > > validate_super(mddev, rdev); > > > > - validate_super() performs: > > if ((le32_to_cpu(sb->feature_map) & > > MD_FEATURE_RECOVERY_OFFSET)) > > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); > > else > > set_bit(In_sync, &rdev->flags); > > rdev->raid_disk = role; > > > > So In_sync flag is not set, because we have a recovery offset. > > > > - Then the code introduced by the patch does: > > if ((info->state & (1<<MD_DISK_SYNC)) && > > (!test_bit(In_sync, &rdev->flags) || > > rdev->raid_disk != info->raid_disk)) { > > /* This was a hot-add request, but events doesn't > > * match, so reject it. > > */ > > export_rdev(rdev); > > return -EINVAL; > > } > > > > So the first part of "OR" succeeds and we abort. > > > > An observation: this problem only happens with a fresh drive > > apparently. If drive that was already In_sync fails and then is > > re-added, kernel does not update its superblock until recovery > > completes. So its superblock has no valid recovery_offset, and this > > problem doesn't happen. The code that skips updating the superblock > > was added as part of this email thread: > > http://www.spinics.net/lists/raid/msg36275.html. However, I did not > > dig deeper why the superblock gets updated in my case. But the > > scenario above repros easily. > > > > Q1: Can you confirm that my analysis is correct? > > Q2: Is this an expected behavior? I would assume that no, because the > > same case for a failed drive (not a fresh drive) works fine > > Q3: How would you recommend to handle this? > > > > Thanks! > > Alex. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-07 11:24 ` NeilBrown @ 2012-06-07 12:47 ` Alexander Lyakas 2012-06-27 7:22 ` NeilBrown 2012-06-08 8:59 ` Thanks (was Re: Problem with patch...) John Robinson 1 sibling, 1 reply; 14+ messages in thread From: Alexander Lyakas @ 2012-06-07 12:47 UTC (permalink / raw) To: NeilBrown; +Cc: andrey.warkentin, linux-raid Thanks for commenting, Neil, On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi again Neil, and Andrey, >> >> looking at this email thread: >> http://www.spinics.net/lists/raid/msg36236.html >> between you and Andrey, the conclusion was: >> "So the correct thing to do is to *not* update the metadata on the >> recovering device until recovery completes. Then if it fails and is >> re-added, it will look just the same as when it was re-added the first >> time, and will do a bitmap-based recovery." >> >> I have two doubts about this decision: >> 1) Since the event count on the recovering drive is not updated, this >> means that after reboot, when array is re-assembled, this drive will >> not be added to the array, and the user will have to manually re-add >> it. I agree this is a minor thing. > > Still, if it can be fixed it should be. > >> 2) There are places in mdadm, which check for recovery_offset on the >> drive and take decisions based upon that. Specifically, if there is >> *no* recovery offset, the data on this drive is considered to be >> consistent WRT to the failure time of the drive. So, for example, the >> drive can be a candidate for bumping up events during "force >> assembly". Now, when superblock on such drive is not updated during >> recovery (so there is *no* recovery offset), mdadm will think that the >> drive is consistent, while in fact, its data is totally unusable until >> after recovery completes. That's because we have updated parts of the >> drive, but did not complete bringing the whole drive in-sync. > > If mdadm would consider updating the event count if not recovery had started, > then surely it is just as valid to do so once some recovery has started, even > if it hasn't completed. The patch you accepted from me ("Don't consider disks with a valid recovery offset as candidates for bumping up event count") actually attempts to protect from that:) I don't understand why "it is just as valid to do so once some recovery has started". My understanding is that once recovery of a drive has started, its data is not consistent between different parts of the drive, until the recovery completes. This is because md does bitmap-based recovery, and not kind of journal/transaction-log based recovery. However, one could argue that for force-assembly case, when data anyways can come up as partially corrupted, this is less important. I would still think that there is value in recoding in a superblock that a drive is recovering. > >> >> Can you pls comment on my doubts? > > I think there is probably room for improvement here but I don't think there > are any serious problems. > > However I'm about to go on leave for a couple of week so I'm unlikely to > think about it for a while. I've made a note to look at it properly when I > get back. > Indeed, don't you think about this while you are resting! Alex. > >> >> Thanks, >> Alex. >> >> >> On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: >> > Hi Neil, >> > it looks like we are hitting an issue with this patch. >> > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are >> > aware that this version of mdadm has an issue of unaligned writes, so >> > we backported the fix to it into this version). >> > >> > We are testing the following simple scenario: >> > # 2-way raid1 with a missing drive >> > # a fresh drive (no superblock) is added (not re-added) to the raid, >> > and the raid happily starts rebuilding it >> > # at some point in time, this drive fails, so rebuild aborts, drive is >> > kicked out. >> > # later the drive recovers and "mdadm remove & re-add" is attempted >> > # re-add fails >> > >> > Some debugging suggests the following: >> > >> > - The superblock of the drive to be re-added looks like this: >> > >> > Avail Dev Size : 10483712 (5.00 GiB 5.37 GB) >> > Array Size : 10483688 (5.00 GiB 5.37 GB) >> > Used Dev Size : 10483688 (5.00 GiB 5.37 GB) >> > Data Offset : 2048 sectors >> > Super Offset : 8 sectors >> > Recovery Offset : 100096 sectors >> > State : clean >> > Resync Offset : N/A sectors >> > Device UUID : d9114d02:76153894:87aaf09d:c55da9be >> > Internal Bitmap : 8 sectors from superblock >> > Update Time : Wed Jun 6 18:36:55 2012 >> > Checksum : 94468c9d - correct >> > Events : 192 >> > Device Role : Active device 1 >> > Array State : AA ('A' == active, '.' == missing) >> > >> > >> > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET >> > set) and a valid role. >> > >> > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs: >> > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]); >> > ... >> > info->disk.state = 6; /* active and in sync */ >> > info->disk.raid_disk = role; >> > >> > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set. >> > >> > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel >> > - the kernel performs: >> > if (!mddev->persistent) { >> > ... // not relevant to our case >> > } else >> > super_types[mddev->major_version]. >> > validate_super(mddev, rdev); >> > >> > - validate_super() performs: >> > if ((le32_to_cpu(sb->feature_map) & >> > MD_FEATURE_RECOVERY_OFFSET)) >> > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); >> > else >> > set_bit(In_sync, &rdev->flags); >> > rdev->raid_disk = role; >> > >> > So In_sync flag is not set, because we have a recovery offset. >> > >> > - Then the code introduced by the patch does: >> > if ((info->state & (1<<MD_DISK_SYNC)) && >> > (!test_bit(In_sync, &rdev->flags) || >> > rdev->raid_disk != info->raid_disk)) { >> > /* This was a hot-add request, but events doesn't >> > * match, so reject it. >> > */ >> > export_rdev(rdev); >> > return -EINVAL; >> > } >> > >> > So the first part of "OR" succeeds and we abort. >> > >> > An observation: this problem only happens with a fresh drive >> > apparently. If drive that was already In_sync fails and then is >> > re-added, kernel does not update its superblock until recovery >> > completes. So its superblock has no valid recovery_offset, and this >> > problem doesn't happen. The code that skips updating the superblock >> > was added as part of this email thread: >> > http://www.spinics.net/lists/raid/msg36275.html. However, I did not >> > dig deeper why the superblock gets updated in my case. But the >> > scenario above repros easily. >> > >> > Q1: Can you confirm that my analysis is correct? >> > Q2: Is this an expected behavior? I would assume that no, because the >> > same case for a failed drive (not a fresh drive) works fine >> > Q3: How would you recommend to handle this? >> > >> > Thanks! >> > Alex. > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-07 12:47 ` Alexander Lyakas @ 2012-06-27 7:22 ` NeilBrown 2012-06-27 16:40 ` Alexander Lyakas 2014-02-19 9:51 ` Alexander Lyakas 0 siblings, 2 replies; 14+ messages in thread From: NeilBrown @ 2012-06-27 7:22 UTC (permalink / raw) To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid [-- Attachment #1: Type: text/plain, Size: 4380 bytes --] On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Thanks for commenting, Neil, > > On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: > > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Hi again Neil, and Andrey, > >> > >> looking at this email thread: > >> http://www.spinics.net/lists/raid/msg36236.html > >> between you and Andrey, the conclusion was: > >> "So the correct thing to do is to *not* update the metadata on the > >> recovering device until recovery completes. Then if it fails and is > >> re-added, it will look just the same as when it was re-added the first > >> time, and will do a bitmap-based recovery." > >> > >> I have two doubts about this decision: > >> 1) Since the event count on the recovering drive is not updated, this > >> means that after reboot, when array is re-assembled, this drive will > >> not be added to the array, and the user will have to manually re-add > >> it. I agree this is a minor thing. > > > > Still, if it can be fixed it should be. > > > >> 2) There are places in mdadm, which check for recovery_offset on the > >> drive and take decisions based upon that. Specifically, if there is > >> *no* recovery offset, the data on this drive is considered to be > >> consistent WRT to the failure time of the drive. So, for example, the > >> drive can be a candidate for bumping up events during "force > >> assembly". Now, when superblock on such drive is not updated during > >> recovery (so there is *no* recovery offset), mdadm will think that the > >> drive is consistent, while in fact, its data is totally unusable until > >> after recovery completes. That's because we have updated parts of the > >> drive, but did not complete bringing the whole drive in-sync. > > > > If mdadm would consider updating the event count if not recovery had started, > > then surely it is just as valid to do so once some recovery has started, even > > if it hasn't completed. > > The patch you accepted from me ("Don't consider disks with a valid > recovery offset as candidates for bumping up event count") actually > attempts to protect from that:) > > I don't understand why "it is just as valid to do so once some > recovery has started". My understanding is that once recovery of a > drive has started, its data is not consistent between different parts > of the drive, until the recovery completes. This is because md does > bitmap-based recovery, and not kind of journal/transaction-log based > recovery. > > However, one could argue that for force-assembly case, when data > anyways can come up as partially corrupted, this is less important. Exactly. And mdadm only updates event counts in the force-assembly case so while it might not be ideal, it is the best we can do. > > I would still think that there is value in recoding in a superblock > that a drive is recovering. Probably. It is a bit unfortunate that if you stop an array that is recovering after a --re-add, you cannot simply 'assemble' it again and get it back to the same state. I'll think more on that. Meanwhile, this patch might address your other problem. It allows --re-add to work if a non-bitmap rebuild fails and is then re-added. diff --git a/drivers/md/md.c b/drivers/md/md.c index c601c4b..d31852e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) super_types[mddev->major_version]. validate_super(mddev, rdev); if ((info->state & (1<<MD_DISK_SYNC)) && - (!test_bit(In_sync, &rdev->flags) || + (test_bit(Faulty, &rdev->flags) || rdev->raid_disk != info->raid_disk)) { /* This was a hot-add request, but events doesn't * match, so reject it. > > > > >> > >> Can you pls comment on my doubts? > > > > I think there is probably room for improvement here but I don't think there > > are any serious problems. > > > > However I'm about to go on leave for a couple of week so I'm unlikely to > > think about it for a while. I've made a note to look at it properly when I > > get back. > > > > Indeed, don't you think about this while you are resting! :-) Thanks. I did have a very enjoyable break. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-27 7:22 ` NeilBrown @ 2012-06-27 16:40 ` Alexander Lyakas 2012-07-02 7:57 ` NeilBrown 2014-02-19 9:51 ` Alexander Lyakas 1 sibling, 1 reply; 14+ messages in thread From: Alexander Lyakas @ 2012-06-27 16:40 UTC (permalink / raw) To: NeilBrown; +Cc: andrey.warkentin, linux-raid Hi Neil, >> >> I would still think that there is value in recoding in a superblock >> that a drive is recovering. > > Probably. It is a bit unfortunate that if you stop an array that is > recovering after a --re-add, you cannot simply 'assemble' it again and > get it back to the same state. > I'll think more on that. As I mentioned, I see the additional re-add as a minor thing, but agree it's better to fix it. The fact that we don't know that the drive is being recovered, bothers me more. Because user might look at the superblock, and assume the data on the drive is consistent to some point in time (time of the drive failure). While the actual data, while doing bitmap-based recovery, is unusable until recovery successfully completes. So the user might think it's okay to try to run his app on this drive. Yes, please think about this. > > Meanwhile, this patch might address your other problem. It allows --re-add > to work if a non-bitmap rebuild fails and is then re-added. > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index c601c4b..d31852e 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) > super_types[mddev->major_version]. > validate_super(mddev, rdev); > if ((info->state & (1<<MD_DISK_SYNC)) && > - (!test_bit(In_sync, &rdev->flags) || > + (test_bit(Faulty, &rdev->flags) || > rdev->raid_disk != info->raid_disk)) { > /* This was a hot-add request, but events doesn't > * match, so reject it. > I have tested a slightly different patch that you suggested earlier - just removing the !test_bit(In_sync, &rdev->flags) check. I confirm that it solves the problem. The Faulty bit check seems redundant to me, because: - it can be set by only by validate_super() and only if that drive's role is 0xfffe in sb->roles[] array - Long time ago I asked you, how can it happen that a drive thinks about *itself* that it is Faulty (has 0xfffe for its role in its own superblock), and you said this should never happen. Anyways, I tested also the patch you suggested, and it also works. Is there any chance to see this fix in ubuntu-precise? Thanks again for your support, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-27 16:40 ` Alexander Lyakas @ 2012-07-02 7:57 ` NeilBrown 2012-07-02 17:30 ` John Gehring 0 siblings, 1 reply; 14+ messages in thread From: NeilBrown @ 2012-07-02 7:57 UTC (permalink / raw) To: Alexander Lyakas; +Cc: andrey.warkentin, linux-raid [-- Attachment #1: Type: text/plain, Size: 3245 bytes --] On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > > >> > >> I would still think that there is value in recoding in a superblock > >> that a drive is recovering. > > > > Probably. It is a bit unfortunate that if you stop an array that is > > recovering after a --re-add, you cannot simply 'assemble' it again and > > get it back to the same state. > > I'll think more on that. > > As I mentioned, I see the additional re-add as a minor thing, but > agree it's better to fix it. The fact that we don't know that the > drive is being recovered, bothers me more. Because user might look at > the superblock, and assume the data on the drive is consistent to some > point in time (time of the drive failure). While the actual data, > while doing bitmap-based recovery, is unusable until recovery > successfully completes. So the user might think it's okay to try to > run his app on this drive. > Yes, please think about this. > > > > > Meanwhile, this patch might address your other problem. It allows --re-add > > to work if a non-bitmap rebuild fails and is then re-added. > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index c601c4b..d31852e 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) > > super_types[mddev->major_version]. > > validate_super(mddev, rdev); > > if ((info->state & (1<<MD_DISK_SYNC)) && > > - (!test_bit(In_sync, &rdev->flags) || > > + (test_bit(Faulty, &rdev->flags) || > > rdev->raid_disk != info->raid_disk)) { > > /* This was a hot-add request, but events doesn't > > * match, so reject it. > > > > I have tested a slightly different patch that you suggested earlier - > just removing the !test_bit(In_sync, &rdev->flags) check. I confirm > that it solves the problem. > > The Faulty bit check seems redundant to me, because: > - it can be set by only by validate_super() and only if that drive's > role is 0xfffe in sb->roles[] array > - Long time ago I asked you, how can it happen that a drive thinks > about *itself* that it is Faulty (has 0xfffe for its role in its own > superblock), and you said this should never happen. Yes, you are right. I've remove the test on 'Faulty' - the test on the raid_disk number is sufficient. > > Anyways, I tested also the patch you suggested, and it also works. Thanks! > > Is there any chance to see this fix in ubuntu-precise? Not really up to me. It doesn't fix a crash or corruption so I'm not sure it is -stable material .... though maybe it is if it fixes a regression. I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc (hopefully later this week). Thanks, NeilBrown > > Thanks again for your support, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-07-02 7:57 ` NeilBrown @ 2012-07-02 17:30 ` John Gehring 0 siblings, 0 replies; 14+ messages in thread From: John Gehring @ 2012-07-02 17:30 UTC (permalink / raw) To: linux-raid I have what I think is a closely related issue, but since this thread is pretty long already, I will start with a brief description to see if this should move to a new thread. I am seeing a repeatable instance where a drive is added to a raid set, then almost immediately rejected. - linux 3.2.15 from kernel.org - mdadm 3.2.5, but I saw this with 3.2, as well - 8 drive, raid 6 Sequence: Remove the device - md created and fully synced. (/dev/md1 in my case) - no IO - pull a drive - issue dd read to /dev/md1 to get it to recognize missing device - mdadm /dev/md1 -r /dev/sdc2 (for example) to remove device from the md Add device back in - insert drive - mdadm --zero-superblock /dev/sdc2 (or sd<x>, whatever it came back in as) - mdadm /dev/md1 -a /dev/sdc2 I see the following about 50% of the time: [ 1813.355806] md: bind<sdc2> [ 1813.367430] md: recovery of RAID array md1 [ 1813.371570] md: minimum _guaranteed_ speed: 80000 KB/sec/disk. [ 1813.377481] md: using maximum available idle IO bandwidth (but not more than 800000 KB/sec) for recovery. [ 1813.387039] md: using 128k window, over a total of 62467072k. [ 1813.392789] md/raid:md1: Disk failure on sdc2, disabling device. [ 1813.392791] md/raid:md1: Operation continuing on 7 devices. [ 1813.404346] md: md1: recovery done. I recently found that after this series of events, if I once again remove the drive from the raid set (mdadm --remove), and do a read of the md device via dd, I can then successfully add the device back to the md with the --zero-superblock command followed by the mdadm --add command. That 'trick' has worked in the first three times that I've tried it, so not a lot of data points, but interesting nonetheless. Finally, I applied the patch referenced earlier in this thread and it did not affect the behavior I've described above. This has happened on different systems and with different drives; i.e. it is not a drive failure issue. Thoughts? Thanks. On Mon, Jul 2, 2012 at 1:57 AM, NeilBrown <neilb@suse.de> wrote: > > On Wed, 27 Jun 2012 19:40:52 +0300 Alexander Lyakas > <alex.bolshoy@gmail.com> > wrote: > > > Hi Neil, > > > > >> > > >> I would still think that there is value in recoding in a superblock > > >> that a drive is recovering. > > > > > > Probably. It is a bit unfortunate that if you stop an array that is > > > recovering after a --re-add, you cannot simply 'assemble' it again and > > > get it back to the same state. > > > I'll think more on that. > > > > As I mentioned, I see the additional re-add as a minor thing, but > > agree it's better to fix it. The fact that we don't know that the > > drive is being recovered, bothers me more. Because user might look at > > the superblock, and assume the data on the drive is consistent to some > > point in time (time of the drive failure). While the actual data, > > while doing bitmap-based recovery, is unusable until recovery > > successfully completes. So the user might think it's okay to try to > > run his app on this drive. > > Yes, please think about this. > > > > > > > > Meanwhile, this patch might address your other problem. It allows > > > --re-add > > > to work if a non-bitmap rebuild fails and is then re-added. > > > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > > index c601c4b..d31852e 100644 > > > --- a/drivers/md/md.c > > > +++ b/drivers/md/md.c > > > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, > > > mdu_disk_info_t *info) > > > super_types[mddev->major_version]. > > > validate_super(mddev, rdev); > > > if ((info->state & (1<<MD_DISK_SYNC)) && > > > - (!test_bit(In_sync, &rdev->flags) || > > > + (test_bit(Faulty, &rdev->flags) || > > > rdev->raid_disk != info->raid_disk)) { > > > /* This was a hot-add request, but events > > > doesn't > > > * match, so reject it. > > > > > > > I have tested a slightly different patch that you suggested earlier - > > just removing the !test_bit(In_sync, &rdev->flags) check. I confirm > > that it solves the problem. > > > > The Faulty bit check seems redundant to me, because: > > - it can be set by only by validate_super() and only if that drive's > > role is 0xfffe in sb->roles[] array > > - Long time ago I asked you, how can it happen that a drive thinks > > about *itself* that it is Faulty (has 0xfffe for its role in its own > > superblock), and you said this should never happen. > > Yes, you are right. I've remove the test on 'Faulty' - the test on the > raid_disk number is sufficient. > > > > > Anyways, I tested also the patch you suggested, and it also works. > > Thanks! > > > > > Is there any chance to see this fix in ubuntu-precise? > > Not really up to me. It doesn't fix a crash or corruption so I'm not sure > it > is -stable material .... though maybe it is if it fixes a regression. > > I suggest you ask the Ubuntu kernel people after it appears in 3.5-rc > (hopefully later this week). > > Thanks, > NeilBrown > > > > > > Thanks again for your support, > > Alex. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-27 7:22 ` NeilBrown 2012-06-27 16:40 ` Alexander Lyakas @ 2014-02-19 9:51 ` Alexander Lyakas 2014-02-26 13:02 ` Alexander Lyakas 1 sibling, 1 reply; 14+ messages in thread From: Alexander Lyakas @ 2014-02-19 9:51 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Andrei Warkentin, Yair Hershko Hello Neil, Andrei, I don't how much you recall of this old discussion. Basically, the change that you made means that md doesn't update the superblock on the recovering device, until recovery completes. As a result, when assembling such array, the recovering device has an old event count in the superblock and is not picked during assembly. So later, user has to re-add it manually. This is true for a device that failed and was re-added. For a fresh device, saved_raid_disk==-1, so its superblock will still be updated (and sucn device will be picked on assembly). On the other hand, MD updates the superblock of the In_sync devices and marks the recovering device as having a valid array slot (not 0xFFFF or 0xFFFE, which are treated similarly today). What do you think of the following change: diff --git a/drivers/md/md.c b/drivers/md/md.c index 561a65f..4bbc7e3 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1854,8 +1854,6 @@ retry: sb->dev_roles[i] = cpu_to_le16(0xfffe); else if (test_bit(In_sync, &rdev2->flags)) sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); - else if (rdev2->raid_disk >= 0) - sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); else sb->dev_roles[i] = cpu_to_le16(0xffff); } Basically, we do not mark the slot of the recovering device, until its recovery completes. So during assembly, we will not pick it, as before. For a fresh drive, there will be a regression - we will not pick it as well on assembly. The reason I am considering this change is another old discussion that we had - considering split-brain resolution, and the proposal I made in: https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE Basically, when MD marks recovering devices as having a valid raid slot in the superblock of In_sync devices, then on array assembly we don't know whether sucn device is recovering or In_sync (if it is inaccessible during assembly). So we have to assume it is In_sync and thus can potentially cause split-brain. What do you think of this change? Thanks, Alex. On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Thanks for commenting, Neil, >> >> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: >> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> > wrote: >> > >> >> Hi again Neil, and Andrey, >> >> >> >> looking at this email thread: >> >> http://www.spinics.net/lists/raid/msg36236.html >> >> between you and Andrey, the conclusion was: >> >> "So the correct thing to do is to *not* update the metadata on the >> >> recovering device until recovery completes. Then if it fails and is >> >> re-added, it will look just the same as when it was re-added the first >> >> time, and will do a bitmap-based recovery." >> >> >> >> I have two doubts about this decision: >> >> 1) Since the event count on the recovering drive is not updated, this >> >> means that after reboot, when array is re-assembled, this drive will >> >> not be added to the array, and the user will have to manually re-add >> >> it. I agree this is a minor thing. >> > >> > Still, if it can be fixed it should be. >> > >> >> 2) There are places in mdadm, which check for recovery_offset on the >> >> drive and take decisions based upon that. Specifically, if there is >> >> *no* recovery offset, the data on this drive is considered to be >> >> consistent WRT to the failure time of the drive. So, for example, the >> >> drive can be a candidate for bumping up events during "force >> >> assembly". Now, when superblock on such drive is not updated during >> >> recovery (so there is *no* recovery offset), mdadm will think that the >> >> drive is consistent, while in fact, its data is totally unusable until >> >> after recovery completes. That's because we have updated parts of the >> >> drive, but did not complete bringing the whole drive in-sync. >> > >> > If mdadm would consider updating the event count if not recovery had started, >> > then surely it is just as valid to do so once some recovery has started, even >> > if it hasn't completed. >> >> The patch you accepted from me ("Don't consider disks with a valid >> recovery offset as candidates for bumping up event count") actually >> attempts to protect from that:) >> >> I don't understand why "it is just as valid to do so once some >> recovery has started". My understanding is that once recovery of a >> drive has started, its data is not consistent between different parts >> of the drive, until the recovery completes. This is because md does >> bitmap-based recovery, and not kind of journal/transaction-log based >> recovery. >> >> However, one could argue that for force-assembly case, when data >> anyways can come up as partially corrupted, this is less important. > > Exactly. And mdadm only updates event counts in the force-assembly case so > while it might not be ideal, it is the best we can do. > >> >> I would still think that there is value in recoding in a superblock >> that a drive is recovering. > > Probably. It is a bit unfortunate that if you stop an array that is > recovering after a --re-add, you cannot simply 'assemble' it again and > get it back to the same state. > I'll think more on that. > > Meanwhile, this patch might address your other problem. It allows --re-add > to work if a non-bitmap rebuild fails and is then re-added. > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index c601c4b..d31852e 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) > super_types[mddev->major_version]. > validate_super(mddev, rdev); > if ((info->state & (1<<MD_DISK_SYNC)) && > - (!test_bit(In_sync, &rdev->flags) || > + (test_bit(Faulty, &rdev->flags) || > rdev->raid_disk != info->raid_disk)) { > /* This was a hot-add request, but events doesn't > * match, so reject it. > > >> >> > >> >> >> >> Can you pls comment on my doubts? >> > >> > I think there is probably room for improvement here but I don't think there >> > are any serious problems. >> > >> > However I'm about to go on leave for a couple of week so I'm unlikely to >> > think about it for a while. I've made a note to look at it properly when I >> > get back. >> > >> >> Indeed, don't you think about this while you are resting! > > :-) > > > Thanks. I did have a very enjoyable break. > > NeilBrown ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2014-02-19 9:51 ` Alexander Lyakas @ 2014-02-26 13:02 ` Alexander Lyakas 2014-02-27 3:19 ` NeilBrown 0 siblings, 1 reply; 14+ messages in thread From: Alexander Lyakas @ 2014-02-26 13:02 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Andrei Warkentin, Yair Hershko HI Neil, can you please comment what do you think of this change. Thanks, Alex. On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, Andrei, > I don't how much you recall of this old discussion. > > Basically, the change that you made means that md doesn't update the > superblock on the recovering device, until recovery completes. As a > result, when assembling such array, the recovering device has an old > event count in the superblock and is not picked during assembly. So > later, user has to re-add it manually. > This is true for a device that failed and was re-added. For a fresh > device, saved_raid_disk==-1, so its superblock will still be updated > (and sucn device will be picked on assembly). > > On the other hand, MD updates the superblock of the In_sync devices > and marks the recovering device as having a valid array slot (not > 0xFFFF or 0xFFFE, which are treated similarly today). > > What do you think of the following change: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 561a65f..4bbc7e3 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1854,8 +1854,6 @@ retry: > sb->dev_roles[i] = cpu_to_le16(0xfffe); > else if (test_bit(In_sync, &rdev2->flags)) > sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > - else if (rdev2->raid_disk >= 0) > - sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > else > sb->dev_roles[i] = cpu_to_le16(0xffff); > } > > Basically, we do not mark the slot of the recovering device, until its > recovery completes. So during assembly, we will not pick it, as > before. For a fresh drive, there will be a regression - we will not > pick it as well on assembly. > > The reason I am considering this change is another old discussion that > we had - considering split-brain resolution, and the proposal I made > in: > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE > > Basically, when MD marks recovering devices as having a valid raid > slot in the superblock of In_sync devices, then on array assembly we > don't know whether sucn device is recovering or In_sync (if it is > inaccessible during assembly). So we have to assume it is In_sync and > thus can potentially cause split-brain. > > > What do you think of this change? > > Thanks, > Alex. > > > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote: >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> wrote: >> >>> Thanks for commenting, Neil, >>> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >>> > wrote: >>> > >>> >> Hi again Neil, and Andrey, >>> >> >>> >> looking at this email thread: >>> >> http://www.spinics.net/lists/raid/msg36236.html >>> >> between you and Andrey, the conclusion was: >>> >> "So the correct thing to do is to *not* update the metadata on the >>> >> recovering device until recovery completes. Then if it fails and is >>> >> re-added, it will look just the same as when it was re-added the first >>> >> time, and will do a bitmap-based recovery." >>> >> >>> >> I have two doubts about this decision: >>> >> 1) Since the event count on the recovering drive is not updated, this >>> >> means that after reboot, when array is re-assembled, this drive will >>> >> not be added to the array, and the user will have to manually re-add >>> >> it. I agree this is a minor thing. >>> > >>> > Still, if it can be fixed it should be. >>> > >>> >> 2) There are places in mdadm, which check for recovery_offset on the >>> >> drive and take decisions based upon that. Specifically, if there is >>> >> *no* recovery offset, the data on this drive is considered to be >>> >> consistent WRT to the failure time of the drive. So, for example, the >>> >> drive can be a candidate for bumping up events during "force >>> >> assembly". Now, when superblock on such drive is not updated during >>> >> recovery (so there is *no* recovery offset), mdadm will think that the >>> >> drive is consistent, while in fact, its data is totally unusable until >>> >> after recovery completes. That's because we have updated parts of the >>> >> drive, but did not complete bringing the whole drive in-sync. >>> > >>> > If mdadm would consider updating the event count if not recovery had started, >>> > then surely it is just as valid to do so once some recovery has started, even >>> > if it hasn't completed. >>> >>> The patch you accepted from me ("Don't consider disks with a valid >>> recovery offset as candidates for bumping up event count") actually >>> attempts to protect from that:) >>> >>> I don't understand why "it is just as valid to do so once some >>> recovery has started". My understanding is that once recovery of a >>> drive has started, its data is not consistent between different parts >>> of the drive, until the recovery completes. This is because md does >>> bitmap-based recovery, and not kind of journal/transaction-log based >>> recovery. >>> >>> However, one could argue that for force-assembly case, when data >>> anyways can come up as partially corrupted, this is less important. >> >> Exactly. And mdadm only updates event counts in the force-assembly case so >> while it might not be ideal, it is the best we can do. >> >>> >>> I would still think that there is value in recoding in a superblock >>> that a drive is recovering. >> >> Probably. It is a bit unfortunate that if you stop an array that is >> recovering after a --re-add, you cannot simply 'assemble' it again and >> get it back to the same state. >> I'll think more on that. >> >> Meanwhile, this patch might address your other problem. It allows --re-add >> to work if a non-bitmap rebuild fails and is then re-added. >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index c601c4b..d31852e 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) >> super_types[mddev->major_version]. >> validate_super(mddev, rdev); >> if ((info->state & (1<<MD_DISK_SYNC)) && >> - (!test_bit(In_sync, &rdev->flags) || >> + (test_bit(Faulty, &rdev->flags) || >> rdev->raid_disk != info->raid_disk)) { >> /* This was a hot-add request, but events doesn't >> * match, so reject it. >> >> >>> >>> > >>> >> >>> >> Can you pls comment on my doubts? >>> > >>> > I think there is probably room for improvement here but I don't think there >>> > are any serious problems. >>> > >>> > However I'm about to go on leave for a couple of week so I'm unlikely to >>> > think about it for a while. I've made a note to look at it properly when I >>> > get back. >>> > >>> >>> Indeed, don't you think about this while you are resting! >> >> :-) >> >> >> Thanks. I did have a very enjoyable break. >> >> NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2014-02-26 13:02 ` Alexander Lyakas @ 2014-02-27 3:19 ` NeilBrown 2014-03-02 11:01 ` Alexander Lyakas 0 siblings, 1 reply; 14+ messages in thread From: NeilBrown @ 2014-02-27 3:19 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid, Andrei Warkentin, Yair Hershko [-- Attachment #1: Type: text/plain, Size: 10195 bytes --] On Wed, 26 Feb 2014 15:02:27 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > HI Neil, > can you please comment what do you think of this change. Sorry for not replying earlier. I think the core problem you are addressing is fixed by: commit f466722ca614edcd14f3337373f33132117c7612 Author: NeilBrown <neilb@suse.de> Date: Mon Dec 9 12:04:56 2013 +1100 md: Change handling of save_raid_disk and metadata update during recovery. Can you confirm if that is the case? With respect to the split-brain issue, I see that you would like every superblock to know which devices are truly operational and which are still recovering - a distinction which my current code doesn't allow. You want to know this because if you can see only 2 drives from a 4-drive RAID6 then: - if the other 2 devices are believed to be fully functional, then we risk a split-brain situation and shouldn't start the array - if either of the other devices is missing or still recovering, then there is no risk of split brain and the array can be started. With the current code you might refuse to start an array because it looks like it both other drives could be fully functional where in fact they are still recovering. This at least fails safely, but it may be that we want to absolutely minimise the number of false-failures. However I would hope that the total amount of time that the array spends recovering a device would be a very small fraction of the total time that the array is active. If that is true, then the number of times you get a crash during recovery will be a small fraction of the total crashes, so the number of false-failures should be a small fraction of the reported failures. So I don't see this as a big problem. With respect to the "biggest problem" you note in the split-brain document you reference: this is an issue that has been in the back of my mind for many years, wondering how much I should care and to what extent it should be "fixed". Whenever I have come close to fixing it, I've managed to convince myself that I shouldn't. We only record a device as 'failed' if a write fails. And if a write fails then that device is clearly inconsistent with the rest of the array. So it really seems appropriate to record that fact. I could possibly be convinced that there comes a point where updating superblocks any further is completely pointless, but I would need to be certain that it wouldn't leave a suggestion that the array was more coherent than it really is. Thanks, NeilBrown > > Thanks, > Alex. > > On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas > <alex.bolshoy@gmail.com> wrote: > > Hello Neil, Andrei, > > I don't how much you recall of this old discussion. > > > > Basically, the change that you made means that md doesn't update the > > superblock on the recovering device, until recovery completes. As a > > result, when assembling such array, the recovering device has an old > > event count in the superblock and is not picked during assembly. So > > later, user has to re-add it manually. > > This is true for a device that failed and was re-added. For a fresh > > device, saved_raid_disk==-1, so its superblock will still be updated > > (and sucn device will be picked on assembly). > > > > On the other hand, MD updates the superblock of the In_sync devices > > and marks the recovering device as having a valid array slot (not > > 0xFFFF or 0xFFFE, which are treated similarly today). > > > > What do you think of the following change: > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 561a65f..4bbc7e3 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -1854,8 +1854,6 @@ retry: > > sb->dev_roles[i] = cpu_to_le16(0xfffe); > > else if (test_bit(In_sync, &rdev2->flags)) > > sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > > - else if (rdev2->raid_disk >= 0) > > - sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > > else > > sb->dev_roles[i] = cpu_to_le16(0xffff); > > } > > > > Basically, we do not mark the slot of the recovering device, until its > > recovery completes. So during assembly, we will not pick it, as > > before. For a fresh drive, there will be a regression - we will not > > pick it as well on assembly. > > > > The reason I am considering this change is another old discussion that > > we had - considering split-brain resolution, and the proposal I made > > in: > > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE > > > > Basically, when MD marks recovering devices as having a valid raid > > slot in the superblock of In_sync devices, then on array assembly we > > don't know whether sucn device is recovering or In_sync (if it is > > inaccessible during assembly). So we have to assume it is In_sync and > > thus can potentially cause split-brain. > > > > > > What do you think of this change? > > > > Thanks, > > Alex. > > > > > > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote: > >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > >> wrote: > >> > >>> Thanks for commenting, Neil, > >>> > >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: > >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > >>> > wrote: > >>> > > >>> >> Hi again Neil, and Andrey, > >>> >> > >>> >> looking at this email thread: > >>> >> http://www.spinics.net/lists/raid/msg36236.html > >>> >> between you and Andrey, the conclusion was: > >>> >> "So the correct thing to do is to *not* update the metadata on the > >>> >> recovering device until recovery completes. Then if it fails and is > >>> >> re-added, it will look just the same as when it was re-added the first > >>> >> time, and will do a bitmap-based recovery." > >>> >> > >>> >> I have two doubts about this decision: > >>> >> 1) Since the event count on the recovering drive is not updated, this > >>> >> means that after reboot, when array is re-assembled, this drive will > >>> >> not be added to the array, and the user will have to manually re-add > >>> >> it. I agree this is a minor thing. > >>> > > >>> > Still, if it can be fixed it should be. > >>> > > >>> >> 2) There are places in mdadm, which check for recovery_offset on the > >>> >> drive and take decisions based upon that. Specifically, if there is > >>> >> *no* recovery offset, the data on this drive is considered to be > >>> >> consistent WRT to the failure time of the drive. So, for example, the > >>> >> drive can be a candidate for bumping up events during "force > >>> >> assembly". Now, when superblock on such drive is not updated during > >>> >> recovery (so there is *no* recovery offset), mdadm will think that the > >>> >> drive is consistent, while in fact, its data is totally unusable until > >>> >> after recovery completes. That's because we have updated parts of the > >>> >> drive, but did not complete bringing the whole drive in-sync. > >>> > > >>> > If mdadm would consider updating the event count if not recovery had started, > >>> > then surely it is just as valid to do so once some recovery has started, even > >>> > if it hasn't completed. > >>> > >>> The patch you accepted from me ("Don't consider disks with a valid > >>> recovery offset as candidates for bumping up event count") actually > >>> attempts to protect from that:) > >>> > >>> I don't understand why "it is just as valid to do so once some > >>> recovery has started". My understanding is that once recovery of a > >>> drive has started, its data is not consistent between different parts > >>> of the drive, until the recovery completes. This is because md does > >>> bitmap-based recovery, and not kind of journal/transaction-log based > >>> recovery. > >>> > >>> However, one could argue that for force-assembly case, when data > >>> anyways can come up as partially corrupted, this is less important. > >> > >> Exactly. And mdadm only updates event counts in the force-assembly case so > >> while it might not be ideal, it is the best we can do. > >> > >>> > >>> I would still think that there is value in recoding in a superblock > >>> that a drive is recovering. > >> > >> Probably. It is a bit unfortunate that if you stop an array that is > >> recovering after a --re-add, you cannot simply 'assemble' it again and > >> get it back to the same state. > >> I'll think more on that. > >> > >> Meanwhile, this patch might address your other problem. It allows --re-add > >> to work if a non-bitmap rebuild fails and is then re-added. > >> > >> diff --git a/drivers/md/md.c b/drivers/md/md.c > >> index c601c4b..d31852e 100644 > >> --- a/drivers/md/md.c > >> +++ b/drivers/md/md.c > >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) > >> super_types[mddev->major_version]. > >> validate_super(mddev, rdev); > >> if ((info->state & (1<<MD_DISK_SYNC)) && > >> - (!test_bit(In_sync, &rdev->flags) || > >> + (test_bit(Faulty, &rdev->flags) || > >> rdev->raid_disk != info->raid_disk)) { > >> /* This was a hot-add request, but events doesn't > >> * match, so reject it. > >> > >> > >>> > >>> > > >>> >> > >>> >> Can you pls comment on my doubts? > >>> > > >>> > I think there is probably room for improvement here but I don't think there > >>> > are any serious problems. > >>> > > >>> > However I'm about to go on leave for a couple of week so I'm unlikely to > >>> > think about it for a while. I've made a note to look at it properly when I > >>> > get back. > >>> > > >>> > >>> Indeed, don't you think about this while you are resting! > >> > >> :-) > >> > >> > >> Thanks. I did have a very enjoyable break. > >> > >> NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2014-02-27 3:19 ` NeilBrown @ 2014-03-02 11:01 ` Alexander Lyakas 0 siblings, 0 replies; 14+ messages in thread From: Alexander Lyakas @ 2014-03-02 11:01 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Andrei Warkentin, Yair Hershko Hi Neil, Thank you for your response to this old issue. On Thu, Feb 27, 2014 at 5:19 AM, NeilBrown <neilb@suse.de> wrote: > On Wed, 26 Feb 2014 15:02:27 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> HI Neil, >> can you please comment what do you think of this change. > > Sorry for not replying earlier. > > I think the core problem you are addressing is fixed by: > > commit f466722ca614edcd14f3337373f33132117c7612 > Author: NeilBrown <neilb@suse.de> > Date: Mon Dec 9 12:04:56 2013 +1100 > > md: Change handling of save_raid_disk and metadata update during recovery. > > Can you confirm if that is the case? Yes. This fixes the problem that the metadata of the recovering device doesn't show that it is recovering (MD_FEATURE_RECOVERY_OFFSET was not set). Now there is no danger that when assembling an array somebody will pick such device thinking it has valid data. This is awesome! Is there any issue with pulling this patch into kernel 3.8.13? (other than the fact that by default MD is not configured as a loadable module, so we need to recompile the kernel for that). > > With respect to the split-brain issue, I see that you would like every > superblock to know which devices are truly operational and which are still > recovering - a distinction which my current code doesn't allow. Exactly. I don't want recovering devices having a valid "role" in the superblock (but still want them to have the MD_FEATURE_RECOVERY_OFFSET set, like the above commit of yours does). > > You want to know this because if you can see only 2 drives from a 4-drive > RAID6 then: > - if the other 2 devices are believed to be fully functional, then we risk a > split-brain situation and shouldn't start the array > - if either of the other devices is missing or still recovering, then there > is no risk of split brain and the array can be started. Precisely! In that case having MD_FEATURE_RECOVERY_OFFSET set does not help, because we don't see the drive. > > With the current code you might refuse to start an array because it looks > like it both other drives could be fully functional where in fact they are > still recovering. > This at least fails safely, but it may be that we want to absolutely minimise > the number of false-failures. Precisely! > > However I would hope that the total amount of time that the array spends > recovering a device would be a very small fraction of the total time that the > array is active. If that is true, then the number of times you get a crash > during recovery will be a small fraction of the total crashes, so the number > of false-failures should be a small fraction of the reported failures. > > So I don't see this as a big problem. Ok, I understand and respect your opinion. Can you please comment overall on my patch, if we decide to integrate it into our kernel. Do you think this patch can break something fundamental? > > With respect to the "biggest problem" you note in the split-brain document > you reference: this is an issue that has been in the back of my mind for > many years, wondering how much I should care and to what extent it should be > "fixed". Whenever I have come close to fixing it, I've managed to convince > myself that I shouldn't. > > We only record a device as 'failed' if a write fails. And if a write fails > then that device is clearly inconsistent with the rest of the array. So it > really seems appropriate to record that fact. > I could possibly be convinced that there comes a point where updating > superblocks any further is completely pointless, but I would need to be > certain that it wouldn't leave a suggestion that the array was more coherent > than it really is. > > Thanks, > NeilBrown I agree with your argument that if array has failed, we should know about it. And when assembling, user should explicitly say "force", making us manually adjust the event counts. So I withdraw my suggestion to not update the superblock if array has failed. Thanks! Alex. > > >> >> Thanks, >> Alex. >> >> On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas >> <alex.bolshoy@gmail.com> wrote: >> > Hello Neil, Andrei, >> > I don't how much you recall of this old discussion. >> > >> > Basically, the change that you made means that md doesn't update the >> > superblock on the recovering device, until recovery completes. As a >> > result, when assembling such array, the recovering device has an old >> > event count in the superblock and is not picked during assembly. So >> > later, user has to re-add it manually. >> > This is true for a device that failed and was re-added. For a fresh >> > device, saved_raid_disk==-1, so its superblock will still be updated >> > (and sucn device will be picked on assembly). >> > >> > On the other hand, MD updates the superblock of the In_sync devices >> > and marks the recovering device as having a valid array slot (not >> > 0xFFFF or 0xFFFE, which are treated similarly today). >> > >> > What do you think of the following change: >> > diff --git a/drivers/md/md.c b/drivers/md/md.c >> > index 561a65f..4bbc7e3 100644 >> > --- a/drivers/md/md.c >> > +++ b/drivers/md/md.c >> > @@ -1854,8 +1854,6 @@ retry: >> > sb->dev_roles[i] = cpu_to_le16(0xfffe); >> > else if (test_bit(In_sync, &rdev2->flags)) >> > sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); >> > - else if (rdev2->raid_disk >= 0) >> > - sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); >> > else >> > sb->dev_roles[i] = cpu_to_le16(0xffff); >> > } >> > >> > Basically, we do not mark the slot of the recovering device, until its >> > recovery completes. So during assembly, we will not pick it, as >> > before. For a fresh drive, there will be a regression - we will not >> > pick it as well on assembly. >> > >> > The reason I am considering this change is another old discussion that >> > we had - considering split-brain resolution, and the proposal I made >> > in: >> > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE >> > >> > Basically, when MD marks recovering devices as having a valid raid >> > slot in the superblock of In_sync devices, then on array assembly we >> > don't know whether sucn device is recovering or In_sync (if it is >> > inaccessible during assembly). So we have to assume it is In_sync and >> > thus can potentially cause split-brain. >> > >> > >> > What do you think of this change? >> > >> > Thanks, >> > Alex. >> > >> > >> > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@suse.de> wrote: >> >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> >> wrote: >> >> >> >>> Thanks for commenting, Neil, >> >>> >> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@suse.de> wrote: >> >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> >>> > wrote: >> >>> > >> >>> >> Hi again Neil, and Andrey, >> >>> >> >> >>> >> looking at this email thread: >> >>> >> http://www.spinics.net/lists/raid/msg36236.html >> >>> >> between you and Andrey, the conclusion was: >> >>> >> "So the correct thing to do is to *not* update the metadata on the >> >>> >> recovering device until recovery completes. Then if it fails and is >> >>> >> re-added, it will look just the same as when it was re-added the first >> >>> >> time, and will do a bitmap-based recovery." >> >>> >> >> >>> >> I have two doubts about this decision: >> >>> >> 1) Since the event count on the recovering drive is not updated, this >> >>> >> means that after reboot, when array is re-assembled, this drive will >> >>> >> not be added to the array, and the user will have to manually re-add >> >>> >> it. I agree this is a minor thing. >> >>> > >> >>> > Still, if it can be fixed it should be. >> >>> > >> >>> >> 2) There are places in mdadm, which check for recovery_offset on the >> >>> >> drive and take decisions based upon that. Specifically, if there is >> >>> >> *no* recovery offset, the data on this drive is considered to be >> >>> >> consistent WRT to the failure time of the drive. So, for example, the >> >>> >> drive can be a candidate for bumping up events during "force >> >>> >> assembly". Now, when superblock on such drive is not updated during >> >>> >> recovery (so there is *no* recovery offset), mdadm will think that the >> >>> >> drive is consistent, while in fact, its data is totally unusable until >> >>> >> after recovery completes. That's because we have updated parts of the >> >>> >> drive, but did not complete bringing the whole drive in-sync. >> >>> > >> >>> > If mdadm would consider updating the event count if not recovery had started, >> >>> > then surely it is just as valid to do so once some recovery has started, even >> >>> > if it hasn't completed. >> >>> >> >>> The patch you accepted from me ("Don't consider disks with a valid >> >>> recovery offset as candidates for bumping up event count") actually >> >>> attempts to protect from that:) >> >>> >> >>> I don't understand why "it is just as valid to do so once some >> >>> recovery has started". My understanding is that once recovery of a >> >>> drive has started, its data is not consistent between different parts >> >>> of the drive, until the recovery completes. This is because md does >> >>> bitmap-based recovery, and not kind of journal/transaction-log based >> >>> recovery. >> >>> >> >>> However, one could argue that for force-assembly case, when data >> >>> anyways can come up as partially corrupted, this is less important. >> >> >> >> Exactly. And mdadm only updates event counts in the force-assembly case so >> >> while it might not be ideal, it is the best we can do. >> >> >> >>> >> >>> I would still think that there is value in recoding in a superblock >> >>> that a drive is recovering. >> >> >> >> Probably. It is a bit unfortunate that if you stop an array that is >> >> recovering after a --re-add, you cannot simply 'assemble' it again and >> >> get it back to the same state. >> >> I'll think more on that. >> >> >> >> Meanwhile, this patch might address your other problem. It allows --re-add >> >> to work if a non-bitmap rebuild fails and is then re-added. >> >> >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> >> index c601c4b..d31852e 100644 >> >> --- a/drivers/md/md.c >> >> +++ b/drivers/md/md.c >> >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) >> >> super_types[mddev->major_version]. >> >> validate_super(mddev, rdev); >> >> if ((info->state & (1<<MD_DISK_SYNC)) && >> >> - (!test_bit(In_sync, &rdev->flags) || >> >> + (test_bit(Faulty, &rdev->flags) || >> >> rdev->raid_disk != info->raid_disk)) { >> >> /* This was a hot-add request, but events doesn't >> >> * match, so reject it. >> >> >> >> >> >>> >> >>> > >> >>> >> >> >>> >> Can you pls comment on my doubts? >> >>> > >> >>> > I think there is probably room for improvement here but I don't think there >> >>> > are any serious problems. >> >>> > >> >>> > However I'm about to go on leave for a couple of week so I'm unlikely to >> >>> > think about it for a while. I've made a note to look at it properly when I >> >>> > get back. >> >>> > >> >>> >> >>> Indeed, don't you think about this while you are resting! >> >> >> >> :-) >> >> >> >> >> >> Thanks. I did have a very enjoyable break. >> >> >> >> NeilBrown > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Thanks (was Re: Problem with patch...) 2012-06-07 11:24 ` NeilBrown 2012-06-07 12:47 ` Alexander Lyakas @ 2012-06-08 8:59 ` John Robinson 1 sibling, 0 replies; 14+ messages in thread From: John Robinson @ 2012-06-08 8:59 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 07/06/2012 12:24, NeilBrown wrote: [...] > I'm about to go on leave for a couple of week [...] I hope you have a good relaxing holiday. Thanks again for your excellent work here. Dear Rest Of The World: don't do anything silly to your RAID arrays in the next fortnight! Cheers, John. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) 2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas 2012-06-07 7:22 ` Alexander Lyakas @ 2012-06-07 10:26 ` NeilBrown 1 sibling, 0 replies; 14+ messages in thread From: NeilBrown @ 2012-06-07 10:26 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4358 bytes --] On Wed, 6 Jun 2012 19:09:25 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > it looks like we are hitting an issue with this patch. > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are > aware that this version of mdadm has an issue of unaligned writes, so > we backported the fix to it into this version). > > We are testing the following simple scenario: > # 2-way raid1 with a missing drive > # a fresh drive (no superblock) is added (not re-added) to the raid, > and the raid happily starts rebuilding it > # at some point in time, this drive fails, so rebuild aborts, drive is > kicked out. > # later the drive recovers and "mdadm remove & re-add" is attempted > # re-add fails Hmm.. I guess we should drop the "In_sync" test and just keep the "raid_disk" part of the test? > > Some debugging suggests the following: > > - The superblock of the drive to be re-added looks like this: > > Avail Dev Size : 10483712 (5.00 GiB 5.37 GB) > Array Size : 10483688 (5.00 GiB 5.37 GB) > Used Dev Size : 10483688 (5.00 GiB 5.37 GB) > Data Offset : 2048 sectors > Super Offset : 8 sectors > Recovery Offset : 100096 sectors > State : clean > Resync Offset : N/A sectors > Device UUID : d9114d02:76153894:87aaf09d:c55da9be > Internal Bitmap : 8 sectors from superblock > Update Time : Wed Jun 6 18:36:55 2012 > Checksum : 94468c9d - correct > Events : 192 > Device Role : Active device 1 > Array State : AA ('A' == active, '.' == missing) > > > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET > set) and a valid role. > > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs: > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]); > ... > info->disk.state = 6; /* active and in sync */ > info->disk.raid_disk = role; > > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set. > > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel > - the kernel performs: > if (!mddev->persistent) { > ... // not relevant to our case > } else > super_types[mddev->major_version]. > validate_super(mddev, rdev); > > - validate_super() performs: > if ((le32_to_cpu(sb->feature_map) & > MD_FEATURE_RECOVERY_OFFSET)) > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); > else > set_bit(In_sync, &rdev->flags); > rdev->raid_disk = role; > > So In_sync flag is not set, because we have a recovery offset. > > - Then the code introduced by the patch does: > if ((info->state & (1<<MD_DISK_SYNC)) && > (!test_bit(In_sync, &rdev->flags) || > rdev->raid_disk != info->raid_disk)) { > /* This was a hot-add request, but events doesn't > * match, so reject it. > */ > export_rdev(rdev); > return -EINVAL; > } > > So the first part of "OR" succeeds and we abort. > > An observation: this problem only happens with a fresh drive > apparently. If drive that was already In_sync fails and then is > re-added, kernel does not update its superblock until recovery > completes. So its superblock has no valid recovery_offset, and this > problem doesn't happen. The code that skips updating the superblock > was added as part of this email thread: > http://www.spinics.net/lists/raid/msg36275.html. However, I did not > dig deeper why the superblock gets updated in my case. But the > scenario above repros easily. > > Q1: Can you confirm that my analysis is correct? Looks about right. > Q2: Is this an expected behavior? I would assume that no, because the > same case for a failed drive (not a fresh drive) works fine It is a case that I hadn't really thought about (re-adding a drive that was partially recovered), so in a sense, no behaviour is expected :-) I agree it isn't desirable behaviour. > Q3: How would you recommend to handle this? 1/ Change the coded added by the commit to remove the !test_bit(In_sync, &rdev->flags) || section. 2/ Possibly backport mdadm commit 0a999759b54f94fd63ac0 This will allow --add to work in more cases. I'm not sure if you are hitting this issue or not. Please let me know how '1' goes and I'll get that fix upstream. Thanks, NeilBrown > > Thanks! > Alex. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-03-02 11:01 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-06-06 16:09 Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) Alexander Lyakas 2012-06-07 7:22 ` Alexander Lyakas 2012-06-07 11:24 ` NeilBrown 2012-06-07 12:47 ` Alexander Lyakas 2012-06-27 7:22 ` NeilBrown 2012-06-27 16:40 ` Alexander Lyakas 2012-07-02 7:57 ` NeilBrown 2012-07-02 17:30 ` John Gehring 2014-02-19 9:51 ` Alexander Lyakas 2014-02-26 13:02 ` Alexander Lyakas 2014-02-27 3:19 ` NeilBrown 2014-03-02 11:01 ` Alexander Lyakas 2012-06-08 8:59 ` Thanks (was Re: Problem with patch...) John Robinson 2012-06-07 10:26 ` Problem with patch: "reject a re-add request that cannot be honoured" (commit bedd86b7773fd97f0d708cc0c371c8963ba7ba9a) NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).