* Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
@ 2011-10-26 17:02 Alexander Lyakas
2011-10-26 21:51 ` NeilBrown
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-10-26 17:02 UTC (permalink / raw)
To: linux-raid
Greetings everybody,
I have a question about the following code in Manage.c:Manage_subdevs()
disc.number = mdi.disk.number;
if (ioctl(fd, GET_DISK_INFO, &disc) != 0
|| disc.major != 0 || disc.minor != 0
|| !enough_fd(fd))
goto skip_re_add;
I do not underatand why the checks: disc.major != 0 || disc.minor != 0
are required. This basically means that the kernel already has an
rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
Let's say that enough_fd() returns true, and we go ahead an issue
ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
add_new_disk(), it will not even look at info->number. It will
initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
the rdev later.
Doing this with mdadm 3.1.4, where this check is not present, actually
succeeds. I understand that this code was added for cases when
enough_fd() returns false, which sounds perfectly fine to protect
from.
I was thinking that this code should actually check something like:
if (ioctl(fd, GET_DISK_INFO, &disc) != 0
|| disk.raid_disk != mdi.disk.raid_disk
|| !enough_fd(fd))
goto skip_re_add;
That is to check that the slot that was being occupied by the drive we
are trying to add, is already occupied by a different drive (need also
to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
not the desc_nr, which does not have any persistent meaning.
Perhaps there are some compatibility issues with old kernels? Or
special considerations for ... containers? non-persistent arrays?
Thanks!
Alex.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-26 17:02 Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" Alexander Lyakas @ 2011-10-26 21:51 ` NeilBrown 2011-10-27 9:10 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-10-26 21:51 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2459 bytes --] On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Greetings everybody, > I have a question about the following code in Manage.c:Manage_subdevs() > > disc.number = mdi.disk.number; > if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > || disc.major != 0 || disc.minor != 0 > || !enough_fd(fd)) > goto skip_re_add; > > I do not underatand why the checks: disc.major != 0 || disc.minor != 0 > are required. This basically means that the kernel already has an > rdev->desc_nr equal to disc.number. But why fail the re-add procedure? > > Let's say that enough_fd() returns true, and we go ahead an issue > ioctl(ADD_NEW_DISK). In this case, according to the kernel code in > add_new_disk(), it will not even look at info->number. It will > initialize rdev->desc_nr to -1, and will allocate a free desc_nr for > the rdev later. > > Doing this with mdadm 3.1.4, where this check is not present, actually > succeeds. I understand that this code was added for cases when > enough_fd() returns false, which sounds perfectly fine to protect > from. > > I was thinking that this code should actually check something like: > if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > || disk.raid_disk != mdi.disk.raid_disk > || !enough_fd(fd)) > goto skip_re_add; > > That is to check that the slot that was being occupied by the drive we > are trying to add, is already occupied by a different drive (need also > to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and > not the desc_nr, which does not have any persistent meaning. > > Perhaps there are some compatibility issues with old kernels? Or > special considerations for ... containers? non-persistent arrays? The point of this code is to make --re-add fail unless mdadm is certain that the kernel will accept the re-add, rather than turn the device into a spare. If a device already exists with the same disk.number, a re-add cannot succeed, so mdadm doesn't even try. When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add the device to the array, or does it turn the device into a spare? I particularly do not want --re-add to turn a device into a spare because people sometimes use it in cases where it cannot work, their device gets turned into a spare, and they lose information that could have been used to reconstruct the array. That that make sense? NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-26 21:51 ` NeilBrown @ 2011-10-27 9:10 ` Alexander Lyakas 2011-10-30 23:16 ` NeilBrown 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lyakas @ 2011-10-27 9:10 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hello Neil, it makes perfect sense not to turn a device into a spare inadvertently. However, with mdadm 3.1.4 under gdb, I tried the following: - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the kernel were respectively (according to GET_DISK_INFO): 0,1,2,3. - I failed the two last drives (c & d) via mdadm and removed them from the array - I wiped the superblock on drive d. - I added the drive d back to the array So now the array had the following setup: sda: disc_nr=0, raid_disk=0 sdb: disc_nr=1, raid_disk=1 sdd: disc_nr=4, raid_disk=2 So sdd was added to the array into slot 2, and received disc_nr=4 - Now I asked to re-add drive sdc back to array. In gdb I followed the re-add flow, to the place where it fills the mdu_disk_info_t structure from the superblock read from sdc. It put there the following content: disc.major = ... disc.minor = ... disc.number = 2 disc.raid_disk = 2 (because previously this drive was in slot 2) disc.state = ... Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd). And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received disc_nr=2 (while it was asking for 4). Of course, it could not have received the same raid_disk, because this raid_disk was already occupied by sdd. So it was added as a spare. But you are saying: > If a device already exists with the same disk.number, a re-add cannot > succeed, so mdadm doesn't even try. while in my case it succeeded (while it actually did "add" and not "re-add"). That's why I was thinking it makes more sense to check disc.raid_disk and not disc.number in this check. Since disc.number is not the drive's role within the array (right?), it is simply a position of the drive in the list of all drives. So if you check raid_disk, and see that this raid slot is already occupied, then, naturally, the drive will be converted to spare, which we want to avoid. And the enough_fd() check protects us from adding a spare to a failed array (like you mentioned to me previously). What am I missing? Thanks, Alex. On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote: > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Greetings everybody, >> I have a question about the following code in Manage.c:Manage_subdevs() >> >> disc.number = mdi.disk.number; >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> || disc.major != 0 || disc.minor != 0 >> || !enough_fd(fd)) >> goto skip_re_add; >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0 >> are required. This basically means that the kernel already has an >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure? >> >> Let's say that enough_fd() returns true, and we go ahead an issue >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in >> add_new_disk(), it will not even look at info->number. It will >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for >> the rdev later. >> >> Doing this with mdadm 3.1.4, where this check is not present, actually >> succeeds. I understand that this code was added for cases when >> enough_fd() returns false, which sounds perfectly fine to protect >> from. >> >> I was thinking that this code should actually check something like: >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> || disk.raid_disk != mdi.disk.raid_disk >> || !enough_fd(fd)) >> goto skip_re_add; >> >> That is to check that the slot that was being occupied by the drive we >> are trying to add, is already occupied by a different drive (need also >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and >> not the desc_nr, which does not have any persistent meaning. >> >> Perhaps there are some compatibility issues with old kernels? Or >> special considerations for ... containers? non-persistent arrays? > > The point of this code is to make --re-add fail unless mdadm is certain that > the kernel will accept the re-add, rather than turn the device into a spare. > > If a device already exists with the same disk.number, a re-add cannot > succeed, so mdadm doesn't even try. > > When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add > the device to the array, or does it turn the device into a spare? > I particularly do not want --re-add to turn a device into a spare because > people sometimes use it in cases where it cannot work, their device gets > turned into a spare, and they lose information that could have been used to > reconstruct the array. > > That that make sense? > > NeilBrown > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-27 9:10 ` Alexander Lyakas @ 2011-10-30 23:16 ` NeilBrown 2011-10-31 8:57 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-10-30 23:16 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 5774 bytes --] On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > it makes perfect sense not to turn a device into a spare inadvertently. > > However, with mdadm 3.1.4 under gdb, I tried the following: > - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the > kernel were respectively (according to GET_DISK_INFO): 0,1,2,3. > - I failed the two last drives (c & d) via mdadm and removed them from the array > - I wiped the superblock on drive d. > - I added the drive d back to the array > So now the array had the following setup: > sda: disc_nr=0, raid_disk=0 > sdb: disc_nr=1, raid_disk=1 > sdd: disc_nr=4, raid_disk=2 > So sdd was added to the array into slot 2, and received disc_nr=4 > > - Now I asked to re-add drive sdc back to array. In gdb I followed the > re-add flow, to the place where it fills the mdu_disk_info_t structure > from the superblock read from sdc. It put there the following content: > disc.major = ... > disc.minor = ... > disc.number = 2 > disc.raid_disk = 2 (because previously this drive was in slot 2) > disc.state = ... > > Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd). > And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received > disc_nr=2 (while it was asking for 4). Of course, it could not have > received the same raid_disk, because this raid_disk was already > occupied by sdd. So it was added as a spare. > > But you are saying: > > If a device already exists with the same disk.number, a re-add cannot > > succeed, so mdadm doesn't even try. > while in my case it succeeded (while it actually did "add" and not "re-add"). We seem to be using word differently. If I ask mdadm to do a "re-add" and it does an "add", then I consider that to be "failure", however you seem to consider it to be a "success". That seems to be the source of confusion. > > That's why I was thinking it makes more sense to check disc.raid_disk > and not disc.number in this check. Since disc.number is not the > drive's role within the array (right?), it is simply a position of the > drive in the list of all drives. > So if you check raid_disk, and see that this raid slot is already > occupied, then, naturally, the drive will be converted to spare, which > we want to avoid. It may well be appropriate to check raid_disk as well, yes. It isn't so easy though which is maybe why I didn't. With 0.90 metadata, the disc.number is the same as disc.raid_disk for active devices. That might be another reason for the code being the way that it is. Thanks, NeilBrown > > And the enough_fd() check protects us from adding a spare to a failed > array (like you mentioned to me previously). > > What am I missing? > > Thanks, > Alex. > > > > > > > > On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote: > > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Greetings everybody, > >> I have a question about the following code in Manage.c:Manage_subdevs() > >> > >> disc.number = mdi.disk.number; > >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > >> || disc.major != 0 || disc.minor != 0 > >> || !enough_fd(fd)) > >> goto skip_re_add; > >> > >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0 > >> are required. This basically means that the kernel already has an > >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure? > >> > >> Let's say that enough_fd() returns true, and we go ahead an issue > >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in > >> add_new_disk(), it will not even look at info->number. It will > >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for > >> the rdev later. > >> > >> Doing this with mdadm 3.1.4, where this check is not present, actually > >> succeeds. I understand that this code was added for cases when > >> enough_fd() returns false, which sounds perfectly fine to protect > >> from. > >> > >> I was thinking that this code should actually check something like: > >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > >> || disk.raid_disk != mdi.disk.raid_disk > >> || !enough_fd(fd)) > >> goto skip_re_add; > >> > >> That is to check that the slot that was being occupied by the drive we > >> are trying to add, is already occupied by a different drive (need also > >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and > >> not the desc_nr, which does not have any persistent meaning. > >> > >> Perhaps there are some compatibility issues with old kernels? Or > >> special considerations for ... containers? non-persistent arrays? > > > > The point of this code is to make --re-add fail unless mdadm is certain that > > the kernel will accept the re-add, rather than turn the device into a spare. > > > > If a device already exists with the same disk.number, a re-add cannot > > succeed, so mdadm doesn't even try. > > > > When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add > > the device to the array, or does it turn the device into a spare? > > I particularly do not want --re-add to turn a device into a spare because > > people sometimes use it in cases where it cannot work, their device gets > > turned into a spare, and they lose information that could have been used to > > reconstruct the array. > > > > That that make sense? > > > > NeilBrown > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-30 23:16 ` NeilBrown @ 2011-10-31 8:57 ` Alexander Lyakas 2011-10-31 9:19 ` NeilBrown 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lyakas @ 2011-10-31 8:57 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Thank you for the clarification, Neil! I was looking at raid_disk, because I am trying to see whether it is possible to control into which raid slot a disk is being added (not re-added). Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d failed.Now let's say, that it is decided to replace drive d with a new drive e (--add for drive e). Then it's possible that drive e will take the raid slot of drive c. So later, if we want to bring back drive c into the array, it will have to go into a different slot, resulting in a full reconstruction, not bitmap-based reconstruction. While if we would have re-added drive c first, it would have done a bitmap-based reconstruction, so only the e drive would have required a full reconstruction. So do you think it makes sense to somehow (perhaps through disc.raid_disk) instruct the kernel into which slot to add a new drive? Thanks, Alex. On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> it makes perfect sense not to turn a device into a spare inadvertently. >> >> However, with mdadm 3.1.4 under gdb, I tried the following: >> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the >> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3. >> - I failed the two last drives (c & d) via mdadm and removed them from the array >> - I wiped the superblock on drive d. >> - I added the drive d back to the array >> So now the array had the following setup: >> sda: disc_nr=0, raid_disk=0 >> sdb: disc_nr=1, raid_disk=1 >> sdd: disc_nr=4, raid_disk=2 >> So sdd was added to the array into slot 2, and received disc_nr=4 >> >> - Now I asked to re-add drive sdc back to array. In gdb I followed the >> re-add flow, to the place where it fills the mdu_disk_info_t structure >> from the superblock read from sdc. It put there the following content: >> disc.major = ... >> disc.minor = ... >> disc.number = 2 >> disc.raid_disk = 2 (because previously this drive was in slot 2) >> disc.state = ... >> >> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd). >> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received >> disc_nr=2 (while it was asking for 4). Of course, it could not have >> received the same raid_disk, because this raid_disk was already >> occupied by sdd. So it was added as a spare. >> >> But you are saying: >> > If a device already exists with the same disk.number, a re-add cannot >> > succeed, so mdadm doesn't even try. >> while in my case it succeeded (while it actually did "add" and not "re-add"). > > We seem to be using word differently. > If I ask mdadm to do a "re-add" and it does an "add", then I consider that to > be "failure", however you seem to consider it to be a "success". > > That seems to be the source of confusion. > > >> >> That's why I was thinking it makes more sense to check disc.raid_disk >> and not disc.number in this check. Since disc.number is not the >> drive's role within the array (right?), it is simply a position of the >> drive in the list of all drives. >> So if you check raid_disk, and see that this raid slot is already >> occupied, then, naturally, the drive will be converted to spare, which >> we want to avoid. > > It may well be appropriate to check raid_disk as well, yes. It isn't so easy > though which is maybe why I didn't. > > With 0.90 metadata, the disc.number is the same as disc.raid_disk for active > devices. That might be another reason for the code being the way that it is. > > Thanks, > NeilBrown > > > >> >> And the enough_fd() check protects us from adding a spare to a failed >> array (like you mentioned to me previously). >> >> What am I missing? >> >> Thanks, >> Alex. >> >> >> >> >> >> >> >> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote: >> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> >> > wrote: >> > >> >> Greetings everybody, >> >> I have a question about the following code in Manage.c:Manage_subdevs() >> >> >> >> disc.number = mdi.disk.number; >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> >> || disc.major != 0 || disc.minor != 0 >> >> || !enough_fd(fd)) >> >> goto skip_re_add; >> >> >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0 >> >> are required. This basically means that the kernel already has an >> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure? >> >> >> >> Let's say that enough_fd() returns true, and we go ahead an issue >> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in >> >> add_new_disk(), it will not even look at info->number. It will >> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for >> >> the rdev later. >> >> >> >> Doing this with mdadm 3.1.4, where this check is not present, actually >> >> succeeds. I understand that this code was added for cases when >> >> enough_fd() returns false, which sounds perfectly fine to protect >> >> from. >> >> >> >> I was thinking that this code should actually check something like: >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> >> || disk.raid_disk != mdi.disk.raid_disk >> >> || !enough_fd(fd)) >> >> goto skip_re_add; >> >> >> >> That is to check that the slot that was being occupied by the drive we >> >> are trying to add, is already occupied by a different drive (need also >> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and >> >> not the desc_nr, which does not have any persistent meaning. >> >> >> >> Perhaps there are some compatibility issues with old kernels? Or >> >> special considerations for ... containers? non-persistent arrays? >> > >> > The point of this code is to make --re-add fail unless mdadm is certain that >> > the kernel will accept the re-add, rather than turn the device into a spare. >> > >> > If a device already exists with the same disk.number, a re-add cannot >> > succeed, so mdadm doesn't even try. >> > >> > When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add >> > the device to the array, or does it turn the device into a spare? >> > I particularly do not want --re-add to turn a device into a spare because >> > people sometimes use it in cases where it cannot work, their device gets >> > turned into a spare, and they lose information that could have been used to >> > reconstruct the array. >> > >> > That that make sense? >> > >> > NeilBrown >> > >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-31 8:57 ` Alexander Lyakas @ 2011-10-31 9:19 ` NeilBrown 2011-11-01 16:26 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-10-31 9:19 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7950 bytes --] On Mon, 31 Oct 2011 10:57:25 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Thank you for the clarification, Neil! > > I was looking at raid_disk, because I am trying to see whether it is > possible to control into which raid slot a disk is being added (not > re-added). > > Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d > failed.Now let's say, that it is decided to replace drive d with a new > drive e (--add for drive e). Then it's possible that drive e will take > the raid slot of drive c. So later, if we want to bring back drive c > into the array, it will have to go into a different slot, resulting in > a full reconstruction, not bitmap-based reconstruction. While if we > would have re-added drive c first, it would have done a bitmap-based > reconstruction, so only the e drive would have required a full > reconstruction. > > So do you think it makes sense to somehow (perhaps through > disc.raid_disk) instruct the kernel into which slot to add a new > drive? You should be able to do that via sysfs. echo frozen > sync_action echo $major:$minor > new_dev echo $slot > dev-$DEVNAME/slot echo idle > sync_action something like that. Seems and odd sort of scenario though. Note that with RAID1 this is not an issue. A re-added drive can take any position in a RAID1 array - it doesn't have to be the same one. NeilBrown > > Thanks, > Alex. > > > > > > On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote: > > On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Hello Neil, > >> it makes perfect sense not to turn a device into a spare inadvertently. > >> > >> However, with mdadm 3.1.4 under gdb, I tried the following: > >> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the > >> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3. > >> - I failed the two last drives (c & d) via mdadm and removed them from the array > >> - I wiped the superblock on drive d. > >> - I added the drive d back to the array > >> So now the array had the following setup: > >> sda: disc_nr=0, raid_disk=0 > >> sdb: disc_nr=1, raid_disk=1 > >> sdd: disc_nr=4, raid_disk=2 > >> So sdd was added to the array into slot 2, and received disc_nr=4 > >> > >> - Now I asked to re-add drive sdc back to array. In gdb I followed the > >> re-add flow, to the place where it fills the mdu_disk_info_t structure > >> from the superblock read from sdc. It put there the following content: > >> disc.major = ... > >> disc.minor = ... > >> disc.number = 2 > >> disc.raid_disk = 2 (because previously this drive was in slot 2) > >> disc.state = ... > >> > >> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd). > >> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received > >> disc_nr=2 (while it was asking for 4). Of course, it could not have > >> received the same raid_disk, because this raid_disk was already > >> occupied by sdd. So it was added as a spare. > >> > >> But you are saying: > >> > If a device already exists with the same disk.number, a re-add cannot > >> > succeed, so mdadm doesn't even try. > >> while in my case it succeeded (while it actually did "add" and not "re-add"). > > > > We seem to be using word differently. > > If I ask mdadm to do a "re-add" and it does an "add", then I consider that to > > be "failure", however you seem to consider it to be a "success". > > > > That seems to be the source of confusion. > > > > > >> > >> That's why I was thinking it makes more sense to check disc.raid_disk > >> and not disc.number in this check. Since disc.number is not the > >> drive's role within the array (right?), it is simply a position of the > >> drive in the list of all drives. > >> So if you check raid_disk, and see that this raid slot is already > >> occupied, then, naturally, the drive will be converted to spare, which > >> we want to avoid. > > > > It may well be appropriate to check raid_disk as well, yes. It isn't so easy > > though which is maybe why I didn't. > > > > With 0.90 metadata, the disc.number is the same as disc.raid_disk for active > > devices. That might be another reason for the code being the way that it is. > > > > Thanks, > > NeilBrown > > > > > > > >> > >> And the enough_fd() check protects us from adding a spare to a failed > >> array (like you mentioned to me previously). > >> > >> What am I missing? > >> > >> Thanks, > >> Alex. > >> > >> > >> > >> > >> > >> > >> > >> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote: > >> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > >> > wrote: > >> > > >> >> Greetings everybody, > >> >> I have a question about the following code in Manage.c:Manage_subdevs() > >> >> > >> >> disc.number = mdi.disk.number; > >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > >> >> || disc.major != 0 || disc.minor != 0 > >> >> || !enough_fd(fd)) > >> >> goto skip_re_add; > >> >> > >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0 > >> >> are required. This basically means that the kernel already has an > >> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure? > >> >> > >> >> Let's say that enough_fd() returns true, and we go ahead an issue > >> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in > >> >> add_new_disk(), it will not even look at info->number. It will > >> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for > >> >> the rdev later. > >> >> > >> >> Doing this with mdadm 3.1.4, where this check is not present, actually > >> >> succeeds. I understand that this code was added for cases when > >> >> enough_fd() returns false, which sounds perfectly fine to protect > >> >> from. > >> >> > >> >> I was thinking that this code should actually check something like: > >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 > >> >> || disk.raid_disk != mdi.disk.raid_disk > >> >> || !enough_fd(fd)) > >> >> goto skip_re_add; > >> >> > >> >> That is to check that the slot that was being occupied by the drive we > >> >> are trying to add, is already occupied by a different drive (need also > >> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and > >> >> not the desc_nr, which does not have any persistent meaning. > >> >> > >> >> Perhaps there are some compatibility issues with old kernels? Or > >> >> special considerations for ... containers? non-persistent arrays? > >> > > >> > The point of this code is to make --re-add fail unless mdadm is certain that > >> > the kernel will accept the re-add, rather than turn the device into a spare. > >> > > >> > If a device already exists with the same disk.number, a re-add cannot > >> > succeed, so mdadm doesn't even try. > >> > > >> > When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add > >> > the device to the array, or does it turn the device into a spare? > >> > I particularly do not want --re-add to turn a device into a spare because > >> > people sometimes use it in cases where it cannot work, their device gets > >> > turned into a spare, and they lose information that could have been used to > >> > reconstruct the array. > >> > > >> > That that make sense? > >> > > >> > NeilBrown > >> > > >> > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-10-31 9:19 ` NeilBrown @ 2011-11-01 16:26 ` Alexander Lyakas 2011-11-01 22:52 ` NeilBrown 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lyakas @ 2011-11-01 16:26 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, I hope you don't find my questions too pesky; I hope as time goes, I will be able to contribute as well, and not just ask for information. What you suggested works great, but only if the drive already has a valid superblock. This is according to md_import_device(), which calls super_1_load(). Although it calls it with NULL refdev, but it still performs basic validity checks of the superblock. So it is needed to have functionality of write_init_super() of mdadm (perhaps can be an mdadm operation, not sure how dangerous this is to expose such functionality). Another question I was digging for, is to find a spot where kernel determines whether it is going to do a bitmap-based reconstruction or a full reconstruction. Can you verify that I found it correctly, or I am way off? In super_1_validate(): if (ev1 < mddev->bitmap->events_cleared) return 0; This means that the array bitmap was cleared after the drive was detached, so we bail right out, and do not set rdev->raid_disk, and leave all flags off. In that case, eventually, we will need full reconstruction. Otherwise, after we decide that this drive is an active device (not failed or spare): if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RECOVERY_OFFSET)) rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); else set_bit(In_sync, &rdev->flags); Here we might or might not set the In_sync flag. Then in slot_store() and in add_new_disk(), we set the important flag for that matter: if (test_bit(In_sync, &rdev->flags)) rdev->saved_raid_disk = slot; else rdev->saved_raid_disk = -1; And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we check this flag and set/not set the fullsync flag. And in raid1_add_disk, like you said, the slot doesn't matter, any valid saved_raid_disk is good. Is this correct? If yes, then in the sequence you suggested, we will always do full reconstruction. Because new_dev_store() and slot_store() sequence do not call validate_super(), so In_sync is never set, so saved_raid_disk remains -1. This is perfectly fine. Thanks! Alex. On Mon, Oct 31, 2011 at 11:19 AM, NeilBrown <neilb@suse.de> wrote: > On Mon, 31 Oct 2011 10:57:25 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Thank you for the clarification, Neil! >> >> I was looking at raid_disk, because I am trying to see whether it is >> possible to control into which raid slot a disk is being added (not >> re-added). >> >> Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d >> failed.Now let's say, that it is decided to replace drive d with a new >> drive e (--add for drive e). Then it's possible that drive e will take >> the raid slot of drive c. So later, if we want to bring back drive c >> into the array, it will have to go into a different slot, resulting in >> a full reconstruction, not bitmap-based reconstruction. While if we >> would have re-added drive c first, it would have done a bitmap-based >> reconstruction, so only the e drive would have required a full >> reconstruction. >> >> So do you think it makes sense to somehow (perhaps through >> disc.raid_disk) instruct the kernel into which slot to add a new >> drive? > > You should be able to do that via sysfs. > echo frozen > sync_action > echo $major:$minor > new_dev > echo $slot > dev-$DEVNAME/slot > echo idle > sync_action > > something like that. > > Seems and odd sort of scenario though. > > Note that with RAID1 this is not an issue. A re-added drive can take any > position in a RAID1 array - it doesn't have to be the same one. > > NeilBrown > > > >> >> Thanks, >> Alex. >> >> >> >> >> >> On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote: >> > On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> >> > wrote: >> > >> >> Hello Neil, >> >> it makes perfect sense not to turn a device into a spare inadvertently. >> >> >> >> However, with mdadm 3.1.4 under gdb, I tried the following: >> >> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the >> >> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3. >> >> - I failed the two last drives (c & d) via mdadm and removed them from the array >> >> - I wiped the superblock on drive d. >> >> - I added the drive d back to the array >> >> So now the array had the following setup: >> >> sda: disc_nr=0, raid_disk=0 >> >> sdb: disc_nr=1, raid_disk=1 >> >> sdd: disc_nr=4, raid_disk=2 >> >> So sdd was added to the array into slot 2, and received disc_nr=4 >> >> >> >> - Now I asked to re-add drive sdc back to array. In gdb I followed the >> >> re-add flow, to the place where it fills the mdu_disk_info_t structure >> >> from the superblock read from sdc. It put there the following content: >> >> disc.major = ... >> >> disc.minor = ... >> >> disc.number = 2 >> >> disc.raid_disk = 2 (because previously this drive was in slot 2) >> >> disc.state = ... >> >> >> >> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd). >> >> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received >> >> disc_nr=2 (while it was asking for 4). Of course, it could not have >> >> received the same raid_disk, because this raid_disk was already >> >> occupied by sdd. So it was added as a spare. >> >> >> >> But you are saying: >> >> > If a device already exists with the same disk.number, a re-add cannot >> >> > succeed, so mdadm doesn't even try. >> >> while in my case it succeeded (while it actually did "add" and not "re-add"). >> > >> > We seem to be using word differently. >> > If I ask mdadm to do a "re-add" and it does an "add", then I consider that to >> > be "failure", however you seem to consider it to be a "success". >> > >> > That seems to be the source of confusion. >> > >> > >> >> >> >> That's why I was thinking it makes more sense to check disc.raid_disk >> >> and not disc.number in this check. Since disc.number is not the >> >> drive's role within the array (right?), it is simply a position of the >> >> drive in the list of all drives. >> >> So if you check raid_disk, and see that this raid slot is already >> >> occupied, then, naturally, the drive will be converted to spare, which >> >> we want to avoid. >> > >> > It may well be appropriate to check raid_disk as well, yes. It isn't so easy >> > though which is maybe why I didn't. >> > >> > With 0.90 metadata, the disc.number is the same as disc.raid_disk for active >> > devices. That might be another reason for the code being the way that it is. >> > >> > Thanks, >> > NeilBrown >> > >> > >> > >> >> >> >> And the enough_fd() check protects us from adding a spare to a failed >> >> array (like you mentioned to me previously). >> >> >> >> What am I missing? >> >> >> >> Thanks, >> >> Alex. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote: >> >> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> >> >> > wrote: >> >> > >> >> >> Greetings everybody, >> >> >> I have a question about the following code in Manage.c:Manage_subdevs() >> >> >> >> >> >> disc.number = mdi.disk.number; >> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> >> >> || disc.major != 0 || disc.minor != 0 >> >> >> || !enough_fd(fd)) >> >> >> goto skip_re_add; >> >> >> >> >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0 >> >> >> are required. This basically means that the kernel already has an >> >> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure? >> >> >> >> >> >> Let's say that enough_fd() returns true, and we go ahead an issue >> >> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in >> >> >> add_new_disk(), it will not even look at info->number. It will >> >> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for >> >> >> the rdev later. >> >> >> >> >> >> Doing this with mdadm 3.1.4, where this check is not present, actually >> >> >> succeeds. I understand that this code was added for cases when >> >> >> enough_fd() returns false, which sounds perfectly fine to protect >> >> >> from. >> >> >> >> >> >> I was thinking that this code should actually check something like: >> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0 >> >> >> || disk.raid_disk != mdi.disk.raid_disk >> >> >> || !enough_fd(fd)) >> >> >> goto skip_re_add; >> >> >> >> >> >> That is to check that the slot that was being occupied by the drive we >> >> >> are trying to add, is already occupied by a different drive (need also >> >> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and >> >> >> not the desc_nr, which does not have any persistent meaning. >> >> >> >> >> >> Perhaps there are some compatibility issues with old kernels? Or >> >> >> special considerations for ... containers? non-persistent arrays? >> >> > >> >> > The point of this code is to make --re-add fail unless mdadm is certain that >> >> > the kernel will accept the re-add, rather than turn the device into a spare. >> >> > >> >> > If a device already exists with the same disk.number, a re-add cannot >> >> > succeed, so mdadm doesn't even try. >> >> > >> >> > When you say in 3.1.4 it "actually succeeds" - what succeeds? Does it re-add >> >> > the device to the array, or does it turn the device into a spare? >> >> > I particularly do not want --re-add to turn a device into a spare because >> >> > people sometimes use it in cases where it cannot work, their device gets >> >> > turned into a spare, and they lose information that could have been used to >> >> > reconstruct the array. >> >> > >> >> > That that make sense? >> >> > >> >> > NeilBrown >> >> > >> >> > >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-01 16:26 ` Alexander Lyakas @ 2011-11-01 22:52 ` NeilBrown 2011-11-08 16:23 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-11-01 22:52 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3102 bytes --] On Tue, 1 Nov 2011 18:26:14 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > I hope you don't find my questions too pesky; I hope as time goes, I > will be able to contribute as well, and not just ask for information. Enquiring minds should always be encouraged! > > What you suggested works great, but only if the drive already has a > valid superblock. This is according to md_import_device(), which calls > super_1_load(). Although it calls it with NULL refdev, but it still > performs basic validity checks of the superblock. So it is needed to > have functionality of write_init_super() of mdadm (perhaps can be an > mdadm operation, not sure how dangerous this is to expose such > functionality). Correct. With v1.x metadata - and even more so with DDF and IMSM - the kernel does not know everything about the metadata and cannot create new superlocks. At best it can edit what is there. As for adding things to mdadm - you just need a genuine use-case. I can see it could be valid to add a '--role' option to --add so you can add a device to a specific slot. I doubt it would be used much, but if there was a concrete need I wouldn't object. > > Another question I was digging for, is to find a spot where kernel > determines whether it is going to do a bitmap-based reconstruction or > a full reconstruction. Can you verify that I found it correctly, or I > am way off? > > In super_1_validate(): > if (ev1 < mddev->bitmap->events_cleared) > return 0; > This means that the array bitmap was cleared after the drive was > detached, so we bail right out, and do not set rdev->raid_disk, and > leave all flags off. In that case, eventually, we will need full > reconstruction. > Otherwise, after we decide that this drive is an active device (not > failed or spare): > if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RECOVERY_OFFSET)) > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); > else > set_bit(In_sync, &rdev->flags); > Here we might or might not set the In_sync flag. > Yes. 'In_sync' effectively means that recovery_offset == MAX. > Then in slot_store() and in add_new_disk(), we set the important flag > for that matter: > if (test_bit(In_sync, &rdev->flags)) > rdev->saved_raid_disk = slot; > else > rdev->saved_raid_disk = -1; > > And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we > check this flag and set/not set the fullsync flag. And in > raid1_add_disk, like you said, the slot doesn't matter, any valid > saved_raid_disk is good. > > Is this correct? If yes, then in the sequence you suggested, we will > always do full reconstruction. Because new_dev_store() and > slot_store() sequence do not call validate_super(), so In_sync is > never set, so saved_raid_disk remains -1. This is perfectly fine. Yes, this is all correct. In the sequence I suggested you could add echo insync > dev-$DEVNAME/state before echo $slot > dev-$DEVNAME/slot and it would result in saved_raid_disk being set. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-01 22:52 ` NeilBrown @ 2011-11-08 16:23 ` Alexander Lyakas 2011-11-08 23:41 ` NeilBrown 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lyakas @ 2011-11-08 16:23 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, 1) after looking at this more I see the following: on ADD_NEW_DISK sequence (to a running array, not during assembly) in the kernel, the desc_nr is loaded from the superblock of the disk to be added in super_1_load(): if (sb->level == cpu_to_le32(LEVEL_MULTIPATH)) rdev->desc_nr = -1; else rdev->desc_nr = le32_to_cpu(sb->dev_number); And later, in bind_rdev_to_array() it is indeed checked that desc_nr is unique. However, at least for 1.2 arrays, I believe this is too restrictive, don't you think? If the raid slot (not desc_nr) of the device being re-added is *not occupied* yet, can't we just select a free desc_nr for the new disk on that path? Or perhaps, mdadm on the re-add path can select a free desc_nr (disc.number) for it (just as it does for --add), after ensuring that the slot is not occupied yet? Where it is better to do it? Otherwise, the re-add fails, while it can perfectly succeed (only pick a different desc_nr). 2) About your suggestion of setting the In_sync flag via sysfs, I think it's too dangerous to blindly set it, so I am not going to do it. Let the kernel look at event count and decide. 3) So it looks like at present, my best way to add (not re-add) a disk into a specific slot is the following: - borrow some code from mdadm to initialize the superblock of the new disk, as belonging to this array (init_super). - use the sysfs to do what you suggested before: echo frozen > sync_action echo $major:$minor > new_dev echo $slot > dev-$DEVNAME/slot echo idle > sync_action Does this makes sense? Thanks! Alex. On Wed, Nov 2, 2011 at 12:52 AM, NeilBrown <neilb@suse.de> wrote: > On Tue, 1 Nov 2011 18:26:14 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hi Neil, >> I hope you don't find my questions too pesky; I hope as time goes, I >> will be able to contribute as well, and not just ask for information. > > Enquiring minds should always be encouraged! > >> >> What you suggested works great, but only if the drive already has a >> valid superblock. This is according to md_import_device(), which calls >> super_1_load(). Although it calls it with NULL refdev, but it still >> performs basic validity checks of the superblock. So it is needed to >> have functionality of write_init_super() of mdadm (perhaps can be an >> mdadm operation, not sure how dangerous this is to expose such >> functionality). > > Correct. With v1.x metadata - and even more so with DDF and IMSM - the > kernel does not know everything about the metadata and cannot create new > superlocks. At best it can edit what is there. > > As for adding things to mdadm - you just need a genuine use-case. > I can see it could be valid to add a '--role' option to --add so you can add > a device to a specific slot. I doubt it would be used much, but if there was > a concrete need I wouldn't object. > > >> >> Another question I was digging for, is to find a spot where kernel >> determines whether it is going to do a bitmap-based reconstruction or >> a full reconstruction. Can you verify that I found it correctly, or I >> am way off? >> >> In super_1_validate(): >> if (ev1 < mddev->bitmap->events_cleared) >> return 0; >> This means that the array bitmap was cleared after the drive was >> detached, so we bail right out, and do not set rdev->raid_disk, and >> leave all flags off. In that case, eventually, we will need full >> reconstruction. >> Otherwise, after we decide that this drive is an active device (not >> failed or spare): >> if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RECOVERY_OFFSET)) >> rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); >> else >> set_bit(In_sync, &rdev->flags); >> Here we might or might not set the In_sync flag. >> > > Yes. 'In_sync' effectively means that recovery_offset == MAX. > > >> Then in slot_store() and in add_new_disk(), we set the important flag >> for that matter: >> if (test_bit(In_sync, &rdev->flags)) >> rdev->saved_raid_disk = slot; >> else >> rdev->saved_raid_disk = -1; >> >> And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we >> check this flag and set/not set the fullsync flag. And in >> raid1_add_disk, like you said, the slot doesn't matter, any valid >> saved_raid_disk is good. >> >> Is this correct? If yes, then in the sequence you suggested, we will >> always do full reconstruction. Because new_dev_store() and >> slot_store() sequence do not call validate_super(), so In_sync is >> never set, so saved_raid_disk remains -1. This is perfectly fine. > > Yes, this is all correct. > In the sequence I suggested you could add > echo insync > dev-$DEVNAME/state > before > echo $slot > dev-$DEVNAME/slot > > and it would result in saved_raid_disk being set. > > NeilBrown > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-08 16:23 ` Alexander Lyakas @ 2011-11-08 23:41 ` NeilBrown 2011-11-17 11:13 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-11-08 23:41 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2090 bytes --] On Tue, 8 Nov 2011 18:23:49 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > > 1) > after looking at this more I see the following: on ADD_NEW_DISK > sequence (to a running array, not during assembly) in the kernel, the > desc_nr is loaded from the superblock of the disk to be added in > super_1_load(): > if (sb->level == cpu_to_le32(LEVEL_MULTIPATH)) > rdev->desc_nr = -1; > else > rdev->desc_nr = le32_to_cpu(sb->dev_number); > And later, in bind_rdev_to_array() it is indeed checked that desc_nr is unique. > > However, at least for 1.2 arrays, I believe this is too restrictive, > don't you think? If the raid slot (not desc_nr) of the device being > re-added is *not occupied* yet, can't we just select a free desc_nr > for the new disk on that path? > Or perhaps, mdadm on the re-add path can select a free desc_nr > (disc.number) for it (just as it does for --add), after ensuring that > the slot is not occupied yet? Where it is better to do it? > Otherwise, the re-add fails, while it can perfectly succeed (only pick > a different desc_nr). I think I see what you are saying. However my question is: is this really an issue. Is there a credible sequence of events that results in the current code makes an undesirable decision? Of course I do not count deliberately editing the metadata as part of a credible sequence of events. > > 2) > About your suggestion of setting the In_sync flag via sysfs, I think > it's too dangerous to blindly set it, so I am not going to do it. Let > the kernel look at event count and decide. > > 3) > So it looks like at present, my best way to add (not re-add) a disk > into a specific slot is the following: > - borrow some code from mdadm to initialize the superblock of the new > disk, as belonging to this array (init_super). > - use the sysfs to do what you suggested before: > echo frozen > sync_action > echo $major:$minor > new_dev > echo $slot > dev-$DEVNAME/slot > echo idle > sync_action > > Does this makes sense? Yes it does. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-08 23:41 ` NeilBrown @ 2011-11-17 11:13 ` Alexander Lyakas 2011-11-21 2:44 ` NeilBrown 0 siblings, 1 reply; 13+ messages in thread From: Alexander Lyakas @ 2011-11-17 11:13 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hello Neil, >> However, at least for 1.2 arrays, I believe this is too restrictive, >> don't you think? If the raid slot (not desc_nr) of the device being >> re-added is *not occupied* yet, can't we just select a free desc_nr >> for the new disk on that path? >> Or perhaps, mdadm on the re-add path can select a free desc_nr >> (disc.number) for it (just as it does for --add), after ensuring that >> the slot is not occupied yet? Where it is better to do it? >> Otherwise, the re-add fails, while it can perfectly succeed (only pick >> a different desc_nr). > > I think I see what you are saying. > However my question is: is this really an issue. > Is there a credible sequence of events that results in the current code makes > an undesirable decision? Of course I do not count deliberately editing the > metadata as part of a credible sequence of events. Consider this scenario, in which the code refuses to re-add a drive: Step 1: - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2) - I failed drives B and C, and removed them from the array, and totally forgot about them for the rest of the scenario. - I added to the array two new drives: D and E, and waited for the resync to complete. The array now has the following structure: A: descr_nr=0 D: desc_nr=3 (was selected during the "add" path in mdadm, as expected) E: desc_nr=4 (was selected during the "add" path in mdadm, as expected) Step 2: - I failed drives D and E, and removed them from the array. The E drive is not used for the rest of the scenario, so we can forget about it. I wrote some data to the array. At this point, the array bitmap is dirty, and will not be cleared, since the array is degraded. Step 3: - I added one new drive (last one, I promise!) to the array - drive F, and waited for it to resync. The array now has the following structure: A: descr_nr=0 F: desc_nr=3 So F took desc_nr of D drive (desc_nr=3). This is expected according to mdadm code. Event counters at this point: A and F: events=149, events_cleared=0 D: events=109 Step 4: At this point, mdadm refuses to re-add the drive D to the array, because its desc_nr is already taken (I verified that via gdb). On the other hand, if we would have simply picked a fresh desc_nr for D, then it could be re-added I believe, because: - slots are not important for raid1 (D's slot was taken actually by F). - it should pass the check for bitmap-based resync (events in D' sb >= events_cleared of the array) Do you agree with this, or perhaps I missed something? Additional notes: - of course, such scenario is relevant only for arrays with more than single redundancy, so it's not relevant for raid5 - to simulate such scenario for raid6, need at step 3 to add the new drive to the slot, which is not the slot of the drive we're going to re-add in step4 (otherwise, it takes the D's slot, and then we really cannot re-add). This can be done as we discussed earlier. What do you think? Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-17 11:13 ` Alexander Lyakas @ 2011-11-21 2:44 ` NeilBrown 2011-11-22 8:45 ` Alexander Lyakas 0 siblings, 1 reply; 13+ messages in thread From: NeilBrown @ 2011-11-21 2:44 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3805 bytes --] On Thu, 17 Nov 2011 13:13:20 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > > >> However, at least for 1.2 arrays, I believe this is too restrictive, > >> don't you think? If the raid slot (not desc_nr) of the device being > >> re-added is *not occupied* yet, can't we just select a free desc_nr > >> for the new disk on that path? > >> Or perhaps, mdadm on the re-add path can select a free desc_nr > >> (disc.number) for it (just as it does for --add), after ensuring that > >> the slot is not occupied yet? Where it is better to do it? > >> Otherwise, the re-add fails, while it can perfectly succeed (only pick > >> a different desc_nr). > > > > I think I see what you are saying. > > However my question is: is this really an issue. > > Is there a credible sequence of events that results in the current code makes > > an undesirable decision? Of course I do not count deliberately editing the > > metadata as part of a credible sequence of events. > > Consider this scenario, in which the code refuses to re-add a drive: > > Step 1: > - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2) > - I failed drives B and C, and removed them from the array, and > totally forgot about them for the rest of the scenario. > - I added to the array two new drives: D and E, and waited for the > resync to complete. The array now has the following structure: > A: descr_nr=0 > D: desc_nr=3 (was selected during the "add" path in mdadm, as expected) > E: desc_nr=4 (was selected during the "add" path in mdadm, as expected) > > Step 2: > - I failed drives D and E, and removed them from the array. The E > drive is not used for the rest of the scenario, so we can forget about > it. > > I wrote some data to the array. At this point, the array bitmap is > dirty, and will not be cleared, since the array is degraded. > > Step 3: > - I added one new drive (last one, I promise!) to the array - drive F, > and waited for it to resync. The array now has the following > structure: > A: descr_nr=0 > F: desc_nr=3 > > So F took desc_nr of D drive (desc_nr=3). This is expected according > to mdadm code. > > Event counters at this point: > A and F: events=149, events_cleared=0 > D: events=109 > > Step 4: > At this point, mdadm refuses to re-add the drive D to the array, > because its desc_nr is already taken (I verified that via gdb). On the > other hand, if we would have simply picked a fresh desc_nr for D, then > it could be re-added I believe, because: > - slots are not important for raid1 (D's slot was taken actually by F). > - it should pass the check for bitmap-based resync (events in D' sb >= > events_cleared of the array) > > Do you agree with this, or perhaps I missed something? > > Additional notes: > - of course, such scenario is relevant only for arrays with more than > single redundancy, so it's not relevant for raid5 > - to simulate such scenario for raid6, need at step 3 to add the new > drive to the slot, which is not the slot of the drive we're going to > re-add in step4 (otherwise, it takes the D's slot, and then we really > cannot re-add). This can be done as we discussed earlier. > > What do you think? I think some of the details in your steps aren't really right, but I do see the point you are making. If you keep the array degraded, the events_cleared will not be updated so any old array member can safely be re-added. I'll have a look and see how best to fix the code. Thanks. NeilBrown > > Thanks, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" 2011-11-21 2:44 ` NeilBrown @ 2011-11-22 8:45 ` Alexander Lyakas 0 siblings, 0 replies; 13+ messages in thread From: Alexander Lyakas @ 2011-11-22 8:45 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Thanks Neil, for looking into this. On Mon, Nov 21, 2011 at 4:44 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 17 Nov 2011 13:13:20 +0200 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> >> >> However, at least for 1.2 arrays, I believe this is too restrictive, >> >> don't you think? If the raid slot (not desc_nr) of the device being >> >> re-added is *not occupied* yet, can't we just select a free desc_nr >> >> for the new disk on that path? >> >> Or perhaps, mdadm on the re-add path can select a free desc_nr >> >> (disc.number) for it (just as it does for --add), after ensuring that >> >> the slot is not occupied yet? Where it is better to do it? >> >> Otherwise, the re-add fails, while it can perfectly succeed (only pick >> >> a different desc_nr). >> > >> > I think I see what you are saying. >> > However my question is: is this really an issue. >> > Is there a credible sequence of events that results in the current code makes >> > an undesirable decision? Of course I do not count deliberately editing the >> > metadata as part of a credible sequence of events. >> >> Consider this scenario, in which the code refuses to re-add a drive: >> >> Step 1: >> - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2) >> - I failed drives B and C, and removed them from the array, and >> totally forgot about them for the rest of the scenario. >> - I added to the array two new drives: D and E, and waited for the >> resync to complete. The array now has the following structure: >> A: descr_nr=0 >> D: desc_nr=3 (was selected during the "add" path in mdadm, as expected) >> E: desc_nr=4 (was selected during the "add" path in mdadm, as expected) >> >> Step 2: >> - I failed drives D and E, and removed them from the array. The E >> drive is not used for the rest of the scenario, so we can forget about >> it. >> >> I wrote some data to the array. At this point, the array bitmap is >> dirty, and will not be cleared, since the array is degraded. >> >> Step 3: >> - I added one new drive (last one, I promise!) to the array - drive F, >> and waited for it to resync. The array now has the following >> structure: >> A: descr_nr=0 >> F: desc_nr=3 >> >> So F took desc_nr of D drive (desc_nr=3). This is expected according >> to mdadm code. >> >> Event counters at this point: >> A and F: events=149, events_cleared=0 >> D: events=109 >> >> Step 4: >> At this point, mdadm refuses to re-add the drive D to the array, >> because its desc_nr is already taken (I verified that via gdb). On the >> other hand, if we would have simply picked a fresh desc_nr for D, then >> it could be re-added I believe, because: >> - slots are not important for raid1 (D's slot was taken actually by F). >> - it should pass the check for bitmap-based resync (events in D' sb >= >> events_cleared of the array) >> >> Do you agree with this, or perhaps I missed something? >> >> Additional notes: >> - of course, such scenario is relevant only for arrays with more than >> single redundancy, so it's not relevant for raid5 >> - to simulate such scenario for raid6, need at step 3 to add the new >> drive to the slot, which is not the slot of the drive we're going to >> re-add in step4 (otherwise, it takes the D's slot, and then we really >> cannot re-add). This can be done as we discussed earlier. >> >> What do you think? > > I think some of the details in your steps aren't really right, but I do see > the point you are making. > If you keep the array degraded, the events_cleared will not be updated so any > old array member can safely be re-added. > > I'll have a look and see how best to fix the code. > > Thanks. > > NeilBrown > > > >> >> Thanks, >> Alex. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-11-22 8:45 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-10-26 17:02 Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" Alexander Lyakas 2011-10-26 21:51 ` NeilBrown 2011-10-27 9:10 ` Alexander Lyakas 2011-10-30 23:16 ` NeilBrown 2011-10-31 8:57 ` Alexander Lyakas 2011-10-31 9:19 ` NeilBrown 2011-11-01 16:26 ` Alexander Lyakas 2011-11-01 22:52 ` NeilBrown 2011-11-08 16:23 ` Alexander Lyakas 2011-11-08 23:41 ` NeilBrown 2011-11-17 11:13 ` Alexander Lyakas 2011-11-21 2:44 ` NeilBrown 2011-11-22 8:45 ` Alexander Lyakas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).