Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
@ 2011-10-26 17:02 Alexander Lyakas
  2011-10-26 21:51 ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-10-26 17:02 UTC (permalink / raw)
  To: linux-raid

Greetings everybody,
I have a question about the following code in Manage.c:Manage_subdevs()

disc.number = mdi.disk.number;
if (ioctl(fd, GET_DISK_INFO, &disc) != 0
    || disc.major != 0 || disc.minor != 0
    || !enough_fd(fd))
    goto skip_re_add;

I do not underatand why the checks: disc.major != 0 || disc.minor != 0
are required. This basically means that the kernel already has an
rdev->desc_nr equal to disc.number. But why fail the re-add procedure?

Let's say that enough_fd() returns true, and we go ahead an issue
ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
add_new_disk(), it will not even look at info->number. It will
initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
the rdev later.

Doing this with mdadm 3.1.4, where this check is not present, actually
succeeds. I understand that this code was added for cases when
enough_fd() returns false, which sounds perfectly fine to protect
from.

I was thinking that this code should actually check something like:
if (ioctl(fd, GET_DISK_INFO, &disc) != 0
    || disk.raid_disk != mdi.disk.raid_disk
    || !enough_fd(fd))
    goto skip_re_add;

That is to check that the slot that was being occupied by the drive we
are trying to add, is already occupied by a different drive (need also
to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
not the desc_nr, which does not have any persistent meaning.

Perhaps there are some compatibility issues with old kernels? Or
special considerations for ... containers? non-persistent arrays?

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-26 17:02 Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" Alexander Lyakas
@ 2011-10-26 21:51 ` NeilBrown
  2011-10-27  9:10   ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-10-26 21:51 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Greetings everybody,
> I have a question about the following code in Manage.c:Manage_subdevs()
> 
> disc.number = mdi.disk.number;
> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>     || disc.major != 0 || disc.minor != 0
>     || !enough_fd(fd))
>     goto skip_re_add;
> 
> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
> are required. This basically means that the kernel already has an
> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
> 
> Let's say that enough_fd() returns true, and we go ahead an issue
> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
> add_new_disk(), it will not even look at info->number. It will
> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
> the rdev later.
> 
> Doing this with mdadm 3.1.4, where this check is not present, actually
> succeeds. I understand that this code was added for cases when
> enough_fd() returns false, which sounds perfectly fine to protect
> from.
> 
> I was thinking that this code should actually check something like:
> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>     || disk.raid_disk != mdi.disk.raid_disk
>     || !enough_fd(fd))
>     goto skip_re_add;
> 
> That is to check that the slot that was being occupied by the drive we
> are trying to add, is already occupied by a different drive (need also
> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
> not the desc_nr, which does not have any persistent meaning.
> 
> Perhaps there are some compatibility issues with old kernels? Or
> special considerations for ... containers? non-persistent arrays?

The point of this code is to make --re-add fail unless mdadm is certain that
the kernel will accept the re-add, rather than turn the device into a spare.

If a device already exists with the same disk.number, a re-add cannot
succeed, so mdadm doesn't even try.

When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
the device to the array, or does it turn the device into a spare?
I particularly do not want --re-add to turn a device into a spare because
people sometimes use it in cases where it cannot work, their device gets
turned into a spare, and they lose information that could have been used to
reconstruct the array.

That that make sense?

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-26 21:51 ` NeilBrown
@ 2011-10-27  9:10   ` Alexander Lyakas
  2011-10-30 23:16     ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-10-27  9:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil,
it makes perfect sense not to turn a device into a spare inadvertently.

However, with mdadm 3.1.4 under gdb, I tried the following:
- I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the
kernel were respectively (according to GET_DISK_INFO): 0,1,2,3.
- I failed the two last drives (c & d) via mdadm and removed them from the array
- I wiped the superblock on drive d.
- I added the drive d back to the array
So now the array had the following setup:
sda: disc_nr=0, raid_disk=0
sdb: disc_nr=1, raid_disk=1
sdd: disc_nr=4, raid_disk=2
So sdd was added to the array into slot 2, and received disc_nr=4

- Now I asked to re-add drive sdc back to array. In gdb I followed the
re-add flow, to the place where it fills the mdu_disk_info_t structure
from the superblock read from sdc. It put there the following content:
disc.major = ...
disc.minor = ...
disc.number = 2
disc.raid_disk = 2 (because previously this drive was in slot 2)
disc.state = ...

Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd).
And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received
disc_nr=2 (while it was asking for 4). Of course, it could not have
received the same raid_disk, because this raid_disk was already
occupied by sdd. So it was added as a spare.

But you are saying:
> If a device already exists with the same disk.number, a re-add cannot
> succeed, so mdadm doesn't even try.
while in my case it succeeded (while it actually did "add" and not "re-add").

That's why I was thinking it makes more sense to check disc.raid_disk
and not disc.number in this check. Since disc.number is not the
drive's role within the array (right?), it is simply a position of the
drive in the list of all drives.
So if you check raid_disk, and see that this raid slot is already
occupied, then, naturally, the drive will be converted to spare, which
we want to avoid.

And the enough_fd() check protects us from adding a spare to a failed
array (like you mentioned to me previously).

What am I missing?

Thanks,
  Alex.







On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Greetings everybody,
>> I have a question about the following code in Manage.c:Manage_subdevs()
>>
>> disc.number = mdi.disk.number;
>> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>>     || disc.major != 0 || disc.minor != 0
>>     || !enough_fd(fd))
>>     goto skip_re_add;
>>
>> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
>> are required. This basically means that the kernel already has an
>> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
>>
>> Let's say that enough_fd() returns true, and we go ahead an issue
>> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
>> add_new_disk(), it will not even look at info->number. It will
>> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
>> the rdev later.
>>
>> Doing this with mdadm 3.1.4, where this check is not present, actually
>> succeeds. I understand that this code was added for cases when
>> enough_fd() returns false, which sounds perfectly fine to protect
>> from.
>>
>> I was thinking that this code should actually check something like:
>> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>>     || disk.raid_disk != mdi.disk.raid_disk
>>     || !enough_fd(fd))
>>     goto skip_re_add;
>>
>> That is to check that the slot that was being occupied by the drive we
>> are trying to add, is already occupied by a different drive (need also
>> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
>> not the desc_nr, which does not have any persistent meaning.
>>
>> Perhaps there are some compatibility issues with old kernels? Or
>> special considerations for ... containers? non-persistent arrays?
>
> The point of this code is to make --re-add fail unless mdadm is certain that
> the kernel will accept the re-add, rather than turn the device into a spare.
>
> If a device already exists with the same disk.number, a re-add cannot
> succeed, so mdadm doesn't even try.
>
> When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
> the device to the array, or does it turn the device into a spare?
> I particularly do not want --re-add to turn a device into a spare because
> people sometimes use it in cases where it cannot work, their device gets
> turned into a spare, and they lose information that could have been used to
> reconstruct the array.
>
> That that make sense?
>
> NeilBrown
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-27  9:10   ` Alexander Lyakas
@ 2011-10-30 23:16     ` NeilBrown
  2011-10-31  8:57       ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-10-30 23:16 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5774 bytes --]

On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> it makes perfect sense not to turn a device into a spare inadvertently.
> 
> However, with mdadm 3.1.4 under gdb, I tried the following:
> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the
> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3.
> - I failed the two last drives (c & d) via mdadm and removed them from the array
> - I wiped the superblock on drive d.
> - I added the drive d back to the array
> So now the array had the following setup:
> sda: disc_nr=0, raid_disk=0
> sdb: disc_nr=1, raid_disk=1
> sdd: disc_nr=4, raid_disk=2
> So sdd was added to the array into slot 2, and received disc_nr=4
> 
> - Now I asked to re-add drive sdc back to array. In gdb I followed the
> re-add flow, to the place where it fills the mdu_disk_info_t structure
> from the superblock read from sdc. It put there the following content:
> disc.major = ...
> disc.minor = ...
> disc.number = 2
> disc.raid_disk = 2 (because previously this drive was in slot 2)
> disc.state = ...
> 
> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd).
> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received
> disc_nr=2 (while it was asking for 4). Of course, it could not have
> received the same raid_disk, because this raid_disk was already
> occupied by sdd. So it was added as a spare.
> 
> But you are saying:
> > If a device already exists with the same disk.number, a re-add cannot
> > succeed, so mdadm doesn't even try.
> while in my case it succeeded (while it actually did "add" and not "re-add").

We seem to be using word differently.
If I ask mdadm to do a "re-add" and it does an "add", then I consider that to
be "failure", however you seem to consider it to be a "success".

That seems to be the source of confusion.


> 
> That's why I was thinking it makes more sense to check disc.raid_disk
> and not disc.number in this check. Since disc.number is not the
> drive's role within the array (right?), it is simply a position of the
> drive in the list of all drives.
> So if you check raid_disk, and see that this raid slot is already
> occupied, then, naturally, the drive will be converted to spare, which
> we want to avoid.

It may well be appropriate to check raid_disk as well, yes.  It isn't so easy
though which is maybe why I didn't.

With 0.90 metadata, the disc.number is the same as disc.raid_disk for active
devices.  That might be another reason for the code being the way that it is.

Thanks,
NeilBrown



> 
> And the enough_fd() check protects us from adding a spare to a failed
> array (like you mentioned to me previously).
> 
> What am I missing?
> 
> Thanks,
>   Alex.
> 
> 
> 
> 
> 
> 
> 
> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote:
> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Greetings everybody,
> >> I have a question about the following code in Manage.c:Manage_subdevs()
> >>
> >> disc.number = mdi.disk.number;
> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
> >>     || disc.major != 0 || disc.minor != 0
> >>     || !enough_fd(fd))
> >>     goto skip_re_add;
> >>
> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
> >> are required. This basically means that the kernel already has an
> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
> >>
> >> Let's say that enough_fd() returns true, and we go ahead an issue
> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
> >> add_new_disk(), it will not even look at info->number. It will
> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
> >> the rdev later.
> >>
> >> Doing this with mdadm 3.1.4, where this check is not present, actually
> >> succeeds. I understand that this code was added for cases when
> >> enough_fd() returns false, which sounds perfectly fine to protect
> >> from.
> >>
> >> I was thinking that this code should actually check something like:
> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
> >>     || disk.raid_disk != mdi.disk.raid_disk
> >>     || !enough_fd(fd))
> >>     goto skip_re_add;
> >>
> >> That is to check that the slot that was being occupied by the drive we
> >> are trying to add, is already occupied by a different drive (need also
> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
> >> not the desc_nr, which does not have any persistent meaning.
> >>
> >> Perhaps there are some compatibility issues with old kernels? Or
> >> special considerations for ... containers? non-persistent arrays?
> >
> > The point of this code is to make --re-add fail unless mdadm is certain that
> > the kernel will accept the re-add, rather than turn the device into a spare.
> >
> > If a device already exists with the same disk.number, a re-add cannot
> > succeed, so mdadm doesn't even try.
> >
> > When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
> > the device to the array, or does it turn the device into a spare?
> > I particularly do not want --re-add to turn a device into a spare because
> > people sometimes use it in cases where it cannot work, their device gets
> > turned into a spare, and they lose information that could have been used to
> > reconstruct the array.
> >
> > That that make sense?
> >
> > NeilBrown
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-30 23:16     ` NeilBrown
@ 2011-10-31  8:57       ` Alexander Lyakas
  2011-10-31  9:19         ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-10-31  8:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thank you for the clarification, Neil!

I was looking at raid_disk, because I am trying to see whether it is
possible to control into which raid slot a disk is being added (not
re-added).

Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d
failed.Now let's say, that it is decided to replace drive d with a new
drive e (--add for drive e). Then it's possible that drive e will take
the raid slot of drive c. So later, if we want to bring back drive c
into the array, it will have to go into a different slot, resulting in
a full reconstruction, not bitmap-based reconstruction. While if we
would have re-added drive c first, it would have done a bitmap-based
reconstruction, so only the e drive would have required a full
reconstruction.

So do you think it makes sense to somehow (perhaps through
disc.raid_disk) instruct the kernel into which slot to add a new
drive?

Thanks,
  Alex.





On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>> it makes perfect sense not to turn a device into a spare inadvertently.
>>
>> However, with mdadm 3.1.4 under gdb, I tried the following:
>> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the
>> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3.
>> - I failed the two last drives (c & d) via mdadm and removed them from the array
>> - I wiped the superblock on drive d.
>> - I added the drive d back to the array
>> So now the array had the following setup:
>> sda: disc_nr=0, raid_disk=0
>> sdb: disc_nr=1, raid_disk=1
>> sdd: disc_nr=4, raid_disk=2
>> So sdd was added to the array into slot 2, and received disc_nr=4
>>
>> - Now I asked to re-add drive sdc back to array. In gdb I followed the
>> re-add flow, to the place where it fills the mdu_disk_info_t structure
>> from the superblock read from sdc. It put there the following content:
>> disc.major = ...
>> disc.minor = ...
>> disc.number = 2
>> disc.raid_disk = 2 (because previously this drive was in slot 2)
>> disc.state = ...
>>
>> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd).
>> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received
>> disc_nr=2 (while it was asking for 4). Of course, it could not have
>> received the same raid_disk, because this raid_disk was already
>> occupied by sdd. So it was added as a spare.
>>
>> But you are saying:
>> > If a device already exists with the same disk.number, a re-add cannot
>> > succeed, so mdadm doesn't even try.
>> while in my case it succeeded (while it actually did "add" and not "re-add").
>
> We seem to be using word differently.
> If I ask mdadm to do a "re-add" and it does an "add", then I consider that to
> be "failure", however you seem to consider it to be a "success".
>
> That seems to be the source of confusion.
>
>
>>
>> That's why I was thinking it makes more sense to check disc.raid_disk
>> and not disc.number in this check. Since disc.number is not the
>> drive's role within the array (right?), it is simply a position of the
>> drive in the list of all drives.
>> So if you check raid_disk, and see that this raid slot is already
>> occupied, then, naturally, the drive will be converted to spare, which
>> we want to avoid.
>
> It may well be appropriate to check raid_disk as well, yes.  It isn't so easy
> though which is maybe why I didn't.
>
> With 0.90 metadata, the disc.number is the same as disc.raid_disk for active
> devices.  That might be another reason for the code being the way that it is.
>
> Thanks,
> NeilBrown
>
>
>
>>
>> And the enough_fd() check protects us from adding a spare to a failed
>> array (like you mentioned to me previously).
>>
>> What am I missing?
>>
>> Thanks,
>>   Alex.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Greetings everybody,
>> >> I have a question about the following code in Manage.c:Manage_subdevs()
>> >>
>> >> disc.number = mdi.disk.number;
>> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>> >>     || disc.major != 0 || disc.minor != 0
>> >>     || !enough_fd(fd))
>> >>     goto skip_re_add;
>> >>
>> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
>> >> are required. This basically means that the kernel already has an
>> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
>> >>
>> >> Let's say that enough_fd() returns true, and we go ahead an issue
>> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
>> >> add_new_disk(), it will not even look at info->number. It will
>> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
>> >> the rdev later.
>> >>
>> >> Doing this with mdadm 3.1.4, where this check is not present, actually
>> >> succeeds. I understand that this code was added for cases when
>> >> enough_fd() returns false, which sounds perfectly fine to protect
>> >> from.
>> >>
>> >> I was thinking that this code should actually check something like:
>> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>> >>     || disk.raid_disk != mdi.disk.raid_disk
>> >>     || !enough_fd(fd))
>> >>     goto skip_re_add;
>> >>
>> >> That is to check that the slot that was being occupied by the drive we
>> >> are trying to add, is already occupied by a different drive (need also
>> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
>> >> not the desc_nr, which does not have any persistent meaning.
>> >>
>> >> Perhaps there are some compatibility issues with old kernels? Or
>> >> special considerations for ... containers? non-persistent arrays?
>> >
>> > The point of this code is to make --re-add fail unless mdadm is certain that
>> > the kernel will accept the re-add, rather than turn the device into a spare.
>> >
>> > If a device already exists with the same disk.number, a re-add cannot
>> > succeed, so mdadm doesn't even try.
>> >
>> > When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
>> > the device to the array, or does it turn the device into a spare?
>> > I particularly do not want --re-add to turn a device into a spare because
>> > people sometimes use it in cases where it cannot work, their device gets
>> > turned into a spare, and they lose information that could have been used to
>> > reconstruct the array.
>> >
>> > That that make sense?
>> >
>> > NeilBrown
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-31  8:57       ` Alexander Lyakas
@ 2011-10-31  9:19         ` NeilBrown
  2011-11-01 16:26           ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-10-31  9:19 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7950 bytes --]

On Mon, 31 Oct 2011 10:57:25 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Thank you for the clarification, Neil!
> 
> I was looking at raid_disk, because I am trying to see whether it is
> possible to control into which raid slot a disk is being added (not
> re-added).
> 
> Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d
> failed.Now let's say, that it is decided to replace drive d with a new
> drive e (--add for drive e). Then it's possible that drive e will take
> the raid slot of drive c. So later, if we want to bring back drive c
> into the array, it will have to go into a different slot, resulting in
> a full reconstruction, not bitmap-based reconstruction. While if we
> would have re-added drive c first, it would have done a bitmap-based
> reconstruction, so only the e drive would have required a full
> reconstruction.
> 
> So do you think it makes sense to somehow (perhaps through
> disc.raid_disk) instruct the kernel into which slot to add a new
> drive?

You should be able to do that via sysfs.
echo frozen > sync_action
echo $major:$minor > new_dev
echo $slot > dev-$DEVNAME/slot
echo idle > sync_action

something like that.

Seems and odd sort of scenario though.

Note that with RAID1 this is not an issue.  A re-added drive can take any
position in a RAID1 array - it doesn't have to be the same one.

NeilBrown



> 
> Thanks,
>   Alex.
> 
> 
> 
> 
> 
> On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote:
> > On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hello Neil,
> >> it makes perfect sense not to turn a device into a spare inadvertently.
> >>
> >> However, with mdadm 3.1.4 under gdb, I tried the following:
> >> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the
> >> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3.
> >> - I failed the two last drives (c & d) via mdadm and removed them from the array
> >> - I wiped the superblock on drive d.
> >> - I added the drive d back to the array
> >> So now the array had the following setup:
> >> sda: disc_nr=0, raid_disk=0
> >> sdb: disc_nr=1, raid_disk=1
> >> sdd: disc_nr=4, raid_disk=2
> >> So sdd was added to the array into slot 2, and received disc_nr=4
> >>
> >> - Now I asked to re-add drive sdc back to array. In gdb I followed the
> >> re-add flow, to the place where it fills the mdu_disk_info_t structure
> >> from the superblock read from sdc. It put there the following content:
> >> disc.major = ...
> >> disc.minor = ...
> >> disc.number = 2
> >> disc.raid_disk = 2 (because previously this drive was in slot 2)
> >> disc.state = ...
> >>
> >> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd).
> >> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received
> >> disc_nr=2 (while it was asking for 4). Of course, it could not have
> >> received the same raid_disk, because this raid_disk was already
> >> occupied by sdd. So it was added as a spare.
> >>
> >> But you are saying:
> >> > If a device already exists with the same disk.number, a re-add cannot
> >> > succeed, so mdadm doesn't even try.
> >> while in my case it succeeded (while it actually did "add" and not "re-add").
> >
> > We seem to be using word differently.
> > If I ask mdadm to do a "re-add" and it does an "add", then I consider that to
> > be "failure", however you seem to consider it to be a "success".
> >
> > That seems to be the source of confusion.
> >
> >
> >>
> >> That's why I was thinking it makes more sense to check disc.raid_disk
> >> and not disc.number in this check. Since disc.number is not the
> >> drive's role within the array (right?), it is simply a position of the
> >> drive in the list of all drives.
> >> So if you check raid_disk, and see that this raid slot is already
> >> occupied, then, naturally, the drive will be converted to spare, which
> >> we want to avoid.
> >
> > It may well be appropriate to check raid_disk as well, yes.  It isn't so easy
> > though which is maybe why I didn't.
> >
> > With 0.90 metadata, the disc.number is the same as disc.raid_disk for active
> > devices.  That might be another reason for the code being the way that it is.
> >
> > Thanks,
> > NeilBrown
> >
> >
> >
> >>
> >> And the enough_fd() check protects us from adding a spare to a failed
> >> array (like you mentioned to me previously).
> >>
> >> What am I missing?
> >>
> >> Thanks,
> >>   Alex.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote:
> >> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> > wrote:
> >> >
> >> >> Greetings everybody,
> >> >> I have a question about the following code in Manage.c:Manage_subdevs()
> >> >>
> >> >> disc.number = mdi.disk.number;
> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
> >> >>     || disc.major != 0 || disc.minor != 0
> >> >>     || !enough_fd(fd))
> >> >>     goto skip_re_add;
> >> >>
> >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
> >> >> are required. This basically means that the kernel already has an
> >> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
> >> >>
> >> >> Let's say that enough_fd() returns true, and we go ahead an issue
> >> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
> >> >> add_new_disk(), it will not even look at info->number. It will
> >> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
> >> >> the rdev later.
> >> >>
> >> >> Doing this with mdadm 3.1.4, where this check is not present, actually
> >> >> succeeds. I understand that this code was added for cases when
> >> >> enough_fd() returns false, which sounds perfectly fine to protect
> >> >> from.
> >> >>
> >> >> I was thinking that this code should actually check something like:
> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
> >> >>     || disk.raid_disk != mdi.disk.raid_disk
> >> >>     || !enough_fd(fd))
> >> >>     goto skip_re_add;
> >> >>
> >> >> That is to check that the slot that was being occupied by the drive we
> >> >> are trying to add, is already occupied by a different drive (need also
> >> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
> >> >> not the desc_nr, which does not have any persistent meaning.
> >> >>
> >> >> Perhaps there are some compatibility issues with old kernels? Or
> >> >> special considerations for ... containers? non-persistent arrays?
> >> >
> >> > The point of this code is to make --re-add fail unless mdadm is certain that
> >> > the kernel will accept the re-add, rather than turn the device into a spare.
> >> >
> >> > If a device already exists with the same disk.number, a re-add cannot
> >> > succeed, so mdadm doesn't even try.
> >> >
> >> > When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
> >> > the device to the array, or does it turn the device into a spare?
> >> > I particularly do not want --re-add to turn a device into a spare because
> >> > people sometimes use it in cases where it cannot work, their device gets
> >> > turned into a spare, and they lose information that could have been used to
> >> > reconstruct the array.
> >> >
> >> > That that make sense?
> >> >
> >> > NeilBrown
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-10-31  9:19         ` NeilBrown
@ 2011-11-01 16:26           ` Alexander Lyakas
  2011-11-01 22:52             ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-11-01 16:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
I hope you don't find my questions too pesky; I hope as time goes, I
will be able to contribute as well, and not just ask for information.

What you suggested works great, but only if the drive already has a
valid superblock. This is according to md_import_device(), which calls
super_1_load(). Although it calls it with NULL refdev, but it still
performs basic validity checks of the superblock. So it is needed to
have functionality of write_init_super() of mdadm (perhaps can be an
mdadm operation, not sure how dangerous this is to expose such
functionality).

Another question I was digging for, is to find a spot where kernel
determines whether it is going to do a bitmap-based reconstruction or
a full reconstruction. Can you verify that I found it correctly, or I
am way off?

In super_1_validate():
		if (ev1 < mddev->bitmap->events_cleared)
			return 0;
This means that the array bitmap was cleared after the drive was
detached, so we bail right out, and do not set rdev->raid_disk, and
leave all flags off. In that case, eventually, we will need full
reconstruction.
Otherwise, after we decide that this drive is an active device (not
failed or spare):
if ((le32_to_cpu(sb->feature_map) &  MD_FEATURE_RECOVERY_OFFSET))
    rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
else
    set_bit(In_sync, &rdev->flags);
Here we might or might not set the In_sync flag.

Then in slot_store() and in add_new_disk(), we set the important flag
for that matter:
if (test_bit(In_sync, &rdev->flags))
    rdev->saved_raid_disk = slot;
else
    rdev->saved_raid_disk = -1;

And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we
check this flag and set/not set the fullsync flag. And in
raid1_add_disk, like you said, the slot doesn't matter, any valid
saved_raid_disk is good.

Is this correct? If yes, then in the sequence you suggested, we will
always do full reconstruction. Because new_dev_store() and
slot_store() sequence do not call validate_super(), so In_sync is
never set, so saved_raid_disk remains -1. This is perfectly fine.

Thanks!
Alex.












On Mon, Oct 31, 2011 at 11:19 AM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 31 Oct 2011 10:57:25 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Thank you for the clarification, Neil!
>>
>> I was looking at raid_disk, because I am trying to see whether it is
>> possible to control into which raid slot a disk is being added (not
>> re-added).
>>
>> Let's say we had raid6 with 4 drives (a,b,c,d), and drives c and d
>> failed.Now let's say, that it is decided to replace drive d with a new
>> drive e (--add for drive e). Then it's possible that drive e will take
>> the raid slot of drive c. So later, if we want to bring back drive c
>> into the array, it will have to go into a different slot, resulting in
>> a full reconstruction, not bitmap-based reconstruction. While if we
>> would have re-added drive c first, it would have done a bitmap-based
>> reconstruction, so only the e drive would have required a full
>> reconstruction.
>>
>> So do you think it makes sense to somehow (perhaps through
>> disc.raid_disk) instruct the kernel into which slot to add a new
>> drive?
>
> You should be able to do that via sysfs.
> echo frozen > sync_action
> echo $major:$minor > new_dev
> echo $slot > dev-$DEVNAME/slot
> echo idle > sync_action
>
> something like that.
>
> Seems and odd sort of scenario though.
>
> Note that with RAID1 this is not an issue.  A re-added drive can take any
> position in a RAID1 array - it doesn't have to be the same one.
>
> NeilBrown
>
>
>
>>
>> Thanks,
>>   Alex.
>>
>>
>>
>>
>>
>> On Mon, Oct 31, 2011 at 1:16 AM, NeilBrown <neilb@suse.de> wrote:
>> > On Thu, 27 Oct 2011 11:10:54 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hello Neil,
>> >> it makes perfect sense not to turn a device into a spare inadvertently.
>> >>
>> >> However, with mdadm 3.1.4 under gdb, I tried the following:
>> >> - I had a raid6 with 4 drives (sda/b/c/d), their "desc_nr" in the
>> >> kernel were respectively (according to GET_DISK_INFO): 0,1,2,3.
>> >> - I failed the two last drives (c & d) via mdadm and removed them from the array
>> >> - I wiped the superblock on drive d.
>> >> - I added the drive d back to the array
>> >> So now the array had the following setup:
>> >> sda: disc_nr=0, raid_disk=0
>> >> sdb: disc_nr=1, raid_disk=1
>> >> sdd: disc_nr=4, raid_disk=2
>> >> So sdd was added to the array into slot 2, and received disc_nr=4
>> >>
>> >> - Now I asked to re-add drive sdc back to array. In gdb I followed the
>> >> re-add flow, to the place where it fills the mdu_disk_info_t structure
>> >> from the superblock read from sdc. It put there the following content:
>> >> disc.major = ...
>> >> disc.minor = ...
>> >> disc.number = 2
>> >> disc.raid_disk = 2 (because previously this drive was in slot 2)
>> >> disc.state = ...
>> >>
>> >> Now in gdb I changed disc.number to 4 (to match the desc_nr of sdd).
>> >> And then issued ADD_NEW_DISK. It succeeded, and the sdc drive received
>> >> disc_nr=2 (while it was asking for 4). Of course, it could not have
>> >> received the same raid_disk, because this raid_disk was already
>> >> occupied by sdd. So it was added as a spare.
>> >>
>> >> But you are saying:
>> >> > If a device already exists with the same disk.number, a re-add cannot
>> >> > succeed, so mdadm doesn't even try.
>> >> while in my case it succeeded (while it actually did "add" and not "re-add").
>> >
>> > We seem to be using word differently.
>> > If I ask mdadm to do a "re-add" and it does an "add", then I consider that to
>> > be "failure", however you seem to consider it to be a "success".
>> >
>> > That seems to be the source of confusion.
>> >
>> >
>> >>
>> >> That's why I was thinking it makes more sense to check disc.raid_disk
>> >> and not disc.number in this check. Since disc.number is not the
>> >> drive's role within the array (right?), it is simply a position of the
>> >> drive in the list of all drives.
>> >> So if you check raid_disk, and see that this raid slot is already
>> >> occupied, then, naturally, the drive will be converted to spare, which
>> >> we want to avoid.
>> >
>> > It may well be appropriate to check raid_disk as well, yes.  It isn't so easy
>> > though which is maybe why I didn't.
>> >
>> > With 0.90 metadata, the disc.number is the same as disc.raid_disk for active
>> > devices.  That might be another reason for the code being the way that it is.
>> >
>> > Thanks,
>> > NeilBrown
>> >
>> >
>> >
>> >>
>> >> And the enough_fd() check protects us from adding a spare to a failed
>> >> array (like you mentioned to me previously).
>> >>
>> >> What am I missing?
>> >>
>> >> Thanks,
>> >>   Alex.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Oct 26, 2011 at 11:51 PM, NeilBrown <neilb@suse.de> wrote:
>> >> > On Wed, 26 Oct 2011 19:02:37 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Greetings everybody,
>> >> >> I have a question about the following code in Manage.c:Manage_subdevs()
>> >> >>
>> >> >> disc.number = mdi.disk.number;
>> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>> >> >>     || disc.major != 0 || disc.minor != 0
>> >> >>     || !enough_fd(fd))
>> >> >>     goto skip_re_add;
>> >> >>
>> >> >> I do not underatand why the checks: disc.major != 0 || disc.minor != 0
>> >> >> are required. This basically means that the kernel already has an
>> >> >> rdev->desc_nr equal to disc.number. But why fail the re-add procedure?
>> >> >>
>> >> >> Let's say that enough_fd() returns true, and we go ahead an issue
>> >> >> ioctl(ADD_NEW_DISK). In this case, according to the kernel code in
>> >> >> add_new_disk(), it will not even look at info->number. It will
>> >> >> initialize rdev->desc_nr to -1, and will allocate a free desc_nr for
>> >> >> the rdev later.
>> >> >>
>> >> >> Doing this with mdadm 3.1.4, where this check is not present, actually
>> >> >> succeeds. I understand that this code was added for cases when
>> >> >> enough_fd() returns false, which sounds perfectly fine to protect
>> >> >> from.
>> >> >>
>> >> >> I was thinking that this code should actually check something like:
>> >> >> if (ioctl(fd, GET_DISK_INFO, &disc) != 0
>> >> >>     || disk.raid_disk != mdi.disk.raid_disk
>> >> >>     || !enough_fd(fd))
>> >> >>     goto skip_re_add;
>> >> >>
>> >> >> That is to check that the slot that was being occupied by the drive we
>> >> >> are trying to add, is already occupied by a different drive (need also
>> >> >> to cover cases of raid_disk <0, raid_disk >= raid_disks etc...) and
>> >> >> not the desc_nr, which does not have any persistent meaning.
>> >> >>
>> >> >> Perhaps there are some compatibility issues with old kernels? Or
>> >> >> special considerations for ... containers? non-persistent arrays?
>> >> >
>> >> > The point of this code is to make --re-add fail unless mdadm is certain that
>> >> > the kernel will accept the re-add, rather than turn the device into a spare.
>> >> >
>> >> > If a device already exists with the same disk.number, a re-add cannot
>> >> > succeed, so mdadm doesn't even try.
>> >> >
>> >> > When you say in 3.1.4 it "actually succeeds" - what succeeds?  Does it re-add
>> >> > the device to the array, or does it turn the device into a spare?
>> >> > I particularly do not want --re-add to turn a device into a spare because
>> >> > people sometimes use it in cases where it cannot work, their device gets
>> >> > turned into a spare, and they lose information that could have been used to
>> >> > reconstruct the array.
>> >> >
>> >> > That that make sense?
>> >> >
>> >> > NeilBrown
>> >> >
>> >> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-01 16:26           ` Alexander Lyakas
@ 2011-11-01 22:52             ` NeilBrown
  2011-11-08 16:23               ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-11-01 22:52 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3102 bytes --]

On Tue, 1 Nov 2011 18:26:14 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> I hope you don't find my questions too pesky; I hope as time goes, I
> will be able to contribute as well, and not just ask for information.

Enquiring minds should always be encouraged!

> 
> What you suggested works great, but only if the drive already has a
> valid superblock. This is according to md_import_device(), which calls
> super_1_load(). Although it calls it with NULL refdev, but it still
> performs basic validity checks of the superblock. So it is needed to
> have functionality of write_init_super() of mdadm (perhaps can be an
> mdadm operation, not sure how dangerous this is to expose such
> functionality).

Correct.  With v1.x metadata - and even more so with DDF and IMSM - the
kernel does not know everything about the metadata and cannot create new
superlocks.  At best it can edit what is there.

As for adding things to mdadm -  you just need a genuine use-case.
I can see it could be valid to add a '--role' option to --add so you can add
a device to a specific slot.  I doubt it would be used much, but if there was
a concrete need I wouldn't object.


> 
> Another question I was digging for, is to find a spot where kernel
> determines whether it is going to do a bitmap-based reconstruction or
> a full reconstruction. Can you verify that I found it correctly, or I
> am way off?
> 
> In super_1_validate():
> 		if (ev1 < mddev->bitmap->events_cleared)
> 			return 0;
> This means that the array bitmap was cleared after the drive was
> detached, so we bail right out, and do not set rdev->raid_disk, and
> leave all flags off. In that case, eventually, we will need full
> reconstruction.
> Otherwise, after we decide that this drive is an active device (not
> failed or spare):
> if ((le32_to_cpu(sb->feature_map) &  MD_FEATURE_RECOVERY_OFFSET))
>     rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
> else
>     set_bit(In_sync, &rdev->flags);
> Here we might or might not set the In_sync flag.
> 

Yes. 'In_sync' effectively means that recovery_offset == MAX.


> Then in slot_store() and in add_new_disk(), we set the important flag
> for that matter:
> if (test_bit(In_sync, &rdev->flags))
>     rdev->saved_raid_disk = slot;
> else
>     rdev->saved_raid_disk = -1;
> 
> And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we
> check this flag and set/not set the fullsync flag. And in
> raid1_add_disk, like you said, the slot doesn't matter, any valid
> saved_raid_disk is good.
> 
> Is this correct? If yes, then in the sequence you suggested, we will
> always do full reconstruction. Because new_dev_store() and
> slot_store() sequence do not call validate_super(), so In_sync is
> never set, so saved_raid_disk remains -1. This is perfectly fine.

Yes, this is all correct.
In the sequence I suggested you could add
   echo insync > dev-$DEVNAME/state
before
   echo $slot > dev-$DEVNAME/slot

and it would result in saved_raid_disk being set.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-01 22:52             ` NeilBrown
@ 2011-11-08 16:23               ` Alexander Lyakas
  2011-11-08 23:41                 ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-11-08 16:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

1)
after looking at this more I see the following: on ADD_NEW_DISK
sequence (to a running array, not during assembly) in the kernel, the
desc_nr is loaded from the superblock of the disk to be added in
super_1_load():
	if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
		rdev->desc_nr = -1;
	else
		rdev->desc_nr = le32_to_cpu(sb->dev_number);
And later, in bind_rdev_to_array() it is indeed checked that desc_nr is unique.

However, at least for 1.2 arrays, I believe this is too restrictive,
don't you think? If the raid slot (not desc_nr) of the device being
re-added is *not occupied* yet, can't we just select a free desc_nr
for the new disk on that path?
Or perhaps, mdadm on the re-add path can select a free desc_nr
(disc.number) for it (just as it does for --add), after ensuring that
the slot is not occupied yet? Where it is better to do it?
Otherwise, the re-add fails, while it can perfectly succeed (only pick
a different desc_nr).

2)
About your suggestion of setting the In_sync flag via sysfs, I think
it's too dangerous to blindly set it, so I am not going to do it. Let
the kernel look at event count and decide.

3)
So it looks like at present, my best way to add (not re-add) a disk
into a specific slot is the following:
- borrow some code from mdadm to initialize the superblock of the new
disk, as belonging to this array (init_super).
- use the sysfs to do what you suggested before:
echo frozen > sync_action
echo $major:$minor > new_dev
echo $slot > dev-$DEVNAME/slot
echo idle > sync_action

Does this makes sense?

Thanks!

Alex.




On Wed, Nov 2, 2011 at 12:52 AM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 1 Nov 2011 18:26:14 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> I hope you don't find my questions too pesky; I hope as time goes, I
>> will be able to contribute as well, and not just ask for information.
>
> Enquiring minds should always be encouraged!
>
>>
>> What you suggested works great, but only if the drive already has a
>> valid superblock. This is according to md_import_device(), which calls
>> super_1_load(). Although it calls it with NULL refdev, but it still
>> performs basic validity checks of the superblock. So it is needed to
>> have functionality of write_init_super() of mdadm (perhaps can be an
>> mdadm operation, not sure how dangerous this is to expose such
>> functionality).
>
> Correct.  With v1.x metadata - and even more so with DDF and IMSM - the
> kernel does not know everything about the metadata and cannot create new
> superlocks.  At best it can edit what is there.
>
> As for adding things to mdadm -  you just need a genuine use-case.
> I can see it could be valid to add a '--role' option to --add so you can add
> a device to a specific slot.  I doubt it would be used much, but if there was
> a concrete need I wouldn't object.
>
>
>>
>> Another question I was digging for, is to find a spot where kernel
>> determines whether it is going to do a bitmap-based reconstruction or
>> a full reconstruction. Can you verify that I found it correctly, or I
>> am way off?
>>
>> In super_1_validate():
>>               if (ev1 < mddev->bitmap->events_cleared)
>>                       return 0;
>> This means that the array bitmap was cleared after the drive was
>> detached, so we bail right out, and do not set rdev->raid_disk, and
>> leave all flags off. In that case, eventually, we will need full
>> reconstruction.
>> Otherwise, after we decide that this drive is an active device (not
>> failed or spare):
>> if ((le32_to_cpu(sb->feature_map) &  MD_FEATURE_RECOVERY_OFFSET))
>>     rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
>> else
>>     set_bit(In_sync, &rdev->flags);
>> Here we might or might not set the In_sync flag.
>>
>
> Yes. 'In_sync' effectively means that recovery_offset == MAX.
>
>
>> Then in slot_store() and in add_new_disk(), we set the important flag
>> for that matter:
>> if (test_bit(In_sync, &rdev->flags))
>>     rdev->saved_raid_disk = slot;
>> else
>>     rdev->saved_raid_disk = -1;
>>
>> And later, in hot_add_disk (like raid5_add_disk, raid1_add_disk), we
>> check this flag and set/not set the fullsync flag. And in
>> raid1_add_disk, like you said, the slot doesn't matter, any valid
>> saved_raid_disk is good.
>>
>> Is this correct? If yes, then in the sequence you suggested, we will
>> always do full reconstruction. Because new_dev_store() and
>> slot_store() sequence do not call validate_super(), so In_sync is
>> never set, so saved_raid_disk remains -1. This is perfectly fine.
>
> Yes, this is all correct.
> In the sequence I suggested you could add
>   echo insync > dev-$DEVNAME/state
> before
>   echo $slot > dev-$DEVNAME/slot
>
> and it would result in saved_raid_disk being set.
>
> NeilBrown
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-08 16:23               ` Alexander Lyakas
@ 2011-11-08 23:41                 ` NeilBrown
  2011-11-17 11:13                   ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-11-08 23:41 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2090 bytes --]

On Tue, 8 Nov 2011 18:23:49 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> 1)
> after looking at this more I see the following: on ADD_NEW_DISK
> sequence (to a running array, not during assembly) in the kernel, the
> desc_nr is loaded from the superblock of the disk to be added in
> super_1_load():
> 	if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
> 		rdev->desc_nr = -1;
> 	else
> 		rdev->desc_nr = le32_to_cpu(sb->dev_number);
> And later, in bind_rdev_to_array() it is indeed checked that desc_nr is unique.
> 
> However, at least for 1.2 arrays, I believe this is too restrictive,
> don't you think? If the raid slot (not desc_nr) of the device being
> re-added is *not occupied* yet, can't we just select a free desc_nr
> for the new disk on that path?
> Or perhaps, mdadm on the re-add path can select a free desc_nr
> (disc.number) for it (just as it does for --add), after ensuring that
> the slot is not occupied yet? Where it is better to do it?
> Otherwise, the re-add fails, while it can perfectly succeed (only pick
> a different desc_nr).

I think I see what you are saying.
However my question is: is this really an issue.
Is there a credible sequence of events that results in the current code makes
an undesirable decision?  Of course I do not count deliberately editing the
metadata as part of a credible sequence of events.


> 
> 2)
> About your suggestion of setting the In_sync flag via sysfs, I think
> it's too dangerous to blindly set it, so I am not going to do it. Let
> the kernel look at event count and decide.
> 
> 3)
> So it looks like at present, my best way to add (not re-add) a disk
> into a specific slot is the following:
> - borrow some code from mdadm to initialize the superblock of the new
> disk, as belonging to this array (init_super).
> - use the sysfs to do what you suggested before:
> echo frozen > sync_action
> echo $major:$minor > new_dev
> echo $slot > dev-$DEVNAME/slot
> echo idle > sync_action
> 
> Does this makes sense?

Yes it does.

NeilBrown



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-08 23:41                 ` NeilBrown
@ 2011-11-17 11:13                   ` Alexander Lyakas
  2011-11-21  2:44                     ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lyakas @ 2011-11-17 11:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil,

>> However, at least for 1.2 arrays, I believe this is too restrictive,
>> don't you think? If the raid slot (not desc_nr) of the device being
>> re-added is *not occupied* yet, can't we just select a free desc_nr
>> for the new disk on that path?
>> Or perhaps, mdadm on the re-add path can select a free desc_nr
>> (disc.number) for it (just as it does for --add), after ensuring that
>> the slot is not occupied yet? Where it is better to do it?
>> Otherwise, the re-add fails, while it can perfectly succeed (only pick
>> a different desc_nr).
>
> I think I see what you are saying.
> However my question is: is this really an issue.
> Is there a credible sequence of events that results in the current code makes
> an undesirable decision?  Of course I do not count deliberately editing the
> metadata as part of a credible sequence of events.

Consider this scenario, in which the code refuses to re-add a drive:

Step 1:
- I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2)
- I failed drives B and C, and removed them from the array, and
totally forgot about them for the rest of the scenario.
- I added to the array two new drives: D and E, and waited for the
resync to complete. The array now has the following structure:
A: descr_nr=0
D: desc_nr=3 (was selected during the "add" path in mdadm, as expected)
E: desc_nr=4 (was selected during the "add" path in mdadm, as expected)

Step 2:
- I failed drives D and E, and removed them from the array. The E
drive is not used for the rest of the scenario, so we can forget about
it.

I wrote some data to the array. At this point, the array bitmap is
dirty, and will not be cleared, since the array is degraded.

Step 3:
- I added one new drive (last one, I promise!) to the array - drive F,
and waited for it to resync. The array now has the following
structure:
A: descr_nr=0
F: desc_nr=3

So F took desc_nr of D drive (desc_nr=3). This is expected according
to mdadm code.

Event counters at this point:
A and F: events=149, events_cleared=0
D: events=109

Step 4:
At this point, mdadm refuses to re-add the drive D to the array,
because its desc_nr is already taken (I verified that via gdb). On the
other hand, if we would have simply picked a fresh desc_nr for D, then
it could be re-added I believe, because:
- slots are not important for raid1 (D's slot was taken actually by F).
- it should pass the check for bitmap-based resync (events in D' sb >=
events_cleared of the array)

Do you agree with this, or perhaps I missed something?

Additional notes:
- of course, such scenario is relevant only for arrays with more than
single redundancy, so it's not relevant for raid5
- to simulate such scenario for raid6, need at step 3 to add the new
drive to the slot, which is not the slot of the drive we're going to
re-add in step4 (otherwise, it takes the D's slot, and then we really
cannot re-add). This can be done as we discussed earlier.

What do you think?

Thanks,
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-17 11:13                   ` Alexander Lyakas
@ 2011-11-21  2:44                     ` NeilBrown
  2011-11-22  8:45                       ` Alexander Lyakas
  0 siblings, 1 reply; 13+ messages in thread
From: NeilBrown @ 2011-11-21  2:44 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3805 bytes --]

On Thu, 17 Nov 2011 13:13:20 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> 
> >> However, at least for 1.2 arrays, I believe this is too restrictive,
> >> don't you think? If the raid slot (not desc_nr) of the device being
> >> re-added is *not occupied* yet, can't we just select a free desc_nr
> >> for the new disk on that path?
> >> Or perhaps, mdadm on the re-add path can select a free desc_nr
> >> (disc.number) for it (just as it does for --add), after ensuring that
> >> the slot is not occupied yet? Where it is better to do it?
> >> Otherwise, the re-add fails, while it can perfectly succeed (only pick
> >> a different desc_nr).
> >
> > I think I see what you are saying.
> > However my question is: is this really an issue.
> > Is there a credible sequence of events that results in the current code makes
> > an undesirable decision?  Of course I do not count deliberately editing the
> > metadata as part of a credible sequence of events.
> 
> Consider this scenario, in which the code refuses to re-add a drive:
> 
> Step 1:
> - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2)
> - I failed drives B and C, and removed them from the array, and
> totally forgot about them for the rest of the scenario.
> - I added to the array two new drives: D and E, and waited for the
> resync to complete. The array now has the following structure:
> A: descr_nr=0
> D: desc_nr=3 (was selected during the "add" path in mdadm, as expected)
> E: desc_nr=4 (was selected during the "add" path in mdadm, as expected)
> 
> Step 2:
> - I failed drives D and E, and removed them from the array. The E
> drive is not used for the rest of the scenario, so we can forget about
> it.
> 
> I wrote some data to the array. At this point, the array bitmap is
> dirty, and will not be cleared, since the array is degraded.
> 
> Step 3:
> - I added one new drive (last one, I promise!) to the array - drive F,
> and waited for it to resync. The array now has the following
> structure:
> A: descr_nr=0
> F: desc_nr=3
> 
> So F took desc_nr of D drive (desc_nr=3). This is expected according
> to mdadm code.
> 
> Event counters at this point:
> A and F: events=149, events_cleared=0
> D: events=109
> 
> Step 4:
> At this point, mdadm refuses to re-add the drive D to the array,
> because its desc_nr is already taken (I verified that via gdb). On the
> other hand, if we would have simply picked a fresh desc_nr for D, then
> it could be re-added I believe, because:
> - slots are not important for raid1 (D's slot was taken actually by F).
> - it should pass the check for bitmap-based resync (events in D' sb >=
> events_cleared of the array)
> 
> Do you agree with this, or perhaps I missed something?
> 
> Additional notes:
> - of course, such scenario is relevant only for arrays with more than
> single redundancy, so it's not relevant for raid5
> - to simulate such scenario for raid6, need at step 3 to add the new
> drive to the slot, which is not the slot of the drive we're going to
> re-add in step4 (otherwise, it takes the D's slot, and then we really
> cannot re-add). This can be done as we discussed earlier.
> 
> What do you think?

I think some of the details in your steps aren't really right, but I do see
the point you are making.
If you keep the array degraded, the events_cleared will not be updated so any
old array member can safely be re-added.

I'll have a look and see how best to fix the code.

Thanks.

NeilBrown



> 
> Thanks,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts"
  2011-11-21  2:44                     ` NeilBrown
@ 2011-11-22  8:45                       ` Alexander Lyakas
  0 siblings, 0 replies; 13+ messages in thread
From: Alexander Lyakas @ 2011-11-22  8:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thanks Neil, for looking into this.

On Mon, Nov 21, 2011 at 4:44 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 17 Nov 2011 13:13:20 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>>
>> >> However, at least for 1.2 arrays, I believe this is too restrictive,
>> >> don't you think? If the raid slot (not desc_nr) of the device being
>> >> re-added is *not occupied* yet, can't we just select a free desc_nr
>> >> for the new disk on that path?
>> >> Or perhaps, mdadm on the re-add path can select a free desc_nr
>> >> (disc.number) for it (just as it does for --add), after ensuring that
>> >> the slot is not occupied yet? Where it is better to do it?
>> >> Otherwise, the re-add fails, while it can perfectly succeed (only pick
>> >> a different desc_nr).
>> >
>> > I think I see what you are saying.
>> > However my question is: is this really an issue.
>> > Is there a credible sequence of events that results in the current code makes
>> > an undesirable decision?  Of course I do not count deliberately editing the
>> > metadata as part of a credible sequence of events.
>>
>> Consider this scenario, in which the code refuses to re-add a drive:
>>
>> Step 1:
>> - I created a raid1 array with 3 drives: A,B,C (and their desc_nr=0,1,2)
>> - I failed drives B and C, and removed them from the array, and
>> totally forgot about them for the rest of the scenario.
>> - I added to the array two new drives: D and E, and waited for the
>> resync to complete. The array now has the following structure:
>> A: descr_nr=0
>> D: desc_nr=3 (was selected during the "add" path in mdadm, as expected)
>> E: desc_nr=4 (was selected during the "add" path in mdadm, as expected)
>>
>> Step 2:
>> - I failed drives D and E, and removed them from the array. The E
>> drive is not used for the rest of the scenario, so we can forget about
>> it.
>>
>> I wrote some data to the array. At this point, the array bitmap is
>> dirty, and will not be cleared, since the array is degraded.
>>
>> Step 3:
>> - I added one new drive (last one, I promise!) to the array - drive F,
>> and waited for it to resync. The array now has the following
>> structure:
>> A: descr_nr=0
>> F: desc_nr=3
>>
>> So F took desc_nr of D drive (desc_nr=3). This is expected according
>> to mdadm code.
>>
>> Event counters at this point:
>> A and F: events=149, events_cleared=0
>> D: events=109
>>
>> Step 4:
>> At this point, mdadm refuses to re-add the drive D to the array,
>> because its desc_nr is already taken (I verified that via gdb). On the
>> other hand, if we would have simply picked a fresh desc_nr for D, then
>> it could be re-added I believe, because:
>> - slots are not important for raid1 (D's slot was taken actually by F).
>> - it should pass the check for bitmap-based resync (events in D' sb >=
>> events_cleared of the array)
>>
>> Do you agree with this, or perhaps I missed something?
>>
>> Additional notes:
>> - of course, such scenario is relevant only for arrays with more than
>> single redundancy, so it's not relevant for raid5
>> - to simulate such scenario for raid6, need at step 3 to add the new
>> drive to the slot, which is not the slot of the drive we're going to
>> re-add in step4 (otherwise, it takes the D's slot, and then we really
>> cannot re-add). This can be done as we discussed earlier.
>>
>> What do you think?
>
> I think some of the details in your steps aren't really right, but I do see
> the point you are making.
> If you keep the array degraded, the events_cleared will not be updated so any
> old array member can safely be re-added.
>
> I'll have a look and see how best to fix the code.
>
> Thanks.
>
> NeilBrown
>
>
>
>>
>> Thanks,
>> Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-11-22  8:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-26 17:02 Question about mdadm commit d6508f0cfb60edf07b36f1532eae4d9cddf7178b "be more careful about add attempts" Alexander Lyakas
2011-10-26 21:51 ` NeilBrown
2011-10-27  9:10   ` Alexander Lyakas
2011-10-30 23:16     ` NeilBrown
2011-10-31  8:57       ` Alexander Lyakas
2011-10-31  9:19         ` NeilBrown
2011-11-01 16:26           ` Alexander Lyakas
2011-11-01 22:52             ` NeilBrown
2011-11-08 16:23               ` Alexander Lyakas
2011-11-08 23:41                 ` NeilBrown
2011-11-17 11:13                   ` Alexander Lyakas
2011-11-21  2:44                     ` NeilBrown
2011-11-22  8:45                       ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).