* ddf: remove failed devices that are no longer in use ?!?
@ 2013-07-26 21:06 Martin Wilck
2013-07-30 1:34 ` NeilBrown
0 siblings, 1 reply; 4+ messages in thread
From: Martin Wilck @ 2013-07-26 21:06 UTC (permalink / raw)
To: NeilBrown, linux-raid
Hi Neil,
here is another question. 2 years ago you committed c7079c84 "ddf:
remove failed devices that are no longer in use", with the reasoning "it
isn't clear what (a phys disk record for every physically attached
device) means in the case of soft raid in a general purpose Linux computer".
I am not sure if this was correct. A common use case for DDF is an
actual BIOS fake RAID, possibly dual-boot with a vendor soft-RAID driver
under Windows. Such other driver might be highly confused by mdadm
auto-removing devices. Not even "missing" devices need to be removed
from the meta data in DDF; they can be simply marked "missing".
May I ask you to reconsider this, and possibly revert c7079c84?
Martin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ddf: remove failed devices that are no longer in use ?!?
2013-07-26 21:06 ddf: remove failed devices that are no longer in use ?!? Martin Wilck
@ 2013-07-30 1:34 ` NeilBrown
2013-07-30 19:24 ` Martin Wilck
0 siblings, 1 reply; 4+ messages in thread
From: NeilBrown @ 2013-07-30 1:34 UTC (permalink / raw)
To: Martin Wilck; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]
On Fri, 26 Jul 2013 23:06:01 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> Hi Neil,
>
> here is another question. 2 years ago you committed c7079c84 "ddf:
> remove failed devices that are no longer in use", with the reasoning "it
> isn't clear what (a phys disk record for every physically attached
> device) means in the case of soft raid in a general purpose Linux computer".
>
> I am not sure if this was correct. A common use case for DDF is an
> actual BIOS fake RAID, possibly dual-boot with a vendor soft-RAID driver
> under Windows. Such other driver might be highly confused by mdadm
> auto-removing devices. Not even "missing" devices need to be removed
> from the meta data in DDF; they can be simply marked "missing".
>
> May I ask you to reconsider this, and possibly revert c7079c84?
> Martin
You may certainly ask ....
I presumably had a motivation for that change. Unfortunately I didn't record
the motivation, only the excuse.
It probably comes down to a question of when *do* you remove phys disk
records?
I think that if I revert that patch we could get a situation where we keep
adding new phys disk records and fill up some table.
We should probably be recording some sort of WWN or path identifier in the
metadata and then have md check in /dev/disk/by-XXX to decide if the device
has really disappeared or is just failed.
Maybe the 'path' field in phys_disk_entry could/should be used here. However
we the BIOS might interpret that in a specific way that mdadm would need to
agree with....
If we can come up with a reasonably reliable way to remove phys disk records
at an appropriate time, I'm happy to revert this patch. Until then I'm not
sure it is a good idea.....
But I'm open to being convinced.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ddf: remove failed devices that are no longer in use ?!?
2013-07-30 1:34 ` NeilBrown
@ 2013-07-30 19:24 ` Martin Wilck
2013-07-31 3:25 ` NeilBrown
0 siblings, 1 reply; 4+ messages in thread
From: Martin Wilck @ 2013-07-30 19:24 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
On 07/30/2013 03:34 AM, NeilBrown wrote:
> On Fri, 26 Jul 2013 23:06:01 +0200 Martin Wilck <mwilck@arcor.de> wrote:
>
>> Hi Neil,
>>
>> here is another question. 2 years ago you committed c7079c84 "ddf:
>> remove failed devices that are no longer in use", with the reasoning "it
>> isn't clear what (a phys disk record for every physically attached
>> device) means in the case of soft raid in a general purpose Linux computer".
>>
>> I am not sure if this was correct. A common use case for DDF is an
>> actual BIOS fake RAID, possibly dual-boot with a vendor soft-RAID driver
>> under Windows. Such other driver might be highly confused by mdadm
>> auto-removing devices. Not even "missing" devices need to be removed
>> from the meta data in DDF; they can be simply marked "missing".
>>
>> May I ask you to reconsider this, and possibly revert c7079c84?
>> Martin
>
> You may certainly ask ....
>
> I presumably had a motivation for that change. Unfortunately I didn't record
> the motivation, only the excuse.
>
> It probably comes down to a question of when *do* you remove phys disk
> records?
> I think that if I revert that patch we could get a situation where we keep
> adding new phys disk records and fill up some table.
How is this handled with native meta data? IMSM? Is there any reason to
treat DDF special? In a hw RAID scenario, the user would remove the
failed disk physically sooner or later, and it would switch to "missing"
state. So here, I'd expect the user to call mdadm --remove.
We already have find_unused_pde(). We could make this function try
harder - when no empty slot is found, look for slots with
"missing|failed" and then "missing" (or "failed"?) disks, and replace
those with the new disk.
> We should probably be recording some sort of WWN or path identifier in the
> metadata and then have md check in /dev/disk/by-XXX to decide if the device
> has really disappeared or is just failed.
Look for "Cannot be bothered" in super-ddf.c :-)
This is something that waits to be implemented, for SAS/SATA at least.
> Maybe the 'path' field in phys_disk_entry could/should be used here. However
> we the BIOS might interpret that in a specific way that mdadm would need to
> agree with....
>
> If we can come up with a reasonably reliable way to remove phys disk records
> at an appropriate time, I'm happy to revert this patch. Until then I'm not
> sure it is a good idea.....
>
> But I'm open to being convinced.
Well, me too, I may be wrong, after all. Perhaps auto-removal is ok. I
need to try it with fake RAID.
Martin
>
> Thanks,
> NeilBrown
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ddf: remove failed devices that are no longer in use ?!?
2013-07-30 19:24 ` Martin Wilck
@ 2013-07-31 3:25 ` NeilBrown
0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2013-07-31 3:25 UTC (permalink / raw)
To: Martin Wilck; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 4595 bytes --]
On Tue, 30 Jul 2013 21:24:24 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> On 07/30/2013 03:34 AM, NeilBrown wrote:
> > On Fri, 26 Jul 2013 23:06:01 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> >
> >> Hi Neil,
> >>
> >> here is another question. 2 years ago you committed c7079c84 "ddf:
> >> remove failed devices that are no longer in use", with the reasoning "it
> >> isn't clear what (a phys disk record for every physically attached
> >> device) means in the case of soft raid in a general purpose Linux computer".
> >>
> >> I am not sure if this was correct. A common use case for DDF is an
> >> actual BIOS fake RAID, possibly dual-boot with a vendor soft-RAID driver
> >> under Windows. Such other driver might be highly confused by mdadm
> >> auto-removing devices. Not even "missing" devices need to be removed
> >> from the meta data in DDF; they can be simply marked "missing".
> >>
> >> May I ask you to reconsider this, and possibly revert c7079c84?
> >> Martin
> >
> > You may certainly ask ....
> >
> > I presumably had a motivation for that change. Unfortunately I didn't record
> > the motivation, only the excuse.
> >
> > It probably comes down to a question of when *do* you remove phys disk
> > records?
> > I think that if I revert that patch we could get a situation where we keep
> > adding new phys disk records and fill up some table.
>
> How is this handled with native meta data? IMSM? Is there any reason to
> treat DDF special? In a hw RAID scenario, the user would remove the
> failed disk physically sooner or later, and it would switch to "missing"
> state. So here, I'd expect the user to call mdadm --remove.
Native metadata doesn't really differentiate between 'device has failed' and
'device isn't present'. There are bits in the 0.90 format that can tell the
difference, but the two states are treated the same.
When a device fails it transition for 'working' to 'failed but still
present'. Then when you "mdadm --remove" it transitions to "not present".
But it is not possible to go back to "failed but still present".
i.e. if you shut down the array and then start it up again, any devices that
have failed will not be included. If they have failed then you possibly
cannot read the metadata and you may not know that they "should" be included.
DDF is designed to work in the firmware of card which have a limited number
of port which they have complete control of. So they may be able to "know"
that a device is present even if it isn't worked. mdadm is designed with the
same or of perspective.
If a device is working enough to read the metadata it would be possible to
include it in the contain even if it is being treated as 'failed'. Not sure
how much work it would be to achieve this.
If a device is working enough to be detected on the bus, but not enough to
read the metadata, then you could only include it in the container if some
config information said "These ports all belong to that container".
I doubt that would be worth the effort.
>
> We already have find_unused_pde(). We could make this function try
> harder - when no empty slot is found, look for slots with
> "missing|failed" and then "missing" (or "failed"?) disks, and replace
> those with the new disk.
Yes, that could work.
>
> > We should probably be recording some sort of WWN or path identifier in the
> > metadata and then have md check in /dev/disk/by-XXX to decide if the device
> > has really disappeared or is just failed.
>
> Look for "Cannot be bothered" in super-ddf.c :-)
> This is something that waits to be implemented, for SAS/SATA at least.
:-) I'm hoping someone will turn up who can be bothered...
>
> > Maybe the 'path' field in phys_disk_entry could/should be used here. However
> > we the BIOS might interpret that in a specific way that mdadm would need to
> > agree with....
> >
> > If we can come up with a reasonably reliable way to remove phys disk records
> > at an appropriate time, I'm happy to revert this patch. Until then I'm not
> > sure it is a good idea.....
> >
> > But I'm open to being convinced.
>
> Well, me too, I may be wrong, after all. Perhaps auto-removal is ok. I
> need to try it with fake RAID.
If you can determine that the current behaviour confuses some BIOS, then
clearly we should fix it.
If not, then the bar for acceptance of changes would probably be a little
higher. As long as it isn't easy to get a growth of lots of 'failed' entries
in the table I can probably be happy.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-07-31 3:25 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-26 21:06 ddf: remove failed devices that are no longer in use ?!? Martin Wilck
2013-07-30 1:34 ` NeilBrown
2013-07-30 19:24 ` Martin Wilck
2013-07-31 3:25 ` NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).