Need some information and help on mdadm in order to support it on IBM z Systems

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Need some information and help on mdadm in order to support it on IBM z Systems
@ 2008-04-11 12:28 Jean-Baptiste Joret
  2008-04-11 14:39 ` Bill Davidsen
  0 siblings, 1 reply; 8+ messages in thread
From: Jean-Baptiste Joret @ 2008-04-11 12:28 UTC (permalink / raw)
  To: linux-raid

Hello,

I am trying to obtain information such as design document or anything that 
would describe the content of the metadata. I am evaluating the solution 
to determinate whether it is entreprise ready for use as a mirror solution 
and if we can support it at IBM. 

Also I am currently having quite a show stopper issue, where help would be 
appreciated. I have a RAID1 with 2 Harddisks, when I remove one hardisk (I 
put the chpids offline which is equivalent to telling the system that the 
drive is currently not available), the missing disk is marked as "faulty 
spare" when calling mdadm -D /dev/md0. 

/dev/md0:
        Version : 01.02.03
  Creation Time : Fri Apr 11 11:11:59 2008
     Raid Level : raid1
     Array Size : 2403972 (2.29 GiB 2.46 GB)
  Used Dev Size : 2403972 (2.29 GiB 2.46 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr 11 11:23:04 2008
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : 0
           UUID : 9a0a6e30:4b8bbe7f:bc0cad81:9fd46804
         Events : 8

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1      94       21        1      active sync   /dev/dasdf1

       0      94       17        -      faulty spare   /dev/dasde1

When I put the disk back online it is not automatically reinserted into 
the array. The only thing that I have tried that worked was to do a hot 
remove followed by a hot add (mdadm /dev/md0 -r /dev/dasde1 and then mdadm 
/dev/md0 -a /dev/dasde1). Is that the correct way or is there any option 
to tell the disk is back an clean ? I don't like my solution verymuch as 
somtimes I get an error saying the superblock cannot be written.

Thank you very much for any help you can provide.

Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali Saluti 

Jean-Baptiste Joret - Linux on System Z 
Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-11 12:28 Need some information and help on mdadm in order to support it on IBM z Systems Jean-Baptiste Joret
@ 2008-04-11 14:39 ` Bill Davidsen
  2008-04-14 11:05   ` Jean-Baptiste Joret
  0 siblings, 1 reply; 8+ messages in thread
From: Bill Davidsen @ 2008-04-11 14:39 UTC (permalink / raw)
  To: Jean-Baptiste Joret; +Cc: linux-raid

Jean-Baptiste Joret wrote:
> Hello,
>
> I am trying to obtain information such as design document or anything that 
> would describe the content of the metadata. I am evaluating the solution 
> to determinate whether it is entreprise ready for use as a mirror solution 
> and if we can support it at IBM. 
>
> Also I am currently having quite a show stopper issue, where help would be 
> appreciated. I have a RAID1 with 2 Harddisks, when I remove one hardisk (I 
> put the chpids offline which is equivalent to telling the system that the 
> drive is currently not available), the missing disk is marked as "faulty 
> spare" when calling mdadm -D /dev/md0. 
>
> /dev/md0:
>         Version : 01.02.03
>   Creation Time : Fri Apr 11 11:11:59 2008
>      Raid Level : raid1
>      Array Size : 2403972 (2.29 GiB 2.46 GB)
>   Used Dev Size : 2403972 (2.29 GiB 2.46 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Fri Apr 11 11:23:04 2008
>           State : active, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
>
>            Name : 0
>            UUID : 9a0a6e30:4b8bbe7f:bc0cad81:9fd46804
>          Events : 8
>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1      94       21        1      active sync   /dev/dasdf1
>
>        0      94       17        -      faulty spare   /dev/dasde1
>
> When I put the disk back online it is not automatically reinserted into 
> the array. The only thing that I have tried that worked was to do a hot 
> remove followed by a hot add (mdadm /dev/md0 -r /dev/dasde1 and then mdadm 
> /dev/md0 -a /dev/dasde1). Is that the correct way or is there any option 
> to tell the disk is back an clean ? I don't like my solution verymuch as 
> somtimes I get an error saying the superblock cannot be written.
>
> Thank you very much for any help you can provide.
>
>   
Start by detailing the versions of the kernel, mdadm, which superblock 
you use, and your bitmap configuration (or lack of it).

> Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali Saluti 
>
> Jean-Baptiste Joret - Linux on System Z 
> Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 
>
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Herbert Kircher
> Sitz der Gesellschaft: Böblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-11 14:39 ` Bill Davidsen
@ 2008-04-14 11:05   ` Jean-Baptiste Joret
       [not found]     ` <4804F9FD.4070606@tmr.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Jean-Baptiste Joret @ 2008-04-14 11:05 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

Hi Bill,

I have created the array with "mdadm --create /dev/md0 --level=1 
--raid-devices=2 /dev/dasd[ef]1 --metadata=1.2 --bitmap=internal" using as 
you can see version 1.2 of the Metatata format. The Kernel ist the SuSE 
standard kernel 2.6.16.60-0.9-default on s390x (SLES 10 SP2 RC1). I have 
this issue with RC2 too.

What I would like to have a more documentation about the Metadata and how 
they are used if you have or know someone who can provide this.

Thank you in advance.

Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali Saluti 

Jean-Baptiste Joret - Linux on System Z 
Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



From:
Bill Davidsen <davidsen@tmr.com>
To:
Jean-Baptiste Joret/Germany/IBM@IBMDE
Cc:
linux-raid@vger.kernel.org
Date:
11.04.2008 16:35
Subject:
Re: Need some information and help on mdadm in order to support it on IBM 
z Systems



Jean-Baptiste Joret wrote:
> Hello,
>
> I am trying to obtain information such as design document or anything 
that 
> would describe the content of the metadata. I am evaluating the solution 

> to determinate whether it is entreprise ready for use as a mirror 
solution 
> and if we can support it at IBM. 
>
> Also I am currently having quite a show stopper issue, where help would 
be 
> appreciated. I have a RAID1 with 2 Harddisks, when I remove one hardisk 
(I 
> put the chpids offline which is equivalent to telling the system that 
the 
> drive is currently not available), the missing disk is marked as "faulty 

> spare" when calling mdadm -D /dev/md0. 
>
> /dev/md0:
>         Version : 01.02.03
>   Creation Time : Fri Apr 11 11:11:59 2008
>      Raid Level : raid1
>      Array Size : 2403972 (2.29 GiB 2.46 GB)
>   Used Dev Size : 2403972 (2.29 GiB 2.46 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
>     Persistence : Superblock is persistent
>
>   Intent Bitmap : Internal
>
>     Update Time : Fri Apr 11 11:23:04 2008
>           State : active, degraded
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 1
>   Spare Devices : 0
>
>            Name : 0
>            UUID : 9a0a6e30:4b8bbe7f:bc0cad81:9fd46804
>          Events : 8
>
>     Number   Major   Minor   RaidDevice State
>        0       0        0        0      removed
>        1      94       21        1      active sync   /dev/dasdf1
>
>        0      94       17        -      faulty spare   /dev/dasde1
>
> When I put the disk back online it is not automatically reinserted into 
> the array. The only thing that I have tried that worked was to do a hot 
> remove followed by a hot add (mdadm /dev/md0 -r /dev/dasde1 and then 
mdadm 
> /dev/md0 -a /dev/dasde1). Is that the correct way or is there any option 

> to tell the disk is back an clean ? I don't like my solution verymuch as 

> somtimes I get an error saying the superblock cannot be written.
>
> Thank you very much for any help you can provide.
>
> 
Start by detailing the versions of the kernel, mdadm, which superblock 
you use, and your bitmap configuration (or lack of it).

> Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali 
Saluti 
>
> Jean-Baptiste Joret - Linux on System Z 
> Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 
>
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Herbert Kircher
> Sitz der Gesellschaft: Böblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> 


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 






--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <4804F9FD.4070606@tmr.com>]

* Re: Need some information and help on mdadm in order to support it on IBM z Systems
       [not found]     ` <4804F9FD.4070606@tmr.com>
@ 2008-04-16 15:20       ` Jean-Baptiste Joret
  2008-04-18  9:46         ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 8+ messages in thread
From: Jean-Baptiste Joret @ 2008-04-16 15:20 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Linux RAID

Hello Bill,

the scenario actually involves simulating a hardware connection issue for 
a few seconds and bring it back online. But once the hardware comes back 
online it is still do not come back into the array an remains marked 
"faulty spare". Moreover, if you then reboot, the mirror comes up and you 
can mount it but it is degraded and my "faulty spare" is now removed:

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

Is there a way maybe using a udev rule to mark the device clean so it can 
be readded automatically into the array ?

Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali Saluti 

Jean-Baptiste Joret - Linux on System Z 
Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

From:
Bill Davidsen <davidsen@tmr.com>
To:
Jean-Baptiste Joret/Germany/IBM@IBMDE, Linux RAID 
<linux-raid@vger.kernel.org>
Date:
15.04.2008 20:50
Subject:
Re: Need some information and help on mdadm in order to support it on IBM 
z Systems

I have added the list back into the addresses, you can use "reply all" to 
keep the discussion where folks can easily contribute.

Jean-Baptiste Joret wrote: 
Hi Bill,

I have created the array with "mdadm --create /dev/md0 --level=1 
--raid-devices=2 /dev/dasd[ef]1 --metadata=1.2 --bitmap=internal" using as 

you can see version 1.2 of the Metatata format. The Kernel ist the SuSE 
standard kernel 2.6.16.60-0.9-default on s390x (SLES 10 SP2 RC1). I have 
this issue with RC2 too.

What I would like to have a more documentation about the Metadata and how 
they are used if you have or know someone who can provide this.

The best (only) description of the metadata is in the md portion of the 
kernel or in the mdadm source code. I am guessing that there is a fix for 
your problem in more recent kernels, since a similar thing was mentioned 
on the mailing list recently. Older versions of the kernel require some 
event to start the rebuild, at which point the spare will be put back into 
the array. Unfortunately I didn't find it quickly, although memory tells 
me that it has been fixed in the latest kernel.

I think you need to look carefully at any hardware or connection issues 
which cause the device to drop out of the array in the first place. The 
fact that it comes in as a faulty spare indicates a problem, but I don't 
quite see what that is. Your remove and reinsert will get it going again, 
is it possible that the devices is not ready at boot time for some reason?

There may be log messages from the time when the drive was kicked from 
that array which will tell you more.
Thank you in advance.

Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali Saluti 

Jean-Baptiste Joret - Linux on System Z 
Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

From:
Bill Davidsen <davidsen@tmr.com>
To:
Jean-Baptiste Joret/Germany/IBM@IBMDE
Cc:
linux-raid@vger.kernel.org
Date:
11.04.2008 16:35
Subject:
Re: Need some information and help on mdadm in order to support it on IBM 
z Systems

Jean-Baptiste Joret wrote:

Hello,

I am trying to obtain information such as design document or anything 

that 

would describe the content of the metadata. I am evaluating the solution 

to determinate whether it is entreprise ready for use as a mirror 

solution 

and if we can support it at IBM. 

Also I am currently having quite a show stopper issue, where help would 

be 

appreciated. I have a RAID1 with 2 Harddisks, when I remove one hardisk 

(I 

put the chpids offline which is equivalent to telling the system that 

the 

drive is currently not available), the missing disk is marked as "faulty 

spare" when calling mdadm -D /dev/md0. 

/dev/md0:
        Version : 01.02.03
  Creation Time : Fri Apr 11 11:11:59 2008
     Raid Level : raid1
     Array Size : 2403972 (2.29 GiB 2.46 GB)
  Used Dev Size : 2403972 (2.29 GiB 2.46 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr 11 11:23:04 2008
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : 0
           UUID : 9a0a6e30:4b8bbe7f:bc0cad81:9fd46804
         Events : 8

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1      94       21        1      active sync   /dev/dasdf1

       0      94       17        -      faulty spare   /dev/dasde1

When I put the disk back online it is not automatically reinserted into 
the array. The only thing that I have tried that worked was to do a hot 
remove followed by a hot add (mdadm /dev/md0 -r /dev/dasde1 and then 

mdadm 

/dev/md0 -a /dev/dasde1). Is that the correct way or is there any option 

to tell the disk is back an clean ? I don't like my solution verymuch as 

somtimes I get an error saying the superblock cannot be written.

Thank you very much for any help you can provide.

Start by detailing the versions of the kernel, mdadm, which superblock 
you use, and your bitmap configuration (or lack of it).

Best regards / Mit freundlichen Gruessen / Cordialement / Cordiali 

Saluti 

Jean-Baptiste Joret - Linux on System Z 
Phone: +49 7031 16-3278 / ITN: 39203278 - eMail: joret@de.ibm.com 

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-16 15:20       ` Jean-Baptiste Joret
@ 2008-04-18  9:46         ` Mario 'BitKoenig' Holbe
  2008-04-18 13:45           ` David Lethe
  0 siblings, 1 reply; 8+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2008-04-18  9:46 UTC (permalink / raw)
  To: linux-raid

Jean-Baptiste Joret <JORET@de.ibm.com> wrote:
> the scenario actually involves simulating a hardware connection issue for 
> a few seconds and bring it back online. But once the hardware comes back 
> online it is still do not come back into the array an remains marked 
> "faulty spare". Moreover, if you then reboot, the mirror comes up and you 
> can mount it but it is degraded and my "faulty spare" is now removed:

This is just the normal way md deals with faulty components. And even
more: I personally don't know any (soft or hard) RAID solution that
would automatically try to re-add faulty components back to an array.
I personally would also consider such an automatic re-add a really bad
idea. There was a reason for the component to fail, you don't want to
touch it again without user intervention - it could make things far more
worse (blocking busses, reading wrong data etc.). A user who knows
better can of course trigger the RAID to touch it again - for md it's
just the way you described already: remove the faulty component from the
array and re-add it.

Being more "intelligent" regarding such an automatic re-add would
require a far deeper failure analysis to decide whether it would be safe
to try re-adding it or better leave it untouched. I don't know any
software yet that would be capable to do so.

Afaik, since a little while md contains one such automatism regarding
sector read errors where it automatically tries to re-write this sector
to the failing disk to trigger disk's sector-reallocation. I personally
even consider this behaviour quite dangerous, since there is no
guarantee that this read-error really occured due to a (quite harmless)
single-sector failure and thus, IMHO even there is a chance to make
things more worse by touching the failing disk again per default.

regards
   Mario
-- 
Computer Science is no more about computers than astronomy is about
telescopes.                                       -- E. W. Dijkstra

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE:  Re: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-18  9:46         ` Mario 'BitKoenig' Holbe
@ 2008-04-18 13:45           ` David Lethe
  2008-04-20 13:41             ` Peter Grandi
  0 siblings, 1 reply; 8+ messages in thread
From: David Lethe @ 2008-04-18 13:45 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe, linux-raid

Well, I can name many RAID controllers that will automatically add a
"faulty" drive back into an array.  This is a very good thing to have,
and is counter-intuitive to all but experienced RAID architects. 

Seeing how the OP works for IBM, then I'll use the IBM Profibre engine
as an example of an engine that will automatically insert a "known bad"
disk.  Infortrend engines actually have a menu item to force an array
with a "known bad" disk online.  Several LSI-family controllers have
this feature in their API, and as a backdoor diagnostic feature.  Some
of the Xyratex engines give this to you. 

I can go on by getting really specific and citing firmware revisions,
model numbers, and so on ... but what do I know ... I just write
diagnostic software, RAID configurators, failure/stress testing,
failover drivers, etc.

To be fair, there are correct ways to reinsert these bad disks, and the
architect needs to do a few things to minimize data integrity risks, and
repair them as part of the reinsertion process.  As this is a  public
forum I won't post them but will instead, speak in generalities to make
my point.  There are hundreds of published patents concerning data
recovery and availability in various failure scenarios, so anybody who
wants to learn more can simply search the USPTO.GOV database and read
them for themselves.  

A few real-world reasons you want this capability ...
* Your RAID system consists of 2 external units, with RAID controller(s)
and disks in unit A, and unit B is an expansion chassis, with
interconnecting cables.  You have a LUN that is spread between 2
enclosures.  Enclosure "B" goes offline, either because of a power
failure; a sysadmin who doesn't know you should power the RAID head down
first, then expansion; or he/she powers it up backwards .. bottom line
is that the drive "failed", but it only "failed" because it was
disconnected due to power issues.  Well architected firmware needs to be
able to recognize this scenario and put the disk back.   
* When a disk drive really does fail, then, depending on the type of
bus/loop structure in the enclosure, and quality of the backplane
architecture, other disks may be affected and may not be able to respond
to I/O for a second or so. If the RAID architecture aggressively fails
disks in this scenario then you would have cascade effects that knock
perfectly good arrays offline.
* You have a hot swap array where disks are not physically "locked" in
an enclosure, and the removal process starts by pushing the drive in..
Igor the klutz leans against the enclosure the wrong way and the drive
temporarily gets disconnected .. but he frantically pushes it back in.
You get the idea, it happens.

The RAID engine (or md software) needs to be more intelligent an be able
to recognize the difference between a drive failure and a drive getting
disconnected

Now to go back to the OP and solve his problem.  Use a special connecter
and extend a pair of wires outside the enclosure that breaks power. If
this is a fibre-channel backplane, then you should also have external
wires to short out loop A and/or loop B in order to inject other types
of errors. My RAID testing software (not trying to plug it, just telling
you some of the things you can do so you can write it yourself) sends a
CDB to tell the disk to perform a mediainit command, or commands a disk
to simply spin down.  Well-designed RAID software/firmware will handle
all of these problems differently.  While on subject, your RAID testing
software needs to be able to create ECC errors on any disk/block you
need to, so you can combine these "disk" failures with stripes that have
both good and bad parity.   (Yes, kinda sorta plugging myself as a hired
gun again, but.. ) if your testing scenario doesn't involve creating ECC
errors and running non-destructive data and parity testing in
combination with simulated hardware failures, then your testing, not
certifying.

To go back to Mario's argument that you *could* make things far worse ..
absolutely. The RAID architect needs to incorporate hot-adding md disks
back into the array, as long as it is done properly.  RAID recovery
logic is perhaps 75% of the source code for top-of-the-line RAID
controllers. Their firmware determines why a disk "failed", and does
what it can to bring it back online and fix the damage.  A $50 SATA RAID
controller has perhaps 10% of the logic dedicated to failover/failback.

The md driver is somewhere in the middle.  I'll end this post by
reminding the md architects to consider how many days it takes to
rebuild a RAID-5 set that uses 500GB or larger disk drives, and how
unnecessary this action can be under certain failure scenarios.

David @ SANtools ^ com

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mario 'BitKoenig'
Holbe
Sent: Friday, April 18, 2008 4:46 AM
To: linux-raid@vger.kernel.org
Subject: Re: Need some information and help on mdadm in order to support
it on IBM z Systems

Jean-Baptiste Joret <JORET@de.ibm.com> wrote:
> the scenario actually involves simulating a hardware connection issue
for 
> a few seconds and bring it back online. But once the hardware comes
back 
> online it is still do not come back into the array an remains marked 
> "faulty spare". Moreover, if you then reboot, the mirror comes up and
you 
> can mount it but it is degraded and my "faulty spare" is now removed:

This is just the normal way md deals with faulty components. And even
more: I personally don't know any (soft or hard) RAID solution that
would automatically try to re-add faulty components back to an array.
I personally would also consider such an automatic re-add a really bad
idea. There was a reason for the component to fail, you don't want to
touch it again without user intervention - it could make things far more
worse (blocking busses, reading wrong data etc.). A user who knows
better can of course trigger the RAID to touch it again - for md it's
just the way you described already: remove the faulty component from the
array and re-add it.

Being more "intelligent" regarding such an automatic re-add would
require a far deeper failure analysis to decide whether it would be safe
to try re-adding it or better leave it untouched. I don't know any
software yet that would be capable to do so.

Afaik, since a little while md contains one such automatism regarding
sector read errors where it automatically tries to re-write this sector
to the failing disk to trigger disk's sector-reallocation. I personally
even consider this behaviour quite dangerous, since there is no
guarantee that this read-error really occured due to a (quite harmless)
single-sector failure and thus, IMHO even there is a chance to make
things more worse by touching the failing disk again per default.

regards
   Mario
-- 
Computer Science is no more about computers than astronomy is about
telescopes.                                       -- E. W. Dijkstra

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-18 13:45           ` David Lethe
@ 2008-04-20 13:41             ` Peter Grandi
  2008-04-20 16:24               ` David Lethe
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Grandi @ 2008-04-20 13:41 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> Well, I can name many RAID controllers that will automatically
> add a "faulty" drive back into an array.  This is a very good
> thing to have, and is counter-intuitive to all but experienced
> RAID architects.

It is not counter-intuitive, it is absolutely crazy in the
general case, and in particular cases it leads to loss of focus
and mingling of astractions layers that should remain separate.

[ ... ]

> There are hundreds of published patents concerning data
> recovery and availability in various failure scenarios, so
> anybody who wants to learn more can simply search the
> USPTO.GOV database and read them for themselves.

Sure, those are heuristics that are implemented on top of a RAID
subsystem, and that are part of a ''computer assisted storage
administration'' logic. Just like IBM developed many years ago
expert systems to automate or assist with recovery from various
faults (not just storage faults) on 370/390 class mainframes.

Such recovery tools have nothing to do with RAID as such, even
if they are often packaged with RAID products. They belong in a
totally different abastraction layer, as this example makes
starkly clear:

> A few real-world reasons you want this capability ...
> * Your RAID system consists of 2 external units, with RAID
>   controller(s) and disks in unit A, and unit B is an
>   expansion chassis, with interconnecting cables. You have a
>   LUN that is spread between 2 enclosures. Enclosure "B" goes
>   offline, either because of a power failure; a sysadmin who
>   doesn't know you should power the RAID head down first, then
>   expansion; or he/she powers it up backwards .. bottom line
>   is that the drive "failed", but it only "failed" because it
>   was disconnected due to power issues.
> [ ... ]

This relies on case-based expert-system like fault analysis and
recovery using knowledge of non-RAID aspects of the storage
subsystem.

Fine, but it has nothing to do with RAID -- as it requires a
kind of ''total system'' approach.

> Well architected firmware needs to be able to recognize this
> scenario and put the disk back.

Well architected *RAID* firmware should do nothing of the sort.

RAID has a (deceptively) simple operation model and yet getting
RAID firmware right is hard enough.

Well architected fault analysis and recovery daemons might well
recognize that scenario and put the disk back, but that's a
completely different story from RAID firmware doing that.

> To go back to Mario's argument that you *could* make things
> far worse ..  absolutely.

Sure, because fault analysis and recovery heuristics take
chances that can go spectacularly wrong, as well as being pretty
hard to code too.

While I am not however against optional fault analysis and
recovery layers on top of RAID, I really object to statements
like this:

> The RAID architect needs to incorporate hot-adding md disks
> back into the array, as long as it is done properly.

Becase the RAID architect should stay well clear of considering
such kind of issues, and of polluting the base RAID firmware
with additional complications; even the base RAID logic is
amazingly bug infested in various products I have had the
misfortune to suffer.

The role of the RAID achitect is to focus on the performance and
correctness of the basic RAID logic, and let the architects of
fault analysis and recovery daemons worry with other issues, and
perhaps to provide suitable hooks to them to make their life
easier.

> RAID recovery logic is perhaps 75% of the source code for
> top-of-the-line RAID controllers. Their firmware determines
> why a disk "failed", and does what it can to bring it back
> online and fix the damage.

There is a rationale for bundling a storage fault analysis and
recovery daemon into a RAID host adapter, but I don't like that,
because often there are two downside:

* Fault analysis and recovery usually are best done at the
  highest possible abstraction level, that is as software
  daemons running on the host, as they have more information
  than a daemon running inside the host adapter.

* Mingling fault analysis and recovery software with the base
  RAID logic (as the temptation then becomes hard to resist)
  tends to distract from the overridingly important task of
  getting the latter to perform reliably and to report errors
  clearly and usefully.

In previous discussions in this list there were crazy proposals
to make part of the Linux RAID logic some detection )not too
bad) and recovery (chancy horror) from (ambiguous) unreported
errors and the reason why I am objecting strenuously here is to
help quash calls for something insane like that.

Separation of concerns and of abstractions layers and keeping
fundamental firmware logic simple are rather important goals
in mission critical subsystems.

> A $50 SATA RAID controller

Except perhaps the IT8212 chips these don't exist :-).

> has perhaps 10% of the logic dedicated to failover/failback.

That 10% is 10% too many. It is already difficult to find
simple, reliable RAID host adapters, never mind get RAID host
adapters that try to be too clever.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Need some information and help on mdadm in order to support it on IBM z Systems
  2008-04-20 13:41             ` Peter Grandi
@ 2008-04-20 16:24               ` David Lethe
  0 siblings, 0 replies; 8+ messages in thread
From: David Lethe @ 2008-04-20 16:24 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID

Well, I can't formerly speak for IBM, EMC, LSI, NetApp, and others when
I say you are wrong in just about everything you wrote. Their
architectures are "absolutely crazy", and their firmware doesn't meet
your personal criteria for being "well-architected".  Qlogic, Emulex and
LSI are also wrong since they have vanity firmware/drivers for specific
RAID subsystems to increase interoperability between all of the RAID
hardware/software layers.

Now contrast zfs with md+filesystemofyourchoice.  The performance,
reliability, security, data integrity, and self-healing capability of
zfs are as profoundly superior to md and your design philosophy, as the
current md architecture is to MS-DOS/FAT.

The empirical evidence speaks for itself.  The RAID hardware vendors and
the architects of zfs spend billions of dollars annually on R&D, have
superior products, and do it my way, not yours.  If you want to respond
with a flame, then take it to a zfs group. I see no need to respond
further.

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter Grandi
Sent: Sunday, April 20, 2008 8:41 AM
To: Linux RAID
Subject: Re: Need some information and help on mdadm in order to support
it on IBM z Systems

[ ... ]

> Well, I can name many RAID controllers that will automatically
> add a "faulty" drive back into an array.  This is a very good
> thing to have, and is counter-intuitive to all but experienced
> RAID architects.

It is not counter-intuitive, it is absolutely crazy in the
general case, and in particular cases it leads to loss of focus
and mingling of astractions layers that should remain separate.

[ ... ]

> There are hundreds of published patents concerning data
> recovery and availability in various failure scenarios, so
> anybody who wants to learn more can simply search the
> USPTO.GOV database and read them for themselves.

Sure, those are heuristics that are implemented on top of a RAID
subsystem, and that are part of a ''computer assisted storage
administration'' logic. Just like IBM developed many years ago
expert systems to automate or assist with recovery from various
faults (not just storage faults) on 370/390 class mainframes.

Such recovery tools have nothing to do with RAID as such, even
if they are often packaged with RAID products. They belong in a
totally different abastraction layer, as this example makes
starkly clear:

> A few real-world reasons you want this capability ...
> * Your RAID system consists of 2 external units, with RAID
>   controller(s) and disks in unit A, and unit B is an
>   expansion chassis, with interconnecting cables. You have a
>   LUN that is spread between 2 enclosures. Enclosure "B" goes
>   offline, either because of a power failure; a sysadmin who
>   doesn't know you should power the RAID head down first, then
>   expansion; or he/she powers it up backwards .. bottom line
>   is that the drive "failed", but it only "failed" because it
>   was disconnected due to power issues.
> [ ... ]

This relies on case-based expert-system like fault analysis and
recovery using knowledge of non-RAID aspects of the storage
subsystem.

Fine, but it has nothing to do with RAID -- as it requires a
kind of ''total system'' approach.

> Well architected firmware needs to be able to recognize this
> scenario and put the disk back.

Well architected *RAID* firmware should do nothing of the sort.

RAID has a (deceptively) simple operation model and yet getting
RAID firmware right is hard enough.

Well architected fault analysis and recovery daemons might well
recognize that scenario and put the disk back, but that's a
completely different story from RAID firmware doing that.

> To go back to Mario's argument that you *could* make things
> far worse ..  absolutely.

Sure, because fault analysis and recovery heuristics take
chances that can go spectacularly wrong, as well as being pretty
hard to code too.

While I am not however against optional fault analysis and
recovery layers on top of RAID, I really object to statements
like this:

> The RAID architect needs to incorporate hot-adding md disks
> back into the array, as long as it is done properly.

Becase the RAID architect should stay well clear of considering
such kind of issues, and of polluting the base RAID firmware
with additional complications; even the base RAID logic is
amazingly bug infested in various products I have had the
misfortune to suffer.

The role of the RAID achitect is to focus on the performance and
correctness of the basic RAID logic, and let the architects of
fault analysis and recovery daemons worry with other issues, and
perhaps to provide suitable hooks to them to make their life
easier.

> RAID recovery logic is perhaps 75% of the source code for
> top-of-the-line RAID controllers. Their firmware determines
> why a disk "failed", and does what it can to bring it back
> online and fix the damage.

There is a rationale for bundling a storage fault analysis and
recovery daemon into a RAID host adapter, but I don't like that,
because often there are two downside:

* Fault analysis and recovery usually are best done at the
  highest possible abstraction level, that is as software
  daemons running on the host, as they have more information
  than a daemon running inside the host adapter.

* Mingling fault analysis and recovery software with the base
  RAID logic (as the temptation then becomes hard to resist)
  tends to distract from the overridingly important task of
  getting the latter to perform reliably and to report errors
  clearly and usefully.

In previous discussions in this list there were crazy proposals
to make part of the Linux RAID logic some detection )not too
bad) and recovery (chancy horror) from (ambiguous) unreported
errors and the reason why I am objecting strenuously here is to
help quash calls for something insane like that.

Separation of concerns and of abstractions layers and keeping
fundamental firmware logic simple are rather important goals
in mission critical subsystems.

> A $50 SATA RAID controller

Except perhaps the IT8212 chips these don't exist :-).

> has perhaps 10% of the logic dedicated to failover/failback.

That 10% is 10% too many. It is already difficult to find
simple, reliable RAID host adapters, never mind get RAID host
adapters that try to be too clever.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-04-20 16:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-11 12:28 Need some information and help on mdadm in order to support it on IBM z Systems Jean-Baptiste Joret
2008-04-11 14:39 ` Bill Davidsen
2008-04-14 11:05   ` Jean-Baptiste Joret
     [not found]     ` <4804F9FD.4070606@tmr.com>
2008-04-16 15:20       ` Jean-Baptiste Joret
2008-04-18  9:46         ` Mario 'BitKoenig' Holbe
2008-04-18 13:45           ` David Lethe
2008-04-20 13:41             ` Peter Grandi
2008-04-20 16:24               ` David Lethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).