mdadm --grow failed

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mdadm --grow failed
@ 2007-02-17  3:22 Marc Marais
  2007-02-17  8:40 ` Neil Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Marc Marais @ 2007-02-17  3:22 UTC (permalink / raw)
  To: linux-raid

I'm trying to grow my raid 5 array as I've just added a new disk. The array 
was originally 3 drives, I've added a fourth using:

mdadm -a /dev/md6 /dev/sda1

Which added the new drive as a spare. I then did:

mdadm --grow /dev/md6 -n 4

Which started the reshape operation. 

Feb 16 23:51:40 xerces kernel: RAID5 conf printout:
Feb 16 23:51:40 xerces kernel:  --- rd:4 wd:4
Feb 16 23:51:40 xerces kernel:  disk 0, o:1, dev:sdb1
Feb 16 23:51:40 xerces kernel:  disk 1, o:1, dev:sdc1
Feb 16 23:51:40 xerces kernel:  disk 2, o:1, dev:sdd1
Feb 16 23:51:40 xerces kernel:  disk 3, o:1, dev:sda1
Feb 16 23:51:40 xerces kernel: md: reshape of RAID array md6
Feb 16 23:51:40 xerces kernel: md: minimum _guaranteed_  speed: 1000 
KB/sec/disk.
Feb 16 23:51:40 xerces kernel: md: using maximum available idle IO bandwidth 
(but not more than 200000 KB/sec) for reshape.
Feb 16 23:51:40 xerces kernel: md: using 128k window, over a total of 
156288256 blocks.

Unfortunately one of the drives timed out during the operation (not a read 
error - just a timeout - which I would've thought would be retried but 
anyway...):

Feb 17 00:19:16 xerces kernel: ata3: command timeout
Feb 17 00:19:16 xerces kernel: ata3: no sense translation for status: 0x40
Feb 17 00:19:16 xerces kernel: ata3: translated ATA stat/err 0x40/00 to SCSI 
SK/ASC/ASCQ 0xb/00/00
Feb 17 00:19:16 xerces kernel: ata3: status=0x40 { DriveReady }
Feb 17 00:19:16 xerces kernel: sd 3:0:0:0: SCSI error: return code = 
0x08000002
Feb 17 00:19:16 xerces kernel: sdc: Current [descriptor]: sense key: Aborted 
Command
Feb 17 00:19:16 xerces kernel:     Additional sense: No additional sense 
information
Feb 17 00:19:16 xerces kernel: Descriptor sense data with sense descriptors 
(in hex):
Feb 17 00:19:16 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 
00 00 00 00 
Feb 17 00:19:16 xerces kernel:         00 00 00 01 
Feb 17 00:19:16 xerces kernel: end_request: I/O error, dev sdc, sector 
24065423
Feb 17 00:19:16 xerces kernel: raid5: Disk failure on sdc1, disabling 
device. Operation continuing on 3 devices

Which then unfortunately aborted the reshape operation:

Feb 17 00:19:16 xerces kernel: md: md6: reshape done.
Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
Feb 17 00:19:17 xerces kernel:  disk 1, o:0, dev:sdc1
Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1
Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1

I re-added the failed disk (sdc) (which btw is a brand new disk - seems this 
is a controller issue - high IO load?) which then resynced the array.

At this point I'm confused as to the state of the array.

mdadm -D /dev/md6 gives:

/dev/md6:
        Version : 00.91.03
  Creation Time : Tue Aug  1 23:31:54 2006
     Raid Level : raid5
     Array Size : 312576512 (298.10 GiB 320.08 GB)
  Used Dev Size : 156288256 (149.05 GiB 160.04 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 6
    Persistence : Superblock is persistent

    Update Time : Sat Feb 17 12:14:22 2007
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

  Delta Devices : 1, (3->4)

           UUID : 603e7ac0:de4df2d1:d44c6b9b:3d20ad32
         Events : 0.7215890

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8        1        3      active sync   /dev/sda1

Although it previously (before issuing the command below) mentioned 
something about reshape 1% or something to that effect.

I've attempted to continue the reshape by issuing:

mdadm --grow /dev/md6 -n 4 

Which gives the error that the array can't be reshaped without increasing 
its size!

Is my array destroyed? Seeing as the sda disk wasn't completely synced I'm 
wonder how it was using to resync the array when sdc went offline. I've got 
a bad feeling about this :|

Help appreciated. (I do have a full backup of course but that's a last 
resort with my luck I'd get a read error from the tape drive)

Regards,
Marc




--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17  3:22 mdadm --grow failed Marc Marais
@ 2007-02-17  8:40 ` Neil Brown
  2007-02-18  9:20   ` Marc Marais
  2007-02-17 18:27 ` Bill Davidsen
  2007-02-18 11:51 ` David Greaves
  2 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2007-02-17  8:40 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid

On Saturday February 17, marcm@liquid-nexus.net wrote:
> 
> Is my array destroyed? Seeing as the sda disk wasn't completely synced I'm 
> wonder how it was using to resync the array when sdc went offline. I've got 
> a bad feeling about this :|

I can understand your bad feeling...
What happened there shouldn't happen, but obviously it did.  There is
evidence that all is not lost but obviously I cannot be sure yet.

Can you "fsck -n" the array?  does the data still seem to be intact?

Can you report exactly what version of Linux kernel, and of mdadm you
are using, and give the output of "mdadm -E" on each drive.

I'll try to work out what happened and how to go forward, but am
unlikely to get back to you for 24-48 hours (I have a busy weekend:-).

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17  3:22 mdadm --grow failed Marc Marais
  2007-02-17  8:40 ` Neil Brown
@ 2007-02-17 18:27 ` Bill Davidsen
  2007-02-17 19:16   ` Justin Piszcz
  2007-02-18 11:51 ` David Greaves
  2 siblings, 1 reply; 14+ messages in thread
From: Bill Davidsen @ 2007-02-17 18:27 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid

Marc Marais wrote:
> I'm trying to grow my raid 5 array as I've just added a new disk. The array 
> was originally 3 drives, I've added a fourth using:
>
> mdadm -a /dev/md6 /dev/sda1
>
> Which added the new drive as a spare. I then did:
>
> mdadm --grow /dev/md6 -n 4
>
> Which started the reshape operation. 
>
> Feb 16 23:51:40 xerces kernel: RAID5 conf printout:
> Feb 16 23:51:40 xerces kernel:  --- rd:4 wd:4
> Feb 16 23:51:40 xerces kernel:  disk 0, o:1, dev:sdb1
> Feb 16 23:51:40 xerces kernel:  disk 1, o:1, dev:sdc1
> Feb 16 23:51:40 xerces kernel:  disk 2, o:1, dev:sdd1
> Feb 16 23:51:40 xerces kernel:  disk 3, o:1, dev:sda1
> Feb 16 23:51:40 xerces kernel: md: reshape of RAID array md6
> Feb 16 23:51:40 xerces kernel: md: minimum _guaranteed_  speed: 1000 
> KB/sec/disk.
> Feb 16 23:51:40 xerces kernel: md: using maximum available idle IO bandwidth 
> (but not more than 200000 KB/sec) for reshape.
> Feb 16 23:51:40 xerces kernel: md: using 128k window, over a total of 
> 156288256 blocks.
>
> Unfortunately one of the drives timed out during the operation (not a read 
> error - just a timeout - which I would've thought would be retried but 
> anyway...):
>
> Feb 17 00:19:16 xerces kernel: ata3: command timeout
> Feb 17 00:19:16 xerces kernel: ata3: no sense translation for status: 0x40
> Feb 17 00:19:16 xerces kernel: ata3: translated ATA stat/err 0x40/00 to SCSI 
> SK/ASC/ASCQ 0xb/00/00
> Feb 17 00:19:16 xerces kernel: ata3: status=0x40 { DriveReady }
> Feb 17 00:19:16 xerces kernel: sd 3:0:0:0: SCSI error: return code = 
> 0x08000002
> Feb 17 00:19:16 xerces kernel: sdc: Current [descriptor]: sense key: Aborted 
> Command
> Feb 17 00:19:16 xerces kernel:     Additional sense: No additional sense 
> information
> Feb 17 00:19:16 xerces kernel: Descriptor sense data with sense descriptors 
> (in hex):
> Feb 17 00:19:16 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 
> 00 00 00 00 
> Feb 17 00:19:16 xerces kernel:         00 00 00 01 
> Feb 17 00:19:16 xerces kernel: end_request: I/O error, dev sdc, sector 
> 24065423
> Feb 17 00:19:16 xerces kernel: raid5: Disk failure on sdc1, disabling 
> device. Operation continuing on 3 devices
>
> Which then unfortunately aborted the reshape operation:
>
> Feb 17 00:19:16 xerces kernel: md: md6: reshape done.
> Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
> Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
> Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
> Feb 17 00:19:17 xerces kernel:  disk 1, o:0, dev:sdc1
> Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
> Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1
> Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
> Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
> Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
> Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
> Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1
>
> I re-added the failed disk (sdc) (which btw is a brand new disk - seems this 
> is a controller issue - high IO load?) which then resynced the array.
>
> At this point I'm confused as to the state of the array.
>
> mdadm -D /dev/md6 gives:
>
> /dev/md6:
>         Version : 00.91.03
>   Creation Time : Tue Aug  1 23:31:54 2006
>      Raid Level : raid5
>      Array Size : 312576512 (298.10 GiB 320.08 GB)
>   Used Dev Size : 156288256 (149.05 GiB 160.04 GB)
>    Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 6
>     Persistence : Superblock is persistent
>
>     Update Time : Sat Feb 17 12:14:22 2007
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 128K
>
>   Delta Devices : 1, (3->4)
>
>            UUID : 603e7ac0:de4df2d1:d44c6b9b:3d20ad32
>          Events : 0.7215890
>
>     Number   Major   Minor   RaidDevice State
>        0       8       17        0      active sync   /dev/sdb1
>        1       8       33        1      active sync   /dev/sdc1
>        2       8       49        2      active sync   /dev/sdd1
>        3       8        1        3      active sync   /dev/sda1
>
> Although it previously (before issuing the command below) mentioned 
> something about reshape 1% or something to that effect.
>
> I've attempted to continue the reshape by issuing:
>
> mdadm --grow /dev/md6 -n 4 
>
> Which gives the error that the array can't be reshaped without increasing 
> its size!
>
> Is my array destroyed? Seeing as the sda disk wasn't completely synced I'm 
> wonder how it was using to resync the array when sdc went offline. I've got 
> a bad feeling about this :|
>
> Help appreciated. (I do have a full backup of course but that's a last 
> resort with my luck I'd get a read error from the tape drive)
I have to think maybe a 'check' would have been good before the grow, 
but since Neil didn't suggest it, please don't now, unless he agrees 
that it's a valid attempt.

However, you certainly can run 'df' and see if the filesystem is resized.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17 18:27 ` Bill Davidsen
@ 2007-02-17 19:16   ` Justin Piszcz
  2007-02-17 21:08     ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Justin Piszcz @ 2007-02-17 19:16 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Marc Marais, linux-raid



On Sat, 17 Feb 2007, Bill Davidsen wrote:

> Marc Marais wrote:
>> I'm trying to grow my raid 5 array as I've just added a new disk. The array 
>> was originally 3 drives, I've added a fourth using:
>> 
>> mdadm -a /dev/md6 /dev/sda1
>> 
>> Which added the new drive as a spare. I then did:
>> 
>> mdadm --grow /dev/md6 -n 4
>> 
>> Which started the reshape operation. 
>> Feb 16 23:51:40 xerces kernel: RAID5 conf printout:
>> Feb 16 23:51:40 xerces kernel:  --- rd:4 wd:4
>> Feb 16 23:51:40 xerces kernel:  disk 0, o:1, dev:sdb1
>> Feb 16 23:51:40 xerces kernel:  disk 1, o:1, dev:sdc1
>> Feb 16 23:51:40 xerces kernel:  disk 2, o:1, dev:sdd1
>> Feb 16 23:51:40 xerces kernel:  disk 3, o:1, dev:sda1
>> Feb 16 23:51:40 xerces kernel: md: reshape of RAID array md6
>> Feb 16 23:51:40 xerces kernel: md: minimum _guaranteed_  speed: 1000 
>> KB/sec/disk.
>> Feb 16 23:51:40 xerces kernel: md: using maximum available idle IO 
>> bandwidth (but not more than 200000 KB/sec) for reshape.
>> Feb 16 23:51:40 xerces kernel: md: using 128k window, over a total of 
>> 156288256 blocks.
>> 
>> Unfortunately one of the drives timed out during the operation (not a read 
>> error - just a timeout - which I would've thought would be retried but 
>> anyway...):
>> 
>> Feb 17 00:19:16 xerces kernel: ata3: command timeout
>> Feb 17 00:19:16 xerces kernel: ata3: no sense translation for status: 0x40
>> Feb 17 00:19:16 xerces kernel: ata3: translated ATA stat/err 0x40/00 to 
>> SCSI SK/ASC/ASCQ 0xb/00/00
>> Feb 17 00:19:16 xerces kernel: ata3: status=0x40 { DriveReady }
>> Feb 17 00:19:16 xerces kernel: sd 3:0:0:0: SCSI error: return code = 
>> 0x08000002
>> Feb 17 00:19:16 xerces kernel: sdc: Current [descriptor]: sense key: 
>> Aborted Command
>> Feb 17 00:19:16 xerces kernel:     Additional sense: No additional sense 
>> information
>> Feb 17 00:19:16 xerces kernel: Descriptor sense data with sense descriptors 
>> (in hex):
>> Feb 17 00:19:16 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 
>> 00 00 00 00 Feb 17 00:19:16 xerces kernel:         00 00 00 01 Feb 17 
>> 00:19:16 xerces kernel: end_request: I/O error, dev sdc, sector 24065423
>> Feb 17 00:19:16 xerces kernel: raid5: Disk failure on sdc1, disabling 
>> device. Operation continuing on 3 devices
>> 
>> Which then unfortunately aborted the reshape operation:
>> 
>> Feb 17 00:19:16 xerces kernel: md: md6: reshape done.
>> Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
>> Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
>> Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
>> Feb 17 00:19:17 xerces kernel:  disk 1, o:0, dev:sdc1
>> Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
>> Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1
>> Feb 17 00:19:17 xerces kernel: RAID5 conf printout:
>> Feb 17 00:19:17 xerces kernel:  --- rd:4 wd:3
>> Feb 17 00:19:17 xerces kernel:  disk 0, o:1, dev:sdb1
>> Feb 17 00:19:17 xerces kernel:  disk 2, o:1, dev:sdd1
>> Feb 17 00:19:17 xerces kernel:  disk 3, o:1, dev:sda1
>> 
>> I re-added the failed disk (sdc) (which btw is a brand new disk - seems 
>> this is a controller issue - high IO load?) which then resynced the array.
>> 
>> At this point I'm confused as to the state of the array.
>> 
>> mdadm -D /dev/md6 gives:
>> 
>> /dev/md6:
>>         Version : 00.91.03
>>   Creation Time : Tue Aug  1 23:31:54 2006
>>      Raid Level : raid5
>>      Array Size : 312576512 (298.10 GiB 320.08 GB)
>>   Used Dev Size : 156288256 (149.05 GiB 160.04 GB)
>>    Raid Devices : 4
>>   Total Devices : 4
>> Preferred Minor : 6
>>     Persistence : Superblock is persistent
>>
>>     Update Time : Sat Feb 17 12:14:22 2007
>>           State : clean
>>  Active Devices : 4
>> Working Devices : 4
>>  Failed Devices : 0
>>   Spare Devices : 0
>>
>>          Layout : left-symmetric
>>      Chunk Size : 128K
>>
>>   Delta Devices : 1, (3->4)
>>
>>            UUID : 603e7ac0:de4df2d1:d44c6b9b:3d20ad32
>>          Events : 0.7215890
>>
>>     Number   Major   Minor   RaidDevice State
>>        0       8       17        0      active sync   /dev/sdb1
>>        1       8       33        1      active sync   /dev/sdc1
>>        2       8       49        2      active sync   /dev/sdd1
>>        3       8        1        3      active sync   /dev/sda1
>> 
>> Although it previously (before issuing the command below) mentioned 
>> something about reshape 1% or something to that effect.
>> 
>> I've attempted to continue the reshape by issuing:
>> 
>> mdadm --grow /dev/md6 -n 4 
>> Which gives the error that the array can't be reshaped without increasing 
>> its size!
>> 
>> Is my array destroyed? Seeing as the sda disk wasn't completely synced I'm 
>> wonder how it was using to resync the array when sdc went offline. I've got 
>> a bad feeling about this :|
>> 
>> Help appreciated. (I do have a full backup of course but that's a last 
>> resort with my luck I'd get a read error from the tape drive)
> I have to think maybe a 'check' would have been good before the grow, but 
> since Neil didn't suggest it, please don't now, unless he agrees that it's a 
> valid attempt.
>
> However, you certainly can run 'df' and see if the filesystem is resized.
>
> -- 
> bill davidsen <davidsen@tmr.com>
> CTO TMR Associates, Inc
> Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Is growing an array with > 1 disk at a time permissible?  I've grown a 
raid 5 from 1.8tb to 3.3tb but always 1 disk at a time.

Justin.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17 19:16   ` Justin Piszcz
@ 2007-02-17 21:08     ` Neil Brown
  2007-02-17 21:30       ` Justin Piszcz
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2007-02-17 21:08 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Bill Davidsen, Marc Marais, linux-raid

On Saturday February 17, jpiszcz@lucidpixels.com wrote:
> 
> Is growing an array with > 1 disk at a time permissible?  I've grown a 
> raid 5 from 1.8tb to 3.3tb but always 1 disk at a time.

Sure is.  >0 is the current requirement.
You can grow a 2 drive raid5 directly to a 10drive if you like.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17 21:08     ` Neil Brown
@ 2007-02-17 21:30       ` Justin Piszcz
  0 siblings, 0 replies; 14+ messages in thread
From: Justin Piszcz @ 2007-02-17 21:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, Marc Marais, linux-raid



On Sun, 18 Feb 2007, Neil Brown wrote:

> On Saturday February 17, jpiszcz@lucidpixels.com wrote:
>>
>> Is growing an array with > 1 disk at a time permissible?  I've grown a
>> raid 5 from 1.8tb to 3.3tb but always 1 disk at a time.
>
> Sure is.  >0 is the current requirement.
> You can grow a 2 drive raid5 directly to a 10drive if you like.
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Wow! Thanks for the info, was not aware of this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17  8:40 ` Neil Brown
@ 2007-02-18  9:20   ` Marc Marais
       [not found]     ` <17880.7869.963793.706096@notabene.brown>
  2007-02-19  0:50     ` Neil Brown
  0 siblings, 2 replies; 14+ messages in thread
From: Marc Marais @ 2007-02-18  9:20 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Ok, I understand the risks which is why I did a full backup before doing 
this. I have subsequently recreated the array and restored my data from 
backup.

Just for information, the e2fsck -n on the drive hung (unresponsive with no 
I/O) so I assume the filesystem was hosed. I suspect resyncing the array 
after the grow failed was a bad idea. 

I'm not sure how the grow operation is performed but to me it seems that 
their is no fault tolerance during the operation so any failure will cause a 
corrupt array. My 2c would be that if any drive fails during a grow 
operation that the operation is aborted in such a way as to allow a restart 
later (if possible) - as in my case a retry would've probably worked. 

Anyway, if you need more info to help improve growing arrays let me know.

As a side note, either my hardware (Promise TX4000) card is acting up or 
there are still some unresolved issues with libata in general and/or 
sata_promise itself. 

Regards,
Marc

On Sat, 17 Feb 2007 19:40:17 +1100, Neil Brown wrote
> On Saturday February 17, marcm@liquid-nexus.net wrote:
> > 
> > Is my array destroyed? Seeing as the sda disk wasn't completely synced 
I'm 
> > wonder how it was using to resync the array when sdc went offline. I've 
got 
> > a bad feeling about this :|
> 
> I can understand your bad feeling...
> What happened there shouldn't happen, but obviously it did.  There is
> evidence that all is not lost but obviously I cannot be sure yet.
> 
> Can you "fsck -n" the array?  does the data still seem to be intact?
> 
> Can you report exactly what version of Linux kernel, and of mdadm you
> are using, and give the output of "mdadm -E" on each drive.
> 
> I'll try to work out what happened and how to go forward, but am
> unlikely to get back to you for 24-48 hours (I have a busy weekend:-).
> 
> NeilBrown

--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-17  3:22 mdadm --grow failed Marc Marais
  2007-02-17  8:40 ` Neil Brown
  2007-02-17 18:27 ` Bill Davidsen
@ 2007-02-18 11:51 ` David Greaves
  2 siblings, 0 replies; 14+ messages in thread
From: David Greaves @ 2007-02-18 11:51 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid

Marc Marais wrote:
[snip]
> Unfortunately one of the drives timed out during the operation (not a read 
> error - just a timeout - which I would've thought would be retried but 
> anyway...):
> Help appreciated. (I do have a full backup of course but that's a last 
> resort with my luck I'd get a read error from the tape drive)

Hi Marc
It looks like you've since recreated the array and restored your data - good :)

It doesn't appear that you mentioned the kernel and distro you are using and the
software versions.

I'm sure this is something people will need.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fw: Re: mdadm --grow failed
       [not found]       ` <20070218105242.M29958@liquid-nexus.net>
@ 2007-02-18 11:57         ` Marc Marais
  2007-02-18 12:13           ` Justin Piszcz
  0 siblings, 1 reply; 14+ messages in thread
From: Marc Marais @ 2007-02-18 11:57 UTC (permalink / raw)
  To: linux-raid

On Sun, 18 Feb 2007 20:39:09 +1100, Neil Brown wrote
> On Sunday February 18, marcm@liquid-nexus.net wrote:
> > Ok, I understand the risks which is why I did a full backup before doing 
> > this. I have subsequently recreated the array and restored my data from 
> > backup.
> 
> Could you still please tell me exactly what kernel/mdadm version you
> were using?
> 
> Thanks,
> NeilBrown

2.6.20 with the patch you supplied in response to the "md6_raid5 crash 
email" I posted in linux-raid a few days ago. Just as background, I replaced 
the failing drive and at the same time bought an additional drive in order 
to increase the array size.

mdadm -V = v2.6 - 21 December 2006. Compiled under Debian (stable).

Also, I've just noticed another drive failure with the new array with a 
similar error to what happened during the grow operation (although on a 
different drive) - I wonder if I should post this to linux-ide?

Feb 18 00:58:10 xerces kernel: ata4: command timeout
Feb 18 00:58:10 xerces kernel: ata4: no sense translation for status: 0x40
Feb 18 00:58:10 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI 
SK/ASC/ASCQ 0xb/00/00
Feb 18 00:58:10 xerces kernel: ata4: status=0x40 { DriveReady }
Feb 18 00:58:10 xerces kernel: sd 4:0:0:0: SCSI error: return code = 
0x08000002
Feb 18 00:58:10 xerces kernel: sdd: Current [descriptor]: sense key: Aborted 
Command
Feb 18 00:58:10 xerces kernel:     Additional sense: No additional sense 
information
Feb 18 00:58:10 xerces kernel: Descriptor sense data with sense descriptors 
(in hex):
Feb 18 00:58:10 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00 
00 00 00 00
Feb 18 00:58:10 xerces kernel:         00 00 00 00
Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector 
35666775
Feb 18 00:58:10 xerces kernel: raid5: Disk failure on sdd1, disabling 
device. Operation continuing on 3 devices

Regards,
Marc


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fw: Re: mdadm --grow failed
  2007-02-18 11:57         ` Fw: " Marc Marais
@ 2007-02-18 12:13           ` Justin Piszcz
  2007-02-18 12:32             ` Marc Marais
  2007-02-19  5:41             ` Marc Marais
  0 siblings, 2 replies; 14+ messages in thread
From: Justin Piszcz @ 2007-02-18 12:13 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid

On Sun, 18 Feb 2007, Marc Marais wrote:

> On Sun, 18 Feb 2007 20:39:09 +1100, Neil Brown wrote
>> On Sunday February 18, marcm@liquid-nexus.net wrote:
>>> Ok, I understand the risks which is why I did a full backup before doing
>>> this. I have subsequently recreated the array and restored my data from
>>> backup.
>>
>> Could you still please tell me exactly what kernel/mdadm version you
>> were using?
>>
>> Thanks,
>> NeilBrown
>
> 2.6.20 with the patch you supplied in response to the "md6_raid5 crash
> email" I posted in linux-raid a few days ago. Just as background, I replaced
> the failing drive and at the same time bought an additional drive in order
> to increase the array size.
>
> mdadm -V = v2.6 - 21 December 2006. Compiled under Debian (stable).
>
> Also, I've just noticed another drive failure with the new array with a
> similar error to what happened during the grow operation (although on a
> different drive) - I wonder if I should post this to linux-ide?
>
> Feb 18 00:58:10 xerces kernel: ata4: command timeout
> Feb 18 00:58:10 xerces kernel: ata4: no sense translation for status: 0x40
> Feb 18 00:58:10 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI
> SK/ASC/ASCQ 0xb/00/00
> Feb 18 00:58:10 xerces kernel: ata4: status=0x40 { DriveReady }
> Feb 18 00:58:10 xerces kernel: sd 4:0:0:0: SCSI error: return code =
> 0x08000002
> Feb 18 00:58:10 xerces kernel: sdd: Current [descriptor]: sense key: Aborted
> Command
> Feb 18 00:58:10 xerces kernel:     Additional sense: No additional sense
> information
> Feb 18 00:58:10 xerces kernel: Descriptor sense data with sense descriptors
> (in hex):
> Feb 18 00:58:10 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00
> 00 00 00 00
> Feb 18 00:58:10 xerces kernel:         00 00 00 00
> Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector
> 35666775
> Feb 18 00:58:10 xerces kernel: raid5: Disk failure on sdd1, disabling
> device. Operation continuing on 3 devices
>
> Regards,
> Marc
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Just out of curiosity:

Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector
35666775

Can you run:

smartctl -d ata -t short /dev/sdd
wait 5 min
smartctl -d ata -t long /dev/sdd
wait 2-3 hr
smartctl -d ata -a /dev/sdd

And then e-mail that output to the list?

Justin.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-18 12:13           ` Justin Piszcz
@ 2007-02-18 12:32             ` Marc Marais
  2007-02-19  5:41             ` Marc Marais
  1 sibling, 0 replies; 14+ messages in thread
From: Marc Marais @ 2007-02-18 12:32 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Sun, 18 Feb 2007 07:13:28 -0500 (EST), Justin Piszcz wrote
> On Sun, 18 Feb 2007, Marc Marais wrote:
> 
> > On Sun, 18 Feb 2007 20:39:09 +1100, Neil Brown wrote
> >> On Sunday February 18, marcm@liquid-nexus.net wrote:
> >>> Ok, I understand the risks which is why I did a full backup before 
doing
> >>> this. I have subsequently recreated the array and restored my data from
> >>> backup.
> >>
> >> Could you still please tell me exactly what kernel/mdadm version you
> >> were using?
> >>
> >> Thanks,
> >> NeilBrown
> >
> > 2.6.20 with the patch you supplied in response to the "md6_raid5 crash
> > email" I posted in linux-raid a few days ago. Just as background, I 
replaced
> > the failing drive and at the same time bought an additional drive in 
order
> > to increase the array size.
> >
> > mdadm -V = v2.6 - 21 December 2006. Compiled under Debian (stable).
> >
> > Also, I've just noticed another drive failure with the new array with a
> > similar error to what happened during the grow operation (although on a
> > different drive) - I wonder if I should post this to linux-ide?
> >
> > Feb 18 00:58:10 xerces kernel: ata4: command timeout
> > Feb 18 00:58:10 xerces kernel: ata4: no sense translation for status: 
0x40
> > Feb 18 00:58:10 xerces kernel: ata4: translated ATA stat/err 0x40/00 to 
SCSI
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 18 00:58:10 xerces kernel: ata4: status=0x40 { DriveReady }
> > Feb 18 00:58:10 xerces kernel: sd 4:0:0:0: SCSI error: return code =
> > 0x08000002
> > Feb 18 00:58:10 xerces kernel: sdd: Current [descriptor]: sense key: 
Aborted
> > Command
> > Feb 18 00:58:10 xerces kernel:     Additional sense: No additional sense
> > information
> > Feb 18 00:58:10 xerces kernel: Descriptor sense data with sense 
descriptors
> > (in hex):
> > Feb 18 00:58:10 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 
00
> > 00 00 00 00
> > Feb 18 00:58:10 xerces kernel:         00 00 00 00
> > Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector
> > 35666775
> > Feb 18 00:58:10 xerces kernel: raid5: Disk failure on sdd1, disabling
> > device. Operation continuing on 3 devices
> >
> > Regards,
> > Marc
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> Just out of curiosity:
> 
> Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd,
>  sector 35666775
> 
> Can you run:
> 
> smartctl -d ata -t short /dev/sdd
> wait 5 min
> smartctl -d ata -t long /dev/sdd
> wait 2-3 hr
> smartctl -d ata -a /dev/sdd
> 
> And then e-mail that output to the list?
> 
> Justin.

I have smartmontools performing regular short and long scans but I will run 
the tests immediately and send the output of smartctl -a when done. 

Note I'm getting similar errors on sdc too (as in 5 minutes ago). 
Interestingly the SMART error logs for sdc and sdd show no errors at all. 

ata3: command timeout
ata3: no sense translation for status: 0x40
ata3: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
ata4: status=0x40 { DriveReady }
sd 3:0:0:0: SCSI error: return code = 0x08000002
sdd: Current [descriptor]: sense key: Aborted Command
     Additional sense: No additional sense information
Descriptor sense data with sense descriptors (in hex):
         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
         00 00 00 00
end_request: I/O error, dev sdc, sector 260419647
raid5:md6: read error corrected (8 sectors at 260419584 on sdc1)

Will post logs when done...

Marc

--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-18  9:20   ` Marc Marais
       [not found]     ` <17880.7869.963793.706096@notabene.brown>
@ 2007-02-19  0:50     ` Neil Brown
  1 sibling, 0 replies; 14+ messages in thread
From: Neil Brown @ 2007-02-19  0:50 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid

On Sunday February 18, marcm@liquid-nexus.net wrote:
> 
> I'm not sure how the grow operation is performed but to me it seems that 
> their is no fault tolerance during the operation so any failure will cause a 
> corrupt array. My 2c would be that if any drive fails during a grow 
> operation that the operation is aborted in such a way as to allow a restart 
> later (if possible) - as in my case a retry would've probably worked. 

For what it's worth, the code does exactly what you suggest.  It does
fail gracefully.  The problem is that it doesn't restart quite the
way you would like.

Had you stopped the array and re-assembled it, it would have resume
the reshape process (at least it did in my testing).

The following patch makes it retry a reshape straight away if it was
aborted due to a device failure (of course, if too many devices have
failed, the retry won't get anywhere, but you would expect that).

Thanks for the valuable feedback.

NeilBrown


Restart a (raid5) reshape that has been aborted due to a read/write error.

An error always aborts any resync/recovery/reshape on the understanding
that it will immediately be restarted if that still makes sense.
However a reshape currently doesn't get restarted.  This this patch
it does.
To avoid restarting when it is not possible to do work, we call 
in to the personality to check that a reshape is ok, and strengthen
raid5_check_reshape to fail if there are too many failed devices.

We also break some code out into a separate function: remote_and_add_spares
as the indent level for that code we getting crazy.


### Diffstat output
 ./drivers/md/md.c    |   74 +++++++++++++++++++++++++++++++--------------------
 ./drivers/md/raid5.c |    2 +
 2 files changed, 47 insertions(+), 29 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-02-19 11:44:51.000000000 +1100
+++ ./drivers/md/md.c	2007-02-19 11:44:54.000000000 +1100
@@ -5343,6 +5343,44 @@ void md_do_sync(mddev_t *mddev)
 EXPORT_SYMBOL_GPL(md_do_sync);
 
 
+static int remove_and_add_spares(mddev_t *mddev)
+{
+	mdk_rdev_t *rdev;
+	struct list_head *rtmp;
+	int spares = 0;
+
+	ITERATE_RDEV(mddev,rdev,rtmp)
+		if (rdev->raid_disk >= 0 &&
+		    (test_bit(Faulty, &rdev->flags) ||
+		     ! test_bit(In_sync, &rdev->flags)) &&
+		    atomic_read(&rdev->nr_pending)==0) {
+			if (mddev->pers->hot_remove_disk(
+				    mddev, rdev->raid_disk)==0) {
+				char nm[20];
+				sprintf(nm,"rd%d", rdev->raid_disk);
+				sysfs_remove_link(&mddev->kobj, nm);
+				rdev->raid_disk = -1;
+			}
+		}
+
+	if (mddev->degraded) {
+		ITERATE_RDEV(mddev,rdev,rtmp)
+			if (rdev->raid_disk < 0
+			    && !test_bit(Faulty, &rdev->flags)) {
+				rdev->recovery_offset = 0;
+				if (mddev->pers->hot_add_disk(mddev,rdev)) {
+					char nm[20];
+					sprintf(nm, "rd%d", rdev->raid_disk);
+					sysfs_create_link(&mddev->kobj,
+							  &rdev->kobj, nm);
+					spares++;
+					md_new_event(mddev);
+				} else
+					break;
+			}
+	}
+	return spares;
+}
 /*
  * This routine is regularly called by all per-raid-array threads to
  * deal with generic issues like resync and super-block update.
@@ -5397,7 +5435,7 @@ void md_check_recovery(mddev_t *mddev)
 		return;
 
 	if (mddev_trylock(mddev)) {
-		int spares =0;
+		int spares = 0;
 
 		spin_lock_irq(&mddev->write_lock);
 		if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
@@ -5460,35 +5498,13 @@ void md_check_recovery(mddev_t *mddev)
 		 * Spare are also removed and re-added, to allow
 		 * the personality to fail the re-add.
 		 */
-		ITERATE_RDEV(mddev,rdev,rtmp)
-			if (rdev->raid_disk >= 0 &&
-			    (test_bit(Faulty, &rdev->flags) || ! test_bit(In_sync, &rdev->flags)) &&
-			    atomic_read(&rdev->nr_pending)==0) {
-				if (mddev->pers->hot_remove_disk(mddev, rdev->raid_disk)==0) {
-					char nm[20];
-					sprintf(nm,"rd%d", rdev->raid_disk);
-					sysfs_remove_link(&mddev->kobj, nm);
-					rdev->raid_disk = -1;
-				}
-			}
-
-		if (mddev->degraded) {
-			ITERATE_RDEV(mddev,rdev,rtmp)
-				if (rdev->raid_disk < 0
-				    && !test_bit(Faulty, &rdev->flags)) {
-					rdev->recovery_offset = 0;
-					if (mddev->pers->hot_add_disk(mddev,rdev)) {
-						char nm[20];
-						sprintf(nm, "rd%d", rdev->raid_disk);
-						sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
-						spares++;
-						md_new_event(mddev);
-					} else
-						break;
-				}
-		}
 
-		if (spares) {
+		if (mddev->reshape_position != MaxSector) {
+			if (mddev->pers->check_reshape(mddev) != 0)
+				/* Cannot proceed */
+				goto unlock;
+			set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+		} else if ((spares = remove_and_add_spares(mddev))) {
 			clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 			clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 		} else if (mddev->recovery_cp < MaxSector) {

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2007-02-19 11:44:48.000000000 +1100
+++ ./drivers/md/raid5.c	2007-02-19 11:44:54.000000000 +1100
@@ -3814,6 +3814,8 @@ static int raid5_check_reshape(mddev_t *
 	if (err)
 		return err;
 
+	if (mddev->degraded > conf->max_degraded)
+		return -EINVAL;
 	/* looks like we might be able to manage this */
 	return 0;
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-18 12:13           ` Justin Piszcz
  2007-02-18 12:32             ` Marc Marais
@ 2007-02-19  5:41             ` Marc Marais
  2007-02-19 13:25               ` Justin Piszcz
  1 sibling, 1 reply; 14+ messages in thread
From: Marc Marais @ 2007-02-19  5:41 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

On Sun, 18 Feb 2007 07:13:28 -0500 (EST), Justin Piszcz wrote
> On Sun, 18 Feb 2007, Marc Marais wrote:
> 
> > On Sun, 18 Feb 2007 20:39:09 +1100, Neil Brown wrote
> >> On Sunday February 18, marcm@liquid-nexus.net wrote:
> >>> Ok, I understand the risks which is why I did a full backup before doing
> >>> this. I have subsequently recreated the array and restored my data from
> >>> backup.
> >>
> >> Could you still please tell me exactly what kernel/mdadm version you
> >> were using?
> >>
> >> Thanks,
> >> NeilBrown
> >
> > 2.6.20 with the patch you supplied in response to the "md6_raid5 crash
> > email" I posted in linux-raid a few days ago. Just as background, I replaced
> > the failing drive and at the same time bought an additional drive in order
> > to increase the array size.
> >
> > mdadm -V = v2.6 - 21 December 2006. Compiled under Debian (stable).
> >
> > Also, I've just noticed another drive failure with the new array with a
> > similar error to what happened during the grow operation (although on a
> > different drive) - I wonder if I should post this to linux-ide?
> >
> > Feb 18 00:58:10 xerces kernel: ata4: command timeout
> > Feb 18 00:58:10 xerces kernel: ata4: no sense translation for status: 0x40
> > Feb 18 00:58:10 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 18 00:58:10 xerces kernel: ata4: status=0x40 { DriveReady }
> > Feb 18 00:58:10 xerces kernel: sd 4:0:0:0: SCSI error: return code =
> > 0x08000002
> > Feb 18 00:58:10 xerces kernel: sdd: Current [descriptor]: sense key: Aborted
> > Command
> > Feb 18 00:58:10 xerces kernel:     Additional sense: No additional sense
> > information
> > Feb 18 00:58:10 xerces kernel: Descriptor sense data with sense descriptors
> > (in hex):
> > Feb 18 00:58:10 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00
> > 00 00 00 00
> > Feb 18 00:58:10 xerces kernel:         00 00 00 00
> > Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector
> > 35666775
> > Feb 18 00:58:10 xerces kernel: raid5: Disk failure on sdd1, disabling
> > device. Operation continuing on 3 devices
> >
> > Regards,
> > Marc
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> Just out of curiosity:
> 
> Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd,
>  sector 35666775
> 
> Can you run:
> 
> smartctl -d ata -t short /dev/sdd
> wait 5 min
> smartctl -d ata -t long /dev/sdd
> wait 2-3 hr
> smartctl -d ata -a /dev/sdd
> 
> And then e-mail that output to the list?
> 
> Justin.

Ok here we go:

/dev/sdd:

smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is
http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1600JB-00EVA0
Serial Number:    WD-WMAEK2751794
Firmware Version: 15.05R15
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Feb 19 14:38:16 2007 GMT-9
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment
test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (5073) seconds.
Offline data collection
capabilities: 			 (0x79) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  67) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.

SMART Attributes Data Structure revision number: 16 Vendor Specific SMART
Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always     
 -       0
  3 Spin_Up_Time            0x0007   148   144   021    Pre-fail  Always     
 -       3141
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always     
 -       91
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always     
 -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always     
 -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always     
 -       5070
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always     
 -       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always     
 -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always     
 -       90
194 Temperature_Celsius     0x0022   116   253   000    Old_age   Always     
 -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always     
 -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always     
 -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always     
 -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always     
 -       0
200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline    
 -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) 
LBA_of_first_error
# 1  Short offline       Completed without error       00%       691         -
# 2  Extended offline    Completed without error       00%       686         -
# 3  Short offline       Completed without error       00%       685         -
# 4  Short offline       Completed without error       00%       620         -
# 5  Extended offline    Completed without error       00%       598         -
# 6  Short offline       Completed without error       00%       597         -
# 7  Short offline       Completed without error       00%       573         -
# 8  Short offline       Completed without error       00%       549         -
# 9  Short offline       Completed without error       00%       525         -
#10  Short offline       Completed without error       00%       501         -
#11  Short offline       Completed without error       00%       477         -
#12  Short offline       Completed without error       00%       453         -
#13  Short offline       Completed without error       00%       382         -
#14  Short offline       Completed without error       00%       358         -
#15  Short offline       Completed without error       00%       334         -
#16  Short offline       Completed without error       00%       310         -
#17  Short offline       Completed without error       00%       286         -
#18  Extended offline    Completed without error       00%       264         -
#19  Short offline       Completed without error       00%       263         -
#20  Short offline       Completed without error       00%       239         -
#21  Short offline       Completed without error       00%       215         -

SMART Selective self-test log data structure revision number 1  SPAN  MIN_LBA
 MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
/dev/sdc:

smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is
http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1600JB-00REA0
Serial Number:    WD-WCANM4681863
Firmware Version: 20.00K20
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Feb 19 14:38:11 2007 GMT-9
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment
test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (4980) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  60) minutes.
Conveyance self-test routine
recommended polling time: 	 (   6) minutes.

SMART Attributes Data Structure revision number: 16 Vendor Specific SMART
Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always     
 -       0
  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always     
 -       3775
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always     
 -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always     
 -       0
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always     
 -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always     
 -       4834
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always     
 -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always     
 -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always     
 -       18
194 Temperature_Celsius     0x0022   114   095   000    Old_age   Always     
 -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always     
 -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always     
 -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline    
 -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always     
 -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline    
 -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) 
LBA_of_first_error
# 1  Short offline       Completed without error       00%      4823         -
# 2  Extended offline    Completed without error       00%      4819         -
# 3  Short offline       Completed without error       00%      4817         -
# 4  Short offline       Completed without error       00%      4799         -
# 5  Short offline       Completed without error       00%      4775         -
# 6  Short offline       Completed without error       00%      4751         -
# 7  Extended offline    Completed without error       00%      4728         -
# 8  Short offline       Completed without error       00%      4727         -
# 9  Short offline       Completed without error       00%      4703         -
#10  Short offline       Completed without error       00%      4679         -
#11  Short offline       Completed without error       00%      4655         -
#12  Short offline       Completed without error       00%      4631         -
#13  Short offline       Completed without error       00%      4607         -
#14  Short offline       Completed without error       00%      4583         -
#15  Short offline       Completed without error       00%      4511         -
#16  Short offline       Completed without error       00%      4487         -
#17  Short offline       Completed without error       00%      4463         -
#18  Short offline       Completed without error       00%      4439         -
#19  Short offline       Completed without error       00%      4415         -
#20  Extended offline    Completed without error       00%      4393         -
#21  Short offline       Completed without error       00%      4391         -

SMART Selective self-test log data structure revision number 1  SPAN  MIN_LBA
 MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: mdadm --grow failed
  2007-02-19  5:41             ` Marc Marais
@ 2007-02-19 13:25               ` Justin Piszcz
  0 siblings, 0 replies; 14+ messages in thread
From: Justin Piszcz @ 2007-02-19 13:25 UTC (permalink / raw)
  To: Marc Marais; +Cc: linux-raid



On Mon, 19 Feb 2007, Marc Marais wrote:

> On Sun, 18 Feb 2007 07:13:28 -0500 (EST), Justin Piszcz wrote
>> On Sun, 18 Feb 2007, Marc Marais wrote:
>>
>>> On Sun, 18 Feb 2007 20:39:09 +1100, Neil Brown wrote
>>>> On Sunday February 18, marcm@liquid-nexus.net wrote:
>>>>> Ok, I understand the risks which is why I did a full backup before doing
>>>>> this. I have subsequently recreated the array and restored my data from
>>>>> backup.
>>>>
>>>> Could you still please tell me exactly what kernel/mdadm version you
>>>> were using?
>>>>
>>>> Thanks,
>>>> NeilBrown
>>>
>>> 2.6.20 with the patch you supplied in response to the "md6_raid5 crash
>>> email" I posted in linux-raid a few days ago. Just as background, I replaced
>>> the failing drive and at the same time bought an additional drive in order
>>> to increase the array size.
>>>
>>> mdadm -V = v2.6 - 21 December 2006. Compiled under Debian (stable).
>>>
>>> Also, I've just noticed another drive failure with the new array with a
>>> similar error to what happened during the grow operation (although on a
>>> different drive) - I wonder if I should post this to linux-ide?
>>>
>>> Feb 18 00:58:10 xerces kernel: ata4: command timeout
>>> Feb 18 00:58:10 xerces kernel: ata4: no sense translation for status: 0x40
>>> Feb 18 00:58:10 xerces kernel: ata4: translated ATA stat/err 0x40/00 to SCSI
>>> SK/ASC/ASCQ 0xb/00/00
>>> Feb 18 00:58:10 xerces kernel: ata4: status=0x40 { DriveReady }
>>> Feb 18 00:58:10 xerces kernel: sd 4:0:0:0: SCSI error: return code =
>>> 0x08000002
>>> Feb 18 00:58:10 xerces kernel: sdd: Current [descriptor]: sense key: Aborted
>>> Command
>>> Feb 18 00:58:10 xerces kernel:     Additional sense: No additional sense
>>> information
>>> Feb 18 00:58:10 xerces kernel: Descriptor sense data with sense descriptors
>>> (in hex):
>>> Feb 18 00:58:10 xerces kernel:         72 0b 00 00 00 00 00 0c 00 0a 80 00
>>> 00 00 00 00
>>> Feb 18 00:58:10 xerces kernel:         00 00 00 00
>>> Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd, sector
>>> 35666775
>>> Feb 18 00:58:10 xerces kernel: raid5: Disk failure on sdd1, disabling
>>> device. Operation continuing on 3 devices
>>>
>>> Regards,
>>> Marc
>>>
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> Just out of curiosity:
>>
>> Feb 18 00:58:10 xerces kernel: end_request: I/O error, dev sdd,
>>  sector 35666775
>>
>> Can you run:
>>
>> smartctl -d ata -t short /dev/sdd
>> wait 5 min
>> smartctl -d ata -t long /dev/sdd
>> wait 2-3 hr
>> smartctl -d ata -a /dev/sdd
>>
>> And then e-mail that output to the list?
>>
>> Justin.
>
> Ok here we go:
>
> /dev/sdd:
>
> smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is
> http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model:     WDC WD1600JB-00EVA0
> Serial Number:    WD-WMAEK2751794
> Firmware Version: 15.05R15
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   6
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Mon Feb 19 14:38:16 2007 GMT-9
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION === SMART overall-health self-assessment
> test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x84)	Offline data collection activity
> 					was suspended by an interrupting command from host.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)	The previous self-test routine completed
> 					without error or no self-test has ever
> 					been run.
> Total time to complete Offline
> data collection: 		 (5073) seconds.
> Offline data collection
> capabilities: 			 (0x79) SMART execute Offline immediate.
> 					No Auto Offline data collection support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					No General Purpose Logging support.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 (  67) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   5) minutes.
>
> SMART Attributes Data Structure revision number: 16 Vendor Specific SMART
> Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always
> -       0
>  3 Spin_Up_Time            0x0007   148   144   021    Pre-fail  Always
> -       3141
>  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always
> -       91
>  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
> -       0
>  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always
> -       0
>  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always
> -       5070
> 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always
> -       0
> 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always
> -       0
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
> -       90
> 194 Temperature_Celsius     0x0022   116   253   000    Old_age   Always
> -       34
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
> -       0
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
> -       0
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always
> -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always
> -       0
> 200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline
> -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)
> LBA_of_first_error
> # 1  Short offline       Completed without error       00%       691         -
> # 2  Extended offline    Completed without error       00%       686         -
> # 3  Short offline       Completed without error       00%       685         -
> # 4  Short offline       Completed without error       00%       620         -
> # 5  Extended offline    Completed without error       00%       598         -
> # 6  Short offline       Completed without error       00%       597         -
> # 7  Short offline       Completed without error       00%       573         -
> # 8  Short offline       Completed without error       00%       549         -
> # 9  Short offline       Completed without error       00%       525         -
> #10  Short offline       Completed without error       00%       501         -
> #11  Short offline       Completed without error       00%       477         -
> #12  Short offline       Completed without error       00%       453         -
> #13  Short offline       Completed without error       00%       382         -
> #14  Short offline       Completed without error       00%       358         -
> #15  Short offline       Completed without error       00%       334         -
> #16  Short offline       Completed without error       00%       310         -
> #17  Short offline       Completed without error       00%       286         -
> #18  Extended offline    Completed without error       00%       264         -
> #19  Short offline       Completed without error       00%       263         -
> #20  Short offline       Completed without error       00%       239         -
> #21  Short offline       Completed without error       00%       215         -
>
> SMART Selective self-test log data structure revision number 1  SPAN  MIN_LBA
> MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> --
> /dev/sdc:
>
> smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen Home page is
> http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model:     WDC WD1600JB-00REA0
> Serial Number:    WD-WCANM4681863
> Firmware Version: 20.00K20
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Mon Feb 19 14:38:11 2007 GMT-9
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION === SMART overall-health self-assessment
> test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x85)	Offline data collection activity
> 					was aborted by an interrupting command from host.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)	The previous self-test routine completed
> 					without error or no self-test has ever
> 					been run.
> Total time to complete Offline
> data collection: 		 (4980) seconds.
> Offline data collection
> capabilities: 			 (0x7b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 (  60) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   6) minutes.
>
> SMART Attributes Data Structure revision number: 16 Vendor Specific SMART
> Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always
> -       0
>  3 Spin_Up_Time            0x0003   184   184   021    Pre-fail  Always
> -       3775
>  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always
> -       19
>  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
> -       0
>  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always
> -       0
>  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always
> -       4834
> 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always
> -       0
> 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always
> -       0
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
> -       18
> 194 Temperature_Celsius     0x0022   114   095   000    Old_age   Always
> -       33
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
> -       0
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
> -       0
> 198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline
> -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
> -       0
> 200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline
> -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)
> LBA_of_first_error
> # 1  Short offline       Completed without error       00%      4823         -
> # 2  Extended offline    Completed without error       00%      4819         -
> # 3  Short offline       Completed without error       00%      4817         -
> # 4  Short offline       Completed without error       00%      4799         -
> # 5  Short offline       Completed without error       00%      4775         -
> # 6  Short offline       Completed without error       00%      4751         -
> # 7  Extended offline    Completed without error       00%      4728         -
> # 8  Short offline       Completed without error       00%      4727         -
> # 9  Short offline       Completed without error       00%      4703         -
> #10  Short offline       Completed without error       00%      4679         -
> #11  Short offline       Completed without error       00%      4655         -
> #12  Short offline       Completed without error       00%      4631         -
> #13  Short offline       Completed without error       00%      4607         -
> #14  Short offline       Completed without error       00%      4583         -
> #15  Short offline       Completed without error       00%      4511         -
> #16  Short offline       Completed without error       00%      4487         -
> #17  Short offline       Completed without error       00%      4463         -
> #18  Short offline       Completed without error       00%      4439         -
> #19  Short offline       Completed without error       00%      4415         -
> #20  Extended offline    Completed without error       00%      4393         -
> #21  Short offline       Completed without error       00%      4391         -
>
> SMART Selective self-test log data structure revision number 1  SPAN  MIN_LBA
> MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>

Strange, sounds like an interrupt problem to me then, what does cat 
/proc/interrupts say?  What does dmesg say?  Any errors there?  Your disks 
appear to be fine.

Justin.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-02-19 13:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-17  3:22 mdadm --grow failed Marc Marais
2007-02-17  8:40 ` Neil Brown
2007-02-18  9:20   ` Marc Marais
     [not found]     ` <17880.7869.963793.706096@notabene.brown>
     [not found]       ` <20070218105242.M29958@liquid-nexus.net>
2007-02-18 11:57         ` Fw: " Marc Marais
2007-02-18 12:13           ` Justin Piszcz
2007-02-18 12:32             ` Marc Marais
2007-02-19  5:41             ` Marc Marais
2007-02-19 13:25               ` Justin Piszcz
2007-02-19  0:50     ` Neil Brown
2007-02-17 18:27 ` Bill Davidsen
2007-02-17 19:16   ` Justin Piszcz
2007-02-17 21:08     ` Neil Brown
2007-02-17 21:30       ` Justin Piszcz
2007-02-18 11:51 ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).