raid5 replace ignored error?

All of lore.kernel.org
 help / color / mirror / Atom feed

* raid5 replace ignored error?
@ 2014-02-04  9:00 Bill
  2014-02-18  3:46 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Bill @ 2014-02-04  9:00 UTC (permalink / raw)
  To: linux-raid

Hi,

I had something weird happen during a replace in a raid5 array on kernel 
3.10.28 -
it appears an error in writing to / communicating with the replacement 
disk was ignored.

I have this array:

md3 : active raid5 sda1[0] sdd1[3] sdb1[1] sdf1[4] sdc1[2]
       3900742144 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
       bitmap: 0/233 pages [0KB], 2048KB chunk

I tried replacing sdf1 with sde1.

     [106666.129833] md: recovery of RAID array md3
     [106666.129836] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
     [106666.129837] md: using maximum available idle IO bandwidth (but 
not more than 200000 KB/sec) for recovery.
     [106666.129842] md: using 128k window, over a total of 975185536k.

1/2 hour later I got a flood of errors in dmesg:

     [108334.974861] ata5.00: exception Emask 0x10 SAct 0x7fffffff SErr 
0x480100 action 0x6 frozen
     [108334.974864] ata5.00: irq_stat 0x08000000, interface fatal error
     [108334.974866] ata5: SError: { UnrecovData 10B8B Handshk }
     [108334.974868] ata5.00: failed command: WRITE FPDMA QUEUED
     [108334.974872] ata5.00: cmd 61/00:00:10:97:9e/04:00:15:00:00/40 
tag 0 ncq 524288 out
     [108334.974872]          res 40/00:b0:10:f7:9e/00:00:15:00:00/40 
Emask 0x10 (ATA bus error)
     [108334.974873] ata5.00: status: { DRDY }
     .
     .(29 more of the same message)
     .
     [108344.976877] ata5: softreset failed (1st FIS failed)
     [108344.976883] ata5: hard resetting link
     [108349.874854] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
     [108349.901025] ata5.00: configured for UDMA/133
     [108349.901055] ata5: EH complete

There were no md error messages, the recovery continued, and finished a 
few hours later.

     [122443.805899] md: md3: recovery done.


Afterwards I did a QC check and found a mismatch in one file which I 
mapped to the area
being updated when this error was logged.

What should happen in this case?
Should the "replace" have failed or is there something else going on here?

Thanks,
Bill



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid5 replace ignored error?
  2014-02-04  9:00 raid5 replace ignored error? Bill
@ 2014-02-18  3:46 ` NeilBrown
  2014-02-18 19:16   ` Bill
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2014-02-18  3:46 UTC (permalink / raw)
  To: Bill; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]

On Tue, 04 Feb 2014 03:00:50 -0600 Bill <billstuff2001@sbcglobal.net> wrote:

> Hi,
> 
> I had something weird happen during a replace in a raid5 array on kernel 
> 3.10.28 -
> it appears an error in writing to / communicating with the replacement 
> disk was ignored.
> 
> I have this array:
> 
> md3 : active raid5 sda1[0] sdd1[3] sdb1[1] sdf1[4] sdc1[2]
>        3900742144 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
>        bitmap: 0/233 pages [0KB], 2048KB chunk
> 
> I tried replacing sdf1 with sde1.
> 
>      [106666.129833] md: recovery of RAID array md3
>      [106666.129836] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
>      [106666.129837] md: using maximum available idle IO bandwidth (but 
> not more than 200000 KB/sec) for recovery.
>      [106666.129842] md: using 128k window, over a total of 975185536k.
> 
> 1/2 hour later I got a flood of errors in dmesg:
> 
>      [108334.974861] ata5.00: exception Emask 0x10 SAct 0x7fffffff SErr 
> 0x480100 action 0x6 frozen
>      [108334.974864] ata5.00: irq_stat 0x08000000, interface fatal error
>      [108334.974866] ata5: SError: { UnrecovData 10B8B Handshk }
>      [108334.974868] ata5.00: failed command: WRITE FPDMA QUEUED
>      [108334.974872] ata5.00: cmd 61/00:00:10:97:9e/04:00:15:00:00/40 
> tag 0 ncq 524288 out
>      [108334.974872]          res 40/00:b0:10:f7:9e/00:00:15:00:00/40 
> Emask 0x10 (ATA bus error)
>      [108334.974873] ata5.00: status: { DRDY }
>      .
>      .(29 more of the same message)
>      .
>      [108344.976877] ata5: softreset failed (1st FIS failed)
>      [108344.976883] ata5: hard resetting link
>      [108349.874854] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>      [108349.901025] ata5.00: configured for UDMA/133
>      [108349.901055] ata5: EH complete
> 
> There were no md error messages, the recovery continued, and finished a 
> few hours later.
> 
>      [122443.805899] md: md3: recovery done.
> 
> 
> Afterwards I did a QC check and found a mismatch in one file which I 
> mapped to the area
> being updated when this error was logged.
> 
> What should happen in this case?
> Should the "replace" have failed or is there something else going on here?

Hi Bill,
 sorry for the delay.

Were there any message like:
   end_request: I/O error, dev sde, sector NNNNNNNN

??
If not, then the error never got up to md - the driver thinks that it managed
to recovery.
If so, then md really should have marked the replacement as faulty - or
possible recorded a bad-block if the device has a badblock log on it (mdadm
-E would tell you).

If the write actually failed, but md wasn't told, then that is a problem in
the driver or device.
If the md was told, then it certainly would be a bug in md.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid5 replace ignored error?
  2014-02-18  3:46 ` NeilBrown
@ 2014-02-18 19:16   ` Bill
  0 siblings, 0 replies; 3+ messages in thread
From: Bill @ 2014-02-18 19:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 02/17/2014 09:46 PM, NeilBrown wrote:
> On Tue, 04 Feb 2014 03:00:50 -0600 Bill <billstuff2001@sbcglobal.net> wrote:
>
>> Hi,
>>
>> I had something weird happen during a replace in a raid5 array on kernel
>> 3.10.28 -
>> it appears an error in writing to / communicating with the replacement
>> disk was ignored.
>>
>> I have this array:
>>
>> md3 : active raid5 sda1[0] sdd1[3] sdb1[1] sdf1[4] sdc1[2]
>>         3900742144 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
>>         bitmap: 0/233 pages [0KB], 2048KB chunk
>>
>> I tried replacing sdf1 with sde1.
>>
>>       [106666.129833] md: recovery of RAID array md3
>>       [106666.129836] md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
>>       [106666.129837] md: using maximum available idle IO bandwidth (but
>> not more than 200000 KB/sec) for recovery.
>>       [106666.129842] md: using 128k window, over a total of 975185536k.
>>
>> 1/2 hour later I got a flood of errors in dmesg:
>>
>>       [108334.974861] ata5.00: exception Emask 0x10 SAct 0x7fffffff SErr
>> 0x480100 action 0x6 frozen
>>       [108334.974864] ata5.00: irq_stat 0x08000000, interface fatal error
>>       [108334.974866] ata5: SError: { UnrecovData 10B8B Handshk }
>>       [108334.974868] ata5.00: failed command: WRITE FPDMA QUEUED
>>       [108334.974872] ata5.00: cmd 61/00:00:10:97:9e/04:00:15:00:00/40
>> tag 0 ncq 524288 out
>>       [108334.974872]          res 40/00:b0:10:f7:9e/00:00:15:00:00/40
>> Emask 0x10 (ATA bus error)
>>       [108334.974873] ata5.00: status: { DRDY }
>>       .
>>       .(29 more of the same message)
>>       .
>>       [108344.976877] ata5: softreset failed (1st FIS failed)
>>       [108344.976883] ata5: hard resetting link
>>       [108349.874854] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>       [108349.901025] ata5.00: configured for UDMA/133
>>       [108349.901055] ata5: EH complete
>>
>> There were no md error messages, the recovery continued, and finished a
>> few hours later.
>>
>>       [122443.805899] md: md3: recovery done.
>>
>>
>> Afterwards I did a QC check and found a mismatch in one file which I
>> mapped to the area
>> being updated when this error was logged.
>>
>> What should happen in this case?
>> Should the "replace" have failed or is there something else going on here?
> Hi Bill,
>   sorry for the delay.
>
> Were there any message like:
>     end_request: I/O error, dev sde, sector NNNNNNNN
>
> ??
> If not, then the error never got up to md - the driver thinks that it managed
> to recovery.
> If so, then md really should have marked the replacement as faulty - or
> possible recorded a bad-block if the device has a badblock log on it (mdadm
> -E would tell you).
>
> If the write actually failed, but md wasn't told, then that is a problem in
> the driver or device.
> If the md was told, then it certainly would be a bug in md.
>

Thanks for the help, Neil

There were no "I/O error" messages, and the drive is healthy according 
to SMART data.

I later found that these errors came after I hot-plugged a drive into a 
different sata
card. During the hotplug for the new disk, the controller for sde 
hard-reset, then
I got the flood of errors above, then it hard-reset again 10 seconds 
later and things
worked ok from there.

When I dug into the file which got corrupted, I found blocks of zero bytes,
which implies the write didn't happen because I had zero'd the drive 
before doing
the replace. So it seems like something failed and md didn't hear about it.

This happened while I was breaking in a new system, and I've since 
tightened up
some loose screws and loose cables, and things are much more stable now.

Thanks again,
Bill











^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-02-18 19:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-04  9:00 raid5 replace ignored error? Bill
2014-02-18  3:46 ` NeilBrown
2014-02-18 19:16   ` Bill

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.