All of lore.kernel.org
 help / color / mirror / Atom feed
* libata: dma, io error messages
@ 2004-08-06 10:29 Paul Jakma
  2004-08-06 12:32 ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 10:29 UTC (permalink / raw)
  To: Linux Kernel

Hi,

I received a mail this morning from mdadm to notify me of a RAID 
event. A partition, sda3, was kicked from a RAID5 array. Following 
error was logged in dmesg/syslog:

ata1: DMA timeout, stat 0x1
ATA: abnormal status 0xD0 on port 0xD081B087
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 02 05 d0 06 00 00 10 00
Current sda: sense key Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 33935366
ATA: abnormal status 0xD0 on port 0xD081B087
Aug  6 06:03:58 hibernia last message repeated 2 times

Can anyone shed light on the precise meaning and/or implications of 
above? Particularly the additional sense message, "auto reallocate 
failed" does this mean the drive is out of spare blocks?

Also, the drive is extremely slow now, about 1MB/s drive transfer 
rate as reported by hdparm -T.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
"Consequences, Schmonsequences, as long as I'm rich."
 		-- "Ali Baba Bunny" [1957, Chuck Jones]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 10:29 libata: dma, io error messages Paul Jakma
@ 2004-08-06 12:32 ` Alan Cox
  2004-08-06 13:40   ` Paul Jakma
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Alan Cox @ 2004-08-06 12:32 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Linux Kernel Mailing List

On Gwe, 2004-08-06 at 11:29, Paul Jakma wrote:
> scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 02 05 d0 06 00 00 10 00
> Current sda: sense key Medium Error

Disk error

> Additional sense: Unrecovered read error - auto reallocate failed

Bad block, and sufficiently bad that it couldn't then recover the block
and rewrite it. When a drive sees a marginal read (lot of forward error
correction recovery needed) it will try and rewrite or move the block.

In this case it couldn't read the block enough to move it.

> Also, the drive is extremely slow now, about 1MB/s drive transfer 
> rate as reported by hdparm -T.

Sounds like it dropped to PIO - that may be a bug triggered by the drive
failing.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 12:32 ` Alan Cox
@ 2004-08-06 13:40   ` Paul Jakma
  2004-08-06 13:49   ` Mark Lord
  2004-08-06 15:40   ` Jeff Garzik
  2 siblings, 0 replies; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 13:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

On Fri, 6 Aug 2004, Alan Cox wrote:

>> Additional sense: Unrecovered read error - auto reallocate failed
>
> Bad block, and sufficiently bad that it couldn't then recover the block
> and rewrite it. When a drive sees a marginal read (lot of forward error
> correction recovery needed) it will try and rewrite or move the block.
>
> In this case it couldn't read the block enough to move it.

Ouch.

I've contacted my supplier to replace the disk. Kudo's to Neil Brown 
for mdadm - I had an email waiting for me this morning to tell me 
about the failed disk.

Thanks Alan.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Windows 95 has been operating for 2 hours, 32 minutes. No errors reported. CALL
GUINESS BOOK OF WORLD RECORDS NOW!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 12:32 ` Alan Cox
  2004-08-06 13:40   ` Paul Jakma
@ 2004-08-06 13:49   ` Mark Lord
  2004-08-06 14:04     ` Paul Jakma
  2004-08-06 15:40   ` Jeff Garzik
  2 siblings, 1 reply; 14+ messages in thread
From: Mark Lord @ 2004-08-06 13:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Paul Jakma, Linux Kernel Mailing List

>>Also, the drive is extremely slow now, about 1MB/s drive transfer 
>>rate as reported by hdparm -T. 
> 
> Sounds like it dropped to PIO - that may be a bug triggered by the drive
> failing.

That should read "hdparm -t", not "-T", right?

And the slowness is most likely due to the error recovery
(retries) in the drive and/or driver, which cause the
overall throughput to plummet for the measurement interval.

Cheers
-- 
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 13:49   ` Mark Lord
@ 2004-08-06 14:04     ` Paul Jakma
  2004-08-06 14:59       ` Mark Lord
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 14:04 UTC (permalink / raw)
  To: Mark Lord; +Cc: Alan Cox, Linux Kernel Mailing List

On Fri, 6 Aug 2004, Mark Lord wrote:

>>> Also, the drive is extremely slow now, about 1MB/s drive transfer rate 
>>> as reported by hdparm -T.

> That should read "hdparm -t", not "-T", right?

erk, sorry, yes.

> And the slowness is most likely due to the error recovery (retries) 
> in the drive and/or driver, which cause the overall throughput to 
> plummet for the measurement interval.

ah. There are no reported errors though. So presumably drive retries 
that end up being successful.

If that is so then this suggests the drive has a far more serious 
problem than just a single bad block, right?

> Cheers

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Nothing is ever a total loss; it can always serve as a bad example.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 14:04     ` Paul Jakma
@ 2004-08-06 14:59       ` Mark Lord
  2004-08-06 15:15         ` Paul Jakma
  2004-08-06 15:41         ` Jeff Garzik
  0 siblings, 2 replies; 14+ messages in thread
From: Mark Lord @ 2004-08-06 14:59 UTC (permalink / raw)
  To: Paul Jakma; +Cc: Alan Cox, Linux Kernel Mailing List

 >If that is so then this suggests the drive has a far
 >more serious problem than just a single bad block, right?

The smartmontools can answer that question far more reliably
than anyone here can.

Cheers
-- 
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 14:59       ` Mark Lord
@ 2004-08-06 15:15         ` Paul Jakma
  2004-08-06 15:41         ` Jeff Garzik
  1 sibling, 0 replies; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 15:15 UTC (permalink / raw)
  To: Mark Lord; +Cc: Alan Cox, Linux Kernel Mailing List

On Fri, 6 Aug 2004, Mark Lord wrote:

> The smartmontools can answer that question far more reliably
> than anyone here can.

IIRC, libata doesnt yet implement the raw IDE access (taskfile or 
somesuch?) needed to allow smartctl to work:

# smartctl -a /dev/sda
smartctl version 5.21 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device:   Version:
Serial number: WD-WMAES2172674
Device type: disk
Local Time is: Fri Aug  6 16:14:29 2004 IST
Device does not support SMART

Device does not support Error Counter logging
Device does not support Self Test logging

:(

> Cheers

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
The control of the production of wealth is the control of human life itself.
 		-- Hilaire Belloc

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 12:32 ` Alan Cox
  2004-08-06 13:40   ` Paul Jakma
  2004-08-06 13:49   ` Mark Lord
@ 2004-08-06 15:40   ` Jeff Garzik
  2004-08-06 16:47     ` Paul Jakma
                       ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Jeff Garzik @ 2004-08-06 15:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Paul Jakma, Linux Kernel Mailing List

Alan Cox wrote:
> On Gwe, 2004-08-06 at 11:29, Paul Jakma wrote:
> 
>>scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 02 05 d0 06 00 00 10 00
>>Current sda: sense key Medium Error
> 
> 
> Disk error
> 
> 
>>Additional sense: Unrecovered read error - auto reallocate failed
> 
> 
> Bad block, and sufficiently bad that it couldn't then recover the block
> and rewrite it. When a drive sees a marginal read (lot of forward error
> correction recovery needed) it will try and rewrite or move the block.
> 
> In this case it couldn't read the block enough to move it.


Unfortunately, in libata all it means at this point is
(a) the error occurred on a read command, and
(b) we did not attempt to retry the command

So, a generic non-specific error you see on all errors.

libata does not (yet) retry cable errors, for example.  Paul, don't 
automatically assume the disk is bad, try swapping cables.

	Jeff



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 14:59       ` Mark Lord
  2004-08-06 15:15         ` Paul Jakma
@ 2004-08-06 15:41         ` Jeff Garzik
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff Garzik @ 2004-08-06 15:41 UTC (permalink / raw)
  To: Mark Lord; +Cc: Paul Jakma, Alan Cox, Linux Kernel Mailing List

Mark Lord wrote:
>  >If that is so then this suggests the drive has a far
>  >more serious problem than just a single bad block, right?
> 
> The smartmontools can answer that question far more reliably
> than anyone here can.


if libata supported SMART, which it doesn't yet.

So much to do, so little time...  :/

	Jeff



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 15:40   ` Jeff Garzik
@ 2004-08-06 16:47     ` Paul Jakma
  2004-08-06 17:24       ` Paul Jakma
  2004-08-09  4:48     ` Eric D. Mudama
  2004-08-17  0:48     ` Paul Jakma
  2 siblings, 1 reply; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 16:47 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Linux Kernel Mailing List

On Fri, 6 Aug 2004, Jeff Garzik wrote:

> libata does not (yet) retry cable errors, for example.  Paul, don't 
> automatically assume the disk is bad, try swapping cables.

Hmmm, I'll see if that's possible, though:

- My spare cables are exact same brand
- I dont know how to 'cycle' the drive

(iirc, libata doesnt yet support hotplug and/or the old "reset the 
bus by doing echo 'scsi remove-single-device' and then 
'add-single-device" trick)

ie, is possible to avoid a reboot?

also, the drive has been running fine since late may with this cable, 
and two other identical cables/drives are still happily running. 
Also, before putting the drives into use, I did some load-testing 
(bonnie++) which they survived quite happily.

> 	Jeff

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Bore, n.:
 	A person who talks when you wish him to listen.
 		-- Ambrose Bierce, "The Devil's Dictionary"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 16:47     ` Paul Jakma
@ 2004-08-06 17:24       ` Paul Jakma
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Jakma @ 2004-08-06 17:24 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Linux Kernel Mailing List

On Fri, 6 Aug 2004, Paul Jakma wrote:

> ie, is possible to avoid a reboot?

Hmmm, Jeff, the sector number in the error message is correct right:

end_request: I/O error, dev sda, sector 3393536

I can test that sector, destructively if needs be (its been kicked 
from raid array) - which is likely to work if the drive still has 
spare blocks and it's genuinely a media error, it can remap on write.

If i get same error trying to read that sector again and again it's a 
drive problem, right? The following command run a few times should 
trigger the same errors if its the drive, right?

 	dd seek=3393535 bs=1b count=2 if=/dev/sda3 of=/dev/null

ah, just tried it, it doesnt.. hmm.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
My doctor told me to stop having intimate dinners for four.  Unless there
are three other people.
 		-- Orson Welles

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 15:40   ` Jeff Garzik
  2004-08-06 16:47     ` Paul Jakma
@ 2004-08-09  4:48     ` Eric D. Mudama
  2004-08-09  4:56       ` Joel Jaeggli
  2004-08-17  0:48     ` Paul Jakma
  2 siblings, 1 reply; 14+ messages in thread
From: Eric D. Mudama @ 2004-08-09  4:48 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Paul Jakma, Linux Kernel Mailing List

On Fri, Aug  6 at 11:40, Jeff Garzik wrote:
>libata does not (yet) retry cable errors, for example.  Paul, don't 
>automatically assume the disk is bad, try swapping cables.

In practice with large #s of drives, cable errors are a given, even
with good cables... eventually *something* is going to glitch, we see
it all the time in our testing, especially at elevated temperatures.

--eric


-- 
Eric D. Mudama
edmudama@mail.bounceswoosh.org


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-09  4:48     ` Eric D. Mudama
@ 2004-08-09  4:56       ` Joel Jaeggli
  0 siblings, 0 replies; 14+ messages in thread
From: Joel Jaeggli @ 2004-08-09  4:56 UTC (permalink / raw)
  To: Eric D. Mudama
  Cc: Jeff Garzik, Alan Cox, Paul Jakma, Linux Kernel Mailing List

On Sun, 8 Aug 2004, Eric D. Mudama wrote:

> On Fri, Aug  6 at 11:40, Jeff Garzik wrote:
>> libata does not (yet) retry cable errors, for example.  Paul, don't 
>> automatically assume the disk is bad, try swapping cables.
>
> In practice with large #s of drives, cable errors are a given, even
> with good cables... eventually *something* is going to glitch, we see
> it all the time in our testing, especially at elevated temperatures.

I can corroborate this using different drivers (3ware in this case) with 
12 drivers and 4 backplanes per chassis most of our failures have been due 
to disks erroring out on the 3ware rather than actual failed drives. we 
pull the disks, test them and return them to the chassis as spares if 
they're functional. It doesn't happen that often but we have quite a few 
of these things now (~10TB) so it does happen.

> --eric
>
>
>

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli  	       Unix Consulting 	       joelja@darkwing.uoregon.edu 
GPG Key Fingerprint:     5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: libata: dma, io error messages
  2004-08-06 15:40   ` Jeff Garzik
  2004-08-06 16:47     ` Paul Jakma
  2004-08-09  4:48     ` Eric D. Mudama
@ 2004-08-17  0:48     ` Paul Jakma
  2 siblings, 0 replies; 14+ messages in thread
From: Paul Jakma @ 2004-08-17  0:48 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Linux Kernel Mailing List

Hi Jeff,

On Fri, 6 Aug 2004, Jeff Garzik wrote:

> libata does not (yet) retry cable errors, for example.  Paul, don't 
> automatically assume the disk is bad, try swapping cables.

FWIW:

- running dd in a while loop reading 1024k of data at a time from 
about 1000 odd blocks before/after the block number mentioned in the 
error messages did not cause any further errors

however:

- when I tried to add the disk (well partition, spanning from 2GB to 
end of disk at 160GB) back into its RAID-5 array, it errored out 
every time when resyncing the array, but only when reaching somewhere 
close to the end of the resync.

(unfortunately, I couldnt get the error messages, it caused all 
further IO to the RAID-5 to die too and effectively hang the box - i 
think due to the MD layer not being at all happy with a disk failling 
while resyncing, indeed, MD was not happy with state of play after 
reboot and wouldnt bring the array back.[1])

- Rogier Wolff reported to me in private mail that in his experience, 
WD disks going 'slow' is a WD quirk and an imminent sign of failure 
of WD disks.

And indeed, the disk did die completely last friday.. it's now been 
replaced.

How much would it take to get SMART reporting working with libata 
btw? I'd be very interested in this and helping out if possible.

> 	Jeff

1. NB: It would be nice if the Linux RAID superblock had per-drive 
UUIDs in addition to the global array UUID. I hotadded the disk to 
the array without first hotremoving it, and MD was rather confused by 
the presence of '/dev/sda' twice, as a failed disk and as a spare. 
Also, the failure of /dev/sda caused /dev/sdb and /dev/sdc to be 
renamed, after reboot to /dev/sda and /dev/sdb, which further 
confused matters. A per-component UUID might help MD discriminate and 
prevent addition of disks which are already in array and also help it 
recognise that the failed /dev/sda3 in the superblock is not the same 
as the current (renamed due to discovery after disk failure), 
/dev/sda3 existing on the system..

It took a lot of touching of wood, checking and rechecking of 
/etc/raidtab vs lsraid and mdadm -E before doing mkraid 
--dangerous-no-resync .... to rewrite the superblock and get the 
array running again despite my mistake and the 
not-very-friendly-to-fat-fingers properties of md.

regards,
-- 
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
If only God would give me some clear sign!  Like making a large deposit
in my name at a Swiss Bank.
- Woody Allen

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-08-17  0:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-06 10:29 libata: dma, io error messages Paul Jakma
2004-08-06 12:32 ` Alan Cox
2004-08-06 13:40   ` Paul Jakma
2004-08-06 13:49   ` Mark Lord
2004-08-06 14:04     ` Paul Jakma
2004-08-06 14:59       ` Mark Lord
2004-08-06 15:15         ` Paul Jakma
2004-08-06 15:41         ` Jeff Garzik
2004-08-06 15:40   ` Jeff Garzik
2004-08-06 16:47     ` Paul Jakma
2004-08-06 17:24       ` Paul Jakma
2004-08-09  4:48     ` Eric D. Mudama
2004-08-09  4:56       ` Joel Jaeggli
2004-08-17  0:48     ` Paul Jakma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.