2.4.18-rc3aa3: dma_intr: status=0x51 errors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.18-rc3aa3: dma_intr: status=0x51 errors
@ 2002-08-18 19:10 Shane
  2002-08-18 19:28 ` Alan Cox
  2002-08-18 20:05 ` Andre Hedrick
  0 siblings, 2 replies; 6+ messages in thread
From: Shane @ 2002-08-18 19:10 UTC (permalink / raw)
  To: linux-kernel

Hello,

I just tried running Cerberus for 15-20s and I got these errors in the
logs. I do use the nasty binary drivers but I replicated the errors from
a fresh boot without them ever being loaded. Can someone tell me what
these errors mean? And are they dangerous? Are there some docs on these
error codes such that I could translate them myself without having to
bother you guys?

The motherboard is an MSI KT133A
I use LVM on that drive and ext3
The controller the drive is on is a Promise Ultra 133 TX2
The drive is:

/dev/hdg:

 Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
 BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4 
 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
ATA-4 ATA-5 

Aug 18 14:49:58 mars kernel: invalidate: busy buffer
Aug 18 14:49:58 mars last message repeated 21 times
Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61193, sector=61192
Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61192
Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61195, sector=61194
Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61194

I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
it found nothing.

More info:

00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
(rev 03)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
(rev 40)
00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
(rev 40)
00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
(rev 74)
00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
DMA push (rev 12)
00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
05)
00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
Unknown device 4d69 (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
(rev a1)

Regards,

Shane

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors
  2002-08-18 19:10 2.4.18-rc3aa3: dma_intr: status=0x51 errors Shane
@ 2002-08-18 19:28 ` Alan Cox
  2002-08-18 20:05 ` Andre Hedrick
  1 sibling, 0 replies; 6+ messages in thread
From: Alan Cox @ 2002-08-18 19:28 UTC (permalink / raw)
  To: Shane; +Cc: linux-kernel

On Sun, 2002-08-18 at 20:10, Shane wrote:
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?


> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192

Tbats the drive logging a bad block on logical sector 61192 (be careful
with the 512byte/1K conversions here when using bad blocks
)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors
  2002-08-18 19:10 2.4.18-rc3aa3: dma_intr: status=0x51 errors Shane
  2002-08-18 19:28 ` Alan Cox
@ 2002-08-18 20:05 ` Andre Hedrick
  2002-08-20 18:50   ` Gunther Mayer
  1 sibling, 1 reply; 6+ messages in thread
From: Andre Hedrick @ 2002-08-18 20:05 UTC (permalink / raw)
  To: Shane; +Cc: linux-kernel


Because it is a hardware error.
Your drive is attempting to reallocate sectors and is failing.

On 18 Aug 2002, Shane wrote:

> Hello,
> 
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?
> 
> The motherboard is an MSI KT133A
> I use LVM on that drive and ext3
> The controller the drive is on is a Promise Ultra 133 TX2
> The drive is:
> 
> /dev/hdg:
> 
>  Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
>  Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
>  RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
>  BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
>  CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
>  IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
>  PIO modes: pio0 pio1 pio2 pio3 pio4 
>  DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
>  AdvancedPM=no WriteCache=enabled
>  Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
> ATA-4 ATA-5 
> 
> 
> Aug 18 14:49:58 mars kernel: invalidate: busy buffer
> Aug 18 14:49:58 mars last message repeated 21 times
> Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61195, sector=61194
> Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61194
> 
> I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
> it found nothing.
> 
> More info:
> 
> 00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
> (rev 03)
> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
> 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
> (rev 40)
> 00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
> 00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
> (rev 40)
> 00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
> (rev 74)
> 00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
> DMA push (rev 12)
> 00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
> 05)
> 00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
> 00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
> Unknown device 4d69 (rev 02)
> 01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
> (rev a1)
> 
> Regards,
> 
> Shane
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors
@ 2002-08-18 22:42 Shane
  0 siblings, 0 replies; 6+ messages in thread
From: Shane @ 2002-08-18 22:42 UTC (permalink / raw)
  To: linux-kernel

Thanks for the answers. Also, it is 2.4.19-rc3aa3 not whats in the
subject.

The man page for badblocks encourages me to use e2fsck -c to run
badblocks and to not run it directly. That, in addition to  the hint
from Alan that I was testing the incorrect range with badblocks lead to
wanting to run e2fsck -c -c /dev/vg01/biglv. That lead to:

kernel: lvm -- lvm_blk_ioctl: unknown command 0x24b
last message repeated 1443 times
message repeated 1422 times

Then I thought, e2fsck works on filesystems and badblocks works on
partitions so maybe this is not a good idea after all? What are the
correct numbers to feed to badblocks to get it to test that portion of
the disk?

Then some of these popped up a few minutes later:

smartd: Device: /dev/hda, S.M.A.R.T. Attribute: 231 Changed 11 
smartd: Device: /dev/hde, S.M.A.R.T. Attribute: 231 Changed 8 
smartd: Device: /dev/hdg, S.M.A.R.T. Attribute: 7 Changed -53 

This clued me into the fact I had previously enabled this SMART on this
box. I don't know much about SMART and I can't seem to find much about
which of the below errors are truly fatal and whats normal. I did a
short self test too.

I see the raw read error rate is 0 but it failed the self test in the 
read element!?

Is 11964485 a large number for Hardware ECC Recovered?

The drive is totally pooched I guess? Any light you could shed on which
of the below numbers are the tell-tale signs that the drive is dying
would be appreciated.

# smartctl -a /dev/hdg
Device: MAXTOR 6L080J4  Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.

General Smart Values: 
Off-line data collection status: (0x00) Offline data collection activity
                                        was never started
Self-test execution status:      ( 112) The previous self-test completed
					having failed
                                        the read element of the test
Total time to complete off-line 
data collection:                 (  34) Seconds
Offline data collection 
Capabilities:                    (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
                                        Automatic timer ON/OFF support
                                        Suspend Offline Collection upon
					new command
                                        Offline surface scan supported
                                        Self-test supported
Smart Capablilities:           (0x0003) Saves SMART data before entering
                                        power-saving mode
                                        Supports SMART auto save timer
Error logging capability:        (0x01) Error logging supported
Short self-test routine 
recommended polling time:        (   2) Minutes
Extended self-test routine 
recommended polling time:        (  40) Minutes

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 11
Attribute                    Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate     0x0029   100   253   020       0
(  3)Spin Up Time            0x0027   063   063   020       4659
(  4)Start Stop Count        0x0032   100   100   008       192
(  5)Reallocated Sector Ct   0x0033   097   097   020       18
(  7)Seek Error Rate         0x000b   100   047   023       0
(  9)Power On Hours          0x0012   096   096   001       2980
( 10)Spin Retry Count        0x0026   100   100   000       0
( 11)Calibration Retry Count 0x0013   100   100   020       0
( 12)Power Cycle Count       0x0032   100   100   008       166
( 13)Read Soft Error Rate    0x000b   100   100   023       0
(194)Temperature             0x0022   082   077   042       48
(195)Hardware ECC Recovered  0x001a   100   005   000       11964485
(196)Reallocated Event Count 0x0010   100   100   020       0
(197)Current Pending Sector  0x0032   100   100   020       3
(198)Offline Uncorrectable   0x0010   100   253   000       0
(199)UDMA CRC Error Count    0x001a   197   197   000       3
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 04
ATA Error Count: 109
Non-Fatal Count: 0

Thanks,

Shane



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors
  2002-08-18 20:05 ` Andre Hedrick
@ 2002-08-20 18:50   ` Gunther Mayer
  2002-08-21  7:30     ` Andre Hedrick
  0 siblings, 1 reply; 6+ messages in thread
From: Gunther Mayer @ 2002-08-20 18:50 UTC (permalink / raw)
  Cc: linux-kernel

Andre Hedrick wrote:

>Because it is a hardware error.
>Your drive is attempting to reallocate sectors and is failing.
>
The drive cannot relocate on an "uncorrectable read error",
as this must be communicated to the user, so he can get
the data from backup.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors
  2002-08-20 18:50   ` Gunther Mayer
@ 2002-08-21  7:30     ` Andre Hedrick
  0 siblings, 0 replies; 6+ messages in thread
From: Andre Hedrick @ 2002-08-21  7:30 UTC (permalink / raw)
  To: Gunther Mayer; +Cc: linux-kernel

On Tue, 20 Aug 2002, Gunther Mayer wrote:

> Andre Hedrick wrote:
> 
> >Because it is a hardware error.
> >Your drive is attempting to reallocate sectors and is failing.
> >
> The drive cannot relocate on an "uncorrectable read error",
> as this must be communicated to the user, so he can get
> the data from backup.

Gunther,

Where are we in disagreement?

me:  the error report because the drive failed to reallocate sector(s)
you: drive cannot relocate with this error.

Oh I have a noise maker patch for Erik Anderson, I just need to add it.

Cheers,

Andre Hedrick
LAD Storage Consulting Group



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-08-21  7:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-18 19:10 2.4.18-rc3aa3: dma_intr: status=0x51 errors Shane
2002-08-18 19:28 ` Alan Cox
2002-08-18 20:05 ` Andre Hedrick
2002-08-20 18:50   ` Gunther Mayer
2002-08-21  7:30     ` Andre Hedrick
  -- strict thread matches above, loose matches on Subject: below --
2002-08-18 22:42 Shane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox