CMD680, kernel 2.4.21, and heartache

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* CMD680, kernel 2.4.21, and heartache
@ 2003-10-03 11:23 Erik Bourget
  2003-10-03 11:59 ` John Bradford
  2003-10-04  1:57 ` jimbleferret
  0 siblings, 2 replies; 10+ messages in thread
From: Erik Bourget @ 2003-10-03 11:23 UTC (permalink / raw)
  To: linux-kernel


Hello,

I've got a Big Problem.

Day 0: 8 new NFS servers go online, they are P4-2.4GHz boxes with two each
120GB Samsung drives attached to CMD680/SiI680 IDE controllers.  They run
Debian stable on a 2.4.21 kernel, with SMP enabled though they are uniproc
boxes, running NFSv3-via-TCP and reiserfs.  CMD680/siimage support compiled
in, obviously.  Software RAID, mirroring drives.

Out of 8 boxes:  

*) One has crashed hard.  I'm about to drive to the datacenter to plug in a
   KVM and take a picture.
*) Three have had DMA turned off and have given extremely spooky errors.
   Read below.

Some factors that are definitely NOT a problem:
- Faulty run of drives.  This has also happened to Hitachi 80GB drives in the
  same configurations.

- Heat.  They're in a chilly room.  The cases haven't overheated.  We've had
  guys checking this every few hours after the first one went bonkers.

Possible problems -
- Simple software problem that somebody can fix and save the day. :)
- All Dell Poweredge 650 servers are broken.  :/

Days 1-6: Faithful service.

Day 7: 
Sep 29 09:06:42 mailstore2-1 -- MARK --
Sep 29 09:12:18 mailstore2-1 kernel: hdc: dma_timer_expiry: dma status == 0x20
Sep 29 09:12:18 mailstore2-1 kernel: hdc: status timeout: status=0xd0 { Busy }
Sep 29 09:12:18 mailstore2-1 kernel: 
Sep 29 09:12:18 mailstore2-1 kernel: ide1: reset: success
Sep 29 09:26:42 mailstore2-1 -- MARK --

Few more days of faithful service.

Little bit ago:
Oct  1 07:28:40 mailstore2-1 -- MARK --
Oct  1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Oct  1 07:47:47 mailstore2-1 kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442, sector=35220864
Oct  1 07:47:47 mailstore2-1 kernel: end_request: I/O error, dev 03:03 (hda), sector 35220864
Oct  1 07:47:47 mailstore2-1 kernel: ^IOperation continuing on 1 devices
Oct  1 07:47:47 mailstore2-1 kernel: md: updating md0 RAID superblock on device
Oct  1 07:47:47 mailstore2-1 kernel: md: hdc3 [events: 00000004]<6>(write) hdc3's sb offset: 115949056
Oct  1 07:47:47 mailstore2-1 kernel: md: recovery thread got woken up ...
Oct  1 07:47:47 mailstore2-1 kernel: md: recovery thread finished ...
Oct  1 07:47:47 mailstore2-1 kernel: md: (skipping faulty hda3 )
Oct  1 08:08:41 mailstore2-1 -- MARK --

Oct  1 10:48:45 mailstore2-1 -- MARK --
Oct  1 10:50:44 mailstore2-1 kernel: hdc: dma_timer_expiry: dma status == 0x20
Oct  1 10:50:44 mailstore2-1 kernel: hdc: status timeout: status=0xd0 { Busy }
Oct  1 10:50:44 mailstore2-1 kernel: 
Oct  1 10:50:44 mailstore2-1 kernel: ide1: reset: success
Oct  1 11:08:46 mailstore2-1 -- MARK --

I'll post again when I've got the text of the kernel panic.

- Erik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 11:23 CMD680, kernel 2.4.21, and heartache Erik Bourget
@ 2003-10-03 11:59 ` John Bradford
  2003-10-03 12:23   ` Erik Bourget
  2003-10-04  1:57 ` jimbleferret
  1 sibling, 1 reply; 10+ messages in thread
From: John Bradford @ 2003-10-03 11:59 UTC (permalink / raw)
  To: Erik Bourget, linux-kernel

> Some factors that are definitely NOT a problem:
> - Faulty run of drives.  This has also happened to Hitachi 80GB drives in the
>   same configurations.
> 
> - Heat.  They're in a chilly room.  The cases haven't overheated.  We've had
>   guys checking this every few hours after the first one went bonkers.
> 
> Possible problems -
> - Simple software problem that somebody can fix and save the day. :)
> - All Dell Poweredge 650 servers are broken.  :/

> Oct  1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> Oct  1 07:47:47 mailstore2-1 kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442, sector=35220864

That is definitely an error from the drive.  If you're absolutely sure
it's not a faulty batch of drives or a cooling issue, maybe you have
power supply problems?  Does SMART give you any useful information?

John.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 11:59 ` John Bradford
@ 2003-10-03 12:23   ` Erik Bourget
  2003-10-03 12:40     ` John Bradford
  2003-10-03 12:48     ` Erik Bourget
  0 siblings, 2 replies; 10+ messages in thread
From: Erik Bourget @ 2003-10-03 12:23 UTC (permalink / raw)
  To: John Bradford; +Cc: linux-kernel

John Bradford <john@grabjohn.com> writes:

>> Some factors that are definitely NOT a problem: - Faulty run of drives.
>> This has also happened to Hitachi 80GB drives in the same configurations.
>> 
>> - Heat.  They're in a chilly room.  The cases haven't overheated.  We've had
>>   guys checking this every few hours after the first one went bonkers.
>> 
>> Possible problems -
>> - Simple software problem that somebody can fix and save the day. :)
>> - All Dell Poweredge 650 servers are broken.  :/
>
>> Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady
>> SeekComplete Error } Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr:
>> error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442,
>> sector=35220864
>
> That is definitely an error from the drive.  If you're absolutely sure
> it's not a faulty batch of drives or a cooling issue, maybe you have
> power supply problems?  Does SMART give you any useful information?
>
> John.

Not power supply problems; two of the machines that have this problem are
located in different facilities even.  What's SMART?

Thanks, 
- Erik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 12:23   ` Erik Bourget
@ 2003-10-03 12:40     ` John Bradford
  2003-10-03 12:48     ` Erik Bourget
  1 sibling, 0 replies; 10+ messages in thread
From: John Bradford @ 2003-10-03 12:40 UTC (permalink / raw)
  To: Erik Bourget; +Cc: linux-kernel

> >> Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady
> >> SeekComplete Error } Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr:
> >> error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442,
> >> sector=35220864
> >
> > That is definitely an error from the drive.  If you're absolutely sure
> > it's not a faulty batch of drives or a cooling issue, maybe you have
> > power supply problems?  Does SMART give you any useful information?
> 
> Not power supply problems; two of the machines that have this problem are
> located in different facilities even.  What's SMART?

Self Monitoring Analysis and Reporting Technology, it allows drives to report reliability statistics.

Smartmontools includes a utility, 'smartctl', you may already have it installed.  If so:

smartctl -e /dev/hda <- Enable S.M.A.R.T.
smartctl -a /dev/hda <- Dump all data

might provide useful data.

John.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 12:23   ` Erik Bourget
  2003-10-03 12:40     ` John Bradford
@ 2003-10-03 12:48     ` Erik Bourget
  2003-10-03 13:11       ` John Bradford
  2003-10-03 18:10       ` Tomasz Rola
  1 sibling, 2 replies; 10+ messages in thread
From: Erik Bourget @ 2003-10-03 12:48 UTC (permalink / raw)
  To: John Bradford; +Cc: linux-kernel

Erik Bourget <erik@midmaine.com> writes:

> John Bradford <john@grabjohn.com> writes:
>
>>> Some factors that are definitely NOT a problem: - Faulty run of drives.
>>> This has also happened to Hitachi 80GB drives in the same configurations.
>>> 
>>> - Heat.  They're in a chilly room.  The cases haven't overheated.  We've had
>>>   guys checking this every few hours after the first one went bonkers.
>>> 
>>> Possible problems -
>>> - Simple software problem that somebody can fix and save the day. :)
>>> - All Dell Poweredge 650 servers are broken.  :/
>>
>>> Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady
>>> SeekComplete Error } Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr:
>>> error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442,
>>> sector=35220864
>>
>> That is definitely an error from the drive.  If you're absolutely sure
>> it's not a faulty batch of drives or a cooling issue, maybe you have
>> power supply problems?  Does SMART give you any useful information?
>>
>> John.
>
> Not power supply problems; two of the machines that have this problem are
> located in different facilities even.  What's SMART?
>

Figured out SMART.  Looks bad:

mailstore2-1:/home/erik# smartctl -a /dev/hda
Device: IC35L120AVV207-0  Supports ATA Version 6
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.

General Smart Values: 
Off-line data collection status: (0x85) Offline data collection activity was 
                                        aborted by an interrupting command

Self-test execution status:      ( 245) Self-test routine in progess
                                        50% of test remaining

Total time to complete off-line 
data collection:                 (2855) Seconds

Offline data collection 
Capabilities:                    (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
                                        Automatic timer ON/OFF support
                                        Suspend Offline Collection upon new
                                        command
                                        Offline surface scan supported
                                        Self-test supported

Smart Capablilities:           (0x0003) Saves SMART data before entering
                                        power-saving mode
                                        Supports SMART auto save timer

Error logging capability:        (0x01) Error logging supported

Short self-test routine 
recommended polling time:        (   1) Minutes

Extended self-test routine 
recommended polling time:        (  48) Minutes

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute                    Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate     0x000b   095   095   060       458761
(  2)Throughput Performance  0x0005   148   148   050       264
(  3)Spin Up Time            0x0007   100   100   024       291
(  4)Start Stop Count        0x0012   100   100   000       6
(  5)Reallocated Sector Ct   0x0033   100   100   005       7
(  7)Seek Error Rate         0x000b   100   100   067       0
(  8)Seek Time Preformance   0x0005   123   123   000       37
(  9)Power On Hours          0x0012   100   100   000       709
( 10)Spin Retry Count        0x0013   100   100   060       0
( 12)Power Cycle Count       0x0032   100   100   000       6
(192)Power-Off Retract Count 0x0032   100   100   050       21
(193)Load Cycle Count        0x0012   100   100   050       21
(194)Temperature             0x0002   196   196   000       1441854
(196)Reallocated Event Count 0x0032   100   100   000       7
(197)Current Pending Sector  0x0022   100   100   000       3
(198)Offline Uncorrectable   0x0008   100   100   000       3
(199)UDMA CRC Error Count    0x000a   200   200   000       0
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 01
ATA Error Count: 1
Non-Fatal Count: 0

Error Log Structure 1:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
 00   00   08   22   1d   3f    e0   25     1851604
 00   00   08   aa   2b   3f    e0   25     1851604
 00   00   08   6a   1d   3f    e0   25     1851604
 00   00   08   02   96   3f    e0   25     1851604
 00   00   08   9a   2d   3f    e0   25     1851604
 00   40   08   9a   2d   3f    e2   51     0
Error condition:   0    Error State:       3
Number of Hours in Drive Life: 660 (life of the drive in hours)

Eep.

Are these errors set by the drive itself, or could a faulty harddrive
controller / driver cause them?  FWIW, I spoke offline to somebody about this
last week who seemed to think that it was an Alan Cox APIC bug.

- Erik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 12:48     ` Erik Bourget
@ 2003-10-03 13:11       ` John Bradford
  2003-10-03 18:10       ` Tomasz Rola
  1 sibling, 0 replies; 10+ messages in thread
From: John Bradford @ 2003-10-03 13:11 UTC (permalink / raw)
  To: Erik Bourget; +Cc: linux-kernel

> >>> Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr: status=0x51 { DriveReady
> >>> SeekComplete Error } Oct 1 07:47:47 mailstore2-1 kernel: hda: dma_intr:
> >>> error=0x40 { UncorrectableError }, LBAsect=37694874, high=2, low=4140442,
> >>> sector=35220864
> >>
> >> That is definitely an error from the drive.  If you're absolutely sure
> >> it's not a faulty batch of drives or a cooling issue, maybe you have
> >> power supply problems?  Does SMART give you any useful information?
> >
> > Not power supply problems; two of the machines that have this problem are
> > located in different facilities even.  What's SMART?
> 
> Figured out SMART.  Looks bad:
> 
> mailstore2-1:/home/erik# smartctl -a /dev/hda
> Device: IC35L120AVV207-0  Supports ATA Version 6
> Drive supports S.M.A.R.T. and is enabled
> Check S.M.A.R.T. Passed.
> 
> General Smart Values: 
> Off-line data collection status: (0x85) Offline data collection activity was 
>                                         aborted by an interrupting command
> 
> Self-test execution status:      ( 245) Self-test routine in progess
>                                         50% of test remaining
> 
> Total time to complete off-line 
> data collection:                 (2855) Seconds
> 
> Offline data collection 
> Capabilities:                    (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
>                                         Automatic timer ON/OFF support
>                                         Suspend Offline Collection upon new
>                                         command
>                                         Offline surface scan supported
>                                         Self-test supported
> 
> Smart Capablilities:           (0x0003) Saves SMART data before entering
>                                         power-saving mode
>                                         Supports SMART auto save timer
> 
> Error logging capability:        (0x01) Error logging supported
> 
> Short self-test routine 
> recommended polling time:        (   1) Minutes
> 
> Extended self-test routine 
> recommended polling time:        (  48) Minutes
> 
> Vendor Specific SMART Attributes with Thresholds:
> Revision Number: 16
> Attribute                    Flag     Value Worst Threshold Raw Value
> (  1)Raw Read Error Rate     0x000b   095   095   060       458761
> (  2)Throughput Performance  0x0005   148   148   050       264
> (  3)Spin Up Time            0x0007   100   100   024       291
> (  4)Start Stop Count        0x0012   100   100   000       6
> (  5)Reallocated Sector Ct   0x0033   100   100   005       7
> (  7)Seek Error Rate         0x000b   100   100   067       0
> (  8)Seek Time Preformance   0x0005   123   123   000       37
> (  9)Power On Hours          0x0012   100   100   000       709
> ( 10)Spin Retry Count        0x0013   100   100   060       0
> ( 12)Power Cycle Count       0x0032   100   100   000       6
> (192)Power-Off Retract Count 0x0032   100   100   050       21
> (193)Load Cycle Count        0x0012   100   100   050       21
> (194)Temperature             0x0002   196   196   000       1441854
> (196)Reallocated Event Count 0x0032   100   100   000       7
> (197)Current Pending Sector  0x0022   100   100   000       3
> (198)Offline Uncorrectable   0x0008   100   100   000       3
> (199)UDMA CRC Error Count    0x000a   200   200   000       0
> SMART Error Log:
> SMART Error Logging Version: 1
> Error Log Data Structure Pointer: 01
> ATA Error Count: 1
> Non-Fatal Count: 0
> 
> Error Log Structure 1:
> DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
>  00   00   08   22   1d   3f    e0   25     1851604
>  00   00   08   aa   2b   3f    e0   25     1851604
>  00   00   08   6a   1d   3f    e0   25     1851604
>  00   00   08   02   96   3f    e0   25     1851604
>  00   00   08   9a   2d   3f    e0   25     1851604
>  00   40   08   9a   2d   3f    e2   51     0
> Error condition:   0    Error State:       3
> Number of Hours in Drive Life: 660 (life of the drive in hours)
> 
> Eep.
> 
> Are these errors set by the drive itself,

Yes - the error log is direct from the drive - a faulty controller or
driver won't cause it to report bogus errors that it hasn't logged.

> or could a faulty harddrive
> controller / driver cause them?

Real errors have definitely been logged by that drive.  A broken
controller may concievably have contributed to that.

>  FWIW, I spoke offline to somebody about this
> last week who seemed to think that it was an Alan Cox APIC bug.

Alan is the best person to ask.  Or maybe somebody else will pick up
on this thread.  I'd rather not speculate without knowing more about
the specific machines.

John.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 12:48     ` Erik Bourget
  2003-10-03 13:11       ` John Bradford
@ 2003-10-03 18:10       ` Tomasz Rola
  2003-10-03 18:22         ` Erik Bourget
  1 sibling, 1 reply; 10+ messages in thread
From: Tomasz Rola @ 2003-10-03 18:10 UTC (permalink / raw)
  To: Erik Bourget; +Cc: linux-kernel, Tomasz Rola

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 3 Oct 2003, Erik Bourget wrote:

> Erik Bourget <erik@midmaine.com> writes:
> 
> (194)Temperature             0x0002   196   196   000       1441854

You should definitely take a look at other drives data in all computers,
esp. temperature. Consult this with max allowed temperature as defined by
hd manufacturer for this specific type of the drive (should be somewhere
on their website or on google). Each disk is different but the general
safe bet for a limit is 40-45 oC, from what I know.

Your room may be cool but it's better to check.

bye
T.

- --
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home    **
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola@bigfoot.com             **


-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 5.0i for non-commercial use
Charset: noconv

iQA/AwUBP327gxETUsyL9vbiEQIE8gCghrDBFt+6iwWPhT9FpYBeUPH5e74AoMny
9U1IohyoivjzNUbLKpIGN2kY
=E7fR
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 18:10       ` Tomasz Rola
@ 2003-10-03 18:22         ` Erik Bourget
  2003-10-03 18:47           ` John Bradford
  0 siblings, 1 reply; 10+ messages in thread
From: Erik Bourget @ 2003-10-03 18:22 UTC (permalink / raw)
  To: Tomasz Rola; +Cc: linux-kernel

Tomasz Rola <rtomek@cis.com.pl> writes:

> On Fri, 3 Oct 2003, Erik Bourget wrote:
>
>> Erik Bourget <erik@midmaine.com> writes:
>> 
>> (194)Temperature             0x0002   196   196   000       1441854
>
> You should definitely take a look at other drives data in all computers,
> esp. temperature. Consult this with max allowed temperature as defined by
> hd manufacturer for this specific type of the drive (should be somewhere
> on their website or on google). Each disk is different but the general
> safe bet for a limit is 40-45 oC, from what I know.
>
> Your room may be cool but it's better to check.
>
> bye
> T.

Yeah, it says 196, and that's bizarre.  196 whats?  From looking at other
example output, the '1441854' number is usually the true deg. C of the
machine.  But I'm reasonably sure that it's not at a million and a half
centigrade.

I can open the case up and put my hand on the drive.  It feels cooler to the
touch than the 10k SCSI drives in the next machine over...

Thanks though;

Erik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 18:22         ` Erik Bourget
@ 2003-10-03 18:47           ` John Bradford
  0 siblings, 0 replies; 10+ messages in thread
From: John Bradford @ 2003-10-03 18:47 UTC (permalink / raw)
  To: Erik Bourget, Tomasz Rola; +Cc: linux-kernel

Quote from Erik Bourget <erik@midmaine.com>:
> Tomasz Rola <rtomek@cis.com.pl> writes:
> 
> > On Fri, 3 Oct 2003, Erik Bourget wrote:
> >
> >> Erik Bourget <erik@midmaine.com> writes:
> >> 
> >> (194)Temperature             0x0002   196   196   000       1441854
> >
> > You should definitely take a look at other drives data in all computers,
> > esp. temperature. Consult this with max allowed temperature as defined by
> > hd manufacturer for this specific type of the drive (should be somewhere
> > on their website or on google). Each disk is different but the general
> > safe bet for a limit is 40-45 oC, from what I know.
> >
> > Your room may be cool but it's better to check.
> 
> Yeah, it says 196, and that's bizarre.  196 whats?  From looking at other
> example output, the '1441854' number is usually the true deg. C of the
> machine.  But I'm reasonably sure that it's not at a million and a half
> centigrade.

The units of the value, worst, and threshold fields are not important
- it's all relative.

Yes, the raw value field is usually the temperature in C as measured
by an on-board sensor, but as far as I know, there is no _requirement_
for it to be.  I've seen disks which report the power on hours in
minutes, for example.

I think the raw value field is really provided for extra-information
purposes - it's better to use the value, worst and threshold fields.

Basically, as long as the value field doesn't reach the threshold
field, don't worry about that aspect of the drive.  Note, the value
may count up or down.  The worst field may or may not be preserved
over a power cycle, that is drive dependant.

John.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CMD680, kernel 2.4.21, and heartache
  2003-10-03 11:23 CMD680, kernel 2.4.21, and heartache Erik Bourget
  2003-10-03 11:59 ` John Bradford
@ 2003-10-04  1:57 ` jimbleferret
  1 sibling, 0 replies; 10+ messages in thread
From: jimbleferret @ 2003-10-04  1:57 UTC (permalink / raw)
  To: Erik Bourget, linux-kernel

On Friday 03 October 2003 07:23 am, Erik Bourget wrote:
> Day 0: 8 new NFS servers go online, they are P4-2.4GHz boxes
> with two each 120GB Samsung drives attached to CMD680/SiI680
> IDE controllers.  They run Debian stable on a 2.4.21 kernel,
> with SMP enabled though they are uniproc boxes, running
> NFSv3-via-TCP and reiserfs.  CMD680/siimage support compiled
> in, obviously.  Software RAID, mirroring drives.
>
> Out of 8 boxes:
>
> *) One has crashed hard.  I'm about to drive to the datacenter
> to plug in a KVM and take a picture.
> *) Three have had DMA turned off and have given extremely
> spooky errors. Read below.

I've been having very similar problems with a box here.  Just put 
Gentoo on it, and started having weird errors almost immediately 
- libraries not being found, all the way to gcc being unable to 
make executables.  Occasionally, I'd get a full lock - network 
was dead from the outside, and locally everything was frozen.  
The logs pointed to the same things - 'hda: status error: 
status=0x58 { DriveReady SeekComplete DataRequest },'  'hda: 
timeout waiting for DMA' and then a 'reset: success,' but it 
didn't seem that way.  Turning off DMA didn't seem to have any 
effect on gcc problems or lockups.

The Maxtor utility said everything was ok, and I had been using 
FreeBSD on it for months with no problems there.  Heat isn't a 
problem.

I then took out SiS5513 support (CONFIG_BLK_DEV_SIS5513), and I 
haven't had any problems since.  DMA is back on, and I've had a 
couple of timeouts and resets, but only 3 or 4 in ~10 hours, and 
no lockups or other noticeable weirdness.

Kernel version is gentoo-sources 2.4.20-r7, but I think that has 
patches from >=2.4.21.  I haven't tried any other sources.

Drive is a Maxtor 8 Gig, 90871U2.

Board is a PCChips M571.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-10-04  1:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-03 11:23 CMD680, kernel 2.4.21, and heartache Erik Bourget
2003-10-03 11:59 ` John Bradford
2003-10-03 12:23   ` Erik Bourget
2003-10-03 12:40     ` John Bradford
2003-10-03 12:48     ` Erik Bourget
2003-10-03 13:11       ` John Bradford
2003-10-03 18:10       ` Tomasz Rola
2003-10-03 18:22         ` Erik Bourget
2003-10-03 18:47           ` John Bradford
2003-10-04  1:57 ` jimbleferret

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox