Fatal File System Corruption -Software RAID + NFS

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fatal File System Corruption -Software RAID + NFS
@ 2003-11-16 17:04 Steven Poulakos
  2003-11-16 19:28 ` Christian Kujau
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Poulakos @ 2003-11-16 17:04 UTC (permalink / raw)
  To: Reiserfs Mailinglist

Hi,

I'm encountering my second wave of reiserfs filesystem corruption with a 
Web/NFS/NIS server.  This time, I had fatal corruption after updating 
the Kernel.  Below are the system specs and the error message that I'm 
receiving.

The system:

Debian (latest unstable)
2.4.22 Kernel  (not patched in any way)
running NFS (from the kernel), which serves home directories to 3 
workstations
Software RAID 1
-2, mirrored 80gig (7200RPM) IBM hard drives
-Each drive is connected to its own channel on a Promise Ultra ATA/100 
Controller (Ultra100 Tx2) card (PCI)  (using ATA 100 cables)
-the two drives are called hda and hdc (contain 3 mirrored RAID partitions)

The problem:

Two of three RAID'ed partitions have been working without errors (/home 
and /usr).  My root partition generated the errors below for many days 
until  detected.  After a reboot, the system would not load.  reiserfsck 
--rebuild-tree could not recover the corruption.  I have assumed that I 
must rebuild the system now, and being able to prevent the errors below 
from happening again will affect how I proceed with the rebuild.  Any 
info for prevent the below errors would be great!

-Steve

Here are snippets of the errors that just repeat over and over...

Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekComplete DataRequest }
Nov  9 06:47:34 ctdev kernel:
Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:47:34 ctdev kernel:
Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:47:34 ctdev kernel:
Nov  9 06:47:39 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:47:39 ctdev kernel:
Nov  9 06:47:39 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:47:39 ctdev kernel:
Nov  9 06:47:44 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:47:44 ctdev kernel:
Nov  9 06:47:44 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
.....(snip).......

.....(snip).......
Nov  9 06:47:59 ctdev kernel:
Nov  9 06:48:05 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:48:05 ctdev kernel:
Nov  9 06:48:05 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:48:05 ctdev kernel:
Nov  9 06:48:05 ctdev kernel: hdc: status timeout: status=0xd0 { Busy }
Nov  9 06:48:05 ctdev kernel:
Nov  9 06:48:05 ctdev kernel: PDC202XX: Secondary channel reset.
Nov  9 06:48:05 ctdev kernel: ide1: reset: success
Nov  9 06:48:10 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:48:10 ctdev kernel:
Nov  9 06:48:10 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:48:10 ctdev kernel:
Nov  9 06:48:10 ctdev kernel: hdc: write_intr error1: nr_sectors=5, 
stat=0xd0
Nov  9 06:48:10 ctdev kernel: hdc: write_intr: status=0xd0 { Busy }
Nov  9 06:48:10 ctdev kernel:
Nov  9 06:48:10 ctdev kernel: PDC202XX: Secondary channel reset.
Nov  9 06:48:10 ctdev kernel: ide1: reset: success
Nov  9 06:48:15 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
.....(snip).......


.....(snip).......
Nov  9 06:56:45 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:56:45 ctdev kernel:
Nov  9 06:56:50 ctdev kernel: hdc: write_intr error1: nr_sectors=51, 
stat=0xd0
Nov  9 06:56:50 ctdev kernel: hdc: write_intr: status=0xd0 { Busy }
Nov  9 06:56:50 ctdev kernel:
Nov  9 06:56:50 ctdev kernel: PDC202XX: Secondary channel reset.
Nov  9 06:56:50 ctdev kernel: ide1: reset: success
Nov  9 06:56:56 ctdev kernel: hdc: write_intr error1: nr_sectors=51, 
stat=0xd0
Nov  9 06:56:56 ctdev kernel: hdc: write_intr: status=0xd0 { Busy }
Nov  9 06:56:56 ctdev kernel:
Nov  9 06:56:56 ctdev kernel: PDC202XX: Secondary channel reset.
Nov  9 06:56:56 ctdev kernel: ide1: reset: success
Nov  9 06:56:56 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:56:56 ctdev kernel:
Nov  9 06:57:01 ctdev kernel: hdc: status error: status=0x58 { 
DriveReady SeekCo
mplete DataRequest }
Nov  9 06:57:01 ctdev kernel:
..... (snip) .....



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-16 17:04 Fatal File System Corruption -Software RAID + NFS Steven Poulakos
@ 2003-11-16 19:28 ` Christian Kujau
  2003-11-17 14:47   ` Dan Oglesby
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Kujau @ 2003-11-16 19:28 UTC (permalink / raw)
  To: Steven Poulakos; +Cc: Reiserfs Mailinglist

Steven Poulakos wrote:
[...]
> Here are snippets of the errors that just repeat over and over...
> 
> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
> DriveReady SeekComplete DataRequest }
> Nov  9 06:47:34 ctdev kernel:
> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
> DriveReady SeekCo

hardware errors, i'd say :-(
try to dd / dd_rescue to a working disk, then go on with newest 
reiserfsprogs or /bin/cat...

Christian.
-
BOFH excuse #6:

global warming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-16 19:28 ` Christian Kujau
@ 2003-11-17 14:47   ` Dan Oglesby
  2003-11-17 17:00     ` Steven Poulakos
  2003-11-20 19:59     ` Steven Poulakos
  0 siblings, 2 replies; 7+ messages in thread
From: Dan Oglesby @ 2003-11-17 14:47 UTC (permalink / raw)
  To: Reiserfs Mailinglist

Christian Kujau wrote:
> Steven Poulakos wrote:
> [...]
> 
>> Here are snippets of the errors that just repeat over and over...
>>
>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>> DriveReady SeekComplete DataRequest }
>> Nov  9 06:47:34 ctdev kernel:
>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>> DriveReady SeekCo
> 
> 
> hardware errors, i'd say :-(
> try to dd / dd_rescue to a working disk, then go on with newest 
> reiserfsprogs or /bin/cat...
> 
> Christian.
> -

I saw these types of errors on my software RAID-5 array when I first 
brought it online.  The array is made up of four 20GB hard drives from 
four different manufacturers, two drives to an IDE channel.  Some of the 
drives didn't want to play nice when sharing the IDE channel.

Using hdparm, I found that not all drives were enabling DMA or 32-bit 
transfers, so I forced all of the hard drives to run 32-bit with DMA. 
This solved the problem.

Another interesting thing I found (while I'm talking about my goofy 
array) was that the system performed MUCH better when I disabled write 
caching on all of the hard drives.

The filesystem is ReiserFS 3.6, of course.

--Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-17 14:47   ` Dan Oglesby
@ 2003-11-17 17:00     ` Steven Poulakos
  2003-11-17 20:58       ` Christian Kujau
  2003-11-20 19:59     ` Steven Poulakos
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Poulakos @ 2003-11-17 17:00 UTC (permalink / raw)
  To: Dan Oglesby, Reiserfs Mailinglist

Dan Oglesby wrote:

> Christian Kujau wrote:
>
>> Steven Poulakos wrote:
>> [...]
>>
>>> Here are snippets of the errors that just repeat over and over...
>>>
>>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>>> DriveReady SeekComplete DataRequest }
>>> Nov  9 06:47:34 ctdev kernel:
>>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>>> DriveReady SeekCo
>>
>>
>>
>> hardware errors, i'd say :-(
>> try to dd / dd_rescue to a working disk, then go on with newest 
>> reiserfsprogs or /bin/cat...
>>
>> Christian.
>> -
>
>
> I saw these types of errors on my software RAID-5 array when I first 
> brought it online.  The array is made up of four 20GB hard drives from 
> four different manufacturers, two drives to an IDE channel.  Some of 
> the drives didn't want to play nice when sharing the IDE channel.
>
> Using hdparm, I found that not all drives were enabling DMA or 32-bit 
> transfers, so I forced all of the hard drives to run 32-bit with DMA. 
> This solved the problem.
>
> Another interesting thing I found (while I'm talking about my goofy 
> array) was that the system performed MUCH better when I disabled write 
> caching on all of the hard drives.
>
> The filesystem is ReiserFS 3.6, of course.
>
> --Dan

Christian above suggested that I'm experiencing hardware problems.  I'm 
now wondering if the problem is with my PCI bus speed, which is running 
at 33MHz.   Since I'm using a Promise Ultra100Tx2 controller card, it 
might not be able to operate fast enough to support UDMA 5.  Is this 
correct? 

Here are some additional specs:
Asus P2B Motherboard (33MHz PCI bus)
Pentium 350
384 MB PC100 memory
I also have one AGP video card in use and two NICs on the PCI bus

Is a 33MHz bus sufficient for running the hard drives in UDMA 5 mode?  
Would you recommend that I upgrade to a 66MHz PCI bus motherboard?  The 
other system specs (from the previous email) are below.

Steve

More system specs:
Debian (latest unstable)
2.4.22 Kernel  (not patched in any way)
running NFS (from the kernel), which serves home directories to 3 
workstations
Software RAID 1
-2, mirrored 80gig (7200RPM) IBM hard drives
-Each drive is connected to its own channel on a Promise Ultra ATA/100 
Controller (Ultra100 Tx2) card (PCI)  (using ATA 100 cables)
-the two drives are called hda and hdc (contain 3 mirrored RAID partitions)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-17 17:00     ` Steven Poulakos
@ 2003-11-17 20:58       ` Christian Kujau
  0 siblings, 0 replies; 7+ messages in thread
From: Christian Kujau @ 2003-11-17 20:58 UTC (permalink / raw)
  To: Steven Poulakos; +Cc: Reiserfs Mailinglist

Steven Poulakos wrote:
> at 33MHz.   Since I'm using a Promise Ultra100Tx2 controller card, it 
> might not be able to operate fast enough to support UDMA 5.  Is this 
> correct?

are you using the promise chipset drivers then? did you try without 
these and instead using plain generic ide drivers?

i dunno much of the innards of any bus-subsystems, but "good drives and 
good cabling" are a must, of course. i myself experience SCSI related 
problems when using high-speed scsi disks in a low-end machine....

Christian.
-- 
BOFH excuse #65:

system needs to be rebooted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-20 19:59     ` Steven Poulakos
@ 2003-11-20  8:20       ` Hans Reiser
  0 siblings, 0 replies; 7+ messages in thread
From: Hans Reiser @ 2003-11-20  8:20 UTC (permalink / raw)
  To: Steven Poulakos; +Cc: Dan Oglesby, Reiserfs Mailinglist, mason

Steven Poulakos wrote:

>
> Dan Oglesby wrote:
>
>>
>>
>> Another interesting thing I found (while I'm talking about my goofy 
>> array) was that the system performed MUCH better when I disabled 
>> write caching on all of the hard drives.
>
Dan and Chris Mason, if you achieve an understanding of this, let me know.

-- 
Hans



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Fatal File System Corruption -Software RAID + NFS
  2003-11-17 14:47   ` Dan Oglesby
  2003-11-17 17:00     ` Steven Poulakos
@ 2003-11-20 19:59     ` Steven Poulakos
  2003-11-20  8:20       ` Hans Reiser
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Poulakos @ 2003-11-20 19:59 UTC (permalink / raw)
  To: Dan Oglesby; +Cc: Reiserfs Mailinglist

Thanks, Dan and Christian, for the offered advice (below).  It turns out 
that my problems were hardware related and not with Reiserfs.  One of 
the IDE cables had a small knick in it with exposed wire.  This was 
probably the source of errors that caused the file system corruption.

Unfortunately, I can't be absolutely certain of this because I also 
upgraded my motherboard to one that supports a 66MHz PCI bus.  This 
allows the Promise ATA/100 controller card to work at full speed. 

My newly rebuilt system has been up and running for three days now.  So 
far, there aren't any errors.

Thanks, again!

Steve

Dan Oglesby wrote:

> Christian Kujau wrote:
>
>> Steven Poulakos wrote:
>> [...]
>>
>>> Here are snippets of the errors that just repeat over and over...
>>>
>>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>>> DriveReady SeekComplete DataRequest }
>>> Nov  9 06:47:34 ctdev kernel:
>>> Nov  9 06:47:34 ctdev kernel: hdc: status error: status=0x58 { 
>>> DriveReady SeekCo
>>
>>
>>
>> hardware errors, i'd say :-(
>> try to dd / dd_rescue to a working disk, then go on with newest 
>> reiserfsprogs or /bin/cat...
>>
>> Christian.
>> -
>
>
> I saw these types of errors on my software RAID-5 array when I first 
> brought it online.  The array is made up of four 20GB hard drives from 
> four different manufacturers, two drives to an IDE channel.  Some of 
> the drives didn't want to play nice when sharing the IDE channel.
>
> Using hdparm, I found that not all drives were enabling DMA or 32-bit 
> transfers, so I forced all of the hard drives to run 32-bit with DMA. 
> This solved the problem.
>
> Another interesting thing I found (while I'm talking about my goofy 
> array) was that the system performed MUCH better when I disabled write 
> caching on all of the hard drives.
>
> The filesystem is ReiserFS 3.6, of course.
>
> --Dan



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-11-20 19:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-11-16 17:04 Fatal File System Corruption -Software RAID + NFS Steven Poulakos
2003-11-16 19:28 ` Christian Kujau
2003-11-17 14:47   ` Dan Oglesby
2003-11-17 17:00     ` Steven Poulakos
2003-11-17 20:58       ` Christian Kujau
2003-11-20 19:59     ` Steven Poulakos
2003-11-20  8:20       ` Hans Reiser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.