Re: RAID5 causing lockups

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: RAID5 causing lockups
@ 2003-06-26  3:54 Corey McGuire
  2003-06-26 11:46 ` Mike Black
  2003-06-26 13:32 ` Matthew Mitchell
  0 siblings, 2 replies; 15+ messages in thread
From: Corey McGuire @ 2003-06-26  3:54 UTC (permalink / raw)
  To: linux-raid

Well, two of my drives did have an older bios, but the upgrade changed nothing.

I noticed that even unmounted, as long as I didn't raidstop the device, the system still crashes.

I tried setting down my bios as much as possible, and I am looking to do the same with the kernel, 2.4.21.  I'll try the magic sysrq key, but I can't find my nulmodem cable to save my life, so I'll have to barrow one from work.

My server marches on, but without /dev/md2... I'll try just letting it sit, /dev/md2 intact, over night, but for now, I need it up, even if it is only for fits and spurts.

Thanks everyone, keep the ideas rolling in.

<sigh>

>Hey folks,
>
>I just upgraded my system from a ~200GB mirror to a ~1TB RAID5, but all has
>not transitioned well.
>
>I really don't know how to debug this issue, though I have tried.  I gave
>up this morning before work, but I was going to try the magickey next
>(something I don't really know how to use, but anything for a clue)
>followed by upgrading to 2.4.21.
>
>The lock up is typical to a system with a failing drive; the system is
>responsive to input, but nothing happens.  Keyboard works fine, but
>programs become idle (not really crashing.)  I tried keeping "top" up,
>hoping I would see something obvious, like raid5syncd doing something
>strange, but if it does, top doesn't update after the problem.
>
>The lockups happen even if the system is doing nothing (other than
>raid5syncd, which is awfully busy since my RAID won't stay up)
>
>If I unmount the RAID5 and RAIDSTOP it, my system will work fine, but I'm
>out 1TB of disk.  Right now, I have it running the bare essentials (all
>services on, but my /home directory has only public_html and mail stuff for
>each user.)
>
>Anything I can do to get more information out of this problem?  I don't
>really know where to look.
>
>
>System Infro
>=======================================================================
>
>My kernel is 2.4.20, my raid tools is raidtools-0.90, no patches on
>anything, home built distro (linux from scratch.)  Had been running on a
>mirror for nearly a year.
>
>Each drive on my system is connected to promise UltraATA 100 controllers.
>I have 6 drives and 3 controllers.  Each drive is a 200GB WD drive, set to
>"Single/Master" on their channel.
>
>No device has a slave.
>
>Drives are hda hdc hde hdg hdi hdk
>
>------- Each drive is configured exactly like the device below -------
>
>Disk /dev/hda: 255 heads, 63 sectors, 24321 cylinders
>Units = cylinders of 16065 * 512 bytes
>
>   Device Boot    Start       End    Blocks   Id  System
>/dev/hda1             1       319   2562336   fd  Linux raid autodetect
>/dev/hda2           320       352    265072+  82  Linux swap
>/dev/hda3           353     24321 192530992+  fd  Linux raid autodetect
>
>------------------------- Here is my raidtab -------------------------
>
>raiddev /dev/md0
>   raid-level            1
>   chunk-size           32
>   nr-raid-disks         2
>   nr-spare-disks        0
>   persistent-superblock 1
>   device        /dev/hda1
>   raid-disk             0
>   device        /dev/hdc1
>   raid-disk             1
>
>raiddev /dev/md1
>   raid-level            1
>   chunk-size           32
>   nr-raid-disks         2
>   nr-spare-disks        0
>   persistent-superblock 1
>   device        /dev/hde1
>   raid-disk             0
>   device        /dev/hdg1
>   raid-disk             1
>
>raiddev /dev/md2
>   raid-level            5
>   chunk-size           32
>   nr-raid-disks         6
>   nr-spare-disks        0
>   persistent-superblock 1
>   device        /dev/hda3
>   raid-disk             0
>   device        /dev/hdc3
>   raid-disk             1
>   device        /dev/hde3
>   raid-disk             2
>   device        /dev/hdg3
>   raid-disk             3
>   device        /dev/hdi3
>   raid-disk             4
>   device        /dev/hdk3
>   raid-disk             5
>
>raiddev /dev/md3
>   raid-level            1
>   chunk-size           32
>   nr-raid-disks         2
>   nr-spare-disks        0
>   persistent-superblock 1
>   device        /dev/hdi1
>   raid-disk             0
>   device        /dev/hdk1
>   raid-disk             1
>
>-------------------------- Here is my fstab --------------------------
>
># Begin /etc/fstab
>
># filesystem   mount-point     fs-type    options           dump fsck-order
>
>/dev/md0       /               reiserfs   defaults             1 1
>/dev/md1       /mnt/backup     reiserfs   noauto,defaults      1 3
>/dev/md2       /home           reiserfs   defaults             1 2
>/dev/hda2      swap            swap       pri=42               0 0
>/dev/hdc2      swap            swap       pri=42               0 0
>/dev/hde2      swap            swap       pri=42               0 0
>/dev/hdg2      swap            swap       pri=42               0 0
>/dev/hdi2      swap            swap       pri=42               0 0
>/dev/hdk2      swap            swap       pri=42               0 0
>proc           /proc           proc       defaults             0 0
>
># End /etc/fstab
>
>=======================================================================
>
>Let me know if I missed anything (probably lots.)
>
>Thanks for your time.
>
>
>/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\
>
>coreyfro@coreyfro.com
>http://www.coreyfro.com/
>http://stats.distributed.net/rc5-64/psummary.php3?id=196879
>ICQ : 3168059
>
>-----BEGIN GEEK CODE BLOCK-----
>GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
>O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
>Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
>v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
>------END GEEK CODE BLOCK------
>
>Home of Geek Code - http://www.geekcode.com/
>The Geek Code Decoder Page - http://www.ebb.org/ungeek//
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psearch.php3?st=coreyfro
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS !d--(+) s: a- C++++$ UL++>++++ P+ L++>++++ E- W+++$ N++ o? K? w++++$>+++++$ O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+) Dl++(++++) D++ G++(-) e>+++ h++(---) r++>+$ y++**>$ H++++ n---(----) p? !au w+ v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-26  3:54 RAID5 causing lockups Corey McGuire
@ 2003-06-26 11:46 ` Mike Black
  2003-06-26 13:32 ` Matthew Mitchell
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Black @ 2003-06-26 11:46 UTC (permalink / raw)
  To: Corey McGuire, linux-raid

Why don't you try creating a 3-disk RAID5, 4-disk, etc.
Perhaps you have one bad disk which this should point out.

Also...I don't think we ever heard what kind of power supply you have.
You might be overloading your system.  If power is a problem this will also probably show good behavior with 3 disks instead of 6.

So...if you find out that 3 is OK, 4 is OK, 5 causes problems -- then you try the 5-disk again with a different drive than the last
one and if it still fails you probably have a power problem.

----- Original Message ----- 
From: "Corey McGuire" <coreyfro@coreyfro.com>
To: <linux-raid@vger.kernel.org>
Sent: Wednesday, June 25, 2003 11:54 PM
Subject: Re: RAID5 causing lockups


> Well, two of my drives did have an older bios, but the upgrade changed nothing.
>
> I noticed that even unmounted, as long as I didn't raidstop the device, the system still crashes.
>
> I tried setting down my bios as much as possible, and I am looking to do the same with the kernel, 2.4.21.  I'll try the magic
sysrq key, but I can't find my nulmodem cable to save my life, so I'll have to barrow one from work.
>
> My server marches on, but without /dev/md2... I'll try just letting it sit, /dev/md2 intact, over night, but for now, I need it
up, even if it is only for fits and spurts.
>
> Thanks everyone, keep the ideas rolling in.
>
> <sigh>
>
> >Hey folks,
> >
> >I just upgraded my system from a ~200GB mirror to a ~1TB RAID5, but all has
> >not transitioned well.
> >
> >I really don't know how to debug this issue, though I have tried.  I gave
> >up this morning before work, but I was going to try the magickey next
> >(something I don't really know how to use, but anything for a clue)
> >followed by upgrading to 2.4.21.
> >
> >The lock up is typical to a system with a failing drive; the system is
> >responsive to input, but nothing happens.  Keyboard works fine, but
> >programs become idle (not really crashing.)  I tried keeping "top" up,
> >hoping I would see something obvious, like raid5syncd doing something
> >strange, but if it does, top doesn't update after the problem.
> >
> >The lockups happen even if the system is doing nothing (other than
> >raid5syncd, which is awfully busy since my RAID won't stay up)
> >
> >If I unmount the RAID5 and RAIDSTOP it, my system will work fine, but I'm
> >out 1TB of disk.  Right now, I have it running the bare essentials (all
> >services on, but my /home directory has only public_html and mail stuff for
> >each user.)
> >
> >Anything I can do to get more information out of this problem?  I don't
> >really know where to look.
> >
> >
> >System Infro
> >=======================================================================
> >
> >My kernel is 2.4.20, my raid tools is raidtools-0.90, no patches on
> >anything, home built distro (linux from scratch.)  Had been running on a
> >mirror for nearly a year.
> >
> >Each drive on my system is connected to promise UltraATA 100 controllers.
> >I have 6 drives and 3 controllers.  Each drive is a 200GB WD drive, set to
> >"Single/Master" on their channel.
> >
> >No device has a slave.
> >
> >Drives are hda hdc hde hdg hdi hdk
> >
> >------- Each drive is configured exactly like the device below -------
> >
> >Disk /dev/hda: 255 heads, 63 sectors, 24321 cylinders
> >Units = cylinders of 16065 * 512 bytes
> >
> >   Device Boot    Start       End    Blocks   Id  System
> >/dev/hda1             1       319   2562336   fd  Linux raid autodetect
> >/dev/hda2           320       352    265072+  82  Linux swap
> >/dev/hda3           353     24321 192530992+  fd  Linux raid autodetect
> >
> >------------------------- Here is my raidtab -------------------------
> >
> >raiddev /dev/md0
> >   raid-level            1
> >   chunk-size           32
> >   nr-raid-disks         2
> >   nr-spare-disks        0
> >   persistent-superblock 1
> >   device        /dev/hda1
> >   raid-disk             0
> >   device        /dev/hdc1
> >   raid-disk             1
> >
> >raiddev /dev/md1
> >   raid-level            1
> >   chunk-size           32
> >   nr-raid-disks         2
> >   nr-spare-disks        0
> >   persistent-superblock 1
> >   device        /dev/hde1
> >   raid-disk             0
> >   device        /dev/hdg1
> >   raid-disk             1
> >
> >raiddev /dev/md2
> >   raid-level            5
> >   chunk-size           32
> >   nr-raid-disks         6
> >   nr-spare-disks        0
> >   persistent-superblock 1
> >   device        /dev/hda3
> >   raid-disk             0
> >   device        /dev/hdc3
> >   raid-disk             1
> >   device        /dev/hde3
> >   raid-disk             2
> >   device        /dev/hdg3
> >   raid-disk             3
> >   device        /dev/hdi3
> >   raid-disk             4
> >   device        /dev/hdk3
> >   raid-disk             5
> >
> >raiddev /dev/md3
> >   raid-level            1
> >   chunk-size           32
> >   nr-raid-disks         2
> >   nr-spare-disks        0
> >   persistent-superblock 1
> >   device        /dev/hdi1
> >   raid-disk             0
> >   device        /dev/hdk1
> >   raid-disk             1
> >
> >-------------------------- Here is my fstab --------------------------
> >
> ># Begin /etc/fstab
> >
> ># filesystem   mount-point     fs-type    options           dump fsck-order
> >
> >/dev/md0       /               reiserfs   defaults             1 1
> >/dev/md1       /mnt/backup     reiserfs   noauto,defaults      1 3
> >/dev/md2       /home           reiserfs   defaults             1 2
> >/dev/hda2      swap            swap       pri=42               0 0
> >/dev/hdc2      swap            swap       pri=42               0 0
> >/dev/hde2      swap            swap       pri=42               0 0
> >/dev/hdg2      swap            swap       pri=42               0 0
> >/dev/hdi2      swap            swap       pri=42               0 0
> >/dev/hdk2      swap            swap       pri=42               0 0
> >proc           /proc           proc       defaults             0 0
> >
> ># End /etc/fstab
> >
> >=======================================================================
> >
> >Let me know if I missed anything (probably lots.)
> >
> >Thanks for your time.
> >
> >
> >/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\
> >
> >coreyfro@coreyfro.com
> >http://www.coreyfro.com/
> >http://stats.distributed.net/rc5-64/psummary.php3?id=196879
> >ICQ : 3168059
> >
> >-----BEGIN GEEK CODE BLOCK-----
> >GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
> >O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
> >Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
> >v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
> >------END GEEK CODE BLOCK------
> >
> >Home of Geek Code - http://www.geekcode.com/
> >The Geek Code Decoder Page - http://www.ebb.org/ungeek//
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
>
>
> /\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\
>
> coreyfro@coreyfro.com
> http://www.coreyfro.com/
> http://stats.distributed.net/rc5-64/psearch.php3?st=coreyfro
> ICQ : 3168059
>
> -----BEGIN GEEK CODE BLOCK-----
> GCS !d--(+) s: a- C++++$ UL++>++++ P+ L++>++++ E- W+++$ N++ o? K? w++++$>+++++$ O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+)
!X- R(+) !tv b-(+) Dl++(++++) D++ G++(-) e>+++ h++(---) r++>+$ y++**>$ H++++ n---(----) p? !au w+ v- 3+>++ j- G'''' B--- u+++*** f*
Quake++++>+++++$
> ------END GEEK CODE BLOCK------
>
> Home of Geek Code - http://www.geekcode.com/
> The Geek Code Decoder Page - http://www.ebb.org/ungeek//
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-26  3:54 RAID5 causing lockups Corey McGuire
  2003-06-26 11:46 ` Mike Black
@ 2003-06-26 13:32 ` Matthew Mitchell
  1 sibling, 0 replies; 15+ messages in thread
From: Matthew Mitchell @ 2003-06-26 13:32 UTC (permalink / raw)
  To: Corey McGuire; +Cc: linux-raid

Going along with some suggestions of other respondents, be very 
suspicious of your power cables.  I had a hell of a time getting a 
stable raid running because of dodgy power cables.  More accurately, the 
5v or 12v lines in some of the little plugs were kind of loose, and they 
would lose contact.  So get in there with a pair of teeny pliers or 
tweezers and crimp those babies on tight.

As an aside, has anyone ever experimented with any type of conductive 
compound when putting these things together?  IMO the cheesiest part of 
the setup is the crappy PC-style power connector, and I wondered about 
various solutions from compound to soldering on a PCB for a more 
reliable means of powering drives.

Best of luck to you.

Corey McGuire wrote:
> Well, two of my drives did have an older bios, but the upgrade changed nothing.
> 
> I noticed that even unmounted, as long as I didn't raidstop the device, the system still crashes.
> 
> I tried setting down my bios as much as possible, and I am looking to do the same with the kernel, 2.4.21.  I'll try the magic sysrq key, but I can't find my nulmodem cable to save my life, so I'll have to barrow one from work.
> 
> My server marches on, but without /dev/md2... I'll try just letting it sit, /dev/md2 intact, over night, but for now, I need it up, even if it is only for fits and spurts.
> 
> Thanks everyone, keep the ideas rolling in.

-- 
Matthew Mitchell
Systems Programmer/Administrator            matthew@geodev.com
Geophysical Development Corporation         phone 713 782 1234
1 Riverway Suite 2100, Houston, TX  77056     fax 713 782 1829

^ permalink raw reply	[flat|nested] 15+ messages in thread

* re: RAID5 causing lockups
@ 2003-06-26 17:34 Corey McGuire
  2003-06-27  5:02 ` Corey McGuire
  0 siblings, 1 reply; 15+ messages in thread
From: Corey McGuire @ 2003-06-26 17:34 UTC (permalink / raw)
  To: linux-raid


Much progress has been made, but success is still out of reach.

First of all, 2.4.21 has been very helpful.  Feedback regarding drive
problems is much more verbose.  I don't know who to blame, the RAID people,
the ATA people, or the promise driver people, but immediately, I found that
one of my controllers was hosing up the works.  I moved the devices from
said controller to my VIA onboard controller and gained about 5MB/second on
the rebuild speed.  I don't know if this is because 2.4.21 is faster, VIA
is faster, I was saturating my PCI bus (since the VIA controller in on the
Southbridge) or because I was previously getting these errors and no
feedback.

Alas, problem persists, but I have found out why (90% certain.)

Now when there is a crash, the system spits out why and panics.  It looks
to be HDA (or HDA is getting the blame) and, thanks to a seemingly
pointless script I wrote to watch the rebuild, I found that the system dies
at around 12.5% on the RAID5 rebuild every time.

Bad disk?  Maybe, probably, but I'll keep banging my head against it for a
while.

Score,
2.4.21 + progress script    1
2.4.20 + crossing fingers   0

I am currently running a kernel with DMA turned off by default.  This
sounded like a good idea last night, around 4 in the morning, but now it
sounds like an exercise in futility.  The idea came to me shortly after I
was visited by the bovine-fairy.  She told me that everything can be fixed
with "moon pies."  I know this apparition was real and not a hallucination
because, until last night, I had never heard of "moon pies."  After a quick
search of google, sure enough, moon pies; they look tasty, maybe she's
right. 

Score
Bovine fairies              1
Sleep depravation           0                         

At any rate, by my calculations, without DMA, it will take another 12hours
to get to the 12.5% fail point.  I should be back from work by then.
Longevity through sloth.

To answer some questions,

My power situation is good.  I have had a lot more juice getting sucked
through this power supply before.  Used to be a dual P3's with 30MM
Peltiers and 3 10,000 RPM cheetahs.  (Peltiers are not worth it, I had to
underclock my system and drop the voltage before it would run any cooler.)
I think these WD's draw 20 watts peak, 14 otherwise.  My power supply is
~400 watts.  Shouldn't be a problem, seeing as how I can run my mirrors
just fine for days, but die after turning my stripe on for minutes.

Building smaller RAID's.  Yeah, I will give that a whirl, just to make sure
HDA is the problem.  I don't think I need to yank HDA, I'll just remove it
from my RAIDTAB and mkraid again.

One point I'd like to make; why is a drive failure killing my RAID5?  Kinda
defeats the purpose.

Here is the aforementioned script plus its results so you can see what I
see.

4tlods.sh (for the love of dog, sync!  I said I was sleep deprived.)

while ((1)) ; do  top -n 1 | head -n 20 ; echo ; cat /proc/mdstat ; done

2.4.21

12:12am  up 19 min,  5 users,  load average: 0.87, 1.06, 0.82
49 processes: 48 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  1.0% user, 52.5% system,  0.0% nice, 46.3% idle
Mem:   516592K av,   95204K used,  421388K free,       0K shrd,   52588K
buff
Swap: 1590384K av,       0K used, 1590384K free                   17196K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    1 root       9   0   504  504   440 S     0.0  0.0   0:06 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00
ksoftirqd_CPU0
    4 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
    7 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
    8 root       7 -20     0    0     0 SW<   0.0  0.0   6:32 raid5d
    9 root      19  19     0    0     0 DWN   0.0  0.0   1:08 raid5syncd
   10 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
   11 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
   12 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
   13 root       9   0     0    0     0 SW    0.0  0.0   0:00 kreiserfsd

Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
      2562240 blocks [2/2] [UU]

md1 : active raid1 hdg1[1] hde1[0]
      2562240 blocks [2/2] [UU]

md3 : active raid1 hdk1[1] hdi1[0]
      2562240 blocks [2/2] [UU]

md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
      [==>..................]  resync = 12.5% (24153592/192530880)
finish=134.7min speed=20822K/sec
unused devices: <none>


2.4.21

2:38am  up 19 min,  1 user,  load average: 0.63, 1.13, 0.89
42 processes: 41 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.9% user, 52.1% system,  0.0% nice, 46.8% idle
Mem:   516592K av,   89824K used,  426768K free,       0K shrd,   57908K
buff
Swap:       0K av,       0K used,       0K free                   10644K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    1 root       8   0   504  504   440 S     0.0  0.0   0:06 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00
ksoftirqd_CPU0
    4 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
    7 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
    8 root      15 -20     0    0     0 SW<   0.0  0.0   6:29 raid5d
    9 root      19  19     0    0     0 DWN   0.0  0.0   1:09 raid5syncd
   14 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
   15 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1syncd
   16 root       9   0     0    0     0 SW    0.0  0.0   0:00 kreiserfsd
   74 root       9   0   616  616   512 S     0.0  0.1   0:00 syslogd

Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
      2562240 blocks [2/2] [UU]
        resync=DELAYED
md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
      [==>..................]  resync = 12.5% (24153596/192530880)
finish=139.2min speed=20147K/sec
unused devices: <none>


2.4.20

3:22am  up 21 min,  1 user,  load average: 1.04, 1.31, 1.02
47 processes: 46 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.9% user, 54.7% system,  0.0% nice, 44.2% idle
Mem:   516604K av,  125824K used,  390780K free,       0K shrd,   91628K
buff
Swap: 1590384K av,       0K used, 1590384K free                   10796K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
    1 root       9   0   504  504   440 S     0.0  0.0   0:10 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root       9   0     0    0     0 SW    0.0  0.0   0:00 kapmd
    4 root      18  19     0    0     0 SWN   0.0  0.0   0:00
ksoftirqd_CPU0
    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
    7 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
    8 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
    9 root       4 -20     0    0     0 SW<   0.0  0.0   7:16 raid5d
   10 root      19  19     0    0     0 DWN   0.0  0.0   1:07 raid5syncd
   11 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
   12 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1syncd
   13 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d

Personalities : [raid1] [raid5] [multipath]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
      2562240 blocks [2/2] [UU]
        resync=DELAYED
md1 : active raid1 hdg1[1] hde1[0]
      2562240 blocks [2/2] [UU]
        resync=DELAYED
md3 : active raid1 hdk1[1] hdi1[0]
      2562240 blocks [2/2] [UU]
        resync=DELAYED
md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
      [==>..................]  resync = 12.5% (24155416/192530880)
finish=181.1min speed=15487K/sec
unused devices: <none>


Thanks for your help everyone, I'll keep trying.


/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

* re: RAID5 causing lockups
  2003-06-26 17:34 Corey McGuire
@ 2003-06-27  5:02 ` Corey McGuire
  2003-06-27  5:32   ` Mike Dresser
  0 siblings, 1 reply; 15+ messages in thread
From: Corey McGuire @ 2003-06-27  5:02 UTC (permalink / raw)
  To: linux-raid

Whoa!!!

The strangest thing happened when I hit 12.7% on my RAID5 rebuild

  9:56pm  up 14:16,  3 users,  load average: 3.33, 2.85, 2.59
51 processes: 44 sleeping, 6 running, 1 zombie, 0 stopped
CPU states:  1.2% user, 10.3% system,  0.0% nice,  4.8% idle
Mem:   516592K av,  511704K used,    4888K free,       0K shrd,   89408K buff
Swap: 1590384K av,     264K used, 1590120K free                  394204K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   13 root       7 -20     0    0     0 SW<  25.8  0.0   0:26 raid1d
 4299 root       0 -20     0    0     0 SW<  22.0  0.0   0:31 raid1d
 4303 root      19  19     0    0     0 RWN  16.2  0.0   0:12 raid1syncd
    6 root       9   0     0    0     0 SW   13.4  0.0  15:41 kupdated
   14 root      20  19     0    0     0 RWN   7.6  0.0   0:11 raid1syncd
    8 root      -1 -20     0    0     0 SW<   5.7  0.0  29:37 raid5d
31151 root      10   0     0    0     0 Z     0.9  0.0   0:00 top <defunct>
31153 root      10   0   920  916   716 R     0.9  0.1   0:00 top
    1 root       9   0   504  504   440 S     0.0  0.0   2:37 init
    2 root       9   0     0    0     0 SW    0.0  0.0   0:02 keventd
    3 root      19  19     0    0     0 SWN   0.0  0.0  35:11 ksoftirqd_CPU0
    4 root       9   0     0    0     0 SW    0.0  0.0   0:37 kswapd
    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush

Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md1 : active raid1 hdg1[1] hde1[0]
      2562240 blocks [2/2] [UU]
      [>....................]  resync =  2.9% (75904/2562240) finish=50.8min speed=814K/sec
md0 : active raid1 hdc1[1] hda1[0]
      2562240 blocks [2/2] [UU]
      [>....................]  resync =  2.7% (70656/2562240) finish=53.7min speed=769K/sec
md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0](F)
      962654400 blocks level 5, 32k chunk, algorithm 0 [6/5] [_UUUUU]

unused devices: <none>

It stopped rebuilding, and moved on to my mirrors... very odd.  I'll try forcing another rebuild, but this is quasi good news.

*********** REPLY SEPARATOR  ***********

On 6/26/2003 at 10:34 AM Corey McGuire wrote:

>Much progress has been made, but success is still out of reach.
>
>First of all, 2.4.21 has been very helpful.  Feedback regarding drive
>problems is much more verbose.  I don't know who to blame, the RAID people,
>the ATA people, or the promise driver people, but immediately, I found that
>one of my controllers was hosing up the works.  I moved the devices from
>said controller to my VIA onboard controller and gained about 5MB/second on
>the rebuild speed.  I don't know if this is because 2.4.21 is faster, VIA
>is faster, I was saturating my PCI bus (since the VIA controller in on the
>Southbridge) or because I was previously getting these errors and no
>feedback.
>
>Alas, problem persists, but I have found out why (90% certain.)
>
>Now when there is a crash, the system spits out why and panics.  It looks
>to be HDA (or HDA is getting the blame) and, thanks to a seemingly
>pointless script I wrote to watch the rebuild, I found that the system dies
>at around 12.5% on the RAID5 rebuild every time.
>
>Bad disk?  Maybe, probably, but I'll keep banging my head against it for a
>while.
>
>Score,
>2.4.21 + progress script    1
>2.4.20 + crossing fingers   0
>
>I am currently running a kernel with DMA turned off by default.  This
>sounded like a good idea last night, around 4 in the morning, but now it
>sounds like an exercise in futility.  The idea came to me shortly after I
>was visited by the bovine-fairy.  She told me that everything can be fixed
>with "moon pies."  I know this apparition was real and not a hallucination
>because, until last night, I had never heard of "moon pies."  After a quick
>search of google, sure enough, moon pies; they look tasty, maybe she's
>right. 
>
>Score
>Bovine fairies              1
>Sleep depravation           0                         
>
>At any rate, by my calculations, without DMA, it will take another 12hours
>to get to the 12.5% fail point.  I should be back from work by then.
>Longevity through sloth.
>
>To answer some questions,
>
>My power situation is good.  I have had a lot more juice getting sucked
>through this power supply before.  Used to be a dual P3's with 30MM
>Peltiers and 3 10,000 RPM cheetahs.  (Peltiers are not worth it, I had to
>underclock my system and drop the voltage before it would run any cooler.)
>I think these WD's draw 20 watts peak, 14 otherwise.  My power supply is
>~400 watts.  Shouldn't be a problem, seeing as how I can run my mirrors
>just fine for days, but die after turning my stripe on for minutes.
>
>Building smaller RAID's.  Yeah, I will give that a whirl, just to make sure
>HDA is the problem.  I don't think I need to yank HDA, I'll just remove it
>from my RAIDTAB and mkraid again.
>
>One point I'd like to make; why is a drive failure killing my RAID5?  Kinda
>defeats the purpose.
>
>Here is the aforementioned script plus its results so you can see what I
>see.
>
>4tlods.sh (for the love of dog, sync!  I said I was sleep deprived.)
>
>while ((1)) ; do  top -n 1 | head -n 20 ; echo ; cat /proc/mdstat ; done
>
>2.4.21
>
>12:12am  up 19 min,  5 users,  load average: 0.87, 1.06, 0.82
>49 processes: 48 sleeping, 1 running, 0 zombie, 0 stopped
>CPU states:  1.0% user, 52.5% system,  0.0% nice, 46.3% idle
>Mem:   516592K av,   95204K used,  421388K free,       0K shrd,   52588K
>buff
>Swap: 1590384K av,       0K used, 1590384K free                   17196K
>cached
>
>  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
>    1 root       9   0   504  504   440 S     0.0  0.0   0:06 init
>    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
>    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00
>ksoftirqd_CPU0
>    4 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
>    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
>    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
>    7 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
>    8 root       7 -20     0    0     0 SW<   0.0  0.0   6:32 raid5d
>    9 root      19  19     0    0     0 DWN   0.0  0.0   1:08 raid5syncd
>   10 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>   11 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>   12 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>   13 root       9   0     0    0     0 SW    0.0  0.0   0:00 kreiserfsd
>
>Personalities : [raid1] [raid5]
>read_ahead 1024 sectors
>md0 : active raid1 hdc1[1] hda1[0]
>      2562240 blocks [2/2] [UU]
>
>md1 : active raid1 hdg1[1] hde1[0]
>      2562240 blocks [2/2] [UU]
>
>md3 : active raid1 hdk1[1] hdi1[0]
>      2562240 blocks [2/2] [UU]
>
>md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
>      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
>      [==>..................]  resync = 12.5% (24153592/192530880)
>finish=134.7min speed=20822K/sec
>unused devices: <none>
>
>
>2.4.21
>
>2:38am  up 19 min,  1 user,  load average: 0.63, 1.13, 0.89
>42 processes: 41 sleeping, 1 running, 0 zombie, 0 stopped
>CPU states:  0.9% user, 52.1% system,  0.0% nice, 46.8% idle
>Mem:   516592K av,   89824K used,  426768K free,       0K shrd,   57908K
>buff
>Swap:       0K av,       0K used,       0K free                   10644K
>cached
>
>  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
>    1 root       8   0   504  504   440 S     0.0  0.0   0:06 init
>    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
>    3 root      19  19     0    0     0 SWN   0.0  0.0   0:00
>ksoftirqd_CPU0
>    4 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
>    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
>    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
>    7 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
>    8 root      15 -20     0    0     0 SW<   0.0  0.0   6:29 raid5d
>    9 root      19  19     0    0     0 DWN   0.0  0.0   1:09 raid5syncd
>   14 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>   15 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1syncd
>   16 root       9   0     0    0     0 SW    0.0  0.0   0:00 kreiserfsd
>   74 root       9   0   616  616   512 S     0.0  0.1   0:00 syslogd
>
>Personalities : [raid1] [raid5]
>read_ahead 1024 sectors
>md0 : active raid1 hdc1[1] hda1[0]
>      2562240 blocks [2/2] [UU]
>        resync=DELAYED
>md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
>      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
>      [==>..................]  resync = 12.5% (24153596/192530880)
>finish=139.2min speed=20147K/sec
>unused devices: <none>
>
>
>2.4.20
>
>3:22am  up 21 min,  1 user,  load average: 1.04, 1.31, 1.02
>47 processes: 46 sleeping, 1 running, 0 zombie, 0 stopped
>CPU states:  0.9% user, 54.7% system,  0.0% nice, 44.2% idle
>Mem:   516604K av,  125824K used,  390780K free,       0K shrd,   91628K
>buff
>Swap: 1590384K av,       0K used, 1590384K free                   10796K
>cached
>
>  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
>    1 root       9   0   504  504   440 S     0.0  0.0   0:10 init
>    2 root       9   0     0    0     0 SW    0.0  0.0   0:00 keventd
>    3 root       9   0     0    0     0 SW    0.0  0.0   0:00 kapmd
>    4 root      18  19     0    0     0 SWN   0.0  0.0   0:00
>ksoftirqd_CPU0
>    5 root       9   0     0    0     0 SW    0.0  0.0   0:00 kswapd
>    6 root       9   0     0    0     0 SW    0.0  0.0   0:00 bdflush
>    7 root       9   0     0    0     0 SW    0.0  0.0   0:00 kupdated
>    8 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 mdrecoveryd
>    9 root       4 -20     0    0     0 SW<   0.0  0.0   7:16 raid5d
>   10 root      19  19     0    0     0 DWN   0.0  0.0   1:07 raid5syncd
>   11 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>   12 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1syncd
>   13 root      -1 -20     0    0     0 SW<   0.0  0.0   0:00 raid1d
>
>Personalities : [raid1] [raid5] [multipath]
>read_ahead 1024 sectors
>md0 : active raid1 hdc1[1] hda1[0]
>      2562240 blocks [2/2] [UU]
>        resync=DELAYED
>md1 : active raid1 hdg1[1] hde1[0]
>      2562240 blocks [2/2] [UU]
>        resync=DELAYED
>md3 : active raid1 hdk1[1] hdi1[0]
>      2562240 blocks [2/2] [UU]
>        resync=DELAYED
>md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
>      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
>      [==>..................]  resync = 12.5% (24155416/192530880)
>finish=181.1min speed=15487K/sec
>unused devices: <none>
>
>
>Thanks for your help everyone, I'll keep trying.
>
>
>/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\
>
>coreyfro@coreyfro.com
>http://www.coreyfro.com/
>http://stats.distributed.net/rc5-64/psummary.php3?id=196879
>ICQ : 3168059
>
>-----BEGIN GEEK CODE BLOCK-----
>GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
>O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
>Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
>v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
>------END GEEK CODE BLOCK------
>
>Home of Geek Code - http://www.geekcode.com/
>The Geek Code Decoder Page - http://www.ebb.org/ungeek//
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html




/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psearch.php3?st=coreyfro
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS !d--(+) s: a- C++++$ UL++>++++ P+ L++>++++ E- W+++$ N++ o? K? w++++$>+++++$ O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+) Dl++(++++) D++ G++(-) e>+++ h++(---) r++>+$ y++**>$ H++++ n---(----) p? !au w+ v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

* re: RAID5 causing lockups
  2003-06-27  5:02 ` Corey McGuire
@ 2003-06-27  5:32   ` Mike Dresser
  2003-06-27  5:47     ` Corey McGuire
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Dresser @ 2003-06-27  5:32 UTC (permalink / raw)
  To: Corey McGuire; +Cc: linux-raid

On Thu, 26 Jun 2003, Corey McGuire wrote:

> Whoa!!!
>
> The strangest thing happened when I hit 12.7% on my RAID5 rebuild
>
> md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0](F)
>       962654400 blocks level 5, 32k chunk, algorithm 0 [6/5] [_UUUUU]

Ins't that a missing disk?  Or do i just not remember raid5 properly.

> >md2 : active raid5 hdk3[5] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
> >      962654400 blocks level 5, 32k chunk, algorithm 0 [6/6] [UUUUUU]
> >      [==>..................]  resync = 12.5% (24153596/192530880)


Yeah, I thought something had changed here.  What's up with that?

the (F) by hda, and I think you were talkinga bout HDA being bad?

Have you pulled hda out of the raid, and used the manufacturers utilities
to test this drive?

Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* re: RAID5 causing lockups
  2003-06-27  5:32   ` Mike Dresser
@ 2003-06-27  5:47     ` Corey McGuire
  0 siblings, 0 replies; 15+ messages in thread
From: Corey McGuire @ 2003-06-27  5:47 UTC (permalink / raw)
  To: linux-raid

>Yeah, I thought something had changed here.  What's up with that?
>
>the (F) by hda, and I think you were talkinga bout HDA being bad?
>
>Have you pulled hda out of the raid, and used the manufacturers utilities
>to test this drive?
>
>Mike

yeah, I saw the little f... and the (_UUUUU) bit too

I just made my floppy, as a matter of fact.  Just waiting for the mirrors the rebuild ;-)

12 minutes to go.  They are moving at a whopping 800K a piece!  Look out!

UDMA is a four letter word.

I should write the Promise guys, the ATA guys and the RAID guys to inform everyone that my RAID5 didn't fail with UDMA enabled.  Only, I'll have to make that sound that's a bad thing because "Didn't fail with UDMA enabled" sounds like a good thing, only its not, cuz it should have failed, cuz its bad, and stuff...

This was not the week to give up caffine...

/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psearch.php3?st=coreyfro
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS !d--(+) s: a- C++++$ UL++>++++ P+ L++>++++ E- W+++$ N++ o? K? w++++$>+++++$ O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+) Dl++(++++) D++ G++(-) e>+++ h++(---) r++>+$ y++**>$ H++++ n---(----) p? !au w+ v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RAID5 causing lockups
@ 2003-06-25 19:16 Corey McGuire
  2003-06-25 19:28 ` Mike Dresser
  2003-06-25 20:36 ` Matt Simonsen
  0 siblings, 2 replies; 15+ messages in thread
From: Corey McGuire @ 2003-06-25 19:16 UTC (permalink / raw)
  To: alewman, bort, corvus, kratz.franz, blatt.guy, linux-raid,
	mario.scalise, harbeck.seth, phil

Hey folks,

I just upgraded my system from a ~200GB mirror to a ~1TB RAID5, but all has
not transitioned well.

I really don't know how to debug this issue, though I have tried.  I gave
up this morning before work, but I was going to try the magickey next
(something I don't really know how to use, but anything for a clue)
followed by upgrading to 2.4.21.

The lock up is typical to a system with a failing drive; the system is
responsive to input, but nothing happens.  Keyboard works fine, but
programs become idle (not really crashing.)  I tried keeping "top" up,
hoping I would see something obvious, like raid5syncd doing something
strange, but if it does, top doesn't update after the problem.

The lockups happen even if the system is doing nothing (other than
raid5syncd, which is awfully busy since my RAID won't stay up)

If I unmount the RAID5 and RAIDSTOP it, my system will work fine, but I'm
out 1TB of disk.  Right now, I have it running the bare essentials (all
services on, but my /home directory has only public_html and mail stuff for
each user.)

Anything I can do to get more information out of this problem?  I don't
really know where to look.


System Infro
=======================================================================

My kernel is 2.4.20, my raid tools is raidtools-0.90, no patches on
anything, home built distro (linux from scratch.)  Had been running on a
mirror for nearly a year.

Each drive on my system is connected to promise UltraATA 100 controllers.
I have 6 drives and 3 controllers.  Each drive is a 200GB WD drive, set to
"Single/Master" on their channel.

No device has a slave.

Drives are hda hdc hde hdg hdi hdk

------- Each drive is configured exactly like the device below -------

Disk /dev/hda: 255 heads, 63 sectors, 24321 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hda1             1       319   2562336   fd  Linux raid autodetect
/dev/hda2           320       352    265072+  82  Linux swap
/dev/hda3           353     24321 192530992+  fd  Linux raid autodetect

------------------------- Here is my raidtab -------------------------

raiddev /dev/md0
   raid-level            1
   chunk-size           32
   nr-raid-disks         2
   nr-spare-disks        0
   persistent-superblock 1
   device        /dev/hda1
   raid-disk             0
   device        /dev/hdc1
   raid-disk             1

raiddev /dev/md1
   raid-level            1
   chunk-size           32
   nr-raid-disks         2
   nr-spare-disks        0
   persistent-superblock 1
   device        /dev/hde1
   raid-disk             0
   device        /dev/hdg1
   raid-disk             1

raiddev /dev/md2
   raid-level            5
   chunk-size           32
   nr-raid-disks         6
   nr-spare-disks        0
   persistent-superblock 1
   device        /dev/hda3
   raid-disk             0
   device        /dev/hdc3
   raid-disk             1
   device        /dev/hde3
   raid-disk             2
   device        /dev/hdg3
   raid-disk             3
   device        /dev/hdi3
   raid-disk             4
   device        /dev/hdk3
   raid-disk             5

raiddev /dev/md3
   raid-level            1
   chunk-size           32
   nr-raid-disks         2
   nr-spare-disks        0
   persistent-superblock 1
   device        /dev/hdi1
   raid-disk             0
   device        /dev/hdk1
   raid-disk             1

-------------------------- Here is my fstab --------------------------

# Begin /etc/fstab

# filesystem   mount-point     fs-type    options           dump fsck-order

/dev/md0       /               reiserfs   defaults             1 1
/dev/md1       /mnt/backup     reiserfs   noauto,defaults      1 3
/dev/md2       /home           reiserfs   defaults             1 2
/dev/hda2      swap            swap       pri=42               0 0
/dev/hdc2      swap            swap       pri=42               0 0
/dev/hde2      swap            swap       pri=42               0 0
/dev/hdg2      swap            swap       pri=42               0 0
/dev/hdi2      swap            swap       pri=42               0 0
/dev/hdk2      swap            swap       pri=42               0 0
proc           /proc           proc       defaults             0 0

# End /etc/fstab

=======================================================================

Let me know if I missed anything (probably lots.)

Thanks for your time.


/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 19:16 Corey McGuire
@ 2003-06-25 19:28 ` Mike Dresser
  2003-06-25 19:41   ` Corey McGuire
  2003-06-25 20:36 ` Matt Simonsen
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Dresser @ 2003-06-25 19:28 UTC (permalink / raw)
  To: Corey McGuire; +Cc: linux-raid

On Wed, 25 Jun 2003, Corey McGuire wrote:

> I have 6 drives and 3 controllers.  Each drive is a 200GB WD drive, set to
> "Single/Master" on their channel.

Go get the utility on wdc's site to fix the problems they have, and see
what happens after that.

http://support.wdc.com/download/index.asp#raidno3ware

They have problems with power management, and the drive is kicked out of
the raid array.

This may or may not be the trouble, but at least see if it needs it.

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 19:28 ` Mike Dresser
@ 2003-06-25 19:41   ` Corey McGuire
  2003-06-25 19:56     ` Mike Dresser
  0 siblings, 1 reply; 15+ messages in thread
From: Corey McGuire @ 2003-06-25 19:41 UTC (permalink / raw)
  To: linux-raid

NASTY!

Thanks, I'll give that a whirl.  Is there a way I can kill all power
manglement outside of the BIOS just to make sure I prevent this?  Once this
is working, I won't even need console blanking.

Should I make sure APM is killed in the kernel config too?  I think I may
have turned it on when I added RAID5 support, thinking I'd use idle calls
now that I have 6 7200RPM drives to heat up my case ;-)

The system opporated in a mirror just fine for around year, if that makes a
difference... using two of these drives (hde and hdk if my memory serves)

*********** REPLY SEPARATOR  ***********

On 6/25/2003 at 3:28 PM Mike Dresser wrote:

>On Wed, 25 Jun 2003, Corey McGuire wrote:
>
>> I have 6 drives and 3 controllers.  Each drive is a 200GB WD drive, set
>to
>> "Single/Master" on their channel.
>
>Go get the utility on wdc's site to fix the problems they have, and see
>what happens after that.
>
>http://support.wdc.com/download/index.asp#raidno3ware
>
>They have problems with power management, and the drive is kicked out of
>the raid array.
>
>This may or may not be the trouble, but at least see if it needs it.
>
>Mike

/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 19:41   ` Corey McGuire
@ 2003-06-25 19:56     ` Mike Dresser
  2003-06-25 20:51       ` Corey McGuire
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Dresser @ 2003-06-25 19:56 UTC (permalink / raw)
  To: Corey McGuire; +Cc: linux-raid

On Wed, 25 Jun 2003, Corey McGuire wrote:

> NASTY!
>
> Thanks, I'll give that a whirl.  Is there a way I can kill all power
> manglement outside of the BIOS just to make sure I prevent this?  Once this
> is working, I won't even need console blanking.
>

Doh, i made the same mistake I made on irc a few weeks ago with someone.

Accoustic management, not power management

*silently beats head against wall*

Sorry about that, i hear the word management and my brain shuts down.

Anyways, the power management is fine.  The drive manages its accoustic
noise or something.  Search wdc's tech knowledge base for 3ware, and
you'll find the relative article there.

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 19:56     ` Mike Dresser
@ 2003-06-25 20:51       ` Corey McGuire
  0 siblings, 0 replies; 15+ messages in thread
From: Corey McGuire @ 2003-06-25 20:51 UTC (permalink / raw)
  To: linux-raid

>Doh, i made the same mistake I made on irc a few weeks ago with someone.
>
>Accoustic management, not power management
>
>*silently beats head against wall*

ok.... cool... I'll check it out...

>Sorry about that, i hear the word management and my brain shuts down.

bwahahahahahaha!

>Anyways, the power management is fine.  The drive manages its accoustic
>noise or something.

I followed the link... I'll try it when I get home...

thanks, again


/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 19:16 Corey McGuire
  2003-06-25 19:28 ` Mike Dresser
@ 2003-06-25 20:36 ` Matt Simonsen
  2003-06-25 20:56   ` Corey McGuire
  1 sibling, 1 reply; 15+ messages in thread
From: Matt Simonsen @ 2003-06-25 20:36 UTC (permalink / raw)
  To: Corey McGuire; +Cc: linux-raid

On Wed, 2003-06-25 at 12:16, Corey McGuire wrote:
> Hey folks,
> 
> I just upgraded my system from a ~200GB mirror to a ~1TB RAID5, but all has
> not transitioned well.

Did the RAID array every finish syncing? It may take a long time
(1048576 megs / 10 mb/sec / 3600 seconds/hr = 29 hours!) ... 

I have one (SCSI) system with a slower CPU, syncing the array I was sure
something was wrong. Once the array was up, though, everything has
worked great. I just have to be sure it shuts down clean or the rebuild
is painful!

Maybe I'm way off, but I'd just give it a day to see if it eventually
syncs. If you can login, do a cat of /proc/mdstat every 15 minutes, if
it's making progress I'd leave it.

Matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RAID5 causing lockups
  2003-06-25 20:36 ` Matt Simonsen
@ 2003-06-25 20:56   ` Corey McGuire
       [not found]     ` <1056575536.24919.101.camel@mattswrk>
  0 siblings, 1 reply; 15+ messages in thread
From: Corey McGuire @ 2003-06-25 20:56 UTC (permalink / raw)
  To: linux-raid

drives are silent.  No activity at all.

Maybe i'll give it a try tonight again.  I can give it a bunch more
horsepower (have it underclocked because the system is basicly a big disk)
maybe if I have more available CPU, the arrays will be a bit more busy if
they really are rebuilding.

still, the problem causes top to stop.  I wouldn't think that would be a
problem if the array was just rebuilding...

*********** REPLY SEPARATOR  ***********

On 6/25/2003 at 1:36 PM Matt Simonsen wrote:

>On Wed, 2003-06-25 at 12:16, Corey McGuire wrote:
>> Hey folks,
>> 
>> I just upgraded my system from a ~200GB mirror to a ~1TB RAID5, but all
>has
>> not transitioned well.
>
>
>Did the RAID array every finish syncing? It may take a long time
>(1048576 megs / 10 mb/sec / 3600 seconds/hr = 29 hours!) ... 
>
>I have one (SCSI) system with a slower CPU, syncing the array I was sure
>something was wrong. Once the array was up, though, everything has
>worked great. I just have to be sure it shuts down clean or the rebuild
>is painful!
>
>Maybe I'm way off, but I'd just give it a day to see if it eventually
>syncs. If you can login, do a cat of /proc/mdstat every 15 minutes, if
>it's making progress I'd leave it.
>
>Matt




/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <1056575536.24919.101.camel@mattswrk>]

* Re: RAID5 causing lockups
       [not found]     ` <1056575536.24919.101.camel@mattswrk>
@ 2003-06-25 21:16       ` Corey McGuire
  0 siblings, 0 replies; 15+ messages in thread
From: Corey McGuire @ 2003-06-25 21:16 UTC (permalink / raw)
  To: linux-raid


I am very used to their noise ;-)

I can't do anything when the problem comes up.  I can change consoles, type
at the prompt, etc, but nothing else...

i can't even ctrl + alt + del

*********** REPLY SEPARATOR  ***********

On 6/25/2003 at 2:12 PM Matt Simonsen wrote:

>On Wed, 2003-06-25 at 13:56, Corey McGuire wrote:
>> drives are silent.  No activity at all.
>> 
>
>Looses credibility for my guess, then... although you're sure they're
>just not quiet disks? Still, I'd check /proc/mdstat to make sure it
>isn't churning.... really, I was about to write in the exact email --
>"my system is locked up, top doesn't work, etc" -- 30 minutes later, I'm
>golden.
>
>Good luck, let us know what you find.
>
>Matt




/\/\/\/\/\/\ Nothing is foolproof to a talented fool. /\/\/\/\/\/\

coreyfro@coreyfro.com
http://www.coreyfro.com/
http://stats.distributed.net/rc5-64/psummary.php3?id=196879
ICQ : 3168059

-----BEGIN GEEK CODE BLOCK-----
GCS d--(+) s: a-- C++++$ UBL++>++++ P+ L+ E W+++$ N+ o? K? w++++$>+++++$
O---- !M--- V- PS+++ PE++(--) Y+ PGP- t--- 5(+) !X- R(+) !tv b-(+)
Dl++(++++) D++ G+ e>+++ h++(---) r++>+$ y++*>$ H++++ n---(----) p? !au w+
v- 3+>++ j- G'''' B--- u+++*** f* Quake++++>+++++$
------END GEEK CODE BLOCK------

Home of Geek Code - http://www.geekcode.com/
The Geek Code Decoder Page - http://www.ebb.org/ungeek//


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-06-27  5:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-26  3:54 RAID5 causing lockups Corey McGuire
2003-06-26 11:46 ` Mike Black
2003-06-26 13:32 ` Matthew Mitchell
  -- strict thread matches above, loose matches on Subject: below --
2003-06-26 17:34 Corey McGuire
2003-06-27  5:02 ` Corey McGuire
2003-06-27  5:32   ` Mike Dresser
2003-06-27  5:47     ` Corey McGuire
2003-06-25 19:16 Corey McGuire
2003-06-25 19:28 ` Mike Dresser
2003-06-25 19:41   ` Corey McGuire
2003-06-25 19:56     ` Mike Dresser
2003-06-25 20:51       ` Corey McGuire
2003-06-25 20:36 ` Matt Simonsen
2003-06-25 20:56   ` Corey McGuire
     [not found]     ` <1056575536.24919.101.camel@mattswrk>
2003-06-25 21:16       ` Corey McGuire

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).