new problem has developed

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* new problem has developed
@ 2003-10-26 20:08 maarten van den Berg
  0 siblings, 0 replies; 2+ messages in thread
From: maarten van den Berg @ 2003-10-26 20:08 UTC (permalink / raw)
  To: linux-raid

Hi list,

Last week I asked about restoring a raid5 array with multiple failed disks. 
Thanks for your help, I've got it back online in degraded mode and was able 
to burn 10 DVD+RWs as a backup measure.  Now I have a new problem though.

I added a spare new disk to the degraded array after making said backups and 
it started resyncing. I went to sleep, only to find out next morning that the 
resync was only at 5.3% and speed was 5K/sec (!).  The system was still 
responsive, no runaway processes and no sign of any hardware trouble in 
/var/log/messages.  I killed the machine and retried... with the same result: 
at exactly 5.3% the speed starts dropping until it is near zero.

For various reasons I decided to decommission the old hardware (AMD K6) and I 
built a newer (and 100% known-good) board in it earlier today. That makes a 
BIG difference in initial speed, I now get 14000K/sec instead of the dead 
slow AMD K6 did. However, at 5.2% the speed drops significantly. We're now 
back at 5.3% and speed has dropped from 13000K to 170K and continues to drop.

I investigated already on the old machine with several tools, of course mdadm, 
but also iostat and keeping an eye on /var/log/messages.  All seems proper.
Also, immediately after "rescueing" the array from the miltiple disk failure I 
ran a long reiserfsck --check on the volume which found no problems at all.

I'm now at a loss...  Does anyone know what to monitor or check first ??

I'm unsure if this could be due to a disk hardware fault but then it would 
surely show up in syslog, right ? Could disk corruption be the culprit ? My 
guess would be "no", not only the reiserfs on top of the md0 tests fine, but 
these are on different layers anyhow, correct ?

One last remark; once this state occurs (near the 5.3% mark) any command that 
tries to query the array hangs indefinitely. (umount, mount, mdadm, even df).
Those commands are unkillable, what also entails that there is no way to 
reboot the machine except by resetswitch (shutdown hangs forever).
Apart from those commands the machine is still totally responsive (on another 
terminal).  Needless to say this bugs me enormously...  :-(

Some info:
The machine has a boot disk which provides "/" and a full linux system. Apart 
from that it has seven (7) 80GB disks connected to 2 promise adapters(100TX2)
Those disks should be a raid5 array with one spare (but they are obviously 
kinda inbetween states right now).  The kernel was 2.4.10 and is now 2.4.18.

If anyone can help, that would be greatly appreciated...
Maarten

-- 
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: new problem has developed
       [not found] <Pine.LNX.4.44.0310290050190.10948-100000@coffee.psychology.mcmaster.ca>
@ 2003-10-30 13:14 ` maarten van den Berg
  0 siblings, 0 replies; 2+ messages in thread
From: maarten van den Berg @ 2003-10-30 13:14 UTC (permalink / raw)
  To: Mark Hahn, linux-raid

On Wednesday 29 October 2003 06:52, Mark Hahn wrote:
> > For various reasons I decided to decommission the old hardware (AMD K6)
> > and I built a newer (and 100% known-good) board in it earlier today. That
> > makes a BIG difference in initial speed, I now get 14000K/sec instead of
> > the dead slow AMD K6 did. However, at 5.2% the speed drops significantly.
> > We're now back at 5.3% and speed has dropped from 13000K to 170K and
> > continues to drop.
>
> this sort of thing *can* actually occur because of sick disks.

Thanks for replying.  Yes, it was a bad disk and I solved it eventually.

> > I investigated already on the old machine with several tools, of course
> > mdadm, but also iostat and keeping an eye on /var/log/messages.  All
> > seems proper.
>
> smartctl on the disks?

If only my BIOS would support that... :-( 
I don't know if it's the main BIOS or the promise cards that must support it, 
but 'ide-smart' just gives no output at all.

I did a 'badblocks' on one disk that was part of the array but already got 
kicked twice from it.  Lo and behold, starting at about 4GB it developed a 
problem (slow reads due to endless retries).  As I desperately NEEDED this 
drive (my array was already degraded!) I decided to use 'dd_rescue' to clone 
it to a good disk and re-assemble the array from there. The dd_rescue 
operation took more than 30 hours(!) and showed that there was a problem 
around the 4GB and also around 71 GB markers. Several MB could not be 
recovered (which is close to nothing, percentage-wise).
Mdadm then reassembled the array with the fresh drive, and subsequent 
hot-adding went as fast as it should. One day later I added a new hot-spare. 
All is well now. I will surely find corrupted data at some point due to the 
missing MB's.  But I see no way to avoid this anyhow...
I just hope it is a file, not reiserfs meta-data, that got killed.

Taking into account that dd_rescue took 30 hours it stands to reason that 
maybe the resync would have worked after all, if only I would have let it run 
longer.  The problem is partly that the resync just seems to grind to a halt, 
whereas dd_rescue is much more verbose in what it does.  If I could 
distinguish between a 'crash' and a slow process (that still works -albeit 
slow) this probably wouldn't have happened. Well, now we know...

> > I'm unsure if this could be due to a disk hardware fault but then it
> > would surely show up in syslog, right ?
>
> no.  there's no syslog-over-ata/scsi afaikt ;)
>
> > Could disk corruption be the culprit ? My
>
> I'd guess vibration.  I've experienced several kinds of recent disks that
> under bad conditions (vibration, near-death) just get amazingly slow,
> but continue to work.  this is, of course, really, really good...

They vibrate, yeah.  That's just what happens if you put eight disks together 
in a cabinet and put two 120mm papst fans right in front of them...  ;-)
(But at least they stay quite cool, really quite cool...)

Maarten

-- 
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2003-10-30 13:14 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-26 20:08 new problem has developed maarten van den Berg
     [not found] <Pine.LNX.4.44.0310290050190.10948-100000@coffee.psychology.mcmaster.ca>
2003-10-30 13:14 ` maarten van den Berg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).