* new problem has developed
@ 2003-10-26 20:08 maarten van den Berg
0 siblings, 0 replies; 2+ messages in thread
From: maarten van den Berg @ 2003-10-26 20:08 UTC (permalink / raw)
To: linux-raid
Hi list,
Last week I asked about restoring a raid5 array with multiple failed disks.
Thanks for your help, I've got it back online in degraded mode and was able
to burn 10 DVD+RWs as a backup measure. Now I have a new problem though.
I added a spare new disk to the degraded array after making said backups and
it started resyncing. I went to sleep, only to find out next morning that the
resync was only at 5.3% and speed was 5K/sec (!). The system was still
responsive, no runaway processes and no sign of any hardware trouble in
/var/log/messages. I killed the machine and retried... with the same result:
at exactly 5.3% the speed starts dropping until it is near zero.
For various reasons I decided to decommission the old hardware (AMD K6) and I
built a newer (and 100% known-good) board in it earlier today. That makes a
BIG difference in initial speed, I now get 14000K/sec instead of the dead
slow AMD K6 did. However, at 5.2% the speed drops significantly. We're now
back at 5.3% and speed has dropped from 13000K to 170K and continues to drop.
I investigated already on the old machine with several tools, of course mdadm,
but also iostat and keeping an eye on /var/log/messages. All seems proper.
Also, immediately after "rescueing" the array from the miltiple disk failure I
ran a long reiserfsck --check on the volume which found no problems at all.
I'm now at a loss... Does anyone know what to monitor or check first ??
I'm unsure if this could be due to a disk hardware fault but then it would
surely show up in syslog, right ? Could disk corruption be the culprit ? My
guess would be "no", not only the reiserfs on top of the md0 tests fine, but
these are on different layers anyhow, correct ?
One last remark; once this state occurs (near the 5.3% mark) any command that
tries to query the array hangs indefinitely. (umount, mount, mdadm, even df).
Those commands are unkillable, what also entails that there is no way to
reboot the machine except by resetswitch (shutdown hangs forever).
Apart from those commands the machine is still totally responsive (on another
terminal). Needless to say this bugs me enormously... :-(
Some info:
The machine has a boot disk which provides "/" and a full linux system. Apart
from that it has seven (7) 80GB disks connected to 2 promise adapters(100TX2)
Those disks should be a raid5 array with one spare (but they are obviously
kinda inbetween states right now). The kernel was 2.4.10 and is now 2.4.18.
If anyone can help, that would be greatly appreciated...
Maarten
--
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: new problem has developed
[not found] <Pine.LNX.4.44.0310290050190.10948-100000@coffee.psychology.mcmaster.ca>
@ 2003-10-30 13:14 ` maarten van den Berg
0 siblings, 0 replies; 2+ messages in thread
From: maarten van den Berg @ 2003-10-30 13:14 UTC (permalink / raw)
To: Mark Hahn, linux-raid
On Wednesday 29 October 2003 06:52, Mark Hahn wrote:
> > For various reasons I decided to decommission the old hardware (AMD K6)
> > and I built a newer (and 100% known-good) board in it earlier today. That
> > makes a BIG difference in initial speed, I now get 14000K/sec instead of
> > the dead slow AMD K6 did. However, at 5.2% the speed drops significantly.
> > We're now back at 5.3% and speed has dropped from 13000K to 170K and
> > continues to drop.
>
> this sort of thing *can* actually occur because of sick disks.
Thanks for replying. Yes, it was a bad disk and I solved it eventually.
> > I investigated already on the old machine with several tools, of course
> > mdadm, but also iostat and keeping an eye on /var/log/messages. All
> > seems proper.
>
> smartctl on the disks?
If only my BIOS would support that... :-(
I don't know if it's the main BIOS or the promise cards that must support it,
but 'ide-smart' just gives no output at all.
I did a 'badblocks' on one disk that was part of the array but already got
kicked twice from it. Lo and behold, starting at about 4GB it developed a
problem (slow reads due to endless retries). As I desperately NEEDED this
drive (my array was already degraded!) I decided to use 'dd_rescue' to clone
it to a good disk and re-assemble the array from there. The dd_rescue
operation took more than 30 hours(!) and showed that there was a problem
around the 4GB and also around 71 GB markers. Several MB could not be
recovered (which is close to nothing, percentage-wise).
Mdadm then reassembled the array with the fresh drive, and subsequent
hot-adding went as fast as it should. One day later I added a new hot-spare.
All is well now. I will surely find corrupted data at some point due to the
missing MB's. But I see no way to avoid this anyhow...
I just hope it is a file, not reiserfs meta-data, that got killed.
Taking into account that dd_rescue took 30 hours it stands to reason that
maybe the resync would have worked after all, if only I would have let it run
longer. The problem is partly that the resync just seems to grind to a halt,
whereas dd_rescue is much more verbose in what it does. If I could
distinguish between a 'crash' and a slow process (that still works -albeit
slow) this probably wouldn't have happened. Well, now we know...
> > I'm unsure if this could be due to a disk hardware fault but then it
> > would surely show up in syslog, right ?
>
> no. there's no syslog-over-ata/scsi afaikt ;)
>
> > Could disk corruption be the culprit ? My
>
> I'd guess vibration. I've experienced several kinds of recent disks that
> under bad conditions (vibration, near-death) just get amazingly slow,
> but continue to work. this is, of course, really, really good...
They vibrate, yeah. That's just what happens if you put eight disks together
in a cabinet and put two 120mm papst fans right in front of them... ;-)
(But at least they stay quite cool, really quite cool...)
Maarten
--
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2003-10-30 13:14 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-26 20:08 new problem has developed maarten van den Berg
[not found] <Pine.LNX.4.44.0310290050190.10948-100000@coffee.psychology.mcmaster.ca>
2003-10-30 13:14 ` maarten van den Berg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).