* Last ditch plea on remote double raid5 disk failure
@ 2007-12-31 10:39 Marc MERLIN
2007-12-31 12:05 ` Neil Brown
2008-01-01 9:56 ` David Rees
0 siblings, 2 replies; 6+ messages in thread
From: Marc MERLIN @ 2007-12-31 10:39 UTC (permalink / raw)
To: H. Peter Anvin, mingo, neilb, linux-raid
Howdy,
Sorry for the direct Ccs, I'm not sure if my Email to linux-raid will
make it or not.
Long story short, my main server just died with a double raid failure
today, and I'm on vacation on the other side of the world.
One drive is dead for good, the other one generates an error when I
read at least one block, but seems ok otherwise.
Before I look into doing a remote manual server failover/rebuild over
new years eve :( I was wondering if I can tell the kernel not to kick
a drive out of an array if it sees a block error and just return the
block error upstream, but continue otherwise (all my partitions are on
a raid5 array, with lvm on top, so even if I were to lose a partition,
I would still be likely to get the other ones back up if I can stop
the auto kicking-out and killing the md array feature).
Currently, I get:
/dev/intraid5/usr: recovering journal
sd 0:0:3:0: SCSI error: return code = 0x8000002
sdd: Current: sense key: Medium Error
Additional sense: Unrecovered read error
Info fld=0x89a48
end_request: I/O error, dev sdd, sector 563784
raid5: Disk failure on sdd3, disabling device. Operation continuing on 3 devices
Buffer I/O error on device dm-0, logical block 541
I'm hoping that if I can get raid5 to continue despite the errors, I
can bring back up enough of the server to continue, a bit like the
remount-ro option in ext2/ext3.
If not, oh well...
Thanks,
Marc
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Last ditch plea on remote double raid5 disk failure
2007-12-31 10:39 Last ditch plea on remote double raid5 disk failure Marc MERLIN
@ 2007-12-31 12:05 ` Neil Brown
2007-12-31 12:40 ` Marc MERLIN
2007-12-31 14:19 ` Michael Tokarev
2008-01-01 9:56 ` David Rees
1 sibling, 2 replies; 6+ messages in thread
From: Neil Brown @ 2007-12-31 12:05 UTC (permalink / raw)
To: Marc MERLIN; +Cc: H. Peter Anvin, mingo, linux-raid
On Monday December 31, merlin@gmail.com wrote:
>
> I'm hoping that if I can get raid5 to continue despite the errors, I
> can bring back up enough of the server to continue, a bit like the
> remount-ro option in ext2/ext3.
>
> If not, oh well...
Sorry, but it is "oh well".
I could probably make it behave a bit better in this situation, but
not in time for you.
NeilBrown
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Last ditch plea on remote double raid5 disk failure
2007-12-31 12:05 ` Neil Brown
@ 2007-12-31 12:40 ` Marc MERLIN
2007-12-31 14:19 ` Michael Tokarev
1 sibling, 0 replies; 6+ messages in thread
From: Marc MERLIN @ 2007-12-31 12:40 UTC (permalink / raw)
To: Neil Brown; +Cc: H. Peter Anvin, mingo, linux-raid
On Dec 31, 2007 1:05 PM, Neil Brown <neilb@suse.de> wrote:
> On Monday December 31, merlin@gmail.com wrote:
> >
> > I'm hoping that if I can get raid5 to continue despite the errors, I
> > can bring back up enough of the server to continue, a bit like the
> > remount-ro option in ext2/ext3.
> >
> > If not, oh well...
>
> Sorry, but it is "oh well".
>
> I could probably make it behave a bit better in this situation, but
> not in time for you.
Understood, thanks much for the answer, I'll work on moving services
to another server now.
If that's a reasonable RFE for the people concerned, it'd be nice to
have a non default md raidtab option, or an assemble option that says
"continue as long as you can if kicking another drive will disable the
array", I'd still be ok with the array staying read-write if block
errors are passed through since the underlying filesystem would
remount read only by itself. I think it my case, it would have allowed
me to reboot my system with md5 damaged, limping on one drive fewer,
having the lvm be read-write, and the /usr partition within it be
read-only while /var within the same raid5 md, same LVM VG but
different LV, could stay read-write.
That might be a bit too much of wishful thinking though. Even having
everything still run as read-only (again, as a non default option),
would be nice. No idea how hard or unrealistic that is, but just
throwing the idea out :)
Either way, have a great new year.
Best,
Marc
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Last ditch plea on remote double raid5 disk failure
2007-12-31 12:05 ` Neil Brown
2007-12-31 12:40 ` Marc MERLIN
@ 2007-12-31 14:19 ` Michael Tokarev
2007-12-31 15:38 ` Bill Davidsen
1 sibling, 1 reply; 6+ messages in thread
From: Michael Tokarev @ 2007-12-31 14:19 UTC (permalink / raw)
To: Neil Brown; +Cc: Marc MERLIN, H. Peter Anvin, mingo, linux-raid
Neil Brown wrote:
> On Monday December 31, merlin@gmail.com wrote:
>> I'm hoping that if I can get raid5 to continue despite the errors, I
>> can bring back up enough of the server to continue, a bit like the
>> remount-ro option in ext2/ext3.
>>
>> If not, oh well...
>
> Sorry, but it is "oh well".
Speaking of all this bad block handling and dropping device in case
of errors. Sure thing the situation here improved ALOT when rewriting
a block in case of read error has been introduced. This was a very
big step into the right direction. But this is still not sufficient,
I think.
What can be done currently, is to extend bitmap thing, to keep more
information. Namely, if a block on one drive fails, and we failed
to rewrite it as well (or when there was no way to rewrite it because
the array was already running in degraded mode), don't drop the drive
still, but fail the original request, AND mark THIS PARTICULAR BLOCK
of THIS PARTICULAR DRIVE as "bad" in the bitmap.
In the other words, bitmap can be extended to cover individual drives
instead of the whole raid device.
It's more - if there's no bitmap for the array, I mean no persistent
bitmap, such a thing can still be done anyway, by keeping such a bitmap
in memory only, up until the raid array will be shut down (in which case
mark the whole drives with errors as "bad"). This way, it's possible
to recover alot more data without risking losing the whole array any
time.
It's even more - up until some real write will be performed over a "bad"
block, there's no need to record its badness - we can return the same
error again as it's expected the drive will return it on a next read
attempt. It's only write - real write - which makes this particular
block to become "bad" as we wasn't able to write new data to it...
Hm. Even in case of write failure, we can still keep the whole drive
without marking anything as "bad", again in a hope that the next of
those blocks will error out again. This is an.. interesting question
really, whenever one can rely on drive to not return bad (read: random)
data in case it errored write operation. I definitely know a case
when it's not true: we've a batch of seagate drives which seem to have
firmware bug in them, which errors out on write with "Defect list
manipulation error" sense code, but reads on this very sector returns
something still, especially after a fresh boot (after a power-off).
In any case, keeping this info in a bitmap should be sufficient to
stop kicking the whole drives out of an array, which currently is
a weakest point in linux software raid (IMHO). As it has been pointed
out numerous times before, due to Murhpy's laws or other factors such
as a phase of the Moon (and partly this behaviour can be described by
the fact that after a drive failure, other drives receives more I/O
requests, esp. when reconstruction starts, and hence have much more
chances to error out on sectors which were not read before in a long
time), drives tend to fail several at once, and often it's trivial to
read the missing information from a drive which has just been kicked
off the array at the place where another drive developed a bad sector.
And another thought around all this. Linux sw raid definitely need
a way to proactively replace a (probably failing) drive, without removing
it from the array first. Something like,
mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING
so that sdNEW will be a mirror of sdFAILING, and once the "recovery"
procedure finishes (which may use data from other drives in case of
I/O error reading sdFAILING - unlike described scenario of making a
superblock-less mirror of sdNEW and sdFAILING),
mdadm --remove /dev/md0 /dev/sdFAILING,
which does not involve any further reconstructions anymore.
/mjt
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Last ditch plea on remote double raid5 disk failure
2007-12-31 14:19 ` Michael Tokarev
@ 2007-12-31 15:38 ` Bill Davidsen
0 siblings, 0 replies; 6+ messages in thread
From: Bill Davidsen @ 2007-12-31 15:38 UTC (permalink / raw)
To: Michael Tokarev
Cc: Neil Brown, Marc MERLIN, H. Peter Anvin, mingo, linux-raid
Michael Tokarev wrote:
> Neil Brown wrote:
>
>> On Monday December 31, merlin@gmail.com wrote:
>>
>>> I'm hoping that if I can get raid5 to continue despite the errors, I
>>> can bring back up enough of the server to continue, a bit like the
>>> remount-ro option in ext2/ext3.
>>>
>>> If not, oh well...
>>>
>> Sorry, but it is "oh well".
>>
> And another thought around all this. Linux sw raid definitely need
> a way to proactively replace a (probably failing) drive, without removing
> it from the array first. Something like,
> mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING
> so that sdNEW will be a mirror of sdFAILING, and once the "recovery"
> procedure finishes (which may use data from other drives in case of
> I/O error reading sdFAILING - unlike described scenario of making a
> superblock-less mirror of sdNEW and sdFAILING),
> mdadm --remove /dev/md0 /dev/sdFAILING,
> which does not involve any further reconstructions anymore.
>
I really like that idea, it addresses the same problem as the various
posts regarding creating little raid1 arrays of the old and new drive, etc.
I would like an option to keep a drive with bad sectors in an array if
removing the drive would prevent the array from running (or starting). I
don't think that should be default, but there are times when some data
is way better than none. I would think the options are fail the drive,
set the array r/o, and return an error and keep going.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Last ditch plea on remote double raid5 disk failure
2007-12-31 10:39 Last ditch plea on remote double raid5 disk failure Marc MERLIN
2007-12-31 12:05 ` Neil Brown
@ 2008-01-01 9:56 ` David Rees
1 sibling, 0 replies; 6+ messages in thread
From: David Rees @ 2008-01-01 9:56 UTC (permalink / raw)
To: Marc MERLIN; +Cc: H. Peter Anvin, mingo, neilb, linux-raid
On Dec 31, 2007 2:39 AM, Marc MERLIN <merlin@gmail.com> wrote:
> new years eve :( I was wondering if I can tell the kernel not to kick
> a drive out of an array if it sees a block error and just return the
> block error upstream, but continue otherwise (all my partitions are on
> a raid5 array, with lvm on top, so even if I were to lose a partition,
> I would still be likely to get the other ones back up if I can stop
> the auto kicking-out and killing the md array feature).
Best bet is to get a new drive into the machine that is at least the
same size as the bad-sector disk, use dd_rescue[1] to copy as much of
the bad-sector disk to the new one.
Remove the bad-sector disk, reboot and hopefully you'll have a
functioning raid array with a bit of bad data on it somewhere.
I'm probably missing a step somewhere but you get the general idea...
-Dave
[1] http://www.garloff.de/kurt/linux/ddrescue/
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2008-01-01 9:56 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-31 10:39 Last ditch plea on remote double raid5 disk failure Marc MERLIN
2007-12-31 12:05 ` Neil Brown
2007-12-31 12:40 ` Marc MERLIN
2007-12-31 14:19 ` Michael Tokarev
2007-12-31 15:38 ` Bill Davidsen
2008-01-01 9:56 ` David Rees
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).