* Logging-Loop when a drive in a raid1 fails.
@ 2004-12-13 5:52 Michael Renner
2004-12-13 15:07 ` Paul Clements
0 siblings, 1 reply; 4+ messages in thread
From: Michael Renner @ 2004-12-13 5:52 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]
Hi,
one of the drives in a software raid1 failed, on a machine running
2.6.9-rc2, leading to this "logging-spree" (see attachment).
Sorry if this has been fixed in the meanwhile; it's not that easy to
test codepaths for failing drives with various kernels without having
access to special block devices which support on-demand-failing.
Furthermore I'm a bit concerned about the overall quality of the md
support in 2.6, this is not the first bug I see with codepaths which
don't get used during normal operation. When all drives are fine the
driver is rock solid, but as soon as a block device gets funky, hell
tends to break loose. I'm a bit surprised because I've never had any
problems in 2.4. Were there major api/code changes which could be the
cause for that?
Another odd behaviour, for which I don't have exact information anymore,
was when md tried to do a resync of a degraded raid5, hit a bad block on
one of the (supposedly) "good" drives, and entered a tight loop of
resync processes starting/aborting. This was with 2.6.9, unfortunately I
wasn't able to take records because of time constraints and arising panic.
best regards,
michael
[-- Attachment #2: md-log.txt --]
[-- Type: text/plain, Size: 2996 bytes --]
Dec 13 02:03:13 stuff kernel: scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 01 5b ff 3c 00 00 08 00
Dec 13 02:03:13 stuff kernel: Info fld=0x15bff3e, Current sda: sense key Medium Error
Dec 13 02:03:13 stuff kernel: Additional sense: Unrecovered read error
Dec 13 02:03:13 stuff kernel: end_request: I/O error, dev sda, sector 22806332
Dec 13 02:03:13 stuff kernel: raid1: Disk failure on sda2, disabling device.
Dec 13 02:03:13 stuff kernel: ^IOperation continuing on 1 devices
Dec 13 02:03:13 stuff kernel: printk: 77 messages suppressed.
Dec 13 02:03:13 stuff kernel: raid1: sda2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: RAID1 conf printout:
Dec 13 02:03:13 stuff kernel: --- wd:1 rd:2
Dec 13 02:03:13 stuff kernel: disk 0, wo:1, o:0, dev:sda2
Dec 13 02:03:13 stuff kernel: disk 1, wo:0, o:1, dev:sdb2
Dec 13 02:03:13 stuff kernel: RAID1 conf printout:
Dec 13 02:03:13 stuff kernel: --- wd:1 rd:2
Dec 13 02:03:13 stuff kernel: disk 1, wo:0, o:1, dev:sdb2
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:18 stuff kernel: printk: 119408 messages suppressed.
Dec 13 02:03:18 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:23 stuff kernel: printk: 119581 messages suppressed.
Dec 13 02:03:23 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:28 stuff kernel: printk: 125937 messages suppressed.
Dec 13 02:03:28 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:33 stuff kernel: printk: 132411 messages suppressed.
Dec 13 02:03:34 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:38 stuff kernel: printk: 128773 messages suppressed.
Dec 13 02:03:39 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:43 stuff kernel: printk: 129557 messages suppressed.
Dec 13 02:03:44 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:48 stuff kernel: printk: 130365 messages suppressed.
Dec 13 02:03:49 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:53 stuff kernel: printk: 129959 messages suppressed.
Dec 13 02:03:54 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:58 stuff kernel: printk: 125235 messages suppressed.
Dec 13 02:03:59 stuff kernel: raid1: sdb2: rescheduling sector 18886472
[.. Continue ad nauseam ..]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Logging-Loop when a drive in a raid1 fails.
2004-12-13 5:52 Logging-Loop when a drive in a raid1 fails Michael Renner
@ 2004-12-13 15:07 ` Paul Clements
2004-12-14 10:19 ` Michael Renner
0 siblings, 1 reply; 4+ messages in thread
From: Paul Clements @ 2004-12-13 15:07 UTC (permalink / raw)
To: Michael Renner; +Cc: linux-raid
Michael Renner wrote:
> one of the drives in a software raid1 failed, on a machine running
> 2.6.9-rc2, leading to this "logging-spree" (see attachment).
> Sorry if this has been fixed in the meanwhile; it's not that easy to
It has. I sent the patch to Neil Brown a while back to fix this problem.
I believe it made 2.6.9.
> test codepaths for failing drives with various kernels without having
> access to special block devices which support on-demand-failing.
mdadm -f /dev/md0 <drive>
roughly approximates a drive failure
> Furthermore I'm a bit concerned about the overall quality of the md
> support in 2.6
I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as stable
as 2.4, at least according to our stress tests.
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Logging-Loop when a drive in a raid1 fails.
2004-12-13 15:07 ` Paul Clements
@ 2004-12-14 10:19 ` Michael Renner
2004-12-14 15:19 ` Paul Clements
0 siblings, 1 reply; 4+ messages in thread
From: Michael Renner @ 2004-12-14 10:19 UTC (permalink / raw)
To: Paul Clements; +Cc: linux-raid
Paul Clements wrote:
> Michael Renner wrote:
>
>> one of the drives in a software raid1 failed, on a machine running
>> 2.6.9-rc2, leading to this "logging-spree" (see attachment).
>
>> Sorry if this has been fixed in the meanwhile; it's not that easy to
>
> It has. I sent the patch to Neil Brown a while back to fix this problem.
> I believe it made 2.6.9.
Ok, good to hear.
>> test codepaths for failing drives with various kernels without having
>> access to special block devices which support on-demand-failing.
>
> mdadm -f /dev/md0 <drive>
>
> roughly approximates a drive failure
IIRC this doesn't touch any codepaths which are involved in handling
unreadable blocks on a block device, rescheduling block reads to another
drive, etc, so this isn't a real alternative to funky block devices ;).
>> Furthermore I'm a bit concerned about the overall quality of the md
>> support in 2.6
>
>
> I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as stable
> as 2.4, at least according to our stress tests.
Including semi-dead/dying drives? As I said, normal operation is rock
solid, it's just the edgy, hardly used stuff which tend(s|ed) to break.
best regards,
michaely
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Logging-Loop when a drive in a raid1 fails.
2004-12-14 10:19 ` Michael Renner
@ 2004-12-14 15:19 ` Paul Clements
0 siblings, 0 replies; 4+ messages in thread
From: Paul Clements @ 2004-12-14 15:19 UTC (permalink / raw)
To: Michael Renner; +Cc: linux-raid
Michael Renner wrote:
> Paul Clements wrote:
>> I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as
>> stable as 2.4, at least according to our stress tests.
> Including semi-dead/dying drives? As I said, normal operation is rock
> solid, it's just the edgy, hardly used stuff which tend(s|ed) to break.
Well, we stress test with nbd under raid1. nbd has the nice property
that it gives I/O errors (read or write, depending on what's going on at
the time) when its network connection is broken. So, in our tests we
break and reconnect the nbd connection periodically, while doing heavy
I/O and, of course, the resync activity of raid1 kicks in on top of that
when the failed device is re-added to the array.
Both 2.4 and 2.6 md are rock solid under several days of this type of
testing.
--
Paul
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2004-12-14 15:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-13 5:52 Logging-Loop when a drive in a raid1 fails Michael Renner
2004-12-13 15:07 ` Paul Clements
2004-12-14 10:19 ` Michael Renner
2004-12-14 15:19 ` Paul Clements
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).