Logging-Loop when a drive in a raid1 fails.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Logging-Loop when a drive in a raid1 fails.
@ 2004-12-13  5:52 Michael Renner
  2004-12-13 15:07 ` Paul Clements
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Renner @ 2004-12-13  5:52 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

Hi,

one of the drives in a software raid1 failed, on a machine running 
2.6.9-rc2, leading to this "logging-spree" (see attachment).

Sorry if this has been fixed in the meanwhile; it's not that easy to 
test codepaths for failing drives with various kernels without having 
access to special block devices which support on-demand-failing.

Furthermore I'm a bit concerned about the overall quality of the md 
support in 2.6, this is not the first bug I see with codepaths which 
don't get used during normal operation. When all drives are fine the 
driver is rock solid, but as soon as a block device gets funky, hell 
tends to break loose. I'm a bit surprised because I've never had any 
problems in 2.4. Were there major api/code changes which could be the 
cause for that?

Another odd behaviour, for which I don't have exact information anymore, 
was when md tried to do a resync of a degraded raid5, hit a bad block on 
one of the (supposedly) "good" drives, and entered a tight loop of 
resync processes starting/aborting. This was with 2.6.9, unfortunately I 
wasn't able to take records because of time constraints and arising panic.

best regards,
michael

[-- Attachment #2: md-log.txt --]
[-- Type: text/plain, Size: 2996 bytes --]

Dec 13 02:03:13 stuff kernel: scsi0: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 01 5b ff 3c 00 00 08 00
Dec 13 02:03:13 stuff kernel: Info fld=0x15bff3e, Current sda: sense key Medium Error
Dec 13 02:03:13 stuff kernel: Additional sense: Unrecovered read error
Dec 13 02:03:13 stuff kernel: end_request: I/O error, dev sda, sector 22806332
Dec 13 02:03:13 stuff kernel: raid1: Disk failure on sda2, disabling device.
Dec 13 02:03:13 stuff kernel: ^IOperation continuing on 1 devices
Dec 13 02:03:13 stuff kernel: printk: 77 messages suppressed.
Dec 13 02:03:13 stuff kernel: raid1: sda2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: RAID1 conf printout:
Dec 13 02:03:13 stuff kernel:  --- wd:1 rd:2
Dec 13 02:03:13 stuff kernel:  disk 0, wo:1, o:0, dev:sda2
Dec 13 02:03:13 stuff kernel:  disk 1, wo:0, o:1, dev:sdb2
Dec 13 02:03:13 stuff kernel: RAID1 conf printout:
Dec 13 02:03:13 stuff kernel:  --- wd:1 rd:2
Dec 13 02:03:13 stuff kernel:  disk 1, wo:0, o:1, dev:sdb2
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:13 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:13 stuff kernel: raid1: sdb2: redirecting sector 18886472 to another mirror
Dec 13 02:03:18 stuff kernel: printk: 119408 messages suppressed.
Dec 13 02:03:18 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:23 stuff kernel: printk: 119581 messages suppressed.
Dec 13 02:03:23 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:28 stuff kernel: printk: 125937 messages suppressed.
Dec 13 02:03:28 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:33 stuff kernel: printk: 132411 messages suppressed.
Dec 13 02:03:34 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:38 stuff kernel: printk: 128773 messages suppressed.
Dec 13 02:03:39 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:43 stuff kernel: printk: 129557 messages suppressed.
Dec 13 02:03:44 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:48 stuff kernel: printk: 130365 messages suppressed.
Dec 13 02:03:49 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:53 stuff kernel: printk: 129959 messages suppressed.
Dec 13 02:03:54 stuff kernel: raid1: sdb2: rescheduling sector 18886472
Dec 13 02:03:58 stuff kernel: printk: 125235 messages suppressed.
Dec 13 02:03:59 stuff kernel: raid1: sdb2: rescheduling sector 18886472
[.. Continue ad nauseam ..]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Logging-Loop when a drive in a raid1 fails.
  2004-12-13  5:52 Logging-Loop when a drive in a raid1 fails Michael Renner
@ 2004-12-13 15:07 ` Paul Clements
  2004-12-14 10:19   ` Michael Renner
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Clements @ 2004-12-13 15:07 UTC (permalink / raw)
  To: Michael Renner; +Cc: linux-raid

Michael Renner wrote:

> one of the drives in a software raid1 failed, on a machine running 
> 2.6.9-rc2, leading to this "logging-spree" (see attachment).

> Sorry if this has been fixed in the meanwhile; it's not that easy to 

It has. I sent the patch to Neil Brown a while back to fix this problem. 
I believe it made 2.6.9.

> test codepaths for failing drives with various kernels without having 
> access to special block devices which support on-demand-failing.

mdadm -f /dev/md0 <drive>

roughly approximates a drive failure

> Furthermore I'm a bit concerned about the overall quality of the md 
> support in 2.6

I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as stable 
as 2.4, at least according to our stress tests.

--
Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Logging-Loop when a drive in a raid1 fails.
  2004-12-13 15:07 ` Paul Clements
@ 2004-12-14 10:19   ` Michael Renner
  2004-12-14 15:19     ` Paul Clements
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Renner @ 2004-12-14 10:19 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

Paul Clements wrote:
> Michael Renner wrote:
> 
>> one of the drives in a software raid1 failed, on a machine running 
>> 2.6.9-rc2, leading to this "logging-spree" (see attachment).
> 
>> Sorry if this has been fixed in the meanwhile; it's not that easy to 
> 
> It has. I sent the patch to Neil Brown a while back to fix this problem. 
> I believe it made 2.6.9.

Ok, good to hear.

>> test codepaths for failing drives with various kernels without having 
>> access to special block devices which support on-demand-failing.
> 
> mdadm -f /dev/md0 <drive>
> 
> roughly approximates a drive failure

IIRC this doesn't touch any codepaths which are involved in handling 
unreadable blocks on a block device, rescheduling block reads to another 
drive, etc, so this isn't a real alternative to funky block devices ;).

>> Furthermore I'm a bit concerned about the overall quality of the md 
>> support in 2.6
> 
> 
> I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as stable 
> as 2.4, at least according to our stress tests.

Including semi-dead/dying drives? As I said, normal operation is rock 
solid, it's just the edgy, hardly used stuff which tend(s|ed) to break.

best regards,
michaely

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Logging-Loop when a drive in a raid1 fails.
  2004-12-14 10:19   ` Michael Renner
@ 2004-12-14 15:19     ` Paul Clements
  0 siblings, 0 replies; 4+ messages in thread
From: Paul Clements @ 2004-12-14 15:19 UTC (permalink / raw)
  To: Michael Renner; +Cc: linux-raid

Michael Renner wrote:
> Paul Clements wrote:

>> I don't think you should be. md in 2.6 (as of 2.6.9 or so) is as 
>> stable as 2.4, at least according to our stress tests.

> Including semi-dead/dying drives? As I said, normal operation is rock 
> solid, it's just the edgy, hardly used stuff which tend(s|ed) to break.

Well, we stress test with nbd under raid1. nbd has the nice property 
that it gives I/O errors (read or write, depending on what's going on at 
the time) when its network connection is broken. So, in our tests we 
break and reconnect the nbd connection periodically, while doing heavy 
I/O and, of course, the resync activity of raid1 kicks in on top of that 
when the failed device is re-added to the array.

Both 2.4 and 2.6 md are rock solid under several days of this type of 
testing.

--
Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-12-14 15:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-13  5:52 Logging-Loop when a drive in a raid1 fails Michael Renner
2004-12-13 15:07 ` Paul Clements
2004-12-14 10:19   ` Michael Renner
2004-12-14 15:19     ` Paul Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).