Reduce Timeout on Disk Failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Reduce Timeout on Disk Failure
@ 2003-04-29 11:04 Andreas Kahnt
  2003-04-29 13:23 ` jim
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Kahnt @ 2003-04-29 11:04 UTC (permalink / raw)
  To: linux-raid

Hello,

we've raid5 configured and removed one disk. The system hangs over one minute
on io (try to copy a big file, cp is in 'uninterruptible sleep') before
continuing in degraded mode. Lots of scsi errors occurred while pending
(kernel 2.4.19). Is it possible to reduce this dead time? Where is it
controlled that md recognizes disk failure at 17:37:09 but remove sde1 at
17:38:23, over one minute later?

I did a look into the md.c and other sources/includes, found the printk()
messages but I'm not familiar with the conzept... please help.

Excerpt of /var/log/messages:
Apr 25 17:37:09 r16 kernel: SCSI disk error : host 1 channel 0 id 2 lun 0 return code = 10000
Apr 25 17:37:09 r16 kernel:  I/O error: dev 08:41, sector 5396720
Apr 25 17:37:09 r16 kernel: raid5: Disk failure on sde1, disabling device. Operation continuing on 3 devices
Apr 25 17:37:09 r16 kernel: md: recovery thread got woken up ...
Apr 25 17:37:09 r16 kernel: md: updating md5 RAID superblock on device
Apr 25 17:37:09 r16 kernel: md: sdh1 [events: 00000003]<6>(write) sdh1's sb offset: 5124608
Apr 25 17:37:09 r16 kernel: SCSI disk error : host 1 channel 0 id 2 lun 0 return code = 10000
Apr 25 17:37:09 r16 kernel:  I/O error: dev 08:41, sector 5396728
Apr 25 17:37:10 r16 kernel: md: sdg1 [events: 00000003]<6>(write) sdg1's sb offset: 5124608
Apr 25 17:37:10 r16 kernel: SCSI disk error : host 1 channel 0 id 2 lun 0 return code = 10000
Apr 25 17:37:10 r16 kernel:  I/O error: dev 08:41, sector 5396992
... SCSI disk error... + I/O error...
Apr 25 17:37:14 r16 kernel: md: sdf1 [events: 00000003]<6>(write) sdf1's sb offset: 5124608
... SCSI disk error... + I/O error...
Apr 25 17:37:15 r16 kernel: md: (skipping faulty sde1 )
Apr 25 17:37:15 r16 kernel: md5: no spare disk to reconstruct array! -- continuing in degraded mode
Apr 25 17:37:15 r16 kernel: md: recovery thread finished ...
... SCSI disk error... + I/O error...
Apr 25 17:38:09 r16 kernel: scsi1:0:2:0: Attempting to queue an ABORT message
Apr 25 17:38:09 r16 kernel: scsi1: Dumping Card State while idle, at SEQADDR 0x8
... driver messages ...
Apr 25 17:38:09 r16 kernel: (scsi1:A:2:0): Queuing a recovery SCB
Apr 25 17:38:09 r16 kernel: scsi1:0:2:0: Device is disconnected, re-queuing SCB
Apr 25 17:38:09 r16 kernel: Recovery code sleeping
Apr 25 17:38:09 r16 kernel: Recovery SCB completes
Apr 25 17:38:09 r16 kernel: Recovery code awake
Apr 25 17:38:09 r16 kernel: aic7xxx_abort returns 0x2002
Apr 25 17:38:09 r16 kernel: scsi1:0:2:0: Attempting to queue a TARGET RESET message
Apr 25 17:38:09 r16 kernel: scsi1:0:2:0: Command not found
Apr 25 17:38:09 r16 kernel: aic7xxx_dev_reset returns 0x2002
Apr 25 17:38:15 r16 kernel: scsi: device set offline - not ready or command retry failed after bus reset: host 1 channel 0 id 2 lun 0
Apr 25 17:38:15 r16 kernel: SCSI disk error : host 1 channel 0 id 2 lun 0 return code = 10000
Apr 25 17:38:15 r16 kernel:  I/O error: dev 08:41, sector 5396760
Apr 25 17:38:15 r16 kernel:  I/O error: dev 08:41, sector 5396768
Apr 25 17:38:23 r16 kernel: md: trying to remove sde1 from md5 ...
Apr 25 17:38:23 r16 kernel: RAID5 conf printout:
Apr 25 17:38:23 r16 kernel:  --- rd:4 wd:3 fd:1
Apr 25 17:38:23 r16 kernel:  disk 0, s:0, o:0, n:0 rd:0 us:1 dev:sde1
Apr 25 17:38:23 r16 kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1 dev:sdf1
Apr 25 17:38:23 r16 kernel:  disk 2, s:0, o:1, n:2 rd:2 us:1 dev:sdg1
Apr 25 17:38:23 r16 kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1 dev:sdh1
Apr 25 17:38:23 r16 kernel: RAID5 conf printout:
Apr 25 17:38:23 r16 kernel:  --- rd:4 wd:3 fd:1
Apr 25 17:38:23 r16 kernel:  disk 0, s:0, o:0, n:0 rd:0 us:0 dev:[dev 00:00]
Apr 25 17:38:23 r16 kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1 dev:sdf1
Apr 25 17:38:23 r16 kernel:  disk 2, s:0, o:1, n:2 rd:2 us:1 dev:sdg1
Apr 25 17:38:23 r16 kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1 dev:sdh1
Apr 25 17:38:23 r16 kernel: md: unbind<sde1,3>
Apr 25 17:38:23 r16 kernel: md: export_rdev(sde1)
Apr 25 17:38:23 r16 kernel: md: updating md5 RAID superblock on device
Apr 25 17:38:23 r16 kernel: md: sdh1 [events: 00000004]<6>(write) sdh1's sb offset: 5124608
Apr 25 17:38:23 r16 kernel: md: sdg1 [events: 00000004]<6>(write) sdg1's sb offset: 5124608
Apr 25 17:38:23 r16 kernel: md: sdf1 [events: 00000004]<6>(write) sdf1's sb offset: 5124608

Thanx,

Andreas.Kahnt@coware.de                         Coware AG
---------------------------------------------------------
Landsberger Str. 402                      D-81241 München
Telefon +49 (0)89 568 236 - 22, Fax -70     www.coware.de

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reduce Timeout on Disk Failure
  2003-04-29 11:04 Reduce Timeout on Disk Failure Andreas Kahnt
@ 2003-04-29 13:23 ` jim
  2003-04-29 14:06   ` Paul Clements
  0 siblings, 1 reply; 5+ messages in thread
From: jim @ 2003-04-29 13:23 UTC (permalink / raw)
  To: Andreas Kahnt; +Cc: linux-raid

If this is patched, I hope it is also put into a 2.2 update.  When a
SW raid is running, a couple of I/O retries might be reasonable, but
not heroic recovery attempts that would make good sense in a
single-disk environment.

We did a simple test of powering down an IDE drive that was part of an
(idle) SW raid, then trying to access the filesystem, and the system
just locked up.  Maybe it would have eventually come back to life - I
dunno.

For the curious, we haven't upgraded to 2.4x because whenever I check
the kernel traffic page, it seems there are still important bugs being
found and corrected - ones we don't want to experience in a production
setup.

Jim

> 
> Hello,
> 
> we've raid5 configured and removed one disk. The system hangs over one minute
> on io (try to copy a big file, cp is in 'uninterruptible sleep') before
> continuing in degraded mode. Lots of scsi errors occurred while pending
> (kernel 2.4.19). Is it possible to reduce this dead time? Where is it
> controlled that md recognizes disk failure at 17:37:09 but remove sde1 at
> 17:38:23, over one minute later?

...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reduce Timeout on Disk Failure
  2003-04-29 13:23 ` jim
@ 2003-04-29 14:06   ` Paul Clements
  2003-04-29 14:18     ` Lars Marowsky-Bree
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Clements @ 2003-04-29 14:06 UTC (permalink / raw)
  To: jim; +Cc: Andreas Kahnt, linux-raid

jim@rubylane.com wrote:
> 
> If this is patched, I hope it is also put into a 2.2 update.  When a
> SW raid is running, a couple of I/O retries might be reasonable, but
> not heroic recovery attempts that would make good sense in a
> single-disk environment.

Yes, the md driver in 2.2 had a ridiculously large retry loop when an
I/O failure occurs...if I counted correctly, I think it did 4096 retries
on I/O failure! This usually means that one of the lower level drivers
ends up hung in a pretty tight error handling loop...

> We did a simple test of powering down an IDE drive that was part of an
> (idle) SW raid, then trying to access the filesystem, and the system
> just locked up.  Maybe it would have eventually come back to life - I
> dunno.

Yep, we tried similar things with a network block device (breaking the
network connection)...we ended up hacking the raid1 and nbd drivers and
inserting schedule() calls just to mitigate the effects of the retries a
little bit...we at least got the system not to hang completely while the
retries were going on... 

> For the curious, we haven't upgraded to 2.4x because whenever I check
> the kernel traffic page, it seems there are still important bugs being
> found and corrected - ones we don't want to experience in a production
> setup.

Well, this particular retry problem does not exist in 2.4. And in
general, as far as software RAID is concerned, 2.4 is a lot better...I
know, at least with raid1, you can fail a device just about anytime you
want (with lots of write activity, during a resync, etc.) and as often
as you want, and it doesn't hang...

--
Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Reduce Timeout on Disk Failure
  2003-04-29 14:06   ` Paul Clements
@ 2003-04-29 14:18     ` Lars Marowsky-Bree
  2003-05-01 22:09       ` About bad sectors 3tcdgwg3
  0 siblings, 1 reply; 5+ messages in thread
From: Lars Marowsky-Bree @ 2003-04-29 14:18 UTC (permalink / raw)
  To: Paul Clements, jim; +Cc: Andreas Kahnt, linux-raid

On 2003-04-29T10:06:14,
   Paul Clements <Paul.Clements@SteelEye.com> said:

> Well, this particular retry problem does not exist in 2.4. And in
> general, as far as software RAID is concerned, 2.4 is a lot better...I
> know, at least with raid1, you can fail a device just about anytime you
> want (with lots of write activity, during a resync, etc.) and as often
> as you want, and it doesn't hang...

This depends on the lower level device. qlaxxxx takes about 30s to
report the unplugging of a cable as an IO error; so access to the md
device blocks for 30s...


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
SuSE Labs - Research & Development, SuSE Linux AG
  
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* About bad sectors.
  2003-04-29 14:18     ` Lars Marowsky-Bree
@ 2003-05-01 22:09       ` 3tcdgwg3
  0 siblings, 0 replies; 5+ messages in thread
From: 3tcdgwg3 @ 2003-05-01 22:09 UTC (permalink / raw)
  To: linux-raid

Hi,

If I am use IDE drive in an array, if there are
bad sectors show up, how is that been handled?

Thanks in advance.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-05-01 22:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-29 11:04 Reduce Timeout on Disk Failure Andreas Kahnt
2003-04-29 13:23 ` jim
2003-04-29 14:06   ` Paul Clements
2003-04-29 14:18     ` Lars Marowsky-Bree
2003-05-01 22:09       ` About bad sectors 3tcdgwg3

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).