hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
@ 2009-07-16 23:03 Tomasz Chmielewski
  2009-07-17  4:12 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Tomasz Chmielewski @ 2009-07-16 23:03 UTC (permalink / raw)
  To: linux-raid, linux-ide

hdparm -Y /dev/sda /dev/sdb renders this:

Jul 17 00:55:50 dom klogd: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Jul 17 00:55:50 dom klogd: ata1.00: waking up from sleep
Jul 17 00:55:50 dom klogd: ata1: hard resetting link
Jul 17 00:55:50 dom klogd: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Jul 17 00:55:50 dom klogd: ata2.00: waking up from sleep
Jul 17 00:55:50 dom klogd: ata2: hard resetting link
Jul 17 00:55:50 dom klogd: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 17 00:55:50 dom klogd: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 17 00:55:50 dom klogd: ata2.00: configured for UDMA/133
Jul 17 00:55:50 dom klogd: ata2: EH complete
Jul 17 00:55:50 dom klogd: ata1.00: configured for UDMA/133
Jul 17 00:55:50 dom klogd: ata1: EH complete
Jul 17 00:55:55 dom klogd: end_request: I/O error, dev sda, sector 225263226
Jul 17 00:55:55 dom klogd: md: super_written gets error=-5, uptodate=0
Jul 17 00:55:55 dom klogd: raid1: Disk failure on sda2, disabling device.
Jul 17 00:55:55 dom klogd: raid1: Operation continuing on 1 devices.
Jul 17 00:55:55 dom klogd: end_request: I/O error, dev sdb, sector 225263226
Jul 17 00:55:55 dom klogd: md: super_written gets error=-5, uptodate=0
Jul 17 00:55:55 dom klogd: RAID1 conf printout:
Jul 17 00:55:55 dom klogd:  --- wd:1 rd:2
Jul 17 00:55:55 dom klogd:  disk 0, wo:1, o:0, dev:sda2
Jul 17 00:55:55 dom klogd:  disk 1, wo:0, o:1, dev:sdb2
Jul 17 00:55:55 dom klogd: RAID1 conf printout:
Jul 17 00:55:55 dom klogd:  --- wd:1 rd:2
Jul 17 00:55:55 dom klogd:  disk 1, wo:0, o:1, dev:sdb2


And one of the disks is kicked out of RAID.

Is it expected behaviour (although probably the error happens somewhere in the ata layer)?
It is kernel 2.6.30.1 and AHCI driver.

00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller


# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda5[1] sdb5[0]
      511999424 blocks [2/2] [UU]

md3 : active raid1 sda6[1] sdb6[0]
      96256 blocks [2/2] [UU]

md0 : active raid1 sdb1[0] sda1[1]
      10233280 blocks [2/2] [UU]

md1 : active raid1 sda2[2](F) sdb2[1]
      102398208 blocks [2/1] [_U]

unused devices: <none>


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
  2009-07-16 23:03 hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal? Tomasz Chmielewski
@ 2009-07-17  4:12 ` Tejun Heo
  2009-07-17  4:38   ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2009-07-17  4:12 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: linux-raid, linux-ide

Tomasz Chmielewski wrote:
> And one of the disks is kicked out of RAID.
> 
> Is it expected behaviour (although probably the error happens somewhere
> in the ata layer)?

Yes, it's expected.  Wakeup requires reset via EH and md requests have
FAILFAST flag set, so they never get retried.  The behavior can be
changed tho.  Hmmm... not entirely sure what to do at this point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
  2009-07-17  4:12 ` Tejun Heo
@ 2009-07-17  4:38   ` NeilBrown
  2009-07-17  4:48     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2009-07-17  4:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Tomasz Chmielewski, linux-raid, linux-ide

On Fri, July 17, 2009 2:12 pm, Tejun Heo wrote:
> Tomasz Chmielewski wrote:
>> And one of the disks is kicked out of RAID.
>>
>> Is it expected behaviour (although probably the error happens somewhere
>> in the ata layer)?
>
> Yes, it's expected.  Wakeup requires reset via EH and md requests have
> FAILFAST flag set, so they never get retried.  The behavior can be
> changed tho.  Hmmm... not entirely sure what to do at this point.

Nope, 'md' requests do not get FAILFAST set.  I tried that and easily
found cases where it fails way too fast.  FAILFAST seems to mean different
things on different devices, making it useless in general (it is still
useful in some specific cases such as multipath on devices which are
expected to be used under multipath and so treat FAILFAST appropriately).

NeilBrown


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
  2009-07-17  4:38   ` NeilBrown
@ 2009-07-17  4:48     ` Tejun Heo
  2009-07-17  5:20       ` Jeff Garzik
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2009-07-17  4:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tomasz Chmielewski, linux-raid, linux-ide

NeilBrown wrote:
> On Fri, July 17, 2009 2:12 pm, Tejun Heo wrote:
>> Tomasz Chmielewski wrote:
>>> And one of the disks is kicked out of RAID.
>>>
>>> Is it expected behaviour (although probably the error happens somewhere
>>> in the ata layer)?
>> Yes, it's expected.  Wakeup requires reset via EH and md requests have
>> FAILFAST flag set, so they never get retried.  The behavior can be
>> changed tho.  Hmmm... not entirely sure what to do at this point.
> 
> Nope, 'md' requests do not get FAILFAST set.

Oh... then it's unexpected.  I'll see if I can reproduce the failure
here.

> I tried that and easily found cases where it fails way too fast.
> FAILFAST seems to mean different things on different devices, making
> it useless in general (it is still useful in some specific cases
> such as multipath on devices which are expected to be used under
> multipath and so treat FAILFAST appropriately).

Yeap, FAILFAST flags seem geared pretty much toward multipathing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
  2009-07-17  4:48     ` Tejun Heo
@ 2009-07-17  5:20       ` Jeff Garzik
  2009-07-17  5:24         ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Garzik @ 2009-07-17  5:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, Tomasz Chmielewski, linux-raid, linux-ide

Tejun Heo wrote:
> NeilBrown wrote:
>> I tried that and easily found cases where it fails way too fast.
>> FAILFAST seems to mean different things on different devices, making
>> it useless in general (it is still useful in some specific cases
>> such as multipath on devices which are expected to be used under
>> multipath and so treat FAILFAST appropriately).
> 
> Yeap, FAILFAST flags seem geared pretty much toward multipathing.

Yes :/

I'm glad this area is getting some attention, because we ideally want to 
do two things in parallel:

* send upper layer advisory message, when we first notice a failure
* begin EH recovery

Time passes, libata attempts recovery, and completes the command with 
success or failure many seconds later.

Right now, failfast handling is inconsistent, and is not (I think...) 
always signalled as soon as we begin EH.

	Jeff



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal?
  2009-07-17  5:20       ` Jeff Garzik
@ 2009-07-17  5:24         ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2009-07-17  5:24 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: NeilBrown, Tomasz Chmielewski, linux-raid, linux-ide

Jeff Garzik wrote:
> Tejun Heo wrote:
>> NeilBrown wrote:
>>> I tried that and easily found cases where it fails way too fast.
>>> FAILFAST seems to mean different things on different devices, making
>>> it useless in general (it is still useful in some specific cases
>>> such as multipath on devices which are expected to be used under
>>> multipath and so treat FAILFAST appropriately).
>>
>> Yeap, FAILFAST flags seem geared pretty much toward multipathing.
> 
> Yes :/
> 
> I'm glad this area is getting some attention, because we ideally want to
> do two things in parallel:
> 
> * send upper layer advisory message, when we first notice a failure
> * begin EH recovery
> 
> Time passes, libata attempts recovery, and completes the command with
> success or failure many seconds later.
> 
> Right now, failfast handling is inconsistent, and is not (I think...)
> always signalled as soon as we begin EH.

Heh.. yeah, it's notified on completion of EH, which BTW is pretty
dumb.  :-)

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-07-17  5:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-16 23:03 hdaprm -Y /dev/sda /dev/sdb -> I/O error -> disk kicked out of RAID - is it normal? Tomasz Chmielewski
2009-07-17  4:12 ` Tejun Heo
2009-07-17  4:38   ` NeilBrown
2009-07-17  4:48     ` Tejun Heo
2009-07-17  5:20       ` Jeff Garzik
2009-07-17  5:24         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).