Device kicked from raid too easilly

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Device kicked from raid too easilly
@ 2010-06-05  1:51 Ian Dall
  2010-06-05  7:22 ` Stefan /*St0fF*/ Hübner
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Dall @ 2010-06-05  1:51 UTC (permalink / raw)
  To: Linux RAID

I think this is different to the similarly titled long thread on SATA
timeouts.

I have an array of U320 scsi disks with similar characteristics from two
different manufacturers.

On two disks I see occasional scsi parity errors. I don't think this is
a cabling or termination issue since I never see the parity errors on
the other brand disks. smartctl shows a number of "non-medium errors"
which I take to be the paroty errors.

Now, when I have a raid 10 of these disks, the scsi parity error causes
the first disk to be failed. The array then continues degraded with no
apparent problems. If I read-add the failed disk, it always fails before
the re-sync is complete. Eg:

Jun  3 23:35:02 fs kernel: md: recovery of RAID array md5
Jun  3 23:35:02 fs kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Jun  3 23:35:02 fs kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Jun  3 23:35:02 fs kernel: md: using 128k window, over a total of 29291904 blocks.
Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Sense Key : Aborted Command [current] 
Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Add. Sense: Scsi parity error
Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] CDB: Write(10): 2a 00 00 05 9e 00 00 01 00 00
Jun  3 23:35:07 fs kernel: end_request: I/O error, dev sde, sector 368128
Jun  3 23:35:07 fs kernel: raid10: Disk failure on sde, disabling device.
Jun  3 23:35:07 fs kernel: raid10: Operation continuing on 3 devices.
Jun  3 23:35:07 fs kernel: md: md5: recovery done.

Now I can test this disk in isolation (using iozone)  pretty heavily and
never see a problem. I can also use it in a raid0 and never see a
problem.

I think some of the strangeness is explained by the comment in the
raid10 error handler:  "else if it is the last working disks, ignore the
error".

Parity errors seem to me like they should be treated as transient
errors. Maybe if there are multiple consecutive parity errors it could
be assumed there is a hard fault in the transport layer. U320 uses
"information units" with (stronger than parity) CRC checking. Although
these errors are not reported as CRC errors that could just be a
reporting issue (the lack of an "additional sense code qualifier").
Given the complexity of the clock recovery de-skewing etc which goes on
for U320, it is not surprising some disks would do it better than
others, but a non zero error rate probably shouldn't be considered
fatal.

I don't really know where this should be fixed. Maybe the scsi layer
should be retrying the scsi command, since it knows most about what sort
of error it is. But equally it could be the responsibility of upper
layers to do any retrying (which gives upper layers the option to not
retry if they don't want to). But if the scsi layer is not responsible
for retrying these sorts of errors, then the md layer is over-reacting
by throwing disks out too easily. 

Regards,
Ian

-- 
Ian Dall <ian@beware.dropbear.id.au>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device kicked from raid too easilly
  2010-06-05  1:51 Device kicked from raid too easilly Ian Dall
@ 2010-06-05  7:22 ` Stefan /*St0fF*/ Hübner
  2010-06-08  5:15   ` Ian Dall
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2010-06-05  7:22 UTC (permalink / raw)
  To: Ian Dall; +Cc: Linux RAID

Hi Ian,

I do not think this is md-related, nor related to the other dropout
problem.  Here we have a write-error, which correctly makes the disk
drop.  So if you say the error is kind of a soft error, then actually
either the disks firmware or the scsi layer should be handling this.

But as always on write errors: mostly there are more than one write
request in the queue, so it's probably hard to find out which data
couldn't be written, or the data has already been discarded.  Well,
write errors are those that should be handled in Firmware...

Stefan

P.S.: maybe you should check for firmware updates of the disks?

Am 05.06.2010 03:51, schrieb Ian Dall:
> I think this is different to the similarly titled long thread on SATA
> timeouts.
> 
> I have an array of U320 scsi disks with similar characteristics from two
> different manufacturers.
> 
> On two disks I see occasional scsi parity errors. I don't think this is
> a cabling or termination issue since I never see the parity errors on
> the other brand disks. smartctl shows a number of "non-medium errors"
> which I take to be the paroty errors.
> 
> Now, when I have a raid 10 of these disks, the scsi parity error causes
> the first disk to be failed. The array then continues degraded with no
> apparent problems. If I read-add the failed disk, it always fails before
> the re-sync is complete. Eg:
> 
> Jun  3 23:35:02 fs kernel: md: recovery of RAID array md5
> Jun  3 23:35:02 fs kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Jun  3 23:35:02 fs kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Jun  3 23:35:02 fs kernel: md: using 128k window, over a total of 29291904 blocks.
> Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Sense Key : Aborted Command [current] 
> Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Add. Sense: Scsi parity error
> Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] CDB: Write(10): 2a 00 00 05 9e 00 00 01 00 00
> Jun  3 23:35:07 fs kernel: end_request: I/O error, dev sde, sector 368128
> Jun  3 23:35:07 fs kernel: raid10: Disk failure on sde, disabling device.
> Jun  3 23:35:07 fs kernel: raid10: Operation continuing on 3 devices.
> Jun  3 23:35:07 fs kernel: md: md5: recovery done.
> 
> Now I can test this disk in isolation (using iozone)  pretty heavily and
> never see a problem. I can also use it in a raid0 and never see a
> problem.
> 
> I think some of the strangeness is explained by the comment in the
> raid10 error handler:  "else if it is the last working disks, ignore the
> error".
> 
> Parity errors seem to me like they should be treated as transient
> errors. Maybe if there are multiple consecutive parity errors it could
> be assumed there is a hard fault in the transport layer. U320 uses
> "information units" with (stronger than parity) CRC checking. Although
> these errors are not reported as CRC errors that could just be a
> reporting issue (the lack of an "additional sense code qualifier").
> Given the complexity of the clock recovery de-skewing etc which goes on
> for U320, it is not surprising some disks would do it better than
> others, but a non zero error rate probably shouldn't be considered
> fatal.
> 
> I don't really know where this should be fixed. Maybe the scsi layer
> should be retrying the scsi command, since it knows most about what sort
> of error it is. But equally it could be the responsibility of upper
> layers to do any retrying (which gives upper layers the option to not
> retry if they don't want to). But if the scsi layer is not responsible
> for retrying these sorts of errors, then the md layer is over-reacting
> by throwing disks out too easily. 
> 
> 
> Regards,
> Ian
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device kicked from raid too easilly
  2010-06-05  7:22 ` Stefan /*St0fF*/ Hübner
@ 2010-06-08  5:15   ` Ian Dall
  2010-06-08  5:56     ` Stefan /*St0fF*/ Hübner
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Dall @ 2010-06-08  5:15 UTC (permalink / raw)
  To: st0ff; +Cc: Linux RAID

On Sat, 2010-06-05 at 09:22 +0200, Stefan /*St0fF*/ Hübner wrote:
> Hi Ian,
> 
> I do not think this is md-related, nor related to the other dropout
> problem.  Here we have a write-error, which correctly makes the disk
> drop.  So if you say the error is kind of a soft error, then actually
> either the disks firmware or the scsi layer should be handling this.
> 
> But as always on write errors: mostly there are more than one write
> request in the queue, so it's probably hard to find out which data
> couldn't be written, or the data has already been discarded.  Well,
> write errors are those that should be handled in Firmware...

Hmm. Well that is fair enough. I have struggled a bit in trying and
follow the path up and down all the various layers. I tried modifying
the scsi io completion to retry these errors but it wasn't a huge
success! It sometimes took a lot of retries (with the system seeming to
freeze for a few seconds), so it looks like it is a "firm" error!

Puzzlingly, swapping the disks around in the backplane works so long as
it is brand "A" in the first slot and not brand "B"! My current theory
is that there are transmission line effects (or maybe RFI) which make
some slots fall outside the range of the brand "B" disks to compensate!

One might think that sata would be better, but I am simultaneously
looking for a problem in another raid array which gives me sata CRC
errors which I assume to be cable related. Interestingly the sata
transport layer treats these CRC errors as soft (at least, there is no
corresponding action from md).

> P.S.: maybe you should check for firmware updates of the disks?

No such luck. These (brand "B") are "Worldisk" (which I believe to be
re-manufactured Fujitsu). The less troublesome disks (brand "A") are
Hitachi.

Thanks for your thoughts.

Regards,
Ian


> Am 05.06.2010 03:51, schrieb Ian Dall:
> > I think this is different to the similarly titled long thread on SATA
> > timeouts.
> > 
> > I have an array of U320 scsi disks with similar characteristics from two
> > different manufacturers.
> > 
> > On two disks I see occasional scsi parity errors. I don't think this is
> > a cabling or termination issue since I never see the parity errors on
> > the other brand disks. smartctl shows a number of "non-medium errors"
> > which I take to be the paroty errors.
> > 
> > Now, when I have a raid 10 of these disks, the scsi parity error causes
> > the first disk to be failed. The array then continues degraded with no
> > apparent problems. If I read-add the failed disk, it always fails before
> > the re-sync is complete. Eg:
> > 
> > Jun  3 23:35:02 fs kernel: md: recovery of RAID array md5
> > Jun  3 23:35:02 fs kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> > Jun  3 23:35:02 fs kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> > Jun  3 23:35:02 fs kernel: md: using 128k window, over a total of 29291904 blocks.
> > Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Sense Key : Aborted Command [current] 
> > Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] Add. Sense: Scsi parity error
> > Jun  3 23:35:07 fs kernel: sd 6:0:0:0: [sde] CDB: Write(10): 2a 00 00 05 9e 00 00 01 00 00
> > Jun  3 23:35:07 fs kernel: end_request: I/O error, dev sde, sector 368128
> > Jun  3 23:35:07 fs kernel: raid10: Disk failure on sde, disabling device.
> > Jun  3 23:35:07 fs kernel: raid10: Operation continuing on 3 devices.
> > Jun  3 23:35:07 fs kernel: md: md5: recovery done.
> > 
> > Now I can test this disk in isolation (using iozone)  pretty heavily and
> > never see a problem. I can also use it in a raid0 and never see a
> > problem.
> > 
> > I think some of the strangeness is explained by the comment in the
> > raid10 error handler:  "else if it is the last working disks, ignore the
> > error".
> > 
> > Parity errors seem to me like they should be treated as transient
> > errors. Maybe if there are multiple consecutive parity errors it could
> > be assumed there is a hard fault in the transport layer. U320 uses
> > "information units" with (stronger than parity) CRC checking. Although
> > these errors are not reported as CRC errors that could just be a
> > reporting issue (the lack of an "additional sense code qualifier").
> > Given the complexity of the clock recovery de-skewing etc which goes on
> > for U320, it is not surprising some disks would do it better than
> > others, but a non zero error rate probably shouldn't be considered
> > fatal.
> > 
> > I don't really know where this should be fixed. Maybe the scsi layer
> > should be retrying the scsi command, since it knows most about what sort
> > of error it is. But equally it could be the responsibility of upper
> > layers to do any retrying (which gives upper layers the option to not
> > retry if they don't want to). But if the scsi layer is not responsible
> > for retrying these sorts of errors, then the md layer is over-reacting
> > by throwing disks out too easily. 
> > 
> > 
> > Regards,
> > Ian
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ian Dall <ian@beware.dropbear.id.au>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device kicked from raid too easilly
  2010-06-08  5:15   ` Ian Dall
@ 2010-06-08  5:56     ` Stefan /*St0fF*/ Hübner
  2010-06-08  6:59       ` Tim Small
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2010-06-08  5:56 UTC (permalink / raw)
  To: Ian Dall; +Cc: Linux RAID

Am 08.06.2010 07:15, schrieb Ian Dall:
> On Sat, 2010-06-05 at 09:22 +0200, Stefan /*St0fF*/ Hübner wrote:
> [...]
> 
> Puzzlingly, swapping the disks around in the backplane works so long as
> it is brand "A" in the first slot and not brand "B"! My current theory
> is that there are transmission line effects (or maybe RFI) which make
> some slots fall outside the range of the brand "B" disks to compensate!
> 
> One might think that sata would be better, but I am simultaneously
> looking for a problem in another raid array which gives me sata CRC
> errors which I assume to be cable related. Interestingly the sata
> transport layer treats these CRC errors as soft (at least, there is no
> corresponding action from md).

Weird, but on some Synology RackStations (RS407 and RS408) brand NAS
devices I had similar problems and what you wrote is exactly what I told
the customers what I think the problem might be.

The devices of mentioned make which failed, always failed slot #1.
Those that made problems, always kicked drive #1 sporadically.  But on
those problem-kids I could never find a real error.  Swapping another
disk into slot #1 made the problems disappear surprinsingly.  So I guess
you thoughts are right, some disks may be able to compensate
transmission-noise better than others.

But I wouldn't want to subject it to the cabling, because (well, I'm not
100% sure if that's right) slot #1's cable is the shortest.  I'd rather
think it might be not-so-well placed electronics on the board or
backplane.  Maybe a capacitor or a resistor (of which there should be
some for each port) are placed for port #1 that way, that they get a lot
warmer than the others ports' passive elements.  Resistors increase
resistance if they're hotter, capacitors loose capacity when they're
hotter.  So this might change the signal levels to some noticeable extent.

> 
>> P.S.: maybe you should check for firmware updates of the disks?
> 
> No such luck. These (brand "B") are "Worldisk" (which I believe to be
> re-manufactured Fujitsu). The less troublesome disks (brand "A") are
> Hitachi.

That's too bad.  But actually, if our shared thoughts above are right, a
firmware update wouldn't help much...
> 
> Thanks for your thoughts.
> 
> Regards,
> Ian
> 
Stefan
> [...]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device kicked from raid too easilly
  2010-06-08  5:56     ` Stefan /*St0fF*/ Hübner
@ 2010-06-08  6:59       ` Tim Small
  0 siblings, 0 replies; 5+ messages in thread
From: Tim Small @ 2010-06-08  6:59 UTC (permalink / raw)
  To: st0ff; +Cc: Stefan /*St0fF*/ Hübner, Ian Dall, Linux RAID

On 08/06/10 06:56, Stefan /*St0fF*/ Hübner wrote:
> But I wouldn't want to subject it to the cabling, because (well, I'm not
> 100% sure if that's right) slot #1's cable is the shortest.
>    

There are some scenarios where a shorter cable could show up a design 
fault, when a longer cable wouldn't.  I'm not an expert by any means, 
and just off the top of my head, perhaps:

Over-saturated inputs
Stronger signal reflections
Auto-termination problems

I've certainly heard of badly designed Ethernet devices not working with 
very short cable lengths....

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-06-08  6:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-05  1:51 Device kicked from raid too easilly Ian Dall
2010-06-05  7:22 ` Stefan /*St0fF*/ Hübner
2010-06-08  5:15   ` Ian Dall
2010-06-08  5:56     ` Stefan /*St0fF*/ Hübner
2010-06-08  6:59       ` Tim Small

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).