From: Steve Costaras <stevecs@chaven.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: XFS data corruption with high I/O even on Areca hardware raid
Date: Thu, 14 Jan 2010 20:15:25 -0600 [thread overview]
Message-ID: <4B4FCFBD.3020300@chaven.com> (raw)
In-Reply-To: <20100115013507.GB28498@discord.disaster>
On 01/14/2010 19:35, Dave Chinner wrote:
> There stack traces do - everything is waiting on IO completion to
> occur. The elevator queues are full, the block device is congested
> and lots of XFS code is waiting on IO completion to occur.
>
>
>> I don't
>> like the abort device commands to arcmsr, still have not heard anything
>> from Areca support for them to look at it.
>>
>> -------------
>> [ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
>>
> Yea, that looks bad - the driver appears to have aborted some IOs
> (no idea why) but probably hasn't handled the error correctly and
> completed the IOs it aborted with an error status (which would cause
> XFS to shut down the filesystem but not freeze like this). So it
> looks like there is a buggy error handling path in the driver being
> triggered by some kind of hardware problem.
>
> I note that it is the same raid controller that has had problems
> as the last report. It might just be a bad RAID card or SATA cables
> from that RAID card. I'd work out which card it is, replace it
> and the cables and see if that makes the problem go away....
>
Yeah, actually this IS a new raid card * cables since the last failure
so don't think (statistically) that it's hardware. Could be firmware
or driver. I've forwarded this over to Areca and hopefully they can
come up with something.
Right now I'm testing with the raid in write-through mode in hopes that,
if it doesn't avoid the problem, minimize the effect of it. Also found
some references thanks to the above abort messages about lengthening the
scsi command timeouts which may help when under 'heavy' load.
(/sys/class/scsi_device/{?}/device/timeout) which looks to default to 30
seconds on at least my kernel. I haven't read up in detail yet as to
the reasons why it is set to 30 or if that was just some arbitrary
number someone picked, to me it seems long (considering we're talking
10-13ms for service times generally on a 7200rpm drive but who knows).
Steve
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2010-01-15 2:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-14 1:11 XFS data corruption with high I/O even on hardware raid Steve Costaras
2010-01-14 2:24 ` Dave Chinner
2010-01-14 2:33 ` Steve Costaras
2010-01-15 0:52 ` XFS data corruption with high I/O even on Areca " Steve Costaras
2010-01-15 1:35 ` Dave Chinner
2010-01-15 2:15 ` Steve Costaras [this message]
2010-01-14 9:08 ` XFS data corruption with high I/O even on " Andi Kleen
2010-01-14 11:19 ` Steve Costaras
2010-01-14 11:36 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B4FCFBD.3020300@chaven.com \
--to=stevecs@chaven.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox