From: Stan Hoeppner <stan@hardwarefreak.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: storage, libaio, or XFS problem? 3.4.26
Date: Fri, 29 Aug 2014 21:55:53 -0500 [thread overview]
Message-ID: <d20fe777ec1fd318ae5d4054dffda3f4@localhost> (raw)
In-Reply-To: <20140829235538.GF20518@dastard>
On Sat, 30 Aug 2014 09:55:38 +1000, Dave Chinner <david@fromorbit.com>
wrote:
> On Fri, Aug 29, 2014 at 11:38:16AM -0500, Stan Hoeppner wrote:
>> On Fri, 29 Aug 2014 09:08:17 +1000, Dave Chinner <david@fromorbit.com>
>> wrote:
>> > On Thu, Aug 28, 2014 at 05:31:33PM -0500, Stan Hoeppner wrote:
>> >> On Thu, 28 Aug 2014 10:32:27 +1000, Dave Chinner
<david@fromorbit.com>
>> >> wrote:
>> >> > On Tue, Aug 26, 2014 at 12:19:43PM -0500, Stan Hoeppner wrote:
>> >> >> Aug 25 23:05:39 Anguish-ssu-1 kernel: [22409.328839] XFS (sdd):
>> >> >> xfs_do_force_shutdown(0x8) called from line 3732 of file
>> >> >> fs/xfs/xfs_bmap.c.
>> >> >> Return address = 0xffffffffa01cc9a6
>> >> >
>> >> > Yup, that's kinda important. That's from xfs_bmap_finish(), and
>> >> > freeing an extent has failed and triggered SHUTDOWN_CORRUPT_INCORE
>> >> > which it's found some kind of inconsistency in the free space
>> >> > btrees. So, likely the same problem that caused EFI recovery to
fail
>> >> > on the other volume.
>> >> >
>> >> > Are the tests being run on newly made filesystems? If not, have
>> >> > these filesystems had xfs_repair run on them after a failure? If
>> >> > so, what is the error that is fixed? If not, does repairing the
>> >> > filesystem make the problem go away?
>> >>
>> >> Newly made after every error of any kind, whether app, XFS shutdown,
>> call
>> >> trace, etc. I've not attempted xfs_repair.
>> >
>> > Please do.
>>
>> Another storage crash yesterday. xfs_repair output inline below for
the
>> 7
>> filesystems. I'm also pasting the dmesg output. This time there is no
>> oops, no call traces. The filesystems mounted fine after mounting,
>> replaying, and repairing.
>
> Ok, what version of xfs_repair did you use?
3.1.4 which is a little long in the tooth. I believe they built the OS
image from Squeeze 6.0. I was originally told it was Wheezy 7.0, but that
turns out to have been false.
>> > The bug? The bleeding edge storage arrays being used had had a
>> > firmware bug in it. When the number of outstanding IOs hit the
>> > *array controller* command tag queue depth limit (some several
>> > thousand simultaneous IOs in flight) it would occasionally misdirect
>> > a single write IO to the *wrong lun*. i.e. it would misdirect a
>> > write.
>> >
>> > It was only under *extreme* loads that this would happen, and it's
>> > this sort of load that AIO+DIO can easily generate - you can have
>> > several thousand IOs in flight without too much hassle, and that
>> > will hit limits in the storage arrays that aren't often hit. Array
>> > controller CTQ depth limits are a good example of a limit that
>> > normal IO won't go near to stressing.
>>
>> I hadn't considered that up to this point. That is *very* insightful,
>> and
>> applicable, since we are dealing with a beta storage array and
firmware.
>> Worth mentioning is that the storage vendor has added a custom routine
>> which expends Herculean effort to identify full stripes before
>> writeback.
>
> Hmmmm. Food for thought, especially as it is evident that the
> storage array appears to be crashing completely. At this point,
> I'd say the burden of finding a corruption needs to start with
> proving that the array is has not done something wrong. Once you
> know that what is on disk is exactly what the filesystem asked to be
> written, then you can start to isolate filesystem issues. But you
> need the storage to be solid and trust-worthy before going looking
> for filesystem problems....
Agreed. Which is why I put storage first in the subject, AIO second, and
XFS third. My initial instinct was a problem with libaio, as the crashes
only surfaced writing with AIO. I'm now seeing problems with storage on
both systems when not using AIO. We're supposed to receive a new firmware
upload next week, so hopefully that will fix some of these issues.
>> This because some of our writes for a given low rate stream are as low
as
>> 32KB and may be 2-3 seconds apart. With a 64-128KB chunk, 768 to
1536KB
>> stripe width, we'd get massive RMW without this feature. Testing thus
>> far
>> shows it is fairly effective, though we still get pretty serious RMW
due
>> to
>> the fact we're writing 350 of these small streams per array at ~72 KB/s
>> max, along with 2 streams at ~48 MB/s, and and 50 streams at ~1.2 MB/s.
>> Multiply this by 7 LUNs per controller and it becomes clear we're
>> putting a
>> pretty serious load on the firmware and cache.
>
> Yup, so having the array cache do the equivalent of sequential
> readahead multi-stream detection for writeback would make a big
> difference. But not simple to do....
Not at all, especially with only 3 GB of RAM to work with, as I'm told.
Seems low for a high end controller with 4x 12G SAS ports. We're only able
to achieve ~250 MB/s per array at the application due to the access pattern
being essentially random, and still with a serious quantity of RMWs. Which
is why we're going to test with an even smaller chunk of 32KB. I believe
that's the lower bound on these controllers. For this workload 16KB or
maybe even 8KB would likely be more optimal. We're also going to test with
bcache and a 400 GB Intel 3700 (datacenter grade) SSD backing two LUNs.
But with bcache chunk size should be far less relevant. I'm anxious to
kick those tires, but it'll be a couple of weeks.
Have you played with bcache yet?
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-08-30 2:56 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-26 6:18 storage, libaio, or XFS problem? 3.4.26 Stan Hoeppner
2014-08-26 6:25 ` Stan Hoeppner
2014-08-26 7:53 ` Dave Chinner
2014-08-26 17:19 ` Stan Hoeppner
2014-08-28 0:32 ` Dave Chinner
2014-08-28 22:31 ` Stan Hoeppner
2014-08-28 23:08 ` Dave Chinner
2014-08-29 16:38 ` Stan Hoeppner
2014-08-29 23:55 ` Dave Chinner
2014-08-30 2:55 ` Stan Hoeppner [this message]
2014-08-31 23:57 ` Dave Chinner
2014-09-01 3:36 ` stan hoeppner
2014-09-01 23:45 ` Dave Chinner
2014-09-02 17:15 ` stan hoeppner
2014-09-02 22:19 ` Dave Chinner
2014-09-07 5:23 ` stan hoeppner
2014-09-07 23:39 ` Dave Chinner
2014-09-08 15:13 ` stan hoeppner
2014-09-20 19:47 ` stan hoeppner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d20fe777ec1fd318ae5d4054dffda3f4@localhost \
--to=stan@hardwarefreak.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.