public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: storage, libaio, or XFS problem?  3.4.26
Date: Fri, 29 Aug 2014 21:55:53 -0500	[thread overview]
Message-ID: <d20fe777ec1fd318ae5d4054dffda3f4@localhost> (raw)
In-Reply-To: <20140829235538.GF20518@dastard>

On Sat, 30 Aug 2014 09:55:38 +1000, Dave Chinner <david@fromorbit.com>
wrote:
> On Fri, Aug 29, 2014 at 11:38:16AM -0500, Stan Hoeppner wrote:
>> On Fri, 29 Aug 2014 09:08:17 +1000, Dave Chinner <david@fromorbit.com>
>> wrote:
>> > On Thu, Aug 28, 2014 at 05:31:33PM -0500, Stan Hoeppner wrote:
>> >> On Thu, 28 Aug 2014 10:32:27 +1000, Dave Chinner
<david@fromorbit.com>
>> >> wrote:
>> >> > On Tue, Aug 26, 2014 at 12:19:43PM -0500, Stan Hoeppner wrote:
>> >> >> Aug 25 23:05:39 Anguish-ssu-1 kernel: [22409.328839] XFS (sdd):
>> >> >> xfs_do_force_shutdown(0x8) called from line 3732 of file
>> >> >> fs/xfs/xfs_bmap.c.
>> >> >> Return address = 0xffffffffa01cc9a6
>> >> > 
>> >> > Yup, that's kinda important. That's from xfs_bmap_finish(), and
>> >> > freeing an extent has failed and triggered SHUTDOWN_CORRUPT_INCORE
>> >> > which it's found some kind of inconsistency in the free space
>> >> > btrees. So, likely the same problem that caused EFI recovery to
fail
>> >> > on the other volume.
>> >> > 
>> >> > Are the tests being run on newly made filesystems? If not, have
>> >> > these filesystems had xfs_repair run on them after a failure?  If
>> >> > so, what is the error that is fixed? If not, does repairing the
>> >> > filesystem make the problem go away?
>> >> 
>> >> Newly made after every error of any kind, whether app, XFS shutdown,
>> call
>> >> trace, etc.  I've not attempted xfs_repair.
>> > 
>> > Please do.
>> 
>> Another storage crash yesterday.  xfs_repair output inline below for
the
>> 7
>> filesystems.  I'm also pasting the dmesg output.  This time there is no
>> oops, no call traces.  The filesystems mounted fine after mounting,
>> replaying, and repairing. 
> 
> Ok, what version of xfs_repair did you use?

3.1.4 which is a little long in the tooth.  I believe they built the OS
image from Squeeze 6.0.  I was originally told it was Wheezy 7.0, but that
turns out to have been false.
 
>> > The bug? The bleeding edge storage arrays being used had had a
>> > firmware bug in it.  When the number of outstanding IOs hit the
>> > *array controller* command tag queue depth limit (some several
>> > thousand simultaneous IOs in flight) it would occasionally misdirect
>> > a single write IO to the *wrong lun*.  i.e. it would misdirect a
>> > write.
>> > 
>> > It was only under *extreme* loads that this would happen, and it's
>> > this sort of load that AIO+DIO can easily generate - you can have
>> > several thousand IOs in flight without too much hassle, and that
>> > will hit limits in the storage arrays that aren't often hit.  Array
>> > controller CTQ depth limits are a good example of a limit that
>> > normal IO won't go near to stressing.
>> 
>> I hadn't considered that up to this point.  That is *very* insightful,
>> and
>> applicable, since we are dealing with a beta storage array and
firmware. 
>> Worth mentioning is that the storage vendor has added a custom routine
>> which expends Herculean effort to identify full stripes before
>> writeback.
> 
> Hmmmm. Food for thought, especially as it is evident that the
> storage array appears to be crashing completely. At this point,
> I'd say the burden of finding a corruption needs to start with
> proving that the array is has not done something wrong. Once you
> know that what is on disk is exactly what the filesystem asked to be
> written, then you can start to isolate filesystem issues. But you
> need the storage to be solid and trust-worthy before going looking
> for filesystem problems....

Agreed.  Which is why I put storage first in the subject, AIO second, and
XFS third.  My initial instinct was a problem with libaio, as the crashes
only surfaced writing with AIO.  I'm now seeing problems with storage on
both systems when not using AIO.  We're supposed to receive a new firmware
upload next week, so hopefully that will fix some of these issues.
 
>> This because some of our writes for a given low rate stream are as low
as
>> 32KB and may be 2-3 seconds apart.  With a 64-128KB chunk, 768 to
1536KB
>> stripe width, we'd get massive RMW without this feature.  Testing thus
>> far
>> shows it is fairly effective, though we still get pretty serious RMW
due
>> to
>> the fact we're writing 350 of these small streams per array at ~72 KB/s
>> max, along with 2 streams at ~48 MB/s, and and 50 streams at ~1.2 MB/s.

>> Multiply this by 7 LUNs per controller and it becomes clear we're
>> putting a
>> pretty serious load on the firmware and cache.
> 
> Yup, so having the array cache do the equivalent of sequential
> readahead multi-stream detection for writeback would make a big
> difference. But not simple to do....

Not at all, especially with only 3 GB of RAM to work with, as I'm told. 
Seems low for a high end controller with 4x 12G SAS ports.  We're only able
to achieve ~250 MB/s per array at the application due to the access pattern
being essentially random, and still with a serious quantity of RMWs.  Which
is why we're going to test with an even smaller chunk of 32KB.  I believe
that's the lower bound on these controllers.  For this workload 16KB or
maybe even 8KB would likely be more optimal.  We're also going to test with
bcache and a 400 GB Intel 3700 (datacenter grade) SSD backing two LUNs. 
But with bcache chunk size should be far less relevant.  I'm anxious to
kick those tires, but it'll be a couple of weeks.

Have you played with bcache yet?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-08-30  2:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-26  6:18 storage, libaio, or XFS problem? 3.4.26 Stan Hoeppner
2014-08-26  6:25 ` Stan Hoeppner
2014-08-26  7:53 ` Dave Chinner
2014-08-26 17:19   ` Stan Hoeppner
2014-08-28  0:32     ` Dave Chinner
2014-08-28 22:31       ` Stan Hoeppner
2014-08-28 23:08         ` Dave Chinner
2014-08-29 16:38           ` Stan Hoeppner
2014-08-29 23:55             ` Dave Chinner
2014-08-30  2:55               ` Stan Hoeppner [this message]
2014-08-31 23:57                 ` Dave Chinner
2014-09-01  3:36                   ` stan hoeppner
2014-09-01 23:45                     ` Dave Chinner
2014-09-02 17:15                       ` stan hoeppner
2014-09-02 22:19                         ` Dave Chinner
2014-09-07  5:23                           ` stan hoeppner
2014-09-07 23:39                             ` Dave Chinner
2014-09-08 15:13                               ` stan hoeppner
2014-09-20 19:47                                 ` stan hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d20fe777ec1fd318ae5d4054dffda3f4@localhost \
    --to=stan@hardwarefreak.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox