From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id C12367FB2 for ; Fri, 29 Aug 2014 21:56:01 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id B080F8F8037 for ; Fri, 29 Aug 2014 19:55:58 -0700 (PDT) Received: from greer.hardwarefreak.com (mo-65-41-216-221.sta.embarqhsd.net [65.41.216.221]) by cuda.sgi.com with ESMTP id qTtvpr08kHkNU6S3 for ; Fri, 29 Aug 2014 19:55:54 -0700 (PDT) Subject: Re: storage, libaio, or XFS =?UTF-8?Q?problem=3F=20=20=33=2E=34=2E=32=36?= MIME-Version: 1.0 Date: Fri, 29 Aug 2014 21:55:53 -0500 From: Stan Hoeppner In-Reply-To: <20140829235538.GF20518@dastard> References: <3fe8c34c0ccbbd720015d273fa2b8b30@localhost> <20140826075345.GJ20518@dastard> <8c29baf987467a84f0b7c1d09c863662@localhost> <20140828003226.GO20518@dastard> <7f9e5aef187b44e899077467aeb0809d@localhost> <20140828230817.GU20518@dastard> <2d2ce7bb38c00a7d35f4a324f6a36cbb@localhost> <20140829235538.GF20518@dastard> Message-ID: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: xfs@oss.sgi.com On Sat, 30 Aug 2014 09:55:38 +1000, Dave Chinner wrote: > On Fri, Aug 29, 2014 at 11:38:16AM -0500, Stan Hoeppner wrote: >> On Fri, 29 Aug 2014 09:08:17 +1000, Dave Chinner >> wrote: >> > On Thu, Aug 28, 2014 at 05:31:33PM -0500, Stan Hoeppner wrote: >> >> On Thu, 28 Aug 2014 10:32:27 +1000, Dave Chinner >> >> wrote: >> >> > On Tue, Aug 26, 2014 at 12:19:43PM -0500, Stan Hoeppner wrote: >> >> >> Aug 25 23:05:39 Anguish-ssu-1 kernel: [22409.328839] XFS (sdd): >> >> >> xfs_do_force_shutdown(0x8) called from line 3732 of file >> >> >> fs/xfs/xfs_bmap.c. >> >> >> Return address = 0xffffffffa01cc9a6 >> >> > >> >> > Yup, that's kinda important. That's from xfs_bmap_finish(), and >> >> > freeing an extent has failed and triggered SHUTDOWN_CORRUPT_INCORE >> >> > which it's found some kind of inconsistency in the free space >> >> > btrees. So, likely the same problem that caused EFI recovery to fail >> >> > on the other volume. >> >> > >> >> > Are the tests being run on newly made filesystems? If not, have >> >> > these filesystems had xfs_repair run on them after a failure? If >> >> > so, what is the error that is fixed? If not, does repairing the >> >> > filesystem make the problem go away? >> >> >> >> Newly made after every error of any kind, whether app, XFS shutdown, >> call >> >> trace, etc. I've not attempted xfs_repair. >> > >> > Please do. >> >> Another storage crash yesterday. xfs_repair output inline below for the >> 7 >> filesystems. I'm also pasting the dmesg output. This time there is no >> oops, no call traces. The filesystems mounted fine after mounting, >> replaying, and repairing. > > Ok, what version of xfs_repair did you use? 3.1.4 which is a little long in the tooth. I believe they built the OS image from Squeeze 6.0. I was originally told it was Wheezy 7.0, but that turns out to have been false. >> > The bug? The bleeding edge storage arrays being used had had a >> > firmware bug in it. When the number of outstanding IOs hit the >> > *array controller* command tag queue depth limit (some several >> > thousand simultaneous IOs in flight) it would occasionally misdirect >> > a single write IO to the *wrong lun*. i.e. it would misdirect a >> > write. >> > >> > It was only under *extreme* loads that this would happen, and it's >> > this sort of load that AIO+DIO can easily generate - you can have >> > several thousand IOs in flight without too much hassle, and that >> > will hit limits in the storage arrays that aren't often hit. Array >> > controller CTQ depth limits are a good example of a limit that >> > normal IO won't go near to stressing. >> >> I hadn't considered that up to this point. That is *very* insightful, >> and >> applicable, since we are dealing with a beta storage array and firmware. >> Worth mentioning is that the storage vendor has added a custom routine >> which expends Herculean effort to identify full stripes before >> writeback. > > Hmmmm. Food for thought, especially as it is evident that the > storage array appears to be crashing completely. At this point, > I'd say the burden of finding a corruption needs to start with > proving that the array is has not done something wrong. Once you > know that what is on disk is exactly what the filesystem asked to be > written, then you can start to isolate filesystem issues. But you > need the storage to be solid and trust-worthy before going looking > for filesystem problems.... Agreed. Which is why I put storage first in the subject, AIO second, and XFS third. My initial instinct was a problem with libaio, as the crashes only surfaced writing with AIO. I'm now seeing problems with storage on both systems when not using AIO. We're supposed to receive a new firmware upload next week, so hopefully that will fix some of these issues. >> This because some of our writes for a given low rate stream are as low as >> 32KB and may be 2-3 seconds apart. With a 64-128KB chunk, 768 to 1536KB >> stripe width, we'd get massive RMW without this feature. Testing thus >> far >> shows it is fairly effective, though we still get pretty serious RMW due >> to >> the fact we're writing 350 of these small streams per array at ~72 KB/s >> max, along with 2 streams at ~48 MB/s, and and 50 streams at ~1.2 MB/s. >> Multiply this by 7 LUNs per controller and it becomes clear we're >> putting a >> pretty serious load on the firmware and cache. > > Yup, so having the array cache do the equivalent of sequential > readahead multi-stream detection for writeback would make a big > difference. But not simple to do.... Not at all, especially with only 3 GB of RAM to work with, as I'm told. Seems low for a high end controller with 4x 12G SAS ports. We're only able to achieve ~250 MB/s per array at the application due to the access pattern being essentially random, and still with a serious quantity of RMWs. Which is why we're going to test with an even smaller chunk of 32KB. I believe that's the lower bound on these controllers. For this workload 16KB or maybe even 8KB would likely be more optimal. We're also going to test with bcache and a 400 GB Intel 3700 (datacenter grade) SSD backing two LUNs. But with bcache chunk size should be far less relevant. I'm anxious to kick those tires, but it'll be a couple of weeks. Have you played with bcache yet? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs