From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:20526 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755288AbaH0KIz (ORCPT ); Wed, 27 Aug 2014 06:08:55 -0400 Date: Wed, 27 Aug 2014 20:08:51 +1000 From: Dave Chinner Subject: Re: [patch, v3] add an aio test which closes the fd before destroying the ioctx Message-ID: <20140827100851.GC26465@dastard> References: <20140820225701.GG26465@dastard> <20140821165750.GA7116@lenny.home.zabbo.net> <20140825165043.GF20070@kvack.org> <20140826172740.GA18337@lenny.home.zabbo.net> <20140827084922.GB26465@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140827084922.GB26465@dastard> Sender: fstests-owner@vger.kernel.org To: Zach Brown Cc: Jeff Moyer , Benjamin LaHaise , fstests@vger.kernel.org List-ID: On Wed, Aug 27, 2014 at 06:49:22PM +1000, Dave Chinner wrote: > On Tue, Aug 26, 2014 at 10:27:40AM -0700, Zach Brown wrote: > > On Tue, Aug 26, 2014 at 12:05:11PM -0400, Jeff Moyer wrote: > > > Benjamin LaHaise writes: > > > > > > > Does someone already have a simple test case we can add to the libaio test > > > > suite to verify this behaviour? > > > > > > I can't reproduce this problem using a loop device, which is what the > > > libaio test suite uses. Even when using real hardware, you have to have > > > disks that are slow enough in order for this to trigger reliably (or > > > at all). > > > > I wonder if you could use something like dm suspend to abuse indefinite > > latencies. > > > > > I could write a more targeted test within xfstests, but I don't think > > > that's strictly necessary (it would just make it more clear what the > > > expectations are, and maybe bump the hit rate percentage up). > > > > I think it'd be worth it (he says, not commiting *his* time). It would > > have been nice if a targeted test helped Dave raise the alarm > > immediately rather than gnaw away at his brain with inconsistent mostly > > unrelated failures for months. > > I'm not sure it's worth the effort. now we have two tests that have > triggered the same problem, I've been easily able to reproduce it > with 2 VMs with test/scratch image files sharing the same spindle. > i.e. run xfstests in one VM, run generic/323 in the other VM, and > it reproduces fairly easily. > > I'm just running it in a loop now to measure how successfully I'm > reproducing the problem, then I'll apply the fix and see if it gets > better. If it does get better, then I'll keep the patch around > locally until it is upstream, and then I'll shout whenever I see > this problem occur again.... Ok, so of 32 executions in a tight loop of generic/323, only 5 executions passed while 27 failed. With the patch suggested, it failed the first 5 executions, so I don't think it fixes the problem. BTW, generic/323 is pulling 8,000 read IOPS and 500MB/s from my single spindle. Methinks that the test file is resident in the BBWC on the RAID controller, which may be why nobody else is reproducing this problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com