fstests.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] ext4: ratelimit the file system mounted message
       [not found]   ` <20150817145056.GC27202@thunk.org>
@ 2015-08-17 23:07     ` Dave Chinner
  2015-08-18  1:18       ` Theodore Ts'o
  2015-08-18  3:55       ` Eryu Guan
  0 siblings, 2 replies; 3+ messages in thread
From: Dave Chinner @ 2015-08-17 23:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List, Eryu Guan, fstests

[cc fstests@vger.kernel.org as we are talking about the test rather
than the kernel behaviour. ]

On Mon, Aug 17, 2015 at 10:50:56AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 17, 2015 at 11:12:15AM +1000, Dave Chinner wrote:
> > On Sat, Aug 15, 2015 at 02:59:57PM -0400, Theodore Ts'o wrote:
> > > The xfstests ext4/305 will mount and unmount the same file system over
> > > 4,000 times, and each one of these will cause a system log message.
> > > Ratelimit this message since if we are getting more than a few dozen
> > > of these messages, they probably aren't going to be helpful.
> > 
> > Perhaps you should look at fixing the test or making it a more
> > targetted reproducer. Tests that do "loop doing something basic
> > while looping doing something else basic for X minutes to try to
> > trip a race condition" aren't very good regression tests....
> 
> The problem what we are specifically testing is a race where one
> process is reading from a proc fs file while the file system is being
> unmounted:

*nod*. It's like trying to swat a fly with a sledge hammer: if you
get lucky you might shave the fly, but most of the time you're going
to miss.

> I don't see a better way of doing the test off the top of my head,
> though.... and to be honest I'm not sure how much value the test
> really has, since it's the sort of thing that can easily be checked by
> inspection, and it seems rather unlikely we would regress here.

Yup. A lot of regression tests get written to tick a process box
(i.e. did we fix regression X?), not because they provide on-going
value to prevent future regressions. I try to push back against
tests that won't provide us with useful protection against future
regressions....

FWIW, if we need to trigger a specific race in XFS for testing
purposes we've historically added debug code to add a delay in the
kernel code to allow the race condition to trigger. e.g.
tests/xfs/051 pokes a sysfs entry that only exists on
CONFIG_XFS_DEBUG=y builds to delay log recovery so that we can
trigger IO errors via dm-flakey while log recovery is being
performed.

> BTW, out of curiosity I reverted 9559996 and tried running ext4/305
> many times, on a variety of different VM's ranging from 1 to 8 CPU's,
> and using both a SSD and a laptop HDD.
> 
> In all cases, ext3/305 reliably reproduced the failure within 30
> mount/unmount cycles, and in most cases, under a dozen cycles.  (i.e.,
> under two seconds, and usually in a fraction of a second).  So I'm not
> entirely sure why the test was written to run the loop for 3 minutes
> and thousands of mount/unmount cycles.

There were lots of tests being written at the time that used a 3
minute timeout. It's another of those red flags that I tend to
push back on these days, and this is an example of why - usually the
problem can be hit very quickly, or the test is extremely unreliable
and long runtime is the only way to trigger the race. Hence
running for X minutes doesn't really prove anything....

> Eryu, you wrote the test; any thoughts?  At the very least I'd suggest
> cutting the test down so that it runs for at most 2 seconds, if for no
> other reason than to speed up regression test runs.

Rather than time limiting, how about bounding the number of
mount/unmount cycles?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] ext4: ratelimit the file system mounted message
  2015-08-17 23:07     ` [PATCH] ext4: ratelimit the file system mounted message Dave Chinner
@ 2015-08-18  1:18       ` Theodore Ts'o
  2015-08-18  3:55       ` Eryu Guan
  1 sibling, 0 replies; 3+ messages in thread
From: Theodore Ts'o @ 2015-08-18  1:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ext4 Developers List, Eryu Guan, fstests

On Tue, Aug 18, 2015 at 09:07:52AM +1000, Dave Chinner wrote:
> Yup. A lot of regression tests get written to tick a process box
> (i.e. did we fix regression X?), not because they provide on-going
> value to prevent future regressions. I try to push back against
> tests that won't provide us with useful protection against future
> regressions....

Yeah, that wasn't the case here.  The bug was fixed by Salman Qazi at
Google in May 2012.  The test was created by Eryu nearly a year later
in April 2013.

> Rather than time limiting, how about bounding the number of
> mount/unmount cycles?

Sure, that makes sense.

						- Ted

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] ext4: ratelimit the file system mounted message
  2015-08-17 23:07     ` [PATCH] ext4: ratelimit the file system mounted message Dave Chinner
  2015-08-18  1:18       ` Theodore Ts'o
@ 2015-08-18  3:55       ` Eryu Guan
  1 sibling, 0 replies; 3+ messages in thread
From: Eryu Guan @ 2015-08-18  3:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Theodore Ts'o, Ext4 Developers List, fstests

On Tue, Aug 18, 2015 at 09:07:52AM +1000, Dave Chinner wrote:
> [cc fstests@vger.kernel.org as we are talking about the test rather
> than the kernel behaviour. ]
> 
[snip]
> > In all cases, ext3/305 reliably reproduced the failure within 30
> > mount/unmount cycles, and in most cases, under a dozen cycles.  (i.e.,
> > under two seconds, and usually in a fraction of a second).  So I'm not
> > entirely sure why the test was written to run the loop for 3 minutes
> > and thousands of mount/unmount cycles.
> 
> There were lots of tests being written at the time that used a 3
> minute timeout. It's another of those red flags that I tend to
> push back on these days, and this is an example of why - usually the
> problem can be hit very quickly, or the test is extremely unreliable
> and long runtime is the only way to trigger the race. Hence
> running for X minutes doesn't really prove anything....

IIRC, 3 minutes time limit was based on my testing before I submitted
the patch, but I could be wrong, it was two years ago..

I think I have better understanding of xfstests and regression tests
now than two years ago, after years education on the list (mainly by
Dave :-))

> 
> > Eryu, you wrote the test; any thoughts?  At the very least I'd suggest
> > cutting the test down so that it runs for at most 2 seconds, if for no
> > other reason than to speed up regression test runs.
> 
> Rather than time limiting, how about bounding the number of
> mount/unmount cycles?

Agreed, 30 cycles seem a reasonable number, I can prepare a patch if no
objection.

Thanks Ted and Dave for looking into this!

Eryu

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-08-18  3:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1439665197-10766-1-git-send-email-tytso@mit.edu>
     [not found] ` <20150817011215.GA714@dastard>
     [not found]   ` <20150817145056.GC27202@thunk.org>
2015-08-17 23:07     ` [PATCH] ext4: ratelimit the file system mounted message Dave Chinner
2015-08-18  1:18       ` Theodore Ts'o
2015-08-18  3:55       ` Eryu Guan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).