Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems

FS/XFS testing framework
 help / color / mirror / Atom feed

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
       [not found] <20250203185519.GA2888598@zen.localdomain>
@ 2025-02-03 19:12 ` Amir Goldstein
  2025-02-04  0:57   ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Amir Goldstein @ 2025-02-03 19:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel, fstests

CC fstests

On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
>
> At Meta, we currently primarily rely on fstests 'auto' runs for
> validating Btrfs as a general purpose filesystem for all of our root
> drives. While this has obviously proven to be a very useful test suite
> with rich collaboration across teams and filesystems, we have observed a
> recent trend in our production filesystem issues that makes us question
> if it is sufficient.
>
> Over the last few years, we have had a number of issues (primarily in
> Btrfs, but at least one notable one in Xfs) that have been detected in
> production, then reproduced with an unreliable non-specific stressor
> that takes hours or even days to trigger the issue.
> Examples:
> - Btrfs relocation bugs
> https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> - Btrfs extent map merging corruption
> https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> - Btrfs dio data corruptions from bio splitting
> (mostly our internal errors trying to make minimal backports of
> https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> and Christoph's related series)
> - Xfs large folios
> https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
>
> In my view, the common threads between these are that:
> - we used fstests to validate these systems, in some cases even with
>   specific regression tests for highly related bugs, but still missed
>   the bugs until they hit us during our production release process. In
>   all cases, we had passing 'fstests -g auto' runs.
> - were able to reproduce the bugs with a predictable concoction of "run
>   a workload and some known nasty btrfs operations in parallel". The most
>   common form of this was running 'fsstress' and 'btrfs balance', but it
>   wasn't quite universal. Sometimes we needed reflink threads, or
>   drop_caches, or memory pressure, etc. to trigger a bug.
> - The relatively generic stressing reproducers took hours or days to
>   produce an issue then the investigating engineer could try to tweak and
>   tune it by trial and error to bring that time down for a particular bug.
>
> This leads me to the conclusion that there is some room for improvement in
> stress testing filesystems (at least Btrfs).
>
> I attempted to study the prior art on this and so far have found:
> - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
>   tests using fsstress and fsx in fstests/. Most of them are xfs and
>   btrfs tests following the aforementioned pattern of racing fsstress
>   with some scary operations. Most of them tend to run for 30s, though
>   some are longer (and of course subject to TIME_FACTOR configuration)
> - Similar duration error injection tests in fstests (e.g. generic/475)
> - The NFSv4 Test Project
>   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
>   A choice quote regarding stress testing:
>   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
>   was able to sustain the concurrent load of 10 processes during 24
>   hours, without any problem. Three months later, NFSv4 reached 72 hours
>   of stress under FSSTRESS, without any bugs. From this date, NFSv4
>   filesystem tree manipulation is considered to be stable."
>
>
> I would like to discuss:
> - Am I missing other strategies people are employing? Apologies if there
>   are obvious ones, but I tried to hunt around for a few days :)
> - What is the universe of interesting stressors (e.g., reflink, scrub,
>   online repair, balance, etc.)
> - What is the universe of interesting validation conditions (e.g.,
>   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> - Is there any interest in automating longer running fsstress runs? Are
>   people already doing this with varying TIME_FACTOR configurations in
>   fstests?
> - There is relatively less testing with fsx than fsstress in fstests.
>   I believe this creates gaps for data corruption bugs rather than
>   "feature logic" issues that the fsstress feature set tends to hit.
> - Can we standardize on some modular "stressors" and stress durations
>   to run to validate file systems?
>
> In the short term, I have been working on these ideas in a separate
> barebones stress testing framework which I am happy to share, but isn't
> particularly interesting in and of itself. It is basically just a
> skeleton for concurrently running some concurrent "stressors" and then
> validating the fs with some generic "validators". I plan to run it
> internally just to see if I can get some useful results on our next few
> major kernel releases.
>
> And of course, I would love to discuss anything else of interest to
> people who like stress testing filesystems!
>
> Thanks,
> Boris
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 19:12 ` [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Amir Goldstein
@ 2025-02-04  0:57   ` Dave Chinner
  2025-02-04 19:58     ` Boris Burkov
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2025-02-04  0:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Boris Burkov, lsf-pc, linux-fsdevel, fstests

On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote:
> CC fstests
> 
> On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
> >
> > At Meta, we currently primarily rely on fstests 'auto' runs for
> > validating Btrfs as a general purpose filesystem for all of our root
> > drives. While this has obviously proven to be a very useful test suite
> > with rich collaboration across teams and filesystems, we have observed a
> > recent trend in our production filesystem issues that makes us question
> > if it is sufficient.
> >
> > Over the last few years, we have had a number of issues (primarily in
> > Btrfs, but at least one notable one in Xfs) that have been detected in
> > production, then reproduced with an unreliable non-specific stressor
> > that takes hours or even days to trigger the issue.
> > Examples:
> > - Btrfs relocation bugs
> > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > - Btrfs extent map merging corruption
> > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > - Btrfs dio data corruptions from bio splitting
> > (mostly our internal errors trying to make minimal backports of
> > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > and Christoph's related series)
> > - Xfs large folios
> > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> >
> > In my view, the common threads between these are that:
> > - we used fstests to validate these systems, in some cases even with
> >   specific regression tests for highly related bugs, but still missed
> >   the bugs until they hit us during our production release process. In
> >   all cases, we had passing 'fstests -g auto' runs.

Have you considered the 'soak' test group with a long SOAK_DURATION
and then increasing the load using LOAD_FACTOR? Also there is a
'stress' group that TIME_FACTOR acts on.

For XFS, there's also bunch of fuzzing tests (in the
dangerous_fuzzers group) that use the same SOAK_DURATION
infrastructure via common/fuzzy.


> > - were able to reproduce the bugs with a predictable concoction of "run
> >   a workload and some known nasty btrfs operations in parallel". The most
> >   common form of this was running 'fsstress' and 'btrfs balance', but it
> >   wasn't quite universal. Sometimes we needed reflink threads, or
> >   drop_caches, or memory pressure, etc. to trigger a bug.

That's pretty much what check-parallel does to a system. Loads of
tests run things like drop_caches, memory compaction, CPU hotplug,
etc. check-parallel essentially exposes every test to these sorts
of background perturbations rather than just the one test that is
running that perturbation. IOWs, even the most basic correctness
test now gets exercised while cpu hotplug and memory compaction are
going on in the background....

Eventually, I plan to implement these background perturbations as
separate control tasks for check-parallel so we don't need specific
tests that run a background perturbation whilst the rest of the
system is under test.

> > - The relatively generic stressing reproducers took hours or days to
> >   produce an issue then the investigating engineer could try to tweak and
> >   tune it by trial and error to bring that time down for a particular bug.
> >
> > This leads me to the conclusion that there is some room for improvement in
> > stress testing filesystems (at least Btrfs).
> >
> > I attempted to study the prior art on this and so far have found:
> > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> >   btrfs tests following the aforementioned pattern of racing fsstress
> >   with some scary operations. Most of them tend to run for 30s, though
> >   some are longer (and of course subject to TIME_FACTOR configuration)

As per above, SOAK_DURATION.

> > - Similar duration error injection tests in fstests (e.g. generic/475)
> > - The NFSv4 Test Project
> >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
> >   A choice quote regarding stress testing:
> >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> >   was able to sustain the concurrent load of 10 processes during 24
> >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> >   filesystem tree manipulation is considered to be stable."
> >
> >
> > I would like to discuss:
> > - Am I missing other strategies people are employing? Apologies if there
> >   are obvious ones, but I tried to hunt around for a few days :)

check-parallel.

> > - What is the universe of interesting stressors (e.g., reflink, scrub,
> >   online repair, balance, etc.)

memory compaction, cpu hotplug, random reflinks of the underlying
loop device image files to simulate dynamic VM image file snapshots,
etc.

> > - What is the universe of interesting validation conditions (e.g.,
> >   kernel panic, read only fs, fsck failure, data integrity error, etc.)

All of them. That's the point of check-parallel - it uses simple,
existing filesystem correctness tests to generate a massively
stressful load on the system...

> > - Is there any interest in automating longer running fsstress runs? Are
> >   people already doing this with varying TIME_FACTOR configurations in
> >   fstests?

At least for XFS, Darrick is already doing that, and I think Carlos
may be as well.

> > - There is relatively less testing with fsx than fsstress in fstests.
> >   I believe this creates gaps for data corruption bugs rather than
> >   "feature logic" issues that the fsstress feature set tends to hit.
> > - Can we standardize on some modular "stressors" and stress durations
> >   to run to validate file systems?

I think we already have that with the "soak" and "stress" groups...

> > In the short term, I have been working on these ideas in a separate
> > barebones stress testing framework which I am happy to share, but isn't
> > particularly interesting in and of itself. It is basically just a
> > skeleton for concurrently running some concurrent "stressors" and then
> > validating the fs with some generic "validators". I plan to run it
> > internally just to see if I can get some useful results on our next few
> > major kernel releases.

check-parallel is effectively a massive concurrent stress workload
for the system. It does this by running many individual correctness
tests concurrently.

Run it on a 64p system or larger, and it will hammer both the test
filesystems and base filesystem that all the loop device image files
are laid out on.  I'm seeing it generate 5-6GB/s of IO load, 40-50GB
of memory usage, and consistently use >90% of the CPU in the system
stress the scheduler at over half a million context switches/s.

> > And of course, I would love to discuss anything else of interest to
> > people who like stress testing filesystems!

Filesystem stress testing by itself isn't really interesting to me.
Using filesystem correctness tests to create massively stressful
workloads, OTOH, attacks the problem from multiple angles and
exercises the system well outside the bounds of just filesystem
code.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04  0:57   ` Dave Chinner
@ 2025-02-04 19:58     ` Boris Burkov
  2025-02-04 21:14       ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Boris Burkov @ 2025-02-04 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests

On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote:
> On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote:
> > CC fstests
> > 
> > On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > At Meta, we currently primarily rely on fstests 'auto' runs for
> > > validating Btrfs as a general purpose filesystem for all of our root
> > > drives. While this has obviously proven to be a very useful test suite
> > > with rich collaboration across teams and filesystems, we have observed a
> > > recent trend in our production filesystem issues that makes us question
> > > if it is sufficient.
> > >
> > > Over the last few years, we have had a number of issues (primarily in
> > > Btrfs, but at least one notable one in Xfs) that have been detected in
> > > production, then reproduced with an unreliable non-specific stressor
> > > that takes hours or even days to trigger the issue.
> > > Examples:
> > > - Btrfs relocation bugs
> > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > > - Btrfs extent map merging corruption
> > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > > - Btrfs dio data corruptions from bio splitting
> > > (mostly our internal errors trying to make minimal backports of
> > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > > and Christoph's related series)
> > > - Xfs large folios
> > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> > >
> > > In my view, the common threads between these are that:
> > > - we used fstests to validate these systems, in some cases even with
> > >   specific regression tests for highly related bugs, but still missed
> > >   the bugs until they hit us during our production release process. In
> > >   all cases, we had passing 'fstests -g auto' runs.
> 
> Have you considered the 'soak' test group with a long SOAK_DURATION
> and then increasing the load using LOAD_FACTOR? Also there is a
> 'stress' group that TIME_FACTOR acts on.
> 
> For XFS, there's also bunch of fuzzing tests (in the
> dangerous_fuzzers group) that use the same SOAK_DURATION
> infrastructure via common/fuzzy.

I hadn't realized people were running these for multi-day durations.
Thanks for pointing them out and for your other inline answers to my
questions.

> 
> 
> > > - were able to reproduce the bugs with a predictable concoction of "run
> > >   a workload and some known nasty btrfs operations in parallel". The most
> > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > >   drop_caches, or memory pressure, etc. to trigger a bug.
> 
> That's pretty much what check-parallel does to a system. Loads of
> tests run things like drop_caches, memory compaction, CPU hotplug,
> etc. check-parallel essentially exposes every test to these sorts
> of background perturbations rather than just the one test that is
> running that perturbation. IOWs, even the most basic correctness
> test now gets exercised while cpu hotplug and memory compaction are
> going on in the background....
> 
> Eventually, I plan to implement these background perturbations as
> separate control tasks for check-parallel so we don't need specific
> tests that run a background perturbation whilst the rest of the
> system is under test.

I think that a framework for introducing background perturbations while
running tests is definitely what I'm getting at. If check-parallel is a
good version of that, then that sounds great to me. I am particularly
excited about your point that it will smash together *every* stimulus
with *every* test. I do have some questions in my head about how that
would work in practice.

My main questions/concerns are:

How much do you randomize the interleaving of tests? Does
check-parallel run them in a random order?

Similarly, their durations are not at all tuned to maximize
interesting interactions. If test X and test Y would collide on some
faulty interaction, but test X runs once in 1 second, then you would
likely never see test X interfere with some interesting moment during
test Y. Are you considering feeding the tests back into the run-queue
as they finish for these stress style runs?

It seems that the two objectives of the test harness are sort of in
tension with using check-parallel to stress things. On one hand you
want tests to independently succeed or fail and on the other hand you
want noise from one test to disturb the other. I fear more of the
failures will turn out to be "Oh, well, when THAT happens, we would
expect this condition to be violated". Especially for the more "unit
test" style fstests that carefully use sync to check specific conditions
during a run.

This variant also feels like it would be at the extreme of difficulty
for attempting to distill a failure into a reproducer.

> 
> > > - The relatively generic stressing reproducers took hours or days to
> > >   produce an issue then the investigating engineer could try to tweak and
> > >   tune it by trial and error to bring that time down for a particular bug.
> > >
> > > This leads me to the conclusion that there is some room for improvement in
> > > stress testing filesystems (at least Btrfs).
> > >
> > > I attempted to study the prior art on this and so far have found:
> > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> > >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> > >   btrfs tests following the aforementioned pattern of racing fsstress
> > >   with some scary operations. Most of them tend to run for 30s, though
> > >   some are longer (and of course subject to TIME_FACTOR configuration)
> 
> As per above, SOAK_DURATION.
> 
> > > - Similar duration error injection tests in fstests (e.g. generic/475)
> > > - The NFSv4 Test Project
> > >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
> > >   A choice quote regarding stress testing:
> > >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> > >   was able to sustain the concurrent load of 10 processes during 24
> > >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> > >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> > >   filesystem tree manipulation is considered to be stable."
> > >
> > >
> > > I would like to discuss:
> > > - Am I missing other strategies people are employing? Apologies if there
> > >   are obvious ones, but I tried to hunt around for a few days :)
> 
> check-parallel.
> 
> > > - What is the universe of interesting stressors (e.g., reflink, scrub,
> > >   online repair, balance, etc.)
> 
> memory compaction, cpu hotplug, random reflinks of the underlying
> loop device image files to simulate dynamic VM image file snapshots,
> etc.
> 
> > > - What is the universe of interesting validation conditions (e.g.,
> > >   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> 
> All of them. That's the point of check-parallel - it uses simple,
> existing filesystem correctness tests to generate a massively
> stressful load on the system...
> 
> > > - Is there any interest in automating longer running fsstress runs? Are
> > >   people already doing this with varying TIME_FACTOR configurations in
> > >   fstests?
> 
> At least for XFS, Darrick is already doing that, and I think Carlos
> may be as well.
> 
> > > - There is relatively less testing with fsx than fsstress in fstests.
> > >   I believe this creates gaps for data corruption bugs rather than
> > >   "feature logic" issues that the fsstress feature set tends to hit.
> > > - Can we standardize on some modular "stressors" and stress durations
> > >   to run to validate file systems?
> 
> I think we already have that with the "soak" and "stress" groups...
> 
> > > In the short term, I have been working on these ideas in a separate
> > > barebones stress testing framework which I am happy to share, but isn't
> > > particularly interesting in and of itself. It is basically just a
> > > skeleton for concurrently running some concurrent "stressors" and then
> > > validating the fs with some generic "validators". I plan to run it
> > > internally just to see if I can get some useful results on our next few
> > > major kernel releases.
> 
> check-parallel is effectively a massive concurrent stress workload
> for the system. It does this by running many individual correctness
> tests concurrently.
> 
> Run it on a 64p system or larger, and it will hammer both the test
> filesystems and base filesystem that all the loop device image files
> are laid out on.  I'm seeing it generate 5-6GB/s of IO load, 40-50GB
> of memory usage, and consistently use >90% of the CPU in the system
> stress the scheduler at over half a million context switches/s.

I will definitely invest some time into getting check-parallel to run
with btrfs, and hopefully it turns up some interesting stuff.

> 
> > > And of course, I would love to discuss anything else of interest to
> > > people who like stress testing filesystems!
> 
> Filesystem stress testing by itself isn't really interesting to me.
> Using filesystem correctness tests to create massively stressful
> workloads, OTOH, attacks the problem from multiple angles and
> exercises the system well outside the bounds of just filesystem
> code.

From what I see, today we have a handful of tests which race fsx or
fsstress with 0-2 operations under test, and you are proposing using
check-parallel to hammer the computer with the entirety of all 1000
tests in parallel (awesome). I think I am proposing something in between
where we run fsx AND fsstress AND ~10 known scary operations. That
has proven to dredge up bugs in btrfs (where the simpler fsstress plus
one thing doesn't). I think check-parallel will be more stressful, but
that this "mega fsstress run" will be more predictable and easier to
tune/get reproducers out of.

Thanks again for your thoughts,
Boris

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04 19:58     ` Boris Burkov
@ 2025-02-04 21:14       ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2025-02-04 21:14 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests

On Tue, Feb 04, 2025 at 11:58:46AM -0800, Boris Burkov wrote:
> On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote:
> > > > - were able to reproduce the bugs with a predictable concoction of "run
> > > >   a workload and some known nasty btrfs operations in parallel". The most
> > > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > > >   drop_caches, or memory pressure, etc. to trigger a bug.
> > 
> > That's pretty much what check-parallel does to a system. Loads of
> > tests run things like drop_caches, memory compaction, CPU hotplug,
> > etc. check-parallel essentially exposes every test to these sorts
> > of background perturbations rather than just the one test that is
> > running that perturbation. IOWs, even the most basic correctness
> > test now gets exercised while cpu hotplug and memory compaction are
> > going on in the background....
> > 
> > Eventually, I plan to implement these background perturbations as
> > separate control tasks for check-parallel so we don't need specific
> > tests that run a background perturbation whilst the rest of the
> > system is under test.
> 
> I think that a framework for introducing background perturbations while
> running tests is definitely what I'm getting at. If check-parallel is a
> good version of that, then that sounds great to me. I am particularly
> excited about your point that it will smash together *every* stimulus
> with *every* test. I do have some questions in my head about how that
> would work in practice.
> 
> My main questions/concerns are:
> 
> How much do you randomize the interleaving of tests? Does
> check-parallel run them in a random order?

Same as check - the "-r" option will randomise the test run order.

The test run order is also somewhat randomised by default in that
it sorts the test run order based on the runtime of each test in
the previous test run. Hence test run order is not static - it
generally runs long running tests before slow running tests, but the
exact order is not fixed.

> Similarly, their durations are not at all tuned to maximize
> interesting interactions. If test X and test Y would collide on some
> faulty interaction, but test X runs once in 1 second, then you would
> likely never see test X interfere with some interesting moment during
> test Y. Are you considering feeding the tests back into the run-queue
> as they finish for these stress style runs?

Not yet - the infrastructure to directly manage and run tests from
check-parallel is not yet in place. It currently generates a test
list for each runner thread then executes that via a check instance
per runner thread.

I plan to have check-parallel execute tests individually itself by
factoring the run loop out of check (similar to how I'm doing the
test list parsing). Once there is direct control of the test
execution, stuff like dynamic test queues where runners just pull
the next test to run off the queue and they keep going until the
queue is empty will be possible.

> It seems that the two objectives of the test harness are sort of in
> tension with using check-parallel to stress things. On one hand you
> want tests to independently succeed or fail and on the other hand you
> want noise from one test to disturb the other.

Yes. Tests are largely written such that they don't interfere with
each other.

> I fear more of the
> failures will turn out to be "Oh, well, when THAT happens, we would
> expect this condition to be violated". Especially for the more "unit
> test" style fstests that carefully use sync to check specific conditions
> during a run.

That's why I currently have a "unreliable_in_parallel" test group
definition and check-parallel excludes that test group. There's
about 20 tests I've classified this way, most of them xfs specific
tests that are reliant on exact fragmentation patterns being
created. This tests are perturbed by things like sync(1) calls from
other tests which results in a different fragmentation pattern than
the test expects to see.

In each case, there is a comment in the test explaining the
condition that makes the test unreliable in parallel, and so we
have some idea of what needs fixing to be able to remove it from the
unreliable_in_parallel group.

Essentially, I'm using this as a marker and note for future
improvements once all the (more important) infrastructure work is
done and solid.

> This variant also feels like it would be at the extreme of difficulty
> for attempting to distill a failure into a reproducer.

It's pretty obvious when a test is doing something that is
influenced by an outside event. The biggest problem for debugging
them comes when the test failures appear to be real bugs (e.g. all
the weird and whacky off-by-one quota failures that check-parallel
triggers on XFS) but they cannot be reproduced when the tests are
run serially.

.....

> > > > And of course, I would love to discuss anything else of interest to
> > > > people who like stress testing filesystems!
> > 
> > Filesystem stress testing by itself isn't really interesting to me.
> > Using filesystem correctness tests to create massively stressful
> > workloads, OTOH, attacks the problem from multiple angles and
> > exercises the system well outside the bounds of just filesystem
> > code.
> 
> From what I see, today we have a handful of tests which race fsx or
> fsstress with 0-2 operations under test, and you are proposing using
> check-parallel to hammer the computer with the entirety of all 1000
> tests in parallel (awesome).

It's currently running one test per CPU in parallel, not all at
once. Many tests run lots of stuff in parallel themselves, too, and
some of them hammer large CPU count machines really hard just by
themselves, let alone when there's another 63 tests running
concurrently....

> I think I am proposing something in between
> where we run fsx AND fsstress AND ~10 known scary operations.

Write a set of tests that do this for btrfs and put them in the
auto/stress/soak groups. Then run 'check-parallel -g soak,stress
....'

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-04 21:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250203185519.GA2888598@zen.localdomain>
2025-02-03 19:12 ` [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Amir Goldstein
2025-02-04  0:57   ` Dave Chinner
2025-02-04 19:58     ` Boris Burkov
2025-02-04 21:14       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox