[LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
@ 2025-02-03 18:55 Boris Burkov
  2025-02-03 19:12 ` Amir Goldstein
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Boris Burkov @ 2025-02-03 18:55 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel

At Meta, we currently primarily rely on fstests 'auto' runs for
validating Btrfs as a general purpose filesystem for all of our root
drives. While this has obviously proven to be a very useful test suite
with rich collaboration across teams and filesystems, we have observed a
recent trend in our production filesystem issues that makes us question
if it is sufficient.

Over the last few years, we have had a number of issues (primarily in
Btrfs, but at least one notable one in Xfs) that have been detected in
production, then reproduced with an unreliable non-specific stressor
that takes hours or even days to trigger the issue.
Examples:
- Btrfs relocation bugs
https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
- Btrfs extent map merging corruption
https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
- Btrfs dio data corruptions from bio splitting
(mostly our internal errors trying to make minimal backports of
https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
and Christoph's related series)
- Xfs large folios 
https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/

In my view, the common threads between these are that:
- we used fstests to validate these systems, in some cases even with
  specific regression tests for highly related bugs, but still missed
  the bugs until they hit us during our production release process. In
  all cases, we had passing 'fstests -g auto' runs.
- were able to reproduce the bugs with a predictable concoction of "run
  a workload and some known nasty btrfs operations in parallel". The most
  common form of this was running 'fsstress' and 'btrfs balance', but it
  wasn't quite universal. Sometimes we needed reflink threads, or
  drop_caches, or memory pressure, etc. to trigger a bug.
- The relatively generic stressing reproducers took hours or days to
  produce an issue then the investigating engineer could try to tweak and
  tune it by trial and error to bring that time down for a particular bug.

This leads me to the conclusion that there is some room for improvement in
stress testing filesystems (at least Btrfs).

I attempted to study the prior art on this and so far have found:
- fsstress/fsx and the attendant tests in fstests/. There are ~150-200
  tests using fsstress and fsx in fstests/. Most of them are xfs and
  btrfs tests following the aforementioned pattern of racing fsstress
  with some scary operations. Most of them tend to run for 30s, though
  some are longer (and of course subject to TIME_FACTOR configuration)
- Similar duration error injection tests in fstests (e.g. generic/475)
- The NFSv4 Test Project
  https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf 
  A choice quote regarding stress testing:
  "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
  was able to sustain the concurrent load of 10 processes during 24
  hours, without any problem. Three months later, NFSv4 reached 72 hours
  of stress under FSSTRESS, without any bugs. From this date, NFSv4
  filesystem tree manipulation is considered to be stable."


I would like to discuss:
- Am I missing other strategies people are employing? Apologies if there
  are obvious ones, but I tried to hunt around for a few days :)
- What is the universe of interesting stressors (e.g., reflink, scrub,
  online repair, balance, etc.)
- What is the universe of interesting validation conditions (e.g.,
  kernel panic, read only fs, fsck failure, data integrity error, etc.)
- Is there any interest in automating longer running fsstress runs? Are
  people already doing this with varying TIME_FACTOR configurations in
  fstests?
- There is relatively less testing with fsx than fsstress in fstests.
  I believe this creates gaps for data corruption bugs rather than
  "feature logic" issues that the fsstress feature set tends to hit.
- Can we standardize on some modular "stressors" and stress durations
  to run to validate file systems?

In the short term, I have been working on these ideas in a separate
barebones stress testing framework which I am happy to share, but isn't
particularly interesting in and of itself. It is basically just a
skeleton for concurrently running some concurrent "stressors" and then
validating the fs with some generic "validators". I plan to run it
internally just to see if I can get some useful results on our next few
major kernel releases.

And of course, I would love to discuss anything else of interest to
people who like stress testing filesystems!

Thanks,
Boris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov
@ 2025-02-03 19:12 ` Amir Goldstein
  2025-02-04  0:57   ` Dave Chinner
  2025-02-03 19:14 ` Sweet Tea Dorminy
  2025-02-03 19:53 ` Darrick J. Wong
  2 siblings, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2025-02-03 19:12 UTC (permalink / raw)
  To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel, fstests

CC fstests

On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
>
> At Meta, we currently primarily rely on fstests 'auto' runs for
> validating Btrfs as a general purpose filesystem for all of our root
> drives. While this has obviously proven to be a very useful test suite
> with rich collaboration across teams and filesystems, we have observed a
> recent trend in our production filesystem issues that makes us question
> if it is sufficient.
>
> Over the last few years, we have had a number of issues (primarily in
> Btrfs, but at least one notable one in Xfs) that have been detected in
> production, then reproduced with an unreliable non-specific stressor
> that takes hours or even days to trigger the issue.
> Examples:
> - Btrfs relocation bugs
> https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> - Btrfs extent map merging corruption
> https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> - Btrfs dio data corruptions from bio splitting
> (mostly our internal errors trying to make minimal backports of
> https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> and Christoph's related series)
> - Xfs large folios
> https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
>
> In my view, the common threads between these are that:
> - we used fstests to validate these systems, in some cases even with
>   specific regression tests for highly related bugs, but still missed
>   the bugs until they hit us during our production release process. In
>   all cases, we had passing 'fstests -g auto' runs.
> - were able to reproduce the bugs with a predictable concoction of "run
>   a workload and some known nasty btrfs operations in parallel". The most
>   common form of this was running 'fsstress' and 'btrfs balance', but it
>   wasn't quite universal. Sometimes we needed reflink threads, or
>   drop_caches, or memory pressure, etc. to trigger a bug.
> - The relatively generic stressing reproducers took hours or days to
>   produce an issue then the investigating engineer could try to tweak and
>   tune it by trial and error to bring that time down for a particular bug.
>
> This leads me to the conclusion that there is some room for improvement in
> stress testing filesystems (at least Btrfs).
>
> I attempted to study the prior art on this and so far have found:
> - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
>   tests using fsstress and fsx in fstests/. Most of them are xfs and
>   btrfs tests following the aforementioned pattern of racing fsstress
>   with some scary operations. Most of them tend to run for 30s, though
>   some are longer (and of course subject to TIME_FACTOR configuration)
> - Similar duration error injection tests in fstests (e.g. generic/475)
> - The NFSv4 Test Project
>   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
>   A choice quote regarding stress testing:
>   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
>   was able to sustain the concurrent load of 10 processes during 24
>   hours, without any problem. Three months later, NFSv4 reached 72 hours
>   of stress under FSSTRESS, without any bugs. From this date, NFSv4
>   filesystem tree manipulation is considered to be stable."
>
>
> I would like to discuss:
> - Am I missing other strategies people are employing? Apologies if there
>   are obvious ones, but I tried to hunt around for a few days :)
> - What is the universe of interesting stressors (e.g., reflink, scrub,
>   online repair, balance, etc.)
> - What is the universe of interesting validation conditions (e.g.,
>   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> - Is there any interest in automating longer running fsstress runs? Are
>   people already doing this with varying TIME_FACTOR configurations in
>   fstests?
> - There is relatively less testing with fsx than fsstress in fstests.
>   I believe this creates gaps for data corruption bugs rather than
>   "feature logic" issues that the fsstress feature set tends to hit.
> - Can we standardize on some modular "stressors" and stress durations
>   to run to validate file systems?
>
> In the short term, I have been working on these ideas in a separate
> barebones stress testing framework which I am happy to share, but isn't
> particularly interesting in and of itself. It is basically just a
> skeleton for concurrently running some concurrent "stressors" and then
> validating the fs with some generic "validators". I plan to run it
> internally just to see if I can get some useful results on our next few
> major kernel releases.
>
> And of course, I would love to discuss anything else of interest to
> people who like stress testing filesystems!
>
> Thanks,
> Boris
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov
  2025-02-03 19:12 ` Amir Goldstein
@ 2025-02-03 19:14 ` Sweet Tea Dorminy
  2025-02-03 19:53 ` Darrick J. Wong
  2 siblings, 0 replies; 10+ messages in thread
From: Sweet Tea Dorminy @ 2025-02-03 19:14 UTC (permalink / raw)
  To: Boris Burkov, lsf-pc; +Cc: linux-fsdevel, Matthew Sakai, dm-devel

> I attempted to study the prior art on this and so far have found:
> - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
>    tests using fsstress and fsx in fstests/. Most of them are xfs and
>    btrfs tests following the aforementioned pattern of racing fsstress
>    with some scary operations. Most of them tend to run for 30s, though
>    some are longer (and of course subject to TIME_FACTOR configuration)
> - Similar duration error injection tests in fstests (e.g. generic/475)
> - The NFSv4 Test Project
>    https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
>    A choice quote regarding stress testing:
>    "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
>    was able to sustain the concurrent load of 10 processes during 24
>    hours, without any problem. Three months later, NFSv4 reached 72 hours
>    of stress under FSSTRESS, without any bugs. From this date, NFSv4
>    filesystem tree manipulation is considered to be stable."
> 
> 
> I would like to discuss:
> - Am I missing other strategies people are employing? Apologies if there
>    are obvious ones, but I tried to hunt around for a few days :)
> - What is the universe of interesting stressors (e.g., reflink, scrub,
>    online repair, balance, etc.)
It's not a filesystem, but the dm-vdo project has some similarities, 
doing deduplication, compression, and thin provisioning. As such, they 
have a fairly extensive set of tests of dm-vdo, and in particular they 
do a fair bit of stress testing.

For them, the universe is reboots, crashes, complete rebuilds, read-only 
entry and exit, compression enable/disable, and 512 byte sector mode 
enable/disable. They've been running about fifty hours a week of these 
tests inside of Red Hat. For instance, 
https://github.com/dm-vdo/vdo-devel/blob/main/src/perl/vdotest/VDOTest/RebuildStress03.pm 
is one of the tests showing the random selection of operations.

When these tests were first introduced eight years ago, they did catch 
some crash or data corruption bugs which were not covered by the 
existing universe of fstests-like tests for dm-vdo. There was also a 
filesystem inconsistency uncovered at the time: 
https://lore.kernel.org/all/CALoZfD4-uqhRSfEh0Y+v8jjSDY2KkAh-hhwdLnRgZopHEETUXA@mail.gmail.com/

I would suggest Matt Sakai, cc'd, or another of the VDO folks as a 
valuable contributor to this discussion, given the VDO folks' long 
experience with stress testing.

Sweet Tea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov
  2025-02-03 19:12 ` Amir Goldstein
  2025-02-03 19:14 ` Sweet Tea Dorminy
@ 2025-02-03 19:53 ` Darrick J. Wong
  2025-02-04 19:38   ` Boris Burkov
  2 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2025-02-03 19:53 UTC (permalink / raw)
  To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel

On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote:
> At Meta, we currently primarily rely on fstests 'auto' runs for
> validating Btrfs as a general purpose filesystem for all of our root
> drives. While this has obviously proven to be a very useful test suite
> with rich collaboration across teams and filesystems, we have observed a
> recent trend in our production filesystem issues that makes us question
> if it is sufficient.
> 
> Over the last few years, we have had a number of issues (primarily in
> Btrfs, but at least one notable one in Xfs) that have been detected in
> production, then reproduced with an unreliable non-specific stressor
> that takes hours or even days to trigger the issue.
> Examples:
> - Btrfs relocation bugs
> https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> - Btrfs extent map merging corruption
> https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> - Btrfs dio data corruptions from bio splitting
> (mostly our internal errors trying to make minimal backports of
> https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> and Christoph's related series)
> - Xfs large folios 
> https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> 
> In my view, the common threads between these are that:
> - we used fstests to validate these systems, in some cases even with
>   specific regression tests for highly related bugs, but still missed
>   the bugs until they hit us during our production release process. In
>   all cases, we had passing 'fstests -g auto' runs.
> - were able to reproduce the bugs with a predictable concoction of "run
>   a workload and some known nasty btrfs operations in parallel". The most
>   common form of this was running 'fsstress' and 'btrfs balance', but it
>   wasn't quite universal. Sometimes we needed reflink threads, or
>   drop_caches, or memory pressure, etc. to trigger a bug.
> - The relatively generic stressing reproducers took hours or days to
>   produce an issue then the investigating engineer could try to tweak and
>   tune it by trial and error to bring that time down for a particular bug.
> 
> This leads me to the conclusion that there is some room for improvement in
> stress testing filesystems (at least Btrfs).
> 
> I attempted to study the prior art on this and so far have found:
> - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
>   tests using fsstress and fsx in fstests/. Most of them are xfs and
>   btrfs tests following the aforementioned pattern of racing fsstress
>   with some scary operations. Most of them tend to run for 30s, though
>   some are longer (and of course subject to TIME_FACTOR configuration)
> - Similar duration error injection tests in fstests (e.g. generic/475)
> - The NFSv4 Test Project
>   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf 
>   A choice quote regarding stress testing:
>   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
>   was able to sustain the concurrent load of 10 processes during 24
>   hours, without any problem. Three months later, NFSv4 reached 72 hours
>   of stress under FSSTRESS, without any bugs. From this date, NFSv4
>   filesystem tree manipulation is considered to be stable."
> 
> 
> I would like to discuss:
> - Am I missing other strategies people are employing? Apologies if there
>   are obvious ones, but I tried to hunt around for a few days :)

At the moment I start six VMs per "configuration", which each run one of:

generic/521	(directio)
generic/522	(bufferedio)
generic/476	(fsstress)
generic/388	(fsstress + log recovery)
xfs/285		(online fsck)
xfs/286		(online metadata rebuild)

with SOAK_DURATION=6.5d so that they wrap up right around the time that
each rc release drops.  I also set FSSTRESS_AVOID="-m 16" so that we
don't end up with gigantic quota files.

There are two "configurations" per kernel tree.  The dot product of them
are:

djwong-dev:
-m metadir=1,autofsck=1,uquota,gquota,pquota,
-m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,

tot mainline:
-m autofsck=1, -d rtinherit=1,
-m autofsck=1,

for-next:
-m metadir=1,autofsck=1,uquota,gquota,pquota,
-m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,

Actually, I just realized that with 6.14 I need to update the tot
mainline configuration to have metadir=1.

> - What is the universe of interesting stressors (e.g., reflink, scrub,
>   online repair, balance, etc.)

Prodding djwong and everyone else into loading up fsx/fsstress with
all their weird new file io calls. ;)

> - What is the universe of interesting validation conditions (e.g.,
>   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> - Is there any interest in automating longer running fsstress runs? Are
>   people already doing this with varying TIME_FACTOR configurations in
>   fstests?

I don't run with SOAK_DURATION > 14 days because I generally haven't
found larger values to be useful in finding bugs.  However, these weekly
long soak tests runs have been going since 2016.

FWIW that actually started because we had a lot of customer complaints
in that era about log recovery failures in xfs, and only later did I
spread it beyond generic/388 to the six profiles above.

> - There is relatively less testing with fsx than fsstress in fstests.
>   I believe this creates gaps for data corruption bugs rather than
>   "feature logic" issues that the fsstress feature set tends to hit.

Probably.  I wonder how much we're really flexing io_uring?

--D

> - Can we standardize on some modular "stressors" and stress durations
>   to run to validate file systems?
> 
> In the short term, I have been working on these ideas in a separate
> barebones stress testing framework which I am happy to share, but isn't
> particularly interesting in and of itself. It is basically just a
> skeleton for concurrently running some concurrent "stressors" and then
> validating the fs with some generic "validators". I plan to run it
> internally just to see if I can get some useful results on our next few
> major kernel releases.
> 
> And of course, I would love to discuss anything else of interest to
> people who like stress testing filesystems!
> 
> Thanks,
> Boris
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 19:12 ` Amir Goldstein
@ 2025-02-04  0:57   ` Dave Chinner
  2025-02-04 19:58     ` Boris Burkov
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2025-02-04  0:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Boris Burkov, lsf-pc, linux-fsdevel, fstests

On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote:
> CC fstests
> 
> On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
> >
> > At Meta, we currently primarily rely on fstests 'auto' runs for
> > validating Btrfs as a general purpose filesystem for all of our root
> > drives. While this has obviously proven to be a very useful test suite
> > with rich collaboration across teams and filesystems, we have observed a
> > recent trend in our production filesystem issues that makes us question
> > if it is sufficient.
> >
> > Over the last few years, we have had a number of issues (primarily in
> > Btrfs, but at least one notable one in Xfs) that have been detected in
> > production, then reproduced with an unreliable non-specific stressor
> > that takes hours or even days to trigger the issue.
> > Examples:
> > - Btrfs relocation bugs
> > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > - Btrfs extent map merging corruption
> > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > - Btrfs dio data corruptions from bio splitting
> > (mostly our internal errors trying to make minimal backports of
> > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > and Christoph's related series)
> > - Xfs large folios
> > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> >
> > In my view, the common threads between these are that:
> > - we used fstests to validate these systems, in some cases even with
> >   specific regression tests for highly related bugs, but still missed
> >   the bugs until they hit us during our production release process. In
> >   all cases, we had passing 'fstests -g auto' runs.

Have you considered the 'soak' test group with a long SOAK_DURATION
and then increasing the load using LOAD_FACTOR? Also there is a
'stress' group that TIME_FACTOR acts on.

For XFS, there's also bunch of fuzzing tests (in the
dangerous_fuzzers group) that use the same SOAK_DURATION
infrastructure via common/fuzzy.


> > - were able to reproduce the bugs with a predictable concoction of "run
> >   a workload and some known nasty btrfs operations in parallel". The most
> >   common form of this was running 'fsstress' and 'btrfs balance', but it
> >   wasn't quite universal. Sometimes we needed reflink threads, or
> >   drop_caches, or memory pressure, etc. to trigger a bug.

That's pretty much what check-parallel does to a system. Loads of
tests run things like drop_caches, memory compaction, CPU hotplug,
etc. check-parallel essentially exposes every test to these sorts
of background perturbations rather than just the one test that is
running that perturbation. IOWs, even the most basic correctness
test now gets exercised while cpu hotplug and memory compaction are
going on in the background....

Eventually, I plan to implement these background perturbations as
separate control tasks for check-parallel so we don't need specific
tests that run a background perturbation whilst the rest of the
system is under test.

> > - The relatively generic stressing reproducers took hours or days to
> >   produce an issue then the investigating engineer could try to tweak and
> >   tune it by trial and error to bring that time down for a particular bug.
> >
> > This leads me to the conclusion that there is some room for improvement in
> > stress testing filesystems (at least Btrfs).
> >
> > I attempted to study the prior art on this and so far have found:
> > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> >   btrfs tests following the aforementioned pattern of racing fsstress
> >   with some scary operations. Most of them tend to run for 30s, though
> >   some are longer (and of course subject to TIME_FACTOR configuration)

As per above, SOAK_DURATION.

> > - Similar duration error injection tests in fstests (e.g. generic/475)
> > - The NFSv4 Test Project
> >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
> >   A choice quote regarding stress testing:
> >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> >   was able to sustain the concurrent load of 10 processes during 24
> >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> >   filesystem tree manipulation is considered to be stable."
> >
> >
> > I would like to discuss:
> > - Am I missing other strategies people are employing? Apologies if there
> >   are obvious ones, but I tried to hunt around for a few days :)

check-parallel.

> > - What is the universe of interesting stressors (e.g., reflink, scrub,
> >   online repair, balance, etc.)

memory compaction, cpu hotplug, random reflinks of the underlying
loop device image files to simulate dynamic VM image file snapshots,
etc.

> > - What is the universe of interesting validation conditions (e.g.,
> >   kernel panic, read only fs, fsck failure, data integrity error, etc.)

All of them. That's the point of check-parallel - it uses simple,
existing filesystem correctness tests to generate a massively
stressful load on the system...

> > - Is there any interest in automating longer running fsstress runs? Are
> >   people already doing this with varying TIME_FACTOR configurations in
> >   fstests?

At least for XFS, Darrick is already doing that, and I think Carlos
may be as well.

> > - There is relatively less testing with fsx than fsstress in fstests.
> >   I believe this creates gaps for data corruption bugs rather than
> >   "feature logic" issues that the fsstress feature set tends to hit.
> > - Can we standardize on some modular "stressors" and stress durations
> >   to run to validate file systems?

I think we already have that with the "soak" and "stress" groups...

> > In the short term, I have been working on these ideas in a separate
> > barebones stress testing framework which I am happy to share, but isn't
> > particularly interesting in and of itself. It is basically just a
> > skeleton for concurrently running some concurrent "stressors" and then
> > validating the fs with some generic "validators". I plan to run it
> > internally just to see if I can get some useful results on our next few
> > major kernel releases.

check-parallel is effectively a massive concurrent stress workload
for the system. It does this by running many individual correctness
tests concurrently.

Run it on a 64p system or larger, and it will hammer both the test
filesystems and base filesystem that all the loop device image files
are laid out on.  I'm seeing it generate 5-6GB/s of IO load, 40-50GB
of memory usage, and consistently use >90% of the CPU in the system
stress the scheduler at over half a million context switches/s.

> > And of course, I would love to discuss anything else of interest to
> > people who like stress testing filesystems!

Filesystem stress testing by itself isn't really interesting to me.
Using filesystem correctness tests to create massively stressful
workloads, OTOH, attacks the problem from multiple angles and
exercises the system well outside the bounds of just filesystem
code.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-03 19:53 ` Darrick J. Wong
@ 2025-02-04 19:38   ` Boris Burkov
  2025-02-04 22:09     ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2025-02-04 19:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: lsf-pc, linux-fsdevel

On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote:
> > At Meta, we currently primarily rely on fstests 'auto' runs for
> > validating Btrfs as a general purpose filesystem for all of our root
> > drives. While this has obviously proven to be a very useful test suite
> > with rich collaboration across teams and filesystems, we have observed a
> > recent trend in our production filesystem issues that makes us question
> > if it is sufficient.
> > 
> > Over the last few years, we have had a number of issues (primarily in
> > Btrfs, but at least one notable one in Xfs) that have been detected in
> > production, then reproduced with an unreliable non-specific stressor
> > that takes hours or even days to trigger the issue.
> > Examples:
> > - Btrfs relocation bugs
> > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > - Btrfs extent map merging corruption
> > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > - Btrfs dio data corruptions from bio splitting
> > (mostly our internal errors trying to make minimal backports of
> > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > and Christoph's related series)
> > - Xfs large folios 
> > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> > 
> > In my view, the common threads between these are that:
> > - we used fstests to validate these systems, in some cases even with
> >   specific regression tests for highly related bugs, but still missed
> >   the bugs until they hit us during our production release process. In
> >   all cases, we had passing 'fstests -g auto' runs.
> > - were able to reproduce the bugs with a predictable concoction of "run
> >   a workload and some known nasty btrfs operations in parallel". The most
> >   common form of this was running 'fsstress' and 'btrfs balance', but it
> >   wasn't quite universal. Sometimes we needed reflink threads, or
> >   drop_caches, or memory pressure, etc. to trigger a bug.
> > - The relatively generic stressing reproducers took hours or days to
> >   produce an issue then the investigating engineer could try to tweak and
> >   tune it by trial and error to bring that time down for a particular bug.
> > 
> > This leads me to the conclusion that there is some room for improvement in
> > stress testing filesystems (at least Btrfs).
> > 
> > I attempted to study the prior art on this and so far have found:
> > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> >   btrfs tests following the aforementioned pattern of racing fsstress
> >   with some scary operations. Most of them tend to run for 30s, though
> >   some are longer (and of course subject to TIME_FACTOR configuration)
> > - Similar duration error injection tests in fstests (e.g. generic/475)
> > - The NFSv4 Test Project
> >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf 
> >   A choice quote regarding stress testing:
> >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> >   was able to sustain the concurrent load of 10 processes during 24
> >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> >   filesystem tree manipulation is considered to be stable."
> > 
> > 
> > I would like to discuss:
> > - Am I missing other strategies people are employing? Apologies if there
> >   are obvious ones, but I tried to hunt around for a few days :)
> 
> At the moment I start six VMs per "configuration", which each run one of:
> 
> generic/521	(directio)
> generic/522	(bufferedio)
> generic/476	(fsstress)
> generic/388	(fsstress + log recovery)
> xfs/285		(online fsck)
> xfs/286		(online metadata rebuild)

That is sweet, and sorry I missed the soak category. I would love to
hear more about your experience with these tests! Do they catch a lot of
bugs? How hard is it to reduce the reproducer down to something smaller
and quicker when you do hit something? I'm also surprised at how few of
these you need. Does xfs just have a lot fewer online "admin" operations
like device replace, defrag, balance, enabling/disabling compression, etc.
than btrfs so you need fewer tests like that? Or do you just not think
that adding more noise would catch enough bugs to make it worth it?
Or does fsstress encompass all the operations you are interested in?

There are a bunch of similar btrfs tests (btrfs/060-074 race fsstress
with one or two interesting btrfs operations each) but we don't
currently run them for much longer than 30s. I am curious to try running
them as soak tests, now, and adding fsx running variants. That will end
up with like ~30 pretty similar tests, that I also feel could sort of
just be one big modular test?

Which kind of gets back to what I was getting at in the first place. I
don't know enough about xfs to fully grok what the various
configurations do to the test (I imagine they enable various features
you want to validate under the soak), but I imagine there are still more
nasty things to do to the system in parallel.

> 
> with SOAK_DURATION=6.5d so that they wrap up right around the time that
> each rc release drops.  I also set FSSTRESS_AVOID="-m 16" so that we
> don't end up with gigantic quota files.
> 
> There are two "configurations" per kernel tree.  The dot product of them
> are:
> 
> djwong-dev:
> -m metadir=1,autofsck=1,uquota,gquota,pquota,
> -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,
> 
> tot mainline:
> -m autofsck=1, -d rtinherit=1,
> -m autofsck=1,
> 
> for-next:
> -m metadir=1,autofsck=1,uquota,gquota,pquota,
> -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,
> 
> Actually, I just realized that with 6.14 I need to update the tot
> mainline configuration to have metadir=1.
> 
> > - What is the universe of interesting stressors (e.g., reflink, scrub,
> >   online repair, balance, etc.)
> 
> Prodding djwong and everyone else into loading up fsx/fsstress with
> all their weird new file io calls. ;)

I think this is quite interesting, actually. Fsstress already does
create and delete snapshots and makes reflinks, but there have been a
number of bugs that I have been unable to reproduce with raw fsstress
but if I run fsstress PLUS more external
reflinking/snapshotting/syncing/etc threads, then they reproduce. It
seems, logically, I could keep fussing with my fsstress invocation to
get there, but that was my experience.

Separately, how much do we want to be adding features that are only in
one or two filesystems to fsstress (similar to my points above regarding
test cardinality explosion)

> 
> > - What is the universe of interesting validation conditions (e.g.,
> >   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> > - Is there any interest in automating longer running fsstress runs? Are
> >   people already doing this with varying TIME_FACTOR configurations in
> >   fstests?
> 
> I don't run with SOAK_DURATION > 14 days because I generally haven't
> found larger values to be useful in finding bugs.  However, these weekly
> long soak tests runs have been going since 2016.

That makes sense to me, it does feel like a day to a week is probably
the sweet spot.

> 
> FWIW that actually started because we had a lot of customer complaints
> in that era about log recovery failures in xfs, and only later did I
> spread it beyond generic/388 to the six profiles above.
> 
> > - There is relatively less testing with fsx than fsstress in fstests.
> >   I believe this creates gaps for data corruption bugs rather than
> >   "feature logic" issues that the fsstress feature set tends to hit.
> 
> Probably.  I wonder how much we're really flexing io_uring?
> 
> --D
> 
> > - Can we standardize on some modular "stressors" and stress durations
> >   to run to validate file systems?
> > 
> > In the short term, I have been working on these ideas in a separate
> > barebones stress testing framework which I am happy to share, but isn't
> > particularly interesting in and of itself. It is basically just a
> > skeleton for concurrently running some concurrent "stressors" and then
> > validating the fs with some generic "validators". I plan to run it
> > internally just to see if I can get some useful results on our next few
> > major kernel releases.
> > 
> > And of course, I would love to discuss anything else of interest to
> > people who like stress testing filesystems!
> > 
> > Thanks,
> > Boris
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04  0:57   ` Dave Chinner
@ 2025-02-04 19:58     ` Boris Burkov
  2025-02-04 21:14       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2025-02-04 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests

On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote:
> On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote:
> > CC fstests
> > 
> > On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > At Meta, we currently primarily rely on fstests 'auto' runs for
> > > validating Btrfs as a general purpose filesystem for all of our root
> > > drives. While this has obviously proven to be a very useful test suite
> > > with rich collaboration across teams and filesystems, we have observed a
> > > recent trend in our production filesystem issues that makes us question
> > > if it is sufficient.
> > >
> > > Over the last few years, we have had a number of issues (primarily in
> > > Btrfs, but at least one notable one in Xfs) that have been detected in
> > > production, then reproduced with an unreliable non-specific stressor
> > > that takes hours or even days to trigger the issue.
> > > Examples:
> > > - Btrfs relocation bugs
> > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > > - Btrfs extent map merging corruption
> > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > > - Btrfs dio data corruptions from bio splitting
> > > (mostly our internal errors trying to make minimal backports of
> > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > > and Christoph's related series)
> > > - Xfs large folios
> > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> > >
> > > In my view, the common threads between these are that:
> > > - we used fstests to validate these systems, in some cases even with
> > >   specific regression tests for highly related bugs, but still missed
> > >   the bugs until they hit us during our production release process. In
> > >   all cases, we had passing 'fstests -g auto' runs.
> 
> Have you considered the 'soak' test group with a long SOAK_DURATION
> and then increasing the load using LOAD_FACTOR? Also there is a
> 'stress' group that TIME_FACTOR acts on.
> 
> For XFS, there's also bunch of fuzzing tests (in the
> dangerous_fuzzers group) that use the same SOAK_DURATION
> infrastructure via common/fuzzy.

I hadn't realized people were running these for multi-day durations.
Thanks for pointing them out and for your other inline answers to my
questions.

> 
> 
> > > - were able to reproduce the bugs with a predictable concoction of "run
> > >   a workload and some known nasty btrfs operations in parallel". The most
> > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > >   drop_caches, or memory pressure, etc. to trigger a bug.
> 
> That's pretty much what check-parallel does to a system. Loads of
> tests run things like drop_caches, memory compaction, CPU hotplug,
> etc. check-parallel essentially exposes every test to these sorts
> of background perturbations rather than just the one test that is
> running that perturbation. IOWs, even the most basic correctness
> test now gets exercised while cpu hotplug and memory compaction are
> going on in the background....
> 
> Eventually, I plan to implement these background perturbations as
> separate control tasks for check-parallel so we don't need specific
> tests that run a background perturbation whilst the rest of the
> system is under test.

I think that a framework for introducing background perturbations while
running tests is definitely what I'm getting at. If check-parallel is a
good version of that, then that sounds great to me. I am particularly
excited about your point that it will smash together *every* stimulus
with *every* test. I do have some questions in my head about how that
would work in practice.

My main questions/concerns are:

How much do you randomize the interleaving of tests? Does
check-parallel run them in a random order?

Similarly, their durations are not at all tuned to maximize
interesting interactions. If test X and test Y would collide on some
faulty interaction, but test X runs once in 1 second, then you would
likely never see test X interfere with some interesting moment during
test Y. Are you considering feeding the tests back into the run-queue
as they finish for these stress style runs?

It seems that the two objectives of the test harness are sort of in
tension with using check-parallel to stress things. On one hand you
want tests to independently succeed or fail and on the other hand you
want noise from one test to disturb the other. I fear more of the
failures will turn out to be "Oh, well, when THAT happens, we would
expect this condition to be violated". Especially for the more "unit
test" style fstests that carefully use sync to check specific conditions
during a run.

This variant also feels like it would be at the extreme of difficulty
for attempting to distill a failure into a reproducer.

> 
> > > - The relatively generic stressing reproducers took hours or days to
> > >   produce an issue then the investigating engineer could try to tweak and
> > >   tune it by trial and error to bring that time down for a particular bug.
> > >
> > > This leads me to the conclusion that there is some room for improvement in
> > > stress testing filesystems (at least Btrfs).
> > >
> > > I attempted to study the prior art on this and so far have found:
> > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> > >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> > >   btrfs tests following the aforementioned pattern of racing fsstress
> > >   with some scary operations. Most of them tend to run for 30s, though
> > >   some are longer (and of course subject to TIME_FACTOR configuration)
> 
> As per above, SOAK_DURATION.
> 
> > > - Similar duration error injection tests in fstests (e.g. generic/475)
> > > - The NFSv4 Test Project
> > >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
> > >   A choice quote regarding stress testing:
> > >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> > >   was able to sustain the concurrent load of 10 processes during 24
> > >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> > >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> > >   filesystem tree manipulation is considered to be stable."
> > >
> > >
> > > I would like to discuss:
> > > - Am I missing other strategies people are employing? Apologies if there
> > >   are obvious ones, but I tried to hunt around for a few days :)
> 
> check-parallel.
> 
> > > - What is the universe of interesting stressors (e.g., reflink, scrub,
> > >   online repair, balance, etc.)
> 
> memory compaction, cpu hotplug, random reflinks of the underlying
> loop device image files to simulate dynamic VM image file snapshots,
> etc.
> 
> > > - What is the universe of interesting validation conditions (e.g.,
> > >   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> 
> All of them. That's the point of check-parallel - it uses simple,
> existing filesystem correctness tests to generate a massively
> stressful load on the system...
> 
> > > - Is there any interest in automating longer running fsstress runs? Are
> > >   people already doing this with varying TIME_FACTOR configurations in
> > >   fstests?
> 
> At least for XFS, Darrick is already doing that, and I think Carlos
> may be as well.
> 
> > > - There is relatively less testing with fsx than fsstress in fstests.
> > >   I believe this creates gaps for data corruption bugs rather than
> > >   "feature logic" issues that the fsstress feature set tends to hit.
> > > - Can we standardize on some modular "stressors" and stress durations
> > >   to run to validate file systems?
> 
> I think we already have that with the "soak" and "stress" groups...
> 
> > > In the short term, I have been working on these ideas in a separate
> > > barebones stress testing framework which I am happy to share, but isn't
> > > particularly interesting in and of itself. It is basically just a
> > > skeleton for concurrently running some concurrent "stressors" and then
> > > validating the fs with some generic "validators". I plan to run it
> > > internally just to see if I can get some useful results on our next few
> > > major kernel releases.
> 
> check-parallel is effectively a massive concurrent stress workload
> for the system. It does this by running many individual correctness
> tests concurrently.
> 
> Run it on a 64p system or larger, and it will hammer both the test
> filesystems and base filesystem that all the loop device image files
> are laid out on.  I'm seeing it generate 5-6GB/s of IO load, 40-50GB
> of memory usage, and consistently use >90% of the CPU in the system
> stress the scheduler at over half a million context switches/s.

I will definitely invest some time into getting check-parallel to run
with btrfs, and hopefully it turns up some interesting stuff.

> 
> > > And of course, I would love to discuss anything else of interest to
> > > people who like stress testing filesystems!
> 
> Filesystem stress testing by itself isn't really interesting to me.
> Using filesystem correctness tests to create massively stressful
> workloads, OTOH, attacks the problem from multiple angles and
> exercises the system well outside the bounds of just filesystem
> code.

From what I see, today we have a handful of tests which race fsx or
fsstress with 0-2 operations under test, and you are proposing using
check-parallel to hammer the computer with the entirety of all 1000
tests in parallel (awesome). I think I am proposing something in between
where we run fsx AND fsstress AND ~10 known scary operations. That
has proven to dredge up bugs in btrfs (where the simpler fsstress plus
one thing doesn't). I think check-parallel will be more stressful, but
that this "mega fsstress run" will be more predictable and easier to
tune/get reproducers out of.

Thanks again for your thoughts,
Boris

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04 19:58     ` Boris Burkov
@ 2025-02-04 21:14       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2025-02-04 21:14 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests

On Tue, Feb 04, 2025 at 11:58:46AM -0800, Boris Burkov wrote:
> On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote:
> > > > - were able to reproduce the bugs with a predictable concoction of "run
> > > >   a workload and some known nasty btrfs operations in parallel". The most
> > > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > > >   drop_caches, or memory pressure, etc. to trigger a bug.
> > 
> > That's pretty much what check-parallel does to a system. Loads of
> > tests run things like drop_caches, memory compaction, CPU hotplug,
> > etc. check-parallel essentially exposes every test to these sorts
> > of background perturbations rather than just the one test that is
> > running that perturbation. IOWs, even the most basic correctness
> > test now gets exercised while cpu hotplug and memory compaction are
> > going on in the background....
> > 
> > Eventually, I plan to implement these background perturbations as
> > separate control tasks for check-parallel so we don't need specific
> > tests that run a background perturbation whilst the rest of the
> > system is under test.
> 
> I think that a framework for introducing background perturbations while
> running tests is definitely what I'm getting at. If check-parallel is a
> good version of that, then that sounds great to me. I am particularly
> excited about your point that it will smash together *every* stimulus
> with *every* test. I do have some questions in my head about how that
> would work in practice.
> 
> My main questions/concerns are:
> 
> How much do you randomize the interleaving of tests? Does
> check-parallel run them in a random order?

Same as check - the "-r" option will randomise the test run order.

The test run order is also somewhat randomised by default in that
it sorts the test run order based on the runtime of each test in
the previous test run. Hence test run order is not static - it
generally runs long running tests before slow running tests, but the
exact order is not fixed.

> Similarly, their durations are not at all tuned to maximize
> interesting interactions. If test X and test Y would collide on some
> faulty interaction, but test X runs once in 1 second, then you would
> likely never see test X interfere with some interesting moment during
> test Y. Are you considering feeding the tests back into the run-queue
> as they finish for these stress style runs?

Not yet - the infrastructure to directly manage and run tests from
check-parallel is not yet in place. It currently generates a test
list for each runner thread then executes that via a check instance
per runner thread.

I plan to have check-parallel execute tests individually itself by
factoring the run loop out of check (similar to how I'm doing the
test list parsing). Once there is direct control of the test
execution, stuff like dynamic test queues where runners just pull
the next test to run off the queue and they keep going until the
queue is empty will be possible.

> It seems that the two objectives of the test harness are sort of in
> tension with using check-parallel to stress things. On one hand you
> want tests to independently succeed or fail and on the other hand you
> want noise from one test to disturb the other.

Yes. Tests are largely written such that they don't interfere with
each other.

> I fear more of the
> failures will turn out to be "Oh, well, when THAT happens, we would
> expect this condition to be violated". Especially for the more "unit
> test" style fstests that carefully use sync to check specific conditions
> during a run.

That's why I currently have a "unreliable_in_parallel" test group
definition and check-parallel excludes that test group. There's
about 20 tests I've classified this way, most of them xfs specific
tests that are reliant on exact fragmentation patterns being
created. This tests are perturbed by things like sync(1) calls from
other tests which results in a different fragmentation pattern than
the test expects to see.

In each case, there is a comment in the test explaining the
condition that makes the test unreliable in parallel, and so we
have some idea of what needs fixing to be able to remove it from the
unreliable_in_parallel group.

Essentially, I'm using this as a marker and note for future
improvements once all the (more important) infrastructure work is
done and solid.

> This variant also feels like it would be at the extreme of difficulty
> for attempting to distill a failure into a reproducer.

It's pretty obvious when a test is doing something that is
influenced by an outside event. The biggest problem for debugging
them comes when the test failures appear to be real bugs (e.g. all
the weird and whacky off-by-one quota failures that check-parallel
triggers on XFS) but they cannot be reproduced when the tests are
run serially.

.....

> > > > And of course, I would love to discuss anything else of interest to
> > > > people who like stress testing filesystems!
> > 
> > Filesystem stress testing by itself isn't really interesting to me.
> > Using filesystem correctness tests to create massively stressful
> > workloads, OTOH, attacks the problem from multiple angles and
> > exercises the system well outside the bounds of just filesystem
> > code.
> 
> From what I see, today we have a handful of tests which race fsx or
> fsstress with 0-2 operations under test, and you are proposing using
> check-parallel to hammer the computer with the entirety of all 1000
> tests in parallel (awesome).

It's currently running one test per CPU in parallel, not all at
once. Many tests run lots of stuff in parallel themselves, too, and
some of them hammer large CPU count machines really hard just by
themselves, let alone when there's another 63 tests running
concurrently....

> I think I am proposing something in between
> where we run fsx AND fsstress AND ~10 known scary operations.

Write a set of tests that do this for btrfs and put them in the
auto/stress/soak groups. Then run 'check-parallel -g soak,stress
....'

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04 19:38   ` Boris Burkov
@ 2025-02-04 22:09     ` Darrick J. Wong
  2025-02-05  4:38       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2025-02-04 22:09 UTC (permalink / raw)
  To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel

On Tue, Feb 04, 2025 at 11:38:45AM -0800, Boris Burkov wrote:
> On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote:
> > > At Meta, we currently primarily rely on fstests 'auto' runs for
> > > validating Btrfs as a general purpose filesystem for all of our root
> > > drives. While this has obviously proven to be a very useful test suite
> > > with rich collaboration across teams and filesystems, we have observed a
> > > recent trend in our production filesystem issues that makes us question
> > > if it is sufficient.
> > > 
> > > Over the last few years, we have had a number of issues (primarily in
> > > Btrfs, but at least one notable one in Xfs) that have been detected in
> > > production, then reproduced with an unreliable non-specific stressor
> > > that takes hours or even days to trigger the issue.
> > > Examples:
> > > - Btrfs relocation bugs
> > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
> > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
> > > - Btrfs extent map merging corruption
> > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
> > > - Btrfs dio data corruptions from bio splitting
> > > (mostly our internal errors trying to make minimal backports of
> > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
> > > and Christoph's related series)
> > > - Xfs large folios 
> > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
> > > 
> > > In my view, the common threads between these are that:
> > > - we used fstests to validate these systems, in some cases even with
> > >   specific regression tests for highly related bugs, but still missed
> > >   the bugs until they hit us during our production release process. In
> > >   all cases, we had passing 'fstests -g auto' runs.
> > > - were able to reproduce the bugs with a predictable concoction of "run
> > >   a workload and some known nasty btrfs operations in parallel". The most
> > >   common form of this was running 'fsstress' and 'btrfs balance', but it
> > >   wasn't quite universal. Sometimes we needed reflink threads, or
> > >   drop_caches, or memory pressure, etc. to trigger a bug.
> > > - The relatively generic stressing reproducers took hours or days to
> > >   produce an issue then the investigating engineer could try to tweak and
> > >   tune it by trial and error to bring that time down for a particular bug.
> > > 
> > > This leads me to the conclusion that there is some room for improvement in
> > > stress testing filesystems (at least Btrfs).
> > > 
> > > I attempted to study the prior art on this and so far have found:
> > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200
> > >   tests using fsstress and fsx in fstests/. Most of them are xfs and
> > >   btrfs tests following the aforementioned pattern of racing fsstress
> > >   with some scary operations. Most of them tend to run for 30s, though
> > >   some are longer (and of course subject to TIME_FACTOR configuration)
> > > - Similar duration error injection tests in fstests (e.g. generic/475)
> > > - The NFSv4 Test Project
> > >   https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf 
> > >   A choice quote regarding stress testing:
> > >   "One year after we started using FSSTRESS (in April 2005) Linux NFSv4
> > >   was able to sustain the concurrent load of 10 processes during 24
> > >   hours, without any problem. Three months later, NFSv4 reached 72 hours
> > >   of stress under FSSTRESS, without any bugs. From this date, NFSv4
> > >   filesystem tree manipulation is considered to be stable."
> > > 
> > > 
> > > I would like to discuss:
> > > - Am I missing other strategies people are employing? Apologies if there
> > >   are obvious ones, but I tried to hunt around for a few days :)
> > 
> > At the moment I start six VMs per "configuration", which each run one of:
> > 
> > generic/521	(directio)
> > generic/522	(bufferedio)
> > generic/476	(fsstress)
> > generic/388	(fsstress + log recovery)
> > xfs/285		(online fsck)
> > xfs/286		(online metadata rebuild)
> 
> That is sweet, and sorry I missed the soak category. I would love to
> hear more about your experience with these tests! Do they catch a lot of
> bugs? How hard is it to reduce the reproducer down to something smaller
> and quicker when you do hit something? I'm also surprised at how few of

It's usually pretty difficult for data corruption reports from fsx, but
most of the fsstress failures are either really obvious (dead fs) or
emit stacktraces and lockdep/kasan reports.

> these you need. Does xfs just have a lot fewer online "admin" operations
> like device replace, defrag, balance, enabling/disabling compression, etc.
> than btrfs so you need fewer tests like that? Or do you just not think
> that adding more noise would catch enough bugs to make it worth it?
> Or does fsstress encompass all the operations you are interested in?

xfs has a lot less stuff to manage, so fsstress/fsx are usually enough.
There's a quartet of specialty tests xfs/285,286,565,566 that exercise
fsstress/fsx against online fsck and repair.

> There are a bunch of similar btrfs tests (btrfs/060-074 race fsstress
> with one or two interesting btrfs operations each) but we don't
> currently run them for much longer than 30s. I am curious to try running
> them as soak tests, now, and adding fsx running variants. That will end
> up with like ~30 pretty similar tests, that I also feel could sort of
> just be one big modular test?

I suggest leaving the specialty tests around (and not in the auto group)
and creating one btrfs/ test that turns on *everything* and races that
against fsstress?

> Which kind of gets back to what I was getting at in the first place. I
> don't know enough about xfs to fully grok what the various
> configurations do to the test (I imagine they enable various features
> you want to validate under the soak), but I imagine there are still more
> nasty things to do to the system in parallel.

Probably, but we've never really dug into that.  Dave might get there
with check-parallel but I don't have 64p systems to spare right now.

As for configurations -- yeah, that's how we deal with the combinatoric
explosion of mkfs options.  Run a lot of different weird configs in
parallel with a fleet of VMs.  It's too bad that sort of implies that we
all have to work for cloud vendors.

> > 
> > with SOAK_DURATION=6.5d so that they wrap up right around the time that
> > each rc release drops.  I also set FSSTRESS_AVOID="-m 16" so that we
> > don't end up with gigantic quota files.
> > 
> > There are two "configurations" per kernel tree.  The dot product of them
> > are:
> > 
> > djwong-dev:
> > -m metadir=1,autofsck=1,uquota,gquota,pquota,
> > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,
> > 
> > tot mainline:
> > -m autofsck=1, -d rtinherit=1,
> > -m autofsck=1,
> > 
> > for-next:
> > -m metadir=1,autofsck=1,uquota,gquota,pquota,
> > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1,
> > 
> > Actually, I just realized that with 6.14 I need to update the tot
> > mainline configuration to have metadir=1.
> > 
> > > - What is the universe of interesting stressors (e.g., reflink, scrub,
> > >   online repair, balance, etc.)
> > 
> > Prodding djwong and everyone else into loading up fsx/fsstress with
> > all their weird new file io calls. ;)
> 
> I think this is quite interesting, actually. Fsstress already does
> create and delete snapshots and makes reflinks, but there have been a
> number of bugs that I have been unable to reproduce with raw fsstress
> but if I run fsstress PLUS more external
> reflinking/snapshotting/syncing/etc threads, then they reproduce. It
> seems, logically, I could keep fussing with my fsstress invocation to
> get there, but that was my experience.
> 
> Separately, how much do we want to be adding features that are only in
> one or two filesystems to fsstress (similar to my points above regarding
> test cardinality explosion)

That's where I think "add it all to fsstress" becomes less useful -- it
might not be a great idea to clutter it up with too many weird ioctls.
That said, I think xfs and bcachefs slowly co-opt the btrfs ones over
time.

--D

> > 
> > > - What is the universe of interesting validation conditions (e.g.,
> > >   kernel panic, read only fs, fsck failure, data integrity error, etc.)
> > > - Is there any interest in automating longer running fsstress runs? Are
> > >   people already doing this with varying TIME_FACTOR configurations in
> > >   fstests?
> > 
> > I don't run with SOAK_DURATION > 14 days because I generally haven't
> > found larger values to be useful in finding bugs.  However, these weekly
> > long soak tests runs have been going since 2016.
> 
> That makes sense to me, it does feel like a day to a week is probably
> the sweet spot.
> 
> > 
> > FWIW that actually started because we had a lot of customer complaints
> > in that era about log recovery failures in xfs, and only later did I
> > spread it beyond generic/388 to the six profiles above.
> > 
> > > - There is relatively less testing with fsx than fsstress in fstests.
> > >   I believe this creates gaps for data corruption bugs rather than
> > >   "feature logic" issues that the fsstress feature set tends to hit.
> > 
> > Probably.  I wonder how much we're really flexing io_uring?
> > 
> > --D
> > 
> > > - Can we standardize on some modular "stressors" and stress durations
> > >   to run to validate file systems?
> > > 
> > > In the short term, I have been working on these ideas in a separate
> > > barebones stress testing framework which I am happy to share, but isn't
> > > particularly interesting in and of itself. It is basically just a
> > > skeleton for concurrently running some concurrent "stressors" and then
> > > validating the fs with some generic "validators". I plan to run it
> > > internally just to see if I can get some useful results on our next few
> > > major kernel releases.
> > > 
> > > And of course, I would love to discuss anything else of interest to
> > > people who like stress testing filesystems!
> > > 
> > > Thanks,
> > > Boris
> > > 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
  2025-02-04 22:09     ` Darrick J. Wong
@ 2025-02-05  4:38       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2025-02-05  4:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Boris Burkov, lsf-pc, linux-fsdevel

On Tue, Feb 04, 2025 at 02:09:39PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 11:38:45AM -0800, Boris Burkov wrote:
> > On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote:
> > > On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote:
> > Which kind of gets back to what I was getting at in the first place. I
> > don't know enough about xfs to fully grok what the various
> > configurations do to the test (I imagine they enable various features
> > you want to validate under the soak), but I imagine there are still more
> > nasty things to do to the system in parallel.
> 
> Probably, but we've never really dug into that.  Dave might get there
> with check-parallel but I don't have 64p systems to spare right now.
> 
> As for configurations -- yeah, that's how we deal with the combinatoric
> explosion of mkfs options.  Run a lot of different weird configs in
> parallel with a fleet of VMs.  It's too bad that sort of implies that we
> all have to work for cloud vendors.

Well, that's one of the issues I'm addressing with check-parallel.

When a full auto run takes 10 minutes, a single developer can
iterate a significant chunk of the configuration matrix on a single
machine in a few hours with a single check-parallel command.

The functionality is already there to do this - if we define all
the configs that are to be tested via config section definitions,
check-parallel will iterate them all in one go.

That's the way I want to run testing - testing mkfs defaults with
the auto group is a ten minute smoke test that will catch most
regressions in new code. That "full auto" smoke test is now faster
than my typical think-code-build-deploy cycle time. Perfect.

Now running half a dozen common configs (e.g. each of the LTS-kernel
related mkfs defaults) for better coverage becomes a "run it while
I'm at lunch/in a meeting" exercise. IT can be done multiple times a
day, and interrupting it to start again with a new build is no
longer a big deal.

End-of-day/overnight testing has a long enough duration (12+ hours)
to exercise /several dozen/ fs configs. That's more than enough
testing to drown a typical developer in things that need analysis
and/or fixing.

All on one local machine built using cheap commodity parts.

The cloud is a lie.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-05  4:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov
2025-02-03 19:12 ` Amir Goldstein
2025-02-04  0:57   ` Dave Chinner
2025-02-04 19:58     ` Boris Burkov
2025-02-04 21:14       ` Dave Chinner
2025-02-03 19:14 ` Sweet Tea Dorminy
2025-02-03 19:53 ` Darrick J. Wong
2025-02-04 19:38   ` Boris Burkov
2025-02-04 22:09     ` Darrick J. Wong
2025-02-05  4:38       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.