* [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems
@ 2025-02-03 18:55 Boris Burkov
2025-02-03 19:12 ` Amir Goldstein
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Boris Burkov @ 2025-02-03 18:55 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-fsdevel
At Meta, we currently primarily rely on fstests 'auto' runs for
validating Btrfs as a general purpose filesystem for all of our root
drives. While this has obviously proven to be a very useful test suite
with rich collaboration across teams and filesystems, we have observed a
recent trend in our production filesystem issues that makes us question
if it is sufficient.
Over the last few years, we have had a number of issues (primarily in
Btrfs, but at least one notable one in Xfs) that have been detected in
production, then reproduced with an unreliable non-specific stressor
that takes hours or even days to trigger the issue.
Examples:
- Btrfs relocation bugs
https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/
https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/
- Btrfs extent map merging corruption
https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/
- Btrfs dio data corruptions from bio splitting
(mostly our internal errors trying to make minimal backports of
https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/
and Christoph's related series)
- Xfs large folios
https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/
In my view, the common threads between these are that:
- we used fstests to validate these systems, in some cases even with
specific regression tests for highly related bugs, but still missed
the bugs until they hit us during our production release process. In
all cases, we had passing 'fstests -g auto' runs.
- were able to reproduce the bugs with a predictable concoction of "run
a workload and some known nasty btrfs operations in parallel". The most
common form of this was running 'fsstress' and 'btrfs balance', but it
wasn't quite universal. Sometimes we needed reflink threads, or
drop_caches, or memory pressure, etc. to trigger a bug.
- The relatively generic stressing reproducers took hours or days to
produce an issue then the investigating engineer could try to tweak and
tune it by trial and error to bring that time down for a particular bug.
This leads me to the conclusion that there is some room for improvement in
stress testing filesystems (at least Btrfs).
I attempted to study the prior art on this and so far have found:
- fsstress/fsx and the attendant tests in fstests/. There are ~150-200
tests using fsstress and fsx in fstests/. Most of them are xfs and
btrfs tests following the aforementioned pattern of racing fsstress
with some scary operations. Most of them tend to run for 30s, though
some are longer (and of course subject to TIME_FACTOR configuration)
- Similar duration error injection tests in fstests (e.g. generic/475)
- The NFSv4 Test Project
https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf
A choice quote regarding stress testing:
"One year after we started using FSSTRESS (in April 2005) Linux NFSv4
was able to sustain the concurrent load of 10 processes during 24
hours, without any problem. Three months later, NFSv4 reached 72 hours
of stress under FSSTRESS, without any bugs. From this date, NFSv4
filesystem tree manipulation is considered to be stable."
I would like to discuss:
- Am I missing other strategies people are employing? Apologies if there
are obvious ones, but I tried to hunt around for a few days :)
- What is the universe of interesting stressors (e.g., reflink, scrub,
online repair, balance, etc.)
- What is the universe of interesting validation conditions (e.g.,
kernel panic, read only fs, fsck failure, data integrity error, etc.)
- Is there any interest in automating longer running fsstress runs? Are
people already doing this with varying TIME_FACTOR configurations in
fstests?
- There is relatively less testing with fsx than fsstress in fstests.
I believe this creates gaps for data corruption bugs rather than
"feature logic" issues that the fsstress feature set tends to hit.
- Can we standardize on some modular "stressors" and stress durations
to run to validate file systems?
In the short term, I have been working on these ideas in a separate
barebones stress testing framework which I am happy to share, but isn't
particularly interesting in and of itself. It is basically just a
skeleton for concurrently running some concurrent "stressors" and then
validating the fs with some generic "validators". I plan to run it
internally just to see if I can get some useful results on our next few
major kernel releases.
And of course, I would love to discuss anything else of interest to
people who like stress testing filesystems!
Thanks,
Boris
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov @ 2025-02-03 19:12 ` Amir Goldstein 2025-02-04 0:57 ` Dave Chinner 2025-02-03 19:14 ` Sweet Tea Dorminy 2025-02-03 19:53 ` Darrick J. Wong 2 siblings, 1 reply; 10+ messages in thread From: Amir Goldstein @ 2025-02-03 19:12 UTC (permalink / raw) To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel, fstests CC fstests On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote: > > At Meta, we currently primarily rely on fstests 'auto' runs for > validating Btrfs as a general purpose filesystem for all of our root > drives. While this has obviously proven to be a very useful test suite > with rich collaboration across teams and filesystems, we have observed a > recent trend in our production filesystem issues that makes us question > if it is sufficient. > > Over the last few years, we have had a number of issues (primarily in > Btrfs, but at least one notable one in Xfs) that have been detected in > production, then reproduced with an unreliable non-specific stressor > that takes hours or even days to trigger the issue. > Examples: > - Btrfs relocation bugs > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > - Btrfs extent map merging corruption > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > - Btrfs dio data corruptions from bio splitting > (mostly our internal errors trying to make minimal backports of > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > and Christoph's related series) > - Xfs large folios > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > In my view, the common threads between these are that: > - we used fstests to validate these systems, in some cases even with > specific regression tests for highly related bugs, but still missed > the bugs until they hit us during our production release process. In > all cases, we had passing 'fstests -g auto' runs. > - were able to reproduce the bugs with a predictable concoction of "run > a workload and some known nasty btrfs operations in parallel". The most > common form of this was running 'fsstress' and 'btrfs balance', but it > wasn't quite universal. Sometimes we needed reflink threads, or > drop_caches, or memory pressure, etc. to trigger a bug. > - The relatively generic stressing reproducers took hours or days to > produce an issue then the investigating engineer could try to tweak and > tune it by trial and error to bring that time down for a particular bug. > > This leads me to the conclusion that there is some room for improvement in > stress testing filesystems (at least Btrfs). > > I attempted to study the prior art on this and so far have found: > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > tests using fsstress and fsx in fstests/. Most of them are xfs and > btrfs tests following the aforementioned pattern of racing fsstress > with some scary operations. Most of them tend to run for 30s, though > some are longer (and of course subject to TIME_FACTOR configuration) > - Similar duration error injection tests in fstests (e.g. generic/475) > - The NFSv4 Test Project > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > A choice quote regarding stress testing: > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > was able to sustain the concurrent load of 10 processes during 24 > hours, without any problem. Three months later, NFSv4 reached 72 hours > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > filesystem tree manipulation is considered to be stable." > > > I would like to discuss: > - Am I missing other strategies people are employing? Apologies if there > are obvious ones, but I tried to hunt around for a few days :) > - What is the universe of interesting stressors (e.g., reflink, scrub, > online repair, balance, etc.) > - What is the universe of interesting validation conditions (e.g., > kernel panic, read only fs, fsck failure, data integrity error, etc.) > - Is there any interest in automating longer running fsstress runs? Are > people already doing this with varying TIME_FACTOR configurations in > fstests? > - There is relatively less testing with fsx than fsstress in fstests. > I believe this creates gaps for data corruption bugs rather than > "feature logic" issues that the fsstress feature set tends to hit. > - Can we standardize on some modular "stressors" and stress durations > to run to validate file systems? > > In the short term, I have been working on these ideas in a separate > barebones stress testing framework which I am happy to share, but isn't > particularly interesting in and of itself. It is basically just a > skeleton for concurrently running some concurrent "stressors" and then > validating the fs with some generic "validators". I plan to run it > internally just to see if I can get some useful results on our next few > major kernel releases. > > And of course, I would love to discuss anything else of interest to > people who like stress testing filesystems! > > Thanks, > Boris > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-03 19:12 ` Amir Goldstein @ 2025-02-04 0:57 ` Dave Chinner 2025-02-04 19:58 ` Boris Burkov 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2025-02-04 0:57 UTC (permalink / raw) To: Amir Goldstein; +Cc: Boris Burkov, lsf-pc, linux-fsdevel, fstests On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote: > CC fstests > > On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote: > > > > At Meta, we currently primarily rely on fstests 'auto' runs for > > validating Btrfs as a general purpose filesystem for all of our root > > drives. While this has obviously proven to be a very useful test suite > > with rich collaboration across teams and filesystems, we have observed a > > recent trend in our production filesystem issues that makes us question > > if it is sufficient. > > > > Over the last few years, we have had a number of issues (primarily in > > Btrfs, but at least one notable one in Xfs) that have been detected in > > production, then reproduced with an unreliable non-specific stressor > > that takes hours or even days to trigger the issue. > > Examples: > > - Btrfs relocation bugs > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > > - Btrfs extent map merging corruption > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > > - Btrfs dio data corruptions from bio splitting > > (mostly our internal errors trying to make minimal backports of > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > > and Christoph's related series) > > - Xfs large folios > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > > > In my view, the common threads between these are that: > > - we used fstests to validate these systems, in some cases even with > > specific regression tests for highly related bugs, but still missed > > the bugs until they hit us during our production release process. In > > all cases, we had passing 'fstests -g auto' runs. Have you considered the 'soak' test group with a long SOAK_DURATION and then increasing the load using LOAD_FACTOR? Also there is a 'stress' group that TIME_FACTOR acts on. For XFS, there's also bunch of fuzzing tests (in the dangerous_fuzzers group) that use the same SOAK_DURATION infrastructure via common/fuzzy. > > - were able to reproduce the bugs with a predictable concoction of "run > > a workload and some known nasty btrfs operations in parallel". The most > > common form of this was running 'fsstress' and 'btrfs balance', but it > > wasn't quite universal. Sometimes we needed reflink threads, or > > drop_caches, or memory pressure, etc. to trigger a bug. That's pretty much what check-parallel does to a system. Loads of tests run things like drop_caches, memory compaction, CPU hotplug, etc. check-parallel essentially exposes every test to these sorts of background perturbations rather than just the one test that is running that perturbation. IOWs, even the most basic correctness test now gets exercised while cpu hotplug and memory compaction are going on in the background.... Eventually, I plan to implement these background perturbations as separate control tasks for check-parallel so we don't need specific tests that run a background perturbation whilst the rest of the system is under test. > > - The relatively generic stressing reproducers took hours or days to > > produce an issue then the investigating engineer could try to tweak and > > tune it by trial and error to bring that time down for a particular bug. > > > > This leads me to the conclusion that there is some room for improvement in > > stress testing filesystems (at least Btrfs). > > > > I attempted to study the prior art on this and so far have found: > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > > tests using fsstress and fsx in fstests/. Most of them are xfs and > > btrfs tests following the aforementioned pattern of racing fsstress > > with some scary operations. Most of them tend to run for 30s, though > > some are longer (and of course subject to TIME_FACTOR configuration) As per above, SOAK_DURATION. > > - Similar duration error injection tests in fstests (e.g. generic/475) > > - The NFSv4 Test Project > > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > > A choice quote regarding stress testing: > > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > > was able to sustain the concurrent load of 10 processes during 24 > > hours, without any problem. Three months later, NFSv4 reached 72 hours > > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > > filesystem tree manipulation is considered to be stable." > > > > > > I would like to discuss: > > - Am I missing other strategies people are employing? Apologies if there > > are obvious ones, but I tried to hunt around for a few days :) check-parallel. > > - What is the universe of interesting stressors (e.g., reflink, scrub, > > online repair, balance, etc.) memory compaction, cpu hotplug, random reflinks of the underlying loop device image files to simulate dynamic VM image file snapshots, etc. > > - What is the universe of interesting validation conditions (e.g., > > kernel panic, read only fs, fsck failure, data integrity error, etc.) All of them. That's the point of check-parallel - it uses simple, existing filesystem correctness tests to generate a massively stressful load on the system... > > - Is there any interest in automating longer running fsstress runs? Are > > people already doing this with varying TIME_FACTOR configurations in > > fstests? At least for XFS, Darrick is already doing that, and I think Carlos may be as well. > > - There is relatively less testing with fsx than fsstress in fstests. > > I believe this creates gaps for data corruption bugs rather than > > "feature logic" issues that the fsstress feature set tends to hit. > > - Can we standardize on some modular "stressors" and stress durations > > to run to validate file systems? I think we already have that with the "soak" and "stress" groups... > > In the short term, I have been working on these ideas in a separate > > barebones stress testing framework which I am happy to share, but isn't > > particularly interesting in and of itself. It is basically just a > > skeleton for concurrently running some concurrent "stressors" and then > > validating the fs with some generic "validators". I plan to run it > > internally just to see if I can get some useful results on our next few > > major kernel releases. check-parallel is effectively a massive concurrent stress workload for the system. It does this by running many individual correctness tests concurrently. Run it on a 64p system or larger, and it will hammer both the test filesystems and base filesystem that all the loop device image files are laid out on. I'm seeing it generate 5-6GB/s of IO load, 40-50GB of memory usage, and consistently use >90% of the CPU in the system stress the scheduler at over half a million context switches/s. > > And of course, I would love to discuss anything else of interest to > > people who like stress testing filesystems! Filesystem stress testing by itself isn't really interesting to me. Using filesystem correctness tests to create massively stressful workloads, OTOH, attacks the problem from multiple angles and exercises the system well outside the bounds of just filesystem code. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-04 0:57 ` Dave Chinner @ 2025-02-04 19:58 ` Boris Burkov 2025-02-04 21:14 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Boris Burkov @ 2025-02-04 19:58 UTC (permalink / raw) To: Dave Chinner; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote: > On Mon, Feb 03, 2025 at 08:12:59PM +0100, Amir Goldstein wrote: > > CC fstests > > > > On Mon, Feb 3, 2025 at 7:54 PM Boris Burkov <boris@bur.io> wrote: > > > > > > At Meta, we currently primarily rely on fstests 'auto' runs for > > > validating Btrfs as a general purpose filesystem for all of our root > > > drives. While this has obviously proven to be a very useful test suite > > > with rich collaboration across teams and filesystems, we have observed a > > > recent trend in our production filesystem issues that makes us question > > > if it is sufficient. > > > > > > Over the last few years, we have had a number of issues (primarily in > > > Btrfs, but at least one notable one in Xfs) that have been detected in > > > production, then reproduced with an unreliable non-specific stressor > > > that takes hours or even days to trigger the issue. > > > Examples: > > > - Btrfs relocation bugs > > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > > > - Btrfs extent map merging corruption > > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > > > - Btrfs dio data corruptions from bio splitting > > > (mostly our internal errors trying to make minimal backports of > > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > > > and Christoph's related series) > > > - Xfs large folios > > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > > > > > In my view, the common threads between these are that: > > > - we used fstests to validate these systems, in some cases even with > > > specific regression tests for highly related bugs, but still missed > > > the bugs until they hit us during our production release process. In > > > all cases, we had passing 'fstests -g auto' runs. > > Have you considered the 'soak' test group with a long SOAK_DURATION > and then increasing the load using LOAD_FACTOR? Also there is a > 'stress' group that TIME_FACTOR acts on. > > For XFS, there's also bunch of fuzzing tests (in the > dangerous_fuzzers group) that use the same SOAK_DURATION > infrastructure via common/fuzzy. I hadn't realized people were running these for multi-day durations. Thanks for pointing them out and for your other inline answers to my questions. > > > > > - were able to reproduce the bugs with a predictable concoction of "run > > > a workload and some known nasty btrfs operations in parallel". The most > > > common form of this was running 'fsstress' and 'btrfs balance', but it > > > wasn't quite universal. Sometimes we needed reflink threads, or > > > drop_caches, or memory pressure, etc. to trigger a bug. > > That's pretty much what check-parallel does to a system. Loads of > tests run things like drop_caches, memory compaction, CPU hotplug, > etc. check-parallel essentially exposes every test to these sorts > of background perturbations rather than just the one test that is > running that perturbation. IOWs, even the most basic correctness > test now gets exercised while cpu hotplug and memory compaction are > going on in the background.... > > Eventually, I plan to implement these background perturbations as > separate control tasks for check-parallel so we don't need specific > tests that run a background perturbation whilst the rest of the > system is under test. I think that a framework for introducing background perturbations while running tests is definitely what I'm getting at. If check-parallel is a good version of that, then that sounds great to me. I am particularly excited about your point that it will smash together *every* stimulus with *every* test. I do have some questions in my head about how that would work in practice. My main questions/concerns are: How much do you randomize the interleaving of tests? Does check-parallel run them in a random order? Similarly, their durations are not at all tuned to maximize interesting interactions. If test X and test Y would collide on some faulty interaction, but test X runs once in 1 second, then you would likely never see test X interfere with some interesting moment during test Y. Are you considering feeding the tests back into the run-queue as they finish for these stress style runs? It seems that the two objectives of the test harness are sort of in tension with using check-parallel to stress things. On one hand you want tests to independently succeed or fail and on the other hand you want noise from one test to disturb the other. I fear more of the failures will turn out to be "Oh, well, when THAT happens, we would expect this condition to be violated". Especially for the more "unit test" style fstests that carefully use sync to check specific conditions during a run. This variant also feels like it would be at the extreme of difficulty for attempting to distill a failure into a reproducer. > > > > - The relatively generic stressing reproducers took hours or days to > > > produce an issue then the investigating engineer could try to tweak and > > > tune it by trial and error to bring that time down for a particular bug. > > > > > > This leads me to the conclusion that there is some room for improvement in > > > stress testing filesystems (at least Btrfs). > > > > > > I attempted to study the prior art on this and so far have found: > > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > > > tests using fsstress and fsx in fstests/. Most of them are xfs and > > > btrfs tests following the aforementioned pattern of racing fsstress > > > with some scary operations. Most of them tend to run for 30s, though > > > some are longer (and of course subject to TIME_FACTOR configuration) > > As per above, SOAK_DURATION. > > > > - Similar duration error injection tests in fstests (e.g. generic/475) > > > - The NFSv4 Test Project > > > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > > > A choice quote regarding stress testing: > > > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > > > was able to sustain the concurrent load of 10 processes during 24 > > > hours, without any problem. Three months later, NFSv4 reached 72 hours > > > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > > > filesystem tree manipulation is considered to be stable." > > > > > > > > > I would like to discuss: > > > - Am I missing other strategies people are employing? Apologies if there > > > are obvious ones, but I tried to hunt around for a few days :) > > check-parallel. > > > > - What is the universe of interesting stressors (e.g., reflink, scrub, > > > online repair, balance, etc.) > > memory compaction, cpu hotplug, random reflinks of the underlying > loop device image files to simulate dynamic VM image file snapshots, > etc. > > > > - What is the universe of interesting validation conditions (e.g., > > > kernel panic, read only fs, fsck failure, data integrity error, etc.) > > All of them. That's the point of check-parallel - it uses simple, > existing filesystem correctness tests to generate a massively > stressful load on the system... > > > > - Is there any interest in automating longer running fsstress runs? Are > > > people already doing this with varying TIME_FACTOR configurations in > > > fstests? > > At least for XFS, Darrick is already doing that, and I think Carlos > may be as well. > > > > - There is relatively less testing with fsx than fsstress in fstests. > > > I believe this creates gaps for data corruption bugs rather than > > > "feature logic" issues that the fsstress feature set tends to hit. > > > - Can we standardize on some modular "stressors" and stress durations > > > to run to validate file systems? > > I think we already have that with the "soak" and "stress" groups... > > > > In the short term, I have been working on these ideas in a separate > > > barebones stress testing framework which I am happy to share, but isn't > > > particularly interesting in and of itself. It is basically just a > > > skeleton for concurrently running some concurrent "stressors" and then > > > validating the fs with some generic "validators". I plan to run it > > > internally just to see if I can get some useful results on our next few > > > major kernel releases. > > check-parallel is effectively a massive concurrent stress workload > for the system. It does this by running many individual correctness > tests concurrently. > > Run it on a 64p system or larger, and it will hammer both the test > filesystems and base filesystem that all the loop device image files > are laid out on. I'm seeing it generate 5-6GB/s of IO load, 40-50GB > of memory usage, and consistently use >90% of the CPU in the system > stress the scheduler at over half a million context switches/s. I will definitely invest some time into getting check-parallel to run with btrfs, and hopefully it turns up some interesting stuff. > > > > And of course, I would love to discuss anything else of interest to > > > people who like stress testing filesystems! > > Filesystem stress testing by itself isn't really interesting to me. > Using filesystem correctness tests to create massively stressful > workloads, OTOH, attacks the problem from multiple angles and > exercises the system well outside the bounds of just filesystem > code. From what I see, today we have a handful of tests which race fsx or fsstress with 0-2 operations under test, and you are proposing using check-parallel to hammer the computer with the entirety of all 1000 tests in parallel (awesome). I think I am proposing something in between where we run fsx AND fsstress AND ~10 known scary operations. That has proven to dredge up bugs in btrfs (where the simpler fsstress plus one thing doesn't). I think check-parallel will be more stressful, but that this "mega fsstress run" will be more predictable and easier to tune/get reproducers out of. Thanks again for your thoughts, Boris > > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-04 19:58 ` Boris Burkov @ 2025-02-04 21:14 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2025-02-04 21:14 UTC (permalink / raw) To: Boris Burkov; +Cc: Amir Goldstein, lsf-pc, linux-fsdevel, fstests On Tue, Feb 04, 2025 at 11:58:46AM -0800, Boris Burkov wrote: > On Tue, Feb 04, 2025 at 11:57:09AM +1100, Dave Chinner wrote: > > > > - were able to reproduce the bugs with a predictable concoction of "run > > > > a workload and some known nasty btrfs operations in parallel". The most > > > > common form of this was running 'fsstress' and 'btrfs balance', but it > > > > wasn't quite universal. Sometimes we needed reflink threads, or > > > > drop_caches, or memory pressure, etc. to trigger a bug. > > > > That's pretty much what check-parallel does to a system. Loads of > > tests run things like drop_caches, memory compaction, CPU hotplug, > > etc. check-parallel essentially exposes every test to these sorts > > of background perturbations rather than just the one test that is > > running that perturbation. IOWs, even the most basic correctness > > test now gets exercised while cpu hotplug and memory compaction are > > going on in the background.... > > > > Eventually, I plan to implement these background perturbations as > > separate control tasks for check-parallel so we don't need specific > > tests that run a background perturbation whilst the rest of the > > system is under test. > > I think that a framework for introducing background perturbations while > running tests is definitely what I'm getting at. If check-parallel is a > good version of that, then that sounds great to me. I am particularly > excited about your point that it will smash together *every* stimulus > with *every* test. I do have some questions in my head about how that > would work in practice. > > My main questions/concerns are: > > How much do you randomize the interleaving of tests? Does > check-parallel run them in a random order? Same as check - the "-r" option will randomise the test run order. The test run order is also somewhat randomised by default in that it sorts the test run order based on the runtime of each test in the previous test run. Hence test run order is not static - it generally runs long running tests before slow running tests, but the exact order is not fixed. > Similarly, their durations are not at all tuned to maximize > interesting interactions. If test X and test Y would collide on some > faulty interaction, but test X runs once in 1 second, then you would > likely never see test X interfere with some interesting moment during > test Y. Are you considering feeding the tests back into the run-queue > as they finish for these stress style runs? Not yet - the infrastructure to directly manage and run tests from check-parallel is not yet in place. It currently generates a test list for each runner thread then executes that via a check instance per runner thread. I plan to have check-parallel execute tests individually itself by factoring the run loop out of check (similar to how I'm doing the test list parsing). Once there is direct control of the test execution, stuff like dynamic test queues where runners just pull the next test to run off the queue and they keep going until the queue is empty will be possible. > It seems that the two objectives of the test harness are sort of in > tension with using check-parallel to stress things. On one hand you > want tests to independently succeed or fail and on the other hand you > want noise from one test to disturb the other. Yes. Tests are largely written such that they don't interfere with each other. > I fear more of the > failures will turn out to be "Oh, well, when THAT happens, we would > expect this condition to be violated". Especially for the more "unit > test" style fstests that carefully use sync to check specific conditions > during a run. That's why I currently have a "unreliable_in_parallel" test group definition and check-parallel excludes that test group. There's about 20 tests I've classified this way, most of them xfs specific tests that are reliant on exact fragmentation patterns being created. This tests are perturbed by things like sync(1) calls from other tests which results in a different fragmentation pattern than the test expects to see. In each case, there is a comment in the test explaining the condition that makes the test unreliable in parallel, and so we have some idea of what needs fixing to be able to remove it from the unreliable_in_parallel group. Essentially, I'm using this as a marker and note for future improvements once all the (more important) infrastructure work is done and solid. > This variant also feels like it would be at the extreme of difficulty > for attempting to distill a failure into a reproducer. It's pretty obvious when a test is doing something that is influenced by an outside event. The biggest problem for debugging them comes when the test failures appear to be real bugs (e.g. all the weird and whacky off-by-one quota failures that check-parallel triggers on XFS) but they cannot be reproduced when the tests are run serially. ..... > > > > And of course, I would love to discuss anything else of interest to > > > > people who like stress testing filesystems! > > > > Filesystem stress testing by itself isn't really interesting to me. > > Using filesystem correctness tests to create massively stressful > > workloads, OTOH, attacks the problem from multiple angles and > > exercises the system well outside the bounds of just filesystem > > code. > > From what I see, today we have a handful of tests which race fsx or > fsstress with 0-2 operations under test, and you are proposing using > check-parallel to hammer the computer with the entirety of all 1000 > tests in parallel (awesome). It's currently running one test per CPU in parallel, not all at once. Many tests run lots of stuff in parallel themselves, too, and some of them hammer large CPU count machines really hard just by themselves, let alone when there's another 63 tests running concurrently.... > I think I am proposing something in between > where we run fsx AND fsstress AND ~10 known scary operations. Write a set of tests that do this for btrfs and put them in the auto/stress/soak groups. Then run 'check-parallel -g soak,stress ....' -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov 2025-02-03 19:12 ` Amir Goldstein @ 2025-02-03 19:14 ` Sweet Tea Dorminy 2025-02-03 19:53 ` Darrick J. Wong 2 siblings, 0 replies; 10+ messages in thread From: Sweet Tea Dorminy @ 2025-02-03 19:14 UTC (permalink / raw) To: Boris Burkov, lsf-pc; +Cc: linux-fsdevel, Matthew Sakai, dm-devel > I attempted to study the prior art on this and so far have found: > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > tests using fsstress and fsx in fstests/. Most of them are xfs and > btrfs tests following the aforementioned pattern of racing fsstress > with some scary operations. Most of them tend to run for 30s, though > some are longer (and of course subject to TIME_FACTOR configuration) > - Similar duration error injection tests in fstests (e.g. generic/475) > - The NFSv4 Test Project > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > A choice quote regarding stress testing: > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > was able to sustain the concurrent load of 10 processes during 24 > hours, without any problem. Three months later, NFSv4 reached 72 hours > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > filesystem tree manipulation is considered to be stable." > > > I would like to discuss: > - Am I missing other strategies people are employing? Apologies if there > are obvious ones, but I tried to hunt around for a few days :) > - What is the universe of interesting stressors (e.g., reflink, scrub, > online repair, balance, etc.) It's not a filesystem, but the dm-vdo project has some similarities, doing deduplication, compression, and thin provisioning. As such, they have a fairly extensive set of tests of dm-vdo, and in particular they do a fair bit of stress testing. For them, the universe is reboots, crashes, complete rebuilds, read-only entry and exit, compression enable/disable, and 512 byte sector mode enable/disable. They've been running about fifty hours a week of these tests inside of Red Hat. For instance, https://github.com/dm-vdo/vdo-devel/blob/main/src/perl/vdotest/VDOTest/RebuildStress03.pm is one of the tests showing the random selection of operations. When these tests were first introduced eight years ago, they did catch some crash or data corruption bugs which were not covered by the existing universe of fstests-like tests for dm-vdo. There was also a filesystem inconsistency uncovered at the time: https://lore.kernel.org/all/CALoZfD4-uqhRSfEh0Y+v8jjSDY2KkAh-hhwdLnRgZopHEETUXA@mail.gmail.com/ I would suggest Matt Sakai, cc'd, or another of the VDO folks as a valuable contributor to this discussion, given the VDO folks' long experience with stress testing. Sweet Tea ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov 2025-02-03 19:12 ` Amir Goldstein 2025-02-03 19:14 ` Sweet Tea Dorminy @ 2025-02-03 19:53 ` Darrick J. Wong 2025-02-04 19:38 ` Boris Burkov 2 siblings, 1 reply; 10+ messages in thread From: Darrick J. Wong @ 2025-02-03 19:53 UTC (permalink / raw) To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote: > At Meta, we currently primarily rely on fstests 'auto' runs for > validating Btrfs as a general purpose filesystem for all of our root > drives. While this has obviously proven to be a very useful test suite > with rich collaboration across teams and filesystems, we have observed a > recent trend in our production filesystem issues that makes us question > if it is sufficient. > > Over the last few years, we have had a number of issues (primarily in > Btrfs, but at least one notable one in Xfs) that have been detected in > production, then reproduced with an unreliable non-specific stressor > that takes hours or even days to trigger the issue. > Examples: > - Btrfs relocation bugs > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > - Btrfs extent map merging corruption > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > - Btrfs dio data corruptions from bio splitting > (mostly our internal errors trying to make minimal backports of > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > and Christoph's related series) > - Xfs large folios > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > In my view, the common threads between these are that: > - we used fstests to validate these systems, in some cases even with > specific regression tests for highly related bugs, but still missed > the bugs until they hit us during our production release process. In > all cases, we had passing 'fstests -g auto' runs. > - were able to reproduce the bugs with a predictable concoction of "run > a workload and some known nasty btrfs operations in parallel". The most > common form of this was running 'fsstress' and 'btrfs balance', but it > wasn't quite universal. Sometimes we needed reflink threads, or > drop_caches, or memory pressure, etc. to trigger a bug. > - The relatively generic stressing reproducers took hours or days to > produce an issue then the investigating engineer could try to tweak and > tune it by trial and error to bring that time down for a particular bug. > > This leads me to the conclusion that there is some room for improvement in > stress testing filesystems (at least Btrfs). > > I attempted to study the prior art on this and so far have found: > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > tests using fsstress and fsx in fstests/. Most of them are xfs and > btrfs tests following the aforementioned pattern of racing fsstress > with some scary operations. Most of them tend to run for 30s, though > some are longer (and of course subject to TIME_FACTOR configuration) > - Similar duration error injection tests in fstests (e.g. generic/475) > - The NFSv4 Test Project > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > A choice quote regarding stress testing: > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > was able to sustain the concurrent load of 10 processes during 24 > hours, without any problem. Three months later, NFSv4 reached 72 hours > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > filesystem tree manipulation is considered to be stable." > > > I would like to discuss: > - Am I missing other strategies people are employing? Apologies if there > are obvious ones, but I tried to hunt around for a few days :) At the moment I start six VMs per "configuration", which each run one of: generic/521 (directio) generic/522 (bufferedio) generic/476 (fsstress) generic/388 (fsstress + log recovery) xfs/285 (online fsck) xfs/286 (online metadata rebuild) with SOAK_DURATION=6.5d so that they wrap up right around the time that each rc release drops. I also set FSSTRESS_AVOID="-m 16" so that we don't end up with gigantic quota files. There are two "configurations" per kernel tree. The dot product of them are: djwong-dev: -m metadir=1,autofsck=1,uquota,gquota,pquota, -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, tot mainline: -m autofsck=1, -d rtinherit=1, -m autofsck=1, for-next: -m metadir=1,autofsck=1,uquota,gquota,pquota, -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, Actually, I just realized that with 6.14 I need to update the tot mainline configuration to have metadir=1. > - What is the universe of interesting stressors (e.g., reflink, scrub, > online repair, balance, etc.) Prodding djwong and everyone else into loading up fsx/fsstress with all their weird new file io calls. ;) > - What is the universe of interesting validation conditions (e.g., > kernel panic, read only fs, fsck failure, data integrity error, etc.) > - Is there any interest in automating longer running fsstress runs? Are > people already doing this with varying TIME_FACTOR configurations in > fstests? I don't run with SOAK_DURATION > 14 days because I generally haven't found larger values to be useful in finding bugs. However, these weekly long soak tests runs have been going since 2016. FWIW that actually started because we had a lot of customer complaints in that era about log recovery failures in xfs, and only later did I spread it beyond generic/388 to the six profiles above. > - There is relatively less testing with fsx than fsstress in fstests. > I believe this creates gaps for data corruption bugs rather than > "feature logic" issues that the fsstress feature set tends to hit. Probably. I wonder how much we're really flexing io_uring? --D > - Can we standardize on some modular "stressors" and stress durations > to run to validate file systems? > > In the short term, I have been working on these ideas in a separate > barebones stress testing framework which I am happy to share, but isn't > particularly interesting in and of itself. It is basically just a > skeleton for concurrently running some concurrent "stressors" and then > validating the fs with some generic "validators". I plan to run it > internally just to see if I can get some useful results on our next few > major kernel releases. > > And of course, I would love to discuss anything else of interest to > people who like stress testing filesystems! > > Thanks, > Boris > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-03 19:53 ` Darrick J. Wong @ 2025-02-04 19:38 ` Boris Burkov 2025-02-04 22:09 ` Darrick J. Wong 0 siblings, 1 reply; 10+ messages in thread From: Boris Burkov @ 2025-02-04 19:38 UTC (permalink / raw) To: Darrick J. Wong; +Cc: lsf-pc, linux-fsdevel On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote: > On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote: > > At Meta, we currently primarily rely on fstests 'auto' runs for > > validating Btrfs as a general purpose filesystem for all of our root > > drives. While this has obviously proven to be a very useful test suite > > with rich collaboration across teams and filesystems, we have observed a > > recent trend in our production filesystem issues that makes us question > > if it is sufficient. > > > > Over the last few years, we have had a number of issues (primarily in > > Btrfs, but at least one notable one in Xfs) that have been detected in > > production, then reproduced with an unreliable non-specific stressor > > that takes hours or even days to trigger the issue. > > Examples: > > - Btrfs relocation bugs > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > > - Btrfs extent map merging corruption > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > > - Btrfs dio data corruptions from bio splitting > > (mostly our internal errors trying to make minimal backports of > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > > and Christoph's related series) > > - Xfs large folios > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > > > In my view, the common threads between these are that: > > - we used fstests to validate these systems, in some cases even with > > specific regression tests for highly related bugs, but still missed > > the bugs until they hit us during our production release process. In > > all cases, we had passing 'fstests -g auto' runs. > > - were able to reproduce the bugs with a predictable concoction of "run > > a workload and some known nasty btrfs operations in parallel". The most > > common form of this was running 'fsstress' and 'btrfs balance', but it > > wasn't quite universal. Sometimes we needed reflink threads, or > > drop_caches, or memory pressure, etc. to trigger a bug. > > - The relatively generic stressing reproducers took hours or days to > > produce an issue then the investigating engineer could try to tweak and > > tune it by trial and error to bring that time down for a particular bug. > > > > This leads me to the conclusion that there is some room for improvement in > > stress testing filesystems (at least Btrfs). > > > > I attempted to study the prior art on this and so far have found: > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > > tests using fsstress and fsx in fstests/. Most of them are xfs and > > btrfs tests following the aforementioned pattern of racing fsstress > > with some scary operations. Most of them tend to run for 30s, though > > some are longer (and of course subject to TIME_FACTOR configuration) > > - Similar duration error injection tests in fstests (e.g. generic/475) > > - The NFSv4 Test Project > > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > > A choice quote regarding stress testing: > > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > > was able to sustain the concurrent load of 10 processes during 24 > > hours, without any problem. Three months later, NFSv4 reached 72 hours > > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > > filesystem tree manipulation is considered to be stable." > > > > > > I would like to discuss: > > - Am I missing other strategies people are employing? Apologies if there > > are obvious ones, but I tried to hunt around for a few days :) > > At the moment I start six VMs per "configuration", which each run one of: > > generic/521 (directio) > generic/522 (bufferedio) > generic/476 (fsstress) > generic/388 (fsstress + log recovery) > xfs/285 (online fsck) > xfs/286 (online metadata rebuild) That is sweet, and sorry I missed the soak category. I would love to hear more about your experience with these tests! Do they catch a lot of bugs? How hard is it to reduce the reproducer down to something smaller and quicker when you do hit something? I'm also surprised at how few of these you need. Does xfs just have a lot fewer online "admin" operations like device replace, defrag, balance, enabling/disabling compression, etc. than btrfs so you need fewer tests like that? Or do you just not think that adding more noise would catch enough bugs to make it worth it? Or does fsstress encompass all the operations you are interested in? There are a bunch of similar btrfs tests (btrfs/060-074 race fsstress with one or two interesting btrfs operations each) but we don't currently run them for much longer than 30s. I am curious to try running them as soak tests, now, and adding fsx running variants. That will end up with like ~30 pretty similar tests, that I also feel could sort of just be one big modular test? Which kind of gets back to what I was getting at in the first place. I don't know enough about xfs to fully grok what the various configurations do to the test (I imagine they enable various features you want to validate under the soak), but I imagine there are still more nasty things to do to the system in parallel. > > with SOAK_DURATION=6.5d so that they wrap up right around the time that > each rc release drops. I also set FSSTRESS_AVOID="-m 16" so that we > don't end up with gigantic quota files. > > There are two "configurations" per kernel tree. The dot product of them > are: > > djwong-dev: > -m metadir=1,autofsck=1,uquota,gquota,pquota, > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, > > tot mainline: > -m autofsck=1, -d rtinherit=1, > -m autofsck=1, > > for-next: > -m metadir=1,autofsck=1,uquota,gquota,pquota, > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, > > Actually, I just realized that with 6.14 I need to update the tot > mainline configuration to have metadir=1. > > > - What is the universe of interesting stressors (e.g., reflink, scrub, > > online repair, balance, etc.) > > Prodding djwong and everyone else into loading up fsx/fsstress with > all their weird new file io calls. ;) I think this is quite interesting, actually. Fsstress already does create and delete snapshots and makes reflinks, but there have been a number of bugs that I have been unable to reproduce with raw fsstress but if I run fsstress PLUS more external reflinking/snapshotting/syncing/etc threads, then they reproduce. It seems, logically, I could keep fussing with my fsstress invocation to get there, but that was my experience. Separately, how much do we want to be adding features that are only in one or two filesystems to fsstress (similar to my points above regarding test cardinality explosion) > > > - What is the universe of interesting validation conditions (e.g., > > kernel panic, read only fs, fsck failure, data integrity error, etc.) > > - Is there any interest in automating longer running fsstress runs? Are > > people already doing this with varying TIME_FACTOR configurations in > > fstests? > > I don't run with SOAK_DURATION > 14 days because I generally haven't > found larger values to be useful in finding bugs. However, these weekly > long soak tests runs have been going since 2016. That makes sense to me, it does feel like a day to a week is probably the sweet spot. > > FWIW that actually started because we had a lot of customer complaints > in that era about log recovery failures in xfs, and only later did I > spread it beyond generic/388 to the six profiles above. > > > - There is relatively less testing with fsx than fsstress in fstests. > > I believe this creates gaps for data corruption bugs rather than > > "feature logic" issues that the fsstress feature set tends to hit. > > Probably. I wonder how much we're really flexing io_uring? > > --D > > > - Can we standardize on some modular "stressors" and stress durations > > to run to validate file systems? > > > > In the short term, I have been working on these ideas in a separate > > barebones stress testing framework which I am happy to share, but isn't > > particularly interesting in and of itself. It is basically just a > > skeleton for concurrently running some concurrent "stressors" and then > > validating the fs with some generic "validators". I plan to run it > > internally just to see if I can get some useful results on our next few > > major kernel releases. > > > > And of course, I would love to discuss anything else of interest to > > people who like stress testing filesystems! > > > > Thanks, > > Boris > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-04 19:38 ` Boris Burkov @ 2025-02-04 22:09 ` Darrick J. Wong 2025-02-05 4:38 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Darrick J. Wong @ 2025-02-04 22:09 UTC (permalink / raw) To: Boris Burkov; +Cc: lsf-pc, linux-fsdevel On Tue, Feb 04, 2025 at 11:38:45AM -0800, Boris Burkov wrote: > On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote: > > > At Meta, we currently primarily rely on fstests 'auto' runs for > > > validating Btrfs as a general purpose filesystem for all of our root > > > drives. While this has obviously proven to be a very useful test suite > > > with rich collaboration across teams and filesystems, we have observed a > > > recent trend in our production filesystem issues that makes us question > > > if it is sufficient. > > > > > > Over the last few years, we have had a number of issues (primarily in > > > Btrfs, but at least one notable one in Xfs) that have been detected in > > > production, then reproduced with an unreliable non-specific stressor > > > that takes hours or even days to trigger the issue. > > > Examples: > > > - Btrfs relocation bugs > > > https://lore.kernel.org/linux-btrfs/68766e66ed15ca2e7550585ed09434249db912a2.1727212293.git.josef@toxicpanda.com/ > > > https://lore.kernel.org/linux-btrfs/fc61fb63e534111f5837c204ec341c876637af69.1731513908.git.josef@toxicpanda.com/ > > > - Btrfs extent map merging corruption > > > https://lore.kernel.org/linux-btrfs/9b98ba80e2cf32f6fb3b15dae9ee92507a9d59c7.1729537596.git.boris@bur.io/ > > > - Btrfs dio data corruptions from bio splitting > > > (mostly our internal errors trying to make minimal backports of > > > https://lore.kernel.org/linux-btrfs/cover.1679512207.git.boris@bur.io/ > > > and Christoph's related series) > > > - Xfs large folios > > > https://lore.kernel.org/linux-fsdevel/effc0ec7-cf9d-44dc-aee5-563942242522@meta.com/ > > > > > > In my view, the common threads between these are that: > > > - we used fstests to validate these systems, in some cases even with > > > specific regression tests for highly related bugs, but still missed > > > the bugs until they hit us during our production release process. In > > > all cases, we had passing 'fstests -g auto' runs. > > > - were able to reproduce the bugs with a predictable concoction of "run > > > a workload and some known nasty btrfs operations in parallel". The most > > > common form of this was running 'fsstress' and 'btrfs balance', but it > > > wasn't quite universal. Sometimes we needed reflink threads, or > > > drop_caches, or memory pressure, etc. to trigger a bug. > > > - The relatively generic stressing reproducers took hours or days to > > > produce an issue then the investigating engineer could try to tweak and > > > tune it by trial and error to bring that time down for a particular bug. > > > > > > This leads me to the conclusion that there is some room for improvement in > > > stress testing filesystems (at least Btrfs). > > > > > > I attempted to study the prior art on this and so far have found: > > > - fsstress/fsx and the attendant tests in fstests/. There are ~150-200 > > > tests using fsstress and fsx in fstests/. Most of them are xfs and > > > btrfs tests following the aforementioned pattern of racing fsstress > > > with some scary operations. Most of them tend to run for 30s, though > > > some are longer (and of course subject to TIME_FACTOR configuration) > > > - Similar duration error injection tests in fstests (e.g. generic/475) > > > - The NFSv4 Test Project > > > https://www.kernel.org/doc/ols/2006/ols2006v2-pages-275-294.pdf > > > A choice quote regarding stress testing: > > > "One year after we started using FSSTRESS (in April 2005) Linux NFSv4 > > > was able to sustain the concurrent load of 10 processes during 24 > > > hours, without any problem. Three months later, NFSv4 reached 72 hours > > > of stress under FSSTRESS, without any bugs. From this date, NFSv4 > > > filesystem tree manipulation is considered to be stable." > > > > > > > > > I would like to discuss: > > > - Am I missing other strategies people are employing? Apologies if there > > > are obvious ones, but I tried to hunt around for a few days :) > > > > At the moment I start six VMs per "configuration", which each run one of: > > > > generic/521 (directio) > > generic/522 (bufferedio) > > generic/476 (fsstress) > > generic/388 (fsstress + log recovery) > > xfs/285 (online fsck) > > xfs/286 (online metadata rebuild) > > That is sweet, and sorry I missed the soak category. I would love to > hear more about your experience with these tests! Do they catch a lot of > bugs? How hard is it to reduce the reproducer down to something smaller > and quicker when you do hit something? I'm also surprised at how few of It's usually pretty difficult for data corruption reports from fsx, but most of the fsstress failures are either really obvious (dead fs) or emit stacktraces and lockdep/kasan reports. > these you need. Does xfs just have a lot fewer online "admin" operations > like device replace, defrag, balance, enabling/disabling compression, etc. > than btrfs so you need fewer tests like that? Or do you just not think > that adding more noise would catch enough bugs to make it worth it? > Or does fsstress encompass all the operations you are interested in? xfs has a lot less stuff to manage, so fsstress/fsx are usually enough. There's a quartet of specialty tests xfs/285,286,565,566 that exercise fsstress/fsx against online fsck and repair. > There are a bunch of similar btrfs tests (btrfs/060-074 race fsstress > with one or two interesting btrfs operations each) but we don't > currently run them for much longer than 30s. I am curious to try running > them as soak tests, now, and adding fsx running variants. That will end > up with like ~30 pretty similar tests, that I also feel could sort of > just be one big modular test? I suggest leaving the specialty tests around (and not in the auto group) and creating one btrfs/ test that turns on *everything* and races that against fsstress? > Which kind of gets back to what I was getting at in the first place. I > don't know enough about xfs to fully grok what the various > configurations do to the test (I imagine they enable various features > you want to validate under the soak), but I imagine there are still more > nasty things to do to the system in parallel. Probably, but we've never really dug into that. Dave might get there with check-parallel but I don't have 64p systems to spare right now. As for configurations -- yeah, that's how we deal with the combinatoric explosion of mkfs options. Run a lot of different weird configs in parallel with a fleet of VMs. It's too bad that sort of implies that we all have to work for cloud vendors. > > > > with SOAK_DURATION=6.5d so that they wrap up right around the time that > > each rc release drops. I also set FSSTRESS_AVOID="-m 16" so that we > > don't end up with gigantic quota files. > > > > There are two "configurations" per kernel tree. The dot product of them > > are: > > > > djwong-dev: > > -m metadir=1,autofsck=1,uquota,gquota,pquota, > > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, > > > > tot mainline: > > -m autofsck=1, -d rtinherit=1, > > -m autofsck=1, > > > > for-next: > > -m metadir=1,autofsck=1,uquota,gquota,pquota, > > -m metadir=1,autofsck=1,uquota,gquota,pquota, -d rtinherit=1, > > > > Actually, I just realized that with 6.14 I need to update the tot > > mainline configuration to have metadir=1. > > > > > - What is the universe of interesting stressors (e.g., reflink, scrub, > > > online repair, balance, etc.) > > > > Prodding djwong and everyone else into loading up fsx/fsstress with > > all their weird new file io calls. ;) > > I think this is quite interesting, actually. Fsstress already does > create and delete snapshots and makes reflinks, but there have been a > number of bugs that I have been unable to reproduce with raw fsstress > but if I run fsstress PLUS more external > reflinking/snapshotting/syncing/etc threads, then they reproduce. It > seems, logically, I could keep fussing with my fsstress invocation to > get there, but that was my experience. > > Separately, how much do we want to be adding features that are only in > one or two filesystems to fsstress (similar to my points above regarding > test cardinality explosion) That's where I think "add it all to fsstress" becomes less useful -- it might not be a great idea to clutter it up with too many weird ioctls. That said, I think xfs and bcachefs slowly co-opt the btrfs ones over time. --D > > > > > - What is the universe of interesting validation conditions (e.g., > > > kernel panic, read only fs, fsck failure, data integrity error, etc.) > > > - Is there any interest in automating longer running fsstress runs? Are > > > people already doing this with varying TIME_FACTOR configurations in > > > fstests? > > > > I don't run with SOAK_DURATION > 14 days because I generally haven't > > found larger values to be useful in finding bugs. However, these weekly > > long soak tests runs have been going since 2016. > > That makes sense to me, it does feel like a day to a week is probably > the sweet spot. > > > > > FWIW that actually started because we had a lot of customer complaints > > in that era about log recovery failures in xfs, and only later did I > > spread it beyond generic/388 to the six profiles above. > > > > > - There is relatively less testing with fsx than fsstress in fstests. > > > I believe this creates gaps for data corruption bugs rather than > > > "feature logic" issues that the fsstress feature set tends to hit. > > > > Probably. I wonder how much we're really flexing io_uring? > > > > --D > > > > > - Can we standardize on some modular "stressors" and stress durations > > > to run to validate file systems? > > > > > > In the short term, I have been working on these ideas in a separate > > > barebones stress testing framework which I am happy to share, but isn't > > > particularly interesting in and of itself. It is basically just a > > > skeleton for concurrently running some concurrent "stressors" and then > > > validating the fs with some generic "validators". I plan to run it > > > internally just to see if I can get some useful results on our next few > > > major kernel releases. > > > > > > And of course, I would love to discuss anything else of interest to > > > people who like stress testing filesystems! > > > > > > Thanks, > > > Boris > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems 2025-02-04 22:09 ` Darrick J. Wong @ 2025-02-05 4:38 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2025-02-05 4:38 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Boris Burkov, lsf-pc, linux-fsdevel On Tue, Feb 04, 2025 at 02:09:39PM -0800, Darrick J. Wong wrote: > On Tue, Feb 04, 2025 at 11:38:45AM -0800, Boris Burkov wrote: > > On Mon, Feb 03, 2025 at 11:53:43AM -0800, Darrick J. Wong wrote: > > > On Mon, Feb 03, 2025 at 10:55:19AM -0800, Boris Burkov wrote: > > Which kind of gets back to what I was getting at in the first place. I > > don't know enough about xfs to fully grok what the various > > configurations do to the test (I imagine they enable various features > > you want to validate under the soak), but I imagine there are still more > > nasty things to do to the system in parallel. > > Probably, but we've never really dug into that. Dave might get there > with check-parallel but I don't have 64p systems to spare right now. > > As for configurations -- yeah, that's how we deal with the combinatoric > explosion of mkfs options. Run a lot of different weird configs in > parallel with a fleet of VMs. It's too bad that sort of implies that we > all have to work for cloud vendors. Well, that's one of the issues I'm addressing with check-parallel. When a full auto run takes 10 minutes, a single developer can iterate a significant chunk of the configuration matrix on a single machine in a few hours with a single check-parallel command. The functionality is already there to do this - if we define all the configs that are to be tested via config section definitions, check-parallel will iterate them all in one go. That's the way I want to run testing - testing mkfs defaults with the auto group is a ten minute smoke test that will catch most regressions in new code. That "full auto" smoke test is now faster than my typical think-code-build-deploy cycle time. Perfect. Now running half a dozen common configs (e.g. each of the LTS-kernel related mkfs defaults) for better coverage becomes a "run it while I'm at lunch/in a meeting" exercise. IT can be done multiple times a day, and interrupting it to start again with a new build is no longer a big deal. End-of-day/overnight testing has a long enough duration (12+ hours) to exercise /several dozen/ fs configs. That's more than enough testing to drown a typical developer in things that need analysis and/or fixing. All on one local machine built using cheap commodity parts. The cloud is a lie. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-02-05 4:38 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-03 18:55 [LSF/MM/BPF TOPIC] Long Duration Stress Testing Filesystems Boris Burkov 2025-02-03 19:12 ` Amir Goldstein 2025-02-04 0:57 ` Dave Chinner 2025-02-04 19:58 ` Boris Burkov 2025-02-04 21:14 ` Dave Chinner 2025-02-03 19:14 ` Sweet Tea Dorminy 2025-02-03 19:53 ` Darrick J. Wong 2025-02-04 19:38 ` Boris Burkov 2025-02-04 22:09 ` Darrick J. Wong 2025-02-05 4:38 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.