From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
zlang@redhat.com, hch@lst.de, fstests@vger.kernel.org,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes
Date: Wed, 22 Jan 2025 17:01:47 +1100 [thread overview]
Message-ID: <Z5CJy195Fh36NNHN@dread.disaster.area> (raw)
In-Reply-To: <20250122040839.GD3761769@mit.edu>
On Tue, Jan 21, 2025 at 11:08:39PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 22, 2025 at 09:15:48AM +1100, Dave Chinner wrote:
> > check-parallel on my 64p machine runs the full auto group test in
> > under 10 minutes.
> >
> > i.e. if you have a typical modern server (64-128p, 256GB RAM and a
> > couple of NVMe SSDs), then check-parallel allows a full test run in
> > the same time that './check -g smoketest' will run....
>
> Interesting. I would have thought that even with NVMe SSD's, you'd be
> I/O speed constrained, especially given that some of the tests
> (especially the ENOSPC hitters) can take quite a lot of time to fill
> the storage device, even if they are using fallocate.
You haven't looked at how check-parallel works, have you? :/
> How do you have your test and scratch devices configured?
Please go and read the check-parallel script. It does all the
per-runner process test and scratch device configuration itself
using loop devices.
> > Yes, and I've previously made the point about how check-parallel
> > changes the way we should be looking at dev-test cycles. We no
> > longer have to care that auto group testing takes 4 hours to run and
> > have to work around that with things like smoketest groups. If you
> > can run the whole auto test group in 10-15 minutes, then we don't
> > need "quick", "smoketest", etc to reduce dev-test cycle time
> > anymore...
>
> Well, yes, if the only consideration is test run time latency.
Sure.
> I can think of two off-setting considerations. The first is if you
> care about cost.
Which I really don't care about.
That's something for a QE organisation to worry about, and it's up
to them to make the best use of the tools they have within the
budget they have.
> The second concern is that for certain class of failures (UBSAN,
> KCSAN, Lockdep, RCU soft lockups, WARN_ON, BUG_ON, and other
> panics/OOPS), if you are runnig 64 tests in parllel it might not be
> obvious which test caused the failure.
Then multiple tests will fail with the same dmesg error, but it's
generally pretty clear which of the tests caused it. Yes, it's a bit
more work to isolate the specific test, but it's not hard or any
different to how a test failure is debugged now.
If you want to automate such failures, then my process is to grep
the log files for all the tests that failed with a dmesg error then
run them again using check instead of check-parallel. Then I get
exactly which test generated the dmesg output without having to put
time into trying to work out which test triggered the failure.
> Today, even if the test VM
> crashes or hangs, I can have test manager (which runs on a e2-small VM
> costing $0.021913 USD/hour and can manage dozens of test VM's all at the
> same time), can restart the test VM, and we know which test is at at
> fault, and we mark that a particular test with the Junit XML status of
> "error" (as distinct from "success" or "failure"). If there are 64
> test runs in parallel, if I wanted to have automated recovery if the
> test appliance hangs or crashes, life gets a lot more complicated.....
Not really. Both dmesg and the results files will have tracked all
the tests inflight when the system crashes, so it's just an extra
step to extract all those tests and run them again using check
and/or check-parallel to further isolate which test caused the
failure....
I'm sure this could be automated eventually, but that's way down my
priority list right now.
> I suppose we could have the human (or test automation) try run each
> individual test that had been running at the time of the crash but
> that's a lot more complicated, and what if the tests pass when run
> once at a time? I guess we should happen that check-parallel found a
> bug that plain check didn't find, but the human being still has to
> root cause the failure.
Yes. This is no different to a test that is flakey or compeltely
fails when run serially by check multiple times. You still need a
human to find the root cause of the failure.
Nobody is being forced to change their tooling or processes to use
check-parallel if they don't want or need to. It is an alternative
method for running the tests within the fstests suite - if using
check meets your needs, there is no reason to use check-parallel or
even care that it exists...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2025-01-22 6:01 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-16 23:21 [PATCHBOMB] fstests: fix check-parallel and get all the xfs 6.13 changes merged Darrick J. Wong
2025-01-16 23:23 ` [PATCHSET 1/7] fstests: random fixes for v2025.01.12 Darrick J. Wong
2025-01-16 23:25 ` [PATCH 01/23] generic/476: fix fsstress process management Darrick J. Wong
2025-01-21 3:03 ` Dave Chinner
2025-01-16 23:25 ` [PATCH 02/23] metadump: make non-local function variables more obvious Darrick J. Wong
2025-01-21 3:06 ` Dave Chinner
2025-01-16 23:25 ` [PATCH 03/23] metadump: fix cleanup for v1 metadump testing Darrick J. Wong
2025-01-21 3:07 ` Dave Chinner
2025-01-16 23:26 ` [PATCH 04/23] generic/482: _run_fsstress needs the test filesystem Darrick J. Wong
2025-01-21 3:12 ` Dave Chinner
2025-01-22 3:27 ` Darrick J. Wong
2025-01-16 23:26 ` [PATCH 05/23] generic/019: don't fail if fio crashes while shutting down Darrick J. Wong
2025-01-21 3:12 ` Dave Chinner
2025-01-16 23:26 ` [PATCH 06/23] fuzzy: do not set _FSSTRESS_PID when exercising fsx Darrick J. Wong
2025-01-21 3:13 ` Dave Chinner
2025-01-16 23:27 ` [PATCH 07/23] common/rc: create a wrapper for the su command Darrick J. Wong
2025-01-16 23:27 ` [PATCH 08/23] common: fix pkill by running test program in a separate session Darrick J. Wong
2025-01-21 3:28 ` Dave Chinner
2025-01-22 4:24 ` Darrick J. Wong
2025-01-22 6:08 ` Dave Chinner
2025-01-22 7:05 ` Darrick J. Wong
2025-01-22 9:42 ` Dave Chinner
2025-01-22 21:46 ` Darrick J. Wong
2025-01-23 1:16 ` Dave Chinner
2025-01-28 4:34 ` Dave Chinner
2025-01-28 7:23 ` Darrick J. Wong
2025-01-28 20:39 ` Dave Chinner
2025-01-29 3:13 ` Darrick J. Wong
2025-01-29 6:06 ` Dave Chinner
2025-01-29 7:33 ` Darrick J. Wong
2025-01-16 23:27 ` [PATCH 09/23] unmount: resume logging of stdout and stderr for filtering Darrick J. Wong
2025-01-21 3:52 ` Dave Chinner
2025-01-22 3:29 ` Darrick J. Wong
2025-01-16 23:27 ` [PATCH 10/23] mkfs: don't hardcode log size Darrick J. Wong
2025-01-21 3:58 ` Dave Chinner
2025-01-21 12:44 ` Theodore Ts'o
2025-01-21 22:05 ` Dave Chinner
2025-01-22 3:40 ` Darrick J. Wong
2025-01-22 3:36 ` Darrick J. Wong
2025-01-22 3:30 ` Darrick J. Wong
2025-01-16 23:28 ` [PATCH 11/23] common/xfs: find loop devices for non-blockdevs passed to _prepare_for_eio_shutdown Darrick J. Wong
2025-01-21 4:37 ` Dave Chinner
2025-01-22 4:05 ` Darrick J. Wong
2025-01-22 5:21 ` Dave Chinner
2025-01-16 23:28 ` [PATCH 12/23] preamble: fix missing _kill_fsstress Darrick J. Wong
2025-01-21 4:37 ` Dave Chinner
2025-01-16 23:28 ` [PATCH 13/23] generic/650: revert SOAK DURATION changes Darrick J. Wong
2025-01-21 4:57 ` Dave Chinner
2025-01-21 13:00 ` Theodore Ts'o
2025-01-21 22:15 ` Dave Chinner
2025-01-22 3:51 ` Darrick J. Wong
2025-01-22 4:08 ` Theodore Ts'o
2025-01-22 6:01 ` Dave Chinner [this message]
2025-01-22 7:02 ` Darrick J. Wong
2025-01-22 3:49 ` Darrick J. Wong
2025-01-22 4:12 ` Dave Chinner
2025-01-22 4:37 ` Darrick J. Wong
2025-01-16 23:28 ` [PATCH 14/23] generic/032: fix pinned mount failure Darrick J. Wong
2025-01-21 5:03 ` Dave Chinner
2025-01-22 4:08 ` Darrick J. Wong
2025-01-22 4:19 ` Dave Chinner
2025-01-16 23:29 ` [PATCH 15/23] fuzzy: stop __stress_scrub_fsx_loop if fsx fails Darrick J. Wong
2025-01-16 23:29 ` [PATCH 16/23] fuzzy: don't use readarray for xfsfind output Darrick J. Wong
2025-01-16 23:29 ` [PATCH 17/23] fuzzy: always stop the scrub fsstress loop on error Darrick J. Wong
2025-01-16 23:29 ` [PATCH 18/23] fuzzy: port fsx and fsstress loop to use --duration Darrick J. Wong
2025-01-16 23:30 ` [PATCH 19/23] common/rc: don't copy fsstress to $TEST_DIR Darrick J. Wong
2025-01-21 5:05 ` Dave Chinner
2025-01-22 3:52 ` Darrick J. Wong
2025-01-16 23:30 ` [PATCH 20/23] fix _require_scratch_duperemove ordering Darrick J. Wong
2025-01-16 23:30 ` [PATCH 21/23] fsstress: fix a memory leak Darrick J. Wong
2025-01-16 23:30 ` [PATCH 22/23] fsx: fix leaked log file pointer Darrick J. Wong
2025-01-16 23:31 ` [PATCH 23/23] build: initialize stack variables to zero by default Darrick J. Wong
2025-01-16 23:23 ` [PATCHSET 2/7] fstests: fix logwrites zeroing Darrick J. Wong
2025-01-16 23:31 ` [PATCH 1/3] logwrites: warn if we don't think read after discard returns zeroes Darrick J. Wong
2025-01-16 23:31 ` [PATCH 2/3] logwrites: use BLKZEROOUT if it's available Darrick J. Wong
2025-01-16 23:31 ` [PATCH 3/3] logwrites: only use BLKDISCARD if we know discard zeroes data Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 3/7] fstests: enable metadir Darrick J. Wong
2025-01-16 23:32 ` [PATCH 01/11] various: fix finding metadata inode numbers when metadir is enabled Darrick J. Wong
2025-01-16 23:32 ` [PATCH 02/11] xfs/{030,033,178}: forcibly disable metadata directory trees Darrick J. Wong
2025-01-16 23:32 ` [PATCH 03/11] common/repair: patch up repair sb inode value complaints Darrick J. Wong
2025-01-16 23:32 ` [PATCH 04/11] xfs/206: update for metadata directory support Darrick J. Wong
2025-01-16 23:33 ` [PATCH 05/11] xfs/{050,144,153,299,330}: update quota reports to handle metadir trees Darrick J. Wong
2025-01-16 23:33 ` [PATCH 06/11] xfs/509: adjust inumbers accounting for metadata directories Darrick J. Wong
2025-01-16 23:33 ` [PATCH 07/11] xfs: create fuzz tests " Darrick J. Wong
2025-01-16 23:34 ` [PATCH 08/11] xfs/163: bigger fs for metadir Darrick J. Wong
2025-01-16 23:34 ` [PATCH 09/11] xfs/122: disable this test for any codebase that knows about metadir Darrick J. Wong
2025-01-16 23:34 ` [PATCH 10/11] scrub: race metapath online fsck with fsstress Darrick J. Wong
2025-01-16 23:34 ` [PATCH 11/11] xfs: test metapath repairs Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 4/7] fstests: make protofiles less janky Darrick J. Wong
2025-01-16 23:35 ` [PATCH 1/1] fstests: test mkfs.xfs protofiles with xattr support Darrick J. Wong
2025-03-02 13:15 ` Zorro Lang
2025-03-04 17:42 ` Darrick J. Wong
2025-03-04 18:00 ` Zorro Lang
2025-03-04 18:21 ` Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 5/7] fstests: shard the realtime section Darrick J. Wong
2025-01-16 23:35 ` [PATCH 01/14] common/populate: refactor caching of metadumps to a helper Darrick J. Wong
2025-01-16 23:35 ` [PATCH 02/14] common/{fuzzy,populate}: use _scratch_xfs_mdrestore Darrick J. Wong
2025-01-16 23:35 ` [PATCH 03/14] fuzzy: stress data and rt sections of xfs filesystems equally Darrick J. Wong
2025-01-16 23:36 ` [PATCH 04/14] common/ext4: reformat external logs during mdrestore operations Darrick J. Wong
2025-01-16 23:36 ` [PATCH 05/14] common/populate: use metadump v2 format by default for fs metadata snapshots Darrick J. Wong
2025-01-16 23:36 ` [PATCH 06/14] punch-alternating: detect xfs realtime files with large allocation units Darrick J. Wong
2025-01-16 23:36 ` [PATCH 07/14] xfs/206: update mkfs filtering for rt groups feature Darrick J. Wong
2025-01-16 23:37 ` [PATCH 08/14] common: pass the realtime device to xfs_db when possible Darrick J. Wong
2025-01-16 23:37 ` [PATCH 09/14] xfs/185: update for rtgroups Darrick J. Wong
2025-01-16 23:37 ` [PATCH 10/14] xfs/449: update test to know about xfs_db -R Darrick J. Wong
2025-01-16 23:37 ` [PATCH 11/14] xfs/271,xfs/556: fix tests to deal with rtgroups output in bmap/fsmap commands Darrick J. Wong
2025-01-16 23:38 ` [PATCH 12/14] common/xfs: capture realtime devices during metadump/mdrestore Darrick J. Wong
2025-01-16 23:38 ` [PATCH 13/14] common/fuzzy: adapt the scrub stress tests to support rtgroups Darrick J. Wong
2025-01-16 23:38 ` [PATCH 14/14] xfs: fix fuzz tests of rtgroups bitmap and summary files Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 6/7] fstests: store quota files in the metadir Darrick J. Wong
2025-01-16 23:38 ` [PATCH 1/4] xfs: update tests for " Darrick J. Wong
2025-01-16 23:39 ` [PATCH 2/4] xfs: test persistent quota flags Darrick J. Wong
2025-01-16 23:39 ` [PATCH 3/4] xfs: fix quota detection in fuzz tests Darrick J. Wong
2025-01-16 23:39 ` [PATCH 4/4] xfs: fix tests for persistent qflags Darrick J. Wong
2025-01-16 23:25 ` [PATCHSET v6.2 7/7] fstests: enable quota for realtime volumes Darrick J. Wong
2025-01-16 23:40 ` [PATCH 1/3] common: enable testing of realtime quota when supported Darrick J. Wong
2025-01-16 23:40 ` [PATCH 2/3] xfs: fix quota tests to adapt to realtime quota Darrick J. Wong
2025-01-16 23:40 ` [PATCH 3/3] xfs: regression testing of quota on the realtime device Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5CJy195Fh36NNHN@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=fstests@vger.kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=zlang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox