From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>,
zlang@redhat.com, hch@lst.de, fstests@vger.kernel.org,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes
Date: Tue, 21 Jan 2025 23:02:12 -0800 [thread overview]
Message-ID: <20250122070212.GC1611770@frogsfrogsfrogs> (raw)
In-Reply-To: <Z5CJy195Fh36NNHN@dread.disaster.area>
On Wed, Jan 22, 2025 at 05:01:47PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 11:08:39PM -0500, Theodore Ts'o wrote:
> > On Wed, Jan 22, 2025 at 09:15:48AM +1100, Dave Chinner wrote:
> > > check-parallel on my 64p machine runs the full auto group test in
> > > under 10 minutes.
> > >
> > > i.e. if you have a typical modern server (64-128p, 256GB RAM and a
> > > couple of NVMe SSDs), then check-parallel allows a full test run in
> > > the same time that './check -g smoketest' will run....
> >
> > Interesting. I would have thought that even with NVMe SSD's, you'd be
> > I/O speed constrained, especially given that some of the tests
> > (especially the ENOSPC hitters) can take quite a lot of time to fill
> > the storage device, even if they are using fallocate.
>
> You haven't looked at how check-parallel works, have you? :/
>
> > How do you have your test and scratch devices configured?
>
> Please go and read the check-parallel script. It does all the
> per-runner process test and scratch device configuration itself
> using loop devices.
>
> > > Yes, and I've previously made the point about how check-parallel
> > > changes the way we should be looking at dev-test cycles. We no
> > > longer have to care that auto group testing takes 4 hours to run and
> > > have to work around that with things like smoketest groups. If you
> > > can run the whole auto test group in 10-15 minutes, then we don't
> > > need "quick", "smoketest", etc to reduce dev-test cycle time
> > > anymore...
> >
> > Well, yes, if the only consideration is test run time latency.
>
> Sure.
>
> > I can think of two off-setting considerations. The first is if you
> > care about cost.
>
> Which I really don't care about.
>
> That's something for a QE organisation to worry about, and it's up
> to them to make the best use of the tools they have within the
> budget they have.
>
> > The second concern is that for certain class of failures (UBSAN,
> > KCSAN, Lockdep, RCU soft lockups, WARN_ON, BUG_ON, and other
> > panics/OOPS), if you are runnig 64 tests in parllel it might not be
> > obvious which test caused the failure.
>
> Then multiple tests will fail with the same dmesg error, but it's
> generally pretty clear which of the tests caused it. Yes, it's a bit
> more work to isolate the specific test, but it's not hard or any
> different to how a test failure is debugged now.
>
> If you want to automate such failures, then my process is to grep
> the log files for all the tests that failed with a dmesg error then
> run them again using check instead of check-parallel. Then I get
> exactly which test generated the dmesg output without having to put
> time into trying to work out which test triggered the failure.
>
> > Today, even if the test VM
> > crashes or hangs, I can have test manager (which runs on a e2-small VM
> > costing $0.021913 USD/hour and can manage dozens of test VM's all at the
> > same time), can restart the test VM, and we know which test is at at
> > fault, and we mark that a particular test with the Junit XML status of
> > "error" (as distinct from "success" or "failure"). If there are 64
> > test runs in parallel, if I wanted to have automated recovery if the
> > test appliance hangs or crashes, life gets a lot more complicated.....
>
> Not really. Both dmesg and the results files will have tracked all
> the tests inflight when the system crashes, so it's just an extra
> step to extract all those tests and run them again using check
> and/or check-parallel to further isolate which test caused the
> failure....
That reminds me to go see if ./check actually fsyncs the state and
report files and whatnot between tests, so that we have a better chance
of figuring out where exactly fstests blew up the machine.
(Luckily xfs is stable enough I haven't had a machine explode in quite
some time, good job everyone! :))
--D
> I'm sure this could be automated eventually, but that's way down my
> priority list right now.
>
> > I suppose we could have the human (or test automation) try run each
> > individual test that had been running at the time of the crash but
> > that's a lot more complicated, and what if the tests pass when run
> > once at a time? I guess we should happen that check-parallel found a
> > bug that plain check didn't find, but the human being still has to
> > root cause the failure.
>
> Yes. This is no different to a test that is flakey or compeltely
> fails when run serially by check multiple times. You still need a
> human to find the root cause of the failure.
>
> Nobody is being forced to change their tooling or processes to use
> check-parallel if they don't want or need to. It is an alternative
> method for running the tests within the fstests suite - if using
> check meets your needs, there is no reason to use check-parallel or
> even care that it exists...
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2025-01-22 7:02 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-16 23:21 [PATCHBOMB] fstests: fix check-parallel and get all the xfs 6.13 changes merged Darrick J. Wong
2025-01-16 23:23 ` [PATCHSET 1/7] fstests: random fixes for v2025.01.12 Darrick J. Wong
2025-01-16 23:25 ` [PATCH 01/23] generic/476: fix fsstress process management Darrick J. Wong
2025-01-21 3:03 ` Dave Chinner
2025-01-16 23:25 ` [PATCH 02/23] metadump: make non-local function variables more obvious Darrick J. Wong
2025-01-21 3:06 ` Dave Chinner
2025-01-16 23:25 ` [PATCH 03/23] metadump: fix cleanup for v1 metadump testing Darrick J. Wong
2025-01-21 3:07 ` Dave Chinner
2025-01-16 23:26 ` [PATCH 04/23] generic/482: _run_fsstress needs the test filesystem Darrick J. Wong
2025-01-21 3:12 ` Dave Chinner
2025-01-22 3:27 ` Darrick J. Wong
2025-01-16 23:26 ` [PATCH 05/23] generic/019: don't fail if fio crashes while shutting down Darrick J. Wong
2025-01-21 3:12 ` Dave Chinner
2025-01-16 23:26 ` [PATCH 06/23] fuzzy: do not set _FSSTRESS_PID when exercising fsx Darrick J. Wong
2025-01-21 3:13 ` Dave Chinner
2025-01-16 23:27 ` [PATCH 07/23] common/rc: create a wrapper for the su command Darrick J. Wong
2025-01-16 23:27 ` [PATCH 08/23] common: fix pkill by running test program in a separate session Darrick J. Wong
2025-01-21 3:28 ` Dave Chinner
2025-01-22 4:24 ` Darrick J. Wong
2025-01-22 6:08 ` Dave Chinner
2025-01-22 7:05 ` Darrick J. Wong
2025-01-22 9:42 ` Dave Chinner
2025-01-22 21:46 ` Darrick J. Wong
2025-01-23 1:16 ` Dave Chinner
2025-01-28 4:34 ` Dave Chinner
2025-01-28 7:23 ` Darrick J. Wong
2025-01-28 20:39 ` Dave Chinner
2025-01-29 3:13 ` Darrick J. Wong
2025-01-29 6:06 ` Dave Chinner
2025-01-29 7:33 ` Darrick J. Wong
2025-01-16 23:27 ` [PATCH 09/23] unmount: resume logging of stdout and stderr for filtering Darrick J. Wong
2025-01-21 3:52 ` Dave Chinner
2025-01-22 3:29 ` Darrick J. Wong
2025-01-16 23:27 ` [PATCH 10/23] mkfs: don't hardcode log size Darrick J. Wong
2025-01-21 3:58 ` Dave Chinner
2025-01-21 12:44 ` Theodore Ts'o
2025-01-21 22:05 ` Dave Chinner
2025-01-22 3:40 ` Darrick J. Wong
2025-01-22 3:36 ` Darrick J. Wong
2025-01-22 3:30 ` Darrick J. Wong
2025-01-16 23:28 ` [PATCH 11/23] common/xfs: find loop devices for non-blockdevs passed to _prepare_for_eio_shutdown Darrick J. Wong
2025-01-21 4:37 ` Dave Chinner
2025-01-22 4:05 ` Darrick J. Wong
2025-01-22 5:21 ` Dave Chinner
2025-01-16 23:28 ` [PATCH 12/23] preamble: fix missing _kill_fsstress Darrick J. Wong
2025-01-21 4:37 ` Dave Chinner
2025-01-16 23:28 ` [PATCH 13/23] generic/650: revert SOAK DURATION changes Darrick J. Wong
2025-01-21 4:57 ` Dave Chinner
2025-01-21 13:00 ` Theodore Ts'o
2025-01-21 22:15 ` Dave Chinner
2025-01-22 3:51 ` Darrick J. Wong
2025-01-22 4:08 ` Theodore Ts'o
2025-01-22 6:01 ` Dave Chinner
2025-01-22 7:02 ` Darrick J. Wong [this message]
2025-01-22 3:49 ` Darrick J. Wong
2025-01-22 4:12 ` Dave Chinner
2025-01-22 4:37 ` Darrick J. Wong
2025-01-16 23:28 ` [PATCH 14/23] generic/032: fix pinned mount failure Darrick J. Wong
2025-01-21 5:03 ` Dave Chinner
2025-01-22 4:08 ` Darrick J. Wong
2025-01-22 4:19 ` Dave Chinner
2025-01-16 23:29 ` [PATCH 15/23] fuzzy: stop __stress_scrub_fsx_loop if fsx fails Darrick J. Wong
2025-01-16 23:29 ` [PATCH 16/23] fuzzy: don't use readarray for xfsfind output Darrick J. Wong
2025-01-16 23:29 ` [PATCH 17/23] fuzzy: always stop the scrub fsstress loop on error Darrick J. Wong
2025-01-16 23:29 ` [PATCH 18/23] fuzzy: port fsx and fsstress loop to use --duration Darrick J. Wong
2025-01-16 23:30 ` [PATCH 19/23] common/rc: don't copy fsstress to $TEST_DIR Darrick J. Wong
2025-01-21 5:05 ` Dave Chinner
2025-01-22 3:52 ` Darrick J. Wong
2025-01-16 23:30 ` [PATCH 20/23] fix _require_scratch_duperemove ordering Darrick J. Wong
2025-01-16 23:30 ` [PATCH 21/23] fsstress: fix a memory leak Darrick J. Wong
2025-01-16 23:30 ` [PATCH 22/23] fsx: fix leaked log file pointer Darrick J. Wong
2025-01-16 23:31 ` [PATCH 23/23] build: initialize stack variables to zero by default Darrick J. Wong
2025-01-16 23:23 ` [PATCHSET 2/7] fstests: fix logwrites zeroing Darrick J. Wong
2025-01-16 23:31 ` [PATCH 1/3] logwrites: warn if we don't think read after discard returns zeroes Darrick J. Wong
2025-01-16 23:31 ` [PATCH 2/3] logwrites: use BLKZEROOUT if it's available Darrick J. Wong
2025-01-16 23:31 ` [PATCH 3/3] logwrites: only use BLKDISCARD if we know discard zeroes data Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 3/7] fstests: enable metadir Darrick J. Wong
2025-01-16 23:32 ` [PATCH 01/11] various: fix finding metadata inode numbers when metadir is enabled Darrick J. Wong
2025-01-16 23:32 ` [PATCH 02/11] xfs/{030,033,178}: forcibly disable metadata directory trees Darrick J. Wong
2025-01-16 23:32 ` [PATCH 03/11] common/repair: patch up repair sb inode value complaints Darrick J. Wong
2025-01-16 23:32 ` [PATCH 04/11] xfs/206: update for metadata directory support Darrick J. Wong
2025-01-16 23:33 ` [PATCH 05/11] xfs/{050,144,153,299,330}: update quota reports to handle metadir trees Darrick J. Wong
2025-01-16 23:33 ` [PATCH 06/11] xfs/509: adjust inumbers accounting for metadata directories Darrick J. Wong
2025-01-16 23:33 ` [PATCH 07/11] xfs: create fuzz tests " Darrick J. Wong
2025-01-16 23:34 ` [PATCH 08/11] xfs/163: bigger fs for metadir Darrick J. Wong
2025-01-16 23:34 ` [PATCH 09/11] xfs/122: disable this test for any codebase that knows about metadir Darrick J. Wong
2025-01-16 23:34 ` [PATCH 10/11] scrub: race metapath online fsck with fsstress Darrick J. Wong
2025-01-16 23:34 ` [PATCH 11/11] xfs: test metapath repairs Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 4/7] fstests: make protofiles less janky Darrick J. Wong
2025-01-16 23:35 ` [PATCH 1/1] fstests: test mkfs.xfs protofiles with xattr support Darrick J. Wong
2025-03-02 13:15 ` Zorro Lang
2025-03-04 17:42 ` Darrick J. Wong
2025-03-04 18:00 ` Zorro Lang
2025-03-04 18:21 ` Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 5/7] fstests: shard the realtime section Darrick J. Wong
2025-01-16 23:35 ` [PATCH 01/14] common/populate: refactor caching of metadumps to a helper Darrick J. Wong
2025-01-16 23:35 ` [PATCH 02/14] common/{fuzzy,populate}: use _scratch_xfs_mdrestore Darrick J. Wong
2025-01-16 23:35 ` [PATCH 03/14] fuzzy: stress data and rt sections of xfs filesystems equally Darrick J. Wong
2025-01-16 23:36 ` [PATCH 04/14] common/ext4: reformat external logs during mdrestore operations Darrick J. Wong
2025-01-16 23:36 ` [PATCH 05/14] common/populate: use metadump v2 format by default for fs metadata snapshots Darrick J. Wong
2025-01-16 23:36 ` [PATCH 06/14] punch-alternating: detect xfs realtime files with large allocation units Darrick J. Wong
2025-01-16 23:36 ` [PATCH 07/14] xfs/206: update mkfs filtering for rt groups feature Darrick J. Wong
2025-01-16 23:37 ` [PATCH 08/14] common: pass the realtime device to xfs_db when possible Darrick J. Wong
2025-01-16 23:37 ` [PATCH 09/14] xfs/185: update for rtgroups Darrick J. Wong
2025-01-16 23:37 ` [PATCH 10/14] xfs/449: update test to know about xfs_db -R Darrick J. Wong
2025-01-16 23:37 ` [PATCH 11/14] xfs/271,xfs/556: fix tests to deal with rtgroups output in bmap/fsmap commands Darrick J. Wong
2025-01-16 23:38 ` [PATCH 12/14] common/xfs: capture realtime devices during metadump/mdrestore Darrick J. Wong
2025-01-16 23:38 ` [PATCH 13/14] common/fuzzy: adapt the scrub stress tests to support rtgroups Darrick J. Wong
2025-01-16 23:38 ` [PATCH 14/14] xfs: fix fuzz tests of rtgroups bitmap and summary files Darrick J. Wong
2025-01-16 23:24 ` [PATCHSET v6.2 6/7] fstests: store quota files in the metadir Darrick J. Wong
2025-01-16 23:38 ` [PATCH 1/4] xfs: update tests for " Darrick J. Wong
2025-01-16 23:39 ` [PATCH 2/4] xfs: test persistent quota flags Darrick J. Wong
2025-01-16 23:39 ` [PATCH 3/4] xfs: fix quota detection in fuzz tests Darrick J. Wong
2025-01-16 23:39 ` [PATCH 4/4] xfs: fix tests for persistent qflags Darrick J. Wong
2025-01-16 23:25 ` [PATCHSET v6.2 7/7] fstests: enable quota for realtime volumes Darrick J. Wong
2025-01-16 23:40 ` [PATCH 1/3] common: enable testing of realtime quota when supported Darrick J. Wong
2025-01-16 23:40 ` [PATCH 2/3] xfs: fix quota tests to adapt to realtime quota Darrick J. Wong
2025-01-16 23:40 ` [PATCH 3/3] xfs: regression testing of quota on the realtime device Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250122070212.GC1611770@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=fstests@vger.kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=zlang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox