From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E30C4170A30; Wed, 22 Jan 2025 07:02:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737529334; cv=none; b=IrueUvSTs82RLXrpPMdYv3ExFWifTqwilCACRcY2jHpa9+uGXDn/mVIafVrRafkdCCMeV9vEkAGWbCYfom4sMYMAVdUowIdcE9yW5iWHiG3AZBHgbTiKOCV3kfht4F1Ulu68Ctx8/mBuA3pLJmeDZn0a/pyE43RmCyp+vPZpWY4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737529334; c=relaxed/simple; bh=E8DKODeyKKi8YQv0D18OmMB6E9G1CrG2xBoE1zyCAXE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=AKreA/09BYvbmLo/BnQWrxpBMlwhGB/6px8INrgEg0xx2XgkC7k6+wKlzozBgfCbDv4YdDEtJM+AWl96/by33C46lRHyfvqLxTDGuKz8lQthMQ5QxAWIGFCKw/qkDXaAfCEXyZ6O2Jk8fN+aF0/weld3DJ9C2oPWwO1QbGGKGEo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=iLQkn1m+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="iLQkn1m+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 53E76C4CED6; Wed, 22 Jan 2025 07:02:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737529333; bh=E8DKODeyKKi8YQv0D18OmMB6E9G1CrG2xBoE1zyCAXE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iLQkn1m+IXD/UcwHAxXykVjpnZC1tagNPHzqTTVLXEmbD+fOLBNkQqQMt0f2Lks1v MPG3aUgTFSTatnV/ISznzUcA9CyWOJEBjIoRMvFxrxEKSsiiNeg6K3/5imXqbDbhQw MMOTO33UZdKJbI41NHZmEbcd9N3wSksmB2BehMOR9UKOQiBSnsWk4mncvzG9gT8+9h Z6EsWDS4pOG9Qm8dppvSVs2lIYsOE2+6QXA/l7DJU574LUmG9GsGUC2tDy6iwxgThn qQNrGZwN0OzbY1oyW9ocqDH6SmFkY6NCMHeXy3N6i4XMV61p5hKinLW0RoCe8SdIcH jR3pppWYJLRcg== Date: Tue, 21 Jan 2025 23:02:12 -0800 From: "Darrick J. Wong" To: Dave Chinner Cc: Theodore Ts'o , zlang@redhat.com, hch@lst.de, fstests@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH 13/23] generic/650: revert SOAK DURATION changes Message-ID: <20250122070212.GC1611770@frogsfrogsfrogs> References: <173706974044.1927324.7824600141282028094.stgit@frogsfrogsfrogs> <173706974273.1927324.11899201065662863518.stgit@frogsfrogsfrogs> <20250121130027.GB3809348@mit.edu> <20250122040839.GD3761769@mit.edu> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Jan 22, 2025 at 05:01:47PM +1100, Dave Chinner wrote: > On Tue, Jan 21, 2025 at 11:08:39PM -0500, Theodore Ts'o wrote: > > On Wed, Jan 22, 2025 at 09:15:48AM +1100, Dave Chinner wrote: > > > check-parallel on my 64p machine runs the full auto group test in > > > under 10 minutes. > > > > > > i.e. if you have a typical modern server (64-128p, 256GB RAM and a > > > couple of NVMe SSDs), then check-parallel allows a full test run in > > > the same time that './check -g smoketest' will run.... > > > > Interesting. I would have thought that even with NVMe SSD's, you'd be > > I/O speed constrained, especially given that some of the tests > > (especially the ENOSPC hitters) can take quite a lot of time to fill > > the storage device, even if they are using fallocate. > > You haven't looked at how check-parallel works, have you? :/ > > > How do you have your test and scratch devices configured? > > Please go and read the check-parallel script. It does all the > per-runner process test and scratch device configuration itself > using loop devices. > > > > Yes, and I've previously made the point about how check-parallel > > > changes the way we should be looking at dev-test cycles. We no > > > longer have to care that auto group testing takes 4 hours to run and > > > have to work around that with things like smoketest groups. If you > > > can run the whole auto test group in 10-15 minutes, then we don't > > > need "quick", "smoketest", etc to reduce dev-test cycle time > > > anymore... > > > > Well, yes, if the only consideration is test run time latency. > > Sure. > > > I can think of two off-setting considerations. The first is if you > > care about cost. > > Which I really don't care about. > > That's something for a QE organisation to worry about, and it's up > to them to make the best use of the tools they have within the > budget they have. > > > The second concern is that for certain class of failures (UBSAN, > > KCSAN, Lockdep, RCU soft lockups, WARN_ON, BUG_ON, and other > > panics/OOPS), if you are runnig 64 tests in parllel it might not be > > obvious which test caused the failure. > > Then multiple tests will fail with the same dmesg error, but it's > generally pretty clear which of the tests caused it. Yes, it's a bit > more work to isolate the specific test, but it's not hard or any > different to how a test failure is debugged now. > > If you want to automate such failures, then my process is to grep > the log files for all the tests that failed with a dmesg error then > run them again using check instead of check-parallel. Then I get > exactly which test generated the dmesg output without having to put > time into trying to work out which test triggered the failure. > > > Today, even if the test VM > > crashes or hangs, I can have test manager (which runs on a e2-small VM > > costing $0.021913 USD/hour and can manage dozens of test VM's all at the > > same time), can restart the test VM, and we know which test is at at > > fault, and we mark that a particular test with the Junit XML status of > > "error" (as distinct from "success" or "failure"). If there are 64 > > test runs in parallel, if I wanted to have automated recovery if the > > test appliance hangs or crashes, life gets a lot more complicated..... > > Not really. Both dmesg and the results files will have tracked all > the tests inflight when the system crashes, so it's just an extra > step to extract all those tests and run them again using check > and/or check-parallel to further isolate which test caused the > failure.... That reminds me to go see if ./check actually fsyncs the state and report files and whatnot between tests, so that we have a better chance of figuring out where exactly fstests blew up the machine. (Luckily xfs is stable enough I haven't had a machine explode in quite some time, good job everyone! :)) --D > I'm sure this could be automated eventually, but that's way down my > priority list right now. > > > I suppose we could have the human (or test automation) try run each > > individual test that had been running at the time of the crash but > > that's a lot more complicated, and what if the tests pass when run > > once at a time? I guess we should happen that check-parallel found a > > bug that plain check didn't find, but the human being still has to > > root cause the failure. > > Yes. This is no different to a test that is flakey or compeltely > fails when run serially by check multiple times. You still need a > human to find the root cause of the failure. > > Nobody is being forced to change their tooling or processes to use > check-parallel if they don't want or need to. It is an alternative > method for running the tests within the fstests suite - if using > check meets your needs, there is no reason to use check-parallel or > even care that it exists... > > -Dave. > -- > Dave Chinner > david@fromorbit.com >