From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E0FFC001B2 for ; Tue, 20 Dec 2022 03:17:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232948AbiLTDRE (ORCPT ); Mon, 19 Dec 2022 22:17:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232989AbiLTDRD (ORCPT ); Mon, 19 Dec 2022 22:17:03 -0500 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D8D813F46 for ; Mon, 19 Dec 2022 19:17:01 -0800 (PST) Received: from cwcc.thunk.org (pool-173-48-120-46.bstnma.fios.verizon.net [173.48.120.46]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 2BK3GhnG026138 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 19 Dec 2022 22:16:44 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1671506205; bh=AUO3hX41d59kcr1GVWXAeTK09oFfMUrFiF1nZ3HcXAk=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=mSFT9Eo+nzgri83W15A78YT0fFG3vpFSdCoVdUUd4p8QLYmn+uJdIgtpiqBCfZiqS G0YFzXMVkwsufvuUN/65mTxAcG9LqOr4dC7cjywBh1uShxpS9+kknLni09H+F95THr 496Qmt3jqEnjoPwPU//MjIR0cM+2YfewYbiLpPvY8ov/vkfIs9NzexQ4+u2QWMj+DD 6M16zbf9uDoPdp42HDPekG9dhFkQCicdsGH3coWIYRhNdBjc5IBo+i9B0HunXvM8qI OgxIndLWTlaSOsFVqHUqi/zkjr5VuKo7cZFvASLIZhCqHAi3gpuPNxnUHM9G77Q+a/ t2gz09wuswNqQ== Received: by cwcc.thunk.org (Postfix, from userid 15806) id 9835D15C3511; Mon, 19 Dec 2022 22:16:43 -0500 (EST) Date: Mon, 19 Dec 2022 22:16:43 -0500 From: "Theodore Ts'o" To: "Darrick J. Wong" Cc: zlang@redhat.com, linux-xfs@vger.kernel.org, fstests@vger.kernel.org, guan@eryu.me, leah.rumancik@gmail.com, quwenruo.btrfs@gmx.com Subject: Re: [PATCH 1/8] check: generate section reports between tests Message-ID: References: <167149446381.332657.9402608531757557463.stgit@magnolia> <167149446946.332657.17186597494532662986.stgit@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <167149446946.332657.17186597494532662986.stgit@magnolia> Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Mon, Dec 19, 2022 at 04:01:09PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong > > Generate the section report between tests so that the summary report > always reflects the outcome of the most recent test. Two usecases are > envisioned here -- if a cluster-based test runner anticipates that the > testrun could crash the VM, they can set REPORT_DIR to (say) an NFS > mount to preserve the intermediate results. If the VM does indeed > crash, the scheduler can examine the state of the crashed VM and move > the tests to another VM. The second usecase is a reporting agent that > runs in the VM to upload live results to a test dashboard. Leah has been working on adding crash recovery for gce-xfstests. It'll be interesting to see how her work dovetails with your patches. The basic design we've worked out works by having the test framework recognize whether the VM had been had been previously been running tests. We keep track of the last test that was run by hooking into $LOGGER_PROG. We then use a python script[1] to append to the xunit file a test result for the test that was running at the time of the crash, and we set the test result to "error", and then we resume running tests from where we had left off. [1] https://github.com/lrumancik/xfstests-bld/blob/ltm-auto-resume-new/test-appliance/files/usr/local/bin/add_error_xunit To deal with cases where the kernel has deadlocked, when the test VM is launched by the LTM server, the LTM server will monitor the test VM, if the LTM server notices that the test VM has failed to make forward progress within a set time, it will force the test VM to reboot, at which point the recovery process described above kicks in. Eventually, we'll have the LTM server examine the serial console of the test VM, looking for indications of kernel panics and RCU / soft lockup warnings, so we can more quickly force a reboot when the system under test is clearly unhappy. The advantage of this design is that it doesen't require using NFS to store the results, and in theory we don't even need to use a separate monitoring VM; we could just use a software and kernel watchdogs to notice when the tests have stopped making forward progress. - Ted P.S. We're not using section reporting since we generally use launch separate VM's for each "section" so we can speed up the test run time by sharding across those VM's. And then we have the LTM server merge the results together into a single test run report.