[PATCH 15/28] check-parallel: de-batch test execution

FS/XFS testing framework
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: fstests@vger.kernel.org
Cc: zlang@kernel.org
Subject: [PATCH 15/28] check-parallel: de-batch test execution
Date: Thu, 17 Apr 2025 13:00:56 +1000	[thread overview]
Message-ID: <20250417031208.1852171-16-david@fromorbit.com> (raw)
In-Reply-To: <20250417031208.1852171-1-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

To improve how check-parallel runs tests, it needs to run tests
directly from the runner threads. We currently batch them based on
runtime before we execture any tests, but this results in runner 0
always having a test list with runtime longer than the test list for
runner N.

As a result, we can end up with higher numbered runners finishing
all their tests before runner 0 has even finished the first test it
was given to run. Hence we end up with check-parallel starting with
maximum concurrency, but the test concurrency reduces as the run
goes on.

To fix this, we need a dynamic test list such that each runner only
needs to be scheduled to run a single test at a time. When they have
finished the current test, they can pop the next test to run off the
time ordered stack and execute that. Hence test runners won't stop
running until there are no more tests to run, hence maximising
concurrency across the entire test run.

To do this, we first need a test list mechanism that is safe for
concurrent destacking from multiple test runners. We place the
test list in a temporary file, then use file locks to serialise
access to the temporary file.

We order the list in the test file from lowest runtime to
highest. This means that running tests from longest to shortest
runtime destacks from the end fo the file. This means that the next
test to run is always the last line fo the file and we can simply
use truncation based mechanisms to consume the test during
destacking.

Running tests individually via check like this is inefficient as
there is a lot of check setup and initialisation overhead.  However,
by increasing the utilisation of the test runner threads, overall
runtime of check-parallel does not increase with this change.
Reduction of this repeated overhead will also be addressed in future
patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 check-parallel | 75 +++++++++++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 32 deletions(-)

diff --git a/check-parallel b/check-parallel
index 6fc86fb92..e2cf2c8d0 100755
--- a/check-parallel
+++ b/check-parallel
@@ -18,6 +18,7 @@ run_section=""
 iam="check-parallel"
 
 tmp=/tmp/check-parallel.$$
+test_list="$tmp.test_list"
 
 . ./common/exit
 . ./common/test_names
@@ -150,9 +151,6 @@ if [ -d "$basedir/runner-0/" ]; then
 	prev_results=`ls -tr $basedir/runner-0/ | grep results | tail -1`
 fi
 
-_tl_prepare_test_list
-_tl_strip_test_list
-
 # grab all previously run tests and order them from highest runtime to lowest
 # We are going to try to run the longer tests first, hopefully so we can avoid
 # massive thundering herds trying to run lots of really short tests in parallel
@@ -198,22 +196,22 @@ if ! $_tl_randomise -a ! $_tl_exact_order; then
 	fi
 fi
 
-# split the list amongst N runners
-split_runner_list()
+# Grab the next test to be run from the tail of the file.
+# Returns an empty string if there is no tests remaining to run.
+# File operations are run under flock so concurrent gets are serialised against
+# each other.
+get_next_test()
 {
-	local ix
-	local rx
-	local -a _list=( $_tl_tests )
-	for ((ix = 0; ix < ${#_list[*]}; ix++)); do
-		seq="${_list[$ix]}"
-		rx=$((ix % $runners))
-		if ! _tl_expunge_test $seq; then
-			runner_list[$rx]+="${_list[$ix]} "
-		fi
-		#echo $seq
-	done
+	local test=
+
+	flock 99
+	test=$(tail -1 $test_list)
+	sed -i "\,$test,d" $test_list
+	flock -u 99
+	echo $test
 }
 
+
 _create_loop_device()
 {
         local file=$1 dev
@@ -240,6 +238,8 @@ _destroy_loop_device()
 
 runner_go()
 {
+	exec 99<>$tmp.test_list_lock
+
 	local id=$1
 	local me=$basedir/runner-$id
 	local _test=$me/test.img
@@ -250,6 +250,7 @@ runner_go()
 	local _scratch_log=$me/scratch-log.img
 	local _logwrites=$me/logwrites.img
 	local _results=$me/results-$2
+	local test_to_run=$(get_next_test)
 
 	mkdir -p $me
 
@@ -291,7 +292,15 @@ runner_go()
 	# Similarly, we need to run check in it's own PID namespace so that
 	# operations like pkill only affect the runner instance, not globally
 	# kill processes from other check instances.
-	tools/run_privatens ./check $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} >> $me/log 2>&1
+	while [ -n "$test_to_run" ]; do
+		echo "Runner $id: running test $test_to_run"
+		unset FSTESTS_ISOL
+		if ! _tl_expunge_test $test_to_run; then
+			tools/run_privatens ./check $run_section $test_to_run >> $me/log 2>&1
+		fi
+
+		test_to_run=$(get_next_test)
+	done
 
 	wait
 	sleep 1
@@ -320,20 +329,32 @@ cleanup()
 	umount -R $basedir/*/test 2> /dev/null
 	umount -R $basedir/*/scratch 2> /dev/null
 	losetup --detach-all
+	rm -rf $tmp.*
 }
 
 trap "cleanup; exit" HUP INT QUIT TERM
 
 _config_setup_parallel
 
-split_runner_list
+_tl_setup_exclude_group "unreliable_in_parallel"
+_tl_prepare_test_list
+_tl_strip_test_list
+
+if ! $_tl_randomise -a ! $_tl_exact_order; then
+	if [ -f $basedir/runner-0/$prev_results/check.time ]; then
+		time_order_test_list
+	fi
+fi
+
+# reverse the order of tests so that the get_next_test() can pull from the file
+# tail rather than the head.
+echo $_tl_tests |sed -e 's/ /\n/g' | tac > $test_list
 if [ -n "$show_test_list" ]; then
 	echo Time ordered test list:
-	echo $_tl_tests
-	echo
+	cat $test_list
+	exit 0
 fi
 
-
 # Each parallel test runner needs to only see it's own mount points. If we
 # leave the basedir as shared, then all tests see all mounts and then we get
 # mount propagation issues cropping up. For example, cloning a new mount
@@ -349,20 +370,10 @@ mount --make-private $basedir
 
 now=`date +%Y-%m-%d-%H:%M:%S`
 for ((i = 0; i < $runners; i++)); do
-
-	if [ -n "$show_test_list" ]; then
-		echo "Runner $i: ${runner_list[$i]}"
-	else
-		runner_go $i $now &
-	fi
-
+	runner_go $i $now &
 done;
 wait
 
-if [ -n "$show_test_list" ]; then
-	exit 0
-fi
-
 echo -n "Tests run: "
 grep Ran $basedir/*/log | sed -e 's,^.*:,,' -e 's, ,\n,g' | sort | uniq | wc -l
 
-- 
2.45.2

next prev parent reply	other threads:[~2025-04-17  3:29 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-17  3:00 [PATCH 00/28] check-parallel: Running tests without check Dave Chinner
2025-04-17  3:00 ` [PATCH 01/28] fstests: remove support for non-numeric test names Dave Chinner
2025-04-30  9:17   ` Nirjhar Roy (IBM)
2025-05-21  2:39     ` Dave Chinner
2025-05-26  5:14       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 02/28] _scratch_mkfs_sized: obey USE_EXTERNAL for XFS filesystems Dave Chinner
2025-05-05  6:14   ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 03/28] fstests: move test exit functions to common/exit Dave Chinner
2025-04-17  3:00 ` [PATCH 04/28] check-parallel: report how many tests were _notrun Dave Chinner
2025-05-05  9:58   ` Nirjhar Roy (IBM)
2025-05-21  2:53     ` Dave Chinner
2025-05-26  6:09       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 05/28] check: factor out test list building code Dave Chinner
2025-05-06 11:32   ` Nirjhar Roy (IBM)
2025-05-21  3:55     ` Dave Chinner
2025-05-26  6:48       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 06/28] check-parallel: use common group list parsing code Dave Chinner
2025-05-06 15:56   ` Nirjhar Roy (IBM)
2025-05-21  4:13     ` Dave Chinner
2025-05-26  6:58       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 07/28] check-parallel: adjust concurrency according to CPU count Dave Chinner
2025-05-07  6:45   ` Nirjhar Roy (IBM)
2025-05-21  4:32     ` Dave Chinner
2025-05-26  8:50       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 08/28] check-parallel: add logwrite device support Dave Chinner
2025-05-07  8:18   ` Nirjhar Roy (IBM)
2025-05-21 10:07     ` Dave Chinner
2025-05-26  8:59       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 09/28] check-parallel: allow FSTYP selection from the CLI Dave Chinner
2025-05-07  8:49   ` Nirjhar Roy (IBM)
2025-05-21 10:17     ` Dave Chinner
2025-05-26  9:00       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 10/28] check-parallel: use PID namespaces for runner process isolation Dave Chinner
2025-05-07  9:02   ` Nirjhar Roy (IBM)
2025-05-21 10:19     ` Dave Chinner
2025-05-26  9:04       ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` [PATCH 11/28] check-parallel: initial support for specifying device sizes Dave Chinner
2025-05-07 10:05   ` Nirjhar Roy (IBM)
2025-05-21 11:11     ` Dave Chinner
2025-04-17  3:00 ` [PATCH 12/28] config: move config section code to it's own file Dave Chinner
2025-05-09  6:09   ` Nirjhar Roy
2025-05-21 11:28     ` Dave Chinner
2025-04-17  3:00 ` [PATCH 13/28] check-parallel: introduce config file support Dave Chinner
2025-05-09 12:01   ` Nirjhar Roy
2025-05-21 12:23     ` Dave Chinner
2025-04-17  3:00 ` [PATCH 14/28] fstests: further separate sourcing common/rc and common/config from initialisation Dave Chinner
2025-05-10 14:08   ` Nirjhar Roy (IBM)
2025-04-17  3:00 ` Dave Chinner [this message]
2025-05-09 13:16   ` [PATCH 15/28] check-parallel: de-batch test execution Nirjhar Roy
2025-04-17  3:00 ` [PATCH 16/28] check-parallel: run sections directly Dave Chinner
2025-05-09 14:03   ` Nirjhar Roy
2025-04-17  3:00 ` [PATCH 17/28] check-parallel: rebuild test list when FSTYP changes Dave Chinner
2025-05-09 16:00   ` Nirjhar Roy
2025-04-17  3:00 ` [PATCH 18/28] check-parallel: create a "results-latest" symlink Dave Chinner
2025-05-10 13:12   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 19/28] check: factor test running Dave Chinner
2025-05-12 13:57   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 20/28] [RFC] check-parallel: run tests directly without using check Dave Chinner
2025-05-13 14:48   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 21/28] generic/531: limit max files per CPU Dave Chinner
2025-05-10 13:15   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 22/28] fsync-tester.c: use syncfs() rather than sync() Dave Chinner
2025-04-30  9:08   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 23/28] open-by-handle.c: " Dave Chinner
2025-04-30  9:02   ` Nirjhar Roy (IBM)
2025-05-21  2:32     ` Dave Chinner
2025-05-26  5:11       ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 24/28] " Dave Chinner
2025-04-30  8:56   ` Nirjhar Roy (IBM)
2025-05-21  2:30     ` Dave Chinner
2025-05-26  4:56       ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 25/28] bulkstat_unlink_test_modified.c: remove unused test code Dave Chinner
2025-04-30  8:47   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 26/28] stale-handle.c: use syncfs() rather than sync() Dave Chinner
2025-04-30  8:34   ` Nirjhar Roy (IBM)
2025-05-21  2:24     ` Dave Chinner
2025-04-17  3:01 ` [PATCH 27/28] scaleread: remove dead test code Dave Chinner
2025-04-30  8:10   ` Nirjhar Roy (IBM)
2025-04-17  3:01 ` [PATCH 28/28] xfs/259: no need to call sync Dave Chinner
2025-04-30  7:56   ` Nirjhar Roy (IBM)

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6fc86fb9 dfblob:e2cf2c8d )
 OR (
bs:"[PATCH 15/28] check-parallel: de-batch test execution" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250417031208.1852171-16-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=fstests@vger.kernel.org \
    --cc=zlang@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox