Re: [PATCH 3/4] multipath-tools tests: add test program for thread runners

public inbox for dm-devel@redhat.com
 help / color / mirror / Atom feed

From: Benjamin Marzinski <bmarzins@redhat.com>
To: Martin Wilck <martin.wilck@suse.com>
Cc: Christophe Varoqui <christophe.varoqui@opensvc.com>,
	Brian Bunker <brian@purestorage.com>,
	dm-devel@lists.linux.dev, Martin Wilck <mwilck@suse.com>
Subject: Re: [PATCH 3/4] multipath-tools tests: add test program for thread runners
Date: Tue, 24 Mar 2026 01:47:24 -0400	[thread overview]
Message-ID: <acIlbMiDtMkyRoDT@redhat.com> (raw)
In-Reply-To: <20260319221344.753790-4-mwilck@suse.com>

On Thu, Mar 19, 2026 at 11:13:43PM +0100, Martin Wilck wrote:
> Add a test program for the "runner" thread implementation from
> the previous commit. The test program runs simulated "hanging" threads that
> may time out, and optionally kills all threads at an arbitrary point in
> time. See the comments at the top of the file for details.
> 
> Also add a test driver script (runner-test.sh) with a few reasonable
> combinations of command line arguments.
> 
> The test program has been used to test the "runner" implementation
> extensively on different architectures (x86_64, aarch64, ppc64le), using
> both valgrind and the gcc address sanitizer (libasan) for detection of
> memory leaks and use-after-free errors.
> 
> For valgrind, a suppression file needs to be added, as valgrind doesn't
> seem to capture the deallocation of thread local storage for detached
> threads in the test case where the test program is killed. The suppression
> affects only memory allocated by glibc. This leak has not been seen with
> libasan, only with valgrind.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  tests/Makefile         |  15 +-
>  tests/runner-test.sh   |  37 +++
>  tests/runner-test.supp |  15 ++
>  tests/runner.c         | 530 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 593 insertions(+), 4 deletions(-)
>  create mode 100755 tests/runner-test.sh
>  create mode 100644 tests/runner-test.supp
>  create mode 100644 tests/runner.c
> 
> [snip]
>
> diff --git a/tests/runner.c b/tests/runner.c
> new file mode 100644
> index 0000000..21a44ca
> --- /dev/null
> +++ b/tests/runner.c
> @@ -0,0 +1,530 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/*
> + * Test reliability of the runner implementation.
> + *
> + * This tests simulates "path checkers" being started through the
> + * runner code. It creates threads that run for a variable amount of time,
> + * optionally ignoring cancellation signals. The runners have a fixed
> + * timeout (TIMEOUT_USEC, "-t" option), after which they are considered
> + * "hanging" and will be cancelled. The actual runtime of the runner is random.
> + * It varies between (TIMEOUT_USEC - NOISE USEC) and
> + * (TIMEOUT_USEC + NOISE_BIAS * NOISE_USEC). These noise parameters are
> + * set with the "-n" and "-b" option, respectively.
> + * This allows simulating frequent races between cancellation and regular
> + * completion of threads. Note that because timers in different threads
> + * aren't started simultaneously, NOISE_USEC = 0 doesn't mean that the
> + * runners will complete exactly at the point in time when there timer
> + * expires.
> + * The runners simulate waiting checkers by simply sleeping. Optionally,
> + * they can ignore cancellation while sleeping (IGNORE_CANCEL, -i) or
> + * divide the sleep time in multiple "steps" (-s), between which they can
> + * be cancelled.
> + * Like multipathd, the main thread "polls" the status of the runners in
> + * regular time intervals that are set with POLL_USEC (-p). Because of
> + * this polling behavior, it can happen that a runner finishes after
> + * its timeout has expired. The runner code (and this test) treats this case
> + * as successful completion.
> + * If a runner completes, the result is checked against the expected value.
> + * The number of threads that have not finished (either successfully or
> + * cancelled, plus the number of wrong results of completed runners is
> + * the error count.
> + * The N_RUNNERS (-N) option determines how many simultaneous threads are
> + * started.
> + * The test runs until all runners have either completed or expired, or
> + * until a maximum wait time is reached, which is calculated from the
> + * test parameters (max_wait in run_test()). The REPEAT (-r) parameter
> + * determines the number of times the entire test is repeated.
> + * The KILL_TIMEOUT (-k) parameter is for simulating a shutdown of the
> + * main program (think multipathd). When this timeout expires, all pending
> + * runners are cancelled and the program terminates.
> + *
> + * A "realistic" simulation of multipathd path checkers would use options
> + * roughly like this:
> + *
> + *   runner-test -N 1000 -p 1000 -t 30000 -n 29990 -b 5 -s 1 -i -r 20 -k 300000
> + *
> + * (note that time options are in ms, whereas the code uses us), but this
> + * takes a very long time to run.
> + *
> + * Scaled down, it becomes:
> + *
> + *   runner-test -N 1000 -p 100 -t 3000 -n 2999 -b 5 -s 1 -i -r 20 -k 30000
> + *
> + * A less realistic run with high likelihood of completionc / cancellation races:
> + *
> + *   runner-test -N 1000 -p 10 -t 3000 -n 1 -b 1 -s 1 -i -r 20
> + *
> + * Here, all runners finish in a +-1ms timer interval around the timeout.
> + * Even with -n 0 (no noise), with a sufficient number of runners, some runners
> + * will time out.
> + */
> +
> +#include <time.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <pthread.h>
> +#include <stdbool.h>
> +#include <sched.h>
> +#include <errno.h>
> +#include <sys/select.h>
> +#include <sys/wait.h>
> +#include "debug.h"
> +#include "time-util.h"
> +#include "runner.h"
> +#include "runner.h"
> +#include "globals.c"
> +
> +#define MILLION 1000000
> +#define BILLION 1000000000
> +
> +static int N_RUNNERS = 100;
> +/* sleep time between runner status polling */
> +static long POLL_USEC = 10000;
> +/* timeout for runner */
> +static long TIMEOUT_USEC = 100000;
> +/* random noise to subtract / add to the sleep time */
> +static long NOISE_USEC = 10000;
> +/*
> + * Factor to increase noise towards longer sleep times (timeouts).
> + * The actual sleep time will be in the interval
> + * [ TIMEOUT_USEC - NOISE_USEC, TIMEOUT_USEC + NOISE_USEC * NOISE_BIAS ]
> + */
> +static int NOISE_BIAS = 5;
> +/* number of sleep intervals the runner uses */
> +static int SLEEP_STEPS = 1;
> +/* time after which to kill all runners */
> +static long KILL_TIMEOUT = 0;
> +/* number of repeated runs */
> +static int REPEAT = 10;
> +/* whether to ignore cancellation signals */
> +static bool IGNORE_CANCEL = false;
> +
> +/* gap in the paylod to similate larger size */
> +#define PAYLOAD_GAP 128
> +
> +struct payload {
> +	long wait_nsec;
> +	int steps;
> +	bool ignore_cancel;
> +	int start;
> +	char pad[PAYLOAD_GAP];
> +	int end;
> +};
> +
> +static void wait_and_add_1(void *arg)
> +{
> +	struct payload *t1 = arg;
> +	struct timespec wait;
> +	int i, cancelstate;
> +
> +	wait.tv_sec = t1->wait_nsec / BILLION;
> +	wait.tv_nsec = t1->wait_nsec % BILLION;
> +	normalize_timespec(&wait);
> +	for (i = 0; i < t1->steps; i++) {
> +		if (t1->ignore_cancel)
> +			pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cancelstate);
> +		if (nanosleep(&wait, NULL) != 0 && errno != EINTR)
> +			condlog(3, "%s: nanosleep: %s", __func__, strerror(errno));
> +		if (t1->ignore_cancel) {
> +			pthread_setcancelstate(cancelstate, NULL);
> +		}
> +		pthread_testcancel();
> +	}
> +	t1->end = t1->start + 1;
> +}
> +
> +static bool payload_error(const struct payload *p)
> +{
> +	return p->end != p->start + 1;
> +}
> +
> +static int check_payload(struct runner_context *ctx, bool *error)
> +{
> +	struct timespec now;
> +	struct payload t1;
> +	int st = check_runner(ctx, &t1, sizeof(t1));
> +
> +	if (st == RUNNER_RUNNING)
> +		return st;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &now);
> +	if (st == RUNNER_DONE) {
> +		if (error)
> +			*error = payload_error(&t1);
> +		condlog(3, "runner finished in state 'done' at %lld.%06lld, start %d end %d",
> +			(long long)now.tv_sec, (long long)now.tv_nsec / 1000,
> +			t1.start, t1.end);
> +	} else
> +		condlog(3, "runner finished in state '%s' at %lld.%06lld",
> +			runner_state_name(st), (long long)now.tv_sec,
> +			(long long)now.tv_nsec / 1000);
> +	return st;
> +}
> +
> +static struct runner_context *
> +start_runner(long usecs, int start, long noise_range_usec, long bias,
> +	     int steps, bool ignore_cancel)
> +{
> +
> +	struct payload t1;
> +	struct runner_context *ctx;
> +	long noise;
> +
> +	if (noise_range_usec > 0)
> +		noise = random() % ((bias + 1) * noise_range_usec) -
> +			noise_range_usec;
> +	else
> +		noise = 0;
> +	t1.start = start;
> +	t1.end = 0;
> +	t1.wait_nsec = (usecs + noise) * 1000 / steps;
> +	t1.wait_nsec = t1.wait_nsec > 0 ? t1.wait_nsec : 0;
> +	t1.steps = steps;
> +	t1.ignore_cancel = ignore_cancel;
> +
> +	ctx = get_runner(wait_and_add_1, &t1, sizeof(t1), usecs);
> +
> +	if (ctx) {
> +		struct timespec tmo, finish;
> +
> +		clock_gettime(CLOCK_MONOTONIC, &tmo);
> +		tmo.tv_sec += usecs / MILLION;
> +		tmo.tv_nsec += (usecs % MILLION) * 1000;
> +		finish = tmo;
> +		normalize_timespec(&tmo);
> +		finish.tv_sec += noise / MILLION;
> +		finish.tv_nsec += (noise % MILLION) * 1000;
> +		normalize_timespec(&finish);
> +		condlog(4, "started runner start %d timeout %lld.%06lld, finish %lld.%06lld (noise %ld), steps %d, %signoring cancellation",
> +			start, (long long)tmo.tv_sec,
> +			(long long)tmo.tv_nsec / 1000, (long long)finish.tv_sec,
> +			(long long)finish.tv_nsec / 1000, noise, steps,
> +			ignore_cancel ? "" : "not ");
> +		return ctx;
> +	} else {
> +		condlog(0, "failed to start runner for start %d", start);
> +		return NULL;
> +	}
> +}
> +
> +static struct runner_context **context;
> +static volatile bool must_stop = false;
> +void int_handler(int signal)
> +{
> +	must_stop = true;
> +}
> +
> +static void terminate_all(void)
> +{
> +	int i, count;
> +
> +	for (count = 0, i = 0; i < N_RUNNERS; i++)
> +		if (context[i]) {
> +			cancel_runner(context[i]);
> +			context[i] = NULL;
> +			count++;
> +		}
> +	condlog(3, "%s: %d runners cancelled", __func__, count);
> +	/* give runners a chance to clean up */
> +	sched_yield();
> +}
> +
> +static bool test_sleep(const struct timespec *wait)
> +{
> +	sigset_t set;
> +
> +	sigfillset(&set);
> +	sigdelset(&set, SIGTERM);
> +	sigdelset(&set, SIGINT);
> +	sigdelset(&set, SIGQUIT);
> +
> +	pselect(0, NULL, NULL, NULL, wait, &set);
> +
> +	if (!must_stop)
> +		return false;
> +
> +	terminate_all();
> +	return true;
> +}
> +
> +static int run_test(int n)
> +{
> +	int i, running, done, errors;
> +	const struct timespec wait = {.tv_sec = 0, .tv_nsec = 1000 * POLL_USEC};
> +	struct timespec stop, now;
> +	long max_wait = TIMEOUT_USEC + NOISE_BIAS * NOISE_USEC + 10000;
> +	bool killed = false;
> +
> +	for (i = 0; i < N_RUNNERS; i++)
> +		context[i] = start_runner(TIMEOUT_USEC, i, NOISE_USEC, NOISE_BIAS,
> +					  SLEEP_STEPS, IGNORE_CANCEL);
> +
> +	clock_gettime(CLOCK_MONOTONIC, &stop);
> +	stop.tv_sec += max_wait / MILLION;
> +	stop.tv_nsec += (max_wait % MILLION) * 1000;
> +	normalize_timespec(&stop);
> +	running = N_RUNNERS;
> +	done = 0;
> +	errors = 0;
> +	do {
> +		bool err = false;
> +
> +		condlog(4, "%d runners active", running);
> +		killed = test_sleep(&wait);
> +		if (killed)
> +			condlog(3, "%s: terminating on signal...", __func__);
> +		for (running = 0, i = 0; i < N_RUNNERS; i++) {
> +			int st;
> +
> +			if (!context[i])
> +				continue;
> +			st = check_payload(context[i], &err);
> +			switch (st) {
> +			case RUNNER_DONE:
> +				if (err)
> +					errors++;
> +				done++;
> +				/* fallthrough */
> +			case RUNNER_CANCELLED:
> +				context[i] = NULL;
> +				break;
> +			default:
> +				running++;
> +				break;
> +			}
> +		}
> +		if (killed)
> +			break;
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		if (timespeccmp(&stop, &now) <= 0)
> +			break;
> +	} while (running);
> +
> +	condlog(2, "%10d%10d%10d%10d%10d", n, N_RUNNERS, N_RUNNERS - running,
> +		done, errors);
> +	if (killed) {
> +		condlog(2, "%s: termination signal received", __func__);
> +		exit(0);
> +	}
> +
> +	if (running > 0) {
> +		condlog(1, "ERROR: %d runners haven't finished", running);
> +		terminate_all();
> +	}
> +	return running + errors;
> +}
> +
> +static void free_ctxs(struct runner_context ***ctxs)
> +{
> +	if (*ctxs)
> +		free(*ctxs);
> +}
> +
> +static int setup_signal_handler(int sig, void (*handler)(int))
> +{
> +	sigset_t set;
> +	sigfillset(&set);
> +	struct sigaction sga = {.sa_handler = NULL};
> +
> +	sga.sa_handler = int_handler;

Typo. Since you pass in a handler, it should be

sga.sa_handler = handler;

not that it actually makes any difference.

-Ben

> +	sga.sa_mask = set;
> +	if (sigaction(sig, &sga, NULL) != 0) {
> +		condlog(1, "%s: failed to install signal handler for %d: %s",
> +			__func__, sig, strerror(errno));
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +int run_tests(void)
> +{
> +	int errors = 0, i;
> +	struct runner_context **ctxs __attribute__((cleanup(free_ctxs))) = NULL;
> +
> +	if (setup_signal_handler(SIGINT, int_handler) != 0)
> +		return -1;
> +
> +	ctxs = calloc(N_RUNNERS, sizeof(*context));
> +	if (ctxs == NULL)
> +		/* arbitrary number to indicate OOM error */
> +		return 7000;
> +	context = ctxs;
> +	for (i = 0; i < REPEAT; i++) {
> +		errors += run_test(i + 1);
> +	}
> +	return errors ? 1 : 0;
> +}
> +
> +/* We need to register a dummy handler to avoid system call restarting in
> + * pselect() below */
> +static void dummy_handler(int sig) {}
> +
> +static int fork_test(void)
> +{
> +	sigset_t set;
> +	pid_t child;
> +	int wstatus;
> +	struct timespec wait_to_kill = {.tv_sec = 0};
> +
> +	/* Block all signals. termination signals will be enabled in test_sleep() */
> +	sigfillset(&set);
> +	pthread_sigmask(SIG_SETMASK, &set, NULL);
> +
> +	child = fork();
> +
> +	if (child < 0) {
> +		condlog(0, "error in fork(), %s", strerror(errno));
> +		return -1;
> +	} else if (child == 0) {
> +		/* child */
> +		int rc = run_tests();
> +		exit(rc ? 1 : 0);
> +	}
> +
> +	setup_signal_handler(SIGCHLD, dummy_handler);
> +
> +	/* parent */
> +	if (KILL_TIMEOUT > 0) {
> +		sigset_t set;
> +
> +		condlog(3, "%s: == Child %d will be killed with SIGINT after %ld us",
> +			__func__, child, KILL_TIMEOUT);
> +		wait_to_kill.tv_sec = KILL_TIMEOUT / MILLION;
> +		wait_to_kill.tv_nsec = (KILL_TIMEOUT % MILLION) * 1000;
> +
> +		/*
> +		 * Unblock SIGCHLD in case thild terminates
> +		 * (child will receive SIGINT)
> +		 */
> +		sigfillset(&set);
> +		sigdelset(&set, SIGCHLD);
> +		if (pselect(0, NULL, NULL, NULL, &wait_to_kill, &set) != 0) {
> +			if (errno == EINTR)
> +				condlog(2, "main: child terminated");
> +			else
> +				condlog(1, "main: error in pselect: %s",
> +					strerror(errno));
> +		} else
> +			kill(child, SIGINT);
> +	}
> +
> +	if (waitpid(child, &wstatus, 0) <= 0) {
> +		condlog(1, "%s: failed to wait for child %d", __func__, child);
> +		return -1;
> +	}
> +	if (WIFEXITED(wstatus)) {
> +		condlog(3, "%s: child %d return code %d", __func__, child,
> +			WEXITSTATUS(wstatus));
> +		return WEXITSTATUS(wstatus);
> +	} else if (WIFSIGNALED(wstatus)) {
> +		condlog(2, "%s: child %d killed by signal code %d", __func__,
> +			child, WTERMSIG(wstatus));
> +		return -1;
> +	} else {
> +		condlog(1, "%s: unexpected status of child %d", __func__, child);
> +		return -1;
> +	}
> +}
> +
> +static long parse_number(const char *arg, long factor, long deflt)
> +{
> +	char *ep;
> +	long v;
> +
> +	if (*arg) {
> +		v = strtol(arg, &ep, 10);
> +		if (!*ep && v >= 0)
> +			return factor * v;
> +	}
> +	condlog(1, "invalid argument: %s, using %ld", arg, deflt);
> +	return deflt;
> +}
> +
> +static int usage(const char *cmd, int opt)
> +{
> +#define USAGE_FMT \
> +	"Usage: %s [options]\n" \
> +		"	-N runners:  number of parallel runners\n" \
> +		"	-p msecs:    time to sleep between status polls in main thread\n" \
> +		"	-t msecs:    timeout for runners\n" \
> +		"	-n msecs:    random noise for runner sleep time\n" \
> +		"	-b bias:     noise increase factor towards sleeping longer\n" \
> +		"	-s n:        number of steps to divide sleep time into\n" \
> +		"	-k msecs:    timeout after which to kill all runners (0: don't kill)\n" \
> +		"	-r n:        number of times to repeat test\n" \
> +		"	-i:          runners ignore cancellation while sleeping\n" \
> +		"	-v n:        set verbosity level (default 2)\n" \
> +		"	-h:          print this help"
> +	condlog(0, USAGE_FMT, cmd);
> +	return opt == 'h' ? 0 : 1;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int opt;
> +	int total = 0;
> +	const char *optstring = "+:N:p:t:n:b:s:k:r:v:ih";
> +
> +	init_test_verbosity(2);
> +
> +	while ((opt = getopt(argc, argv, optstring)) != -1) {
> +		switch (opt) {
> +		case 'N':
> +			N_RUNNERS = parse_number(optarg, 1L, N_RUNNERS);
> +			break;
> +		case 'p':
> +			POLL_USEC = parse_number(optarg, 1000L, POLL_USEC);
> +			break;
> +		case 't':
> +			TIMEOUT_USEC = parse_number(optarg, 1000L, TIMEOUT_USEC);
> +			break;
> +		case 'n':
> +			NOISE_USEC = parse_number(optarg, 1000L, NOISE_USEC);
> +			break;
> +		case 'b':
> +			NOISE_BIAS = parse_number(optarg, 1L, NOISE_BIAS);
> +			break;
> +		case 's':
> +			SLEEP_STEPS = parse_number(optarg, 1L, SLEEP_STEPS);
> +			break;
> +		case 'k':
> +			KILL_TIMEOUT = parse_number(optarg, 1000L, KILL_TIMEOUT);
> +			break;
> +		case 'r':
> +			REPEAT = parse_number(optarg, 1L, REPEAT);
> +			break;
> +		case 'i':
> +			IGNORE_CANCEL = true;
> +			break;
> +		case 'v':
> +			libmp_verbosity = parse_number(optarg, 1L, libmp_verbosity);
> +			break;
> +		case 'h':
> +		case ':':
> +		case '?':
> +			return usage(argv[0], opt);
> +			break;
> +		}
> +	}
> +
> +	if (optind != argc)
> +		return usage(argv[0], '?');
> +
> +	condlog(2, "Runner: timeout=%ld, noise interval=[%ld:%ld], steps=%d",
> +		TIMEOUT_USEC, TIMEOUT_USEC - NOISE_USEC,
> +		TIMEOUT_USEC + NOISE_BIAS * NOISE_USEC, SLEEP_STEPS);
> +	condlog(2, "Other : poll interval=%ld, ignore cancellation=%s, runners=%d, repeat=%d, kill timeout=%ld",
> +		POLL_USEC, IGNORE_CANCEL ? "YES" : "NO", N_RUNNERS, REPEAT,
> +		KILL_TIMEOUT);
> +	condlog(2, "%10s%10s%10s%10s%10s", "run", "total", "finished",
> +		"completed", "errors");
> +
> +	total = fork_test();
> +	if (total == -1)
> +		return 130;
> +	condlog(2, "== TOTAL NUMBER OF FAILED RUNS: %d", total);
> +	return total ? 1 : 0;
> +}
> -- 
> 2.53.0

next prev parent reply	other threads:[~2026-03-24  5:47 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 22:13 [PATCH 0/4] multipath-tools: generic async threads for TUR checker Martin Wilck
2026-03-19 22:13 ` [PATCH 1/4] multipathd: get_new_state: map PATH_TIMEOUT to PATH_DOWN Martin Wilck
2026-03-24  5:35   ` Benjamin Marzinski
2026-03-19 22:13 ` [PATCH 2/4] libmpathutil: add generic implementation for checker thread runners Martin Wilck
2026-03-24  5:44   ` Benjamin Marzinski
2026-03-19 22:13 ` [PATCH 3/4] multipath-tools tests: add test program for " Martin Wilck
2026-03-24  5:47   ` Benjamin Marzinski [this message]
2026-03-19 22:13 ` [PATCH 4/4] libmultipath: TUR checker: use runner threads Martin Wilck
2026-03-24  6:38   ` Benjamin Marzinski
2026-03-24 12:24     ` Martin Wilck
2026-03-24 14:46       ` Benjamin Marzinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acIlbMiDtMkyRoDT@redhat.com \
    --to=bmarzins@redhat.com \
    --cc=brian@purestorage.com \
    --cc=christophe.varoqui@opensvc.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=martin.wilck@suse.com \
    --cc=mwilck@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox