* [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure
@ 2026-02-13 18:21 Ihor Solodrai
2026-02-13 18:57 ` Andrea Righi
2026-02-13 19:12 ` Tejun Heo
0 siblings, 2 replies; 3+ messages in thread
From: Ihor Solodrai @ 2026-02-13 18:21 UTC (permalink / raw)
To: Andrea Righi, Changwoo Min, David Vernet, Tejun Heo
Cc: sched-ext, bpf, kernel-team
The rt_stall test measures the runtime ratio between an EXT and an RT
task pinned to the same CPU, verifying that the deadline server prevents
RT tasks from starving SCHED_EXT tasks. It expects the EXT task to get
at least 4% of CPU time.
The test is flaky because sched_stress_test() calls sleep(RUN_TIME)
immediately after fork(), without waiting for the RT child to complete
its setup (set_affinity + set_sched). If the RT child experiences
scheduling latency before completing setup, that delay eats into the
measurement window: the RT child runs for less than RUN_TIME seconds,
and the EXT task's measured ratio drops below the 4% threshold.
For example, in the failing CI run [1]:
EXT=0.140s RT=4.750s total=4.890s (expected ~5.0s)
ratio=2.86% < 4% → FAIL
The 110ms gap (5.0 - 4.89) corresponds to the RT child's setup time
being counted inside the measurement window, during which fewer
deadline server ticks fire for the EXT task.
Fix by using pipes to synchronize: each child signals the parent after
completing its setup, and the parent waits for both signals before
starting sleep(RUN_TIME). This ensures the measurement window only
counts time when both tasks are fully configured and competing.
[1] https://github.com/kernel-patches/bpf/actions/runs/21961895809/job/63442490449
Fixes: be621a76341c ("selftests/sched_ext: Add test for sched_ext dl_server")
Assisted-by: claude-opus-4-6-v1
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
---
BPF CI caught the failure, I fed the logs to Claude Code and this is
what it came up with. I ran this 3 times on CI, and the failure didn't
happen again. The change makes sense to me, although synchronization
via pipes might be an overkill?
Please let me know if this is too sloppy, I'll try to refine.
Thanks!
---
tools/testing/selftests/sched_ext/rt_stall.c | 49 ++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
index 015200f80f6e..ab772e336f86 100644
--- a/tools/testing/selftests/sched_ext/rt_stall.c
+++ b/tools/testing/selftests/sched_ext/rt_stall.c
@@ -23,6 +23,30 @@
#define CORE_ID 0 /* CPU to pin tasks to */
#define RUN_TIME 5 /* How long to run the test in seconds */
+/* Signal the parent that setup is complete by writing to a pipe */
+static void signal_ready(int fd)
+{
+ char c = 1;
+
+ if (write(fd, &c, 1) != 1) {
+ perror("write to ready pipe");
+ exit(EXIT_FAILURE);
+ }
+ close(fd);
+}
+
+/* Wait for a child to signal readiness via a pipe */
+static void wait_ready(int fd)
+{
+ char c;
+
+ if (read(fd, &c, 1) != 1) {
+ perror("read from ready pipe");
+ exit(EXIT_FAILURE);
+ }
+ close(fd);
+}
+
/* Simple busy-wait function for test tasks */
static void process_func(void)
{
@@ -122,14 +146,24 @@ static bool sched_stress_test(bool is_ext)
float ext_runtime, rt_runtime, actual_ratio;
int ext_pid, rt_pid;
+ int ext_ready[2], rt_ready[2];
ksft_print_header();
ksft_set_plan(1);
+ if (pipe(ext_ready) || pipe(rt_ready)) {
+ perror("pipe");
+ ksft_exit_fail();
+ }
+
/* Create and set up a EXT task */
ext_pid = fork();
if (ext_pid == 0) {
+ close(ext_ready[0]);
+ close(rt_ready[0]);
+ close(rt_ready[1]);
set_affinity(CORE_ID);
+ signal_ready(ext_ready[1]);
process_func();
exit(0);
} else if (ext_pid < 0) {
@@ -140,8 +174,12 @@ static bool sched_stress_test(bool is_ext)
/* Create an RT task */
rt_pid = fork();
if (rt_pid == 0) {
+ close(ext_ready[0]);
+ close(ext_ready[1]);
+ close(rt_ready[0]);
set_affinity(CORE_ID);
set_sched(SCHED_FIFO, 50);
+ signal_ready(rt_ready[1]);
process_func();
exit(0);
} else if (rt_pid < 0) {
@@ -149,6 +187,17 @@ static bool sched_stress_test(bool is_ext)
ksft_exit_fail();
}
+ /*
+ * Wait for both children to complete their setup (affinity and
+ * scheduling policy) before starting the measurement window.
+ * This prevents flaky failures caused by the RT child's setup
+ * time eating into the measurement period.
+ */
+ close(ext_ready[1]);
+ close(rt_ready[1]);
+ wait_ready(ext_ready[0]);
+ wait_ready(rt_ready[0]);
+
/* Let the processes run for the specified time */
sleep(RUN_TIME);
--
2.53.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure
2026-02-13 18:21 [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure Ihor Solodrai
@ 2026-02-13 18:57 ` Andrea Righi
2026-02-13 19:12 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: Andrea Righi @ 2026-02-13 18:57 UTC (permalink / raw)
To: Ihor Solodrai
Cc: Changwoo Min, David Vernet, Tejun Heo, sched-ext, bpf,
kernel-team
Hi Ihor,
On Fri, Feb 13, 2026 at 10:21:36AM -0800, Ihor Solodrai wrote:
> The rt_stall test measures the runtime ratio between an EXT and an RT
> task pinned to the same CPU, verifying that the deadline server prevents
> RT tasks from starving SCHED_EXT tasks. It expects the EXT task to get
> at least 4% of CPU time.
>
> The test is flaky because sched_stress_test() calls sleep(RUN_TIME)
> immediately after fork(), without waiting for the RT child to complete
> its setup (set_affinity + set_sched). If the RT child experiences
> scheduling latency before completing setup, that delay eats into the
> measurement window: the RT child runs for less than RUN_TIME seconds,
> and the EXT task's measured ratio drops below the 4% threshold.
>
> For example, in the failing CI run [1]:
> EXT=0.140s RT=4.750s total=4.890s (expected ~5.0s)
> ratio=2.86% < 4% → FAIL
>
> The 110ms gap (5.0 - 4.89) corresponds to the RT child's setup time
> being counted inside the measurement window, during which fewer
> deadline server ticks fire for the EXT task.
>
> Fix by using pipes to synchronize: each child signals the parent after
> completing its setup, and the parent waits for both signals before
> starting sleep(RUN_TIME). This ensures the measurement window only
> counts time when both tasks are fully configured and competing.
>
> [1] https://github.com/kernel-patches/bpf/actions/runs/21961895809/job/63442490449
>
> Fixes: be621a76341c ("selftests/sched_ext: Add test for sched_ext dl_server")
> Assisted-by: claude-opus-4-6-v1
> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
The pipe sync makes sense and should make the test more robust, so LGTM.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
>
> ---
>
> BPF CI caught the failure, I fed the logs to Claude Code and this is
> what it came up with. I ran this 3 times on CI, and the failure didn't
> happen again. The change makes sense to me, although synchronization
> via pipes might be an overkill?
>
> Please let me know if this is too sloppy, I'll try to refine.
>
> Thanks!
>
> ---
> tools/testing/selftests/sched_ext/rt_stall.c | 49 ++++++++++++++++++++
> 1 file changed, 49 insertions(+)
>
> diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c
> index 015200f80f6e..ab772e336f86 100644
> --- a/tools/testing/selftests/sched_ext/rt_stall.c
> +++ b/tools/testing/selftests/sched_ext/rt_stall.c
> @@ -23,6 +23,30 @@
> #define CORE_ID 0 /* CPU to pin tasks to */
> #define RUN_TIME 5 /* How long to run the test in seconds */
>
> +/* Signal the parent that setup is complete by writing to a pipe */
> +static void signal_ready(int fd)
> +{
> + char c = 1;
> +
> + if (write(fd, &c, 1) != 1) {
> + perror("write to ready pipe");
> + exit(EXIT_FAILURE);
> + }
> + close(fd);
> +}
> +
> +/* Wait for a child to signal readiness via a pipe */
> +static void wait_ready(int fd)
> +{
> + char c;
> +
> + if (read(fd, &c, 1) != 1) {
> + perror("read from ready pipe");
> + exit(EXIT_FAILURE);
> + }
> + close(fd);
> +}
> +
> /* Simple busy-wait function for test tasks */
> static void process_func(void)
> {
> @@ -122,14 +146,24 @@ static bool sched_stress_test(bool is_ext)
>
> float ext_runtime, rt_runtime, actual_ratio;
> int ext_pid, rt_pid;
> + int ext_ready[2], rt_ready[2];
>
> ksft_print_header();
> ksft_set_plan(1);
>
> + if (pipe(ext_ready) || pipe(rt_ready)) {
> + perror("pipe");
> + ksft_exit_fail();
> + }
> +
> /* Create and set up a EXT task */
> ext_pid = fork();
> if (ext_pid == 0) {
> + close(ext_ready[0]);
> + close(rt_ready[0]);
> + close(rt_ready[1]);
> set_affinity(CORE_ID);
> + signal_ready(ext_ready[1]);
> process_func();
> exit(0);
> } else if (ext_pid < 0) {
> @@ -140,8 +174,12 @@ static bool sched_stress_test(bool is_ext)
> /* Create an RT task */
> rt_pid = fork();
> if (rt_pid == 0) {
> + close(ext_ready[0]);
> + close(ext_ready[1]);
> + close(rt_ready[0]);
> set_affinity(CORE_ID);
> set_sched(SCHED_FIFO, 50);
> + signal_ready(rt_ready[1]);
> process_func();
> exit(0);
> } else if (rt_pid < 0) {
> @@ -149,6 +187,17 @@ static bool sched_stress_test(bool is_ext)
> ksft_exit_fail();
> }
>
> + /*
> + * Wait for both children to complete their setup (affinity and
> + * scheduling policy) before starting the measurement window.
> + * This prevents flaky failures caused by the RT child's setup
> + * time eating into the measurement period.
> + */
> + close(ext_ready[1]);
> + close(rt_ready[1]);
> + wait_ready(ext_ready[0]);
> + wait_ready(rt_ready[0]);
> +
> /* Let the processes run for the specified time */
> sleep(RUN_TIME);
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure
2026-02-13 18:21 [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure Ihor Solodrai
2026-02-13 18:57 ` Andrea Righi
@ 2026-02-13 19:12 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-02-13 19:12 UTC (permalink / raw)
To: Ihor Solodrai
Cc: Andrea Righi, Changwoo Min, David Vernet, Emil Tsalapatis,
sched-ext, bpf, kernel-team
Applied to sched_ext/for-7.0-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-13 19:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-13 18:21 [PATCH v1] selftests/sched_ext: Fix rt_stall flaky failure Ihor Solodrai
2026-02-13 18:57 ` Andrea Righi
2026-02-13 19:12 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox