* EVL 7.1 and below on armhf: System hang when running the testsuite under stress @ 2026-06-22 7:29 Florian Bezdeka 2026-06-22 9:21 ` Philippe Gerum 0 siblings, 1 reply; 6+ messages in thread From: Florian Bezdeka @ 2026-06-22 7:29 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai, Jan Kiszka, Tobias Schaffner Hi Philippe, while testing the 7.1 based EVL branches I noted a system hang on arm when trying to execute the evl testsuite multiple times in a row while stressing the system at the same time. Environment: Nested virtualization, a x86 VM running the armhf tests in another VM. Might not be that common but other archs are running fine with this setup. I tested against 7.1 and 7.0 rebase branches - both affected - but was not able yet to go below 7.0 as time is limited. I also enabled PROVE_LOCKING but that did not trigger any warning but made the problem go away... The stressor cmdline: # stress-ng --cpu 8 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 & Example output # evl test -k basic-xbuf: OK clock-timer-periodic: OK clone-fork-exec: OK detach-self: OK duplicate-element: OK element-visibility: OK fault: OK file-dup: OK fpu-preload: OK fpu-stress: OK heap-torture: OK hectic: OK mapfd: OK monitor-count-inband: OK monitor-deadlock: OK monitor-deboost-stress: OK monitor-event: OK monitor-event-sigrel: OK monitor-event-targeted: OK monitor-event-untrack: OK monitor-flags: OK monitor-flags-broadcast: OK monitor-flags-inband: OK monitor-pi: OK monitor-pi-deadlock: OK monitor-pi-deboost: OK monitor-pi-stress: OK monitor-pp-dynamic: OK monitor-pp-lazy: OK monitor-pp-lower: OK monitor-pp-nested: OK monitor-pp-pi: OK monitor-pp-raise: OK monitor-pp-tryenter: OK monitor-pp-weak: OK monitor-recursive: OK monitor-steal: OK monitor-trylock: OK monitor-wait-multiple: OK monitor-wait-requeue: OK observable-hm: OK observable-inband: OK observable-onchange: OK observable-oob: OK observable-race: OK observable-thread: OK observable-unicast: OK poll-close: OK poll-flags: OK poll-many: OK poll-multiple: OK poll-nested: OK poll-observable-inband: OK poll-observable-oob: OK poll-sem: OK poll-xbuf: OK proxy-echo: OK proxy-eventfd: OK proxy-pipe: OK proxy-poll: OK ptrace-sync: OK ring-spray: OK rwlock-read: OK rwlock-write: OK sched-quota-accuracy: 78.4% ** sched-tp-accuracy: BROKEN sched-tp-overrun: OK sem-close-unblock: OK sem-flush: OK sem-timedwait: OK sem-wait: OK simple-clone: OK [ No futher output, seems we are stuck in stax-lock test] [ It's always within this test ] I was able to fetch the following rcu warning from the serial console via gdb/lx-dmesg. Wasn't that helpful for me, but maybe it rings a bell. [ 57.488273] EVL: fault:1957 switching in-band [pid=1957, excpt=0, __copy_to_user_std+0x74/0x374] [ 57.489105] EVL: fault:1957 resuming out-of-band [pid=1957, excpt=0, __copy_to_user_std+0x360/0x374] [ 57.489398] EVL: fault:1957 switching in-band [pid=1957, excpt=0, user_pc=0x4707ea] [ 86.772645] EVL: fault:4193 switching in-band [pid=4193, excpt=0, __copy_to_user_std+0x74/0x374] [ 86.772942] EVL: fault:4193 resuming out-of-band [pid=4193, excpt=0, __copy_to_user_std+0x360/0x374] [ 86.773029] EVL: fault:4193 switching in-band [pid=4193, excpt=0, user_pc=0x4507ea] [ 177.579348] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:10780.9' signaled [ 374.037157] EVL: fault:25707 switching in-band [pid=25707, excpt=0, __copy_to_user_std+0x74/0x374] [ 374.037582] EVL: fault:25707 resuming out-of-band [pid=25707, excpt=0, __copy_to_user_std+0x360/0x374] [ 374.037705] EVL: fault:25707 switching in-band [pid=25707, excpt=0, user_pc=0x4507ea] [ 493.107954] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:2062.4' signaled [ 599.153154] EVL: fault:9187 switching in-band [pid=9187, excpt=0, __copy_to_user_std+0x74/0x374] [ 599.153624] EVL: fault:9187 resuming out-of-band [pid=9187, excpt=0, __copy_to_user_std+0x360/0x374] [ 599.153725] EVL: fault:9187 switching in-band [pid=9187, excpt=0, user_pc=0x4007ea] [ 627.334572] EVL: fault:11456 switching in-band [pid=11456, excpt=0, __copy_to_user_std+0x74/0x374] [ 627.335530] EVL: fault:11456 resuming out-of-band [pid=11456, excpt=0, __copy_to_user_std+0x360/0x374] [ 627.335752] EVL: fault:11456 switching in-band [pid=11456, excpt=0, user_pc=0x4907ea] [ 730.251556] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:18782.6' signaled [ 747.230444] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 747.231393] rcu: (detected by 1, t=2102 jiffies, g=103469, q=1205 ncpus=4) [ 747.231467] rcu: All QSes seen, last rcu_sched kthread activity 2100 (44723-42623), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 747.231599] rcu: rcu_sched kthread starved for 2100 jiffies! g103469 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 747.231628] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 747.231642] rcu: RCU grace-period kthread stack dump: [ 747.231701] task:rcu_sched state:R running task stack:0 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x00000000 [ 747.232543] Call trace: [ 747.233011] __schedule from schedule+0x20/0x130 [ 747.233751] schedule from schedule_timeout+0x84/0xf4 [ 747.233784] schedule_timeout from rcu_gp_fqs_loop+0xe8/0x450 [ 747.233807] rcu_gp_fqs_loop from rcu_gp_kthread+0xf0/0x110 [ 747.233871] rcu_gp_kthread from kthread+0xe8/0x10c [ 747.233901] kthread from ret_from_fork+0x14/0x30 [ 747.233957] Exception stack(0xf0879fb0 to 0xf0879ff8) [ 747.234087] 9fa0: 00000000 00000000 00000000 00000000 [ 747.234108] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 747.234122] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 747.234251] rcu: Stack dump where RCU GP kthread last ran: [ 747.234387] Sending NMI from CPU 1 to CPUs 0: [ 747.234579] Spurious and unmasked percpu IRQ23 on CPU0 This problem is unrelated to the arm pipelining cleanup series. I'm going to post v3 now. Another finding triggered by some analysis is that we disable a couple of tests in CI. There are two tests failing often in this arm qemu setup: - clock-timer-periodic - sched-tp-accuracy The timer test is especially failing when there is some load on the host. Now the question - mainly in the direction of Tobias: Why are the other tests disabled in CI? Namely: - sched-quota-accuracy - sched-tp-accuracy - sched-tp-overrun - monitor-event-untrack Shouldn't we better fix the tests than simply disable them? Haven't seen any failures on arm64. x86 pending. Best regards, Florian ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: EVL 7.1 and below on armhf: System hang when running the testsuite under stress 2026-06-22 7:29 EVL 7.1 and below on armhf: System hang when running the testsuite under stress Florian Bezdeka @ 2026-06-22 9:21 ` Philippe Gerum 2026-06-22 22:30 ` Florian Bezdeka 0 siblings, 1 reply; 6+ messages in thread From: Philippe Gerum @ 2026-06-22 9:21 UTC (permalink / raw) To: Florian Bezdeka; +Cc: xenomai, Jan Kiszka, Tobias Schaffner Florian Bezdeka <florian.bezdeka@siemens.com> writes: <snip> > sem-wait: OK > simple-clone: OK > [ No futher output, seems we are stuck in stax-lock test] > [ It's always within this test ] > > I was able to fetch the following rcu warning from the serial console > via gdb/lx-dmesg. Wasn't that helpful for me, but maybe it rings a bell. > > [ 57.488273] EVL: fault:1957 switching in-band [pid=1957, excpt=0, __copy_to_user_std+0x74/0x374] > [ 57.489105] EVL: fault:1957 resuming out-of-band [pid=1957, excpt=0, __copy_to_user_std+0x360/0x374] > [ 57.489398] EVL: fault:1957 switching in-band [pid=1957, excpt=0, user_pc=0x4707ea] > [ 86.772645] EVL: fault:4193 switching in-band [pid=4193, excpt=0, __copy_to_user_std+0x74/0x374] > [ 86.772942] EVL: fault:4193 resuming out-of-band [pid=4193, excpt=0, __copy_to_user_std+0x360/0x374] > [ 86.773029] EVL: fault:4193 switching in-band [pid=4193, excpt=0, user_pc=0x4507ea] > [ 177.579348] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:10780.9' signaled > [ 374.037157] EVL: fault:25707 switching in-band [pid=25707, excpt=0, __copy_to_user_std+0x74/0x374] > [ 374.037582] EVL: fault:25707 resuming out-of-band [pid=25707, excpt=0, __copy_to_user_std+0x360/0x374] > [ 374.037705] EVL: fault:25707 switching in-band [pid=25707, excpt=0, user_pc=0x4507ea] > [ 493.107954] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:2062.4' signaled > [ 599.153154] EVL: fault:9187 switching in-band [pid=9187, excpt=0, __copy_to_user_std+0x74/0x374] > [ 599.153624] EVL: fault:9187 resuming out-of-band [pid=9187, excpt=0, __copy_to_user_std+0x360/0x374] > [ 599.153725] EVL: fault:9187 switching in-band [pid=9187, excpt=0, user_pc=0x4007ea] > [ 627.334572] EVL: fault:11456 switching in-band [pid=11456, excpt=0, __copy_to_user_std+0x74/0x374] > [ 627.335530] EVL: fault:11456 resuming out-of-band [pid=11456, excpt=0, __copy_to_user_std+0x360/0x374] > [ 627.335752] EVL: fault:11456 switching in-band [pid=11456, excpt=0, user_pc=0x4907ea] > [ 730.251556] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:18782.6' signaled > [ 747.230444] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > [ 747.231393] rcu: (detected by 1, t=2102 jiffies, g=103469, q=1205 ncpus=4) > [ 747.231467] rcu: All QSes seen, last rcu_sched kthread activity 2100 (44723-42623), jiffies_till_next_fqs=1, root ->qsmask 0x0 > [ 747.231599] rcu: rcu_sched kthread starved for 2100 jiffies! g103469 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 > [ 747.231628] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. > [ 747.231642] rcu: RCU grace-period kthread stack dump: > [ 747.231701] task:rcu_sched state:R running task stack:0 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x00000000 > [ 747.232543] Call trace: > [ 747.233011] __schedule from schedule+0x20/0x130 > [ 747.233751] schedule from schedule_timeout+0x84/0xf4 > [ 747.233784] schedule_timeout from rcu_gp_fqs_loop+0xe8/0x450 > [ 747.233807] rcu_gp_fqs_loop from rcu_gp_kthread+0xf0/0x110 > [ 747.233871] rcu_gp_kthread from kthread+0xe8/0x10c > [ 747.233901] kthread from ret_from_fork+0x14/0x30 > [ 747.233957] Exception stack(0xf0879fb0 to 0xf0879ff8) > [ 747.234087] 9fa0: 00000000 00000000 00000000 00000000 > [ 747.234108] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > [ 747.234122] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 > [ 747.234251] rcu: Stack dump where RCU GP kthread last ran: > [ 747.234387] Sending NMI from CPU 1 to CPUs 0: > [ 747.234579] Spurious and unmasked percpu IRQ23 on CPU0 Hard to say at the moment whether the pressure imposed on the virtualized system by the test is responsible for this hang, or the inter-stage synchronization in the core has issues. Any change with this patch in? diff --git a/tests/stax-lock.c b/tests/stax-lock.c index 51576d9..87511ef 100644 --- a/tests/stax-lock.c +++ b/tests/stax-lock.c @@ -66,18 +66,17 @@ static void *test_thread(void *arg) me = 1 << serial; oob = !!(serial & 1); + delay = running_on_vm() ? 1000000 : 100000; if (oob) { __Tcall_assert(tfd, evl_attach_self("stax.%ld:%d", serial / 2, getpid())); do_ioctl = oob_ioctl; do_usleep = evl_usleep; - delay = 100000; /* Any in-band presence is invalid. */ invalid = 0x55555555; } else { do_ioctl = ioctl; do_usleep = usleep; - delay = 100000; /* Any oob presence is invalid. */ invalid = 0xAAAAAAAA; } Clearly, an improvement would not rule out some issue in the implementation of the stax mechanism, but this might give us a valuable hint anyway. > > This problem is unrelated to the arm pipelining cleanup series. I'm > going to post v3 now. > > Another finding triggered by some analysis is that we disable a couple > of tests in CI. There are two tests failing often in this arm qemu > setup: > - clock-timer-periodic > - sched-tp-accuracy > > The timer test is especially failing when there is some load on the > host. Since r58, we have the running_on_vm() predicate available to test code, which checks whether the "EVL_ON_VM" environment variable is set to 1/y/yes/Y/YES (unfortunately, I'm not aware of any way to detect this without user input like the valgrind vm allows via some hypercall). sched-tp-accuracy, sched-tp-overrun, and monitor-event-untrack have been fixed up accordingly not to trigger false positive on vm. > > Now the question - mainly in the direction of Tobias: > Why are the other tests disabled in CI? Namely: > - sched-quota-accuracy > - sched-tp-accuracy > - sched-tp-overrun > - monitor-event-untrack > > Shouldn't we better fix the tests than simply disable them? Haven't seen > any failures on arm64. x86 pending. > Yes, it would be better to fix them specifically for vm context, even if that means disabling some checks based on timing accuracy. -- Philippe. ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: EVL 7.1 and below on armhf: System hang when running the testsuite under stress 2026-06-22 9:21 ` Philippe Gerum @ 2026-06-22 22:30 ` Florian Bezdeka 2026-06-23 8:18 ` Tobias Schaffner 2026-06-23 12:17 ` Philippe Gerum 0 siblings, 2 replies; 6+ messages in thread From: Florian Bezdeka @ 2026-06-22 22:30 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai, Jan Kiszka, Tobias Schaffner On Mon, 2026-06-22 at 11:21 +0200, Philippe Gerum wrote: > Florian Bezdeka <florian.bezdeka@siemens.com> writes: > > <snip> > > > sem-wait: OK > > simple-clone: OK > > [ No futher output, seems we are stuck in stax-lock test] > > [ It's always within this test ] > > > > I was able to fetch the following rcu warning from the serial console > > via gdb/lx-dmesg. Wasn't that helpful for me, but maybe it rings a bell. > > > > [ 57.488273] EVL: fault:1957 switching in-band [pid=1957, excpt=0, __copy_to_user_std+0x74/0x374] > > [ 57.489105] EVL: fault:1957 resuming out-of-band [pid=1957, excpt=0, __copy_to_user_std+0x360/0x374] > > [ 57.489398] EVL: fault:1957 switching in-band [pid=1957, excpt=0, user_pc=0x4707ea] > > [ 86.772645] EVL: fault:4193 switching in-band [pid=4193, excpt=0, __copy_to_user_std+0x74/0x374] > > [ 86.772942] EVL: fault:4193 resuming out-of-band [pid=4193, excpt=0, __copy_to_user_std+0x360/0x374] > > [ 86.773029] EVL: fault:4193 switching in-band [pid=4193, excpt=0, user_pc=0x4507ea] > > [ 177.579348] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:10780.9' signaled > > [ 374.037157] EVL: fault:25707 switching in-band [pid=25707, excpt=0, __copy_to_user_std+0x74/0x374] > > [ 374.037582] EVL: fault:25707 resuming out-of-band [pid=25707, excpt=0, __copy_to_user_std+0x360/0x374] > > [ 374.037705] EVL: fault:25707 switching in-band [pid=25707, excpt=0, user_pc=0x4507ea] > > [ 493.107954] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:2062.4' signaled > > [ 599.153154] EVL: fault:9187 switching in-band [pid=9187, excpt=0, __copy_to_user_std+0x74/0x374] > > [ 599.153624] EVL: fault:9187 resuming out-of-band [pid=9187, excpt=0, __copy_to_user_std+0x360/0x374] > > [ 599.153725] EVL: fault:9187 switching in-band [pid=9187, excpt=0, user_pc=0x4007ea] > > [ 627.334572] EVL: fault:11456 switching in-band [pid=11456, excpt=0, __copy_to_user_std+0x74/0x374] > > [ 627.335530] EVL: fault:11456 resuming out-of-band [pid=11456, excpt=0, __copy_to_user_std+0x360/0x374] > > [ 627.335752] EVL: fault:11456 switching in-band [pid=11456, excpt=0, user_pc=0x4907ea] > > [ 730.251556] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:18782.6' signaled > > [ 747.230444] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > > [ 747.231393] rcu: (detected by 1, t=2102 jiffies, g=103469, q=1205 ncpus=4) > > [ 747.231467] rcu: All QSes seen, last rcu_sched kthread activity 2100 (44723-42623), jiffies_till_next_fqs=1, root ->qsmask 0x0 > > [ 747.231599] rcu: rcu_sched kthread starved for 2100 jiffies! g103469 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 > > [ 747.231628] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. > > [ 747.231642] rcu: RCU grace-period kthread stack dump: > > [ 747.231701] task:rcu_sched state:R running task stack:0 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x00000000 > > [ 747.232543] Call trace: > > [ 747.233011] __schedule from schedule+0x20/0x130 > > [ 747.233751] schedule from schedule_timeout+0x84/0xf4 > > [ 747.233784] schedule_timeout from rcu_gp_fqs_loop+0xe8/0x450 > > [ 747.233807] rcu_gp_fqs_loop from rcu_gp_kthread+0xf0/0x110 > > [ 747.233871] rcu_gp_kthread from kthread+0xe8/0x10c > > [ 747.233901] kthread from ret_from_fork+0x14/0x30 > > [ 747.233957] Exception stack(0xf0879fb0 to 0xf0879ff8) > > [ 747.234087] 9fa0: 00000000 00000000 00000000 00000000 > > [ 747.234108] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > > [ 747.234122] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 > > [ 747.234251] rcu: Stack dump where RCU GP kthread last ran: > > [ 747.234387] Sending NMI from CPU 1 to CPUs 0: > > [ 747.234579] Spurious and unmasked percpu IRQ23 on CPU0 > > Hard to say at the moment whether the pressure imposed on the > virtualized system by the test is responsible for this hang, or the > inter-stage synchronization in the core has issues. Any change with this > patch in? > > diff --git a/tests/stax-lock.c b/tests/stax-lock.c > index 51576d9..87511ef 100644 > --- a/tests/stax-lock.c > +++ b/tests/stax-lock.c > @@ -66,18 +66,17 @@ static void *test_thread(void *arg) > me = 1 << serial; > > oob = !!(serial & 1); > + delay = running_on_vm() ? 1000000 : 100000; > if (oob) { > __Tcall_assert(tfd, evl_attach_self("stax.%ld:%d", > serial / 2, getpid())); > do_ioctl = oob_ioctl; > do_usleep = evl_usleep; > - delay = 100000; > /* Any in-band presence is invalid. */ > invalid = 0x55555555; > } else { > do_ioctl = ioctl; > do_usleep = usleep; > - delay = 100000; > /* Any oob presence is invalid. */ > invalid = 0xAAAAAAAA; > } Yep, that helps. > > Clearly, an improvement would not rule out some issue in the > implementation of the stax mechanism, but this might give us a valuable > hint anyway. > > > > > This problem is unrelated to the arm pipelining cleanup series. I'm > > going to post v3 now. > > > > Another finding triggered by some analysis is that we disable a couple > > of tests in CI. There are two tests failing often in this arm qemu > > setup: > > - clock-timer-periodic > > - sched-tp-accuracy > > > > The timer test is especially failing when there is some load on the > > host. > > Since r58, we have the running_on_vm() predicate available to test code, > which checks whether the "EVL_ON_VM" environment variable is set to > 1/y/yes/Y/YES (unfortunately, I'm not aware of any way to detect this > without user input like the valgrind vm allows via some hypercall). systemd-detect-virt implements a couple of mechanisms to detect a lot of different hypervisors. Could we depend on it? > > sched-tp-accuracy, sched-tp-overrun, and monitor-event-untrack have been > fixed up accordingly not to trigger false positive on vm. Nice. Tobias, any plans to clean that up in the CI setup? > > > > > Now the question - mainly in the direction of Tobias: > > Why are the other tests disabled in CI? Namely: > > - sched-quota-accuracy > > - sched-tp-accuracy > > - sched-tp-overrun > > - monitor-event-untrack > > > > Shouldn't we better fix the tests than simply disable them? Haven't seen > > any failures on arm64. x86 pending. > > > > Yes, it would be better to fix them specifically for vm context, even if > that means disabling some checks based on timing accuracy. > > -- > Philippe. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: EVL 7.1 and below on armhf: System hang when running the testsuite under stress 2026-06-22 22:30 ` Florian Bezdeka @ 2026-06-23 8:18 ` Tobias Schaffner 2026-06-23 12:17 ` Philippe Gerum 1 sibling, 0 replies; 6+ messages in thread From: Tobias Schaffner @ 2026-06-23 8:18 UTC (permalink / raw) To: Florian Bezdeka, Philippe Gerum; +Cc: xenomai, Jan Kiszka On 6/23/26 00:30, Florian Bezdeka wrote: > On Mon, 2026-06-22 at 11:21 +0200, Philippe Gerum wrote: >> Florian Bezdeka <florian.bezdeka@siemens.com> writes: >> >> <snip> >> >>> sem-wait: OK >>> simple-clone: OK >>> [ No futher output, seems we are stuck in stax-lock test] >>> [ It's always within this test ] >>> >>> I was able to fetch the following rcu warning from the serial console >>> via gdb/lx-dmesg. Wasn't that helpful for me, but maybe it rings a bell. >>> >>> [ 57.488273] EVL: fault:1957 switching in-band [pid=1957, excpt=0, __copy_to_user_std+0x74/0x374] >>> [ 57.489105] EVL: fault:1957 resuming out-of-band [pid=1957, excpt=0, __copy_to_user_std+0x360/0x374] >>> [ 57.489398] EVL: fault:1957 switching in-band [pid=1957, excpt=0, user_pc=0x4707ea] >>> [ 86.772645] EVL: fault:4193 switching in-band [pid=4193, excpt=0, __copy_to_user_std+0x74/0x374] >>> [ 86.772942] EVL: fault:4193 resuming out-of-band [pid=4193, excpt=0, __copy_to_user_std+0x360/0x374] >>> [ 86.773029] EVL: fault:4193 switching in-band [pid=4193, excpt=0, user_pc=0x4507ea] >>> [ 177.579348] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:10780.9' signaled >>> [ 374.037157] EVL: fault:25707 switching in-band [pid=25707, excpt=0, __copy_to_user_std+0x74/0x374] >>> [ 374.037582] EVL: fault:25707 resuming out-of-band [pid=25707, excpt=0, __copy_to_user_std+0x360/0x374] >>> [ 374.037705] EVL: fault:25707 switching in-band [pid=25707, excpt=0, user_pc=0x4507ea] >>> [ 493.107954] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:2062.4' signaled >>> [ 599.153154] EVL: fault:9187 switching in-band [pid=9187, excpt=0, __copy_to_user_std+0x74/0x374] >>> [ 599.153624] EVL: fault:9187 resuming out-of-band [pid=9187, excpt=0, __copy_to_user_std+0x360/0x374] >>> [ 599.153725] EVL: fault:9187 switching in-band [pid=9187, excpt=0, user_pc=0x4007ea] >>> [ 627.334572] EVL: fault:11456 switching in-band [pid=11456, excpt=0, __copy_to_user_std+0x74/0x374] >>> [ 627.335530] EVL: fault:11456 resuming out-of-band [pid=11456, excpt=0, __copy_to_user_std+0x360/0x374] >>> [ 627.335752] EVL: fault:11456 switching in-band [pid=11456, excpt=0, user_pc=0x4907ea] >>> [ 730.251556] EVL: watchdog triggered on CPU0 -- runaway thread 'post-many-flags:18782.6' signaled >>> [ 747.230444] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >>> [ 747.231393] rcu: (detected by 1, t=2102 jiffies, g=103469, q=1205 ncpus=4) >>> [ 747.231467] rcu: All QSes seen, last rcu_sched kthread activity 2100 (44723-42623), jiffies_till_next_fqs=1, root ->qsmask 0x0 >>> [ 747.231599] rcu: rcu_sched kthread starved for 2100 jiffies! g103469 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 >>> [ 747.231628] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. >>> [ 747.231642] rcu: RCU grace-period kthread stack dump: >>> [ 747.231701] task:rcu_sched state:R running task stack:0 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x00000000 >>> [ 747.232543] Call trace: >>> [ 747.233011] __schedule from schedule+0x20/0x130 >>> [ 747.233751] schedule from schedule_timeout+0x84/0xf4 >>> [ 747.233784] schedule_timeout from rcu_gp_fqs_loop+0xe8/0x450 >>> [ 747.233807] rcu_gp_fqs_loop from rcu_gp_kthread+0xf0/0x110 >>> [ 747.233871] rcu_gp_kthread from kthread+0xe8/0x10c >>> [ 747.233901] kthread from ret_from_fork+0x14/0x30 >>> [ 747.233957] Exception stack(0xf0879fb0 to 0xf0879ff8) >>> [ 747.234087] 9fa0: 00000000 00000000 00000000 00000000 >>> [ 747.234108] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >>> [ 747.234122] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 >>> [ 747.234251] rcu: Stack dump where RCU GP kthread last ran: >>> [ 747.234387] Sending NMI from CPU 1 to CPUs 0: >>> [ 747.234579] Spurious and unmasked percpu IRQ23 on CPU0 >> >> Hard to say at the moment whether the pressure imposed on the >> virtualized system by the test is responsible for this hang, or the >> inter-stage synchronization in the core has issues. Any change with this >> patch in? >> >> diff --git a/tests/stax-lock.c b/tests/stax-lock.c >> index 51576d9..87511ef 100644 >> --- a/tests/stax-lock.c >> +++ b/tests/stax-lock.c >> @@ -66,18 +66,17 @@ static void *test_thread(void *arg) >> me = 1 << serial; >> >> oob = !!(serial & 1); >> + delay = running_on_vm() ? 1000000 : 100000; >> if (oob) { >> __Tcall_assert(tfd, evl_attach_self("stax.%ld:%d", >> serial / 2, getpid())); >> do_ioctl = oob_ioctl; >> do_usleep = evl_usleep; >> - delay = 100000; >> /* Any in-band presence is invalid. */ >> invalid = 0x55555555; >> } else { >> do_ioctl = ioctl; >> do_usleep = usleep; >> - delay = 100000; >> /* Any oob presence is invalid. */ >> invalid = 0xAAAAAAAA; >> } > > Yep, that helps. > >> >> Clearly, an improvement would not rule out some issue in the >> implementation of the stax mechanism, but this might give us a valuable >> hint anyway. >> >>> >>> This problem is unrelated to the arm pipelining cleanup series. I'm >>> going to post v3 now. >>> >>> Another finding triggered by some analysis is that we disable a couple >>> of tests in CI. There are two tests failing often in this arm qemu >>> setup: >>> - clock-timer-periodic >>> - sched-tp-accuracy >>> >>> The timer test is especially failing when there is some load on the >>> host. >> >> Since r58, we have the running_on_vm() predicate available to test code, >> which checks whether the "EVL_ON_VM" environment variable is set to >> 1/y/yes/Y/YES (unfortunately, I'm not aware of any way to detect this >> without user input like the valgrind vm allows via some hypercall). > > systemd-detect-virt implements a couple of mechanisms to detect a lot of > different hypervisors. Could we depend on it? > >> >> sched-tp-accuracy, sched-tp-overrun, and monitor-event-untrack have been >> fixed up accordingly not to trigger false positive on vm. > > Nice. > > Tobias, any plans to clean that up in the CI setup? Great! Yes I will send a patch. >> >>> >>> Now the question - mainly in the direction of Tobias: >>> Why are the other tests disabled in CI? Namely: >>> - sched-quota-accuracy >>> - sched-tp-accuracy >>> - sched-tp-overrun >>> - monitor-event-untrack >>> >>> Shouldn't we better fix the tests than simply disable them? Haven't seen >>> any failures on arm64. x86 pending. >>> >> >> Yes, it would be better to fix them specifically for vm context, even if >> that means disabling some checks based on timing accuracy. >> >> -- >> Philippe. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: EVL 7.1 and below on armhf: System hang when running the testsuite under stress 2026-06-22 22:30 ` Florian Bezdeka 2026-06-23 8:18 ` Tobias Schaffner @ 2026-06-23 12:17 ` Philippe Gerum 2026-06-23 12:30 ` Philippe Gerum 1 sibling, 1 reply; 6+ messages in thread From: Philippe Gerum @ 2026-06-23 12:17 UTC (permalink / raw) To: Florian Bezdeka; +Cc: xenomai, Jan Kiszka, Tobias Schaffner Florian Bezdeka <florian.bezdeka@siemens.com> writes: >> >> Since r58, we have the running_on_vm() predicate available to test code, >> which checks whether the "EVL_ON_VM" environment variable is set to >> 1/y/yes/Y/YES (unfortunately, I'm not aware of any way to detect this >> without user input like the valgrind vm allows via some hypercall). > > systemd-detect-virt implements a couple of mechanisms to detect a lot of > different hypervisors. Could we depend on it? > Yes, running_on_vm() could try this source if present, falling back to the envvar method if not. -- Philippe. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: EVL 7.1 and below on armhf: System hang when running the testsuite under stress 2026-06-23 12:17 ` Philippe Gerum @ 2026-06-23 12:30 ` Philippe Gerum 0 siblings, 0 replies; 6+ messages in thread From: Philippe Gerum @ 2026-06-23 12:30 UTC (permalink / raw) To: Florian Bezdeka; +Cc: xenomai, Jan Kiszka, Tobias Schaffner Philippe Gerum <rpm@xenomai.org> writes: > Florian Bezdeka <florian.bezdeka@siemens.com> writes: > >>> >>> Since r58, we have the running_on_vm() predicate available to test code, >>> which checks whether the "EVL_ON_VM" environment variable is set to >>> 1/y/yes/Y/YES (unfortunately, I'm not aware of any way to detect this >>> without user input like the valgrind vm allows via some hypercall). >> >> systemd-detect-virt implements a couple of mechanisms to detect a lot of >> different hypervisors. Could we depend on it? >> > > Yes, running_on_vm() could try this source if present, falling back to > the envvar method if not. Actually, having evl-test feed EVL_ON_VM would be better. This way we could still control the vm detection by setting EVL_ON_VM as desired before running the test manually (i.e. without evl-test). I'm preparing a patch for this. -- Philippe. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-23 12:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-22 7:29 EVL 7.1 and below on armhf: System hang when running the testsuite under stress Florian Bezdeka 2026-06-22 9:21 ` Philippe Gerum 2026-06-22 22:30 ` Florian Bezdeka 2026-06-23 8:18 ` Tobias Schaffner 2026-06-23 12:17 ` Philippe Gerum 2026-06-23 12:30 ` Philippe Gerum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.