BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
@ 2025-09-19 11:10 Ben Dooks
  2025-09-19 16:37 ` Matteo Martelli
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Dooks @ 2025-09-19 11:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel@vger.kernel.org, Matteo Martelli, Marcel Ziswiler

We are doing some testing with stress-ng and the cgroup-v2 enabled
(CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
related to user-space calling sched_setattr() and possibly other calls.

At the moment we're not sure if the WARN and BUG calls are entirely
correct, we are considering there may be some sort of race condition
which is causing incorrect assumptions in the code.

We are seeing this kernel bug in pick_next_rt_entity being triggered

	idx = sched_find_first_bit(array->bitmap);
	BUG_ON(idx >= MAX_RT_PRIO);

Which suggests that the pick_task_rt() ran, thought there was something
there to schedule and got into pick_next_rt_entity() which then found
there was nothing. It does this by checking rq->rt.rt_queued before it
bothers to try picking something to run.

(this BUG_ON() is triggered if there is no index in the array indicating
  something there to run)

We added some debug to find out what the values in pick_next_rt_entity()
with the current rt_queued and the value it was when pick_task_rt()
looked, and we got:

    idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)

This shows the code was entered with the rt_q showing something
should have been queued and by the time the pick_next_rt_entity()
was entered there seems to be nothing (assuming the array is in
sync with the lists...)

I think the two questions we have are:

- Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
   return NULL be the best way of handling this? I am going to try
   this and see if the system is still runnable with this.

- Are we seeing a race here, and if so where is the best place to
   prevent it?

Note, we do have a few local backported cgroup-v2 patches.

Our systemd unit file to launch the test is here:

[Service]
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStartPre=/bin/sh -c 'echo 500000 > 
/sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
--timeout=0 --verify --oom-avoid --metrics --timestamp 
--exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
--stressor-time
Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
--disable_rlimits --disable_clone_newuser"
Slice=system.slice
OOMPolicy=continue

I added this to dump the array and confirm at-least the array-v-list
was in sync at the point of the bug:

static inline void debug_pick_next(struct rt_rq *rt_rq, int idx, 
unsigned qs)
{
	struct rt_prio_array *array = &rt_rq->active;
	unsigned int nr;

	pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was 
%u)\n",
	       rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );

	for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
		pr_info("  bitmap idx %u: %lx\n", nr, 
array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
	}

	// check that the bitmap and array match
	for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
		bool l_empty = list_empty(array->queue + nr);
		bool a_empty = !test_bit(nr, array->bitmap);

		if (l_empty != a_empty) {
			pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
			       a_empty ? "empty" : "full",
			       l_empty ? "empty" : "full");
		}
	}
}
	


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
  2025-09-19 11:10 BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2 Ben Dooks
@ 2025-09-19 16:37 ` Matteo Martelli
  2025-09-23 18:14   ` Dietmar Eggemann
  0 siblings, 1 reply; 6+ messages in thread
From: Matteo Martelli @ 2025-09-19 16:37 UTC (permalink / raw)
  To: Ben Dooks, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, linux-kernel@vger.kernel.org, Marcel Ziswiler,
	Matteo Martelli

Hi all,

On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> We are doing some testing with stress-ng and the cgroup-v2 enabled
> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> related to user-space calling sched_setattr() and possibly other calls.
> 
> At the moment we're not sure if the WARN and BUG calls are entirely
> correct, we are considering there may be some sort of race condition
> which is causing incorrect assumptions in the code.
> 
> We are seeing this kernel bug in pick_next_rt_entity being triggered
> 
> 	idx = sched_find_first_bit(array->bitmap);
> 	BUG_ON(idx >= MAX_RT_PRIO);
> 
> Which suggests that the pick_task_rt() ran, thought there was something
> there to schedule and got into pick_next_rt_entity() which then found
> there was nothing. It does this by checking rq->rt.rt_queued before it
> bothers to try picking something to run.
> 
> (this BUG_ON() is triggered if there is no index in the array indicating
>   something there to run)
> 
> We added some debug to find out what the values in pick_next_rt_entity()
> with the current rt_queued and the value it was when pick_task_rt()
> looked, and we got:
> 
>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> 
> This shows the code was entered with the rt_q showing something
> should have been queued and by the time the pick_next_rt_entity()
> was entered there seems to be nothing (assuming the array is in
> sync with the lists...)
> 
> I think the two questions we have are:
> 
> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
>    return NULL be the best way of handling this? I am going to try
>    this and see if the system is still runnable with this.
> 
> - Are we seeing a race here, and if so where is the best place to
>    prevent it?
> 
> Note, we do have a few local backported cgroup-v2 patches.
> 
> Our systemd unit file to launch the test is here:
> 
> [Service]
> Type=simple
> Restart=always
> ExecStartPre=/bin/sh -c 'echo 500000 > 
> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> ExecStartPre=/bin/sh -c 'echo 500000 > 
> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
> --timeout=0 --verify --oom-avoid --metrics --timestamp 
> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
> --stressor-time
> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
> --disable_rlimits --disable_clone_newuser"
> Slice=system.slice
> OOMPolicy=continue
> 
> I added this to dump the array and confirm at-least the array-v-list
> was in sync at the point of the bug:
> 
> static inline void debug_pick_next(struct rt_rq *rt_rq, int idx, 
> unsigned qs)
> {
> 	struct rt_prio_array *array = &rt_rq->active;
> 	unsigned int nr;
> 
> 	pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was 
> %u)\n",
> 	       rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );
> 
> 	for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
> 		pr_info("  bitmap idx %u: %lx\n", nr, 
> array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
> 	}
> 
> 	// check that the bitmap and array match
> 	for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
> 		bool l_empty = list_empty(array->queue + nr);
> 		bool a_empty = !test_bit(nr, array->bitmap);
> 
> 		if (l_empty != a_empty) {
> 			pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
> 			       a_empty ? "empty" : "full",
> 			       l_empty ? "empty" : "full");
> 		}
> 	}
> }
> 	

Hi all,

To provide some more context, we have found out this issue while running
some tests with stress-ng scheduler stressor[1] and the RT throttling
feature after enabling the RT_GROUP_SCHED kernel option. Note that we
also have PREEMPT_RT enabled in our config.

I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
v6.17-rc6. See below the steps to reproduce it.

cd linux
git reset --hard v6.17-rc6 && git clean -f -d

# Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/

# Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
make mrproper
make defconfig
scripts/config -k -e EXPERT
scripts/config -k -e PREEMPT_RT
scripts/config -k -e RT_GROUP_SCHED
make olddefconfig
make -j12

# Download a debian image and run qemu
wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
qemu-system-x86_64 \
    -m 2G -smp 4 \
    -nographic \
    -nic user,hostfwd=tcp::2222-:22 \
    -M q35,accel=kvm \
    -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
    -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
    -monitor none \
    -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
    -kernel arch/x86/boot/bzImage

# Then inside guest machine
# Install stress-ng
apt-get update && apt-get install stress-ng

# Create the stress-ng service. It sets the group RT runtime to 500ms
# (50% BW) via the cgroupv2 interface then it starts the stress-ng
# scheduler stressor. Also note the cpu affinity set to a single CPU
# which seems to help the issue to be more reproducible.
echo "[Unit]
Description=Mixed stress with long in the system slice
After=basic.target

[Service]
AllowedCPUs=0
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose --stressor-time
Slice=system.slice
OOMPolicy=continue" > /etc/systemd/system/stress-sched-long-system.service

systemctl start stress-sched-long-system.service

Then the BUG_ON is triggered within a few minutes. See the following logs.

[  345.657737] ------------[ cut here ]------------
[  345.657741] kernel BUG at kernel/sched/rt.c:1673!
[  345.657746] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  345.657749] CPU: 0 UID: 0 PID: 379 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[  345.657750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[  345.657751] RIP: 0010:pick_task_rt+0x6c/0x80
[  345.657762] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7
[  345.657763] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[  345.657765] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[  345.657765] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[  345.657766] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[  345.657766] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[  345.657767] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[  345.657770] FS:  00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[  345.657770] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.657771] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[  345.657772] Call Trace:
[  345.657775]  <TASK>
[  345.657775]  __schedule+0x488/0xf30
[  345.657779]  preempt_schedule+0x2e/0x50
[  345.657780]  preempt_schedule_thunk+0x16/0x30
[  345.657782]  migrate_enable+0xbc/0xd0
[  345.657784]  rt_spin_unlock+0xd/0x40
[  345.657787]  get_signal+0x765/0x8d0
[  345.657789]  ? do_nanosleep+0xe9/0x170
[  345.657791]  arch_do_signal_or_restart+0x38/0x250
[  345.657793]  exit_to_user_mode_loop+0x6b/0xb0
[  345.657796]  do_syscall_64+0x221/0x290
[  345.657798]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  345.657800] RIP: 0033:0x7f3395c4f687
[  345.657801] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[  345.657802] RSP: 002b:00007ffd2cbc0270 EFLAGS: 00000202 ORIG_RAX: 00000000000000e6
[  345.657803] RAX: 0000000000000000 RBX: 00007f339538fb00 RCX: 00007f3395c4f687
[  345.657803] RDX: 00007ffd2cbc02b0 RSI: 0000000000000000 RDI: 0000000000000000
[  345.657804] RBP: 000056316fe06320 R08: 0000000000000000 R09: 0000000000000000
[  345.657804] R10: 00007ffd2cbc02c0 R11: 0000000000000202 R12: 000000000000017b
[  345.657804] R13: 000056315f837030 R14: 0000000000000003 R15: 0000000000000001
[  345.657805]  </TASK>
[  345.657805] Modules linked in:
[  345.657807] ---[ end trace 0000000000000000 ]---
[  345.657807] RIP: 0010:pick_task_rt+0x6c/0x80
[  345.657809] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7e c0 90 <0f> 0b 90 0f 0b 90 31 c0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90
[  345.657809] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[  345.657810] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[  345.657810] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[  345.657810] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[  345.657811] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[  345.657811] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[  345.657814] FS:  00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[  345.657814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.657815] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[  345.657815] Kernel panic - not syncing: Fatal exception
[  345.657969] Kernel Offset: 0x14e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  345.685385] ---[ end Kernel panic - not syncing: Fatal exception ]---

Also the WARNING in __dequeue_rt_entity() is often being hit

[  117.550503] ------------[ cut here ]------------
[  117.550505] WARNING: CPU: 0 PID: 398 at kernel/sched/rt.c:1366 dequeue_rt_stack+0x311/0x330
[  117.550518] Modules linked in:
[  117.550521] CPU: 0 UID: 0 PID: 398 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[  117.550523] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[  117.550524] RIP: 0010:dequeue_rt_stack+0x311/0x330
[  117.550526] Code: 06 00 00 e9 46 fe ff ff 90 0f 0b 90 85 c0 75 06 90 0f 0b 90 31 c0 b9 01 00 00 00 48 85 d2 0f 85 ce fd ff ff e9 cf fd f
[  117.550526] RSP: 0018:ffffb4af008cbce0 EFLAGS: 00010046
[  117.550528] RAX: 0000000000000000 RBX: ffff979604108120 RCX: ffff979601febc80
[  117.550528] RDX: ffff979604108120 RSI: 0000000000000006 RDI: ffff97967dc29400
[  117.550529] RBP: ffff979604108120 R08: 00000000000e7ef0 R09: ffff979601febc00
[  117.550529] R10: 0000000000000001 R11: 0000000000000002 R12: ffff97967dc29400
[  117.550530] R13: 0000000000000006 R14: 0000000000000002 R15: ffffffff97393180
[  117.550533] FS:  00007f2a872eeb00(0000) GS:ffff9796e5c0f000(0000) knlGS:0000000000000000
[  117.550534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  117.550534] CR2: 00005633eab27d00 CR3: 00000000041c6000 CR4: 00000000000006f0
[  117.550535] Call Trace:
[  117.550538]  <TASK>
[  117.550539]  dequeue_rt_entity+0x29/0x160
[  117.550541]  dequeue_task_rt+0x25/0x40
[  117.550542]  rt_mutex_setprio+0x356/0x520
[  117.550545]  rt_mutex_slowunlock+0x15c/0x290
[  117.550548]  ? __set_cpus_allowed_ptr+0x5f/0xa0
[  117.550549]  ? migrate_enable+0x6a/0xd0
[  117.550550]  do_send_sig_info+0x61/0xa0
[  117.550553]  kill_pid_info_type+0x8d/0xa0
[  117.550555]  kill_something_info+0x16b/0x1a0
[  117.550556]  __x64_sys_kill+0x88/0xb0
[  117.550557]  do_syscall_64+0xa4/0x290
[  117.550560]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  117.550561] RIP: 0033:0x7f2a87b5f007
[  117.550562] Code: 48 83 c4 08 c3 66 0f 1f 44 00 00 48 8b 15 e9 6d 1a 00 64 89 02 b8 ff ff ff ff eb e4 0f 1f 80 00 00 00 00 b8 3e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 6d 1a 00 f7 d8 64 89 01 48
[  117.550563] RSP: 002b:00007ffefab861c8 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[  117.550564] RAX: ffffffffffffffda RBX: 00007f2a872d8a00 RCX: 00007f2a87b5f007
[  117.550565] RDX: 0000000000000003 RSI: 0000000000000012 RDI: 00000000000001af
[  117.550565] RBP: 0000000000000002 R08: 000acce4c998f093 R09: 0000000000000000
[  117.550565] R10: 00007f2a8909d000 R11: 0000000000000202 R12: 0000000000000004
[  117.550566] R13: 0000000000000001 R14: 00007ffefab86420 R15: 00000000000001af
[  117.550567]  </TASK>
[  117.550567] ---[ end trace 0000000000000000 ]---

and sometimes RCU stalls

[  453.738633] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  453.738636] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P853/3:b..l
[  453.738638] rcu:     (detected by 0, t=21002 jiffies, g=1477, q=122 ncpus=4)
[  453.738640] task:stress-ng-cpu-s state:R  running task     stack:14200 pid:853   tgid:853   ppid:849    task_flags:0x400140 flags:0x0000
[  453.738644] Call Trace:
[  453.738645]  <TASK>
[  453.738646]  __schedule+0x3c9/0xf30
[  453.738651]  schedule_rtlock+0x15/0x30
[  453.738652]  rtlock_slowlock_locked+0x1b6/0x1090
[  453.738654]  rt_spin_lock+0x79/0xd0
[  453.738656]  do_send_sig_info+0x31/0xa0
[  453.738659]  kill_pid_info_type+0x8d/0xa0
[  453.738661]  kill_something_info+0x16b/0x1a0
[  453.738662]  __x64_sys_kill+0x88/0xb0
[  453.738663]  do_syscall_64+0xa4/0x290
[  453.738665]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  453.738667] RIP: 0033:0x7f4df5aeb007
[  453.738669] RSP: 002b:00007ffffc9d8558 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[  453.738670] RAX: ffffffffffffffda RBX: 00007f4df5261b20 RCX: 00007f4df5aeb007
[  453.738671] RDX: 0000000000000012 RSI: 0000000000000012 RDI: 000000000000035e
[  453.738671] RBP: 0000000000000003 R08: 001053484f787e79 R09: 0000000000000000
[  453.738672] R10: 00007f4df7029000 R11: 0000000000000202 R12: 0000000000000004
[  453.738672] R13: 0000000000000001 R14: 00007ffffc9d8738 R15: 000000000000035e
[  453.738673]  </TASK>

[1]: https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu-sched.c

I hope the additional information is helpful.

Best regards,
Matteo Martelli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
  2025-09-19 16:37 ` Matteo Martelli
@ 2025-09-23 18:14   ` Dietmar Eggemann
  2025-09-24 13:10     ` Matteo Martelli
  0 siblings, 1 reply; 6+ messages in thread
From: Dietmar Eggemann @ 2025-09-23 18:14 UTC (permalink / raw)
  To: Matteo Martelli, Ben Dooks, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel@vger.kernel.org,
	Marcel Ziswiler

On 19.09.25 18:37, Matteo Martelli wrote:
> Hi all,
> 
> On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
>> We are doing some testing with stress-ng and the cgroup-v2 enabled
>> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
>> related to user-space calling sched_setattr() and possibly other calls.
>>
>> At the moment we're not sure if the WARN and BUG calls are entirely
>> correct, we are considering there may be some sort of race condition
>> which is causing incorrect assumptions in the code.
>>
>> We are seeing this kernel bug in pick_next_rt_entity being triggered
>>
>> 	idx = sched_find_first_bit(array->bitmap);
>> 	BUG_ON(idx >= MAX_RT_PRIO);
>>
>> Which suggests that the pick_task_rt() ran, thought there was something
>> there to schedule and got into pick_next_rt_entity() which then found
>> there was nothing. It does this by checking rq->rt.rt_queued before it
>> bothers to try picking something to run.
>>
>> (this BUG_ON() is triggered if there is no index in the array indicating
>>   something there to run)
>>
>> We added some debug to find out what the values in pick_next_rt_entity()
>> with the current rt_queued and the value it was when pick_task_rt()
>> looked, and we got:
>>
>>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
>>
>> This shows the code was entered with the rt_q showing something
>> should have been queued and by the time the pick_next_rt_entity()
>> was entered there seems to be nothing (assuming the array is in
>> sync with the lists...)
>>
>> I think the two questions we have are:
>>
>> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
>>    return NULL be the best way of handling this? I am going to try
>>    this and see if the system is still runnable with this.
>>
>> - Are we seeing a race here, and if so where is the best place to
>>    prevent it?
>>
>> Note, we do have a few local backported cgroup-v2 patches.
>>
>> Our systemd unit file to launch the test is here:
>>
>> [Service]
>> Type=simple
>> Restart=always
>> ExecStartPre=/bin/sh -c 'echo 500000 > 
>> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
>> ExecStartPre=/bin/sh -c 'echo 500000 > 
>> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
>> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
>> --timeout=0 --verify --oom-avoid --metrics --timestamp 
>> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
>> --stressor-time
>> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
>> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
>> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
>> --disable_rlimits --disable_clone_newuser"
>> Slice=system.slice
>> OOMPolicy=continue

[...]

> Hi all,
> 
> To provide some more context, we have found out this issue while running
> some tests with stress-ng scheduler stressor[1] and the RT throttling
> feature after enabling the RT_GROUP_SCHED kernel option. Note that we
> also have PREEMPT_RT enabled in our config.
> 
> I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
> v6.17-rc6. See below the steps to reproduce it.
> 
> cd linux
> git reset --hard v6.17-rc6 && git clean -f -d
> 
> # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
> b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/

Don't get this one ... you just pick a single patch from the RFC
patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server' ?

https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com


> # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
> make mrproper
> make defconfig
> scripts/config -k -e EXPERT
> scripts/config -k -e PREEMPT_RT
> scripts/config -k -e RT_GROUP_SCHED
> make olddefconfig
> make -j12
> 
> # Download a debian image and run qemu
> wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
> qemu-system-x86_64 \
>     -m 2G -smp 4 \
>     -nographic \
>     -nic user,hostfwd=tcp::2222-:22 \
>     -M q35,accel=kvm \
>     -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
>     -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
>     -monitor none \
>     -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
>     -kernel arch/x86/boot/bzImage
> 
> # Then inside guest machine
> # Install stress-ng
> apt-get update && apt-get install stress-ng
> 
> # Create the stress-ng service. It sets the group RT runtime to 500ms
> # (50% BW) via the cgroupv2 interface then it starts the stress-ng
> # scheduler stressor. Also note the cpu affinity set to a single CPU
> # which seems to help the issue to be more reproducible.

I assume this is the 'AllowedCPUs=0' line in the systemd service file.

> echo "[Unit]
> Description=Mixed stress with long in the system slice
> After=basic.target
> 
> [Service]
> AllowedCPUs=0
> Type=simple
> Restart=always
> ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --


I assume you get 4 stressors since you run 'qemu -smp 4'? How many
stress-ng related tasks have you running in
'system.slice/stress-sched-long-system.service'? And all of them on CPU0?

[...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
  2025-09-23 18:14   ` Dietmar Eggemann
@ 2025-09-24 13:10     ` Matteo Martelli
  2025-10-22 17:57       ` Ben Dooks
  0 siblings, 1 reply; 6+ messages in thread
From: Matteo Martelli @ 2025-09-24 13:10 UTC (permalink / raw)
  To: Dietmar Eggemann, Ben Dooks, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel@vger.kernel.org,
	Marcel Ziswiler, Matteo Martelli

Hi Dietmar,

On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19.09.25 18:37, Matteo Martelli wrote:
> > Hi all,
> > 
> > On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> >> We are doing some testing with stress-ng and the cgroup-v2 enabled
> >> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> >> related to user-space calling sched_setattr() and possibly other calls.
> >>
> >> At the moment we're not sure if the WARN and BUG calls are entirely
> >> correct, we are considering there may be some sort of race condition
> >> which is causing incorrect assumptions in the code.
> >>
> >> We are seeing this kernel bug in pick_next_rt_entity being triggered
> >>
> >> 	idx = sched_find_first_bit(array->bitmap);
> >> 	BUG_ON(idx >= MAX_RT_PRIO);
> >>
> >> Which suggests that the pick_task_rt() ran, thought there was something
> >> there to schedule and got into pick_next_rt_entity() which then found
> >> there was nothing. It does this by checking rq->rt.rt_queued before it
> >> bothers to try picking something to run.
> >>
> >> (this BUG_ON() is triggered if there is no index in the array indicating
> >>   something there to run)
> >>
> >> We added some debug to find out what the values in pick_next_rt_entity()
> >> with the current rt_queued and the value it was when pick_task_rt()
> >> looked, and we got:
> >>
> >>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> >>
> >> This shows the code was entered with the rt_q showing something
> >> should have been queued and by the time the pick_next_rt_entity()
> >> was entered there seems to be nothing (assuming the array is in
> >> sync with the lists...)
> >>
> >> I think the two questions we have are:
> >>
> >> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
> >>    return NULL be the best way of handling this? I am going to try
> >>    this and see if the system is still runnable with this.
> >>
> >> - Are we seeing a race here, and if so where is the best place to
> >>    prevent it?
> >>
> >> Note, we do have a few local backported cgroup-v2 patches.
> >>
> >> Our systemd unit file to launch the test is here:
> >>
> >> [Service]
> >> Type=simple
> >> Restart=always
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> >> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
> >> --timeout=0 --verify --oom-avoid --metrics --timestamp 
> >> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
> >> --stressor-time
> >> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> >> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> >> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
> >> --disable_rlimits --disable_clone_newuser"
> >> Slice=system.slice
> >> OOMPolicy=continue
> 
> [...]
> 
> > Hi all,
> > 
> > To provide some more context, we have found out this issue while running
> > some tests with stress-ng scheduler stressor[1] and the RT throttling
> > feature after enabling the RT_GROUP_SCHED kernel option. Note that we
> > also have PREEMPT_RT enabled in our config.
> > 
> > I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
> > v6.17-rc6. See below the steps to reproduce it.
> > 
> > cd linux
> > git reset --hard v6.17-rc6 && git clean -f -d
> > 
> > # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
> > b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
> 
> Don't get this one ... you just pick a single patch from the RFC
> patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server' ?
> 
> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
> 

Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
specific cgroup from a systemd unit, in order to control the max CPU
bandwidth allowed for a systemd slice. Since systemd depracated support
for cgroupv1 I picked that patch to export them via cgroupv2. To my
understanding, with that patch, setting the rt_runtime_us and
rt_period_us parameters via cgroupv2 should have the same effect as
setting them via cgroupv1. Of course I could have missed something and
that could be one reason for the issue. I will better look into it and
try to see if the issue is still reproducible with cgroupv1.

> 
> > # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
> > make mrproper
> > make defconfig
> > scripts/config -k -e EXPERT
> > scripts/config -k -e PREEMPT_RT
> > scripts/config -k -e RT_GROUP_SCHED
> > make olddefconfig
> > make -j12
> > 
> > # Download a debian image and run qemu
> > wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
> > qemu-system-x86_64 \
> >     -m 2G -smp 4 \
> >     -nographic \
> >     -nic user,hostfwd=tcp::2222-:22 \
> >     -M q35,accel=kvm \
> >     -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
> >     -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
> >     -monitor none \
> >     -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
> >     -kernel arch/x86/boot/bzImage
> > 
> > # Then inside guest machine
> > # Install stress-ng
> > apt-get update && apt-get install stress-ng
> > 
> > # Create the stress-ng service. It sets the group RT runtime to 500ms
> > # (50% BW) via the cgroupv2 interface then it starts the stress-ng
> > # scheduler stressor. Also note the cpu affinity set to a single CPU
> > # which seems to help the issue to be more reproducible.
> 
> I assume this is the 'AllowedCPUs=0' line in the systemd service file.

Yes, correct.

> 
> > echo "[Unit]
> > Description=Mixed stress with long in the system slice
> > After=basic.target
> > 
> > [Service]
> > AllowedCPUs=0
> > Type=simple
> > Restart=always
> > ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> > ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
> 
> 
> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
> stress-ng related tasks have you running in
> 'system.slice/stress-sched-long-system.service'? And all of them on CPU0?

Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
running on CPU 0. To my understanding each scheduler stressor forks 16
stress-ng child tasks [1], this is confirmed by the number of stress-ng
tasks running on the system. The test itself is not particularly
meaningful, it just reflects the setup I had when I found the BUG_ON.

> [...]
> 

[1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress-cpu-sched.c#L66

Best regards,
Matteo Martelli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
  2025-09-24 13:10     ` Matteo Martelli
@ 2025-10-22 17:57       ` Ben Dooks
  2025-10-23 10:04         ` Ben Dooks
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Dooks @ 2025-10-22 17:57 UTC (permalink / raw)
  To: Matteo Martelli, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel@vger.kernel.org,
	Marcel Ziswiler

On 24/09/2025 14:10, Matteo Martelli wrote:
> Hi Dietmar,
> 
> On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 19.09.25 18:37, Matteo Martelli wrote:
>>> Hi all,
>>>
>>> On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
>>>> We are doing some testing with stress-ng and the cgroup-v2 enabled
>>>> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
>>>> related to user-space calling sched_setattr() and possibly other calls.
>>>>
>>>> At the moment we're not sure if the WARN and BUG calls are entirely
>>>> correct, we are considering there may be some sort of race condition
>>>> which is causing incorrect assumptions in the code.
>>>>
>>>> We are seeing this kernel bug in pick_next_rt_entity being triggered
>>>>
>>>> 	idx = sched_find_first_bit(array->bitmap);
>>>> 	BUG_ON(idx >= MAX_RT_PRIO);
>>>>
>>>> Which suggests that the pick_task_rt() ran, thought there was something
>>>> there to schedule and got into pick_next_rt_entity() which then found
>>>> there was nothing. It does this by checking rq->rt.rt_queued before it
>>>> bothers to try picking something to run.
>>>>
>>>> (this BUG_ON() is triggered if there is no index in the array indicating
>>>>    something there to run)
>>>>
>>>> We added some debug to find out what the values in pick_next_rt_entity()
>>>> with the current rt_queued and the value it was when pick_task_rt()
>>>> looked, and we got:
>>>>
>>>>      idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
>>>>
>>>> This shows the code was entered with the rt_q showing something
>>>> should have been queued and by the time the pick_next_rt_entity()
>>>> was entered there seems to be nothing (assuming the array is in
>>>> sync with the lists...)
>>>>
>>>> I think the two questions we have are:
>>>>
>>>> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
>>>>     return NULL be the best way of handling this? I am going to try
>>>>     this and see if the system is still runnable with this.
>>>>
>>>> - Are we seeing a race here, and if so where is the best place to
>>>>     prevent it?
>>>>
>>>> Note, we do have a few local backported cgroup-v2 patches.
>>>>
>>>> Our systemd unit file to launch the test is here:
>>>>
>>>> [Service]
>>>> Type=simple
>>>> Restart=always
>>>> ExecStartPre=/bin/sh -c 'echo 500000 >
>>>> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
>>>> ExecStartPre=/bin/sh -c 'echo 500000 >
>>>> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
>>>> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng
>>>> --timeout=0 --verify --oom-avoid --metrics --timestamp
>>>> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose
>>>> --stressor-time
>>>> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
>>>> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
>>>> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps
>>>> --disable_rlimits --disable_clone_newuser"
>>>> Slice=system.slice
>>>> OOMPolicy=continue
>>
>> [...]
>>
>>> Hi all,
>>>
>>> To provide some more context, we have found out this issue while running
>>> some tests with stress-ng scheduler stressor[1] and the RT throttling
>>> feature after enabling the RT_GROUP_SCHED kernel option. Note that we
>>> also have PREEMPT_RT enabled in our config.
>>>
>>> I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
>>> v6.17-rc6. See below the steps to reproduce it.
>>>
>>> cd linux
>>> git reset --hard v6.17-rc6 && git clean -f -d
>>>
>>> # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
>>> b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
>>
>> Don't get this one ... you just pick a single patch from the RFC
>> patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server' ?
>>
>> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
>>
> 
> Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
> specific cgroup from a systemd unit, in order to control the max CPU
> bandwidth allowed for a systemd slice. Since systemd depracated support
> for cgroupv1 I picked that patch to export them via cgroupv2. To my
> understanding, with that patch, setting the rt_runtime_us and
> rt_period_us parameters via cgroupv2 should have the same effect as
> setting them via cgroupv1. Of course I could have missed something and
> that could be one reason for the issue. I will better look into it and
> try to see if the issue is still reproducible with cgroupv1.

We are still seeing WARN_ON() due to the tests at

static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned 
int flags)
{
...
	if (move_entity(flags)) {
...
		WARN_ON_ONCE(!rt_se->on_list);
...
	}
}

This seems to be due to the task_group's rt_entity tripping this under
load. I'm not sure yet if the WARN_ON_ONCE() here is actually useful or
if we are tripping some sort of race condition.

When moving from one stress-ng pid to another, it should have enqueued
the task-group back with the on_list set, but it seems not to? I've done
a quick trace_printk() after moving up to v6.17, however adding a print
of the rt_se setting seems to have stopped this issue from re-appearing
on my system.

I'll run some more tests to see if this comes back.

>>
>>> # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
>>> make mrproper
>>> make defconfig
>>> scripts/config -k -e EXPERT
>>> scripts/config -k -e PREEMPT_RT
>>> scripts/config -k -e RT_GROUP_SCHED
>>> make olddefconfig
>>> make -j12
>>>
>>> # Download a debian image and run qemu
>>> wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
>>> qemu-system-x86_64 \
>>>      -m 2G -smp 4 \
>>>      -nographic \
>>>      -nic user,hostfwd=tcp::2222-:22 \
>>>      -M q35,accel=kvm \
>>>      -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
>>>      -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
>>>      -monitor none \
>>>      -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
>>>      -kernel arch/x86/boot/bzImage
>>>
>>> # Then inside guest machine
>>> # Install stress-ng
>>> apt-get update && apt-get install stress-ng
>>>
>>> # Create the stress-ng service. It sets the group RT runtime to 500ms
>>> # (50% BW) via the cgroupv2 interface then it starts the stress-ng
>>> # scheduler stressor. Also note the cpu affinity set to a single CPU
>>> # which seems to help the issue to be more reproducible.
>>
>> I assume this is the 'AllowedCPUs=0' line in the systemd service file.
> 
> Yes, correct.
> 
>>
>>> echo "[Unit]
>>> Description=Mixed stress with long in the system slice
>>> After=basic.target
>>>
>>> [Service]
>>> AllowedCPUs=0
>>> Type=simple
>>> Restart=always
>>> ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
>>> ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
>>
>>
>> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
>> stress-ng related tasks have you running in
>> 'system.slice/stress-sched-long-system.service'? And all of them on CPU0?
> 
> Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
> running on CPU 0. To my understanding each scheduler stressor forks 16
> stress-ng child tasks [1], this is confirmed by the number of stress-ng
> tasks running on the system. The test itself is not particularly
> meaningful, it just reflects the setup I had when I found the BUG_ON.
> 
>> [...]
>>
> 
> [1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress-cpu-sched.c#L66
> 
> Best regards,
> Matteo Martelli
> 


-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
  2025-10-22 17:57       ` Ben Dooks
@ 2025-10-23 10:04         ` Ben Dooks
  0 siblings, 0 replies; 6+ messages in thread
From: Ben Dooks @ 2025-10-23 10:04 UTC (permalink / raw)
  To: Matteo Martelli, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel@vger.kernel.org,
	Marcel Ziswiler

On 22/10/2025 18:57, Ben Dooks wrote:
> On 24/09/2025 14:10, Matteo Martelli wrote:
>> Hi Dietmar,
>>
>> On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann 
>> <dietmar.eggemann@arm.com> wrote:
>>> On 19.09.25 18:37, Matteo Martelli wrote:
>>>> Hi all,
>>>>
>>>> On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks 
>>>> <ben.dooks@codethink.co.uk> wrote:
>>>>> We are doing some testing with stress-ng and the cgroup-v2 enabled
>>>>> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
>>>>> related to user-space calling sched_setattr() and possibly other 
>>>>> calls.
>>>>>
>>>>> At the moment we're not sure if the WARN and BUG calls are entirely
>>>>> correct, we are considering there may be some sort of race condition
>>>>> which is causing incorrect assumptions in the code.
>>>>>
>>>>> We are seeing this kernel bug in pick_next_rt_entity being triggered
>>>>>
>>>>>     idx = sched_find_first_bit(array->bitmap);
>>>>>     BUG_ON(idx >= MAX_RT_PRIO);
>>>>>
>>>>> Which suggests that the pick_task_rt() ran, thought there was 
>>>>> something
>>>>> there to schedule and got into pick_next_rt_entity() which then found
>>>>> there was nothing. It does this by checking rq->rt.rt_queued before it
>>>>> bothers to try picking something to run.
>>>>>
>>>>> (this BUG_ON() is triggered if there is no index in the array 
>>>>> indicating
>>>>>    something there to run)
>>>>>
>>>>> We added some debug to find out what the values in 
>>>>> pick_next_rt_entity()
>>>>> with the current rt_queued and the value it was when pick_task_rt()
>>>>> looked, and we got:
>>>>>
>>>>>      idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
>>>>>
>>>>> This shows the code was entered with the rt_q showing something
>>>>> should have been queued and by the time the pick_next_rt_entity()
>>>>> was entered there seems to be nothing (assuming the array is in
>>>>> sync with the lists...)
>>>>>
>>>>> I think the two questions we have are:
>>>>>
>>>>> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
>>>>>     return NULL be the best way of handling this? I am going to try
>>>>>     this and see if the system is still runnable with this.
>>>>>
>>>>> - Are we seeing a race here, and if so where is the best place to
>>>>>     prevent it?
>>>>>
>>>>> Note, we do have a few local backported cgroup-v2 patches.
>>>>>
>>>>> Our systemd unit file to launch the test is here:
>>>>>
>>>>> [Service]
>>>>> Type=simple
>>>>> Restart=always
>>>>> ExecStartPre=/bin/sh -c 'echo 500000 >
>>>>> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
>>>>> ExecStartPre=/bin/sh -c 'echo 500000 >
>>>>> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/ 
>>>>> cpu.rt_runtime_us'
>>>>> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng
>>>>> --timeout=0 --verify --oom-avoid --metrics --timestamp
>>>>> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose
>>>>> --stressor-time
>>>>> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
>>>>> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/ 
>>>>> stress-ng"
>>>>> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps
>>>>> --disable_rlimits --disable_clone_newuser"
>>>>> Slice=system.slice
>>>>> OOMPolicy=continue
>>>
>>> [...]
>>>
>>>> Hi all,
>>>>
>>>> To provide some more context, we have found out this issue while 
>>>> running
>>>> some tests with stress-ng scheduler stressor[1] and the RT throttling
>>>> feature after enabling the RT_GROUP_SCHED kernel option. Note that we
>>>> also have PREEMPT_RT enabled in our config.
>>>>
>>>> I've just reproduced the issue on qemu-x86_64 with a debian image 
>>>> and kernel
>>>> v6.17-rc6. See below the steps to reproduce it.
>>>>
>>>> cd linux
>>>> git reset --hard v6.17-rc6 && git clean -f -d
>>>>
>>>> # Apply patch to expose RT_GROUP_SCHED interface to userspace with 
>>>> cgroupv2
>>>> b4 shazam --single-message https://lore.kernel.org/ 
>>>> all/20250731105543.40832-17-yurand2000@gmail.com/
>>>
>>> Don't get this one ... you just pick a single patch from the RFC
>>> patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth 
>>> Server' ?
>>>
>>> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
>>>
>>
>> Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
>> specific cgroup from a systemd unit, in order to control the max CPU
>> bandwidth allowed for a systemd slice. Since systemd depracated support
>> for cgroupv1 I picked that patch to export them via cgroupv2. To my
>> understanding, with that patch, setting the rt_runtime_us and
>> rt_period_us parameters via cgroupv2 should have the same effect as
>> setting them via cgroupv1. Of course I could have missed something and
>> that could be one reason for the issue. I will better look into it and
>> try to see if the issue is still reproducible with cgroupv1.
> 
> We are still seeing WARN_ON() due to the tests at
> 
> static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned 
> int flags)
> {
> ...
>      if (move_entity(flags)) {
> ...
>          WARN_ON_ONCE(!rt_se->on_list);
> ...
>      }
> }
> 
> This seems to be due to the task_group's rt_entity tripping this under
> load. I'm not sure yet if the WARN_ON_ONCE() here is actually useful or
> if we are tripping some sort of race condition.
> 
> When moving from one stress-ng pid to another, it should have enqueued
> the task-group back with the on_list set, but it seems not to? I've done
> a quick trace_printk() after moving up to v6.17, however adding a print
> of the rt_se setting seems to have stopped this issue from re-appearing
> on my system.
> 
> I'll run some more tests to see if this comes back.
> 
>>>
>>>> # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
>>>> make mrproper
>>>> make defconfig
>>>> scripts/config -k -e EXPERT
>>>> scripts/config -k -e PREEMPT_RT
>>>> scripts/config -k -e RT_GROUP_SCHED
>>>> make olddefconfig
>>>> make -j12
>>>>
>>>> # Download a debian image and run qemu
>>>> wget https://cdimage.debian.org/images/cloud/sid/ 
>>>> daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
>>>> qemu-system-x86_64 \
>>>>      -m 2G -smp 4 \
>>>>      -nographic \
>>>>      -nic user,hostfwd=tcp::2222-:22 \
>>>>      -M q35,accel=kvm \
>>>>      -drive format=qcow2,file=debian-sid-nocloud-amd64- 
>>>> daily-20250919-2240.qcow2 \
>>>>      -virtfs local,path=.,mount_tag=shared,security_model=mapped- 
>>>> xattr \
>>>>      -monitor none \
>>>>      -append "root=/dev/sda1 console=ttyS0,115200 
>>>> sysctl.kernel.panic_on_oops=1" \
>>>>      -kernel arch/x86/boot/bzImage
>>>>
>>>> # Then inside guest machine
>>>> # Install stress-ng
>>>> apt-get update && apt-get install stress-ng
>>>>
>>>> # Create the stress-ng service. It sets the group RT runtime to 500ms
>>>> # (50% BW) via the cgroupv2 interface then it starts the stress-ng
>>>> # scheduler stressor. Also note the cpu affinity set to a single CPU
>>>> # which seems to help the issue to be more reproducible.
>>>
>>> I assume this is the 'AllowedCPUs=0' line in the systemd service file.
>>
>> Yes, correct.
>>
>>>
>>>> echo "[Unit]
>>>> Description=Mixed stress with long in the system slice
>>>> After=basic.target
>>>>
>>>> [Service]
>>>> AllowedCPUs=0
>>>> Type=simple
>>>> Restart=always
>>>> ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/ 
>>>> cpu.rt_runtime_us'
>>>> ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid -- 
>>>> metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
>>>
>>>
>>> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
>>> stress-ng related tasks have you running in
>>> 'system.slice/stress-sched-long-system.service'? And all of them on 
>>> CPU0?
>>
>> Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
>> running on CPU 0. To my understanding each scheduler stressor forks 16
>> stress-ng child tasks [1], this is confirmed by the number of stress-ng
>> tasks running on the system. The test itself is not particularly
>> meaningful, it just reflects the setup I had when I found the BUG_ON.
>>
>>> [...]
>>>
>>
>> [1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress- 
>> cpu-sched.c#L66
>>
>> Best regards,
>> Matteo Martelli
>>

So after adding tracing and some trace_printks to try and work out what
is going on, this is a dump from one of the WARNs:

The sched_rt_entity we're looking at is at address 0xffff98b2020d61c0
which is a task_group for the stress-ng process

>  stress-ng-cpu-s-483     [000] d..2.   191.336467: sched_switch: prev_comm=stress-ng-cpu-s prev_pid=483 prev_prio=47 prev_state=T ==> next_comm=stress-ng-cpu-s next_pid=473 next_prio=48
>  stress-ng-cpu-s-473     [000] d..42   191.336470: sched_wakeup: comm=stress-ng-cpu-s pid=466 prio=136 target_cpu=000
>  stress-ng-cpu-s-473     [000] d..31   191.336471: dequeue_rt_entity: rt_se ffff98b20a371f80, flags 10
>  stress-ng-cpu-s-473     [000] d..31   191.336471: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-473     [000] d..31   191.336471: dequeue_rt_stack: dequeue ffff98b20a371f80 (473)
>  stress-ng-cpu-s-473     [000] d..31   191.336471: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-473     [000] d..31   191.336472: dequeue_rt_entity: rt_se ffff98b20a371f80, flags 10, done
>  stress-ng-cpu-s-473     [000] d..31   191.336472: enqueue_rt_entity: rt_se ffff98b20a371f80, flags 10
>  stress-ng-cpu-s-473     [000] d..31   191.336472: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-473     [000] d..31   191.336472: __enqueue_rt_entity: enqueue ffff98b20a371f80 pid 473
>  stress-ng-cpu-s-473     [000] d..31   191.336472: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-473     [000] d..31   191.336473: enqueue_rt_entity: rt_se ffff98b20a371f80, flags 10, done
> ... removed other cpu events ...
>  stress-ng-cpu-s-473     [000] d..2.   191.336592: dequeue_rt_entity: rt_se ffff98b20a371f80, flags 25
>  stress-ng-cpu-s-473     [000] d..2.   191.336592: __delist_rt_entity: ffff98b207824f00: on_list to 0
>  stress-ng-cpu-s-473     [000] d..2.   191.336592: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-473     [000] d..2.   191.336592: __delist_rt_entity: ffff98b20a371f80: on_list to 0
>  stress-ng-cpu-s-473     [000] d..2.   191.336592: dequeue_rt_stack: dequeue ffff98b20a371f80 (473)
>  stress-ng-cpu-s-473     [000] d..2.   191.336593: __enqueue_rt_entity: ffff98b207824f00: on_list to 1
enqueue and on_list is set here.>  stress-ng-cpu-s-473     [000] d..2. 
191.336593: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-473     [000] d..2.   191.336593: dequeue_rt_entity: rt_se ffff98b20a371f80, flags 25, done
>  stress-ng-cpu-s-473     [000] d..2.   191.336594: sched_switch: prev_comm=stress-ng-cpu-s prev_pid=473 prev_prio=48 prev_state=Z ==> next_comm=stress-ng-cpu-s next_pid=452 next_prio=98
>  stress-ng-cpu-s-452     [000] d..31   191.336597: enqueue_rt_entity: rt_se ffff98b20a35db80, flags 9
>  stress-ng-cpu-s-452     [000] d..31   191.336597: __delist_rt_entity: ffff98b207824f00: on_list to 0
>  stress-ng-cpu-s-452     [000] d..31   191.336597: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-452     [000] d..31   191.336598: __enqueue_rt_entity: ffff98b20a35db80: on_list to 1
>  stress-ng-cpu-s-452     [000] d..31   191.336598: __enqueue_rt_entity: enqueue ffff98b20a35db80 pid 461
>  stress-ng-cpu-s-452     [000] d..31   191.336598: __enqueue_rt_entity: ffff98b207824f00: on_list to 1
>  stress-ng-cpu-s-452     [000] d..31   191.336598: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-452     [000] d..31   191.336598: enqueue_rt_entity: rt_se ffff98b20a35db80, flags 9, done
>  stress-ng-cpu-s-452     [000] dN.31   191.336599: sched_wakeup: comm=stress-ng-cpu-s pid=461 prio=5 target_cpu=000
>  stress-ng-cpu-s-452     [000] d..21   191.336600: dequeue_rt_entity: rt_se ffff98b205fe3d80, flags 10
>  stress-ng-cpu-s-452     [000] d..21   191.336600: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
dequeue above

>  stress-ng-cpu-s-452     [000] d..21   191.336600: dequeue_rt_stack: dequeue ffff98b205fe3d80 (452)
>  stress-ng-cpu-s-452     [000] d..21   191.336600: __delist_rt_entity: ffff98b207824f00: on_list to 0
>  stress-ng-cpu-s-452     [000] d..21   191.336601: dequeue_rt_entity: rt_se ffff98b205fe3d80, flags 10, done
>  stress-ng-cpu-s-452     [000] d..21   191.336601: enqueue_rt_entity: rt_se ffff98b205fe3d80, flags 10

enqueue here, but the flags are for a save

>  stress-ng-cpu-s-452     [000] d..21   191.336601: __enqueue_rt_entity: enqueue ffff98b205fe3d80 pid 452
>  stress-ng-cpu-s-452     [000] d..21   191.336601: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1

actual enqueue here

>  stress-ng-cpu-s-452     [000] d..21   191.336601: enqueue_rt_entity: rt_se ffff98b205fe3d80, flags 10, done
>  stress-ng-cpu-s-452     [000] d..21   191.336602: sched_switch: prev_comm=stress-ng-cpu-s prev_pid=452 prev_prio=98 prev_state=R+ ==> next_comm=stress-ng-cpu-s next_pid=461 next_prio=5

trace from sched_switch moving pid 452->461 which different pid, but the 
same task_group.

>  stress-ng-cpu-s-461     [000] d..3.   191.336604: dequeue_rt_entity: rt_se ffff98b205fe3d80, flags 14
> ... removed other cpu events ...
>  stress-ng-cpu-s-461     [000] d..3.   191.336730: dequeue_rt_stack: WARN: removing entity ffff98b207824f00 (tg[0,ffff98b2020d61c0]) but not on list (pid -1)
warn is triggered here as we're removing something that wasn't on_list 
as the previous enqueuedidn't put it on the list.
>   stress-ng-cpu-s-461     [000] d..3.   191.336781: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-461     [000] d..3.   191.336781: __delist_rt_entity: ffff98b205fe3d80: on_list to 0
>  stress-ng-cpu-s-461     [000] d..3.   191.336781: dequeue_rt_stack: dequeue ffff98b205fe3d80 (452)
>  stress-ng-cpu-s-461     [000] d..3.   191.336781: dequeue_rt_entity: rt_se ffff98b205fe3d80, flags 14, done
>  stress-ng-cpu-s-461     [000] d..3.   191.336782: enqueue_rt_entity: rt_se ffff98b205fe3d80, flags 14
>  stress-ng-cpu-s-461     [000] d..3.   191.336782: __enqueue_rt_entity: ffff98b205fe3d80: on_list to 1
>  stress-ng-cpu-s-461     [000] d..3.   191.336782: __enqueue_rt_entity: enqueue ffff98b205fe3d80 pid 452
>  stress-ng-cpu-s-461     [000] d..3.   191.336782: __enqueue_rt_entity: ffff98b207824f00: on_list to 1
>  stress-ng-cpu-s-461     [000] d..3.   191.336782: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-461     [000] d..3.   191.336783: enqueue_rt_entity: rt_se ffff98b205fe3d80, flags 14, done
>  stress-ng-cpu-s-461     [000] d..2.   191.336783: dequeue_rt_entity: rt_se ffff98b20a35db80, flags 9
>  stress-ng-cpu-s-461     [000] d..2.   191.336783: __delist_rt_entity: ffff98b207824f00: on_list to 0
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: dequeue_rt_stack: dequeue ffff98b207824f00 (-1)
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: __delist_rt_entity: ffff98b20a35db80: on_list to 0
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: dequeue_rt_stack: dequeue ffff98b20a35db80 (461)
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: __enqueue_rt_entity: ffff98b207824f00: on_list to 1
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: __enqueue_rt_entity: enqueue ffff98b207824f00 pid -1
>  stress-ng-cpu-s-461     [000] d..2.   191.336784: dequeue_rt_entity: rt_se ffff98b20a35db80, flags 9, done
>  stress-ng-cpu-s-461     [000] d..2.   191.336785: sched_switch: prev_comm=stress-ng-cpu-s prev_pid=461 prev_prio=5 prev_state=D ==> next_comm=stress-ng-cpu-s next_pid=452 next_prio=5





-- 
Ben Dooks				http://www.codethink.co.uk/
Senior Engineer				Codethink - Providing Genius

https://www.codethink.co.uk/privacy.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-23 10:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-19 11:10 BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2 Ben Dooks
2025-09-19 16:37 ` Matteo Martelli
2025-09-23 18:14   ` Dietmar Eggemann
2025-09-24 13:10     ` Matteo Martelli
2025-10-22 17:57       ` Ben Dooks
2025-10-23 10:04         ` Ben Dooks

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.