Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matteo Martelli <matteo.martelli@codethink.co.uk>
To: Ben Dooks <ben.dooks@codethink.co.uk>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>,
	Matteo Martelli <matteo.martelli@codethink.co.uk>
Subject: Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
Date: Fri, 19 Sep 2025 18:37:15 +0200	[thread overview]
Message-ID: <d6abff7f5f9ee5e41f19cb1f9d02de29@codethink.co.uk> (raw)
In-Reply-To: <3308bca2-624e-42a3-8d98-48751acaa3b3@codethink.co.uk>

Hi all,

On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> We are doing some testing with stress-ng and the cgroup-v2 enabled
> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> related to user-space calling sched_setattr() and possibly other calls.
> 
> At the moment we're not sure if the WARN and BUG calls are entirely
> correct, we are considering there may be some sort of race condition
> which is causing incorrect assumptions in the code.
> 
> We are seeing this kernel bug in pick_next_rt_entity being triggered
> 
> 	idx = sched_find_first_bit(array->bitmap);
> 	BUG_ON(idx >= MAX_RT_PRIO);
> 
> Which suggests that the pick_task_rt() ran, thought there was something
> there to schedule and got into pick_next_rt_entity() which then found
> there was nothing. It does this by checking rq->rt.rt_queued before it
> bothers to try picking something to run.
> 
> (this BUG_ON() is triggered if there is no index in the array indicating
>   something there to run)
> 
> We added some debug to find out what the values in pick_next_rt_entity()
> with the current rt_queued and the value it was when pick_task_rt()
> looked, and we got:
> 
>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> 
> This shows the code was entered with the rt_q showing something
> should have been queued and by the time the pick_next_rt_entity()
> was entered there seems to be nothing (assuming the array is in
> sync with the lists...)
> 
> I think the two questions we have are:
> 
> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
>    return NULL be the best way of handling this? I am going to try
>    this and see if the system is still runnable with this.
> 
> - Are we seeing a race here, and if so where is the best place to
>    prevent it?
> 
> Note, we do have a few local backported cgroup-v2 patches.
> 
> Our systemd unit file to launch the test is here:
> 
> [Service]
> Type=simple
> Restart=always
> ExecStartPre=/bin/sh -c 'echo 500000 > 
> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> ExecStartPre=/bin/sh -c 'echo 500000 > 
> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
> --timeout=0 --verify --oom-avoid --metrics --timestamp 
> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
> --stressor-time
> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
> --disable_rlimits --disable_clone_newuser"
> Slice=system.slice
> OOMPolicy=continue
> 
> I added this to dump the array and confirm at-least the array-v-list
> was in sync at the point of the bug:
> 
> static inline void debug_pick_next(struct rt_rq *rt_rq, int idx, 
> unsigned qs)
> {
> 	struct rt_prio_array *array = &rt_rq->active;
> 	unsigned int nr;
> 
> 	pr_err("rt_q %px: idx %d bigger than MAX_RT_PRIO %d, queued = %d (was 
> %u)\n",
> 	       rt_rq, idx, MAX_RT_PRIO, rt_rq->rt_queued, qs );
> 
> 	for (nr = 0; nr < MAX_RT_PRIO; nr += sizeof(array->bitmap[0])*8) {
> 		pr_info("  bitmap idx %u: %lx\n", nr, 
> array->bitmap[nr/(sizeof(array->bitmap[0])*8)]);
> 	}
> 
> 	// check that the bitmap and array match
> 	for (nr = 0; nr < MAX_RT_PRIO; nr += 1) {
> 		bool l_empty = list_empty(array->queue + nr);
> 		bool a_empty = !test_bit(nr, array->bitmap);
> 
> 		if (l_empty != a_empty) {
> 			pr_err(" bitmap idx %u: array %s, bitmask %s\n", nr,
> 			       a_empty ? "empty" : "full",
> 			       l_empty ? "empty" : "full");
> 		}
> 	}
> }
> 	

Hi all,

To provide some more context, we have found out this issue while running
some tests with stress-ng scheduler stressor[1] and the RT throttling
feature after enabling the RT_GROUP_SCHED kernel option. Note that we
also have PREEMPT_RT enabled in our config.

I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
v6.17-rc6. See below the steps to reproduce it.

cd linux
git reset --hard v6.17-rc6 && git clean -f -d

# Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/

# Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
make mrproper
make defconfig
scripts/config -k -e EXPERT
scripts/config -k -e PREEMPT_RT
scripts/config -k -e RT_GROUP_SCHED
make olddefconfig
make -j12

# Download a debian image and run qemu
wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
qemu-system-x86_64 \
    -m 2G -smp 4 \
    -nographic \
    -nic user,hostfwd=tcp::2222-:22 \
    -M q35,accel=kvm \
    -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
    -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
    -monitor none \
    -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
    -kernel arch/x86/boot/bzImage

# Then inside guest machine
# Install stress-ng
apt-get update && apt-get install stress-ng

# Create the stress-ng service. It sets the group RT runtime to 500ms
# (50% BW) via the cgroupv2 interface then it starts the stress-ng
# scheduler stressor. Also note the cpu affinity set to a single CPU
# which seems to help the issue to be more reproducible.
echo "[Unit]
Description=Mixed stress with long in the system slice
After=basic.target

[Service]
AllowedCPUs=0
Type=simple
Restart=always
ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose --stressor-time
Slice=system.slice
OOMPolicy=continue" > /etc/systemd/system/stress-sched-long-system.service

systemctl start stress-sched-long-system.service

Then the BUG_ON is triggered within a few minutes. See the following logs.

[  345.657737] ------------[ cut here ]------------
[  345.657741] kernel BUG at kernel/sched/rt.c:1673!
[  345.657746] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  345.657749] CPU: 0 UID: 0 PID: 379 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[  345.657750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[  345.657751] RIP: 0010:pick_task_rt+0x6c/0x80
[  345.657762] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7
[  345.657763] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[  345.657765] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[  345.657765] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[  345.657766] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[  345.657766] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[  345.657767] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[  345.657770] FS:  00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[  345.657770] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.657771] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[  345.657772] Call Trace:
[  345.657775]  <TASK>
[  345.657775]  __schedule+0x488/0xf30
[  345.657779]  preempt_schedule+0x2e/0x50
[  345.657780]  preempt_schedule_thunk+0x16/0x30
[  345.657782]  migrate_enable+0xbc/0xd0
[  345.657784]  rt_spin_unlock+0xd/0x40
[  345.657787]  get_signal+0x765/0x8d0
[  345.657789]  ? do_nanosleep+0xe9/0x170
[  345.657791]  arch_do_signal_or_restart+0x38/0x250
[  345.657793]  exit_to_user_mode_loop+0x6b/0xb0
[  345.657796]  do_syscall_64+0x221/0x290
[  345.657798]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  345.657800] RIP: 0033:0x7f3395c4f687
[  345.657801] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[  345.657802] RSP: 002b:00007ffd2cbc0270 EFLAGS: 00000202 ORIG_RAX: 00000000000000e6
[  345.657803] RAX: 0000000000000000 RBX: 00007f339538fb00 RCX: 00007f3395c4f687
[  345.657803] RDX: 00007ffd2cbc02b0 RSI: 0000000000000000 RDI: 0000000000000000
[  345.657804] RBP: 000056316fe06320 R08: 0000000000000000 R09: 0000000000000000
[  345.657804] R10: 00007ffd2cbc02c0 R11: 0000000000000202 R12: 000000000000017b
[  345.657804] R13: 000056315f837030 R14: 0000000000000003 R15: 0000000000000001
[  345.657805]  </TASK>
[  345.657805] Modules linked in:
[  345.657807] ---[ end trace 0000000000000000 ]---
[  345.657807] RIP: 0010:pick_task_rt+0x6c/0x80
[  345.657809] Code: 85 c0 74 16 48 8b 78 40 48 85 ff 75 c6 48 2d 80 01 00 00 c3 cc cc cc cc 31 c0 c3 cc cc cc cc f3 48 0f bc c0 83 f8 63 7e c0 90 <0f> 0b 90 0f 0b 90 31 c0 c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90
[  345.657809] RSP: 0018:ffff95de8071bcf8 EFLAGS: 00010002
[  345.657810] RAX: 0000000000000064 RBX: ffff8bd585ab9e00 RCX: 0000000000000000
[  345.657810] RDX: 0000000000000000 RSI: ffff8bd585ab9e00 RDI: ffff8bd5fdc29400
[  345.657810] RBP: ffff95de8071bd70 R08: 0000000000000004 R09: ffff8bd5fdc29200
[  345.657811] R10: 0000000000000001 R11: 000000000000000a R12: ffff8bd585ab9e00
[  345.657811] R13: ffffffff97593180 R14: ffff8bd5fdc29200 R15: 0000000000000000
[  345.657814] FS:  00007f339538fb00(0000) GS:ffff8bd665a0f000(0000) knlGS:0000000000000000
[  345.657814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  345.657815] CR2: 000056316fe06320 CR3: 0000000006f0e000 CR4: 00000000000006f0
[  345.657815] Kernel panic - not syncing: Fatal exception
[  345.657969] Kernel Offset: 0x14e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  345.685385] ---[ end Kernel panic - not syncing: Fatal exception ]---

Also the WARNING in __dequeue_rt_entity() is often being hit

[  117.550503] ------------[ cut here ]------------
[  117.550505] WARNING: CPU: 0 PID: 398 at kernel/sched/rt.c:1366 dequeue_rt_stack+0x311/0x330
[  117.550518] Modules linked in:
[  117.550521] CPU: 0 UID: 0 PID: 398 Comm: stress-ng-cpu-s Not tainted 6.17.0-rc6-00001-g6c9be1b0be15 #1 PREEMPT_{RT,(full)}
[  117.550523] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[  117.550524] RIP: 0010:dequeue_rt_stack+0x311/0x330
[  117.550526] Code: 06 00 00 e9 46 fe ff ff 90 0f 0b 90 85 c0 75 06 90 0f 0b 90 31 c0 b9 01 00 00 00 48 85 d2 0f 85 ce fd ff ff e9 cf fd f
[  117.550526] RSP: 0018:ffffb4af008cbce0 EFLAGS: 00010046
[  117.550528] RAX: 0000000000000000 RBX: ffff979604108120 RCX: ffff979601febc80
[  117.550528] RDX: ffff979604108120 RSI: 0000000000000006 RDI: ffff97967dc29400
[  117.550529] RBP: ffff979604108120 R08: 00000000000e7ef0 R09: ffff979601febc00
[  117.550529] R10: 0000000000000001 R11: 0000000000000002 R12: ffff97967dc29400
[  117.550530] R13: 0000000000000006 R14: 0000000000000002 R15: ffffffff97393180
[  117.550533] FS:  00007f2a872eeb00(0000) GS:ffff9796e5c0f000(0000) knlGS:0000000000000000
[  117.550534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  117.550534] CR2: 00005633eab27d00 CR3: 00000000041c6000 CR4: 00000000000006f0
[  117.550535] Call Trace:
[  117.550538]  <TASK>
[  117.550539]  dequeue_rt_entity+0x29/0x160
[  117.550541]  dequeue_task_rt+0x25/0x40
[  117.550542]  rt_mutex_setprio+0x356/0x520
[  117.550545]  rt_mutex_slowunlock+0x15c/0x290
[  117.550548]  ? __set_cpus_allowed_ptr+0x5f/0xa0
[  117.550549]  ? migrate_enable+0x6a/0xd0
[  117.550550]  do_send_sig_info+0x61/0xa0
[  117.550553]  kill_pid_info_type+0x8d/0xa0
[  117.550555]  kill_something_info+0x16b/0x1a0
[  117.550556]  __x64_sys_kill+0x88/0xb0
[  117.550557]  do_syscall_64+0xa4/0x290
[  117.550560]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  117.550561] RIP: 0033:0x7f2a87b5f007
[  117.550562] Code: 48 83 c4 08 c3 66 0f 1f 44 00 00 48 8b 15 e9 6d 1a 00 64 89 02 b8 ff ff ff ff eb e4 0f 1f 80 00 00 00 00 b8 3e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 6d 1a 00 f7 d8 64 89 01 48
[  117.550563] RSP: 002b:00007ffefab861c8 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[  117.550564] RAX: ffffffffffffffda RBX: 00007f2a872d8a00 RCX: 00007f2a87b5f007
[  117.550565] RDX: 0000000000000003 RSI: 0000000000000012 RDI: 00000000000001af
[  117.550565] RBP: 0000000000000002 R08: 000acce4c998f093 R09: 0000000000000000
[  117.550565] R10: 00007f2a8909d000 R11: 0000000000000202 R12: 0000000000000004
[  117.550566] R13: 0000000000000001 R14: 00007ffefab86420 R15: 00000000000001af
[  117.550567]  </TASK>
[  117.550567] ---[ end trace 0000000000000000 ]---

and sometimes RCU stalls

[  453.738633] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  453.738636] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P853/3:b..l
[  453.738638] rcu:     (detected by 0, t=21002 jiffies, g=1477, q=122 ncpus=4)
[  453.738640] task:stress-ng-cpu-s state:R  running task     stack:14200 pid:853   tgid:853   ppid:849    task_flags:0x400140 flags:0x0000
[  453.738644] Call Trace:
[  453.738645]  <TASK>
[  453.738646]  __schedule+0x3c9/0xf30
[  453.738651]  schedule_rtlock+0x15/0x30
[  453.738652]  rtlock_slowlock_locked+0x1b6/0x1090
[  453.738654]  rt_spin_lock+0x79/0xd0
[  453.738656]  do_send_sig_info+0x31/0xa0
[  453.738659]  kill_pid_info_type+0x8d/0xa0
[  453.738661]  kill_something_info+0x16b/0x1a0
[  453.738662]  __x64_sys_kill+0x88/0xb0
[  453.738663]  do_syscall_64+0xa4/0x290
[  453.738665]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  453.738667] RIP: 0033:0x7f4df5aeb007
[  453.738669] RSP: 002b:00007ffffc9d8558 EFLAGS: 00000202 ORIG_RAX: 000000000000003e
[  453.738670] RAX: ffffffffffffffda RBX: 00007f4df5261b20 RCX: 00007f4df5aeb007
[  453.738671] RDX: 0000000000000012 RSI: 0000000000000012 RDI: 000000000000035e
[  453.738671] RBP: 0000000000000003 R08: 001053484f787e79 R09: 0000000000000000
[  453.738672] R10: 00007f4df7029000 R11: 0000000000000202 R12: 0000000000000004
[  453.738672] R13: 0000000000000001 R14: 00007ffffc9d8738 R15: 000000000000035e
[  453.738673]  </TASK>

[1]: https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu-sched.c

I hope the additional information is helpful.

Best regards,
Matteo Martelli

next prev parent reply	other threads:[~2025-09-19 17:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19 11:10 BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2 Ben Dooks
2025-09-19 16:37 ` Matteo Martelli [this message]
2025-09-23 18:14   ` Dietmar Eggemann
2025-09-24 13:10     ` Matteo Martelli
2025-10-22 17:57       ` Ben Dooks
2025-10-23 10:04         ` Ben Dooks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d6abff7f5f9ee5e41f19cb1f9d02de29@codethink.co.uk \
    --to=matteo.martelli@codethink.co.uk \
    --cc=ben.dooks@codethink.co.uk \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcel.ziswiler@codethink.co.uk \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.