From: Matteo Martelli <matteo.martelli@codethink.co.uk>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Ben Dooks <ben.dooks@codethink.co.uk>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk>,
Matteo Martelli <matteo.martelli@codethink.co.uk>
Subject: Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2
Date: Wed, 24 Sep 2025 15:10:19 +0200 [thread overview]
Message-ID: <d6fcbedf0daf259e2f96a1e0cc666cff@codethink.co.uk> (raw)
In-Reply-To: <9edb5b8d-8660-4699-b041-bd74329a14e9@arm.com>
Hi Dietmar,
On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19.09.25 18:37, Matteo Martelli wrote:
> > Hi all,
> >
> > On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> >> We are doing some testing with stress-ng and the cgroup-v2 enabled
> >> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> >> related to user-space calling sched_setattr() and possibly other calls.
> >>
> >> At the moment we're not sure if the WARN and BUG calls are entirely
> >> correct, we are considering there may be some sort of race condition
> >> which is causing incorrect assumptions in the code.
> >>
> >> We are seeing this kernel bug in pick_next_rt_entity being triggered
> >>
> >> idx = sched_find_first_bit(array->bitmap);
> >> BUG_ON(idx >= MAX_RT_PRIO);
> >>
> >> Which suggests that the pick_task_rt() ran, thought there was something
> >> there to schedule and got into pick_next_rt_entity() which then found
> >> there was nothing. It does this by checking rq->rt.rt_queued before it
> >> bothers to try picking something to run.
> >>
> >> (this BUG_ON() is triggered if there is no index in the array indicating
> >> something there to run)
> >>
> >> We added some debug to find out what the values in pick_next_rt_entity()
> >> with the current rt_queued and the value it was when pick_task_rt()
> >> looked, and we got:
> >>
> >> idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> >>
> >> This shows the code was entered with the rt_q showing something
> >> should have been queued and by the time the pick_next_rt_entity()
> >> was entered there seems to be nothing (assuming the array is in
> >> sync with the lists...)
> >>
> >> I think the two questions we have are:
> >>
> >> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
> >> return NULL be the best way of handling this? I am going to try
> >> this and see if the system is still runnable with this.
> >>
> >> - Are we seeing a race here, and if so where is the best place to
> >> prevent it?
> >>
> >> Note, we do have a few local backported cgroup-v2 patches.
> >>
> >> Our systemd unit file to launch the test is here:
> >>
> >> [Service]
> >> Type=simple
> >> Restart=always
> >> ExecStartPre=/bin/sh -c 'echo 500000 >
> >> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> >> ExecStartPre=/bin/sh -c 'echo 500000 >
> >> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> >> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng
> >> --timeout=0 --verify --oom-avoid --metrics --timestamp
> >> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose
> >> --stressor-time
> >> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> >> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> >> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps
> >> --disable_rlimits --disable_clone_newuser"
> >> Slice=system.slice
> >> OOMPolicy=continue
>
> [...]
>
> > Hi all,
> >
> > To provide some more context, we have found out this issue while running
> > some tests with stress-ng scheduler stressor[1] and the RT throttling
> > feature after enabling the RT_GROUP_SCHED kernel option. Note that we
> > also have PREEMPT_RT enabled in our config.
> >
> > I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
> > v6.17-rc6. See below the steps to reproduce it.
> >
> > cd linux
> > git reset --hard v6.17-rc6 && git clean -f -d
> >
> > # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
> > b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
>
> Don't get this one ... you just pick a single patch from the RFC
> patch-set '[RFC PATCH v2 00/25] Hierarchical Constant Bandwidth Server' ?
>
> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
>
Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
specific cgroup from a systemd unit, in order to control the max CPU
bandwidth allowed for a systemd slice. Since systemd depracated support
for cgroupv1 I picked that patch to export them via cgroupv2. To my
understanding, with that patch, setting the rt_runtime_us and
rt_period_us parameters via cgroupv2 should have the same effect as
setting them via cgroupv1. Of course I could have missed something and
that could be one reason for the issue. I will better look into it and
try to see if the issue is still reproducible with cgroupv1.
>
> > # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
> > make mrproper
> > make defconfig
> > scripts/config -k -e EXPERT
> > scripts/config -k -e PREEMPT_RT
> > scripts/config -k -e RT_GROUP_SCHED
> > make olddefconfig
> > make -j12
> >
> > # Download a debian image and run qemu
> > wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
> > qemu-system-x86_64 \
> > -m 2G -smp 4 \
> > -nographic \
> > -nic user,hostfwd=tcp::2222-:22 \
> > -M q35,accel=kvm \
> > -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
> > -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
> > -monitor none \
> > -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
> > -kernel arch/x86/boot/bzImage
> >
> > # Then inside guest machine
> > # Install stress-ng
> > apt-get update && apt-get install stress-ng
> >
> > # Create the stress-ng service. It sets the group RT runtime to 500ms
> > # (50% BW) via the cgroupv2 interface then it starts the stress-ng
> > # scheduler stressor. Also note the cpu affinity set to a single CPU
> > # which seems to help the issue to be more reproducible.
>
> I assume this is the 'AllowedCPUs=0' line in the systemd service file.
Yes, correct.
>
> > echo "[Unit]
> > Description=Mixed stress with long in the system slice
> > After=basic.target
> >
> > [Service]
> > AllowedCPUs=0
> > Type=simple
> > Restart=always
> > ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> > ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
>
>
> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
> stress-ng related tasks have you running in
> 'system.slice/stress-sched-long-system.service'? And all of them on CPU0?
Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
running on CPU 0. To my understanding each scheduler stressor forks 16
stress-ng child tasks [1], this is confirmed by the number of stress-ng
tasks running on the system. The test itself is not particularly
meaningful, it just reflects the setup I had when I found the BUG_ON.
> [...]
>
[1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress-cpu-sched.c#L66
Best regards,
Matteo Martelli
next prev parent reply other threads:[~2025-09-24 13:10 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-19 11:10 BUG/WARN issues in kernel/sched/rt.c under stress-ng with crgoup-v2 Ben Dooks
2025-09-19 16:37 ` Matteo Martelli
2025-09-23 18:14 ` Dietmar Eggemann
2025-09-24 13:10 ` Matteo Martelli [this message]
2025-10-22 17:57 ` Ben Dooks
2025-10-23 10:04 ` Ben Dooks
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d6fcbedf0daf259e2f96a1e0cc666cff@codethink.co.uk \
--to=matteo.martelli@codethink.co.uk \
--cc=ben.dooks@codethink.co.uk \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=marcel.ziswiler@codethink.co.uk \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.