From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from imap4.hz.codethink.co.uk (imap4.hz.codethink.co.uk [188.40.203.114])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBF7D11713
	for <linux-kernel@vger.kernel.org>; Wed, 24 Sep 2025 13:10:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=188.40.203.114
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1758719443; cv=none; b=NruBvPvfwqeMDnhyeYmaMf4pKRYyAcEOL1gDi3omUz7TbBwKfBwEyzO3Yh1pHoJq8eFLaAG++ZvDcUB4P/40o3YYSTIkbEGXV5PXpwb2Fql7VgdI77cru8Ddrwpyyx+Jr54vHaIwRIztK5K6409FjMDZipoguH/jWfxGAwC+V2c=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1758719443; c=relaxed/simple;
	bh=Vm+enzay0EtJU3tmUcJTW9sdXcV1GXPE3QWm2ac4Uk0=;
	h=Date:Message-ID:From:Subject:To:In-Reply-To:References; b=G7PH6Oc8CDybOA0NemF+WLFFcIvrR0vfPPFcZJ5KGshdl2Uht4w2IdG2NHTTXjLGmOz4q76p3VkfHa4Zf8z+P0kz+E9MXhdqqlbz/sXpI9kqCilyisc9dV9paiKc23ZC6U37PmpR1A1xsp+O+4NWQPxg+s4wFeXkdERWVmEd8X0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=codethink.co.uk; spf=pass smtp.mailfrom=codethink.co.uk; dkim=pass (2048-bit key) header.d=codethink.co.uk header.i=@codethink.co.uk header.b=CnXnrd2J; arc=none smtp.client-ip=188.40.203.114
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=codethink.co.uk
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=codethink.co.uk
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=codethink.co.uk header.i=@codethink.co.uk header.b="CnXnrd2J"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=codethink.co.uk; s=imap4-20230908; h=Sender:References:In-Reply-To:To:
	Subject:From:Message-ID:Date:Reply-To:Cc:MIME-Version:Content-Type:
	Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:
	List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive;
	bh=nk1HfuVFuLnLujBiaC6+cv0KJ5hj9Q7odgwOaGzF5wA=; b=CnXnrd2JN7NM5M9p6AGRSoEJJw
	oWEnVNIYVNnQrx2Pqq3rFvC6wsAIpo4knJPX2zYq66v/fLVWQ7xoxgdXK7PoutJD2+g8zkZZsnLAi
	hp0eDUHbmi09kE9BMc7acDhRyWK3dnYJUixgvP++c3l+zrEuMFh4cxAvWqDuDMr1/fSeiANetgiDg
	R2rgonI8aPv37Vk6reGTJqHgQMZLQghT8YJFf16B5tsJjIWqzuN7aLf9z7qEt0VRwpDXKA955P7+l
	Q98mtM2gkmQ+Hycsauz8Ecb7SHAYLrsHMJcwt4vFQvJHmp6CnmEsXJCbkhbZlGTmaqQ3PiFFXLnpx
	K7Fxv/mw==;
Received: from host-79-47-48-17.retail.telecomitalia.it ([79.47.48.17] helo=localhost)
	by imap4.hz.codethink.co.uk with utf8esmtpsa  (Exim 4.94.2 #2 (Debian))
	id 1v1PGF-001Qne-5P; Wed, 24 Sep 2025 14:10:20 +0100
Date: Wed, 24 Sep 2025 15:10:19 +0200
Message-ID: <d6fcbedf0daf259e2f96a1e0cc666cff@codethink.co.uk>
From: Matteo Martelli <matteo.martelli@codethink.co.uk>
Subject: Re: BUG/WARN issues in kernel/sched/rt.c under stress-ng with
 crgoup-v2
To: Dietmar Eggemann <dietmar.eggemann@arm.com>, Ben Dooks
	<ben.dooks@codethink.co.uk>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
	<peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot
	<vincent.guittot@linaro.org>, "linux-kernel@vger.kernel.org"
	<linux-kernel@vger.kernel.org>, Marcel Ziswiler
	<marcel.ziswiler@codethink.co.uk>, Matteo Martelli
        <matteo.martelli@codethink.co.uk>
In-Reply-To: <9edb5b8d-8660-4699-b041-bd74329a14e9@arm.com>
References: <3308bca2-624e-42a3-8d98-48751acaa3b3@codethink.co.uk>
	<d6abff7f5f9ee5e41f19cb1f9d02de29@codethink.co.uk>
	<9edb5b8d-8660-4699-b041-bd74329a14e9@arm.com>
Sender: matteo.martelli@codethink.co.uk
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>

Hi Dietmar,

On Tue, 23 Sep 2025 20:14:18 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 19.09.25 18:37, Matteo Martelli wrote:
> > Hi all,
> > 
> > On Fri, 19 Sep 2025 12:10:34 +0100, Ben Dooks <ben.dooks@codethink.co.uk> wrote:
> >> We are doing some testing with stress-ng and the cgroup-v2 enabled
> >> (CONFIG_RT_GROUP_SCHED) and are running into WARN/BUG within a minute
> >> related to user-space calling sched_setattr() and possibly other calls.
> >>
> >> At the moment we're not sure if the WARN and BUG calls are entirely
> >> correct, we are considering there may be some sort of race condition
> >> which is causing incorrect assumptions in the code.
> >>
> >> We are seeing this kernel bug in pick_next_rt_entity being triggered
> >>
> >> 	idx = sched_find_first_bit(array->bitmap);
> >> 	BUG_ON(idx >= MAX_RT_PRIO);
> >>
> >> Which suggests that the pick_task_rt() ran, thought there was something
> >> there to schedule and got into pick_next_rt_entity() which then found
> >> there was nothing. It does this by checking rq->rt.rt_queued before it
> >> bothers to try picking something to run.
> >>
> >> (this BUG_ON() is triggered if there is no index in the array indicating
> >>   something there to run)
> >>
> >> We added some debug to find out what the values in pick_next_rt_entity()
> >> with the current rt_queued and the value it was when pick_task_rt()
> >> looked, and we got:
> >>
> >>     idx 100 bigger than MAX_RT_PRIO 100, queued = 0 (queued was 1)
> >>
> >> This shows the code was entered with the rt_q showing something
> >> should have been queued and by the time the pick_next_rt_entity()
> >> was entered there seems to be nothing (assuming the array is in
> >> sync with the lists...)
> >>
> >> I think the two questions we have are:
> >>
> >> - Is the BUG_ON() here appropriate, should a WARN_ON_ONCE() and
> >>    return NULL be the best way of handling this? I am going to try
> >>    this and see if the system is still runnable with this.
> >>
> >> - Are we seeing a race here, and if so where is the best place to
> >>    prevent it?
> >>
> >> Note, we do have a few local backported cgroup-v2 patches.
> >>
> >> Our systemd unit file to launch the test is here:
> >>
> >> [Service]
> >> Type=simple
> >> Restart=always
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> >> ExecStartPre=/bin/sh -c 'echo 500000 > 
> >> /sys/fs/cgroup/system.slice/stress-sched-long-system.service/cpu.rt_runtime_us'
> >> ExecStart=sandbox-run /usr/bin/stress-ng --temp-path /tmp/stress-ng 
> >> --timeout=0 --verify --oom-avoid --metrics --timestamp 
> >> --exclude=enosys,usersyscall --cpu-sched 0 --timeout 60 --verbose 
> >> --stressor-time
> >> Environment=SANDBOX_RO_BINDMOUNTS="/usr/share/stress-ng"
> >> Environment=SANDBOX_RW_BINDMOUNTS="/var/log /sys /proc /dev /tmp/stress-ng"
> >> Environment=SANDBOX_EXTRA_ARGS="--cwd /tmp/stress-ng --keep_caps 
> >> --disable_rlimits --disable_clone_newuser"
> >> Slice=system.slice
> >> OOMPolicy=continue
> 
> [...]
> 
> > Hi all,
> > 
> > To provide some more context, we have found out this issue while running
> > some tests with stress-ng scheduler stressor[1] and the RT throttling
> > feature after enabling the RT_GROUP_SCHED kernel option. Note that we
> > also have PREEMPT_RT enabled in our config.
> > 
> > I've just reproduced the issue on qemu-x86_64 with a debian image and kernel
> > v6.17-rc6. See below the steps to reproduce it.
> > 
> > cd linux
> > git reset --hard v6.17-rc6 && git clean -f -d
> > 
> > # Apply patch to expose RT_GROUP_SCHED interface to userspace with cgroupv2
> > b4 shazam --single-message https://lore.kernel.org/all/20250731105543.40832-17-yurand2000@gmail.com/
> 
> Don't get this one ... you just pick a single patch from the RFC
> patch-set '[RFC PATCH v2 00/25]  Hierarchical Constant Bandwidth Server' ?
> 
> https://lore.kernel.org/r/20250731105543.40832-1-yurand2000@gmail.com
> 

Yes, I was looking for a way to set the cpu.rt_runtime_us param for a
specific cgroup from a systemd unit, in order to control the max CPU
bandwidth allowed for a systemd slice. Since systemd depracated support
for cgroupv1 I picked that patch to export them via cgroupv2. To my
understanding, with that patch, setting the rt_runtime_us and
rt_period_us parameters via cgroupv2 should have the same effect as
setting them via cgroupv1. Of course I could have missed something and
that could be one reason for the issue. I will better look into it and
try to see if the issue is still reproducible with cgroupv1.

> 
> > # Build kernel with defconfig + PREEMPT_RT=y and RT_GROUP_SCHED=y
> > make mrproper
> > make defconfig
> > scripts/config -k -e EXPERT
> > scripts/config -k -e PREEMPT_RT
> > scripts/config -k -e RT_GROUP_SCHED
> > make olddefconfig
> > make -j12
> > 
> > # Download a debian image and run qemu
> > wget https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
> > qemu-system-x86_64 \
> >     -m 2G -smp 4 \
> >     -nographic \
> >     -nic user,hostfwd=tcp::2222-:22 \
> >     -M q35,accel=kvm \
> >     -drive format=qcow2,file=debian-sid-nocloud-amd64-daily-20250919-2240.qcow2 \
> >     -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
> >     -monitor none \
> >     -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
> >     -kernel arch/x86/boot/bzImage
> > 
> > # Then inside guest machine
> > # Install stress-ng
> > apt-get update && apt-get install stress-ng
> > 
> > # Create the stress-ng service. It sets the group RT runtime to 500ms
> > # (50% BW) via the cgroupv2 interface then it starts the stress-ng
> > # scheduler stressor. Also note the cpu affinity set to a single CPU
> > # which seems to help the issue to be more reproducible.
> 
> I assume this is the 'AllowedCPUs=0' line in the systemd service file.

Yes, correct.

> 
> > echo "[Unit]
> > Description=Mixed stress with long in the system slice
> > After=basic.target
> > 
> > [Service]
> > AllowedCPUs=0
> > Type=simple
> > Restart=always
> > ExecStartPre=/bin/sh -c 'echo 500000 > /sys/fs/cgroup/system.slice/cpu.rt_runtime_us'
> > ExecStart=/usr/bin/stress-ng --timeout=0 --verify --oom-avoid --metrics --timestamp --exclude=enosys,usersyscall --cpu-sched 0 --
> 
> 
> I assume you get 4 stressors since you run 'qemu -smp 4'? How many
> stress-ng related tasks have you running in
> 'system.slice/stress-sched-long-system.service'? And all of them on CPU0?

Yes, with --cpu-sched 0, stress-ng is using 4 scheduler stressors all
running on CPU 0. To my understanding each scheduler stressor forks 16
stress-ng child tasks [1], this is confirmed by the number of stress-ng
tasks running on the system. The test itself is not particularly
meaningful, it just reflects the setup I had when I found the BUG_ON.

> [...]
> 

[1]: https://github.com/ColinIanKing/stress-ng/blob/V0.19.04/stress-cpu-sched.c#L66

Best regards,
Matteo Martelli