* [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
@ 2025-07-21 11:11 Yuri Andriaccio
2025-07-22 0:07 ` kernel test robot
2025-07-25 11:44 ` Matteo Martelli
0 siblings, 2 replies; 5+ messages in thread
From: Yuri Andriaccio @ 2025-07-21 11:11 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio
Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.
The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows to run real-time tasks, and which is still set to 95%
of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
(FIFO/RR are not affected) is only 90%.
To demonstrate this, I've run the following script on the latest tip/master:
----------
PROCS=$(nproc)
echo "Allocate 95% bw per task/cpu"
for ((i = 0; i < $PROCS; i++))
do
sleep infinity &
chrt -d -T 95000000 -P 100000000 -D 100000000 -p 0 $!
done
killall sleep
echo "Allocate 90% bw per task/cpu"
for ((i = 0; i < $PROCS; i++))
do
sleep infinity &
chrt -d -T 90000000 -P 100000000 -D 100000000 -p 0 $!
done
killall sleep
----------
First-off we try to fully utilize the 95% rt-bandwidth by allocating #CPU
SCHED_DEADLINE tasks, requesting 95/100ms each. This will fail, because, as
mentioned, fair-servers are also accounted in the 95% realtime bw. With the
second allocation, it is possible to show that 90% bandwidth is instead allowed
by the scheduler. By playing with the numbers and chrt(s), it is possible to see
that the allocatable bandwidth for SCHED_DEADLINE tasks is exactly 90%, while it
is possible to see through stress-tests that on CPU-hog caused by FIFO/RR tasks
(of course SCHED_DEADLINE tasks are limited by the admission test, which as
mentioned fails at 90% total bw utilization), the fair-servers only allocate at
most 5% of the CPU-time to SCHED_OTHER tasks. There is clearly a 5% of CPU-time
lost somewhere.
This patch reclaims the 5% lost SCHED_DEADLINE CPU-time (FIFO/RR are not
affected, there is no admission test there to perform), by accounting the
fair-server's bandwidth separately. After this patch, the above script runs
successfully also when allocating 95% bw per task/cpu.
Changes:
- Make the fair-servers' bandwidth not be accounted into the total allocated
bandwidth for real-time tasks.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Remove the admission control test when allocating a fair-server, as its
bandwidth is accounted differently.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
of a fair-server, preventing overcommitment.
- Add dl_bw_fair, which computes the total allocated bandwidth of the
fair-servers in the given root-domain.
- Update admission tests (in sched_dl_global_validate) when changing the
maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
Notes:
Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bw must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when changing the
maximum realtime bandwidth.
This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and viceversa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.
Testing:
This patch has been tested with basic regression tests, by checking that it is
not possible to overcommit the bandwidth of fair-servers and that SCHED_OTHER
tasks do use at least the specified amount of bw (also varying the ratio of
rt/non-rt bandwidth).
Additionally it has also been tested on top of this fix, ensuring that the
warning mentioned in the bug report is not re-triggered:
https://lore.kernel.org/all/aHpf4LfMtB2V9uNb@jlelli-thinkpadt14gen4.remote.csb/
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
kernel/sched/deadline.c | 66 ++++++++++++++++++-----------------------
kernel/sched/sched.h | 1 -
kernel/sched/topology.c | 8 -----
3 files changed, 29 insertions(+), 46 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1af06e48227..e97a7feb59d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i)
return cpus;
}
+static inline u64 dl_bw_fair(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ u64 fair_server_bw = 0;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ if (cpumask_subset(rd->span, cpu_active_mask))
+ i = cpumask_first(rd->span);
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask) {
+ fair_server_bw += cpu_rq(i)->fair_server.dl_bw;
+ }
+
+ return fair_server_bw;
+}
+
static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
{
unsigned long cap = 0;
@@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void)
}
}
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
- u64 new_bw = dl_se->dl_bw;
- int cpu = cpu_of(rq);
- struct dl_bw *dl_b;
-
- dl_b = dl_bw_of(cpu_of(rq));
- guard(raw_spinlock)(&dl_b->lock);
-
- if (!dl_bw_cpus(cpu))
- return;
-
- __dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
{
- u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
- u64 new_bw = to_ratio(period, runtime);
+ u64 max_bw, new_bw = to_ratio(period, runtime);
struct rq *rq = dl_se->rq;
int cpu = cpu_of(rq);
struct dl_bw *dl_b;
@@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
cpus = dl_bw_cpus(cpu);
cap = dl_bw_capacity(cpu);
+ max_bw = cap_scale(BW_UNIT - dl_b->bw, cap) / cpus;
- if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+ if (new_bw > max_bw)
return -EBUSY;
if (init) {
__add_rq_bw(new_bw, &rq->dl);
- __dl_add(dl_b, new_bw, cpus);
} else {
- __dl_sub(dl_b, dl_se->dl_bw, cpus);
- __dl_add(dl_b, new_bw, cpus);
-
dl_rq_change_utilization(rq, dl_se, new_bw);
}
@@ -2932,17 +2931,6 @@ void dl_clear_root_domain(struct root_domain *rd)
rd->dl_bw.total_bw = 0;
for_each_cpu(i, rd->span)
cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
- /*
- * dl_servers are not tasks. Since dl_add_task_root_domain ignores
- * them, we need to account for them here explicitly.
- */
- for_each_cpu(i, rd->span) {
- struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
- if (dl_server(dl_se) && cpu_active(i))
- __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
- }
}
void dl_clear_root_domain_cpu(int cpu)
@@ -3126,9 +3114,10 @@ int sched_dl_global_validate(void)
u64 period = global_rt_period();
u64 new_bw = to_ratio(period, runtime);
u64 cookie = ++dl_cookie;
+ u64 fair_bw;
struct dl_bw *dl_b;
- int cpu, cpus, ret = 0;
- unsigned long flags;
+ int cpu, ret = 0;
+ unsigned long cap, flags;
/*
* Here we want to check the bandwidth not being set to some
@@ -3142,10 +3131,13 @@ int sched_dl_global_validate(void)
goto next;
dl_b = dl_bw_of(cpu);
- cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+ fair_bw = dl_bw_fair(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- if (new_bw * cpus < dl_b->total_bw)
+ if (cap_scale(new_bw, cap) < dl_b->total_bw)
+ ret = -EBUSY;
+ if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
ret = -EBUSY;
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac953fad8c2..42b5d024dce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
extern void dl_server_update_idle_time(struct rq *rq,
struct task_struct *p);
extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
u64 runtime, u64 period, bool init);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8a..4ea3365984a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
set_rq_online(rq);
- /*
- * Because the rq is not a task, dl_add_task_root_domain() did not
- * move the fair server bw to the rd if it already started.
- * Add it now.
- */
- if (rq->fair_server.dl_server)
- __dl_server_attach_root(&rq->fair_server, rq);
-
rq_unlock_irqrestore(rq, &rf);
if (old_rd)
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-07-21 11:11 [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
@ 2025-07-22 0:07 ` kernel test robot
2025-07-25 11:44 ` Matteo Martelli
1 sibling, 0 replies; 5+ messages in thread
From: kernel test robot @ 2025-07-22 0:07 UTC (permalink / raw)
To: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider
Cc: oe-kbuild-all, linux-kernel, Luca Abeni, Yuri Andriaccio
Hi Yuri,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/master peterz-queue/sched/core next-20250721]
[cannot apply to tip/auto-latest linus/master v6.16-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Yuri-Andriaccio/sched-deadline-Remove-fair-servers-from-real-time-task-s-bandwidth-accounting/20250721-191333
base: tip/sched/core
patch link: https://lore.kernel.org/r/20250721111131.309388-1-yurand2000%40gmail.com
patch subject: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
config: i386-buildonly-randconfig-001-20250722 (https://download.01.org/0day-ci/archive/20250722/202507220727.BmA1Osdg-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250722/202507220727.BmA1Osdg-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507220727.BmA1Osdg-lkp@intel.com/
All errors (new ones prefixed by >>):
ld: kernel/sched/build_policy.o: in function `dl_server_apply_params':
>> kernel/sched/deadline.c:1693: undefined reference to `__udivdi3'
vim +1693 kernel/sched/deadline.c
1677
1678 int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
1679 {
1680 u64 max_bw, new_bw = to_ratio(period, runtime);
1681 struct rq *rq = dl_se->rq;
1682 int cpu = cpu_of(rq);
1683 struct dl_bw *dl_b;
1684 unsigned long cap;
1685 int retval = 0;
1686 int cpus;
1687
1688 dl_b = dl_bw_of(cpu);
1689 guard(raw_spinlock)(&dl_b->lock);
1690
1691 cpus = dl_bw_cpus(cpu);
1692 cap = dl_bw_capacity(cpu);
> 1693 max_bw = cap_scale(BW_UNIT - dl_b->bw, cap) / cpus;
1694
1695 if (new_bw > max_bw)
1696 return -EBUSY;
1697
1698 if (init) {
1699 __add_rq_bw(new_bw, &rq->dl);
1700 } else {
1701 dl_rq_change_utilization(rq, dl_se, new_bw);
1702 }
1703
1704 dl_se->dl_runtime = runtime;
1705 dl_se->dl_deadline = period;
1706 dl_se->dl_period = period;
1707
1708 dl_se->runtime = 0;
1709 dl_se->deadline = 0;
1710
1711 dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
1712 dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
1713
1714 return retval;
1715 }
1716
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-07-21 11:11 [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
2025-07-22 0:07 ` kernel test robot
@ 2025-07-25 11:44 ` Matteo Martelli
2025-07-25 15:28 ` Yuri Andriaccio
1 sibling, 1 reply; 5+ messages in thread
From: Matteo Martelli @ 2025-07-25 11:44 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: linux-kernel, Luca Abeni, Yuri Andriaccio, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider
Hi Yuri,
On Mon, 21 Jul 2025 13:11:31 +0200, Yuri Andriaccio <yurand2000@gmail.com> wrote:
> Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
> prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
> real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
> RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
> has been limited to 95% of the CPU-time.
>
> The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
> currently set to use, as expected, 5% of the CPU-time. Still, they share the
> same bandwidth that allows to run real-time tasks, and which is still set to 95%
> of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
> the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
> (FIFO/RR are not affected) is only 90%.
> ...
I noticed the same issue while recently addressing a stress-ng bug [1].
> This patch reclaims the 5% lost SCHED_DEADLINE CPU-time (FIFO/RR are not
> affected, there is no admission test there to perform), by accounting the
> fair-server's bandwidth separately. After this patch, the above script runs
> successfully also when allocating 95% bw per task/cpu.
> ...
I've tested your patch on top of 6.16.0-rc7-00091-gdd9c17322a6c,
x86 defconfig, on qemu with 4 virtual CPUs with a debian sid image [2].
I used stress-ng at latest master commit (35e617844) so that it includes
the fix to run the stress workers at full deadline bandwidth. The patch
seems to work properly but I encountered some kernel warnings while
testing. See the details below.
Checked default settings after boot (95% global BW, 5% fair_server BW):
----------
root@localhost:~# cat /proc/sys/kernel/sched_rt_period_us
1000000
root@localhost:~# cat /proc/sys/kernel/sched_rt_runtime_us
950000
root@localhost:~# cat /sys/kernel/debug/sched/fair_server/cpu*/period
1000000000
1000000000
1000000000
1000000000
root@localhost:~# cat /sys/kernel/debug/sched/fair_server/cpu*/runtime
50000000
50000000
50000000
50000000
----------
Launched stress-ng command using 95% BW: 95ms runtime, 100ms period, 100ms
deadline. All 4 stress processes ran properly:
----------
root@localhost:~# ./stress-ng/stress-ng --sched deadline --sched-runtime 95000000 --sched-period 100000000 --sched-deadline 100000000 --cpu 0 --verbose --metrics --timeout 10s
stress-ng: debug: [351] invoked with './stress-ng/stress-ng --sched deadline --sched-runtime 95000000 --sched-period 100000000 --sched-deadline 100000000 --cpu 0 --verbose --metrics --timeout 10s' by user 0 'root'
stress-ng: debug: [351] stress-ng 0.19.02 g35e617844d9b
stress-ng: debug: [351] system: Linux localhost 6.16.0-rc7-00364-gfe48072cc20f #1 SMP PREEMPT_DYNAMIC Thu Jul 24 18:44:39 CEST 2025 x86_64, gcc 14.2.0, glibc 2.41, little endian
stress-ng: debug: [351] RAM total: 1.9G, RAM free: 1.8G, swap free: 0.0
stress-ng: debug: [351] temporary file path: '/root', filesystem type: ext4 (187735 blocks available, QEMU HARDDISK)
stress-ng: debug: [351] 4 processors online, 4 processors configured
stress-ng: info: [351] setting to a 10 secs run per stressor
stress-ng: debug: [351] CPU data cache: L1: 64K, L2: 512K, L3: 16384K
stress-ng: debug: [351] cache allocate: shared cache buffer size: 16384K
stress-ng: info: [351] dispatching hogs: 4 cpu
stress-ng: debug: [351] starting stressors
stress-ng: debug: [352] deadline: setting scheduler class 'sched' (period=100000000, runtime=95000000, deadline=100000000)
stress-ng: debug: [353] deadline: setting scheduler class 'sched' (period=100000000, runtime=95000000, deadline=100000000)
stress-ng: debug: [352] cpu: [352] started (instance 0 on CPU 3)
stress-ng: debug: [353] cpu: [353] started (instance 1 on CPU 2)
stress-ng: debug: [354] deadline: setting scheduler class 'sched' (period=100000000, runtime=95000000, deadline=100000000)
stress-ng: debug: [354] cpu: [354] started (instance 2 on CPU 1)
stress-ng: debug: [351] 4 stressors started
stress-ng: debug: [355] deadline: setting scheduler class 'sched' (period=100000000, runtime=95000000, deadline=100000000)
stress-ng: debug: [355] cpu: [355] started (instance 3 on CPU 3)
stress-ng: debug: [352] cpu: using method 'all'
stress-ng: debug: [355] cpu: [355] exited (instance 3 on CPU 0)
stress-ng: debug: [354] cpu: [354] exited (instance 2 on CPU 1)
stress-ng: debug: [353] cpu: [353] exited (instance 1 on CPU 2)
stress-ng: debug: [352] cpu: [352] exited (instance 0 on CPU 3)
stress-ng: debug: [351] cpu: [352] terminated (success)
stress-ng: debug: [351] cpu: [353] terminated (success)
stress-ng: debug: [351] cpu: [354] terminated (success)
stress-ng: debug: [351] cpu: [355] terminated (success)
stress-ng: debug: [351] metrics-check: all stressor metrics validated and sane
stress-ng: metrc: [351] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [351] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [351] cpu 75655 10.00 38.01 0.00 7564.01 1990.31 95.01 8460
stress-ng: info: [351] skipped: 0
stress-ng: info: [351] passed: 4: cpu (4)
stress-ng: info: [351] failed: 0
stress-ng: info: [351] metrics untrustworthy: 0
stress-ng: info: [351] successful run completed in 10.10 secs
----------
Launched stress-ng command using 96% BW: 96ms runtime, 100ms period, 100ms
deadline. The last spawn stress process failed to execute:
"stress-ng: warn: [357] cpu: [361] aborted early, out of system resources"
(without the patch, stress-ng produced the same error but with BW > 90%):
----------
root@localhost:~# ./stress-ng/stress-ng --sched deadline --sched-runtime 96000000 --sched-period 100000000 --sched-deadline 100000000 --cpu 0 --verbose --metrics --timeout 10s
stress-ng: debug: [357] invoked with './stress-ng/stress-ng --sched deadline --sched-runtime 96000000 --sched-period 100000000 --sched-deadline 100000000 --cpu 0 --verbose --metrics --timeout 10s' by user 0 'root'
stress-ng: debug: [357] stress-ng 0.19.02 g35e617844d9b
stress-ng: debug: [357] system: Linux localhost 6.16.0-rc7-00364-gfe48072cc20f #1 SMP PREEMPT_DYNAMIC Thu Jul 24 18:44:39 CEST 2025 x86_64, gcc 14.2.0, glibc 2.41, little endian
stress-ng: debug: [357] RAM total: 1.9G, RAM free: 1.8G, swap free: 0.0
stress-ng: debug: [357] temporary file path: '/root', filesystem type: ext4 (187735 blocks available, QEMU HARDDISK)
stress-ng: debug: [357] 4 processors online, 4 processors configured
stress-ng: info: [357] setting to a 10 secs run per stressor
stress-ng: debug: [357] CPU data cache: L1: 64K, L2: 512K, L3: 16384K
stress-ng: debug: [357] cache allocate: shared cache buffer size: 16384K
stress-ng: info: [357] dispatching hogs: 4 cpu
stress-ng: debug: [357] starting stressors
stress-ng: debug: [358] deadline: setting scheduler class 'sched' (period=100000000, runtime=96000000, deadline=100000000)
stress-ng: debug: [359] deadline: setting scheduler class 'sched' (period=100000000, runtime=96000000, deadline=100000000)
stress-ng: debug: [360] deadline: setting scheduler class 'sched' (period=100000000, runtime=96000000, deadline=100000000)
stress-ng: debug: [358] cpu: [358] started (instance 0 on CPU 1)
stress-ng: debug: [359] cpu: [359] started (instance 1 on CPU 2)
stress-ng: debug: [360] cpu: [360] started (instance 2 on CPU 0)
stress-ng: debug: [357] 4 stressors started
stress-ng: debug: [361] deadline: setting scheduler class 'sched' (period=100000000, runtime=96000000, deadline=100000000)
stress-ng: debug: [358] cpu: using method 'all'
stress-ng: debug: [359] cpu: [359] exited (instance 1 on CPU 3)
stress-ng: debug: [360] cpu: [360] exited (instance 2 on CPU 2)
stress-ng: debug: [358] cpu: [358] exited (instance 0 on CPU 0)
stress-ng: debug: [357] cpu: [358] terminated (success)
stress-ng: debug: [357] cpu: [359] terminated (success)
stress-ng: debug: [357] cpu: [360] terminated (success)
stress-ng: warn: [357] cpu: [361] aborted early, out of system resources
stress-ng: debug: [357] cpu: [361] terminated (no resources)
stress-ng: debug: [357] metrics-check: all stressor metrics validated and sane
stress-ng: metrc: [357] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [357] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [357] cpu 58940 10.00 28.80 0.00 5892.97 2046.02 96.01 8444
stress-ng: info: [357] skipped: 1: cpu (1)
stress-ng: info: [357] passed: 3: cpu (3)
stress-ng: info: [357] failed: 0
stress-ng: info: [357] metrics untrustworthy: 0
stress-ng: info: [357] successful run completed in 10.10 secs
----------
Trying to increase fair_server BW without reducing the global BW, failed as
expected:
----------
root@localhost:~# echo 60000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
-bash: echo: write error: Device or resource busy
----------
Trying to increase fair_server BW after reducing the global BW, succeeded:
----------
root@localhost:~# echo 940000 > /proc/sys/kernel/sched_rt_runtime_us
root@localhost:~# echo 60000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
root@localhost:~# echo 60000000 > /sys/kernel/debug/sched/fair_server/cpu1/runtime
root@localhost:~# echo 60000000 > /sys/kernel/debug/sched/fair_server/cpu2/runtime
root@localhost:~# echo 60000000 > /sys/kernel/debug/sched/fair_server/cpu3/runtime
root@localhost:~#
----------
Checked that settings had been applied correctly:
----------
root@localhost:~# cat /proc/sys/kernel/sched_rt_runtime_us
940000
root@localhost:~# cat /sys/kernel/debug/sched/fair_server/cpu*/runtime
60000000
60000000
60000000
60000000
----------
NOTE: After setting fair_server runtime for cpu2, I've encointered the
following warning, however it also happened after setting the runtime
for cpu0 during another instance of the same test:
----------
[ 196.647897] ------------[ cut here ]------------
[ 196.649198] WARNING: kernel/sched/deadline.c:284 at dl_rq_change_utilization+0x5f/0x1d0, CPU#1: bash/249
[ 196.650244] Modules linked in:
[ 196.650613] CPU: 1 UID: 0 PID: 249 Comm: bash Not tainted 6.16.0-rc7-00364-gfe48072cc20f #1 PREEMPT(voluntary)
[ 196.651698] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[ 196.652834] RIP: 0010:dl_rq_change_utilization+0x5f/0x1d0
[ 196.653747] Code: 01 00 00 4d 89 88 d0 08 00 00 48 83 c4 18 c3 cc cc cc cc 90 0f 0b 90 49 c7 80 d0 08 00 00 00 00 00 00 48 85 d2 74 dc 31 c0 90 <0f> 0b 90 49 39 c1 73 d1 e9 04 01 00 00 f6 46 53 10 75 73 48 8b 87
[ 196.655791] RSP: 0018:ffffa47b003e3d60 EFLAGS: 00010093
[ 196.656328] RAX: 0000000000000000 RBX: 000000003b9aca00 RCX: ffff9f25bdd292b0
[ 196.657031] RDX: 000000000000cccc RSI: ffff9f25bdd292b0 RDI: ffff9f25bdd289c0
[ 196.657742] RBP: ffff9f25bdd289c0 R08: ffff9f25bdd289c0 R09: 000000000000f5c2
[ 196.658453] R10: 000000000000000a R11: 0fffffffffffffff R12: ffff9f25bdd292b0
[ 196.659077] R13: 0000000000001000 R14: ffff9f25411a3840 R15: ffff9f25411a3800
[ 196.659752] FS: 00007f3c521c4740(0000) GS:ffff9f26249cb000(0000) knlGS:0000000000000000
[ 196.660321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 196.660740] CR2: 000055d43eae1a58 CR3: 0000000005281000 CR4: 00000000000006f0
[ 196.661210] Call Trace:
[ 196.661383] <TASK>
[ 196.661605] ? dl_bw_cpus+0x74/0x80
[ 196.661840] dl_server_apply_params+0x1b3/0x1e0
[ 196.662142] sched_fair_server_write.isra.0+0x10a/0x1a0
[ 196.662523] full_proxy_write+0x54/0x90
[ 196.662780] vfs_write+0xc9/0x480
[ 196.663006] ksys_write+0x6e/0xe0
[ 196.663229] do_syscall_64+0xa4/0x260
[ 196.663513] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 196.663845] RIP: 0033:0x7f3c52256687
[ 196.664084] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[ 196.666102] RSP: 002b:00007ffde87d9720 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 196.666631] RAX: ffffffffffffffda RBX: 00007f3c521c4740 RCX: 00007f3c52256687
[ 196.667099] RDX: 0000000000000009 RSI: 000055d43eb74a50 RDI: 0000000000000001
[ 196.667596] RBP: 000055d43eb74a50 R08: 0000000000000000 R09: 0000000000000000
[ 196.668057] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000009
[ 196.668562] R13: 00007f3c523af5c0 R14: 00007f3c523ace80 R15: 0000000000000000
[ 196.669448] </TASK>
[ 196.669795] ---[ end trace 0000000000000000 ]---
----------
I've also encountered this other warning after setting cpu4 during
another instance of the same test:
----------
[ 177.452335] ------------[ cut here ]------------
[ 177.453820] WARNING: kernel/sched/deadline.c:257 at task_non_contending+0x259/0x3b0, CPU#1: swapper/1/0
[ 177.455214] Modules linked in:
[ 177.455555] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W 6.16.0-rc7-00364-gfe48072cc20f #1 PREEMPT(voluntary)
[ 177.456841] Tainted: [W]=WARN
[ 177.457192] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[ 177.458202] RIP: 0010:task_non_contending+0x259/0x3b0
[ 177.458751] Code: 4c 89 0c 24 e8 b8 46 49 00 4c 8b 0c 24 e9 6b fe ff ff f6 43 53 10 0f 85 57 ff ff ff 49 8b 80 c8 08 00 00 48 2b 43 30 73 06 90 <0f> 0b 90 31 c0 49 63 90 28 0b 00 00 49 89 80 c8 08 00 00 48 c7 c0
[ 177.460760] RSP: 0018:ffffa88f400e8ea0 EFLAGS: 00010087
[ 177.461352] RAX: ffffffffffffd70a RBX: ffff8c8b7dca92b0 RCX: 0000000000000001
[ 177.462121] RDX: fffffffffffdf33f RSI: 0000000003938700 RDI: ffff8c8b7dca92b0
[ 177.462884] RBP: ffff8c8b7dca89c0 R08: ffff8c8b7dca89c0 R09: 0000000000000002
[ 177.463655] R10: ffff8c8b7dca89c0 R11: 0000000000000000 R12: ffffffffa7b55050
[ 177.464403] R13: 0000000000000002 R14: ffff8c8b7dca92b0 R15: ffff8c8b7dc9bb00
[ 177.465124] FS: 0000000000000000(0000) GS:ffff8c8bd3dcb000(0000) knlGS:0000000000000000
[ 177.465775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 177.466231] CR2: 000055c7394fa93c CR3: 0000000006855000 CR4: 00000000000006f0
[ 177.466757] Call Trace:
[ 177.466952] <IRQ>
[ 177.467104] dl_task_timer+0x3ca/0x830
[ 177.467373] __hrtimer_run_queues+0x12e/0x2a0
[ 177.467686] hrtimer_interrupt+0xf7/0x220
[ 177.468000] __sysvec_apic_timer_interrupt+0x53/0x100
[ 177.468358] sysvec_apic_timer_interrupt+0x66/0x80
[ 177.468701] </IRQ>
[ 177.468866] <TASK>
[ 177.469047] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 177.469413] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 177.469759] Code: 06 86 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d f5 97 1f 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 177.471123] RSP: 0018:ffffa88f400a3ed8 EFLAGS: 00000212
[ 177.471491] RAX: ffff8c8bd3dcb000 RBX: ffff8c8b01249d80 RCX: ffff8c8b02399701
[ 177.472013] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 000000000000bc4c
[ 177.472511] RBP: 0000000000000001 R08: 000000000000bc4c R09: ffff8c8b7dca49d0
[ 177.473021] R10: 0000002954e5b780 R11: 0000000000000002 R12: 0000000000000000
[ 177.473518] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 177.474058] ? ct_kernel_exit.constprop.0+0x63/0xf0
[ 177.474404] default_idle+0x9/0x10
[ 177.474649] default_idle_call+0x2b/0x100
[ 177.474952] do_idle+0x1d0/0x230
[ 177.475186] cpu_startup_entry+0x24/0x30
[ 177.475466] start_secondary+0xf3/0x100
[ 177.475745] common_startup_64+0x13e/0x148
[ 177.476079] </TASK>
[ 177.476241] ---[ end trace 0000000000000000 ]---
----------
Beside the warnings the new max BW it was applied correctly, and re-running
stress-ng with 95% BW it failed (last spawn process aborted), while re-running
stress-ng with 94% BW it passed (all processes ran).
Trying to reset global BW to 95% before decreasing fair_server BW failed as
expected:
----------
root@localhost:~# echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
-bash: echo: write error: Device or resource busy
----------
Setting back global BW to 95% after decreasing fair_server BW back to 5%
succeeded:
----------
root@localhost:~# echo 50000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
root@localhost:~# echo 50000000 > /sys/kernel/debug/sched/fair_server/cpu1/runtime
root@localhost:~# echo 50000000 > /sys/kernel/debug/sched/fair_server/cpu2/runtime
root@localhost:~# echo 50000000 > /sys/kernel/debug/sched/fair_server/cpu3/runtime
root@localhost:~# echo 950000 > /proc/sys/kernel/sched_rt_runtime_us
root@localhost:~#
----------
Checked that settings had been applied correctly:
----------
root@localhost:~# cat /proc/sys/kernel/sched_rt_runtime_us
950000
root@localhost:~# cat /sys/kernel/debug/sched/fair_server/cpu*/runtime
50000000
50000000
50000000
50000000
----------
After a few seconds I've enountered yet another warning:
----------
[ 749.612165] ------------[ cut here ]------------
[ 749.613416] WARNING: kernel/sched/deadline.c:245 at task_contending+0x167/0x190, CPU#3: swapper/3/0
[ 749.614406] Modules linked in:
[ 749.614783] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Tainted: G W 6.16.0-rc7-00364-gfe48072cc20f #1 PREEMPT(voluntary)
[ 749.616070] Tainted: [W]=WARN
[ 749.616398] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.17.0-1-1 04/01/2014
[ 749.617383] RIP: 0010:task_contending+0x167/0x190
[ 749.617913] Code: 48 83 c4 08 e9 1a 56 49 00 83 e1 04 0f 85 7a ff ff ff c3 cc cc cc cc 90 0f 0b 90 e9 39 ff ff ff 90 0f 0b 90 e9 f5 fe ff ff 90 <0f> 0b 90 e9 f9 fe ff ff 90 0f 0b 90 eb 91 48 c7 c6 50 4c ef 96 48
[ 749.621091] RSP: 0018:ffffa47b00140de8 EFLAGS: 00010087
[ 749.621685] RAX: ffff9f25bdda89c0 RBX: ffff9f25bdda92b0 RCX: 000000000000f5c2
[ 749.622470] RDX: 000000000000f5c2 RSI: 0000000000000001 RDI: 0000000000000000
[ 749.623243] RBP: 0000000000000001 R08: 00000000041d4a50 R09: 0000000000000002
[ 749.623935] R10: 0000000000000001 R11: ffff9f2541e28010 R12: 0000000000000001
[ 749.624526] R13: ffff9f25bdda8a40 R14: 0000000000000000 R15: 0000000000200b20
[ 749.625066] FS: 0000000000000000(0000) GS:ffff9f2624acb000(0000) knlGS:0000000000000000
[ 749.625676] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 749.626090] CR2: 00007f3c523df18c CR3: 00000000050c8000 CR4: 00000000000006f0
[ 749.626639] Call Trace:
[ 749.626825] <IRQ>
[ 749.626980] enqueue_dl_entity+0x401/0x730
[ 749.627281] dl_server_start+0x35/0x80
[ 749.627588] enqueue_task_fair+0x20c/0x820
[ 749.627891] enqueue_task+0x2c/0x70
[ 749.628150] ttwu_do_activate+0x6e/0x210
[ 749.628439] try_to_wake_up+0x249/0x630
[ 749.628741] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 749.629062] hrtimer_wakeup+0x1d/0x30
[ 749.629332] __hrtimer_run_queues+0x12e/0x2a0
[ 749.629712] hrtimer_interrupt+0xf7/0x220
[ 749.630005] __sysvec_apic_timer_interrupt+0x53/0x100
[ 749.630373] sysvec_apic_timer_interrupt+0x66/0x80
[ 749.630810] </IRQ>
[ 749.630971] <TASK>
[ 749.631131] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 749.631500] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 749.631821] Code: 06 86 00 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d f5 97 1f 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 749.633041] RSP: 0018:ffffa47b000b3ed8 EFLAGS: 00000202
[ 749.633490] RAX: ffff9f2624acb000 RBX: ffff9f254124bb00 RCX: ffffa47b0056fe30
[ 749.633996] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 000000000006ae44
[ 749.634482] RBP: 0000000000000003 R08: 000000000006ae44 R09: ffff9f25bdda49d0
[ 749.634952] R10: 000000ae900bc540 R11: 0000000000000008 R12: 0000000000000000
[ 749.635437] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 749.637635] ? ct_kernel_exit.constprop.0+0x63/0xf0
[ 749.638053] default_idle+0x9/0x10
[ 749.638349] default_idle_call+0x2b/0x100
[ 749.638682] do_idle+0x1d0/0x230
[ 749.638903] cpu_startup_entry+0x24/0x30
[ 749.639163] start_secondary+0xf3/0x100
[ 749.639420] common_startup_64+0x13e/0x148
[ 749.639718] </TASK>
[ 749.639870] ---[ end trace 0000000000000000 ]---
----------
I then reduced fair_server runtime to 40000000 (for all CPUs), raised global
sched_rt_runtime_us to 960000, and re-run stress-ng at 96% BW (passed) and at
97% BW (failed: one process aborted). During this last test I've got again the
two warnings mentioned above (kernel/sched/deadline.c:284 at dequeue_task_dl
and WARNING: kernel/sched/deadline.c:245 at task_contending+0x167/0x190), this
while setting the fair_server runtime but also while running stress-ng.
I couldn't reproduce the warning on the same kernel without the patch applied,
by trying different configurations of fair_server BW and global BW, while they
seem very reproducible with the patch applied.
[1]: https://github.com/ColinIanKing/stress-ng/issues/549
[2]: https://cloud.debian.org/images/cloud/sid/daily/20250606-2135/debian-sid-nocloud-amd64-daily-20250606-2135.qcow2
I hope this is helpful.
Best regards,
Matteo Martelli
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-07-25 11:44 ` Matteo Martelli
@ 2025-07-25 15:28 ` Yuri Andriaccio
2025-07-25 17:22 ` Matteo Martelli
0 siblings, 1 reply; 5+ messages in thread
From: Yuri Andriaccio @ 2025-07-25 15:28 UTC (permalink / raw)
To: matteo.martelli
Cc: bsegall, dietmar.eggemann, juri.lelli, linux-kernel, luca.abeni,
mgorman, mingo, peterz, rostedt, vincent.guittot, vschneid,
yuri.andriaccio
Hi,
Thank you very much for your testing, I'm very glad to know that the patch works
as intended.
At first glance, I think the warnings you are having are related to this bug
report that I posted a few days ago:
https://lore.kernel.org/all/20250718113848.193139-1-yurand2000@gmail.com/
Juri Lelli checked it out and made a patch that addresses these warns here:
https://lore.kernel.org/all/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com/
Have you already applied Juri's patch together with mine? If not, I think it
should solve those issues you mentioned.
Have a nice day,
Yuri
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting
2025-07-25 15:28 ` Yuri Andriaccio
@ 2025-07-25 17:22 ` Matteo Martelli
0 siblings, 0 replies; 5+ messages in thread
From: Matteo Martelli @ 2025-07-25 17:22 UTC (permalink / raw)
To: Yuri Andriaccio
Cc: bsegall, dietmar.eggemann, juri.lelli, linux-kernel, luca.abeni,
mgorman, mingo, peterz, rostedt, vincent.guittot, vschneid,
yuri.andriaccio
Hi Yuri,
On Fri, 25 Jul 2025 17:28:04 +0200, Yuri Andriaccio <yurand2000@gmail.com> wrote:
> Hi,
>
> Thank you very much for your testing, I'm very glad to know that the patch works
> as intended.
>
> At first glance, I think the warnings you are having are related to this bug
> report that I posted a few days ago:
> https://lore.kernel.org/all/20250718113848.193139-1-yurand2000@gmail.com/
>
> Juri Lelli checked it out and made a patch that addresses these warns here:
> https://lore.kernel.org/all/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com/
>
> Have you already applied Juri's patch together with mine? If not, I think it
> should solve those issues you mentioned.
No, I dind't apply Juri's patch before the previously mentioned tests.
I've now applied it on top of yours and I've just rerun the same tests.
I confirm that stress-ng and runtime variations commands provide the
same results, but that the warnings are not produced anymore.
>
> Have a nice day,
> Yuri
>
Best regards,
Matteo
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-25 17:22 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21 11:11 [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Yuri Andriaccio
2025-07-22 0:07 ` kernel test robot
2025-07-25 11:44 ` Matteo Martelli
2025-07-25 15:28 ` Yuri Andriaccio
2025-07-25 17:22 ` Matteo Martelli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).