Re: sched/deadline: Use revised wakeup rule for dl_server

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Andreas Ziegler <br025@umbiko.net>
To: Christian Loehle <christian.loehle@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	linux-kernel@vger.kernel.org,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	John Stultz <jstultz@google.com>
Subject: Re: sched/deadline: Use revised wakeup rule for dl_server
Date: Mon, 11 May 2026 12:37:12 +0000	[thread overview]
Message-ID: <701f3a1dd4730f92cb3013176e068a16@umbiko.net> (raw)
In-Reply-To: <50156878-265d-4025-9b36-c819c80b7493@arm.com>

On 2026-05-11 09:47, Christian Loehle wrote:
> On 5/9/26 12:42, Andreas Ziegler wrote:
>> Hi Christian, Everyone,
>> 
>> On 2026-05-08 14:13, Christian Loehle wrote:
>>> On 5/8/26 13:06, Andreas Ziegler wrote:
>>>> Hi Christian,
>>>> 
>>>> On 2026-05-08 09:20, Christian Loehle wrote:
>>>>> On 5/8/26 09:09, Andreas Ziegler wrote:
>>>>>> Linux kernel version: 6.12
>>>>>>   CONFIG_PREEMPT_RT (w/ PREEMPT_RT patch applied)
>>>>>> Architecture: aarch64
>>>>>> Platform: Raspberry Pi 4
>>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> Commit d66792919d4f (sched/deadline: Use revised wakeup rule for 
>>>>>> dl_server) [1] introduced a marked degradation in scheduling 
>>>>>> latency for real-time tasks in the presence of heavy I/O load.
>>>>>> 
>>>>>> --- a/kernel/sched/deadline.c
>>>>>> +++ b/kernel/sched/deadline.c
>>>>>> @@ -1079,7 +1079,7 @@ static void update_dl_entity(struct 
>>>>>> sched_dl_entity *dl_se)
>>>>>>      if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>>>>>>          dl_entity_overflow(dl_se, rq_clock(rq))) {
>>>>>> 
>>>>>> -        if (unlikely(!dl_is_implicit(dl_se) &&
>>>>>> +        if (unlikely((!dl_is_implicit(dl_se) || dl_se->dl_defer) 
>>>>>> &&
>>>>>>                   !dl_time_before(dl_se->deadline, rq_clock(rq)) 
>>>>>> &&
>>>>>>                   !is_dl_boosted(dl_se))) {
>>>>>>              update_dl_revised_wakeup(dl_se, rq);
>>>>>> 
>>>>>> This was observed using a modified version of Con Kolivas' 
>>>>>> interactivity benchmark [2]; kernel bisection eventually pointed 
>>>>>> to the above mentioned commit.
>>>>>> 
>>>>>> Benchmark results before d66792919d4f:
>>>>>> 
>>>>>> --- Benchmarking simulated cpu of Audio real time in the presence 
>>>>>> of simulated ---
>>>>>> Load    Latency +/- SD   median  max [100n]    Desired CPU  
>>>>>> Deadlines met [%]
>>>>>> None      76.6 +/- 8.3654    76  166
>>>>>> Video      78.5 +/- 3.9433    78  107
>>>>>> X      76.4 +/- 8.123     75  157
>>>>>> Burn      72.0 +/- 6.4733    71  127
>>>>>> Write     255.3 +/- 26.627   252  331
>>>>>> Read     226.6 +/- 12.38    227  262
>>>>>> Ring      84.2 +/- 6.6207    83  125
>>>>>> Compile     225.3 +/- 23.949   222  328
>>>>>> 
>>>>>>      136.8 +/- 78.462        331
>>>>>> 
>>>>>> Benchmark results after d66792919d4f:
>>>>>> 
>>>>>> --- Benchmarking simulated cpu of Audio real time in the presence 
>>>>>> of simulated ---
>>>>>> Load    Latency +/- SD   median  max [100n]    Desired CPU  
>>>>>> Deadlines met [%]
>>>>>> None      68.4 +/- 9.7864    67  169
>>>>>> Video      74.4 +/- 3.724     74   97
>>>>>> X      72.0 +/- 6.5681    71  129
>>>>>> Burn      66.9 +/- 5.9059    66  117
>>>>>> Write    9576.9 +/- 67639    250500418        98.1         98.1
>>>>>> Read     209.3 +/- 11.018   209  267
>>>>>> Ring      80.5 +/- 8.0993    78  125
>>>>>> Compile     239.0 +/- 29.447   234  372
>>>>>> 
>>>>>>     1298.4 +/- 24118       500418
>>>>>> 
>>>>>> Reverting this commit obviously solves the issue for me. I have no 
>>>>>> idea why this issue appears exclusively with heavy write loads in 
>>>>>> the background.
>>>>>> 
>>>>>> Is this a scheduler issue, or rather something in the background?
>>>>>> 
>>>>> 
>>>>> Hi Andreas,
>>>>> You're using cpufreq schedutil for your tests I'm assuming?
>>>>> Is there a difference in cpufreq behavior (avg cpufreq or OPP 
>>>>> residencies?)
>>>>> Does the regression also happen on powersave/performance governor?
>>>> 
>>>> Actually this is a very stripped-down system. The 'performance' 
>>>> cpufreq governor is the only one compiled in, the processor cores 
>>>> run on a fixed frequency. CONFIG_PM_OPP is not set.
>>> 
>>> That certainly makes the analysis easier.
>>> I couldn't reproduce the issue so far on my system but it does seem 
>>> like the dl server
>>> would get potentially unbounded running time with very frequent
>>> starting and stopping of the dlserver (which presumably happens 
>>> because of
>>> the writeback) reset the runtime, which then leads to your 25s 
>>> observed latency.
>>> Peter, how is the revised wakeup rule supposed to behave here?
>>> 
>>>> [snip]
>> 
>> This seems to be a case of runtime starvation. If I change 
>> sched_rt_runtime_us to a smaller value, the benchmark returns 
>> reasonable latency values.
>> 
>> # echo "980000" > /proc/sys/kernel/sched_rt_runtime_us
>> 
>> I could live with this workaround, since it seems not to impact 
>> overall latency values in a noticeable way.
>> 
> 
> Not a very stable workaround unfortunately :/
> While I try to reproduce this, what you're observing should imply that 
> the
> background SCHED_NORMAL work is enough to fully utilize the system, 
> right?
> interbench Write does 4k (buffered) writes of a 1GB file and then 
> close+open
> and repeat, nothing fancy really. Does this actually produce 
> significant CPU
> utilization for you? Can you just run the background work and see what 
> that
> looks like?
> (What you're seeing looks like a bug in any case, just so I'm not going 
> down
> a wrong path when trying to reproduce here).

You are right, and this was a false positive; the problem seems to be 
intermittent (maybe 1/20) and I just got lucky for one session.

Some background information about the current state of the system:
   /* CONFIG_CPU_FREQ is not set */
   Root filesystem in RAM (initrd)
   Cpu 3 is isolated: boot parameters: console=tty1 
console=ttyAMA0,115200 isolcpus=nohz,domain,managed_irq,3 nohz_full=3 
rcu_nocbs=3

Background load is normally near 100% idle; this is from top after 
reboot:

Mem: 95724K used, 853524K free, 42408K shrd, 72K buff, 43352K cached
CPU:  0.0% usr  0.0% sys  0.0% nic  100% idle  0.0% io  0.0% irq  0.0% 
sirq
Load average: 0.21 0.17 0.07 3/126 702

The file size used by interbench is even less than 1GB, due to the 
limits of the rootfs. Typical values are around 100-200 MiB. It is 
written in an infinite loop until receiving the stop message (via pipe) 
from the controlling process. The check for the abort signal occurs 
after a completed write, not on block level.

I just noticed that interbench seems to have a bug itself: it uses only 
one processor - looks like a mangled cpu mask. Top output during the 
write benchmark:

Mem: 358024K used, 591224K free, 298516K shrd, 2504K buff, 299464K 
cached
CPU:  1.8% usr 23.1% sys  0.0% nic 74.9% idle  0.0% io  0.0% irq  0.0% 
sirq
Load average: 1.21 0.46 0.29 5/129 2116
   PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  2106  2105 root     S     1228  0.1   0 23.6 interbench -r -t 60 -u -w 
Write -W
  2109  2105 root     S     1228  0.1   0  1.2 interbench -r -t 60 -u -w 
Write -W
  1829  1274 root     R     1600  0.1   2  0.0 top -d 5
    22     2 root     SW       0  0.0   0  0.0 [rcuc/0]
  1270     2 root     IW       0  0.0   0  0.0 [kworker/0:0-eve]
   652     1 mpd      S    27632  2.9   0  0.0 /usr/bin/mpd
  2023  2021 root     S     4476  0.4   0  0.0 sshd-session: root@notty
   675   673 root     S     4448  0.4   1  0.0 sshd-session: root@pts/0
   673   601 root     S     4140  0.4   0  0.0 sshd-session: root [priv]
  2021   601 root     S     4140  0.4   0  0.0 sshd-session: root [priv]
   601     1 root     S     3736  0.3   1  0.0 sshd: /usr/sbin/sshd 
[listener] 0
  2024  2023 root     S     3224  0.3   1  0.0 /usr/libexec/sftp-server
  2025  2023 root     S     3188  0.3   2  0.0 /usr/libexec/sftp-server
   501     1 root     S     1884  0.2   1  0.0 /usr/sbin/wpa_supplicant 
-B -P /va
   131     1 root     S     1672  0.1   0  0.0 /sbin/mdev -df
   676   675 root     S     1636  0.1   1  0.0 -sh
  1274   605 root     S     1636  0.1   1  0.0 -sh
   605     1 root     S     1592  0.1   1  0.0 /usr/sbin/telnetd -F
   527     1 root     S     1576  0.1   2  0.0 udhcpc -t1 -A2 -b -R -O 
search -O
     1     0 root     S     1576  0.1   0  0.0 init

I tried limiting interbench's rather excessive SCHED_FIFO priorities to 
values normal for the system, but without success.

next prev parent reply	other threads:[~2026-05-11 12:37 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  8:09 sched/deadline: Use revised wakeup rule for dl_server Andreas Ziegler
2026-05-08  9:20 ` Christian Loehle
2026-05-08 12:06   ` Andreas Ziegler
2026-05-08 14:13     ` Christian Loehle
2026-05-09 11:42       ` Andreas Ziegler
2026-05-11  9:47         ` Christian Loehle
2026-05-11 12:37           ` Andreas Ziegler [this message]
2026-05-11 12:46 ` Juri Lelli
2026-05-11 14:13   ` Andreas Ziegler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=701f3a1dd4730f92cb3013176e068a16@umbiko.net \
    --to=br025@umbiko.net \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox