SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
@ 2025-04-28 18:04 Marcel Ziswiler
  2025-05-02 13:55 ` Juri Lelli
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-04-28 18:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vineeth Pillai,
	Daniel Bristot de Oliveira

Hi

As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the
mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down to a
granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode
enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of
millions of test runs over the course of a full week where we usually see absolutely zero deadline misses, we
see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). This is
with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the job
mix, but once without reclaim enabled and once with reclaim enabled.

We are wondering whether there are any known limitations to GRUB or what exactly could be the issue.

We are happy to provide more detailed debugging information but are looking for suggestions how/what exactly to
look at.

Any help is much appreciated. Thanks!

Cheers

Marcel

[1] https://projects.eclipse.org/projects/technology.tsf

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-04-28 18:04 SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) Marcel Ziswiler
@ 2025-05-02 13:55 ` Juri Lelli
  2025-05-02 14:10   ` luca abeni
  2025-05-03 11:14   ` Marcel Ziswiler
  0 siblings, 2 replies; 35+ messages in thread
From: Juri Lelli @ 2025-05-02 13:55 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai,
	Luca Abeni

Hi Marcel,

On 28/04/25 20:04, Marcel Ziswiler wrote:
> Hi
> 
> As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the
> mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down to a
> granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
> However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode
> enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of
> millions of test runs over the course of a full week where we usually see absolutely zero deadline misses, we
> see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). This is
> with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the job
> mix, but once without reclaim enabled and once with reclaim enabled.
> 
> We are wondering whether there are any known limitations to GRUB or what exactly could be the issue.
> 
> We are happy to provide more detailed debugging information but are looking for suggestions how/what exactly to
> look at.

Could you add details of the taskset you are working with? The number of
tasks, their reservation parameters (runtime, period, deadline) and how
much they are running (or trying to run) each time they wake up. Also
which one is using GRUB and which one maybe is not.

Adding Luca in Cc so he can also take a look.

Thanks,
Juri


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-02 13:55 ` Juri Lelli
@ 2025-05-02 14:10   ` luca abeni
  2025-05-03 13:14     ` Marcel Ziswiler
  2025-05-03 11:14   ` Marcel Ziswiler
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-05-02 14:10 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi all,

On Fri, 2 May 2025 15:55:42 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Hi Marcel,
> 
> On 28/04/25 20:04, Marcel Ziswiler wrote:
> > Hi
> > 
> > As part of our trustable work [1], we also run a lot of real time
> > scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel.
> > Overall, the Linux scheduler proves quite capable of scheduling
> > deadline tasks down to a granularity of 5ms on both of our test
> > systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
> > However, recently, we noticed a lot of deadline misses if we
> > introduce overrunning jobs with reclaim mode enabled
> > (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused
> > Bandwidth). E.g. from hundreds of millions of test runs over the
> > course of a full week where we usually see absolutely zero deadline
> > misses, we see 43 million deadline misses on NUC and 600 thousand
> > on ROCK5B (which also has double the CPU cores). This is with
> > otherwise exactly the same test configuration, which adds exactly
> > the same two overrunning jobs to the job mix, but once without
> > reclaim enabled and once with reclaim enabled.
> > 
> > We are wondering whether there are any known limitations to GRUB or
> > what exactly could be the issue.
> > 
> > We are happy to provide more detailed debugging information but are
> > looking for suggestions how/what exactly to look at.  
> 
> Could you add details of the taskset you are working with? The number
> of tasks, their reservation parameters (runtime, period, deadline)
> and how much they are running (or trying to run) each time they wake
> up. Also which one is using GRUB and which one maybe is not.
> 
> Adding Luca in Cc so he can also take a look.

Thanks for cc-ing me, Jury! 

Marcel, are your tests on a multi-core machine with global scheduling?
If yes, we should check if the taskset is schedulable.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-02 13:55 ` Juri Lelli
  2025-05-02 14:10   ` luca abeni
@ 2025-05-03 11:14   ` Marcel Ziswiler
  2025-05-07 20:25     ` luca abeni
  1 sibling, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-05-03 11:14 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai,
	Luca Abeni

Hi Juri

Thanks for getting back to me.

On Fri, 2025-05-02 at 15:55 +0200, Juri Lelli wrote:
> Hi Marcel,
> 
> On 28/04/25 20:04, Marcel Ziswiler wrote:
> > Hi
> > 
> > As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the
> > mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down
> > to a
> > granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
> > However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode
> > enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of
> > millions of test runs over the course of a full week where we usually see absolutely zero deadline misses,
> > we
> > see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores).
> > This is
> > with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the
> > job
> > mix, but once without reclaim enabled and once with reclaim enabled.
> > 
> > We are wondering whether there are any known limitations to GRUB or what exactly could be the issue.
> > 
> > We are happy to provide more detailed debugging information but are looking for suggestions how/what
> > exactly to
> > look at.
> 
> Could you add details of the taskset you are working with? The number of
> tasks, their reservation parameters (runtime, period, deadline) and how
> much they are running (or trying to run) each time they wake up. Also
> which one is using GRUB and which one maybe is not.

We currently use three cores as follows:

#### core x

|sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim |
| -- | -- | -- | -- | -- |
|  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
| 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
| 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
| 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
| 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
| 10 ms  | 0.5 ms  | **1      |  5.00% | no |

Total utilisation of core x is 79.43% (less than 100%)

**1 - this shall be a rogue process. This process will
 a) run for the maximum allowed workload value 
 b) do not collect execution data

This last rogue process is the one which causes massive issues to the rest of the scheduling if we set it to do
reclaim.

#### core y

|sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim |
| -- | -- | -- | -- | -- |
|  5 ms  | 0.5 ms | 0.45 ms | 10.00% | no |
| 10 ms  | 1.9 ms | 1.71 ms | 19.00% | no |
| 12 ms  | 1.8 ms | 1.62 ms | 15.00% | no |
| 50 ms  | 5.5 ms | 4.95 ms | 11.00% | no |
| 50 ms  | 9.0 ms | 8.10 ms | 18.00% | no |

Total utilisation of core y is 73.00% (less than 100%)

#### core z

The third core is special as it will run 50 jobs with the same configuration as such:

|sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation |
| -- | -- | -- | -- |
|  50 ms  | 0.8 ms | 0.72 ms | 1.60% |

jobs 1-50 should run with reclaim OFF

Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%)

Please let me know if you need any further details which may help figuring out what exactly is going on.

> Adding Luca in Cc so he can also take a look.
> 
> Thanks,

Thank you!

> Juri

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-02 14:10   ` luca abeni
@ 2025-05-03 13:14     ` Marcel Ziswiler
  2025-05-05 15:53       ` luca abeni
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-05-03 13:14 UTC (permalink / raw)
  To: luca abeni, Juri Lelli
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai

Hi Luca

On Fri, 2025-05-02 at 16:10 +0200, luca abeni wrote:
> Hi all,
> 
> On Fri, 2 May 2025 15:55:42 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > Hi Marcel,
> > 
> > On 28/04/25 20:04, Marcel Ziswiler wrote:
> > > Hi
> > > 
> > > As part of our trustable work [1], we also run a lot of real time
> > > scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel.
> > > Overall, the Linux scheduler proves quite capable of scheduling
> > > deadline tasks down to a granularity of 5ms on both of our test
> > > systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
> > > However, recently, we noticed a lot of deadline misses if we
> > > introduce overrunning jobs with reclaim mode enabled
> > > (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused
> > > Bandwidth). E.g. from hundreds of millions of test runs over the
> > > course of a full week where we usually see absolutely zero deadline
> > > misses, we see 43 million deadline misses on NUC and 600 thousand
> > > on ROCK5B (which also has double the CPU cores). This is with
> > > otherwise exactly the same test configuration, which adds exactly
> > > the same two overrunning jobs to the job mix, but once without
> > > reclaim enabled and once with reclaim enabled.
> > > 
> > > We are wondering whether there are any known limitations to GRUB or
> > > what exactly could be the issue.
> > > 
> > > We are happy to provide more detailed debugging information but are
> > > looking for suggestions how/what exactly to look at.  
> > 
> > Could you add details of the taskset you are working with? The number
> > of tasks, their reservation parameters (runtime, period, deadline)
> > and how much they are running (or trying to run) each time they wake
> > up. Also which one is using GRUB and which one maybe is not.
> > 
> > Adding Luca in Cc so he can also take a look.
> 
> Thanks for cc-ing me, Jury! 
> 
> Marcel, are your tests on a multi-core machine with global scheduling?
> If yes, we should check if the taskset is schedulable.

Yes, as previously mentioned, we run all our tests on multi-core machines. Not sure though, what exactly you
are referring to by "global scheduling". Do you mean using Global Earliest Deadline First (GEDF)? I guess that
is what SCHED_DEADLINE is using, not?

Concerning the taskset being schedulable, it is not that it does not schedule at all. Remember, from hundreds
of millions of test runs over the course of a full week where we usually see absolutely zero deadline misses
(without reclaim), we see 43 million deadline misses (with that one rogue process set to reclaim) on NUC and
600 thousand on ROCK5B (which also has double the CPU cores).

Please let me know if you need any further details which may help figuring out what exactly is going on.

> 			Thanks,

Thank you!

> 				Luca

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-03 13:14     ` Marcel Ziswiler
@ 2025-05-05 15:53       ` luca abeni
  0 siblings, 0 replies; 35+ messages in thread
From: luca abeni @ 2025-05-05 15:53 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

On Sat, 03 May 2025 15:14:50 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
[...]
> > Marcel, are your tests on a multi-core machine with global
> > scheduling? If yes, we should check if the taskset is schedulable.  
> 
> Yes, as previously mentioned, we run all our tests on multi-core
> machines. Not sure though, what exactly you are referring to by
> "global scheduling". Do you mean using Global Earliest Deadline First
> (GEDF)? I guess that is what SCHED_DEADLINE is using, not?

Yes, I meant global EDF (and, yes, this is what SCHED_DEADLINE uses
unless you play with isolated cpusets or affinities).

One potential issue is that global EDF does not guarantee the
hard respect of deadlines, but only provides guarantees about bounded
tardiness. Then, in practice many tasksets run without missing
deadlines even if they are not guaranteed to be schedulable (the hard
schedulability tests are very pessimistic).

When using GRUB (actually, m-GRUB), the runtimes of the tasks are
increased to reclaim unreserved CPU time, and this increases the
probability to miss deadlines. m-GRUB guarantees that all deadlines are
respected only if some hard schedulability tests (more complex than the
admission control policy used by SCHED_DEADLINE) are respected.
This paper provides more details about such schedulability tests:
https://hal.science/hal-01286130/document
(see Section 4.2)

I see that in another email you describe the taskset you are using...
I'll try to have a look at it to check if the issue you are seeing is
related to what I mention above, or if there is some other issue.

			Luca

> 
> Concerning the taskset being schedulable, it is not that it does not
> schedule at all. Remember, from hundreds of millions of test runs
> over the course of a full week where we usually see absolutely zero
> deadline misses (without reclaim), we see 43 million deadline misses
> (with that one rogue process set to reclaim) on NUC and 600 thousand
> on ROCK5B (which also has double the CPU cores).
> 
> Please let me know if you need any further details which may help
> figuring out what exactly is going on.
> 
> > 			Thanks,  
> 
> Thank you!
> 
> > 				Luca  
> 
> Cheers
> 
> Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-03 11:14   ` Marcel Ziswiler
@ 2025-05-07 20:25     ` luca abeni
  2025-05-19 13:32       ` Marcel Ziswiler
  0 siblings, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-05-07 20:25 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

just a quick question to better understand your setup (and check where
the issue comes from):
in the email below, you say that tasks are statically assigned to
cores; how did you do this? Did you use isolated cpusets, or did you
set the tasks affinities after disabling the SCHED_DEADLINE admission
control (echo -1 > /proc/sys/kernel/sched_rt_runtime_us)?

Or am I misunderstanding your setup?

Also, are you using HRTICK_DL?


			Thanks,
				Luca

On Sat, 03 May 2025 13:14:53 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
[...]
> We currently use three cores as follows:
> 
> #### core x
> 
> |sched_deadline = sched_period | sched_runtime | CP max run time 90%
> of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- |
> |  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
> | 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
> | 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
> | 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
> | 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
> | 10 ms  | 0.5 ms  | **1      |  5.00% | no |
> 
> Total utilisation of core x is 79.43% (less than 100%)
> 
> **1 - this shall be a rogue process. This process will
>  a) run for the maximum allowed workload value 
>  b) do not collect execution data
> 
> This last rogue process is the one which causes massive issues to the
> rest of the scheduling if we set it to do reclaim.
> 
> #### core y
> 
> |sched_deadline = sched_period | sched_runtime | CP max run time 90%
> of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- |
> |  5 ms  | 0.5 ms | 0.45 ms | 10.00% | no |
> | 10 ms  | 1.9 ms | 1.71 ms | 19.00% | no |
> | 12 ms  | 1.8 ms | 1.62 ms | 15.00% | no |
> | 50 ms  | 5.5 ms | 4.95 ms | 11.00% | no |
> | 50 ms  | 9.0 ms | 8.10 ms | 18.00% | no |
> 
> Total utilisation of core y is 73.00% (less than 100%)
> 
> #### core z
> 
> The third core is special as it will run 50 jobs with the same
> configuration as such:
> 
> |sched_deadline = sched_period | sched_runtime | CP max run time 90%
> of sched_runtime | utilisation | | -- | -- | -- | -- |
> |  50 ms  | 0.8 ms | 0.72 ms | 1.60% |
> 
> jobs 1-50 should run with reclaim OFF
> 
> Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%)
> 
> Please let me know if you need any further details which may help
> figuring out what exactly is going on.
> 
> > Adding Luca in Cc so he can also take a look.
> > 
> > Thanks,  
> 
> Thank you!
> 
> > Juri  
> 
> Cheers
> 
> Marcel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-07 20:25     ` luca abeni
@ 2025-05-19 13:32       ` Marcel Ziswiler
  2025-05-20 16:09         ` luca abeni
  2025-05-23 19:46         ` luca abeni
  0 siblings, 2 replies; 35+ messages in thread
From: Marcel Ziswiler @ 2025-05-19 13:32 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Luca

Thanks and sorry, for my late reply. I was traveling the Cretan wilderness without access to any work related
infrastructure.

On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote:
> Hi Marcel,
> 
> just a quick question to better understand your setup (and check where
> the issue comes from):
> in the email below, you say that tasks are statically assigned to
> cores; how did you do this? Did you use isolated cpusets,

Yes, we use the cpuset controller from the cgroup-v2 APIs in the linux kernel in order to partition CPUs and
memory nodes. In detail, we use the AllowedCPUs and
AllowedMemoryNodes in systemd's slice configurations.

> or did you
> set the tasks affinities after disabling the SCHED_DEADLINE admission
> control (echo -1 > /proc/sys/kernel/sched_rt_runtime_us)?

No.

> Or am I misunderstanding your setup?

No, I don't think so.

> Also, are you using HRTICK_DL?

No, not that I am aware of and definitely not on ROCK5Bs while our amd64 configuration currently does not even
enable SCHED_DEBUG. Not sure how to easily judge the specific HRTICK feature set in such case.

> 			Thanks,
> 				Luca

Thank you very much!

Cheers

Marcel

> On Sat, 03 May 2025 13:14:53 +0200
> Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> [...]
> > We currently use three cores as follows:
> > 
> > #### core x
> > 
> > > sched_deadline = sched_period | sched_runtime | CP max run time 90%
> > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- |
> > >  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
> > > 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
> > > 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
> > > 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
> > > 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
> > > 10 ms  | 0.5 ms  | **1      |  5.00% | no |
> > 
> > Total utilisation of core x is 79.43% (less than 100%)
> > 
> > **1 - this shall be a rogue process. This process will
> >  a) run for the maximum allowed workload value 
> >  b) do not collect execution data
> > 
> > This last rogue process is the one which causes massive issues to the
> > rest of the scheduling if we set it to do reclaim.
> > 
> > #### core y
> > 
> > > sched_deadline = sched_period | sched_runtime | CP max run time 90%
> > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- |
> > >  5 ms  | 0.5 ms | 0.45 ms | 10.00% | no |
> > > 10 ms  | 1.9 ms | 1.71 ms | 19.00% | no |
> > > 12 ms  | 1.8 ms | 1.62 ms | 15.00% | no |
> > > 50 ms  | 5.5 ms | 4.95 ms | 11.00% | no |
> > > 50 ms  | 9.0 ms | 8.10 ms | 18.00% | no |
> > 
> > Total utilisation of core y is 73.00% (less than 100%)
> > 
> > #### core z
> > 
> > The third core is special as it will run 50 jobs with the same
> > configuration as such:
> > 
> > > sched_deadline = sched_period | sched_runtime | CP max run time 90%
> > of sched_runtime | utilisation | | -- | -- | -- | -- |
> > >  50 ms  | 0.8 ms | 0.72 ms | 1.60% |
> > 
> > jobs 1-50 should run with reclaim OFF
> > 
> > Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%)
> > 
> > Please let me know if you need any further details which may help
> > figuring out what exactly is going on.
> > 
> > > Adding Luca in Cc so he can also take a look.
> > > 
> > > Thanks,  
> > 
> > Thank you!
> > 
> > > Juri  
> > 
> > Cheers
> > 
> > Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-19 13:32       ` Marcel Ziswiler
@ 2025-05-20 16:09         ` luca abeni
  2025-05-21  9:59           ` Marcel Ziswiler
  2025-05-23 19:46         ` luca abeni
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-05-20 16:09 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

On Mon, 19 May 2025 15:32:27 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:

> Hi Luca
> 
> Thanks and sorry, for my late reply. I was traveling the Cretan
> wilderness without access to any work related infrastructure.
> 
> On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote:
> > Hi Marcel,
> > 
> > just a quick question to better understand your setup (and check
> > where the issue comes from):
> > in the email below, you say that tasks are statically assigned to
> > cores; how did you do this? Did you use isolated cpusets,  
> 
> Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> linux kernel in order to partition CPUs and memory nodes. In detail,
> we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> configurations.

OK, I never tried the v2 API, but if it allows creating a new root
domain (which is an isolated cpuset, I think), then it should work
without issues.

So, since you are seeing unexpected deadline misses, there is a bug
somewhere... I am going to check.
In the meantime, enjoy the Cretan wilderness :)



				Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-20 16:09         ` luca abeni
@ 2025-05-21  9:59           ` Marcel Ziswiler
  0 siblings, 0 replies; 35+ messages in thread
From: Marcel Ziswiler @ 2025-05-21  9:59 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Luca

On Tue, 2025-05-20 at 18:09 +0200, luca abeni wrote:
> Hi Marcel,
> 
> On Mon, 19 May 2025 15:32:27 +0200
> Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> 
> > Hi Luca
> > 
> > Thanks and sorry, for my late reply. I was traveling the Cretan
> > wilderness without access to any work related infrastructure.
> > 
> > On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote:
> > > Hi Marcel,
> > > 
> > > just a quick question to better understand your setup (and check
> > > where the issue comes from):
> > > in the email below, you say that tasks are statically assigned to
> > > cores; how did you do this? Did you use isolated cpusets,  
> > 
> > Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> > linux kernel in order to partition CPUs and memory nodes. In detail,
> > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> > configurations.
> 
> OK, I never tried the v2 API, but if it allows creating a new root
> domain (which is an isolated cpuset, I think), then it should work
> without issues.

Yes and it works fine with everything else :)

> So, since you are seeing unexpected deadline misses, there is a bug
> somewhere... I am going to check.

Thanks you very much and let me know if you need any further information figuring out what might be going on.

> In the meantime, enjoy the Cretan wilderness :)

Thanks, I already made it back and, yes, I enjoyed it very much :)

> 				Luca

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-19 13:32       ` Marcel Ziswiler
  2025-05-20 16:09         ` luca abeni
@ 2025-05-23 19:46         ` luca abeni
  2025-05-25 19:29           ` Marcel Ziswiler
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-05-23 19:46 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

sorry, but I have some additional questions to fully understand your
setup...

On Mon, 19 May 2025 15:32:27 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
[...]
> > just a quick question to better understand your setup (and check
> > where the issue comes from):
> > in the email below, you say that tasks are statically assigned to
> > cores; how did you do this? Did you use isolated cpusets,  
> 
> Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> linux kernel in order to partition CPUs and memory nodes. In detail,
> we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> configurations.

How do you configure systemd? I am having troubles in reproducing your
AllowedCPUs configuration... This is an example of what I am trying:
	sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1
	sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3
	sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
	sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3
and then I try to run a SCHED_DEADLINE application with
	sudo systemd-run --scope -p Slice=custom-workload.slice <application>

However, this does not work because systemd is not creating an isolated
cpuset... So, the root domain still contains CPUs 0-3, and the
"custom-workload.slice" cpuset only has CPU 1. Hence, the check
                        /*
                         * Don't allow tasks with an affinity mask smaller than
                         * the entire root_domain to become SCHED_DEADLINE. We
                         * will also fail if there's no bandwidth available.
                         */
                        if (!cpumask_subset(span, p->cpus_ptr) ||
                            rq->rd->dl_bw.bw == 0) {
                                retval = -EPERM;
                                goto unlock;
                        }
in sched_setsched() fails.


How are you configuring the cpusets? Also, which kernel version are you using?
(sorry if you already posted this information in previous emails and I am
missing something obvious)


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-23 19:46         ` luca abeni
@ 2025-05-25 19:29           ` Marcel Ziswiler
  2025-05-29  9:39             ` Juri Lelli
  2025-05-30  9:21             ` luca abeni
  0 siblings, 2 replies; 35+ messages in thread
From: Marcel Ziswiler @ 2025-05-25 19:29 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Luca

On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote:
> Hi Marcel,
> 
> sorry, but I have some additional questions to fully understand your
> setup...

No Problem, I am happy to answer any questions :)

> On Mon, 19 May 2025 15:32:27 +0200
> Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> [...]
> > > just a quick question to better understand your setup (and check
> > > where the issue comes from):
> > > in the email below, you say that tasks are statically assigned to
> > > cores; how did you do this? Did you use isolated cpusets,  
> > 
> > Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> > linux kernel in order to partition CPUs and memory nodes. In detail,
> > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> > configurations.
> 
> How do you configure systemd? I am having troubles in reproducing your
> AllowedCPUs configuration... This is an example of what I am trying:
> 	sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1
> 	sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3
> 	sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
> 	sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3
> and then I try to run a SCHED_DEADLINE application with
> 	sudo systemd-run --scope -p Slice=custom-workload.slice <application>

We just use a bunch of systemd configuration files as follows:

[root@localhost ~]# cat /lib/systemd/system/monitor.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Prioritized slice for the safety monitor.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=0
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety1.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=1
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety2.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=2
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/safety3.slice
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
[Unit]
Description=Slice for Safety case processes.
Before=slices.target

[Slice]
CPUWeight=1000
AllowedCPUs=3
MemoryAccounting=true
MemoryMin=10%
ManagedOOMPreference=omit

[Install]
WantedBy=slices.target

[root@localhost ~]# cat /lib/systemd/system/system.slice 
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only

#
# This slice will control all processes started by systemd by
# default.
#

[Unit]
Description=System Slice
Documentation=man:systemd.special(7)
Before=slices.target

[Slice]
CPUQuota=150%
AllowedCPUs=0
MemoryAccounting=true
MemoryMax=80%
ManagedOOMSwap=kill
ManagedOOMMemoryPressure=kill

[root@localhost ~]# cat /lib/systemd/system/user.slice 
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only

#
# This slice will control all processes started by systemd-logind
#

[Unit]
Description=User and Session Slice
Documentation=man:systemd.special(7)
Before=slices.target

[Slice]
CPUQuota=25%
AllowedCPUs=0
MemoryAccounting=true
MemoryMax=80%
ManagedOOMSwap=kill
ManagedOOMMemoryPressure=kill

> However, this does not work because systemd is not creating an isolated
> cpuset... So, the root domain still contains CPUs 0-3, and the
> "custom-workload.slice" cpuset only has CPU 1. Hence, the check
>                         /*
>                          * Don't allow tasks with an affinity mask smaller than
>                          * the entire root_domain to become SCHED_DEADLINE. We
>                          * will also fail if there's no bandwidth available.
>                          */
>                         if (!cpumask_subset(span, p->cpus_ptr) ||
>                             rq->rd->dl_bw.bw == 0) {
>                                 retval = -EPERM;
>                                 goto unlock;
>                         }
> in sched_setsched() fails.
> 
> 
> How are you configuring the cpusets?

See above.

> Also, which kernel version are you using?
> (sorry if you already posted this information in previous emails and I am
> missing something obvious)

Not even sure, whether I explicitly mentioned that other than that we are always running latest stable.

Two months ago when we last run some extensive tests on this it was actually v6.13.6.

> 			Thanks,

Thank you!

> 				Luca

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-25 19:29           ` Marcel Ziswiler
@ 2025-05-29  9:39             ` Juri Lelli
  2025-06-02 14:59               ` Marcel Ziswiler
  2025-05-30  9:21             ` luca abeni
  1 sibling, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-05-29  9:39 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]

Hi Marcel,

On 25/05/25 21:29, Marcel Ziswiler wrote:
> Hi Luca
> 
> On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote:
> > Hi Marcel,
> > 
> > sorry, but I have some additional questions to fully understand your
> > setup...
> 
> No Problem, I am happy to answer any questions :)
> 
> > On Mon, 19 May 2025 15:32:27 +0200
> > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> > [...]
> > > > just a quick question to better understand your setup (and check
> > > > where the issue comes from):
> > > > in the email below, you say that tasks are statically assigned to
> > > > cores; how did you do this? Did you use isolated cpusets,  
> > > 
> > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> > > linux kernel in order to partition CPUs and memory nodes. In detail,
> > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> > > configurations.
> > 
> > How do you configure systemd? I am having troubles in reproducing your
> > AllowedCPUs configuration... This is an example of what I am trying:
> > 	sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1
> > 	sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3
> > 	sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
> > 	sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3
> > and then I try to run a SCHED_DEADLINE application with
> > 	sudo systemd-run --scope -p Slice=custom-workload.slice <application>
> 
> We just use a bunch of systemd configuration files as follows:
> 

...

> > How are you configuring the cpusets?
> 
> See above.
> 

Could you please add 'debug sched_debug sched_verbose' to your kernel
cmdline and share the complete dmesg before starting your tests?

Also, I am attaching a script that should be able to retrieve cpuset
information if you run it with

# python3 get_cpuset_info.py > cpuset.out

Could you please also do that and share the collected information?

It should help us to better understand your setup and possibly reproduce
the problem you are seeing.

Thanks!
Juri

[-- Attachment #2: get_cpuset_info.py --]
[-- Type: text/plain, Size: 1766 bytes --]

import os

def get_cpuset_info(cgroup_path):
    """Retrieves cpuset information for a given cgroup path."""
    info = {}
    files_to_check = [
        'cpuset.cpus',
        'cpuset.mems',
        'cpuset.cpus.effective',
        'cpuset.mems.effective',
        'cpuset.cpus.exclusive',
        'cpuset.cpus.exclusive.effective',
        'cpuset.cpus.partition'
    ]

    for filename in files_to_check:
        filepath = os.path.join(cgroup_path, filename)
        if os.path.exists(filepath) and os.access(filepath, os.R_OK):
            try:
                with open(filepath, 'r') as f:
                    info[filename] = f.read().strip()
            except Exception as e:
                info[filename] = f"Error reading: {e}"
        # else:
        #     info[filename] = "Not found or not readable" # Uncomment if you want to explicitly show missing files
    return info

def main():
    cgroup_root = '/sys/fs/cgroup'

    print(f"Recursively retrieving cpuset information from {cgroup_root} (cgroup v2):\n")

    for dirpath, dirnames, filenames in os.walk(cgroup_root):
        # Skip the root cgroup directory itself if it's not a delegate
        # and only process subdirectories that might have cpuset info.
        # This is a heuristic; if you want to see info for the root too, remove this if.
        # if dirpath == cgroup_root:
        #     continue

        cpuset_info = get_cpuset_info(dirpath)

        if cpuset_info: # Only print if we found some cpuset information
            print(f"Cgroup: {dirpath.replace(cgroup_root, '') or '/'}")
            for key, value in cpuset_info.items():
                print(f"  {key}: {value}")
            print("-" * 30) # Separator for readability

if __name__ == "__main__":
    main()

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-25 19:29           ` Marcel Ziswiler
  2025-05-29  9:39             ` Juri Lelli
@ 2025-05-30  9:21             ` luca abeni
  2025-06-03 11:18               ` Marcel Ziswiler
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-05-30  9:21 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

On Sun, 25 May 2025 21:29:05 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
[...]
> > How do you configure systemd? I am having troubles in reproducing
> > your AllowedCPUs configuration... This is an example of what I am
> > trying: sudo systemctl set-property --runtime custom-workload.slice
> > AllowedCPUs=1 sudo systemctl set-property --runtime init.scope
> > AllowedCPUs=0,2,3 sudo systemctl set-property --runtime
> > system.slice AllowedCPUs=0,2,3 sudo systemctl set-property
> > --runtime user.slice AllowedCPUs=0,2,3 and then I try to run a
> > SCHED_DEADLINE application with sudo systemd-run --scope -p
> > Slice=custom-workload.slice <application>  
> 
> We just use a bunch of systemd configuration files as follows:
> 
> [root@localhost ~]# cat /lib/systemd/system/monitor.slice
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
[...]

So, I copied your *.slice files in /lib/systemd/system (and I added
them to the "Wants=" entry of /lib/systemd/system/slices.target,
otherwise the slices are not created), but I am still unable to run
SCHED_DEADLINE applications in these slices.

This is due to the fact that the kernel does not create a new root
domain for these cpusets (probably because the cpusets' CPUs are not
exclusive and the cpuset is not "isolated": for example,
/sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to "member",
not to "isolated"). So, the "cpumask_subset(span, p->cpus_ptr)" in
sched_setsched() is still false and the syscall returns -EPERM.


Since I do not know how to obtain an isolated cpuset with cgroup v2 and
systemd, I tried using the old cgroup v1, as described in the
SCHED_DEADLINE documentation.

This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced the
number of missed deadlines (I tried with a set of periodic tasks having
the same parameters as the ones you described). So, it looks like
reclaiming is working correctly (at least, as far as I can see) when
using cgroup v1 to configure the CPU partitions... Maybe there is some
bug triggered by cgroup v2, or maybe I am misunderstanding your setup.

I think the experiment suggested by Juri can help in understanding
where the issue can be.


			Thanks,
				Luca


> [Unit]
> Description=Prioritized slice for the safety monitor.
> Before=slices.target
> 
> [Slice]
> CPUWeight=1000
> AllowedCPUs=0
> MemoryAccounting=true
> MemoryMin=10%
> ManagedOOMPreference=omit
> 
> [Install]
> WantedBy=slices.target
> 
> [root@localhost ~]# cat /lib/systemd/system/safety1.slice
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
> [Unit]
> Description=Slice for Safety case processes.
> Before=slices.target
> 
> [Slice]
> CPUWeight=1000
> AllowedCPUs=1
> MemoryAccounting=true
> MemoryMin=10%
> ManagedOOMPreference=omit
> 
> [Install]
> WantedBy=slices.target
> 
> [root@localhost ~]# cat /lib/systemd/system/safety2.slice
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
> [Unit]
> Description=Slice for Safety case processes.
> Before=slices.target
> 
> [Slice]
> CPUWeight=1000
> AllowedCPUs=2
> MemoryAccounting=true
> MemoryMin=10%
> ManagedOOMPreference=omit
> 
> [Install]
> WantedBy=slices.target
> 
> [root@localhost ~]# cat /lib/systemd/system/safety3.slice
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
> [Unit]
> Description=Slice for Safety case processes.
> Before=slices.target
> 
> [Slice]
> CPUWeight=1000
> AllowedCPUs=3
> MemoryAccounting=true
> MemoryMin=10%
> ManagedOOMPreference=omit
> 
> [Install]
> WantedBy=slices.target
> 
> [root@localhost ~]# cat /lib/systemd/system/system.slice 
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
> 
> #
> # This slice will control all processes started by systemd by
> # default.
> #
> 
> [Unit]
> Description=System Slice
> Documentation=man:systemd.special(7)
> Before=slices.target
> 
> [Slice]
> CPUQuota=150%
> AllowedCPUs=0
> MemoryAccounting=true
> MemoryMax=80%
> ManagedOOMSwap=kill
> ManagedOOMMemoryPressure=kill
> 
> [root@localhost ~]# cat /lib/systemd/system/user.slice 
> # Copyright (C) 2024 Codethink Limited
> # SPDX-License-Identifier: GPL-2.0-only
> 
> #
> # This slice will control all processes started by systemd-logind
> #
> 
> [Unit]
> Description=User and Session Slice
> Documentation=man:systemd.special(7)
> Before=slices.target
> 
> [Slice]
> CPUQuota=25%
> AllowedCPUs=0
> MemoryAccounting=true
> MemoryMax=80%
> ManagedOOMSwap=kill
> ManagedOOMMemoryPressure=kill
> 
> > However, this does not work because systemd is not creating an
> > isolated cpuset... So, the root domain still contains CPUs 0-3, and
> > the "custom-workload.slice" cpuset only has CPU 1. Hence, the check
> >                         /*
> >                          * Don't allow tasks with an affinity mask
> > smaller than
> >                          * the entire root_domain to become
> > SCHED_DEADLINE. We
> >                          * will also fail if there's no bandwidth
> > available. */
> >                         if (!cpumask_subset(span, p->cpus_ptr) ||
> >                             rq->rd->dl_bw.bw == 0) {
> >                                 retval = -EPERM;
> >                                 goto unlock;
> >                         }
> > in sched_setsched() fails.
> > 
> > 
> > How are you configuring the cpusets?  
> 
> See above.
> 
> > Also, which kernel version are you using?
> > (sorry if you already posted this information in previous emails
> > and I am missing something obvious)  
> 
> Not even sure, whether I explicitly mentioned that other than that we
> are always running latest stable.
> 
> Two months ago when we last run some extensive tests on this it was
> actually v6.13.6.
> 
> > 			Thanks,  
> 
> Thank you!
> 
> > 				Luca  
> 
> Cheers
> 
> Marcel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-29  9:39             ` Juri Lelli
@ 2025-06-02 14:59               ` Marcel Ziswiler
  2025-06-17 12:21                 ` Juri Lelli
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-06-02 14:59 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Juri

On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote:
> Hi Marcel,
> 
> On 25/05/25 21:29, Marcel Ziswiler wrote:
> > Hi Luca
> > 
> > On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote:
> > > Hi Marcel,
> > > 
> > > sorry, but I have some additional questions to fully understand your
> > > setup...
> > 
> > No Problem, I am happy to answer any questions :)
> > 
> > > On Mon, 19 May 2025 15:32:27 +0200
> > > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> > > [...]
> > > > > just a quick question to better understand your setup (and check
> > > > > where the issue comes from):
> > > > > in the email below, you say that tasks are statically assigned to
> > > > > cores; how did you do this? Did you use isolated cpusets,  
> > > > 
> > > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the
> > > > linux kernel in order to partition CPUs and memory nodes. In detail,
> > > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice
> > > > configurations.
> > > 
> > > How do you configure systemd? I am having troubles in reproducing your
> > > AllowedCPUs configuration... This is an example of what I am trying:
> > > 	sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1
> > > 	sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3
> > > 	sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
> > > 	sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3
> > > and then I try to run a SCHED_DEADLINE application with
> > > 	sudo systemd-run --scope -p Slice=custom-workload.slice <application>
> > 
> > We just use a bunch of systemd configuration files as follows:
> > 
> 
> ...
> 
> > > How are you configuring the cpusets?
> > 
> > See above.
> > 
> 
> Could you please add 'debug sched_debug sched_verbose' to your kernel
> cmdline and share the complete dmesg before starting your tests?

Sure, here you go [1].

> Also, I am attaching a script that should be able to retrieve cpuset
> information if you run it with
> 
> # python3 get_cpuset_info.py > cpuset.out
> 
> Could you please also do that and share the collected information?

[root@localhost ~]# python3 get_cpuset_info.py > cpuset.out
[root@localhost ~]# cat cpuset.out 
Recursively retrieving cpuset information from /sys/fs/cgroup (cgroup v2):

Cgroup: /
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
------------------------------
Cgroup: /safety3.slice
  cpuset.cpus: 3
  cpuset.mems: 
  cpuset.cpus.effective: 3
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 3
  cpuset.cpus.partition: root
------------------------------
Cgroup: /sys-fs-fuse-connections.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /sys-kernel-debug.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /dev-mqueue.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /user.slice
  cpuset.cpus: 0
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /monitor.slice
  cpuset.cpus: 0
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /safety1.slice
  cpuset.cpus: 1
  cpuset.mems: 
  cpuset.cpus.effective: 1
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 1
  cpuset.cpus.partition: root
------------------------------
Cgroup: /sys-kernel-tracing.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /init.scope
  cpuset.cpus: 0
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice
  cpuset.cpus: 0
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-networkd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-udevd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/system-serial\x2dgetty.slice
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/boot.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/var-lib-containers.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/auditd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/system-modprobe.slice
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-journald.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-nsresourced.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/sshd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/var-tmp.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/test-audio.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/tmp.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-userdbd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/test-speaker.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-oomd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-resolved.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/dbus.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-timesyncd.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/system-getty.slice
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/systemd-logind.service
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /system.slice/system-disk\x2dstat\x2dmonitoring.slice
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------
Cgroup: /safety2.slice
  cpuset.cpus: 2
  cpuset.mems: 
  cpuset.cpus.effective: 2
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 2
  cpuset.cpus.partition: root
------------------------------
Cgroup: /dev-hugepages.mount
  cpuset.cpus: 
  cpuset.mems: 
  cpuset.cpus.effective: 0
  cpuset.mems.effective: 0
  cpuset.cpus.exclusive: 
  cpuset.cpus.exclusive.effective: 
  cpuset.cpus.partition: member
------------------------------

> It should help us to better understand your setup and possibly reproduce
> the problem you are seeing.

Sure, I am happy to help.

> Thanks!

Thanks you!

> Juri

[1] https://pastebin.com/khFApYgf

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-05-30  9:21             ` luca abeni
@ 2025-06-03 11:18               ` Marcel Ziswiler
  2025-06-06 13:16                 ` luca abeni
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-06-03 11:18 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Luca

Thank you very much!

On Fri, 2025-05-30 at 11:21 +0200, luca abeni wrote:
> Hi Marcel,
> 
> On Sun, 25 May 2025 21:29:05 +0200
> Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> [...]
> > > How do you configure systemd? I am having troubles in reproducing
> > > your AllowedCPUs configuration... This is an example of what I am
> > > trying: sudo systemctl set-property --runtime custom-workload.slice
> > > AllowedCPUs=1 sudo systemctl set-property --runtime init.scope
> > > AllowedCPUs=0,2,3 sudo systemctl set-property --runtime
> > > system.slice AllowedCPUs=0,2,3 sudo systemctl set-property
> > > --runtime user.slice AllowedCPUs=0,2,3 and then I try to run a
> > > SCHED_DEADLINE application with sudo systemd-run --scope -p
> > > Slice=custom-workload.slice <application>  
> > 
> > We just use a bunch of systemd configuration files as follows:
> > 
> > [root@localhost ~]# cat /lib/systemd/system/monitor.slice
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> [...]
> 
> So, I copied your *.slice files in /lib/systemd/system (and I added
> them to the "Wants=" entry of /lib/systemd/system/slices.target,
> otherwise the slices are not created), but I am still unable to run
> SCHED_DEADLINE applications in these slices.

We just link them there e.g.

[root@localhost ~]# ls -l /etc/systemd/system/slices.target.wants/safety1.slice
lrwxrwxrwx 1 root root 37 Nov 10  2011 /etc/systemd/system/slices.target.wants/safety1.slice ->
/usr/lib/systemd/system/safety1.slice

BTW: /lib is just sym-linked to /usr/lib in our setup.

> This is due to the fact that the kernel does not create a new root
> domain for these cpusets (probably because the cpusets' CPUs are not
> exclusive and the cpuset is not "isolated": for example,
> /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to "member",
> not to "isolated").

Not sure, but for me it is indeed root e.g.

[root@localhost ~]# cat /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition
root

> So, the "cpumask_subset(span, p->cpus_ptr)" in
> sched_setsched() is still false and the syscall returns -EPERM.
> 
> 
> Since I do not know how to obtain an isolated cpuset with cgroup v2 and
> systemd, I tried using the old cgroup v1, as described in the
> SCHED_DEADLINE documentation.

I would have thought it should not make any difference whether cgroup v1 or v2 is used, but then who knows.

> This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced the
> number of missed deadlines (I tried with a set of periodic tasks having
> the same parameters as the ones you described). So, it looks like
> reclaiming is working correctly (at least, as far as I can see) when
> using cgroup v1 to configure the CPU partitions... Maybe there is some
> bug triggered by cgroup v2,

Could be, but anyway would be good to also update the SCHED_DEADLINE documentation to cgroup v2.

> or maybe I am misunderstanding your setup.

No, there should be nothing else special really.

> I think the experiment suggested by Juri can help in understanding
> where the issue can be.

Yes, I already did all that and hope you guys can get some insights from that experiment.

And remember, if I can help in any other way just let me know. Thanks!

> 			Thanks,
> 				Luca
> 
> 
> > [Unit]
> > Description=Prioritized slice for the safety monitor.
> > Before=slices.target
> > 
> > [Slice]
> > CPUWeight=1000
> > AllowedCPUs=0
> > MemoryAccounting=true
> > MemoryMin=10%
> > ManagedOOMPreference=omit
> > 
> > [Install]
> > WantedBy=slices.target
> > 
> > [root@localhost ~]# cat /lib/systemd/system/safety1.slice
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> > [Unit]
> > Description=Slice for Safety case processes.
> > Before=slices.target
> > 
> > [Slice]
> > CPUWeight=1000
> > AllowedCPUs=1
> > MemoryAccounting=true
> > MemoryMin=10%
> > ManagedOOMPreference=omit
> > 
> > [Install]
> > WantedBy=slices.target
> > 
> > [root@localhost ~]# cat /lib/systemd/system/safety2.slice
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> > [Unit]
> > Description=Slice for Safety case processes.
> > Before=slices.target
> > 
> > [Slice]
> > CPUWeight=1000
> > AllowedCPUs=2
> > MemoryAccounting=true
> > MemoryMin=10%
> > ManagedOOMPreference=omit
> > 
> > [Install]
> > WantedBy=slices.target
> > 
> > [root@localhost ~]# cat /lib/systemd/system/safety3.slice
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> > [Unit]
> > Description=Slice for Safety case processes.
> > Before=slices.target
> > 
> > [Slice]
> > CPUWeight=1000
> > AllowedCPUs=3
> > MemoryAccounting=true
> > MemoryMin=10%
> > ManagedOOMPreference=omit
> > 
> > [Install]
> > WantedBy=slices.target
> > 
> > [root@localhost ~]# cat /lib/systemd/system/system.slice 
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> > 
> > #
> > # This slice will control all processes started by systemd by
> > # default.
> > #
> > 
> > [Unit]
> > Description=System Slice
> > Documentation=man:systemd.special(7)
> > Before=slices.target
> > 
> > [Slice]
> > CPUQuota=150%
> > AllowedCPUs=0
> > MemoryAccounting=true
> > MemoryMax=80%
> > ManagedOOMSwap=kill
> > ManagedOOMMemoryPressure=kill
> > 
> > [root@localhost ~]# cat /lib/systemd/system/user.slice 
> > # Copyright (C) 2024 Codethink Limited
> > # SPDX-License-Identifier: GPL-2.0-only
> > 
> > #
> > # This slice will control all processes started by systemd-logind
> > #
> > 
> > [Unit]
> > Description=User and Session Slice
> > Documentation=man:systemd.special(7)
> > Before=slices.target
> > 
> > [Slice]
> > CPUQuota=25%
> > AllowedCPUs=0
> > MemoryAccounting=true
> > MemoryMax=80%
> > ManagedOOMSwap=kill
> > ManagedOOMMemoryPressure=kill
> > 
> > > However, this does not work because systemd is not creating an
> > > isolated cpuset... So, the root domain still contains CPUs 0-3, and
> > > the "custom-workload.slice" cpuset only has CPU 1. Hence, the check
> > >                         /*
> > >                          * Don't allow tasks with an affinity mask
> > > smaller than
> > >                          * the entire root_domain to become
> > > SCHED_DEADLINE. We
> > >                          * will also fail if there's no bandwidth
> > > available. */
> > >                         if (!cpumask_subset(span, p->cpus_ptr) ||
> > >                             rq->rd->dl_bw.bw == 0) {
> > >                                 retval = -EPERM;
> > >                                 goto unlock;
> > >                         }
> > > in sched_setsched() fails.
> > > 
> > > 
> > > How are you configuring the cpusets?  
> > 
> > See above.
> > 
> > > Also, which kernel version are you using?
> > > (sorry if you already posted this information in previous emails
> > > and I am missing something obvious)  
> > 
> > Not even sure, whether I explicitly mentioned that other than that we
> > are always running latest stable.
> > 
> > Two months ago when we last run some extensive tests on this it was
> > actually v6.13.6.
> > 
> > > 			Thanks,  
> > 
> > Thank you!
> > 
> > > 				Luca

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-03 11:18               ` Marcel Ziswiler
@ 2025-06-06 13:16                 ` luca abeni
  0 siblings, 0 replies; 35+ messages in thread
From: luca abeni @ 2025-06-06 13:16 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Marcel,

I am still having issues in reproducing your setup with cgroup v2
(maybe it depends on the systemd version, I do not know), but I ran
some experiments using cgroup v1... Here are some ideas:
- When using reclaiming, the core load can become very high (reaching
  95%)... This can increase the CPU temperature, and maybe some thermal
  throttling mechanism slows it down to avoid overheating? (this
  happened one time to me when replicating your setup). This would
  explain some missed deadlines
- Related to this... You probably already mentioned it, but which kind
  of CPU are you using? How is frequency scaling configured? (that is:
  which cpufreq governor are you using?)
- Another random idea: is it possible that you enabled reclaiming only
  for some of the SCHED_DEADLINE threads running on a core? (and
  reclaiming is maybe disabled for the thread that is missing
  deadlines?)

Also, can you try lowering the value of
/proc/sys/kernel/sched_rt_runtime_us and check if the problem still
happens?



			Thanks,
				Luca
On Tue, 03 Jun 2025 13:18:23 +0200
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:

> Hi Luca
> 
> Thank you very much!
> 
> On Fri, 2025-05-30 at 11:21 +0200, luca abeni wrote:
> > Hi Marcel,
> > 
> > On Sun, 25 May 2025 21:29:05 +0200
> > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:
> > [...]  
> > > > How do you configure systemd? I am having troubles in
> > > > reproducing your AllowedCPUs configuration... This is an
> > > > example of what I am trying: sudo systemctl set-property
> > > > --runtime custom-workload.slice AllowedCPUs=1 sudo systemctl
> > > > set-property --runtime init.scope AllowedCPUs=0,2,3 sudo
> > > > systemctl set-property --runtime system.slice AllowedCPUs=0,2,3
> > > > sudo systemctl set-property --runtime user.slice
> > > > AllowedCPUs=0,2,3 and then I try to run a SCHED_DEADLINE
> > > > application with sudo systemd-run --scope -p
> > > > Slice=custom-workload.slice <application>    
> > > 
> > > We just use a bunch of systemd configuration files as follows:
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/monitor.slice
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only  
> > [...]
> > 
> > So, I copied your *.slice files in /lib/systemd/system (and I added
> > them to the "Wants=" entry of /lib/systemd/system/slices.target,
> > otherwise the slices are not created), but I am still unable to run
> > SCHED_DEADLINE applications in these slices.  
> 
> We just link them there e.g.
> 
> [root@localhost ~]# ls -l
> /etc/systemd/system/slices.target.wants/safety1.slice lrwxrwxrwx 1
> root root 37 Nov 10  2011
> /etc/systemd/system/slices.target.wants/safety1.slice ->
> /usr/lib/systemd/system/safety1.slice
> 
> BTW: /lib is just sym-linked to /usr/lib in our setup.
> 
> > This is due to the fact that the kernel does not create a new root
> > domain for these cpusets (probably because the cpusets' CPUs are not
> > exclusive and the cpuset is not "isolated": for example,
> > /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to
> > "member", not to "isolated").  
> 
> Not sure, but for me it is indeed root e.g.
> 
> [root@localhost ~]# cat
> /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition root
> 
> > So, the "cpumask_subset(span, p->cpus_ptr)" in
> > sched_setsched() is still false and the syscall returns -EPERM.
> > 
> > 
> > Since I do not know how to obtain an isolated cpuset with cgroup v2
> > and systemd, I tried using the old cgroup v1, as described in the
> > SCHED_DEADLINE documentation.  
> 
> I would have thought it should not make any difference whether cgroup
> v1 or v2 is used, but then who knows.
> 
> > This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced
> > the number of missed deadlines (I tried with a set of periodic
> > tasks having the same parameters as the ones you described). So, it
> > looks like reclaiming is working correctly (at least, as far as I
> > can see) when using cgroup v1 to configure the CPU partitions...
> > Maybe there is some bug triggered by cgroup v2,  
> 
> Could be, but anyway would be good to also update the SCHED_DEADLINE
> documentation to cgroup v2.
> 
> > or maybe I am misunderstanding your setup.  
> 
> No, there should be nothing else special really.
> 
> > I think the experiment suggested by Juri can help in understanding
> > where the issue can be.  
> 
> Yes, I already did all that and hope you guys can get some insights
> from that experiment.
> 
> And remember, if I can help in any other way just let me know. Thanks!
> 
> > 			Thanks,
> > 				Luca
> > 
> >   
> > > [Unit]
> > > Description=Prioritized slice for the safety monitor.
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUWeight=1000
> > > AllowedCPUs=0
> > > MemoryAccounting=true
> > > MemoryMin=10%
> > > ManagedOOMPreference=omit
> > > 
> > > [Install]
> > > WantedBy=slices.target
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/safety1.slice
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only
> > > [Unit]
> > > Description=Slice for Safety case processes.
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUWeight=1000
> > > AllowedCPUs=1
> > > MemoryAccounting=true
> > > MemoryMin=10%
> > > ManagedOOMPreference=omit
> > > 
> > > [Install]
> > > WantedBy=slices.target
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/safety2.slice
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only
> > > [Unit]
> > > Description=Slice for Safety case processes.
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUWeight=1000
> > > AllowedCPUs=2
> > > MemoryAccounting=true
> > > MemoryMin=10%
> > > ManagedOOMPreference=omit
> > > 
> > > [Install]
> > > WantedBy=slices.target
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/safety3.slice
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only
> > > [Unit]
> > > Description=Slice for Safety case processes.
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUWeight=1000
> > > AllowedCPUs=3
> > > MemoryAccounting=true
> > > MemoryMin=10%
> > > ManagedOOMPreference=omit
> > > 
> > > [Install]
> > > WantedBy=slices.target
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/system.slice 
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only
> > > 
> > > #
> > > # This slice will control all processes started by systemd by
> > > # default.
> > > #
> > > 
> > > [Unit]
> > > Description=System Slice
> > > Documentation=man:systemd.special(7)
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUQuota=150%
> > > AllowedCPUs=0
> > > MemoryAccounting=true
> > > MemoryMax=80%
> > > ManagedOOMSwap=kill
> > > ManagedOOMMemoryPressure=kill
> > > 
> > > [root@localhost ~]# cat /lib/systemd/system/user.slice 
> > > # Copyright (C) 2024 Codethink Limited
> > > # SPDX-License-Identifier: GPL-2.0-only
> > > 
> > > #
> > > # This slice will control all processes started by systemd-logind
> > > #
> > > 
> > > [Unit]
> > > Description=User and Session Slice
> > > Documentation=man:systemd.special(7)
> > > Before=slices.target
> > > 
> > > [Slice]
> > > CPUQuota=25%
> > > AllowedCPUs=0
> > > MemoryAccounting=true
> > > MemoryMax=80%
> > > ManagedOOMSwap=kill
> > > ManagedOOMMemoryPressure=kill
> > >   
> > > > However, this does not work because systemd is not creating an
> > > > isolated cpuset... So, the root domain still contains CPUs 0-3,
> > > > and the "custom-workload.slice" cpuset only has CPU 1. Hence,
> > > > the check /*
> > > >                          * Don't allow tasks with an affinity
> > > > mask smaller than
> > > >                          * the entire root_domain to become
> > > > SCHED_DEADLINE. We
> > > >                          * will also fail if there's no
> > > > bandwidth available. */
> > > >                         if (!cpumask_subset(span, p->cpus_ptr)
> > > > || rq->rd->dl_bw.bw == 0) {
> > > >                                 retval = -EPERM;
> > > >                                 goto unlock;
> > > >                         }
> > > > in sched_setsched() fails.
> > > > 
> > > > 
> > > > How are you configuring the cpusets?    
> > > 
> > > See above.
> > >   
> > > > Also, which kernel version are you using?
> > > > (sorry if you already posted this information in previous emails
> > > > and I am missing something obvious)    
> > > 
> > > Not even sure, whether I explicitly mentioned that other than
> > > that we are always running latest stable.
> > > 
> > > Two months ago when we last run some extensive tests on this it
> > > was actually v6.13.6.
> > >   
> > > > 			Thanks,    
> > > 
> > > Thank you!
> > >   
> > > > 				Luca  
> 
> Cheers
> 
> Marcel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-02 14:59               ` Marcel Ziswiler
@ 2025-06-17 12:21                 ` Juri Lelli
  2025-06-18 11:24                   ` Marcel Ziswiler
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-17 12:21 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 02/06/25 16:59, Marcel Ziswiler wrote:
> Hi Juri
> 
> On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote:

...

> > It should help us to better understand your setup and possibly reproduce
> > the problem you are seeing.

OK, it definitely took a while (apologies), but I think I managed to
reproduce the issue you are seeing.

I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me
to play with the taskset and got to the following two situations when
running your coreX taskset on CPU1 of my system (since the issue is
already reproducible, I think it's OK to ignore the other tasksets as
they are running isolated on different CPUs anyway).

This is your coreX taskset, in which the last task is the bad behaving one that
will run without/with RECLAIM in the test.

|sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim |
| -- | -- | -- | -- | -- |
|  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
| 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
| 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
| 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
| 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
| 10 ms  | 0.5 ms  | **1      |  5.00% | no |

Without reclaim everything looks good (apart from the 1st tasks that I
think suffers a bit from the granularity/precision of rt-app runtime
loop):

https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png

Order is the same as above, last tasks gets constantly throttled and
makes no harm to the rest.

With reclaim (only last misbehaving task) we indeed seem to have a problem:

https://github.com/jlelli/misc/blob/main/deadline-reclaim.png

Essentially all other tasks are experiencing long wakeup delays that
cause deadline misses. The bad behaving task seems to be able to almost
monopolize the CPU. Interesting to notice that, even if I left max
available bandwidth to 95%, the CPU is busy at 100%.

So, yeah, Luca, I think we have a problem. :-)

Will try to find more time soon and keep looking into this.

Thanks,
Juri

1 - https://github.com/jlelli/rt-app/tree/reclaim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-17 12:21                 ` Juri Lelli
@ 2025-06-18 11:24                   ` Marcel Ziswiler
  2025-06-20  9:29                     ` Juri Lelli
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-06-18 11:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Juri

On Tue, 2025-06-17 at 14:21 +0200, Juri Lelli wrote:
> On 02/06/25 16:59, Marcel Ziswiler wrote:
> > Hi Juri
> > 
> > On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote:
> 
> ...
> 
> > > It should help us to better understand your setup and possibly reproduce
> > > the problem you are seeing.
> 
> OK, it definitely took a while (apologies), but I think I managed to
> reproduce the issue you are seeing.

No need to apologies, I know how hard it can be trying to bring up random stuff in the Linux world : )

> I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me
> to play with the taskset and got to the following two situations when
> running your coreX taskset on CPU1 of my system (since the issue is
> already reproducible, I think it's OK to ignore the other tasksets as
> they are running isolated on different CPUs anyway).
> 
> This is your coreX taskset, in which the last task is the bad behaving one that
> will run without/with RECLAIM in the test.
> 
> > sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation |
> > reclaim |
> > -- | -- | -- | -- | -- |
> >  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
> > 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
> > 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
> > 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
> > 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
> > 10 ms  | 0.5 ms  | **1      |  5.00% | no |
> 
> Without reclaim everything looks good (apart from the 1st tasks that I
> think suffers a bit from the granularity/precision of rt-app runtime
> loop):
> 
> https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png

Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline =
sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA
ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful.

> Order is the same as above, last tasks gets constantly throttled and
> makes no harm to the rest.
> 
> With reclaim (only last misbehaving task) we indeed seem to have a problem:
> 
> https://github.com/jlelli/misc/blob/main/deadline-reclaim.png
> 
> Essentially all other tasks are experiencing long wakeup delays that
> cause deadline misses. The bad behaving task seems to be able to almost
> monopolize the CPU. Interesting to notice that, even if I left max
> available bandwidth to 95%, the CPU is busy at 100%.

Yeah, pretty much completely overloaded.

> So, yeah, Luca, I think we have a problem. :-)
> 
> Will try to find more time soon and keep looking into this.

Thank you very much and just let me know if I can help in any way.

> Thanks,
> Juri
> 
> 1 - https://github.com/jlelli/rt-app/tree/reclaim

BTW: I will be talking at the OSS NA/ELC next week in Denver should any of you folks be around.

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-18 11:24                   ` Marcel Ziswiler
@ 2025-06-20  9:29                     ` Juri Lelli
  2025-06-20  9:37                       ` luca abeni
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-20  9:29 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 18/06/25 12:24, Marcel Ziswiler wrote:

...

> Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline =
> sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA
> ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful.

Ah, OK, even though I meant granularity of the 'fake' runtime of the
tasks. In rt-app we simulate it by essentially reading the clock until
that much runtime elapsed (or performing floating point operations) and
in some cases is not super tight.

For runtime enforcement (dl_runtime) and/or period/deadline (dl_{period,
deadline}), did you try enabling HRTICK_DL sched feature? It is kind of
required for parameters under 1ms if one wants precise behavior.

> > Order is the same as above, last tasks gets constantly throttled and
> > makes no harm to the rest.
> > 
> > With reclaim (only last misbehaving task) we indeed seem to have a problem:
> > 
> > https://github.com/jlelli/misc/blob/main/deadline-reclaim.png
> > 
> > Essentially all other tasks are experiencing long wakeup delays that
> > cause deadline misses. The bad behaving task seems to be able to almost
> > monopolize the CPU. Interesting to notice that, even if I left max
> > available bandwidth to 95%, the CPU is busy at 100%.
> 
> Yeah, pretty much completely overloaded.
> 
> > So, yeah, Luca, I think we have a problem. :-)
> > 
> > Will try to find more time soon and keep looking into this.
> 
> Thank you very much and just let me know if I can help in any way.

I have been playing a little more with this and noticed (by chance) that
after writing a value on sched_rt_runtime_us (even the 950000 default)
this seem to 'work' - I don't see deadline misses anymore.

I thus have moved my attention to GRUB related per-cpu variables [1] and
noticed something that looks fishy with extra_bw: after boot and w/o any
DEADLINE tasks around (other than dl_servers) all dl_rqs have different
values [2]. E.g.,

  extra_bw   : (u64)447170
  extra_bw   : (u64)604454
  extra_bw   : (u64)656882
  extra_bw   : (u64)691834
  extra_bw   : (u64)718048
  extra_bw   : (u64)739018
  extra_bw   : (u64)756494
  extra_bw   : (u64)771472
  extra_bw   : (u64)784578
  extra_bw   : (u64)796228
  ...

When we write a value to sched_rt_runtime_us only extra_bw of the first
cpu of a root_domain gets updated. So, this might be the reason why
things seem to improve with single CPU domains like in the situation at
hand, but still probably broken in general. I think the issue here is
that we end up calling init_dl_rq_bw_ratio() only for the first cpu
after the introduction of dl_bw_visited() functionality.

So, this might be one thing to look at, but I am honestly still confused
by why we have weird numbers as the above after boot. Also a bit
confused by the actual meaning and purpose of the 5 GRUB variables we
have to deal with.

Luca, Vineeth (for the recent introduction of max_bw), maybe we could
take a step back and re-check (and maybe and document better :) what
each variable is meant to do and how it gets updated?

Thanks!
Juri

1 - Starts at https://elixir.bootlin.com/linux/v6.16-rc2/source/kernel/sched/sched.h#L866
2 - The drgn script I am using
---
#!/usr/bin/env drgn

desc = """
This is a drgn script to show the current root domains configuration. For more
info on drgn, visit https://github.com/osandov/drgn.
"""

import os
import argparse

import drgn
from drgn import FaultError, NULL, Object, alignof, cast, container_of, execscript, implicit_convert, offsetof, reinterpret, sizeof, stack_trace
from drgn.helpers.common import *
from drgn.helpers.linux import *

def print_dl_bws_info():

    print("Retrieving dl_rq  Information:")

    runqueues = prog['runqueues']

    for cpu_id in for_each_possible_cpu(prog):
        try:
            rq = per_cpu(runqueues, cpu_id)

            dl_rq = rq.dl

            print(f"  From CPU: {cpu_id}")

            print(f"  running_bw : {dl_rq.running_bw}")
            print(f"  this_bw    : {dl_rq.this_bw}")
            print(f"  extra_bw   : {dl_rq.extra_bw}")
            print(f"  max_bw     : {dl_rq.max_bw}")
            print(f"  bw_ratio   : {dl_rq.bw_ratio}")

        except drgn.FaultError as fe:
            print(f"  (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
        except AttributeError as ae:
            print(f"  (CPU {cpu_id}: Missing attribute for dl_rq (kernel struct change?): {ae})")
        except Exception as e:
            print(f"  (CPU {cpu_id}: An unexpected error occurred: {e})")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=desc,
                                     formatter_class=argparse.RawTextHelpFormatter)
    args = parser.parse_args()

    print_dl_bws_info()
---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20  9:29                     ` Juri Lelli
@ 2025-06-20  9:37                       ` luca abeni
  2025-06-20  9:58                         ` Juri Lelli
  2025-06-20 14:16                         ` luca abeni
  0 siblings, 2 replies; 35+ messages in thread
From: luca abeni @ 2025-06-20  9:37 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Juri,

On Fri, 20 Jun 2025 11:29:52 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> I have been playing a little more with this and noticed (by chance)
> that after writing a value on sched_rt_runtime_us (even the 950000
> default) this seem to 'work' - I don't see deadline misses anymore.
> 
> I thus have moved my attention to GRUB related per-cpu variables [1]
> and noticed something that looks fishy with extra_bw: after boot and
> w/o any DEADLINE tasks around (other than dl_servers) all dl_rqs have
> different values [2]. E.g.,
> 
>   extra_bw   : (u64)447170
>   extra_bw   : (u64)604454
[...]
> So, this might be one thing to look at, but I am honestly still
> confused by why we have weird numbers as the above after boot. Also a
> bit confused by the actual meaning and purpose of the 5 GRUB
> variables we have to deal with.

Sorry about that... I was under the impression they were documented in
some comments, but I might be wrong...


> Luca, Vineeth (for the recent introduction of max_bw), maybe we could
> take a step back and re-check (and maybe and document better :) what
> each variable is meant to do and how it gets updated?

I am not sure about the funny values initially assigned to these
variables, but I can surely provide some documentation about what these
variables represent... I am going to look at this and I'll send some
comments or patches.


			Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20  9:37                       ` luca abeni
@ 2025-06-20  9:58                         ` Juri Lelli
  2025-06-20 14:16                         ` luca abeni
  1 sibling, 0 replies; 35+ messages in thread
From: Juri Lelli @ 2025-06-20  9:58 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 20/06/25 11:37, luca abeni wrote:
> Hi Juri,
> 
> On Fri, 20 Jun 2025 11:29:52 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > I have been playing a little more with this and noticed (by chance)
> > that after writing a value on sched_rt_runtime_us (even the 950000
> > default) this seem to 'work' - I don't see deadline misses anymore.
> > 
> > I thus have moved my attention to GRUB related per-cpu variables [1]
> > and noticed something that looks fishy with extra_bw: after boot and
> > w/o any DEADLINE tasks around (other than dl_servers) all dl_rqs have
> > different values [2]. E.g.,
> > 
> >   extra_bw   : (u64)447170
> >   extra_bw   : (u64)604454
> [...]
> > So, this might be one thing to look at, but I am honestly still
> > confused by why we have weird numbers as the above after boot. Also a
> > bit confused by the actual meaning and purpose of the 5 GRUB
> > variables we have to deal with.
> 
> Sorry about that... I was under the impression they were documented in
> some comments, but I might be wrong...

No worries! I am also culpable, as I did test and review the patches. :)
extra_bw in particular I believe can benefit from a bit of attention.

> > Luca, Vineeth (for the recent introduction of max_bw), maybe we could
> > take a step back and re-check (and maybe and document better :) what
> > each variable is meant to do and how it gets updated?
> 
> I am not sure about the funny values initially assigned to these
> variables, but I can surely provide some documentation about what these
> variables represent... I am going to look at this and I'll send some
> comments or patches.

Thanks a lot! I am also continuing to dig.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20  9:37                       ` luca abeni
  2025-06-20  9:58                         ` Juri Lelli
@ 2025-06-20 14:16                         ` luca abeni
  2025-06-20 15:28                           ` Juri Lelli
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-06-20 14:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On Fri, 20 Jun 2025 11:37:45 +0200
luca abeni <luca.abeni@santannapisa.it> wrote:
[...]
> > Luca, Vineeth (for the recent introduction of max_bw), maybe we
> > could take a step back and re-check (and maybe and document better
> > :) what each variable is meant to do and how it gets updated?  
> 
> I am not sure about the funny values initially assigned to these
> variables, but I can surely provide some documentation about what
> these variables represent... I am going to look at this and I'll send
> some comments or patches.

So, I had a look tying to to remember the situation... This is my
current understanding:
- the max_bw field should be just the maximum amount of CPU bandwidth we
  want to use with reclaiming... It is rt_runtime_us / rt_period_us; I
  guess it is cached in this field just to avoid computing it every
  time.
  So, max_bw should be updated only when
  /proc/sys/kernel/sched_rt_{runtime,period}_us are written
- the extra_bw field represents an additional amount of CPU bandwidth
  we can reclaim on each core (the original m-GRUB algorithm just
  reclaimed Uinact, the utilization of inactive tasks).
  It is initialized to Umax when no SCHED_DEADLINE tasks exist and
  should be decreased by Ui when a task with utilization Ui becomes
  SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task
  terminates or changes scheduling policy). Since this value is
  per_core, Ui is divided by the number of cores in the root domain...
  From what you write, I guess extra_bw is not correctly
  initialized/updated when a new root domain is created?

All this information is probably not properly documented... Should I
improve the description in Documentation/scheduler/sched-deadline.rst
or do you prefer some comments in kernel/sched/deadline.c? (or .h?)


			Luca

> 
> 
> 			Luca


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20 14:16                         ` luca abeni
@ 2025-06-20 15:28                           ` Juri Lelli
  2025-06-20 16:52                             ` luca abeni
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-20 15:28 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 20/06/25 16:16, luca abeni wrote:
> On Fri, 20 Jun 2025 11:37:45 +0200
> luca abeni <luca.abeni@santannapisa.it> wrote:
> [...]
> > > Luca, Vineeth (for the recent introduction of max_bw), maybe we
> > > could take a step back and re-check (and maybe and document better
> > > :) what each variable is meant to do and how it gets updated?  
> > 
> > I am not sure about the funny values initially assigned to these
> > variables, but I can surely provide some documentation about what
> > these variables represent... I am going to look at this and I'll send
> > some comments or patches.
> 
> So, I had a look tying to to remember the situation... This is my
> current understanding:
> - the max_bw field should be just the maximum amount of CPU bandwidth we
>   want to use with reclaiming... It is rt_runtime_us / rt_period_us; I
>   guess it is cached in this field just to avoid computing it every
>   time.
>   So, max_bw should be updated only when
>   /proc/sys/kernel/sched_rt_{runtime,period}_us are written
> - the extra_bw field represents an additional amount of CPU bandwidth
>   we can reclaim on each core (the original m-GRUB algorithm just
>   reclaimed Uinact, the utilization of inactive tasks).
>   It is initialized to Umax when no SCHED_DEADLINE tasks exist and

Is Umax == max_bw from above?

>   should be decreased by Ui when a task with utilization Ui becomes
>   SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task
>   terminates or changes scheduling policy). Since this value is
>   per_core, Ui is divided by the number of cores in the root domain...
>   From what you write, I guess extra_bw is not correctly
>   initialized/updated when a new root domain is created?

It looks like so yeah. After boot and when domains are dinamically
created. But, I am still not 100%, I only see weird numbers that I
struggle to relate with what you say above. :)

> All this information is probably not properly documented... Should I
> improve the description in Documentation/scheduler/sched-deadline.rst
> or do you prefer some comments in kernel/sched/deadline.c? (or .h?)

I think ideally both. sched-deadline.rst should probably contain the
whole picture with more information and .c/.h the condendensed version.

Thanks!
Juri


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20 15:28                           ` Juri Lelli
@ 2025-06-20 16:52                             ` luca abeni
  2025-06-24  7:49                               ` Juri Lelli
  2025-06-24 13:36                               ` luca abeni
  0 siblings, 2 replies; 35+ messages in thread
From: luca abeni @ 2025-06-20 16:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On Fri, 20 Jun 2025 17:28:28 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> On 20/06/25 16:16, luca abeni wrote:
[...]
> > So, I had a look tying to to remember the situation... This is my
> > current understanding:
> > - the max_bw field should be just the maximum amount of CPU
> > bandwidth we want to use with reclaiming... It is rt_runtime_us /
> > rt_period_us; I guess it is cached in this field just to avoid
> > computing it every time.
> >   So, max_bw should be updated only when
> >   /proc/sys/kernel/sched_rt_{runtime,period}_us are written
> > - the extra_bw field represents an additional amount of CPU
> > bandwidth we can reclaim on each core (the original m-GRUB
> > algorithm just reclaimed Uinact, the utilization of inactive tasks).
> >   It is initialized to Umax when no SCHED_DEADLINE tasks exist and  
> 
> Is Umax == max_bw from above?

Yes; sorry about the confusion


> >   should be decreased by Ui when a task with utilization Ui becomes
> >   SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task
> >   terminates or changes scheduling policy). Since this value is
> >   per_core, Ui is divided by the number of cores in the root
> > domain... From what you write, I guess extra_bw is not correctly
> >   initialized/updated when a new root domain is created?  
> 
> It looks like so yeah. After boot and when domains are dinamically
> created. But, I am still not 100%, I only see weird numbers that I
> struggle to relate with what you say above. :)

BTW, when running some tests on different machines I think I found out
that 6.11 does not exhibit this issue (this needs to be confirmed, I am
working on reproducing the test with different kernels on the same
machine)

If I manage to reproduce this result, I think I can run a bisect to the
commit introducing the issue (git is telling me that I'll need about 15
tests :)
So, stay tuned...


> > All this information is probably not properly documented... Should I
> > improve the description in
> > Documentation/scheduler/sched-deadline.rst or do you prefer some
> > comments in kernel/sched/deadline.c? (or .h?)  
> 
> I think ideally both. sched-deadline.rst should probably contain the
> whole picture with more information and .c/.h the condendensed
> version.

OK, I'll try to do this in next week


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20 16:52                             ` luca abeni
@ 2025-06-24  7:49                               ` Juri Lelli
  2025-06-24 12:59                                 ` Juri Lelli
  2025-06-24 13:36                               ` luca abeni
  1 sibling, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-24  7:49 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 20/06/25 18:52, luca abeni wrote:
> On Fri, 20 Jun 2025 17:28:28 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > On 20/06/25 16:16, luca abeni wrote:
> [...]
> > > So, I had a look tying to to remember the situation... This is my
> > > current understanding:
> > > - the max_bw field should be just the maximum amount of CPU
> > > bandwidth we want to use with reclaiming... It is rt_runtime_us /
> > > rt_period_us; I guess it is cached in this field just to avoid
> > > computing it every time.
> > >   So, max_bw should be updated only when
> > >   /proc/sys/kernel/sched_rt_{runtime,period}_us are written
> > > - the extra_bw field represents an additional amount of CPU
> > > bandwidth we can reclaim on each core (the original m-GRUB
> > > algorithm just reclaimed Uinact, the utilization of inactive tasks).
> > >   It is initialized to Umax when no SCHED_DEADLINE tasks exist and  
> > 
> > Is Umax == max_bw from above?
> 
> Yes; sorry about the confusion
> 
> 
> > >   should be decreased by Ui when a task with utilization Ui becomes
> > >   SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task
> > >   terminates or changes scheduling policy). Since this value is
> > >   per_core, Ui is divided by the number of cores in the root
> > > domain... From what you write, I guess extra_bw is not correctly
> > >   initialized/updated when a new root domain is created?  
> > 
> > It looks like so yeah. After boot and when domains are dinamically
> > created. But, I am still not 100%, I only see weird numbers that I
> > struggle to relate with what you say above. :)
> 
> BTW, when running some tests on different machines I think I found out
> that 6.11 does not exhibit this issue (this needs to be confirmed, I am
> working on reproducing the test with different kernels on the same
> machine)
> 
> If I manage to reproduce this result, I think I can run a bisect to the
> commit introducing the issue (git is telling me that I'll need about 15
> tests :)
> So, stay tuned...

The following seem to at least cure the problem after boot. Things are
still broken after cpusets creation. Moving to look into that, but
wanted to share where I am so that we don't duplicate work.

Rationale for the below is that we currently end up calling
__dl_update() with 'cpus' that are not stable yet. So, I tried to move
initialization after SMP is up (all CPUs have been onlined).

---
 kernel/sched/core.c     |  3 +++
 kernel/sched/deadline.c | 39 +++++++++++++++++++++++----------------
 kernel/sched/sched.h    |  1 +
 3 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a38..d152f8a84818b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8470,6 +8470,8 @@ void __init sched_init_smp(void)
 	init_sched_rt_class();
 	init_sched_dl_class();
 
+	sched_init_dl_servers();
+
 	sched_smp_initialized = true;
 }
 
@@ -8484,6 +8486,7 @@ early_initcall(migration_init);
 void __init sched_init_smp(void)
 {
 	sched_init_granularity();
+	sched_init_dl_servers();
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245e..9f3b3f3592a58 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1647,22 +1647,6 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = dl_se->rq;
 
-	/*
-	 * XXX: the apply do not work fine at the init phase for the
-	 * fair server because things are not yet set. We need to improve
-	 * this before getting generic.
-	 */
-	if (!dl_server(dl_se)) {
-		u64 runtime =  50 * NSEC_PER_MSEC;
-		u64 period = 1000 * NSEC_PER_MSEC;
-
-		dl_server_apply_params(dl_se, runtime, period, 1);
-
-		dl_se->dl_server = 1;
-		dl_se->dl_defer = 1;
-		setup_new_dl_entity(dl_se);
-	}
-
 	if (!dl_se->dl_runtime)
 		return;
 
@@ -1693,6 +1677,29 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick_task = pick_task;
 }
 
+void sched_init_dl_servers(void)
+{
+	int cpu;
+	struct rq *rq;
+	struct sched_dl_entity *dl_se;
+
+	for_each_online_cpu(cpu) {
+		u64 runtime =  50 * NSEC_PER_MSEC;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		rq = cpu_rq(cpu);
+		dl_se = &rq->fair_server;
+
+		WARN_ON(dl_server(dl_se));
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_server = 1;
+		dl_se->dl_defer = 1;
+		setup_new_dl_entity(dl_se);
+	}
+}
+
 void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
 {
 	u64 new_bw = dl_se->dl_bw;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295e..22301c28a5d2d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,7 @@ extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
+extern void sched_init_dl_servers(void);
 
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-24  7:49                               ` Juri Lelli
@ 2025-06-24 12:59                                 ` Juri Lelli
  2025-06-24 15:00                                   ` luca abeni
  2025-06-25 15:55                                   ` Marcel Ziswiler
  0 siblings, 2 replies; 35+ messages in thread
From: Juri Lelli @ 2025-06-24 12:59 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hello again,

On 24/06/25 09:49, Juri Lelli wrote:

...

> The following seem to at least cure the problem after boot. Things are
> still broken after cpusets creation. Moving to look into that, but
> wanted to share where I am so that we don't duplicate work.

I ended up with two additional patches that seem to make things a little
better at my end. You can find them at

https://github.com/jlelli/linux/tree/upstream/fix-grub

Marcel, Luca, can you please give them a quick try to check if they do
any good?

Thanks!
Juri


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-20 16:52                             ` luca abeni
  2025-06-24  7:49                               ` Juri Lelli
@ 2025-06-24 13:36                               ` luca abeni
  1 sibling, 0 replies; 35+ messages in thread
From: luca abeni @ 2025-06-24 13:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On Fri, 20 Jun 2025 18:52:48 +0200
luca abeni <luca.abeni@santannapisa.it> wrote:
[...]
> > >   should be decreased by Ui when a task with utilization Ui
> > > becomes SCHED_DEADLINE (and increased by Ui when the
> > > SCHED_DEADLINE task terminates or changes scheduling policy).
> > > Since this value is per_core, Ui is divided by the number of
> > > cores in the root domain... From what you write, I guess extra_bw
> > > is not correctly initialized/updated when a new root domain is
> > > created?    
> > 
> > It looks like so yeah. After boot and when domains are dinamically
> > created. But, I am still not 100%, I only see weird numbers that I
> > struggle to relate with what you say above. :)  
> 
> BTW, when running some tests on different machines I think I found out
> that 6.11 does not exhibit this issue (this needs to be confirmed, I
> am working on reproducing the test with different kernels on the same
> machine)
> 
> If I manage to reproduce this result, I think I can run a bisect to
> the commit introducing the issue (git is telling me that I'll need
> about 15 tests :)
> So, stay tuned...

It took more than I expected, but I think I found the guilty commit...
It seems to be
[5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d] sched/rt: Remove default bandwidth control

Starting from this commit, I can reproduce the issue, but if I test the
previous commit (c8a85394cfdb4696b4e2f8a0f3066a1c921af426
sched/core: Fix picking of tasks for core scheduling with DL server)
the issue disappears.

Maybe this information can help in better understanding the problem :)



			Luca

> 
> > > All this information is probably not properly documented...
> > > Should I improve the description in
> > > Documentation/scheduler/sched-deadline.rst or do you prefer some
> > > comments in kernel/sched/deadline.c? (or .h?)    
> > 
> > I think ideally both. sched-deadline.rst should probably contain the
> > whole picture with more information and .c/.h the condendensed
> > version.  
> 
> OK, I'll try to do this in next week
> 
> 
> 			Thanks,
> 				Luca


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-24 12:59                                 ` Juri Lelli
@ 2025-06-24 15:00                                   ` luca abeni
  2025-06-25  9:30                                     ` Juri Lelli
  2025-06-25 15:55                                   ` Marcel Ziswiler
  1 sibling, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-06-24 15:00 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On Tue, 24 Jun 2025 14:59:13 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Hello again,
> 
> On 24/06/25 09:49, Juri Lelli wrote:
> 
> ...
> 
> > The following seem to at least cure the problem after boot. Things
> > are still broken after cpusets creation. Moving to look into that,
> > but wanted to share where I am so that we don't duplicate work.  
> 
> I ended up with two additional patches that seem to make things a
> little better at my end. You can find them at
> 
> https://github.com/jlelli/linux/tree/upstream/fix-grub
> 
> Marcel, Luca, can you please give them a quick try to check if they do
> any good?

I applied your 3 patches to the master branch of linux.git, and they
indeed seems to fix the issue!

Now, I need to understand how they relate to
5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d :)

One small issue: after applying your patches, I get this WARN at boot
time:
[    0.384481] ------------[ cut here ]------------
[    0.385384] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:265 task_non_contending+0x24d/0x3b0
[    0.385384] Modules linked in:
[    0.385384] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc2-00234-ge35a18896578 #42 PREEMPT(voluntary)
[    0.385384] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    0.385384] RIP: 0010:task_non_contending+0x24d/0x3b0
[    0.385384] Code: 59 49 00 e9 7a fe ff ff 48 8b 53 30 f6 43 53 10 0f 85 4c ff ff ff 48 8b 85 c8 08 00 00 48 29 d0 48 89 85 c8 08 00 00 73 0f 90 <0f> 0b 90 48 c7 85 c8 08 00 00 00 00 00 00 48 63 95 28 0b 00 00 48
[    0.385384] RSP: 0000:ffffb52300013c08 EFLAGS: 00010093
[    0.385384] RAX: ffffffffffff3334 RBX: ffff979ffe8292b0 RCX: 0000000000000001
[    0.385384] RDX: 000000000000cccc RSI: 0000000002faf080 RDI: ffff979ffe8292b0
[    0.385384] RBP: ffff979ffe8289c0 R08: 0000000000000001 R09: 00000000000002a5
[    0.385384] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffe0ab69
[    0.385384] R13: ffff979ffe828a40 R14: 0000000000000009 R15: ffff979ffe8289c0
[    0.385384] FS:  0000000000000000(0000) GS:ffff97a05f709000(0000) knlGS:0000000000000000
[    0.385384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.385384] CR2: ffff979fdec01000 CR3: 000000001e030000 CR4: 00000000000006f0
[    0.385384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.385384] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.385384] Call Trace:
[    0.385384]  <TASK>
[    0.385384]  dl_server_stop+0x21/0x40
[    0.385384]  dequeue_entities+0x604/0x900
[    0.385384]  dequeue_task_fair+0x85/0x190
[    0.385384]  ? update_rq_clock+0x6c/0x110
[    0.385384]  __schedule+0x1f0/0xee0
[    0.385384]  schedule+0x22/0xd0
[    0.385384]  schedule_timeout+0xf4/0x100
[    0.385384]  __wait_for_common+0x97/0x180
[    0.385384]  ? __pfx_schedule_timeout+0x10/0x10
[    0.385384]  ? __pfx_devtmpfsd+0x10/0x10
[    0.385384]  wait_for_completion_killable+0x1f/0x40
[    0.385384]  __kthread_create_on_node+0xe7/0x150
[    0.385384]  kthread_create_on_node+0x4f/0x70
[    0.385384]  ? register_filesystem+0x97/0xc0
[    0.385384]  devtmpfs_init+0x115/0x200
[    0.385384]  driver_init+0x15/0x50
[    0.385384]  kernel_init_freeable+0xf4/0x2d0
[    0.385384]  ? __pfx_kernel_init+0x10/0x10
[    0.385384]  kernel_init+0x15/0x1c0
[    0.385384]  ret_from_fork+0x80/0xd0
[    0.385384]  ? __pfx_kernel_init+0x10/0x10
[    0.385384]  ret_from_fork_asm+0x1a/0x30
[    0.385384]  </TASK>
[    0.385384] ---[ end trace 0000000000000000 ]---
                                                                                

				Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-24 15:00                                   ` luca abeni
@ 2025-06-25  9:30                                     ` Juri Lelli
  2025-06-25 10:11                                       ` Juri Lelli
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-25  9:30 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 24/06/25 17:00, luca abeni wrote:
> On Tue, 24 Jun 2025 14:59:13 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > Hello again,
> > 
> > On 24/06/25 09:49, Juri Lelli wrote:
> > 
> > ...
> > 
> > > The following seem to at least cure the problem after boot. Things
> > > are still broken after cpusets creation. Moving to look into that,
> > > but wanted to share where I am so that we don't duplicate work.  
> > 
> > I ended up with two additional patches that seem to make things a
> > little better at my end. You can find them at
> > 
> > https://github.com/jlelli/linux/tree/upstream/fix-grub
> > 
> > Marcel, Luca, can you please give them a quick try to check if they do
> > any good?
> 
> I applied your 3 patches to the master branch of linux.git, and they
> indeed seems to fix the issue!
> 
> Now, I need to understand how they relate to
> 5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d :)
> 
> One small issue: after applying your patches, I get this WARN at boot
> time:
> [    0.384481] ------------[ cut here ]------------
> [    0.385384] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:265 task_non_contending+0x24d/0x3b0
> [    0.385384] Modules linked in:
> [    0.385384] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc2-00234-ge35a18896578 #42 PREEMPT(voluntary)
> [    0.385384] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [    0.385384] RIP: 0010:task_non_contending+0x24d/0x3b0
> [    0.385384] Code: 59 49 00 e9 7a fe ff ff 48 8b 53 30 f6 43 53 10 0f 85 4c ff ff ff 48 8b 85 c8 08 00 00 48 29 d0 48 89 85 c8 08 00 00 73 0f 90 <0f> 0b 90 48 c7 85 c8 08 00 00 00 00 00 00 48 63 95 28 0b 00 00 48
> [    0.385384] RSP: 0000:ffffb52300013c08 EFLAGS: 00010093
> [    0.385384] RAX: ffffffffffff3334 RBX: ffff979ffe8292b0 RCX: 0000000000000001
> [    0.385384] RDX: 000000000000cccc RSI: 0000000002faf080 RDI: ffff979ffe8292b0
> [    0.385384] RBP: ffff979ffe8289c0 R08: 0000000000000001 R09: 00000000000002a5
> [    0.385384] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffe0ab69
> [    0.385384] R13: ffff979ffe828a40 R14: 0000000000000009 R15: ffff979ffe8289c0
> [    0.385384] FS:  0000000000000000(0000) GS:ffff97a05f709000(0000) knlGS:0000000000000000
> [    0.385384] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.385384] CR2: ffff979fdec01000 CR3: 000000001e030000 CR4: 00000000000006f0
> [    0.385384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.385384] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.385384] Call Trace:
> [    0.385384]  <TASK>
> [    0.385384]  dl_server_stop+0x21/0x40
> [    0.385384]  dequeue_entities+0x604/0x900
> [    0.385384]  dequeue_task_fair+0x85/0x190
> [    0.385384]  ? update_rq_clock+0x6c/0x110
> [    0.385384]  __schedule+0x1f0/0xee0
> [    0.385384]  schedule+0x22/0xd0
> [    0.385384]  schedule_timeout+0xf4/0x100
> [    0.385384]  __wait_for_common+0x97/0x180
> [    0.385384]  ? __pfx_schedule_timeout+0x10/0x10
> [    0.385384]  ? __pfx_devtmpfsd+0x10/0x10
> [    0.385384]  wait_for_completion_killable+0x1f/0x40
> [    0.385384]  __kthread_create_on_node+0xe7/0x150
> [    0.385384]  kthread_create_on_node+0x4f/0x70
> [    0.385384]  ? register_filesystem+0x97/0xc0
> [    0.385384]  devtmpfs_init+0x115/0x200
> [    0.385384]  driver_init+0x15/0x50
> [    0.385384]  kernel_init_freeable+0xf4/0x2d0
> [    0.385384]  ? __pfx_kernel_init+0x10/0x10
> [    0.385384]  kernel_init+0x15/0x1c0
> [    0.385384]  ret_from_fork+0x80/0xd0
> [    0.385384]  ? __pfx_kernel_init+0x10/0x10
> [    0.385384]  ret_from_fork_asm+0x1a/0x30
> [    0.385384]  </TASK>
> [    0.385384] ---[ end trace 0000000000000000 ]---

I now see it as well, not sure how I missed it, maybe didn't pay enough
attention. :)

It looks like (at least at my end) it comes from

task_non_contending()
  sub_running_bw()
    __sub_running_bw()
      WARN_ON_ONCE(dl_rq->running_bw > old); /* underflow */

I would guess the later initialization of dl-server is not playing well
wrt running_bw. Will take a look.

BTW, I pushed an additional fixup commmit (forgot some needed locking
here and there, ops :).


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-25  9:30                                     ` Juri Lelli
@ 2025-06-25 10:11                                       ` Juri Lelli
  2025-06-25 12:50                                         ` luca abeni
  0 siblings, 1 reply; 35+ messages in thread
From: Juri Lelli @ 2025-06-25 10:11 UTC (permalink / raw)
  To: luca abeni
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 25/06/25 11:30, Juri Lelli wrote:

...

> It looks like (at least at my end) it comes from
> 
> task_non_contending()
>   sub_running_bw()
>     __sub_running_bw()
>       WARN_ON_ONCE(dl_rq->running_bw > old); /* underflow */
> 
> I would guess the later initialization of dl-server is not playing well
> wrt running_bw. Will take a look.

I pushed another fixup adding a check for dl_server_active in
dl_server_stop(). It seems to cure the WARN here.

Could you please pull and re-test?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-25 10:11                                       ` Juri Lelli
@ 2025-06-25 12:50                                         ` luca abeni
  2025-06-26 10:59                                           ` Marcel Ziswiler
  0 siblings, 1 reply; 35+ messages in thread
From: luca abeni @ 2025-06-25 12:50 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

Hi Juri,

On Wed, 25 Jun 2025 12:11:46 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> I pushed another fixup adding a check for dl_server_active in
> dl_server_stop(). It seems to cure the WARN here.
> 
> Could you please pull and re-test?

I added your last 2 commits, and tested again; it seems to me that
everything looks fine, now... Marcel, can you confirm?


			Luca

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-24 12:59                                 ` Juri Lelli
  2025-06-24 15:00                                   ` luca abeni
@ 2025-06-25 15:55                                   ` Marcel Ziswiler
  1 sibling, 0 replies; 35+ messages in thread
From: Marcel Ziswiler @ 2025-06-25 15:55 UTC (permalink / raw)
  To: Juri Lelli, luca abeni
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai

Hi Juri

On Tue, 2025-06-24 at 14:59 +0200, Juri Lelli wrote:
> Hello again,
> 
> On 24/06/25 09:49, Juri Lelli wrote:
> 
> ...
> 
> > The following seem to at least cure the problem after boot. Things are
> > still broken after cpusets creation. Moving to look into that, but
> > wanted to share where I am so that we don't duplicate work.
> 
> I ended up with two additional patches that seem to make things a little
> better at my end. You can find them at
> 
> https://github.com/jlelli/linux/tree/upstream/fix-grub
> 
> Marcel, Luca, can you please give them a quick try to check if they do
> any good?

I gave this a try yesterday and run a first longer test in our CI. While that now only run for 16 hours doing
30+ mio. tests on NUCs and 75+ mio. on ROCK 5B it really looks promising so far.

I will now update to your latest patches and re-run those tests. Usually we need 40+ hours of testing to really
be confident in our statistics around those tests.

> Thanks!

Thank you!

> Juri

Cheers from the OSS NA/ELC in Denver

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-25 12:50                                         ` luca abeni
@ 2025-06-26 10:59                                           ` Marcel Ziswiler
  2025-06-26 11:45                                             ` Juri Lelli
  0 siblings, 1 reply; 35+ messages in thread
From: Marcel Ziswiler @ 2025-06-26 10:59 UTC (permalink / raw)
  To: luca abeni, Juri Lelli
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai

Hi Luca and Juri

On Wed, 2025-06-25 at 14:50 +0200, luca abeni wrote:
> Hi Juri,
> 
> On Wed, 25 Jun 2025 12:11:46 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > I pushed another fixup adding a check for dl_server_active in
> > dl_server_stop(). It seems to cure the WARN here.
> > 
> > Could you please pull and re-test?
> 
> I added your last 2 commits, and tested again; it seems to me that
> everything looks fine, now... Marcel, can you confirm?

Indeed, our CI run now close to 220 mio. tests on NUCs and 190 mio. on ROCK 5B and so far it didn't miss any
single beat! Also the statistics around those tests look very good. With reclaim enabled one can now truly get
very good real-time performance. Thank you very much!

We will continue to exercise the Linux kernel scheduler to the fullest and report any inconsistencies we are
seeing.

Just let me know if there is anything else we may help you with. Thanks again!

> 			Luca

Cheers

Marcel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)
  2025-06-26 10:59                                           ` Marcel Ziswiler
@ 2025-06-26 11:45                                             ` Juri Lelli
  0 siblings, 0 replies; 35+ messages in thread
From: Juri Lelli @ 2025-06-26 11:45 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Vineeth Pillai

On 26/06/25 04:59, Marcel Ziswiler wrote:
> Hi Luca and Juri
> 
> On Wed, 2025-06-25 at 14:50 +0200, luca abeni wrote:
> > Hi Juri,
> > 
> > On Wed, 25 Jun 2025 12:11:46 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> > [...]
> > > I pushed another fixup adding a check for dl_server_active in
> > > dl_server_stop(). It seems to cure the WARN here.
> > > 
> > > Could you please pull and re-test?
> > 
> > I added your last 2 commits, and tested again; it seems to me that
> > everything looks fine, now... Marcel, can you confirm?
> 
> Indeed, our CI run now close to 220 mio. tests on NUCs and 190 mio. on ROCK 5B and so far it didn't miss any
> single beat! Also the statistics around those tests look very good. With reclaim enabled one can now truly get
> very good real-time performance. Thank you very much!
> 
> We will continue to exercise the Linux kernel scheduler to the fullest and report any inconsistencies we are
> seeing.
> 
> Just let me know if there is anything else we may help you with. Thanks again!

Great! Thanks a lot for testing and the patience. :-)

I will be sending out a polished version of the set soon. Please take a
look and add your reviewed/tested-by to that if you can. The changes are
the same you have been testing already, just with changelogs etc. Let's
see if people spot problems with the actual implementation of the fixes.

Best,
Juri


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-06-26 11:45 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28 18:04 SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) Marcel Ziswiler
2025-05-02 13:55 ` Juri Lelli
2025-05-02 14:10   ` luca abeni
2025-05-03 13:14     ` Marcel Ziswiler
2025-05-05 15:53       ` luca abeni
2025-05-03 11:14   ` Marcel Ziswiler
2025-05-07 20:25     ` luca abeni
2025-05-19 13:32       ` Marcel Ziswiler
2025-05-20 16:09         ` luca abeni
2025-05-21  9:59           ` Marcel Ziswiler
2025-05-23 19:46         ` luca abeni
2025-05-25 19:29           ` Marcel Ziswiler
2025-05-29  9:39             ` Juri Lelli
2025-06-02 14:59               ` Marcel Ziswiler
2025-06-17 12:21                 ` Juri Lelli
2025-06-18 11:24                   ` Marcel Ziswiler
2025-06-20  9:29                     ` Juri Lelli
2025-06-20  9:37                       ` luca abeni
2025-06-20  9:58                         ` Juri Lelli
2025-06-20 14:16                         ` luca abeni
2025-06-20 15:28                           ` Juri Lelli
2025-06-20 16:52                             ` luca abeni
2025-06-24  7:49                               ` Juri Lelli
2025-06-24 12:59                                 ` Juri Lelli
2025-06-24 15:00                                   ` luca abeni
2025-06-25  9:30                                     ` Juri Lelli
2025-06-25 10:11                                       ` Juri Lelli
2025-06-25 12:50                                         ` luca abeni
2025-06-26 10:59                                           ` Marcel Ziswiler
2025-06-26 11:45                                             ` Juri Lelli
2025-06-25 15:55                                   ` Marcel Ziswiler
2025-06-24 13:36                               ` luca abeni
2025-05-30  9:21             ` luca abeni
2025-06-03 11:18               ` Marcel Ziswiler
2025-06-06 13:16                 ` luca abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).