* SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) @ 2025-04-28 18:04 Marcel Ziswiler 2025-05-02 13:55 ` Juri Lelli 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-04-28 18:04 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vineeth Pillai, Daniel Bristot de Oliveira Hi As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down to a granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs). However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of millions of test runs over the course of a full week where we usually see absolutely zero deadline misses, we see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). This is with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the job mix, but once without reclaim enabled and once with reclaim enabled. We are wondering whether there are any known limitations to GRUB or what exactly could be the issue. We are happy to provide more detailed debugging information but are looking for suggestions how/what exactly to look at. Any help is much appreciated. Thanks! Cheers Marcel [1] https://projects.eclipse.org/projects/technology.tsf ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-04-28 18:04 SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) Marcel Ziswiler @ 2025-05-02 13:55 ` Juri Lelli 2025-05-02 14:10 ` luca abeni 2025-05-03 11:14 ` Marcel Ziswiler 0 siblings, 2 replies; 35+ messages in thread From: Juri Lelli @ 2025-05-02 13:55 UTC (permalink / raw) To: Marcel Ziswiler Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai, Luca Abeni Hi Marcel, On 28/04/25 20:04, Marcel Ziswiler wrote: > Hi > > As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the > mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down to a > granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs). > However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode > enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of > millions of test runs over the course of a full week where we usually see absolutely zero deadline misses, we > see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). This is > with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the job > mix, but once without reclaim enabled and once with reclaim enabled. > > We are wondering whether there are any known limitations to GRUB or what exactly could be the issue. > > We are happy to provide more detailed debugging information but are looking for suggestions how/what exactly to > look at. Could you add details of the taskset you are working with? The number of tasks, their reservation parameters (runtime, period, deadline) and how much they are running (or trying to run) each time they wake up. Also which one is using GRUB and which one maybe is not. Adding Luca in Cc so he can also take a look. Thanks, Juri ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-02 13:55 ` Juri Lelli @ 2025-05-02 14:10 ` luca abeni 2025-05-03 13:14 ` Marcel Ziswiler 2025-05-03 11:14 ` Marcel Ziswiler 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-05-02 14:10 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi all, On Fri, 2 May 2025 15:55:42 +0200 Juri Lelli <juri.lelli@redhat.com> wrote: > Hi Marcel, > > On 28/04/25 20:04, Marcel Ziswiler wrote: > > Hi > > > > As part of our trustable work [1], we also run a lot of real time > > scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel. > > Overall, the Linux scheduler proves quite capable of scheduling > > deadline tasks down to a granularity of 5ms on both of our test > > systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs). > > However, recently, we noticed a lot of deadline misses if we > > introduce overrunning jobs with reclaim mode enabled > > (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused > > Bandwidth). E.g. from hundreds of millions of test runs over the > > course of a full week where we usually see absolutely zero deadline > > misses, we see 43 million deadline misses on NUC and 600 thousand > > on ROCK5B (which also has double the CPU cores). This is with > > otherwise exactly the same test configuration, which adds exactly > > the same two overrunning jobs to the job mix, but once without > > reclaim enabled and once with reclaim enabled. > > > > We are wondering whether there are any known limitations to GRUB or > > what exactly could be the issue. > > > > We are happy to provide more detailed debugging information but are > > looking for suggestions how/what exactly to look at. > > Could you add details of the taskset you are working with? The number > of tasks, their reservation parameters (runtime, period, deadline) > and how much they are running (or trying to run) each time they wake > up. Also which one is using GRUB and which one maybe is not. > > Adding Luca in Cc so he can also take a look. Thanks for cc-ing me, Jury! Marcel, are your tests on a multi-core machine with global scheduling? If yes, we should check if the taskset is schedulable. Thanks, Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-02 14:10 ` luca abeni @ 2025-05-03 13:14 ` Marcel Ziswiler 2025-05-05 15:53 ` luca abeni 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-05-03 13:14 UTC (permalink / raw) To: luca abeni, Juri Lelli Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca On Fri, 2025-05-02 at 16:10 +0200, luca abeni wrote: > Hi all, > > On Fri, 2 May 2025 15:55:42 +0200 > Juri Lelli <juri.lelli@redhat.com> wrote: > > > Hi Marcel, > > > > On 28/04/25 20:04, Marcel Ziswiler wrote: > > > Hi > > > > > > As part of our trustable work [1], we also run a lot of real time > > > scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel. > > > Overall, the Linux scheduler proves quite capable of scheduling > > > deadline tasks down to a granularity of 5ms on both of our test > > > systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs). > > > However, recently, we noticed a lot of deadline misses if we > > > introduce overrunning jobs with reclaim mode enabled > > > (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused > > > Bandwidth). E.g. from hundreds of millions of test runs over the > > > course of a full week where we usually see absolutely zero deadline > > > misses, we see 43 million deadline misses on NUC and 600 thousand > > > on ROCK5B (which also has double the CPU cores). This is with > > > otherwise exactly the same test configuration, which adds exactly > > > the same two overrunning jobs to the job mix, but once without > > > reclaim enabled and once with reclaim enabled. > > > > > > We are wondering whether there are any known limitations to GRUB or > > > what exactly could be the issue. > > > > > > We are happy to provide more detailed debugging information but are > > > looking for suggestions how/what exactly to look at. > > > > Could you add details of the taskset you are working with? The number > > of tasks, their reservation parameters (runtime, period, deadline) > > and how much they are running (or trying to run) each time they wake > > up. Also which one is using GRUB and which one maybe is not. > > > > Adding Luca in Cc so he can also take a look. > > Thanks for cc-ing me, Jury! > > Marcel, are your tests on a multi-core machine with global scheduling? > If yes, we should check if the taskset is schedulable. Yes, as previously mentioned, we run all our tests on multi-core machines. Not sure though, what exactly you are referring to by "global scheduling". Do you mean using Global Earliest Deadline First (GEDF)? I guess that is what SCHED_DEADLINE is using, not? Concerning the taskset being schedulable, it is not that it does not schedule at all. Remember, from hundreds of millions of test runs over the course of a full week where we usually see absolutely zero deadline misses (without reclaim), we see 43 million deadline misses (with that one rogue process set to reclaim) on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). Please let me know if you need any further details which may help figuring out what exactly is going on. > Thanks, Thank you! > Luca Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-03 13:14 ` Marcel Ziswiler @ 2025-05-05 15:53 ` luca abeni 0 siblings, 0 replies; 35+ messages in thread From: luca abeni @ 2025-05-05 15:53 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, On Sat, 03 May 2025 15:14:50 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: [...] > > Marcel, are your tests on a multi-core machine with global > > scheduling? If yes, we should check if the taskset is schedulable. > > Yes, as previously mentioned, we run all our tests on multi-core > machines. Not sure though, what exactly you are referring to by > "global scheduling". Do you mean using Global Earliest Deadline First > (GEDF)? I guess that is what SCHED_DEADLINE is using, not? Yes, I meant global EDF (and, yes, this is what SCHED_DEADLINE uses unless you play with isolated cpusets or affinities). One potential issue is that global EDF does not guarantee the hard respect of deadlines, but only provides guarantees about bounded tardiness. Then, in practice many tasksets run without missing deadlines even if they are not guaranteed to be schedulable (the hard schedulability tests are very pessimistic). When using GRUB (actually, m-GRUB), the runtimes of the tasks are increased to reclaim unreserved CPU time, and this increases the probability to miss deadlines. m-GRUB guarantees that all deadlines are respected only if some hard schedulability tests (more complex than the admission control policy used by SCHED_DEADLINE) are respected. This paper provides more details about such schedulability tests: https://hal.science/hal-01286130/document (see Section 4.2) I see that in another email you describe the taskset you are using... I'll try to have a look at it to check if the issue you are seeing is related to what I mention above, or if there is some other issue. Luca > > Concerning the taskset being schedulable, it is not that it does not > schedule at all. Remember, from hundreds of millions of test runs > over the course of a full week where we usually see absolutely zero > deadline misses (without reclaim), we see 43 million deadline misses > (with that one rogue process set to reclaim) on NUC and 600 thousand > on ROCK5B (which also has double the CPU cores). > > Please let me know if you need any further details which may help > figuring out what exactly is going on. > > > Thanks, > > Thank you! > > > Luca > > Cheers > > Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-02 13:55 ` Juri Lelli 2025-05-02 14:10 ` luca abeni @ 2025-05-03 11:14 ` Marcel Ziswiler 2025-05-07 20:25 ` luca abeni 1 sibling, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-05-03 11:14 UTC (permalink / raw) To: Juri Lelli Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai, Luca Abeni Hi Juri Thanks for getting back to me. On Fri, 2025-05-02 at 15:55 +0200, Juri Lelli wrote: > Hi Marcel, > > On 28/04/25 20:04, Marcel Ziswiler wrote: > > Hi > > > > As part of our trustable work [1], we also run a lot of real time scheduler (SCHED_DEADLINE) tests on the > > mainline Linux kernel. Overall, the Linux scheduler proves quite capable of scheduling deadline tasks down > > to a > > granularity of 5ms on both of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs). > > However, recently, we noticed a lot of deadline misses if we introduce overrunning jobs with reclaim mode > > enabled (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused Bandwidth). E.g. from hundreds of > > millions of test runs over the course of a full week where we usually see absolutely zero deadline misses, > > we > > see 43 million deadline misses on NUC and 600 thousand on ROCK5B (which also has double the CPU cores). > > This is > > with otherwise exactly the same test configuration, which adds exactly the same two overrunning jobs to the > > job > > mix, but once without reclaim enabled and once with reclaim enabled. > > > > We are wondering whether there are any known limitations to GRUB or what exactly could be the issue. > > > > We are happy to provide more detailed debugging information but are looking for suggestions how/what > > exactly to > > look at. > > Could you add details of the taskset you are working with? The number of > tasks, their reservation parameters (runtime, period, deadline) and how > much they are running (or trying to run) each time they wake up. Also > which one is using GRUB and which one maybe is not. We currently use three cores as follows: #### core x |sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | | 5 ms | 0.15 ms | 0.135 ms | 3.00% | no | | 10 ms | 1.8 ms | 1.62 ms | 18.00% | no | | 10 ms | 2.1 ms | 1.89 ms | 21.00% | no | | 14 ms | 2.3 ms | 2.07 ms | 16.43% | no | | 50 ms | 8.0 ms | 7.20 ms | 16:00% | no | | 10 ms | 0.5 ms | **1 | 5.00% | no | Total utilisation of core x is 79.43% (less than 100%) **1 - this shall be a rogue process. This process will a) run for the maximum allowed workload value b) do not collect execution data This last rogue process is the one which causes massive issues to the rest of the scheduling if we set it to do reclaim. #### core y |sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | | 5 ms | 0.5 ms | 0.45 ms | 10.00% | no | | 10 ms | 1.9 ms | 1.71 ms | 19.00% | no | | 12 ms | 1.8 ms | 1.62 ms | 15.00% | no | | 50 ms | 5.5 ms | 4.95 ms | 11.00% | no | | 50 ms | 9.0 ms | 8.10 ms | 18.00% | no | Total utilisation of core y is 73.00% (less than 100%) #### core z The third core is special as it will run 50 jobs with the same configuration as such: |sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | | -- | -- | -- | -- | | 50 ms | 0.8 ms | 0.72 ms | 1.60% | jobs 1-50 should run with reclaim OFF Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%) Please let me know if you need any further details which may help figuring out what exactly is going on. > Adding Luca in Cc so he can also take a look. > > Thanks, Thank you! > Juri Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-03 11:14 ` Marcel Ziswiler @ 2025-05-07 20:25 ` luca abeni 2025-05-19 13:32 ` Marcel Ziswiler 0 siblings, 1 reply; 35+ messages in thread From: luca abeni @ 2025-05-07 20:25 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, just a quick question to better understand your setup (and check where the issue comes from): in the email below, you say that tasks are statically assigned to cores; how did you do this? Did you use isolated cpusets, or did you set the tasks affinities after disabling the SCHED_DEADLINE admission control (echo -1 > /proc/sys/kernel/sched_rt_runtime_us)? Or am I misunderstanding your setup? Also, are you using HRTICK_DL? Thanks, Luca On Sat, 03 May 2025 13:14:53 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: [...] > We currently use three cores as follows: > > #### core x > > |sched_deadline = sched_period | sched_runtime | CP max run time 90% > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | > | 5 ms | 0.15 ms | 0.135 ms | 3.00% | no | > | 10 ms | 1.8 ms | 1.62 ms | 18.00% | no | > | 10 ms | 2.1 ms | 1.89 ms | 21.00% | no | > | 14 ms | 2.3 ms | 2.07 ms | 16.43% | no | > | 50 ms | 8.0 ms | 7.20 ms | 16:00% | no | > | 10 ms | 0.5 ms | **1 | 5.00% | no | > > Total utilisation of core x is 79.43% (less than 100%) > > **1 - this shall be a rogue process. This process will > a) run for the maximum allowed workload value > b) do not collect execution data > > This last rogue process is the one which causes massive issues to the > rest of the scheduling if we set it to do reclaim. > > #### core y > > |sched_deadline = sched_period | sched_runtime | CP max run time 90% > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | > | 5 ms | 0.5 ms | 0.45 ms | 10.00% | no | > | 10 ms | 1.9 ms | 1.71 ms | 19.00% | no | > | 12 ms | 1.8 ms | 1.62 ms | 15.00% | no | > | 50 ms | 5.5 ms | 4.95 ms | 11.00% | no | > | 50 ms | 9.0 ms | 8.10 ms | 18.00% | no | > > Total utilisation of core y is 73.00% (less than 100%) > > #### core z > > The third core is special as it will run 50 jobs with the same > configuration as such: > > |sched_deadline = sched_period | sched_runtime | CP max run time 90% > of sched_runtime | utilisation | | -- | -- | -- | -- | > | 50 ms | 0.8 ms | 0.72 ms | 1.60% | > > jobs 1-50 should run with reclaim OFF > > Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%) > > Please let me know if you need any further details which may help > figuring out what exactly is going on. > > > Adding Luca in Cc so he can also take a look. > > > > Thanks, > > Thank you! > > > Juri > > Cheers > > Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-07 20:25 ` luca abeni @ 2025-05-19 13:32 ` Marcel Ziswiler 2025-05-20 16:09 ` luca abeni 2025-05-23 19:46 ` luca abeni 0 siblings, 2 replies; 35+ messages in thread From: Marcel Ziswiler @ 2025-05-19 13:32 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca Thanks and sorry, for my late reply. I was traveling the Cretan wilderness without access to any work related infrastructure. On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote: > Hi Marcel, > > just a quick question to better understand your setup (and check where > the issue comes from): > in the email below, you say that tasks are statically assigned to > cores; how did you do this? Did you use isolated cpusets, Yes, we use the cpuset controller from the cgroup-v2 APIs in the linux kernel in order to partition CPUs and memory nodes. In detail, we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice configurations. > or did you > set the tasks affinities after disabling the SCHED_DEADLINE admission > control (echo -1 > /proc/sys/kernel/sched_rt_runtime_us)? No. > Or am I misunderstanding your setup? No, I don't think so. > Also, are you using HRTICK_DL? No, not that I am aware of and definitely not on ROCK5Bs while our amd64 configuration currently does not even enable SCHED_DEBUG. Not sure how to easily judge the specific HRTICK feature set in such case. > Thanks, > Luca Thank you very much! Cheers Marcel > On Sat, 03 May 2025 13:14:53 +0200 > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > [...] > > We currently use three cores as follows: > > > > #### core x > > > > > sched_deadline = sched_period | sched_runtime | CP max run time 90% > > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | > > > 5 ms | 0.15 ms | 0.135 ms | 3.00% | no | > > > 10 ms | 1.8 ms | 1.62 ms | 18.00% | no | > > > 10 ms | 2.1 ms | 1.89 ms | 21.00% | no | > > > 14 ms | 2.3 ms | 2.07 ms | 16.43% | no | > > > 50 ms | 8.0 ms | 7.20 ms | 16:00% | no | > > > 10 ms | 0.5 ms | **1 | 5.00% | no | > > > > Total utilisation of core x is 79.43% (less than 100%) > > > > **1 - this shall be a rogue process. This process will > > a) run for the maximum allowed workload value > > b) do not collect execution data > > > > This last rogue process is the one which causes massive issues to the > > rest of the scheduling if we set it to do reclaim. > > > > #### core y > > > > > sched_deadline = sched_period | sched_runtime | CP max run time 90% > > of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | > > > 5 ms | 0.5 ms | 0.45 ms | 10.00% | no | > > > 10 ms | 1.9 ms | 1.71 ms | 19.00% | no | > > > 12 ms | 1.8 ms | 1.62 ms | 15.00% | no | > > > 50 ms | 5.5 ms | 4.95 ms | 11.00% | no | > > > 50 ms | 9.0 ms | 8.10 ms | 18.00% | no | > > > > Total utilisation of core y is 73.00% (less than 100%) > > > > #### core z > > > > The third core is special as it will run 50 jobs with the same > > configuration as such: > > > > > sched_deadline = sched_period | sched_runtime | CP max run time 90% > > of sched_runtime | utilisation | | -- | -- | -- | -- | > > > 50 ms | 0.8 ms | 0.72 ms | 1.60% | > > > > jobs 1-50 should run with reclaim OFF > > > > Total utilisation of core y is 1.6 * 50 = 80.00% (less than 100%) > > > > Please let me know if you need any further details which may help > > figuring out what exactly is going on. > > > > > Adding Luca in Cc so he can also take a look. > > > > > > Thanks, > > > > Thank you! > > > > > Juri > > > > Cheers > > > > Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-19 13:32 ` Marcel Ziswiler @ 2025-05-20 16:09 ` luca abeni 2025-05-21 9:59 ` Marcel Ziswiler 2025-05-23 19:46 ` luca abeni 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-05-20 16:09 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, On Mon, 19 May 2025 15:32:27 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > Hi Luca > > Thanks and sorry, for my late reply. I was traveling the Cretan > wilderness without access to any work related infrastructure. > > On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote: > > Hi Marcel, > > > > just a quick question to better understand your setup (and check > > where the issue comes from): > > in the email below, you say that tasks are statically assigned to > > cores; how did you do this? Did you use isolated cpusets, > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > linux kernel in order to partition CPUs and memory nodes. In detail, > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > configurations. OK, I never tried the v2 API, but if it allows creating a new root domain (which is an isolated cpuset, I think), then it should work without issues. So, since you are seeing unexpected deadline misses, there is a bug somewhere... I am going to check. In the meantime, enjoy the Cretan wilderness :) Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-20 16:09 ` luca abeni @ 2025-05-21 9:59 ` Marcel Ziswiler 0 siblings, 0 replies; 35+ messages in thread From: Marcel Ziswiler @ 2025-05-21 9:59 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca On Tue, 2025-05-20 at 18:09 +0200, luca abeni wrote: > Hi Marcel, > > On Mon, 19 May 2025 15:32:27 +0200 > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > > > Hi Luca > > > > Thanks and sorry, for my late reply. I was traveling the Cretan > > wilderness without access to any work related infrastructure. > > > > On Wed, 2025-05-07 at 22:25 +0200, luca abeni wrote: > > > Hi Marcel, > > > > > > just a quick question to better understand your setup (and check > > > where the issue comes from): > > > in the email below, you say that tasks are statically assigned to > > > cores; how did you do this? Did you use isolated cpusets, > > > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > > linux kernel in order to partition CPUs and memory nodes. In detail, > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > > configurations. > > OK, I never tried the v2 API, but if it allows creating a new root > domain (which is an isolated cpuset, I think), then it should work > without issues. Yes and it works fine with everything else :) > So, since you are seeing unexpected deadline misses, there is a bug > somewhere... I am going to check. Thanks you very much and let me know if you need any further information figuring out what might be going on. > In the meantime, enjoy the Cretan wilderness :) Thanks, I already made it back and, yes, I enjoyed it very much :) > Luca Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-19 13:32 ` Marcel Ziswiler 2025-05-20 16:09 ` luca abeni @ 2025-05-23 19:46 ` luca abeni 2025-05-25 19:29 ` Marcel Ziswiler 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-05-23 19:46 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, sorry, but I have some additional questions to fully understand your setup... On Mon, 19 May 2025 15:32:27 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: [...] > > just a quick question to better understand your setup (and check > > where the issue comes from): > > in the email below, you say that tasks are statically assigned to > > cores; how did you do this? Did you use isolated cpusets, > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > linux kernel in order to partition CPUs and memory nodes. In detail, > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > configurations. How do you configure systemd? I am having troubles in reproducing your AllowedCPUs configuration... This is an example of what I am trying: sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1 sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3 sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3 sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3 and then I try to run a SCHED_DEADLINE application with sudo systemd-run --scope -p Slice=custom-workload.slice <application> However, this does not work because systemd is not creating an isolated cpuset... So, the root domain still contains CPUs 0-3, and the "custom-workload.slice" cpuset only has CPU 1. Hence, the check /* * Don't allow tasks with an affinity mask smaller than * the entire root_domain to become SCHED_DEADLINE. We * will also fail if there's no bandwidth available. */ if (!cpumask_subset(span, p->cpus_ptr) || rq->rd->dl_bw.bw == 0) { retval = -EPERM; goto unlock; } in sched_setsched() fails. How are you configuring the cpusets? Also, which kernel version are you using? (sorry if you already posted this information in previous emails and I am missing something obvious) Thanks, Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-23 19:46 ` luca abeni @ 2025-05-25 19:29 ` Marcel Ziswiler 2025-05-29 9:39 ` Juri Lelli 2025-05-30 9:21 ` luca abeni 0 siblings, 2 replies; 35+ messages in thread From: Marcel Ziswiler @ 2025-05-25 19:29 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote: > Hi Marcel, > > sorry, but I have some additional questions to fully understand your > setup... No Problem, I am happy to answer any questions :) > On Mon, 19 May 2025 15:32:27 +0200 > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > [...] > > > just a quick question to better understand your setup (and check > > > where the issue comes from): > > > in the email below, you say that tasks are statically assigned to > > > cores; how did you do this? Did you use isolated cpusets, > > > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > > linux kernel in order to partition CPUs and memory nodes. In detail, > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > > configurations. > > How do you configure systemd? I am having troubles in reproducing your > AllowedCPUs configuration... This is an example of what I am trying: > sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1 > sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3 > sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3 > sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3 > and then I try to run a SCHED_DEADLINE application with > sudo systemd-run --scope -p Slice=custom-workload.slice <application> We just use a bunch of systemd configuration files as follows: [root@localhost ~]# cat /lib/systemd/system/monitor.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only [Unit] Description=Prioritized slice for the safety monitor. Before=slices.target [Slice] CPUWeight=1000 AllowedCPUs=0 MemoryAccounting=true MemoryMin=10% ManagedOOMPreference=omit [Install] WantedBy=slices.target [root@localhost ~]# cat /lib/systemd/system/safety1.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only [Unit] Description=Slice for Safety case processes. Before=slices.target [Slice] CPUWeight=1000 AllowedCPUs=1 MemoryAccounting=true MemoryMin=10% ManagedOOMPreference=omit [Install] WantedBy=slices.target [root@localhost ~]# cat /lib/systemd/system/safety2.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only [Unit] Description=Slice for Safety case processes. Before=slices.target [Slice] CPUWeight=1000 AllowedCPUs=2 MemoryAccounting=true MemoryMin=10% ManagedOOMPreference=omit [Install] WantedBy=slices.target [root@localhost ~]# cat /lib/systemd/system/safety3.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only [Unit] Description=Slice for Safety case processes. Before=slices.target [Slice] CPUWeight=1000 AllowedCPUs=3 MemoryAccounting=true MemoryMin=10% ManagedOOMPreference=omit [Install] WantedBy=slices.target [root@localhost ~]# cat /lib/systemd/system/system.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only # # This slice will control all processes started by systemd by # default. # [Unit] Description=System Slice Documentation=man:systemd.special(7) Before=slices.target [Slice] CPUQuota=150% AllowedCPUs=0 MemoryAccounting=true MemoryMax=80% ManagedOOMSwap=kill ManagedOOMMemoryPressure=kill [root@localhost ~]# cat /lib/systemd/system/user.slice # Copyright (C) 2024 Codethink Limited # SPDX-License-Identifier: GPL-2.0-only # # This slice will control all processes started by systemd-logind # [Unit] Description=User and Session Slice Documentation=man:systemd.special(7) Before=slices.target [Slice] CPUQuota=25% AllowedCPUs=0 MemoryAccounting=true MemoryMax=80% ManagedOOMSwap=kill ManagedOOMMemoryPressure=kill > However, this does not work because systemd is not creating an isolated > cpuset... So, the root domain still contains CPUs 0-3, and the > "custom-workload.slice" cpuset only has CPU 1. Hence, the check > /* > * Don't allow tasks with an affinity mask smaller than > * the entire root_domain to become SCHED_DEADLINE. We > * will also fail if there's no bandwidth available. > */ > if (!cpumask_subset(span, p->cpus_ptr) || > rq->rd->dl_bw.bw == 0) { > retval = -EPERM; > goto unlock; > } > in sched_setsched() fails. > > > How are you configuring the cpusets? See above. > Also, which kernel version are you using? > (sorry if you already posted this information in previous emails and I am > missing something obvious) Not even sure, whether I explicitly mentioned that other than that we are always running latest stable. Two months ago when we last run some extensive tests on this it was actually v6.13.6. > Thanks, Thank you! > Luca Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-25 19:29 ` Marcel Ziswiler @ 2025-05-29 9:39 ` Juri Lelli 2025-06-02 14:59 ` Marcel Ziswiler 2025-05-30 9:21 ` luca abeni 1 sibling, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-05-29 9:39 UTC (permalink / raw) To: Marcel Ziswiler Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai [-- Attachment #1: Type: text/plain, Size: 2080 bytes --] Hi Marcel, On 25/05/25 21:29, Marcel Ziswiler wrote: > Hi Luca > > On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote: > > Hi Marcel, > > > > sorry, but I have some additional questions to fully understand your > > setup... > > No Problem, I am happy to answer any questions :) > > > On Mon, 19 May 2025 15:32:27 +0200 > > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > > [...] > > > > just a quick question to better understand your setup (and check > > > > where the issue comes from): > > > > in the email below, you say that tasks are statically assigned to > > > > cores; how did you do this? Did you use isolated cpusets, > > > > > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > > > linux kernel in order to partition CPUs and memory nodes. In detail, > > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > > > configurations. > > > > How do you configure systemd? I am having troubles in reproducing your > > AllowedCPUs configuration... This is an example of what I am trying: > > sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1 > > sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3 > > sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3 > > sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3 > > and then I try to run a SCHED_DEADLINE application with > > sudo systemd-run --scope -p Slice=custom-workload.slice <application> > > We just use a bunch of systemd configuration files as follows: > ... > > How are you configuring the cpusets? > > See above. > Could you please add 'debug sched_debug sched_verbose' to your kernel cmdline and share the complete dmesg before starting your tests? Also, I am attaching a script that should be able to retrieve cpuset information if you run it with # python3 get_cpuset_info.py > cpuset.out Could you please also do that and share the collected information? It should help us to better understand your setup and possibly reproduce the problem you are seeing. Thanks! Juri [-- Attachment #2: get_cpuset_info.py --] [-- Type: text/plain, Size: 1766 bytes --] import os def get_cpuset_info(cgroup_path): """Retrieves cpuset information for a given cgroup path.""" info = {} files_to_check = [ 'cpuset.cpus', 'cpuset.mems', 'cpuset.cpus.effective', 'cpuset.mems.effective', 'cpuset.cpus.exclusive', 'cpuset.cpus.exclusive.effective', 'cpuset.cpus.partition' ] for filename in files_to_check: filepath = os.path.join(cgroup_path, filename) if os.path.exists(filepath) and os.access(filepath, os.R_OK): try: with open(filepath, 'r') as f: info[filename] = f.read().strip() except Exception as e: info[filename] = f"Error reading: {e}" # else: # info[filename] = "Not found or not readable" # Uncomment if you want to explicitly show missing files return info def main(): cgroup_root = '/sys/fs/cgroup' print(f"Recursively retrieving cpuset information from {cgroup_root} (cgroup v2):\n") for dirpath, dirnames, filenames in os.walk(cgroup_root): # Skip the root cgroup directory itself if it's not a delegate # and only process subdirectories that might have cpuset info. # This is a heuristic; if you want to see info for the root too, remove this if. # if dirpath == cgroup_root: # continue cpuset_info = get_cpuset_info(dirpath) if cpuset_info: # Only print if we found some cpuset information print(f"Cgroup: {dirpath.replace(cgroup_root, '') or '/'}") for key, value in cpuset_info.items(): print(f" {key}: {value}") print("-" * 30) # Separator for readability if __name__ == "__main__": main() ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-29 9:39 ` Juri Lelli @ 2025-06-02 14:59 ` Marcel Ziswiler 2025-06-17 12:21 ` Juri Lelli 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-06-02 14:59 UTC (permalink / raw) To: Juri Lelli Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Juri On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote: > Hi Marcel, > > On 25/05/25 21:29, Marcel Ziswiler wrote: > > Hi Luca > > > > On Fri, 2025-05-23 at 21:46 +0200, luca abeni wrote: > > > Hi Marcel, > > > > > > sorry, but I have some additional questions to fully understand your > > > setup... > > > > No Problem, I am happy to answer any questions :) > > > > > On Mon, 19 May 2025 15:32:27 +0200 > > > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > > > [...] > > > > > just a quick question to better understand your setup (and check > > > > > where the issue comes from): > > > > > in the email below, you say that tasks are statically assigned to > > > > > cores; how did you do this? Did you use isolated cpusets, > > > > > > > > Yes, we use the cpuset controller from the cgroup-v2 APIs in the > > > > linux kernel in order to partition CPUs and memory nodes. In detail, > > > > we use the AllowedCPUs and AllowedMemoryNodes in systemd's slice > > > > configurations. > > > > > > How do you configure systemd? I am having troubles in reproducing your > > > AllowedCPUs configuration... This is an example of what I am trying: > > > sudo systemctl set-property --runtime custom-workload.slice AllowedCPUs=1 > > > sudo systemctl set-property --runtime init.scope AllowedCPUs=0,2,3 > > > sudo systemctl set-property --runtime system.slice AllowedCPUs=0,2,3 > > > sudo systemctl set-property --runtime user.slice AllowedCPUs=0,2,3 > > > and then I try to run a SCHED_DEADLINE application with > > > sudo systemd-run --scope -p Slice=custom-workload.slice <application> > > > > We just use a bunch of systemd configuration files as follows: > > > > ... > > > > How are you configuring the cpusets? > > > > See above. > > > > Could you please add 'debug sched_debug sched_verbose' to your kernel > cmdline and share the complete dmesg before starting your tests? Sure, here you go [1]. > Also, I am attaching a script that should be able to retrieve cpuset > information if you run it with > > # python3 get_cpuset_info.py > cpuset.out > > Could you please also do that and share the collected information? [root@localhost ~]# python3 get_cpuset_info.py > cpuset.out [root@localhost ~]# cat cpuset.out Recursively retrieving cpuset information from /sys/fs/cgroup (cgroup v2): Cgroup: / cpuset.cpus.effective: 0 cpuset.mems.effective: 0 ------------------------------ Cgroup: /safety3.slice cpuset.cpus: 3 cpuset.mems: cpuset.cpus.effective: 3 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: 3 cpuset.cpus.partition: root ------------------------------ Cgroup: /sys-fs-fuse-connections.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /sys-kernel-debug.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /dev-mqueue.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /user.slice cpuset.cpus: 0 cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /monitor.slice cpuset.cpus: 0 cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /safety1.slice cpuset.cpus: 1 cpuset.mems: cpuset.cpus.effective: 1 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: 1 cpuset.cpus.partition: root ------------------------------ Cgroup: /sys-kernel-tracing.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /init.scope cpuset.cpus: 0 cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice cpuset.cpus: 0 cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-networkd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-udevd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/system-serial\x2dgetty.slice cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/boot.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/var-lib-containers.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/auditd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/system-modprobe.slice cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-journald.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-nsresourced.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/sshd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/var-tmp.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/test-audio.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/tmp.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-userdbd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/test-speaker.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-oomd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-resolved.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/dbus.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-timesyncd.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/system-getty.slice cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/systemd-logind.service cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /system.slice/system-disk\x2dstat\x2dmonitoring.slice cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ Cgroup: /safety2.slice cpuset.cpus: 2 cpuset.mems: cpuset.cpus.effective: 2 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: 2 cpuset.cpus.partition: root ------------------------------ Cgroup: /dev-hugepages.mount cpuset.cpus: cpuset.mems: cpuset.cpus.effective: 0 cpuset.mems.effective: 0 cpuset.cpus.exclusive: cpuset.cpus.exclusive.effective: cpuset.cpus.partition: member ------------------------------ > It should help us to better understand your setup and possibly reproduce > the problem you are seeing. Sure, I am happy to help. > Thanks! Thanks you! > Juri [1] https://pastebin.com/khFApYgf Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-02 14:59 ` Marcel Ziswiler @ 2025-06-17 12:21 ` Juri Lelli 2025-06-18 11:24 ` Marcel Ziswiler 0 siblings, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-17 12:21 UTC (permalink / raw) To: Marcel Ziswiler Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 02/06/25 16:59, Marcel Ziswiler wrote: > Hi Juri > > On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote: ... > > It should help us to better understand your setup and possibly reproduce > > the problem you are seeing. OK, it definitely took a while (apologies), but I think I managed to reproduce the issue you are seeing. I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me to play with the taskset and got to the following two situations when running your coreX taskset on CPU1 of my system (since the issue is already reproducible, I think it's OK to ignore the other tasksets as they are running isolated on different CPUs anyway). This is your coreX taskset, in which the last task is the bad behaving one that will run without/with RECLAIM in the test. |sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim | | -- | -- | -- | -- | -- | | 5 ms | 0.15 ms | 0.135 ms | 3.00% | no | | 10 ms | 1.8 ms | 1.62 ms | 18.00% | no | | 10 ms | 2.1 ms | 1.89 ms | 21.00% | no | | 14 ms | 2.3 ms | 2.07 ms | 16.43% | no | | 50 ms | 8.0 ms | 7.20 ms | 16:00% | no | | 10 ms | 0.5 ms | **1 | 5.00% | no | Without reclaim everything looks good (apart from the 1st tasks that I think suffers a bit from the granularity/precision of rt-app runtime loop): https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png Order is the same as above, last tasks gets constantly throttled and makes no harm to the rest. With reclaim (only last misbehaving task) we indeed seem to have a problem: https://github.com/jlelli/misc/blob/main/deadline-reclaim.png Essentially all other tasks are experiencing long wakeup delays that cause deadline misses. The bad behaving task seems to be able to almost monopolize the CPU. Interesting to notice that, even if I left max available bandwidth to 95%, the CPU is busy at 100%. So, yeah, Luca, I think we have a problem. :-) Will try to find more time soon and keep looking into this. Thanks, Juri 1 - https://github.com/jlelli/rt-app/tree/reclaim ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-17 12:21 ` Juri Lelli @ 2025-06-18 11:24 ` Marcel Ziswiler 2025-06-20 9:29 ` Juri Lelli 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-06-18 11:24 UTC (permalink / raw) To: Juri Lelli Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Juri On Tue, 2025-06-17 at 14:21 +0200, Juri Lelli wrote: > On 02/06/25 16:59, Marcel Ziswiler wrote: > > Hi Juri > > > > On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote: > > ... > > > > It should help us to better understand your setup and possibly reproduce > > > the problem you are seeing. > > OK, it definitely took a while (apologies), but I think I managed to > reproduce the issue you are seeing. No need to apologies, I know how hard it can be trying to bring up random stuff in the Linux world : ) > I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me > to play with the taskset and got to the following two situations when > running your coreX taskset on CPU1 of my system (since the issue is > already reproducible, I think it's OK to ignore the other tasksets as > they are running isolated on different CPUs anyway). > > This is your coreX taskset, in which the last task is the bad behaving one that > will run without/with RECLAIM in the test. > > > sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | > > reclaim | > > -- | -- | -- | -- | -- | > > 5 ms | 0.15 ms | 0.135 ms | 3.00% | no | > > 10 ms | 1.8 ms | 1.62 ms | 18.00% | no | > > 10 ms | 2.1 ms | 1.89 ms | 21.00% | no | > > 14 ms | 2.3 ms | 2.07 ms | 16.43% | no | > > 50 ms | 8.0 ms | 7.20 ms | 16:00% | no | > > 10 ms | 0.5 ms | **1 | 5.00% | no | > > Without reclaim everything looks good (apart from the 1st tasks that I > think suffers a bit from the granularity/precision of rt-app runtime > loop): > > https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline = sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful. > Order is the same as above, last tasks gets constantly throttled and > makes no harm to the rest. > > With reclaim (only last misbehaving task) we indeed seem to have a problem: > > https://github.com/jlelli/misc/blob/main/deadline-reclaim.png > > Essentially all other tasks are experiencing long wakeup delays that > cause deadline misses. The bad behaving task seems to be able to almost > monopolize the CPU. Interesting to notice that, even if I left max > available bandwidth to 95%, the CPU is busy at 100%. Yeah, pretty much completely overloaded. > So, yeah, Luca, I think we have a problem. :-) > > Will try to find more time soon and keep looking into this. Thank you very much and just let me know if I can help in any way. > Thanks, > Juri > > 1 - https://github.com/jlelli/rt-app/tree/reclaim BTW: I will be talking at the OSS NA/ELC next week in Denver should any of you folks be around. Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-18 11:24 ` Marcel Ziswiler @ 2025-06-20 9:29 ` Juri Lelli 2025-06-20 9:37 ` luca abeni 0 siblings, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-20 9:29 UTC (permalink / raw) To: Marcel Ziswiler Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 18/06/25 12:24, Marcel Ziswiler wrote: ... > Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline = > sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA > ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful. Ah, OK, even though I meant granularity of the 'fake' runtime of the tasks. In rt-app we simulate it by essentially reading the clock until that much runtime elapsed (or performing floating point operations) and in some cases is not super tight. For runtime enforcement (dl_runtime) and/or period/deadline (dl_{period, deadline}), did you try enabling HRTICK_DL sched feature? It is kind of required for parameters under 1ms if one wants precise behavior. > > Order is the same as above, last tasks gets constantly throttled and > > makes no harm to the rest. > > > > With reclaim (only last misbehaving task) we indeed seem to have a problem: > > > > https://github.com/jlelli/misc/blob/main/deadline-reclaim.png > > > > Essentially all other tasks are experiencing long wakeup delays that > > cause deadline misses. The bad behaving task seems to be able to almost > > monopolize the CPU. Interesting to notice that, even if I left max > > available bandwidth to 95%, the CPU is busy at 100%. > > Yeah, pretty much completely overloaded. > > > So, yeah, Luca, I think we have a problem. :-) > > > > Will try to find more time soon and keep looking into this. > > Thank you very much and just let me know if I can help in any way. I have been playing a little more with this and noticed (by chance) that after writing a value on sched_rt_runtime_us (even the 950000 default) this seem to 'work' - I don't see deadline misses anymore. I thus have moved my attention to GRUB related per-cpu variables [1] and noticed something that looks fishy with extra_bw: after boot and w/o any DEADLINE tasks around (other than dl_servers) all dl_rqs have different values [2]. E.g., extra_bw : (u64)447170 extra_bw : (u64)604454 extra_bw : (u64)656882 extra_bw : (u64)691834 extra_bw : (u64)718048 extra_bw : (u64)739018 extra_bw : (u64)756494 extra_bw : (u64)771472 extra_bw : (u64)784578 extra_bw : (u64)796228 ... When we write a value to sched_rt_runtime_us only extra_bw of the first cpu of a root_domain gets updated. So, this might be the reason why things seem to improve with single CPU domains like in the situation at hand, but still probably broken in general. I think the issue here is that we end up calling init_dl_rq_bw_ratio() only for the first cpu after the introduction of dl_bw_visited() functionality. So, this might be one thing to look at, but I am honestly still confused by why we have weird numbers as the above after boot. Also a bit confused by the actual meaning and purpose of the 5 GRUB variables we have to deal with. Luca, Vineeth (for the recent introduction of max_bw), maybe we could take a step back and re-check (and maybe and document better :) what each variable is meant to do and how it gets updated? Thanks! Juri 1 - Starts at https://elixir.bootlin.com/linux/v6.16-rc2/source/kernel/sched/sched.h#L866 2 - The drgn script I am using --- #!/usr/bin/env drgn desc = """ This is a drgn script to show the current root domains configuration. For more info on drgn, visit https://github.com/osandov/drgn. """ import os import argparse import drgn from drgn import FaultError, NULL, Object, alignof, cast, container_of, execscript, implicit_convert, offsetof, reinterpret, sizeof, stack_trace from drgn.helpers.common import * from drgn.helpers.linux import * def print_dl_bws_info(): print("Retrieving dl_rq Information:") runqueues = prog['runqueues'] for cpu_id in for_each_possible_cpu(prog): try: rq = per_cpu(runqueues, cpu_id) dl_rq = rq.dl print(f" From CPU: {cpu_id}") print(f" running_bw : {dl_rq.running_bw}") print(f" this_bw : {dl_rq.this_bw}") print(f" extra_bw : {dl_rq.extra_bw}") print(f" max_bw : {dl_rq.max_bw}") print(f" bw_ratio : {dl_rq.bw_ratio}") except drgn.FaultError as fe: print(f" (CPU {cpu_id}: Fault accessing kernel memory: {fe})") except AttributeError as ae: print(f" (CPU {cpu_id}: Missing attribute for dl_rq (kernel struct change?): {ae})") except Exception as e: print(f" (CPU {cpu_id}: An unexpected error occurred: {e})") if __name__ == "__main__": parser = argparse.ArgumentParser(description=desc, formatter_class=argparse.RawTextHelpFormatter) args = parser.parse_args() print_dl_bws_info() --- ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 9:29 ` Juri Lelli @ 2025-06-20 9:37 ` luca abeni 2025-06-20 9:58 ` Juri Lelli 2025-06-20 14:16 ` luca abeni 0 siblings, 2 replies; 35+ messages in thread From: luca abeni @ 2025-06-20 9:37 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Juri, On Fri, 20 Jun 2025 11:29:52 +0200 Juri Lelli <juri.lelli@redhat.com> wrote: [...] > I have been playing a little more with this and noticed (by chance) > that after writing a value on sched_rt_runtime_us (even the 950000 > default) this seem to 'work' - I don't see deadline misses anymore. > > I thus have moved my attention to GRUB related per-cpu variables [1] > and noticed something that looks fishy with extra_bw: after boot and > w/o any DEADLINE tasks around (other than dl_servers) all dl_rqs have > different values [2]. E.g., > > extra_bw : (u64)447170 > extra_bw : (u64)604454 [...] > So, this might be one thing to look at, but I am honestly still > confused by why we have weird numbers as the above after boot. Also a > bit confused by the actual meaning and purpose of the 5 GRUB > variables we have to deal with. Sorry about that... I was under the impression they were documented in some comments, but I might be wrong... > Luca, Vineeth (for the recent introduction of max_bw), maybe we could > take a step back and re-check (and maybe and document better :) what > each variable is meant to do and how it gets updated? I am not sure about the funny values initially assigned to these variables, but I can surely provide some documentation about what these variables represent... I am going to look at this and I'll send some comments or patches. Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 9:37 ` luca abeni @ 2025-06-20 9:58 ` Juri Lelli 2025-06-20 14:16 ` luca abeni 1 sibling, 0 replies; 35+ messages in thread From: Juri Lelli @ 2025-06-20 9:58 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 20/06/25 11:37, luca abeni wrote: > Hi Juri, > > On Fri, 20 Jun 2025 11:29:52 +0200 > Juri Lelli <juri.lelli@redhat.com> wrote: > [...] > > I have been playing a little more with this and noticed (by chance) > > that after writing a value on sched_rt_runtime_us (even the 950000 > > default) this seem to 'work' - I don't see deadline misses anymore. > > > > I thus have moved my attention to GRUB related per-cpu variables [1] > > and noticed something that looks fishy with extra_bw: after boot and > > w/o any DEADLINE tasks around (other than dl_servers) all dl_rqs have > > different values [2]. E.g., > > > > extra_bw : (u64)447170 > > extra_bw : (u64)604454 > [...] > > So, this might be one thing to look at, but I am honestly still > > confused by why we have weird numbers as the above after boot. Also a > > bit confused by the actual meaning and purpose of the 5 GRUB > > variables we have to deal with. > > Sorry about that... I was under the impression they were documented in > some comments, but I might be wrong... No worries! I am also culpable, as I did test and review the patches. :) extra_bw in particular I believe can benefit from a bit of attention. > > Luca, Vineeth (for the recent introduction of max_bw), maybe we could > > take a step back and re-check (and maybe and document better :) what > > each variable is meant to do and how it gets updated? > > I am not sure about the funny values initially assigned to these > variables, but I can surely provide some documentation about what these > variables represent... I am going to look at this and I'll send some > comments or patches. Thanks a lot! I am also continuing to dig. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 9:37 ` luca abeni 2025-06-20 9:58 ` Juri Lelli @ 2025-06-20 14:16 ` luca abeni 2025-06-20 15:28 ` Juri Lelli 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-06-20 14:16 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On Fri, 20 Jun 2025 11:37:45 +0200 luca abeni <luca.abeni@santannapisa.it> wrote: [...] > > Luca, Vineeth (for the recent introduction of max_bw), maybe we > > could take a step back and re-check (and maybe and document better > > :) what each variable is meant to do and how it gets updated? > > I am not sure about the funny values initially assigned to these > variables, but I can surely provide some documentation about what > these variables represent... I am going to look at this and I'll send > some comments or patches. So, I had a look tying to to remember the situation... This is my current understanding: - the max_bw field should be just the maximum amount of CPU bandwidth we want to use with reclaiming... It is rt_runtime_us / rt_period_us; I guess it is cached in this field just to avoid computing it every time. So, max_bw should be updated only when /proc/sys/kernel/sched_rt_{runtime,period}_us are written - the extra_bw field represents an additional amount of CPU bandwidth we can reclaim on each core (the original m-GRUB algorithm just reclaimed Uinact, the utilization of inactive tasks). It is initialized to Umax when no SCHED_DEADLINE tasks exist and should be decreased by Ui when a task with utilization Ui becomes SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task terminates or changes scheduling policy). Since this value is per_core, Ui is divided by the number of cores in the root domain... From what you write, I guess extra_bw is not correctly initialized/updated when a new root domain is created? All this information is probably not properly documented... Should I improve the description in Documentation/scheduler/sched-deadline.rst or do you prefer some comments in kernel/sched/deadline.c? (or .h?) Luca > > > Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 14:16 ` luca abeni @ 2025-06-20 15:28 ` Juri Lelli 2025-06-20 16:52 ` luca abeni 0 siblings, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-20 15:28 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 20/06/25 16:16, luca abeni wrote: > On Fri, 20 Jun 2025 11:37:45 +0200 > luca abeni <luca.abeni@santannapisa.it> wrote: > [...] > > > Luca, Vineeth (for the recent introduction of max_bw), maybe we > > > could take a step back and re-check (and maybe and document better > > > :) what each variable is meant to do and how it gets updated? > > > > I am not sure about the funny values initially assigned to these > > variables, but I can surely provide some documentation about what > > these variables represent... I am going to look at this and I'll send > > some comments or patches. > > So, I had a look tying to to remember the situation... This is my > current understanding: > - the max_bw field should be just the maximum amount of CPU bandwidth we > want to use with reclaiming... It is rt_runtime_us / rt_period_us; I > guess it is cached in this field just to avoid computing it every > time. > So, max_bw should be updated only when > /proc/sys/kernel/sched_rt_{runtime,period}_us are written > - the extra_bw field represents an additional amount of CPU bandwidth > we can reclaim on each core (the original m-GRUB algorithm just > reclaimed Uinact, the utilization of inactive tasks). > It is initialized to Umax when no SCHED_DEADLINE tasks exist and Is Umax == max_bw from above? > should be decreased by Ui when a task with utilization Ui becomes > SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task > terminates or changes scheduling policy). Since this value is > per_core, Ui is divided by the number of cores in the root domain... > From what you write, I guess extra_bw is not correctly > initialized/updated when a new root domain is created? It looks like so yeah. After boot and when domains are dinamically created. But, I am still not 100%, I only see weird numbers that I struggle to relate with what you say above. :) > All this information is probably not properly documented... Should I > improve the description in Documentation/scheduler/sched-deadline.rst > or do you prefer some comments in kernel/sched/deadline.c? (or .h?) I think ideally both. sched-deadline.rst should probably contain the whole picture with more information and .c/.h the condendensed version. Thanks! Juri ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 15:28 ` Juri Lelli @ 2025-06-20 16:52 ` luca abeni 2025-06-24 7:49 ` Juri Lelli 2025-06-24 13:36 ` luca abeni 0 siblings, 2 replies; 35+ messages in thread From: luca abeni @ 2025-06-20 16:52 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On Fri, 20 Jun 2025 17:28:28 +0200 Juri Lelli <juri.lelli@redhat.com> wrote: > On 20/06/25 16:16, luca abeni wrote: [...] > > So, I had a look tying to to remember the situation... This is my > > current understanding: > > - the max_bw field should be just the maximum amount of CPU > > bandwidth we want to use with reclaiming... It is rt_runtime_us / > > rt_period_us; I guess it is cached in this field just to avoid > > computing it every time. > > So, max_bw should be updated only when > > /proc/sys/kernel/sched_rt_{runtime,period}_us are written > > - the extra_bw field represents an additional amount of CPU > > bandwidth we can reclaim on each core (the original m-GRUB > > algorithm just reclaimed Uinact, the utilization of inactive tasks). > > It is initialized to Umax when no SCHED_DEADLINE tasks exist and > > Is Umax == max_bw from above? Yes; sorry about the confusion > > should be decreased by Ui when a task with utilization Ui becomes > > SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task > > terminates or changes scheduling policy). Since this value is > > per_core, Ui is divided by the number of cores in the root > > domain... From what you write, I guess extra_bw is not correctly > > initialized/updated when a new root domain is created? > > It looks like so yeah. After boot and when domains are dinamically > created. But, I am still not 100%, I only see weird numbers that I > struggle to relate with what you say above. :) BTW, when running some tests on different machines I think I found out that 6.11 does not exhibit this issue (this needs to be confirmed, I am working on reproducing the test with different kernels on the same machine) If I manage to reproduce this result, I think I can run a bisect to the commit introducing the issue (git is telling me that I'll need about 15 tests :) So, stay tuned... > > All this information is probably not properly documented... Should I > > improve the description in > > Documentation/scheduler/sched-deadline.rst or do you prefer some > > comments in kernel/sched/deadline.c? (or .h?) > > I think ideally both. sched-deadline.rst should probably contain the > whole picture with more information and .c/.h the condendensed > version. OK, I'll try to do this in next week Thanks, Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 16:52 ` luca abeni @ 2025-06-24 7:49 ` Juri Lelli 2025-06-24 12:59 ` Juri Lelli 2025-06-24 13:36 ` luca abeni 1 sibling, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-24 7:49 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 20/06/25 18:52, luca abeni wrote: > On Fri, 20 Jun 2025 17:28:28 +0200 > Juri Lelli <juri.lelli@redhat.com> wrote: > > > On 20/06/25 16:16, luca abeni wrote: > [...] > > > So, I had a look tying to to remember the situation... This is my > > > current understanding: > > > - the max_bw field should be just the maximum amount of CPU > > > bandwidth we want to use with reclaiming... It is rt_runtime_us / > > > rt_period_us; I guess it is cached in this field just to avoid > > > computing it every time. > > > So, max_bw should be updated only when > > > /proc/sys/kernel/sched_rt_{runtime,period}_us are written > > > - the extra_bw field represents an additional amount of CPU > > > bandwidth we can reclaim on each core (the original m-GRUB > > > algorithm just reclaimed Uinact, the utilization of inactive tasks). > > > It is initialized to Umax when no SCHED_DEADLINE tasks exist and > > > > Is Umax == max_bw from above? > > Yes; sorry about the confusion > > > > > should be decreased by Ui when a task with utilization Ui becomes > > > SCHED_DEADLINE (and increased by Ui when the SCHED_DEADLINE task > > > terminates or changes scheduling policy). Since this value is > > > per_core, Ui is divided by the number of cores in the root > > > domain... From what you write, I guess extra_bw is not correctly > > > initialized/updated when a new root domain is created? > > > > It looks like so yeah. After boot and when domains are dinamically > > created. But, I am still not 100%, I only see weird numbers that I > > struggle to relate with what you say above. :) > > BTW, when running some tests on different machines I think I found out > that 6.11 does not exhibit this issue (this needs to be confirmed, I am > working on reproducing the test with different kernels on the same > machine) > > If I manage to reproduce this result, I think I can run a bisect to the > commit introducing the issue (git is telling me that I'll need about 15 > tests :) > So, stay tuned... The following seem to at least cure the problem after boot. Things are still broken after cpusets creation. Moving to look into that, but wanted to share where I am so that we don't duplicate work. Rationale for the below is that we currently end up calling __dl_update() with 'cpus' that are not stable yet. So, I tried to move initialization after SMP is up (all CPUs have been onlined). --- kernel/sched/core.c | 3 +++ kernel/sched/deadline.c | 39 +++++++++++++++++++++++---------------- kernel/sched/sched.h | 1 + 3 files changed, 27 insertions(+), 16 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8988d38d46a38..d152f8a84818b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8470,6 +8470,8 @@ void __init sched_init_smp(void) init_sched_rt_class(); init_sched_dl_class(); + sched_init_dl_servers(); + sched_smp_initialized = true; } @@ -8484,6 +8486,7 @@ early_initcall(migration_init); void __init sched_init_smp(void) { sched_init_granularity(); + sched_init_dl_servers(); } #endif /* CONFIG_SMP */ diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index ad45a8fea245e..9f3b3f3592a58 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1647,22 +1647,6 @@ void dl_server_start(struct sched_dl_entity *dl_se) { struct rq *rq = dl_se->rq; - /* - * XXX: the apply do not work fine at the init phase for the - * fair server because things are not yet set. We need to improve - * this before getting generic. - */ - if (!dl_server(dl_se)) { - u64 runtime = 50 * NSEC_PER_MSEC; - u64 period = 1000 * NSEC_PER_MSEC; - - dl_server_apply_params(dl_se, runtime, period, 1); - - dl_se->dl_server = 1; - dl_se->dl_defer = 1; - setup_new_dl_entity(dl_se); - } - if (!dl_se->dl_runtime) return; @@ -1693,6 +1677,29 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, dl_se->server_pick_task = pick_task; } +void sched_init_dl_servers(void) +{ + int cpu; + struct rq *rq; + struct sched_dl_entity *dl_se; + + for_each_online_cpu(cpu) { + u64 runtime = 50 * NSEC_PER_MSEC; + u64 period = 1000 * NSEC_PER_MSEC; + + rq = cpu_rq(cpu); + dl_se = &rq->fair_server; + + WARN_ON(dl_server(dl_se)); + + dl_server_apply_params(dl_se, runtime, period, 1); + + dl_se->dl_server = 1; + dl_se->dl_defer = 1; + setup_new_dl_entity(dl_se); + } +} + void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq) { u64 new_bw = dl_se->dl_bw; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 475bb5998295e..22301c28a5d2d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -384,6 +384,7 @@ extern void dl_server_stop(struct sched_dl_entity *dl_se); extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq, dl_server_has_tasks_f has_tasks, dl_server_pick_f pick_task); +extern void sched_init_dl_servers(void); extern void dl_server_update_idle_time(struct rq *rq, struct task_struct *p); -- 2.49.0 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-24 7:49 ` Juri Lelli @ 2025-06-24 12:59 ` Juri Lelli 2025-06-24 15:00 ` luca abeni 2025-06-25 15:55 ` Marcel Ziswiler 0 siblings, 2 replies; 35+ messages in thread From: Juri Lelli @ 2025-06-24 12:59 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hello again, On 24/06/25 09:49, Juri Lelli wrote: ... > The following seem to at least cure the problem after boot. Things are > still broken after cpusets creation. Moving to look into that, but > wanted to share where I am so that we don't duplicate work. I ended up with two additional patches that seem to make things a little better at my end. You can find them at https://github.com/jlelli/linux/tree/upstream/fix-grub Marcel, Luca, can you please give them a quick try to check if they do any good? Thanks! Juri ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-24 12:59 ` Juri Lelli @ 2025-06-24 15:00 ` luca abeni 2025-06-25 9:30 ` Juri Lelli 2025-06-25 15:55 ` Marcel Ziswiler 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-06-24 15:00 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On Tue, 24 Jun 2025 14:59:13 +0200 Juri Lelli <juri.lelli@redhat.com> wrote: > Hello again, > > On 24/06/25 09:49, Juri Lelli wrote: > > ... > > > The following seem to at least cure the problem after boot. Things > > are still broken after cpusets creation. Moving to look into that, > > but wanted to share where I am so that we don't duplicate work. > > I ended up with two additional patches that seem to make things a > little better at my end. You can find them at > > https://github.com/jlelli/linux/tree/upstream/fix-grub > > Marcel, Luca, can you please give them a quick try to check if they do > any good? I applied your 3 patches to the master branch of linux.git, and they indeed seems to fix the issue! Now, I need to understand how they relate to 5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d :) One small issue: after applying your patches, I get this WARN at boot time: [ 0.384481] ------------[ cut here ]------------ [ 0.385384] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:265 task_non_contending+0x24d/0x3b0 [ 0.385384] Modules linked in: [ 0.385384] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc2-00234-ge35a18896578 #42 PREEMPT(voluntary) [ 0.385384] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 0.385384] RIP: 0010:task_non_contending+0x24d/0x3b0 [ 0.385384] Code: 59 49 00 e9 7a fe ff ff 48 8b 53 30 f6 43 53 10 0f 85 4c ff ff ff 48 8b 85 c8 08 00 00 48 29 d0 48 89 85 c8 08 00 00 73 0f 90 <0f> 0b 90 48 c7 85 c8 08 00 00 00 00 00 00 48 63 95 28 0b 00 00 48 [ 0.385384] RSP: 0000:ffffb52300013c08 EFLAGS: 00010093 [ 0.385384] RAX: ffffffffffff3334 RBX: ffff979ffe8292b0 RCX: 0000000000000001 [ 0.385384] RDX: 000000000000cccc RSI: 0000000002faf080 RDI: ffff979ffe8292b0 [ 0.385384] RBP: ffff979ffe8289c0 R08: 0000000000000001 R09: 00000000000002a5 [ 0.385384] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffe0ab69 [ 0.385384] R13: ffff979ffe828a40 R14: 0000000000000009 R15: ffff979ffe8289c0 [ 0.385384] FS: 0000000000000000(0000) GS:ffff97a05f709000(0000) knlGS:0000000000000000 [ 0.385384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.385384] CR2: ffff979fdec01000 CR3: 000000001e030000 CR4: 00000000000006f0 [ 0.385384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.385384] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.385384] Call Trace: [ 0.385384] <TASK> [ 0.385384] dl_server_stop+0x21/0x40 [ 0.385384] dequeue_entities+0x604/0x900 [ 0.385384] dequeue_task_fair+0x85/0x190 [ 0.385384] ? update_rq_clock+0x6c/0x110 [ 0.385384] __schedule+0x1f0/0xee0 [ 0.385384] schedule+0x22/0xd0 [ 0.385384] schedule_timeout+0xf4/0x100 [ 0.385384] __wait_for_common+0x97/0x180 [ 0.385384] ? __pfx_schedule_timeout+0x10/0x10 [ 0.385384] ? __pfx_devtmpfsd+0x10/0x10 [ 0.385384] wait_for_completion_killable+0x1f/0x40 [ 0.385384] __kthread_create_on_node+0xe7/0x150 [ 0.385384] kthread_create_on_node+0x4f/0x70 [ 0.385384] ? register_filesystem+0x97/0xc0 [ 0.385384] devtmpfs_init+0x115/0x200 [ 0.385384] driver_init+0x15/0x50 [ 0.385384] kernel_init_freeable+0xf4/0x2d0 [ 0.385384] ? __pfx_kernel_init+0x10/0x10 [ 0.385384] kernel_init+0x15/0x1c0 [ 0.385384] ret_from_fork+0x80/0xd0 [ 0.385384] ? __pfx_kernel_init+0x10/0x10 [ 0.385384] ret_from_fork_asm+0x1a/0x30 [ 0.385384] </TASK> [ 0.385384] ---[ end trace 0000000000000000 ]--- Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-24 15:00 ` luca abeni @ 2025-06-25 9:30 ` Juri Lelli 2025-06-25 10:11 ` Juri Lelli 0 siblings, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-25 9:30 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 24/06/25 17:00, luca abeni wrote: > On Tue, 24 Jun 2025 14:59:13 +0200 > Juri Lelli <juri.lelli@redhat.com> wrote: > > > Hello again, > > > > On 24/06/25 09:49, Juri Lelli wrote: > > > > ... > > > > > The following seem to at least cure the problem after boot. Things > > > are still broken after cpusets creation. Moving to look into that, > > > but wanted to share where I am so that we don't duplicate work. > > > > I ended up with two additional patches that seem to make things a > > little better at my end. You can find them at > > > > https://github.com/jlelli/linux/tree/upstream/fix-grub > > > > Marcel, Luca, can you please give them a quick try to check if they do > > any good? > > I applied your 3 patches to the master branch of linux.git, and they > indeed seems to fix the issue! > > Now, I need to understand how they relate to > 5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d :) > > One small issue: after applying your patches, I get this WARN at boot > time: > [ 0.384481] ------------[ cut here ]------------ > [ 0.385384] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:265 task_non_contending+0x24d/0x3b0 > [ 0.385384] Modules linked in: > [ 0.385384] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc2-00234-ge35a18896578 #42 PREEMPT(voluntary) > [ 0.385384] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 0.385384] RIP: 0010:task_non_contending+0x24d/0x3b0 > [ 0.385384] Code: 59 49 00 e9 7a fe ff ff 48 8b 53 30 f6 43 53 10 0f 85 4c ff ff ff 48 8b 85 c8 08 00 00 48 29 d0 48 89 85 c8 08 00 00 73 0f 90 <0f> 0b 90 48 c7 85 c8 08 00 00 00 00 00 00 48 63 95 28 0b 00 00 48 > [ 0.385384] RSP: 0000:ffffb52300013c08 EFLAGS: 00010093 > [ 0.385384] RAX: ffffffffffff3334 RBX: ffff979ffe8292b0 RCX: 0000000000000001 > [ 0.385384] RDX: 000000000000cccc RSI: 0000000002faf080 RDI: ffff979ffe8292b0 > [ 0.385384] RBP: ffff979ffe8289c0 R08: 0000000000000001 R09: 00000000000002a5 > [ 0.385384] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffe0ab69 > [ 0.385384] R13: ffff979ffe828a40 R14: 0000000000000009 R15: ffff979ffe8289c0 > [ 0.385384] FS: 0000000000000000(0000) GS:ffff97a05f709000(0000) knlGS:0000000000000000 > [ 0.385384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.385384] CR2: ffff979fdec01000 CR3: 000000001e030000 CR4: 00000000000006f0 > [ 0.385384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 0.385384] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 0.385384] Call Trace: > [ 0.385384] <TASK> > [ 0.385384] dl_server_stop+0x21/0x40 > [ 0.385384] dequeue_entities+0x604/0x900 > [ 0.385384] dequeue_task_fair+0x85/0x190 > [ 0.385384] ? update_rq_clock+0x6c/0x110 > [ 0.385384] __schedule+0x1f0/0xee0 > [ 0.385384] schedule+0x22/0xd0 > [ 0.385384] schedule_timeout+0xf4/0x100 > [ 0.385384] __wait_for_common+0x97/0x180 > [ 0.385384] ? __pfx_schedule_timeout+0x10/0x10 > [ 0.385384] ? __pfx_devtmpfsd+0x10/0x10 > [ 0.385384] wait_for_completion_killable+0x1f/0x40 > [ 0.385384] __kthread_create_on_node+0xe7/0x150 > [ 0.385384] kthread_create_on_node+0x4f/0x70 > [ 0.385384] ? register_filesystem+0x97/0xc0 > [ 0.385384] devtmpfs_init+0x115/0x200 > [ 0.385384] driver_init+0x15/0x50 > [ 0.385384] kernel_init_freeable+0xf4/0x2d0 > [ 0.385384] ? __pfx_kernel_init+0x10/0x10 > [ 0.385384] kernel_init+0x15/0x1c0 > [ 0.385384] ret_from_fork+0x80/0xd0 > [ 0.385384] ? __pfx_kernel_init+0x10/0x10 > [ 0.385384] ret_from_fork_asm+0x1a/0x30 > [ 0.385384] </TASK> > [ 0.385384] ---[ end trace 0000000000000000 ]--- I now see it as well, not sure how I missed it, maybe didn't pay enough attention. :) It looks like (at least at my end) it comes from task_non_contending() sub_running_bw() __sub_running_bw() WARN_ON_ONCE(dl_rq->running_bw > old); /* underflow */ I would guess the later initialization of dl-server is not playing well wrt running_bw. Will take a look. BTW, I pushed an additional fixup commmit (forgot some needed locking here and there, ops :). ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-25 9:30 ` Juri Lelli @ 2025-06-25 10:11 ` Juri Lelli 2025-06-25 12:50 ` luca abeni 0 siblings, 1 reply; 35+ messages in thread From: Juri Lelli @ 2025-06-25 10:11 UTC (permalink / raw) To: luca abeni Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 25/06/25 11:30, Juri Lelli wrote: ... > It looks like (at least at my end) it comes from > > task_non_contending() > sub_running_bw() > __sub_running_bw() > WARN_ON_ONCE(dl_rq->running_bw > old); /* underflow */ > > I would guess the later initialization of dl-server is not playing well > wrt running_bw. Will take a look. I pushed another fixup adding a check for dl_server_active in dl_server_stop(). It seems to cure the WARN here. Could you please pull and re-test? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-25 10:11 ` Juri Lelli @ 2025-06-25 12:50 ` luca abeni 2025-06-26 10:59 ` Marcel Ziswiler 0 siblings, 1 reply; 35+ messages in thread From: luca abeni @ 2025-06-25 12:50 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Juri, On Wed, 25 Jun 2025 12:11:46 +0200 Juri Lelli <juri.lelli@redhat.com> wrote: [...] > I pushed another fixup adding a check for dl_server_active in > dl_server_stop(). It seems to cure the WARN here. > > Could you please pull and re-test? I added your last 2 commits, and tested again; it seems to me that everything looks fine, now... Marcel, can you confirm? Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-25 12:50 ` luca abeni @ 2025-06-26 10:59 ` Marcel Ziswiler 2025-06-26 11:45 ` Juri Lelli 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-06-26 10:59 UTC (permalink / raw) To: luca abeni, Juri Lelli Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca and Juri On Wed, 2025-06-25 at 14:50 +0200, luca abeni wrote: > Hi Juri, > > On Wed, 25 Jun 2025 12:11:46 +0200 > Juri Lelli <juri.lelli@redhat.com> wrote: > [...] > > I pushed another fixup adding a check for dl_server_active in > > dl_server_stop(). It seems to cure the WARN here. > > > > Could you please pull and re-test? > > I added your last 2 commits, and tested again; it seems to me that > everything looks fine, now... Marcel, can you confirm? Indeed, our CI run now close to 220 mio. tests on NUCs and 190 mio. on ROCK 5B and so far it didn't miss any single beat! Also the statistics around those tests look very good. With reclaim enabled one can now truly get very good real-time performance. Thank you very much! We will continue to exercise the Linux kernel scheduler to the fullest and report any inconsistencies we are seeing. Just let me know if there is anything else we may help you with. Thanks again! > Luca Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-26 10:59 ` Marcel Ziswiler @ 2025-06-26 11:45 ` Juri Lelli 0 siblings, 0 replies; 35+ messages in thread From: Juri Lelli @ 2025-06-26 11:45 UTC (permalink / raw) To: Marcel Ziswiler Cc: luca abeni, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On 26/06/25 04:59, Marcel Ziswiler wrote: > Hi Luca and Juri > > On Wed, 2025-06-25 at 14:50 +0200, luca abeni wrote: > > Hi Juri, > > > > On Wed, 25 Jun 2025 12:11:46 +0200 > > Juri Lelli <juri.lelli@redhat.com> wrote: > > [...] > > > I pushed another fixup adding a check for dl_server_active in > > > dl_server_stop(). It seems to cure the WARN here. > > > > > > Could you please pull and re-test? > > > > I added your last 2 commits, and tested again; it seems to me that > > everything looks fine, now... Marcel, can you confirm? > > Indeed, our CI run now close to 220 mio. tests on NUCs and 190 mio. on ROCK 5B and so far it didn't miss any > single beat! Also the statistics around those tests look very good. With reclaim enabled one can now truly get > very good real-time performance. Thank you very much! > > We will continue to exercise the Linux kernel scheduler to the fullest and report any inconsistencies we are > seeing. > > Just let me know if there is anything else we may help you with. Thanks again! Great! Thanks a lot for testing and the patience. :-) I will be sending out a polished version of the set soon. Please take a look and add your reviewed/tested-by to that if you can. The changes are the same you have been testing already, just with changelogs etc. Let's see if people spot problems with the actual implementation of the fixes. Best, Juri ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-24 12:59 ` Juri Lelli 2025-06-24 15:00 ` luca abeni @ 2025-06-25 15:55 ` Marcel Ziswiler 1 sibling, 0 replies; 35+ messages in thread From: Marcel Ziswiler @ 2025-06-25 15:55 UTC (permalink / raw) To: Juri Lelli, luca abeni Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Juri On Tue, 2025-06-24 at 14:59 +0200, Juri Lelli wrote: > Hello again, > > On 24/06/25 09:49, Juri Lelli wrote: > > ... > > > The following seem to at least cure the problem after boot. Things are > > still broken after cpusets creation. Moving to look into that, but > > wanted to share where I am so that we don't duplicate work. > > I ended up with two additional patches that seem to make things a little > better at my end. You can find them at > > https://github.com/jlelli/linux/tree/upstream/fix-grub > > Marcel, Luca, can you please give them a quick try to check if they do > any good? I gave this a try yesterday and run a first longer test in our CI. While that now only run for 16 hours doing 30+ mio. tests on NUCs and 75+ mio. on ROCK 5B it really looks promising so far. I will now update to your latest patches and re-run those tests. Usually we need 40+ hours of testing to really be confident in our statistics around those tests. > Thanks! Thank you! > Juri Cheers from the OSS NA/ELC in Denver Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-20 16:52 ` luca abeni 2025-06-24 7:49 ` Juri Lelli @ 2025-06-24 13:36 ` luca abeni 1 sibling, 0 replies; 35+ messages in thread From: luca abeni @ 2025-06-24 13:36 UTC (permalink / raw) To: Juri Lelli Cc: Marcel Ziswiler, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai On Fri, 20 Jun 2025 18:52:48 +0200 luca abeni <luca.abeni@santannapisa.it> wrote: [...] > > > should be decreased by Ui when a task with utilization Ui > > > becomes SCHED_DEADLINE (and increased by Ui when the > > > SCHED_DEADLINE task terminates or changes scheduling policy). > > > Since this value is per_core, Ui is divided by the number of > > > cores in the root domain... From what you write, I guess extra_bw > > > is not correctly initialized/updated when a new root domain is > > > created? > > > > It looks like so yeah. After boot and when domains are dinamically > > created. But, I am still not 100%, I only see weird numbers that I > > struggle to relate with what you say above. :) > > BTW, when running some tests on different machines I think I found out > that 6.11 does not exhibit this issue (this needs to be confirmed, I > am working on reproducing the test with different kernels on the same > machine) > > If I manage to reproduce this result, I think I can run a bisect to > the commit introducing the issue (git is telling me that I'll need > about 15 tests :) > So, stay tuned... It took more than I expected, but I think I found the guilty commit... It seems to be [5f6bd380c7bdbe10f7b4e8ddcceed60ce0714c6d] sched/rt: Remove default bandwidth control Starting from this commit, I can reproduce the issue, but if I test the previous commit (c8a85394cfdb4696b4e2f8a0f3066a1c921af426 sched/core: Fix picking of tasks for core scheduling with DL server) the issue disappears. Maybe this information can help in better understanding the problem :) Luca > > > > All this information is probably not properly documented... > > > Should I improve the description in > > > Documentation/scheduler/sched-deadline.rst or do you prefer some > > > comments in kernel/sched/deadline.c? (or .h?) > > > > I think ideally both. sched-deadline.rst should probably contain the > > whole picture with more information and .c/.h the condendensed > > version. > > OK, I'll try to do this in next week > > > Thanks, > Luca ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-25 19:29 ` Marcel Ziswiler 2025-05-29 9:39 ` Juri Lelli @ 2025-05-30 9:21 ` luca abeni 2025-06-03 11:18 ` Marcel Ziswiler 1 sibling, 1 reply; 35+ messages in thread From: luca abeni @ 2025-05-30 9:21 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, On Sun, 25 May 2025 21:29:05 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: [...] > > How do you configure systemd? I am having troubles in reproducing > > your AllowedCPUs configuration... This is an example of what I am > > trying: sudo systemctl set-property --runtime custom-workload.slice > > AllowedCPUs=1 sudo systemctl set-property --runtime init.scope > > AllowedCPUs=0,2,3 sudo systemctl set-property --runtime > > system.slice AllowedCPUs=0,2,3 sudo systemctl set-property > > --runtime user.slice AllowedCPUs=0,2,3 and then I try to run a > > SCHED_DEADLINE application with sudo systemd-run --scope -p > > Slice=custom-workload.slice <application> > > We just use a bunch of systemd configuration files as follows: > > [root@localhost ~]# cat /lib/systemd/system/monitor.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only [...] So, I copied your *.slice files in /lib/systemd/system (and I added them to the "Wants=" entry of /lib/systemd/system/slices.target, otherwise the slices are not created), but I am still unable to run SCHED_DEADLINE applications in these slices. This is due to the fact that the kernel does not create a new root domain for these cpusets (probably because the cpusets' CPUs are not exclusive and the cpuset is not "isolated": for example, /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to "member", not to "isolated"). So, the "cpumask_subset(span, p->cpus_ptr)" in sched_setsched() is still false and the syscall returns -EPERM. Since I do not know how to obtain an isolated cpuset with cgroup v2 and systemd, I tried using the old cgroup v1, as described in the SCHED_DEADLINE documentation. This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced the number of missed deadlines (I tried with a set of periodic tasks having the same parameters as the ones you described). So, it looks like reclaiming is working correctly (at least, as far as I can see) when using cgroup v1 to configure the CPU partitions... Maybe there is some bug triggered by cgroup v2, or maybe I am misunderstanding your setup. I think the experiment suggested by Juri can help in understanding where the issue can be. Thanks, Luca > [Unit] > Description=Prioritized slice for the safety monitor. > Before=slices.target > > [Slice] > CPUWeight=1000 > AllowedCPUs=0 > MemoryAccounting=true > MemoryMin=10% > ManagedOOMPreference=omit > > [Install] > WantedBy=slices.target > > [root@localhost ~]# cat /lib/systemd/system/safety1.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only > [Unit] > Description=Slice for Safety case processes. > Before=slices.target > > [Slice] > CPUWeight=1000 > AllowedCPUs=1 > MemoryAccounting=true > MemoryMin=10% > ManagedOOMPreference=omit > > [Install] > WantedBy=slices.target > > [root@localhost ~]# cat /lib/systemd/system/safety2.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only > [Unit] > Description=Slice for Safety case processes. > Before=slices.target > > [Slice] > CPUWeight=1000 > AllowedCPUs=2 > MemoryAccounting=true > MemoryMin=10% > ManagedOOMPreference=omit > > [Install] > WantedBy=slices.target > > [root@localhost ~]# cat /lib/systemd/system/safety3.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only > [Unit] > Description=Slice for Safety case processes. > Before=slices.target > > [Slice] > CPUWeight=1000 > AllowedCPUs=3 > MemoryAccounting=true > MemoryMin=10% > ManagedOOMPreference=omit > > [Install] > WantedBy=slices.target > > [root@localhost ~]# cat /lib/systemd/system/system.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only > > # > # This slice will control all processes started by systemd by > # default. > # > > [Unit] > Description=System Slice > Documentation=man:systemd.special(7) > Before=slices.target > > [Slice] > CPUQuota=150% > AllowedCPUs=0 > MemoryAccounting=true > MemoryMax=80% > ManagedOOMSwap=kill > ManagedOOMMemoryPressure=kill > > [root@localhost ~]# cat /lib/systemd/system/user.slice > # Copyright (C) 2024 Codethink Limited > # SPDX-License-Identifier: GPL-2.0-only > > # > # This slice will control all processes started by systemd-logind > # > > [Unit] > Description=User and Session Slice > Documentation=man:systemd.special(7) > Before=slices.target > > [Slice] > CPUQuota=25% > AllowedCPUs=0 > MemoryAccounting=true > MemoryMax=80% > ManagedOOMSwap=kill > ManagedOOMMemoryPressure=kill > > > However, this does not work because systemd is not creating an > > isolated cpuset... So, the root domain still contains CPUs 0-3, and > > the "custom-workload.slice" cpuset only has CPU 1. Hence, the check > > /* > > * Don't allow tasks with an affinity mask > > smaller than > > * the entire root_domain to become > > SCHED_DEADLINE. We > > * will also fail if there's no bandwidth > > available. */ > > if (!cpumask_subset(span, p->cpus_ptr) || > > rq->rd->dl_bw.bw == 0) { > > retval = -EPERM; > > goto unlock; > > } > > in sched_setsched() fails. > > > > > > How are you configuring the cpusets? > > See above. > > > Also, which kernel version are you using? > > (sorry if you already posted this information in previous emails > > and I am missing something obvious) > > Not even sure, whether I explicitly mentioned that other than that we > are always running latest stable. > > Two months ago when we last run some extensive tests on this it was > actually v6.13.6. > > > Thanks, > > Thank you! > > > Luca > > Cheers > > Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-05-30 9:21 ` luca abeni @ 2025-06-03 11:18 ` Marcel Ziswiler 2025-06-06 13:16 ` luca abeni 0 siblings, 1 reply; 35+ messages in thread From: Marcel Ziswiler @ 2025-06-03 11:18 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Luca Thank you very much! On Fri, 2025-05-30 at 11:21 +0200, luca abeni wrote: > Hi Marcel, > > On Sun, 25 May 2025 21:29:05 +0200 > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > [...] > > > How do you configure systemd? I am having troubles in reproducing > > > your AllowedCPUs configuration... This is an example of what I am > > > trying: sudo systemctl set-property --runtime custom-workload.slice > > > AllowedCPUs=1 sudo systemctl set-property --runtime init.scope > > > AllowedCPUs=0,2,3 sudo systemctl set-property --runtime > > > system.slice AllowedCPUs=0,2,3 sudo systemctl set-property > > > --runtime user.slice AllowedCPUs=0,2,3 and then I try to run a > > > SCHED_DEADLINE application with sudo systemd-run --scope -p > > > Slice=custom-workload.slice <application> > > > > We just use a bunch of systemd configuration files as follows: > > > > [root@localhost ~]# cat /lib/systemd/system/monitor.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > [...] > > So, I copied your *.slice files in /lib/systemd/system (and I added > them to the "Wants=" entry of /lib/systemd/system/slices.target, > otherwise the slices are not created), but I am still unable to run > SCHED_DEADLINE applications in these slices. We just link them there e.g. [root@localhost ~]# ls -l /etc/systemd/system/slices.target.wants/safety1.slice lrwxrwxrwx 1 root root 37 Nov 10 2011 /etc/systemd/system/slices.target.wants/safety1.slice -> /usr/lib/systemd/system/safety1.slice BTW: /lib is just sym-linked to /usr/lib in our setup. > This is due to the fact that the kernel does not create a new root > domain for these cpusets (probably because the cpusets' CPUs are not > exclusive and the cpuset is not "isolated": for example, > /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to "member", > not to "isolated"). Not sure, but for me it is indeed root e.g. [root@localhost ~]# cat /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition root > So, the "cpumask_subset(span, p->cpus_ptr)" in > sched_setsched() is still false and the syscall returns -EPERM. > > > Since I do not know how to obtain an isolated cpuset with cgroup v2 and > systemd, I tried using the old cgroup v1, as described in the > SCHED_DEADLINE documentation. I would have thought it should not make any difference whether cgroup v1 or v2 is used, but then who knows. > This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced the > number of missed deadlines (I tried with a set of periodic tasks having > the same parameters as the ones you described). So, it looks like > reclaiming is working correctly (at least, as far as I can see) when > using cgroup v1 to configure the CPU partitions... Maybe there is some > bug triggered by cgroup v2, Could be, but anyway would be good to also update the SCHED_DEADLINE documentation to cgroup v2. > or maybe I am misunderstanding your setup. No, there should be nothing else special really. > I think the experiment suggested by Juri can help in understanding > where the issue can be. Yes, I already did all that and hope you guys can get some insights from that experiment. And remember, if I can help in any other way just let me know. Thanks! > Thanks, > Luca > > > > [Unit] > > Description=Prioritized slice for the safety monitor. > > Before=slices.target > > > > [Slice] > > CPUWeight=1000 > > AllowedCPUs=0 > > MemoryAccounting=true > > MemoryMin=10% > > ManagedOOMPreference=omit > > > > [Install] > > WantedBy=slices.target > > > > [root@localhost ~]# cat /lib/systemd/system/safety1.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > > [Unit] > > Description=Slice for Safety case processes. > > Before=slices.target > > > > [Slice] > > CPUWeight=1000 > > AllowedCPUs=1 > > MemoryAccounting=true > > MemoryMin=10% > > ManagedOOMPreference=omit > > > > [Install] > > WantedBy=slices.target > > > > [root@localhost ~]# cat /lib/systemd/system/safety2.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > > [Unit] > > Description=Slice for Safety case processes. > > Before=slices.target > > > > [Slice] > > CPUWeight=1000 > > AllowedCPUs=2 > > MemoryAccounting=true > > MemoryMin=10% > > ManagedOOMPreference=omit > > > > [Install] > > WantedBy=slices.target > > > > [root@localhost ~]# cat /lib/systemd/system/safety3.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > > [Unit] > > Description=Slice for Safety case processes. > > Before=slices.target > > > > [Slice] > > CPUWeight=1000 > > AllowedCPUs=3 > > MemoryAccounting=true > > MemoryMin=10% > > ManagedOOMPreference=omit > > > > [Install] > > WantedBy=slices.target > > > > [root@localhost ~]# cat /lib/systemd/system/system.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > > > > # > > # This slice will control all processes started by systemd by > > # default. > > # > > > > [Unit] > > Description=System Slice > > Documentation=man:systemd.special(7) > > Before=slices.target > > > > [Slice] > > CPUQuota=150% > > AllowedCPUs=0 > > MemoryAccounting=true > > MemoryMax=80% > > ManagedOOMSwap=kill > > ManagedOOMMemoryPressure=kill > > > > [root@localhost ~]# cat /lib/systemd/system/user.slice > > # Copyright (C) 2024 Codethink Limited > > # SPDX-License-Identifier: GPL-2.0-only > > > > # > > # This slice will control all processes started by systemd-logind > > # > > > > [Unit] > > Description=User and Session Slice > > Documentation=man:systemd.special(7) > > Before=slices.target > > > > [Slice] > > CPUQuota=25% > > AllowedCPUs=0 > > MemoryAccounting=true > > MemoryMax=80% > > ManagedOOMSwap=kill > > ManagedOOMMemoryPressure=kill > > > > > However, this does not work because systemd is not creating an > > > isolated cpuset... So, the root domain still contains CPUs 0-3, and > > > the "custom-workload.slice" cpuset only has CPU 1. Hence, the check > > > /* > > > * Don't allow tasks with an affinity mask > > > smaller than > > > * the entire root_domain to become > > > SCHED_DEADLINE. We > > > * will also fail if there's no bandwidth > > > available. */ > > > if (!cpumask_subset(span, p->cpus_ptr) || > > > rq->rd->dl_bw.bw == 0) { > > > retval = -EPERM; > > > goto unlock; > > > } > > > in sched_setsched() fails. > > > > > > > > > How are you configuring the cpusets? > > > > See above. > > > > > Also, which kernel version are you using? > > > (sorry if you already posted this information in previous emails > > > and I am missing something obvious) > > > > Not even sure, whether I explicitly mentioned that other than that we > > are always running latest stable. > > > > Two months ago when we last run some extensive tests on this it was > > actually v6.13.6. > > > > > Thanks, > > > > Thank you! > > > > > Luca Cheers Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) 2025-06-03 11:18 ` Marcel Ziswiler @ 2025-06-06 13:16 ` luca abeni 0 siblings, 0 replies; 35+ messages in thread From: luca abeni @ 2025-06-06 13:16 UTC (permalink / raw) To: Marcel Ziswiler Cc: Juri Lelli, linux-kernel, Ingo Molnar, Peter Zijlstra, Vineeth Pillai Hi Marcel, I am still having issues in reproducing your setup with cgroup v2 (maybe it depends on the systemd version, I do not know), but I ran some experiments using cgroup v1... Here are some ideas: - When using reclaiming, the core load can become very high (reaching 95%)... This can increase the CPU temperature, and maybe some thermal throttling mechanism slows it down to avoid overheating? (this happened one time to me when replicating your setup). This would explain some missed deadlines - Related to this... You probably already mentioned it, but which kind of CPU are you using? How is frequency scaling configured? (that is: which cpufreq governor are you using?) - Another random idea: is it possible that you enabled reclaiming only for some of the SCHED_DEADLINE threads running on a core? (and reclaiming is maybe disabled for the thread that is missing deadlines?) Also, can you try lowering the value of /proc/sys/kernel/sched_rt_runtime_us and check if the problem still happens? Thanks, Luca On Tue, 03 Jun 2025 13:18:23 +0200 Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > Hi Luca > > Thank you very much! > > On Fri, 2025-05-30 at 11:21 +0200, luca abeni wrote: > > Hi Marcel, > > > > On Sun, 25 May 2025 21:29:05 +0200 > > Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote: > > [...] > > > > How do you configure systemd? I am having troubles in > > > > reproducing your AllowedCPUs configuration... This is an > > > > example of what I am trying: sudo systemctl set-property > > > > --runtime custom-workload.slice AllowedCPUs=1 sudo systemctl > > > > set-property --runtime init.scope AllowedCPUs=0,2,3 sudo > > > > systemctl set-property --runtime system.slice AllowedCPUs=0,2,3 > > > > sudo systemctl set-property --runtime user.slice > > > > AllowedCPUs=0,2,3 and then I try to run a SCHED_DEADLINE > > > > application with sudo systemd-run --scope -p > > > > Slice=custom-workload.slice <application> > > > > > > We just use a bunch of systemd configuration files as follows: > > > > > > [root@localhost ~]# cat /lib/systemd/system/monitor.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > [...] > > > > So, I copied your *.slice files in /lib/systemd/system (and I added > > them to the "Wants=" entry of /lib/systemd/system/slices.target, > > otherwise the slices are not created), but I am still unable to run > > SCHED_DEADLINE applications in these slices. > > We just link them there e.g. > > [root@localhost ~]# ls -l > /etc/systemd/system/slices.target.wants/safety1.slice lrwxrwxrwx 1 > root root 37 Nov 10 2011 > /etc/systemd/system/slices.target.wants/safety1.slice -> > /usr/lib/systemd/system/safety1.slice > > BTW: /lib is just sym-linked to /usr/lib in our setup. > > > This is due to the fact that the kernel does not create a new root > > domain for these cpusets (probably because the cpusets' CPUs are not > > exclusive and the cpuset is not "isolated": for example, > > /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition is set to > > "member", not to "isolated"). > > Not sure, but for me it is indeed root e.g. > > [root@localhost ~]# cat > /sys/fs/cgroup/safety1.slice/cpuset.cpus.partition root > > > So, the "cpumask_subset(span, p->cpus_ptr)" in > > sched_setsched() is still false and the syscall returns -EPERM. > > > > > > Since I do not know how to obtain an isolated cpuset with cgroup v2 > > and systemd, I tried using the old cgroup v1, as described in the > > SCHED_DEADLINE documentation. > > I would have thought it should not make any difference whether cgroup > v1 or v2 is used, but then who knows. > > > This worked fine, and enabling SCHED_FLAG_RECLAIM actually reduced > > the number of missed deadlines (I tried with a set of periodic > > tasks having the same parameters as the ones you described). So, it > > looks like reclaiming is working correctly (at least, as far as I > > can see) when using cgroup v1 to configure the CPU partitions... > > Maybe there is some bug triggered by cgroup v2, > > Could be, but anyway would be good to also update the SCHED_DEADLINE > documentation to cgroup v2. > > > or maybe I am misunderstanding your setup. > > No, there should be nothing else special really. > > > I think the experiment suggested by Juri can help in understanding > > where the issue can be. > > Yes, I already did all that and hope you guys can get some insights > from that experiment. > > And remember, if I can help in any other way just let me know. Thanks! > > > Thanks, > > Luca > > > > > > > [Unit] > > > Description=Prioritized slice for the safety monitor. > > > Before=slices.target > > > > > > [Slice] > > > CPUWeight=1000 > > > AllowedCPUs=0 > > > MemoryAccounting=true > > > MemoryMin=10% > > > ManagedOOMPreference=omit > > > > > > [Install] > > > WantedBy=slices.target > > > > > > [root@localhost ~]# cat /lib/systemd/system/safety1.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > > [Unit] > > > Description=Slice for Safety case processes. > > > Before=slices.target > > > > > > [Slice] > > > CPUWeight=1000 > > > AllowedCPUs=1 > > > MemoryAccounting=true > > > MemoryMin=10% > > > ManagedOOMPreference=omit > > > > > > [Install] > > > WantedBy=slices.target > > > > > > [root@localhost ~]# cat /lib/systemd/system/safety2.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > > [Unit] > > > Description=Slice for Safety case processes. > > > Before=slices.target > > > > > > [Slice] > > > CPUWeight=1000 > > > AllowedCPUs=2 > > > MemoryAccounting=true > > > MemoryMin=10% > > > ManagedOOMPreference=omit > > > > > > [Install] > > > WantedBy=slices.target > > > > > > [root@localhost ~]# cat /lib/systemd/system/safety3.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > > [Unit] > > > Description=Slice for Safety case processes. > > > Before=slices.target > > > > > > [Slice] > > > CPUWeight=1000 > > > AllowedCPUs=3 > > > MemoryAccounting=true > > > MemoryMin=10% > > > ManagedOOMPreference=omit > > > > > > [Install] > > > WantedBy=slices.target > > > > > > [root@localhost ~]# cat /lib/systemd/system/system.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > > > > > # > > > # This slice will control all processes started by systemd by > > > # default. > > > # > > > > > > [Unit] > > > Description=System Slice > > > Documentation=man:systemd.special(7) > > > Before=slices.target > > > > > > [Slice] > > > CPUQuota=150% > > > AllowedCPUs=0 > > > MemoryAccounting=true > > > MemoryMax=80% > > > ManagedOOMSwap=kill > > > ManagedOOMMemoryPressure=kill > > > > > > [root@localhost ~]# cat /lib/systemd/system/user.slice > > > # Copyright (C) 2024 Codethink Limited > > > # SPDX-License-Identifier: GPL-2.0-only > > > > > > # > > > # This slice will control all processes started by systemd-logind > > > # > > > > > > [Unit] > > > Description=User and Session Slice > > > Documentation=man:systemd.special(7) > > > Before=slices.target > > > > > > [Slice] > > > CPUQuota=25% > > > AllowedCPUs=0 > > > MemoryAccounting=true > > > MemoryMax=80% > > > ManagedOOMSwap=kill > > > ManagedOOMMemoryPressure=kill > > > > > > > However, this does not work because systemd is not creating an > > > > isolated cpuset... So, the root domain still contains CPUs 0-3, > > > > and the "custom-workload.slice" cpuset only has CPU 1. Hence, > > > > the check /* > > > > * Don't allow tasks with an affinity > > > > mask smaller than > > > > * the entire root_domain to become > > > > SCHED_DEADLINE. We > > > > * will also fail if there's no > > > > bandwidth available. */ > > > > if (!cpumask_subset(span, p->cpus_ptr) > > > > || rq->rd->dl_bw.bw == 0) { > > > > retval = -EPERM; > > > > goto unlock; > > > > } > > > > in sched_setsched() fails. > > > > > > > > > > > > How are you configuring the cpusets? > > > > > > See above. > > > > > > > Also, which kernel version are you using? > > > > (sorry if you already posted this information in previous emails > > > > and I am missing something obvious) > > > > > > Not even sure, whether I explicitly mentioned that other than > > > that we are always running latest stable. > > > > > > Two months ago when we last run some extensive tests on this it > > > was actually v6.13.6. > > > > > > > Thanks, > > > > > > Thank you! > > > > > > > Luca > > Cheers > > Marcel ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2025-06-26 11:45 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-28 18:04 SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG_RECLAIM jobs in the mix (using GRUB) Marcel Ziswiler 2025-05-02 13:55 ` Juri Lelli 2025-05-02 14:10 ` luca abeni 2025-05-03 13:14 ` Marcel Ziswiler 2025-05-05 15:53 ` luca abeni 2025-05-03 11:14 ` Marcel Ziswiler 2025-05-07 20:25 ` luca abeni 2025-05-19 13:32 ` Marcel Ziswiler 2025-05-20 16:09 ` luca abeni 2025-05-21 9:59 ` Marcel Ziswiler 2025-05-23 19:46 ` luca abeni 2025-05-25 19:29 ` Marcel Ziswiler 2025-05-29 9:39 ` Juri Lelli 2025-06-02 14:59 ` Marcel Ziswiler 2025-06-17 12:21 ` Juri Lelli 2025-06-18 11:24 ` Marcel Ziswiler 2025-06-20 9:29 ` Juri Lelli 2025-06-20 9:37 ` luca abeni 2025-06-20 9:58 ` Juri Lelli 2025-06-20 14:16 ` luca abeni 2025-06-20 15:28 ` Juri Lelli 2025-06-20 16:52 ` luca abeni 2025-06-24 7:49 ` Juri Lelli 2025-06-24 12:59 ` Juri Lelli 2025-06-24 15:00 ` luca abeni 2025-06-25 9:30 ` Juri Lelli 2025-06-25 10:11 ` Juri Lelli 2025-06-25 12:50 ` luca abeni 2025-06-26 10:59 ` Marcel Ziswiler 2025-06-26 11:45 ` Juri Lelli 2025-06-25 15:55 ` Marcel Ziswiler 2025-06-24 13:36 ` luca abeni 2025-05-30 9:21 ` luca abeni 2025-06-03 11:18 ` Marcel Ziswiler 2025-06-06 13:16 ` luca abeni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).