* [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting
@ 2016-02-08 12:45 Juri Lelli
2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli
0 siblings, 2 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw)
To: rostedt
Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot,
wanpeng.li, juri.lelli
Steven Rostedt provided a detailed bug report [1] highlighting the fact that we
lose bandwidth accounting information across scheduling domains changes. This
happens when we make changes to cpusets or when hotplug comes into play (this
has been already pointed out in the past by Wanpeng Li). Problem is that these
operations are destructive w.r.t. root_domain data structure(s), that is where
we store information about the amount of bandwidth has been admitted.
This set proposes to fix that by relying on the rq_{online, offline} calls,
that IIUC were introduced exactly to cope with the problem that we have.
Patches description:
o 01/02 introduces per-rq admitted bandwidth accounting (I stole this from
Luca with his approval, but I changed it a bit; I kept attribution in
the patch, but everything that I could have break is on me; also,
Vincent posted something similar as part of the sched-freq story)
o 02/02 uses information provided by 01/02 to save restore bandwidth
information in root_domain(s)
Steven already provided instructions on how to reproduce that problem and test
the proposed fix in [1]. I updated my tests accordingly
https://github.com/jlelli/tests
This set is based on tip/sched/core as of today. I pushed this set on a branch
together with Steve's sched_debug enhancements to print root_domain bandwidth
and some trace_printks that should help to track what these changes are doing.
git://linux-arm.org/linux-jl.git upstream/fixes/dl_rootdomain_account-v1
Changelogs can be improved and patches requires more comments, but I wanted to
send this out early to understand if we are really fixing the problem and to
see if people think this approach could fly. Also, I expect that there are
still corner cases that we don't cover with the bandwidth tracking.
Testing and feedback is more than welcome.
Best,
- Juri
[1] https://lkml.org/lkml/2016/2/3/966
Juri Lelli (2):
sched/deadline: add per rq tracking of admitted bandwidth
sched/deadline: rq_{online,offline}_dl for root_domain changes
kernel/sched/core.c | 2 ++
kernel/sched/deadline.c | 20 ++++++++++++++++++++
kernel/sched/sched.h | 22 ++++++++++++++++++++++
3 files changed, 44 insertions(+)
--
2.7.0
^ permalink raw reply [flat|nested] 46+ messages in thread* [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli @ 2016-02-08 12:45 ` Juri Lelli 2016-02-10 11:32 ` Juri Lelli 2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli 1 sibling, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw) To: rostedt Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li, juri.lelli Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks that passed admission control at root_domain level only. This creates problems when such data structure(s) are destroyed, when we reconfigure scheduling domains for example. This is part one of two changes required to fix the problem. In this patch we add per-rq tracking of admitted bandwidth. Tasks bring with them their bandwidth contribution when they enter the system and are enqueued for the first time. Contributions are then moved around when migrations happen and removed when tasks die. Per-rq admitted bandwidth information will be leveraged in the next commit to save/restore per-rq bandwidth contribution towards the root domain (using the rq_{online,offline} mechanism). Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> [ original patch by ] Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- kernel/sched/core.c | 2 ++ kernel/sched/deadline.c | 18 ++++++++++++++++++ kernel/sched/sched.h | 22 ++++++++++++++++++++++ 3 files changed, 42 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 24fcdbf..706ca23 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2449,7 +2449,9 @@ static int dl_overflow(struct task_struct *p, int policy, } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { __dl_clear(dl_b, p->dl.dl_bw); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); __dl_add(dl_b, new_bw); + __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index cd64c97..2480cab 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -83,6 +83,7 @@ void init_dl_rq(struct dl_rq *dl_rq) #else init_dl_bw(&dl_rq->dl_bw); #endif + dl_rq->ac_bw = 0; } #ifdef CONFIG_SMP @@ -278,8 +279,10 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * By now the task is replenished and enqueued; migrate it. */ deactivate_task(rq, p, 0); + __dl_sub_ac(rq, p->dl.dl_bw); set_task_cpu(p, later_rq->cpu); activate_task(later_rq, p, 0); + __dl_add_ac(later_rq, p->dl.dl_bw); if (!fallback) resched_curr(later_rq); @@ -506,6 +509,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se, */ if (dl_se->dl_new) { setup_new_dl_entity(dl_se, pi_se); + __dl_add_ac(rq, dl_se->dl_bw); return; } @@ -955,6 +959,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) return; } + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_add_ac(rq, p->dl.dl_bw); + /* * If p is throttled, we do nothing. In fact, if it exhausted * its budget it needs a replenishment and, since it now is on @@ -980,6 +987,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_sub_ac(rq, p->dl.dl_bw); } /* @@ -1219,6 +1228,8 @@ static void task_dead_dl(struct task_struct *p) { struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); + /* * Since we are TASK_DEAD we won't slip out of the domain! */ @@ -1511,8 +1522,10 @@ retry: } deactivate_task(rq, next_task, 0); + __dl_sub_ac(rq, next_task->dl.dl_bw); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); + __dl_add_ac(later_rq, next_task->dl.dl_bw); ret = 1; resched_curr(later_rq); @@ -1599,8 +1612,10 @@ static void pull_dl_task(struct rq *this_rq) resched = true; deactivate_task(src_rq, p, 0); + __dl_sub_ac(src_rq, p->dl.dl_bw); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); + __dl_add_ac(this_rq, p->dl.dl_bw); dmin = p->dl.deadline; /* Is there any other task even earlier? */ @@ -1705,6 +1720,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) if (!start_dl_timer(p)) __dl_clear_params(p); + if (dl_prio(p->normal_prio)) + __dl_sub_ac(rq, p->dl.dl_bw); + /* * Since this might be the only -deadline task on the rq, * this is the right place to try to pull some other one diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10f1637..e754680 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -519,6 +519,14 @@ struct dl_rq { #else struct dl_bw dl_bw; #endif + + /* + * ac_bw keeps track of per rq admitted bandwidth. It only changes + * when a new task is admitted, it dies, it changes scheduling policy + * or is migrated to another rq. It is used to correctly save/resore + * total_bw on root_domain changes. + */ + u64 ac_bw; }; #ifdef CONFIG_SMP @@ -720,6 +728,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) +static inline +void __dl_sub_ac(struct rq *rq, u64 tsk_bw) +{ + WARN_ON(rq->dl.ac_bw == 0); + + rq->dl.ac_bw -= tsk_bw; +} + +static inline +void __dl_add_ac(struct rq *rq, u64 tsk_bw) +{ + rq->dl.ac_bw += tsk_bw; +} + static inline u64 __rq_clock_broken(struct rq *rq) { return READ_ONCE(rq->clock); -- 2.7.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli @ 2016-02-10 11:32 ` Juri Lelli 2016-02-10 11:43 ` luca abeni ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Juri Lelli @ 2016-02-10 11:32 UTC (permalink / raw) To: rostedt Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li, Juri Lelli Hi, I've updated this patch since, with a bit more testing and talking with Luca in private, I realized that the previous version didn't manage switching back and forth from SCHED_DEADLINE correctly. Thanks a lot Luca for your feedback (even if not visible on the list). I updated the testing branch accordingly and added a test to my tests that stresses switch-in/switch-out. I don't particularly like the fact that we break the scheduling classes abstraction in __dl_overflow(), so I think a little bit of refactoring is still needed. But that can also happen afterwards, if we fix the problem with root domains. Best, - Juri --->8--- >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 From: Juri Lelli <juri.lelli@arm.com> Date: Sat, 6 Feb 2016 12:41:09 +0000 Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks that passed admission control at root_domain level only. This creates problems when such data structure(s) are destroyed, when we reconfigure scheduling domains for example. This is part one of two changes required to fix the problem. In this patch we add per-rq tracking of admitted bandwidth. Tasks bring with them their bandwidth contribution when they enter the system and are enqueued for the first time. Contributions are then moved around when migrations happen and removed when tasks die. Per-rq admitted bandwidth information will be leveraged in the next commit to save/restore per-rq bandwidth contribution towards the root domain (using the rq_{online,offline} mechanism). Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> [ original patch by ] Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- kernel/sched/core.c | 6 +++++- kernel/sched/deadline.c | 16 +++++++++++++++- kernel/sched/sched.h | 22 ++++++++++++++++++++++ 3 files changed, 42 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9503d59..0ee0ec2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; int cpus, err = -1; - if (new_bw == p->dl.dl_bw) + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) return 0; /* @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, 0, new_bw)) { __dl_add(dl_b, new_bw); + __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { __dl_clear(dl_b, p->dl.dl_bw); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); __dl_add(dl_b, new_bw); + __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); err = 0; } raw_spin_unlock(&dl_b->lock); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index cd64c97..8ac0fee 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -83,6 +83,7 @@ void init_dl_rq(struct dl_rq *dl_rq) #else init_dl_bw(&dl_rq->dl_bw); #endif + dl_rq->ac_bw = 0; } #ifdef CONFIG_SMP @@ -278,8 +279,10 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * By now the task is replenished and enqueued; migrate it. */ deactivate_task(rq, p, 0); + __dl_sub_ac(rq, p->dl.dl_bw); set_task_cpu(p, later_rq->cpu); activate_task(later_rq, p, 0); + __dl_add_ac(later_rq, p->dl.dl_bw); if (!fallback) resched_curr(later_rq); @@ -597,7 +600,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) /* * The task might have changed its scheduling policy to something - * different than SCHED_DEADLINE (through switched_fromd_dl()). + * different than SCHED_DEADLINE (through switched_from_dl()). */ if (!dl_task(p)) { __dl_clear_params(p); @@ -955,6 +958,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) return; } + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_add_ac(rq, p->dl.dl_bw); + /* * If p is throttled, we do nothing. In fact, if it exhausted * its budget it needs a replenishment and, since it now is on @@ -980,6 +986,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); + if (p->on_rq == TASK_ON_RQ_MIGRATING) + __dl_sub_ac(rq, p->dl.dl_bw); } /* @@ -1219,6 +1227,8 @@ static void task_dead_dl(struct task_struct *p) { struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); + __dl_sub_ac(task_rq(p), p->dl.dl_bw); + /* * Since we are TASK_DEAD we won't slip out of the domain! */ @@ -1511,8 +1521,10 @@ retry: } deactivate_task(rq, next_task, 0); + __dl_sub_ac(rq, next_task->dl.dl_bw); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); + __dl_add_ac(later_rq, next_task->dl.dl_bw); ret = 1; resched_curr(later_rq); @@ -1599,8 +1611,10 @@ static void pull_dl_task(struct rq *this_rq) resched = true; deactivate_task(src_rq, p, 0); + __dl_sub_ac(src_rq, p->dl.dl_bw); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); + __dl_add_ac(this_rq, p->dl.dl_bw); dmin = p->dl.deadline; /* Is there any other task even earlier? */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 10f1637..242907f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -519,6 +519,14 @@ struct dl_rq { #else struct dl_bw dl_bw; #endif + + /* + * ac_bw keeps track of per rq admitted bandwidth. It only changes + * when a new task is admitted, it dies, it changes scheduling policy + * or is migrated to another rq. It is used to correctly save/resore + * total_bw on root_domain changes. + */ + u64 ac_bw; }; #ifdef CONFIG_SMP @@ -720,6 +728,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues) +static inline +void __dl_sub_ac(struct rq *rq, u64 tsk_bw) +{ + WARN_ON(rq->dl.ac_bw < tsk_bw); + + rq->dl.ac_bw -= tsk_bw; +} + +static inline +void __dl_add_ac(struct rq *rq, u64 tsk_bw) +{ + rq->dl.ac_bw += tsk_bw; +} + static inline u64 __rq_clock_broken(struct rq *rq) { return READ_ONCE(rq->clock); -- 2.7.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 11:32 ` Juri Lelli @ 2016-02-10 11:43 ` luca abeni 2016-02-10 11:58 ` Juri Lelli 2016-02-10 12:48 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni 2016-02-10 14:37 ` Steven Rostedt 2 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-10 11:43 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li Hi all, On Wed, 10 Feb 2016 11:32:58 +0000 Juri Lelli <juri.lelli@arm.com> wrote: [...] > @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, > int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && > !__dl_overflow(dl_b, cpus, 0, new_bw)) { > __dl_add(dl_b, new_bw); > + __dl_add_ac(task_rq(p), new_bw); > err = 0; > } else if (dl_policy(policy) && task_has_dl_policy(p) && > !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { > __dl_clear(dl_b, p->dl.dl_bw); > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > __dl_add(dl_b, new_bw); > + __dl_add_ac(task_rq(p), new_bw); > err = 0; > } else if (!dl_policy(policy) && task_has_dl_policy(p)) { > __dl_clear(dl_b, p->dl.dl_bw); > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe they can be added in switched_to_dl() and switched_from_dl()? I'll test this idea locally, and I'll send an updated patch if it works. Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 11:43 ` luca abeni @ 2016-02-10 11:58 ` Juri Lelli 2016-02-19 13:43 ` luca abeni 2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni 0 siblings, 2 replies; 46+ messages in thread From: Juri Lelli @ 2016-02-10 11:58 UTC (permalink / raw) To: luca abeni Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On 10/02/16 12:43, Luca Abeni wrote: > Hi all, > Hi Luca, > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, > > int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && > > !__dl_overflow(dl_b, cpus, 0, new_bw)) { > > __dl_add(dl_b, new_bw); > > + __dl_add_ac(task_rq(p), new_bw); > > err = 0; > > } else if (dl_policy(policy) && task_has_dl_policy(p) && > > !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { > > __dl_clear(dl_b, p->dl.dl_bw); > > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > > __dl_add(dl_b, new_bw); > > + __dl_add_ac(task_rq(p), new_bw); > > err = 0; > > } else if (!dl_policy(policy) && task_has_dl_policy(p)) { > > __dl_clear(dl_b, p->dl.dl_bw); > > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > > Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe they > can be added in switched_to_dl() and switched_from_dl()? > That might work too yes. I think there is value if we are able to move all __dl_{add,sub}_ac calls in deadline.c. Actually, we should probably move __dl_{add,clear} there as well, so that changes to rq and root_domain happen at the same time. > I'll test this idea locally, and I'll send an updated patch if it works. > Thanks! Will wait for it :). Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 11:58 ` Juri Lelli @ 2016-02-19 13:43 ` luca abeni 2016-02-19 14:20 ` Steven Rostedt 2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni 1 sibling, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-19 13:43 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li [-- Attachment #1: Type: text/plain, Size: 3661 bytes --] Hi all, On Wed, 10 Feb 2016 11:58:12 +0000 Juri Lelli <juri.lelli@arm.com> wrote: [...] > > > } else if (!dl_policy(policy) && task_has_dl_policy(p)) { > > > __dl_clear(dl_b, p->dl.dl_bw); > > > + __dl_sub_ac(task_rq(p), p->dl.dl_bw); > > > > Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe > > they can be added in switched_to_dl() and switched_from_dl()? > > > > That might work too yes. I think there is value if we are able to move > all __dl_{add,sub}_ac calls in deadline.c. Actually, we should > probably move __dl_{add,clear} there as well, so that changes to rq > and root_domain happen at the same time. > > > I'll test this idea locally, and I'll send an updated patch if it > > works. > > > > Thanks! Will wait for it :). I know there are still open issues with select_fallback_rq() and similar stuff, but as promised I worked on moving the __dl_*_ac() calls from core.c to deadline.c (which is orthogonal respect to the select_fallback_rq() thing). I post these patches so that people can see how the code would look like after moving __dl_*_ac() calls to deadline.c and can provide some comments; the patches are not submitted for inclusion (yet). So, the first attached patch (to be applied over Juri's patch) just moves two __dl_sub_ac() and __dl_add_ac() invocations from dl_overflow() to deadline.c, using the switched_from_dl() and switched_to_dl() methods. This should cover the cases of tasks moving from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The case in which the deadline parameters of a task are changed still needs some __dl_* calls in dl_overflow(). These calls are moved to deadline.c by the second attached patch, which uses prio_changed_dl()... Here, prio_changed_dl() needs to know the "old" task utilization, while it is given the "old task priority" (which, for a deadline task is equal to the current priority)... So, I changed the meaning of the third parameter of prio_changed_dl(), from "old priority" to "old utilization". I know that changing the meaning of a parameter between different scheduling classes might be a bad idea... But the "prio_changed" method seems to be designed for priority-based schedulers only, and does not provide good information to the SCHED_DEADLINE scheduling class... The alternative is to change the prototype of the function, adding an "oldbw" (or, better, "oldperiod" and "oldruntime") parameter. Let me know what you think about this. Notice that the second patch also contains two hunks that can be extracted from it (if needed): 1) remove the call to switched_to_dl() from prio_changed_dl(): this call seems to be useless 2) change __sched_setscheduler() to invoke prio_changed for deadline tasks changing their parameters. Currently, this method is not called (because when changing parameters deadline tasks do not change their priority, so new_effective_prio == oldprio). I suspect calling prio_changed is the correct behaviour; of course, this can be done in different ways, let me know your opinion. I tested these two patches in various ways, and I do not see regressions respect to Juri's patch (but I expect issues with select_fallback_rq()... BTW, if anyone can provide some testcases for it, I can try to fix that case too). Finally, when playing with switched_to_dl() I realized that it can be used to remove the dl_new field. So, I wrote patch 0003 (attached), which seems to be working correctly, but I am probably missing some important tests. Let me know what you think about it: I think it might be a nice simplification of the code, but as usual I might be missing something :) Thanks, Luca [-- Attachment #2: 0001-Move-some-calls-to-__dl_-sub-add-_ac-from-core.c-to-.patch --] [-- Type: text/x-patch, Size: 2219 bytes --] >From 704d8f7f803ff10d2e84a9fcd4737ce684d7ef78 Mon Sep 17 00:00:00 2001 From: Luca Abeni <luca.abeni@unitn.it> Date: Thu, 11 Feb 2016 13:12:12 +0100 Subject: [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c This moves some deadline-specific calls from core.c (dl_overflow()) switched_from_dl() and switched_to_dl(). Some __dl_{sub,add}_ac() calls are left in dl_overflow(), to handle the case in which the deadline parameters of a task are changed without changing the scheduling class. --- kernel/sched/core.c | 2 -- kernel/sched/deadline.c | 3 +++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0ee0ec2..a4f08d1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2445,7 +2445,6 @@ static int dl_overflow(struct task_struct *p, int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, 0, new_bw)) { __dl_add(dl_b, new_bw); - __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { @@ -2456,7 +2455,6 @@ static int dl_overflow(struct task_struct *p, int policy, err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); - __dl_sub_ac(task_rq(p), p->dl.dl_bw); err = 0; } raw_spin_unlock(&dl_b->lock); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8ac0fee..4cc713a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1710,6 +1710,8 @@ void __init init_sched_dl_class(void) static void switched_from_dl(struct rq *rq, struct task_struct *p) { + __dl_sub_ac(task_rq(p), p->dl.dl_bw); + /* * Start the deadline timer; if we switch back to dl before this we'll * continue consuming our current CBS slice. If we stay outside of @@ -1736,6 +1738,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) */ static void switched_to_dl(struct rq *rq, struct task_struct *p) { + __dl_add_ac(rq, p->dl.dl_bw); if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP if (p->nr_cpus_allowed > 1 && rq->dl.overloaded) -- 2.5.0 [-- Attachment #3: 0002-Move-the-remaining-__dl_-sub-add-_ac-calls-from-core.patch --] [-- Type: text/x-patch, Size: 3602 bytes --] >From e9703fcf14722111e16657ba4e9b08055fb4ba30 Mon Sep 17 00:00:00 2001 From: Luca Abeni <luca.abeni@unitn.it> Date: Thu, 11 Feb 2016 15:33:23 +0100 Subject: [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls from core.c to deadline.c To move these calls from dl_overflow() to deadline.c, we must change the meaning of the third parameter of prio_changed_dl(). Instead of passing the "old priority" (which is always equal to the current one, for SCHED_DEADLINE) we pass the old utilization. An alternative approach is to change the prototype of the "prio_changed" method of the scheduling class. --- kernel/sched/core.c | 10 ++++++---- kernel/sched/deadline.c | 10 +++++++--- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a4f08d1..5dc12db 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2449,9 +2449,7 @@ static int dl_overflow(struct task_struct *p, int policy, } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { __dl_clear(dl_b, p->dl.dl_bw); - __dl_sub_ac(task_rq(p), p->dl.dl_bw); __dl_add(dl_b, new_bw); - __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); @@ -3522,6 +3520,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) } else p->dl.dl_boosted = 0; p->sched_class = &dl_sched_class; + oldprio = 0; } else if (rt_prio(prio)) { if (dl_prio(oldprio)) p->dl.dl_boosted = 0; @@ -3891,7 +3890,7 @@ static int __sched_setscheduler(struct task_struct *p, { int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 : MAX_RT_PRIO - 1 - attr->sched_priority; - int retval, oldprio, oldpolicy = -1, queued, running; + int retval, oldprio, oldbw, oldpolicy = -1, queued, running; int new_effective_prio, policy = attr->sched_policy; unsigned long flags; const struct sched_class *prev_class; @@ -4069,6 +4068,7 @@ change: p->sched_reset_on_fork = reset_on_fork; oldprio = p->prio; + oldbw = p->dl.dl_bw; if (pi) { /* @@ -4081,6 +4081,8 @@ change: new_effective_prio = rt_mutex_get_effective_prio(p, newprio); if (new_effective_prio == oldprio) { __setscheduler_params(p, attr); + if (p->sched_class == &dl_sched_class) + p->sched_class->prio_changed(rq, p, oldbw); task_rq_unlock(rq, p, &flags); return 0; } @@ -4110,7 +4112,7 @@ change: enqueue_task(rq, p, enqueue_flags); } - check_class_changed(rq, p, prev_class, oldprio); + check_class_changed(rq, p, prev_class, ((prev_class == &dl_sched_class) && (p->sched_class == &dl_sched_class)) ? oldbw : oldprio); preempt_disable(); /* avoid rq from going away on us */ task_rq_unlock(rq, p, &flags); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 4cc713a..959e7b7 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1757,8 +1757,13 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) * a push or pull operation might be needed. */ static void prio_changed_dl(struct rq *rq, struct task_struct *p, - int oldprio) + int oldbw) { + if (oldbw) { + __dl_sub_ac(rq, oldbw); + __dl_add_ac(rq, p->dl.dl_bw); + } + if (task_on_rq_queued(p) || rq->curr == p) { #ifdef CONFIG_SMP /* @@ -1785,8 +1790,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, */ resched_curr(rq); #endif /* CONFIG_SMP */ - } else - switched_to_dl(rq, p); + } } const struct sched_class dl_sched_class = { -- 2.5.0 [-- Attachment #4: 0003-Remove-dl_new.patch --] [-- Type: text/x-patch, Size: 3200 bytes --] >From 12466b28951fc9327af66e450d046cd232a41354 Mon Sep 17 00:00:00 2001 From: Luca Abeni <luca.abeni@unitn.it> Date: Fri, 19 Feb 2016 12:22:59 +0100 Subject: [PATCH 3/4] Remove dl_new switched_to_dl() can be used instead --- kernel/sched/core.c | 1 - kernel/sched/deadline.c | 28 +++++----------------------- 2 files changed, 5 insertions(+), 24 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5dc12db..4246b1b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2183,7 +2183,6 @@ void __dl_clear_params(struct task_struct *p) dl_se->dl_bw = 0; dl_se->dl_throttled = 0; - dl_se->dl_new = 1; dl_se->dl_yielded = 0; } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 959e7b7..12cb934 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -355,8 +355,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - WARN_ON(!dl_se->dl_new || dl_se->dl_throttled); - /* * We use the regular wall clock time to set deadlines in the * future; in fact, we must consider execution overheads (time @@ -364,7 +362,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, */ dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; dl_se->runtime = pi_se->dl_runtime; - dl_se->dl_new = 0; } /* @@ -503,15 +500,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - /* - * The arrival of a new instance needs special treatment, i.e., - * the actual scheduling parameters have to be "renewed". - */ - if (dl_se->dl_new) { - setup_new_dl_entity(dl_se, pi_se); - return; - } - if (dl_time_before(dl_se->deadline, rq_clock(rq)) || dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) { dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; @@ -608,16 +596,6 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) } /* - * This is possible if switched_from_dl() raced against a running - * callback that took the above !dl_task() path and we've since then - * switched back into SCHED_DEADLINE. - * - * There's nothing to do except drop our task reference. - */ - if (dl_se->dl_new) - goto unlock; - - /* * The task might have been boosted by someone else and might be in the * boosting/deboosting path, its not throttled. */ @@ -920,7 +898,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, * parameters of the task might need updating. Otherwise, * we want a replenishment of its runtime. */ - if (dl_se->dl_new || flags & ENQUEUE_WAKEUP) + if (flags & ENQUEUE_WAKEUP) update_dl_entity(dl_se, pi_se); else if (flags & ENQUEUE_REPLENISH) replenish_dl_entity(dl_se, pi_se); @@ -1738,6 +1716,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) */ static void switched_to_dl(struct rq *rq, struct task_struct *p) { + if (p->dl.deadline <= rq_clock(rq)) { + setup_new_dl_entity(&p->dl, &p->dl); + } + __dl_add_ac(rq, p->dl.dl_bw); if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP -- 2.5.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-19 13:43 ` luca abeni @ 2016-02-19 14:20 ` Steven Rostedt 2016-02-19 14:53 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Steven Rostedt @ 2016-02-19 14:20 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Fri, 19 Feb 2016 14:43:47 +0100 luca abeni <luca.abeni@unitn.it> wrote: > So, the first attached patch (to be applied over Juri's patch) just > moves two __dl_sub_ac() and __dl_add_ac() invocations from > dl_overflow() to deadline.c, using the switched_from_dl() and > switched_to_dl() methods. This should cover the cases of tasks moving > from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The case in > which the deadline parameters of a task are changed still needs some > __dl_* calls in dl_overflow(). Hi Luca, Please send the patches in a patch series. It is very hard to review patches that are attachments. And our scripts are made to apply patches from mailing lists. Having attachments just makes the job harder. -- Steve ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-19 14:20 ` Steven Rostedt @ 2016-02-19 14:53 ` luca abeni 2016-02-19 14:57 ` Steven Rostedt 2016-02-22 11:03 ` luca abeni 0 siblings, 2 replies; 46+ messages in thread From: luca abeni @ 2016-02-19 14:53 UTC (permalink / raw) To: Steven Rostedt Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Fri, 19 Feb 2016 09:20:08 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 19 Feb 2016 14:43:47 +0100 > luca abeni <luca.abeni@unitn.it> wrote: > > > > So, the first attached patch (to be applied over Juri's patch) just > > moves two __dl_sub_ac() and __dl_add_ac() invocations from > > dl_overflow() to deadline.c, using the switched_from_dl() and > > switched_to_dl() methods. This should cover the cases of tasks > > moving from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The > > case in which the deadline parameters of a task are changed still > > needs some __dl_* calls in dl_overflow(). > > Hi Luca, > > Please send the patches in a patch series. It is very hard to review > patches that are attachments. And our scripts are made to apply > patches from mailing lists. Having attachments just makes the job > harder. Sorry about that; I was in hurry, and I tried to do the quickest thing... I'll resend the patches in a more appropriate way on Monday. Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-19 14:53 ` luca abeni @ 2016-02-19 14:57 ` Steven Rostedt 2016-02-22 11:03 ` luca abeni 1 sibling, 0 replies; 46+ messages in thread From: Steven Rostedt @ 2016-02-19 14:57 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Fri, 19 Feb 2016 15:53:06 +0100 luca abeni <luca.abeni@unitn.it> wrote: > Sorry about that; I was in hurry, and I tried to do the quickest > thing... I'll resend the patches in a more appropriate way on Monday. Thanks! No rush. -- Steve ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-19 14:53 ` luca abeni 2016-02-19 14:57 ` Steven Rostedt @ 2016-02-22 11:03 ` luca abeni 1 sibling, 0 replies; 46+ messages in thread From: luca abeni @ 2016-02-22 11:03 UTC (permalink / raw) To: Steven Rostedt Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Fri, 19 Feb 2016 15:53:06 +0100 luca abeni <luca.abeni@unitn.it> wrote: [...] > > Please send the patches in a patch series. It is very hard to review > > patches that are attachments. And our scripts are made to apply > > patches from mailing lists. Having attachments just makes the job > > harder. > > Sorry about that; I was in hurry, and I tried to do the quickest > thing... I'll resend the patches in a more appropriate way on Monday. Ok; I just re-sent the patches with git sent-email. Hopefully, I managed to send them as a reply to Juri's original patch, so that the context of the discussion is not lost. I just realized that the subjects of the emails say "[*/4]" instead of "[*/3]" because of a stupid error of mine; I hope this is not a problem (the patches are sent only for discussion). If this is a problem, I can resend, of course. Thanks (and sorry for the mess), Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 0/3] cleanup per rq tracking of admitted bandwidth 2016-02-10 11:58 ` Juri Lelli 2016-02-19 13:43 ` luca abeni @ 2016-02-22 10:57 ` Luca Abeni 2016-02-22 10:57 ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni ` (2 more replies) 1 sibling, 3 replies; 46+ messages in thread From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li, Luca Abeni I know there are still open issues with select_fallback_rq() and similar stuff, but as promised I worked on moving the __dl_*_ac() calls from core.c to deadline.c (which is orthogonal respect to the select_fallback_rq() thing). I post these patches so that people can see how the code would look like after moving __dl_*_ac() calls to deadline.c and can provide some comments; the patches are not submitted for inclusion (yet). All the following patches have to be applied over Juri's patch. So, patch 0001 just moves two __dl_sub_ac() and __dl_add_ac() invocations from dl_overflow() to deadline.c, using the switched_from_dl() and switched_to_dl() methods. This should cover the cases of tasks moving from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The case in which the deadline parameters of a task are changed still needs some __dl_* calls in dl_overflow(). These calls are moved to deadline.c patch 0002, which uses prio_changed_dl()... Here, prio_changed_dl() needs to know the "old" task utilization, while it is given the "old task priority" (which, for a deadline task is equal to the current priority)... So, I changed the meaning of the third parameter of prio_changed_dl(), from "old priority" to "old utilization". I know that changing the meaning of a parameter between different scheduling classes might be a bad idea... But the "prio_changed" method seems to be designed for priority-based schedulers only, and does not provide good information to the SCHED_DEADLINE scheduling class... The alternative is to change the prototype of the function, adding an "oldbw" (or, better, "oldperiod" and "oldruntime") parameter. Let me know what you think about this. Notice that patch 0002 also contains two hunks that can be extracted from it (if needed): 1) remove the call to switched_to_dl() from prio_changed_dl(): this call seems to be useless 2) change __sched_setscheduler() to invoke prio_changed for deadline tasks changing their parameters. Currently, this method is not called (because when changing parameters deadline tasks do not change their priority, so new_effective_prio == oldprio). I suspect calling prio_changed is the correct behaviour; of course, this can be done in different ways, let me know your opinion. I tested patches 0001+0002 in various ways, and I do not see regressions respect to Juri's patch (but I expect issues with select_fallback_rq()... BTW, if anyone can provide some testcases for it, I can try to fix that case too). Finally, when playing with switched_to_dl() I realized that it can be used to remove the dl_new field. So, I wrote patch 0003, which seems to be working correctly, but I am probably missing some important tests. Let me know what you think about it: I think it might be a nice simplification of the code, but as usual I might be missing something :) Luca Abeni (4): Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Move the remaining __dl_{sub,add}_ac() calls from core.c to deadline.c Remove dl_new kernel/sched/core.c | 13 ++++++------- kernel/sched/deadline.c | 43 +++++++++++++++++-------------------------- kernel/sched/sched.h | 5 +++++ 3 files changed, 28 insertions(+), 33 deletions(-) -- 2.5.0 ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c 2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni @ 2016-02-22 10:57 ` Luca Abeni 2016-02-22 10:57 ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni 2016-02-22 10:57 ` [PATCH 3/4] Remove dl_new Luca Abeni 2 siblings, 0 replies; 46+ messages in thread From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li, Luca Abeni This moves some deadline-specific calls from core.c (dl_overflow()) switched_from_dl() and switched_to_dl(). Some __dl_{sub,add}_ac() calls are left in dl_overflow(), to handle the case in which the deadline parameters of a task are changed without changing the scheduling class. --- kernel/sched/core.c | 2 -- kernel/sched/deadline.c | 3 +++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0ee0ec2..a4f08d1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2445,7 +2445,6 @@ static int dl_overflow(struct task_struct *p, int policy, if (dl_policy(policy) && !task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, 0, new_bw)) { __dl_add(dl_b, new_bw); - __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { @@ -2456,7 +2455,6 @@ static int dl_overflow(struct task_struct *p, int policy, err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); - __dl_sub_ac(task_rq(p), p->dl.dl_bw); err = 0; } raw_spin_unlock(&dl_b->lock); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8ac0fee..4cc713a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1710,6 +1710,8 @@ void __init init_sched_dl_class(void) static void switched_from_dl(struct rq *rq, struct task_struct *p) { + __dl_sub_ac(task_rq(p), p->dl.dl_bw); + /* * Start the deadline timer; if we switch back to dl before this we'll * continue consuming our current CBS slice. If we stay outside of @@ -1736,6 +1738,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) */ static void switched_to_dl(struct rq *rq, struct task_struct *p) { + __dl_add_ac(rq, p->dl.dl_bw); if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP if (p->nr_cpus_allowed > 1 && rq->dl.overloaded) -- 2.5.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls from core.c to deadline.c 2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni 2016-02-22 10:57 ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni @ 2016-02-22 10:57 ` Luca Abeni 2016-02-22 10:57 ` [PATCH 3/4] Remove dl_new Luca Abeni 2 siblings, 0 replies; 46+ messages in thread From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li, Luca Abeni To move these calls from dl_overflow() to deadline.c, we must change the meaning of the third parameter of prio_changed_dl(). Instead of passing the "old priority" (which is always equal to the current one, for SCHED_DEADLINE) we pass the old utilization. An alternative approach is to change the prototype of the "prio_changed" method of the scheduling class. --- kernel/sched/core.c | 10 ++++++---- kernel/sched/deadline.c | 10 +++++++--- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a4f08d1..5dc12db 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2449,9 +2449,7 @@ static int dl_overflow(struct task_struct *p, int policy, } else if (dl_policy(policy) && task_has_dl_policy(p) && !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { __dl_clear(dl_b, p->dl.dl_bw); - __dl_sub_ac(task_rq(p), p->dl.dl_bw); __dl_add(dl_b, new_bw); - __dl_add_ac(task_rq(p), new_bw); err = 0; } else if (!dl_policy(policy) && task_has_dl_policy(p)) { __dl_clear(dl_b, p->dl.dl_bw); @@ -3522,6 +3520,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) } else p->dl.dl_boosted = 0; p->sched_class = &dl_sched_class; + oldprio = 0; } else if (rt_prio(prio)) { if (dl_prio(oldprio)) p->dl.dl_boosted = 0; @@ -3891,7 +3890,7 @@ static int __sched_setscheduler(struct task_struct *p, { int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 : MAX_RT_PRIO - 1 - attr->sched_priority; - int retval, oldprio, oldpolicy = -1, queued, running; + int retval, oldprio, oldbw, oldpolicy = -1, queued, running; int new_effective_prio, policy = attr->sched_policy; unsigned long flags; const struct sched_class *prev_class; @@ -4069,6 +4068,7 @@ change: p->sched_reset_on_fork = reset_on_fork; oldprio = p->prio; + oldbw = p->dl.dl_bw; if (pi) { /* @@ -4081,6 +4081,8 @@ change: new_effective_prio = rt_mutex_get_effective_prio(p, newprio); if (new_effective_prio == oldprio) { __setscheduler_params(p, attr); + if (p->sched_class == &dl_sched_class) + p->sched_class->prio_changed(rq, p, oldbw); task_rq_unlock(rq, p, &flags); return 0; } @@ -4110,7 +4112,7 @@ change: enqueue_task(rq, p, enqueue_flags); } - check_class_changed(rq, p, prev_class, oldprio); + check_class_changed(rq, p, prev_class, ((prev_class == &dl_sched_class) && (p->sched_class == &dl_sched_class)) ? oldbw : oldprio); preempt_disable(); /* avoid rq from going away on us */ task_rq_unlock(rq, p, &flags); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 4cc713a..959e7b7 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1757,8 +1757,13 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) * a push or pull operation might be needed. */ static void prio_changed_dl(struct rq *rq, struct task_struct *p, - int oldprio) + int oldbw) { + if (oldbw) { + __dl_sub_ac(rq, oldbw); + __dl_add_ac(rq, p->dl.dl_bw); + } + if (task_on_rq_queued(p) || rq->curr == p) { #ifdef CONFIG_SMP /* @@ -1785,8 +1790,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, */ resched_curr(rq); #endif /* CONFIG_SMP */ - } else - switched_to_dl(rq, p); + } } const struct sched_class dl_sched_class = { -- 2.5.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 3/4] Remove dl_new 2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni 2016-02-22 10:57 ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni 2016-02-22 10:57 ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni @ 2016-02-22 10:57 ` Luca Abeni 2016-02-23 15:42 ` Peter Zijlstra 2 siblings, 1 reply; 46+ messages in thread From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li, Luca Abeni switched_to_dl() can be used instead --- kernel/sched/core.c | 1 - kernel/sched/deadline.c | 28 +++++----------------------- 2 files changed, 5 insertions(+), 24 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5dc12db..4246b1b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2183,7 +2183,6 @@ void __dl_clear_params(struct task_struct *p) dl_se->dl_bw = 0; dl_se->dl_throttled = 0; - dl_se->dl_new = 1; dl_se->dl_yielded = 0; } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 959e7b7..12cb934 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -355,8 +355,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - WARN_ON(!dl_se->dl_new || dl_se->dl_throttled); - /* * We use the regular wall clock time to set deadlines in the * future; in fact, we must consider execution overheads (time @@ -364,7 +362,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se, */ dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; dl_se->runtime = pi_se->dl_runtime; - dl_se->dl_new = 0; } /* @@ -503,15 +500,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); - /* - * The arrival of a new instance needs special treatment, i.e., - * the actual scheduling parameters have to be "renewed". - */ - if (dl_se->dl_new) { - setup_new_dl_entity(dl_se, pi_se); - return; - } - if (dl_time_before(dl_se->deadline, rq_clock(rq)) || dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) { dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; @@ -608,16 +596,6 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) } /* - * This is possible if switched_from_dl() raced against a running - * callback that took the above !dl_task() path and we've since then - * switched back into SCHED_DEADLINE. - * - * There's nothing to do except drop our task reference. - */ - if (dl_se->dl_new) - goto unlock; - - /* * The task might have been boosted by someone else and might be in the * boosting/deboosting path, its not throttled. */ @@ -920,7 +898,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, * parameters of the task might need updating. Otherwise, * we want a replenishment of its runtime. */ - if (dl_se->dl_new || flags & ENQUEUE_WAKEUP) + if (flags & ENQUEUE_WAKEUP) update_dl_entity(dl_se, pi_se); else if (flags & ENQUEUE_REPLENISH) replenish_dl_entity(dl_se, pi_se); @@ -1738,6 +1716,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) */ static void switched_to_dl(struct rq *rq, struct task_struct *p) { + if (p->dl.deadline <= rq_clock(rq)) { + setup_new_dl_entity(&p->dl, &p->dl); + } + __dl_add_ac(rq, p->dl.dl_bw); if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP -- 2.5.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-02-22 10:57 ` [PATCH 3/4] Remove dl_new Luca Abeni @ 2016-02-23 15:42 ` Peter Zijlstra 2016-02-24 13:53 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2016-02-23 15:42 UTC (permalink / raw) To: Luca Abeni Cc: Juri Lelli, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > switched_to_dl() can be used instead This seems unrelated to the other patches, and looks like a nice cleanup. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-02-23 15:42 ` Peter Zijlstra @ 2016-02-24 13:53 ` luca abeni 2016-02-25 9:46 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-24 13:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Juri Lelli, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On Tue, 23 Feb 2016 16:42:49 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > switched_to_dl() can be used instead > > This seems unrelated to the other patches, and looks like a nice > cleanup. Ok; I'll rebase the patch on master and I'll run some more serious tests (Juri, your tests repository is available on github, right? Can I assume that if the patch passes your tests then it is ok?). If everything goes well, I'll submit the patch. Thanks, Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-02-24 13:53 ` luca abeni @ 2016-02-25 9:46 ` Juri Lelli 2016-03-03 9:03 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-25 9:46 UTC (permalink / raw) To: luca abeni Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi, On 24/02/16 14:53, luca abeni wrote: > On Tue, 23 Feb 2016 16:42:49 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > > switched_to_dl() can be used instead > > > > This seems unrelated to the other patches, and looks like a nice > > cleanup. > > Ok; I'll rebase the patch on master and I'll run some more serious > tests (Juri, your tests repository is available on github, right? Can I > assume that if the patch passes your tests then it is ok?). > If everything goes well, I'll submit the patch. > Yes, tests reside here https://github.com/jlelli/tests. They should give you some confidence that things are not completely broken, but of course they might be still broken and you do not notice by running such tests. :-) Please run also the PI related onces, they might be important in this case. Anyway, I'd say that, once you are fine with it, you can submit the patch an then we have a second look at it. Thanks, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-02-25 9:46 ` Juri Lelli @ 2016-03-03 9:03 ` luca abeni 2016-03-03 9:28 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-03-03 9:03 UTC (permalink / raw) To: Juri Lelli Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On Thu, 25 Feb 2016 09:46:55 +0000 Juri Lelli <juri.lelli@arm.com> wrote: > Hi, > > On 24/02/16 14:53, luca abeni wrote: > > On Tue, 23 Feb 2016 16:42:49 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > > > switched_to_dl() can be used instead > > > > > > This seems unrelated to the other patches, and looks like a nice > > > cleanup. > > > > Ok; I'll rebase the patch on master and I'll run some more serious > > tests (Juri, your tests repository is available on github, right? > > Can I assume that if the patch passes your tests then it is ok?). > > If everything goes well, I'll submit the patch. > > > > Yes, tests reside here https://github.com/jlelli/tests. They should > give you some confidence that things are not completely broken, but > of course they might be still broken and you do not notice by running > such tests. :-) I am trying these tests, but... Some scripts use "schedtool"; where can I find a proper version of it (supporting SCHED_DEADLINE)? I tried https://github.com/scheduler-tools/schedtool-dl but it does not seem to work correctly... Thanks, Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-03-03 9:03 ` luca abeni @ 2016-03-03 9:28 ` Juri Lelli 2016-03-03 14:23 ` Steven Rostedt 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-03-03 9:28 UTC (permalink / raw) To: luca abeni Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi Luca, On 03/03/16 10:03, Luca Abeni wrote: > On Thu, 25 Feb 2016 09:46:55 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > On 24/02/16 14:53, luca abeni wrote: > > > On Tue, 23 Feb 2016 16:42:49 +0100 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote: > > > > > switched_to_dl() can be used instead > > > > > > > > This seems unrelated to the other patches, and looks like a nice > > > > cleanup. > > > > > > Ok; I'll rebase the patch on master and I'll run some more serious > > > tests (Juri, your tests repository is available on github, right? > > > Can I assume that if the patch passes your tests then it is ok?). > > > If everything goes well, I'll submit the patch. > > > > > > > Yes, tests reside here https://github.com/jlelli/tests. They should > > give you some confidence that things are not completely broken, but > > of course they might be still broken and you do not notice by running > > such tests. :-) > I am trying these tests, but... Some scripts use "schedtool"; where can > I find a proper version of it (supporting SCHED_DEADLINE)? > I tried https://github.com/scheduler-tools/schedtool-dl but it does not > seem to work correctly... > That's the one that I use, and I'm not seeing any problems with it. I'll send you the binary in private. Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-03-03 9:28 ` Juri Lelli @ 2016-03-03 14:23 ` Steven Rostedt 2016-03-03 14:31 ` luca abeni 2016-03-03 16:12 ` Juri Lelli 0 siblings, 2 replies; 46+ messages in thread From: Steven Rostedt @ 2016-03-03 14:23 UTC (permalink / raw) To: Juri Lelli Cc: luca abeni, Peter Zijlstra, linux-kernel, mingo, vincent.guittot, wanpeng.li On Thu, 3 Mar 2016 09:28:01 +0000 Juri Lelli <juri.lelli@arm.com> wrote: > That's the one that I use, and I'm not seeing any problems with it. I'll > send you the binary in private. That's the one I use too. BTW, Juri, do you plan on submitting patches to schedtool upstream? -- Steve ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-03-03 14:23 ` Steven Rostedt @ 2016-03-03 14:31 ` luca abeni 2016-03-03 16:12 ` Juri Lelli 1 sibling, 0 replies; 46+ messages in thread From: luca abeni @ 2016-03-03 14:31 UTC (permalink / raw) To: Steven Rostedt Cc: Juri Lelli, Peter Zijlstra, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi Steven, On Thu, 3 Mar 2016 09:23:44 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 3 Mar 2016 09:28:01 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > That's the one that I use, and I'm not seeing any problems with it. > > I'll send you the binary in private. > > That's the one I use too. Juri provided me with a working binary, and I think I found the cause of the issue: it works fine on 64bit systems, but fails on 32bit systems. I think the issue is in the sched_setattr() definition present in syscall_magic.h (which ignores the "flags" parameter). Luca > BTW, Juri, do you plan on submitting patches > to schedtool upstream? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/4] Remove dl_new 2016-03-03 14:23 ` Steven Rostedt 2016-03-03 14:31 ` luca abeni @ 2016-03-03 16:12 ` Juri Lelli 1 sibling, 0 replies; 46+ messages in thread From: Juri Lelli @ 2016-03-03 16:12 UTC (permalink / raw) To: Steven Rostedt Cc: luca abeni, Peter Zijlstra, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi Steve, On 03/03/16 09:23, Steven Rostedt wrote: > On Thu, 3 Mar 2016 09:28:01 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > That's the one that I use, and I'm not seeing any problems with it. I'll > > send you the binary in private. > > That's the one I use too. BTW, Juri, do you plan on submitting patches > to schedtool upstream? > Good point, I should. But I don't have any plans ATM :-/. OTOH, if anyone else wants to do that I'll be more than happy. :-) Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 11:32 ` Juri Lelli 2016-02-10 11:43 ` luca abeni @ 2016-02-10 12:48 ` luca abeni 2016-02-10 13:42 ` Juri Lelli 2016-02-10 14:37 ` Steven Rostedt 2 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-10 12:48 UTC (permalink / raw) To: Juri Lelli Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li Hi, On Wed, 10 Feb 2016 11:32:58 +0000 Juri Lelli <juri.lelli@arm.com> wrote: [...] > From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > From: Juri Lelli <juri.lelli@arm.com> > Date: Sat, 6 Feb 2016 12:41:09 +0000 > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted > bandwidth > > Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks > that passed admission control at root_domain level only. This creates > problems when such data structure(s) are destroyed, when we > reconfigure scheduling domains for example. > > This is part one of two changes required to fix the problem. In this > patch we add per-rq tracking of admitted bandwidth. Tasks bring with > them their bandwidth contribution when they enter the system and are > enqueued for the first time. Contributions are then moved around when > migrations happen and removed when tasks die. I think this patch actually does two different things (addressing two separate problems): 1) it introduces the tracking of per-rq utilization (used in your next patch to address the root domain issues) 2) it fixes a bug in the current utilization tracking mechanism. Currently, a task doing while(1) { switch to SCHED_DEADLINE switch to SCHED_OTHER } brings dl_b->total_bw below 0. Thanks to Juri for showing me this problem (and how to reproduce it) in a private email. This happens because when the task switches back from SCHED_DEADLINE to SCHED_OTHER, switched_from_dl() does not clear its deadline parameters (they will be cleared by the deadline timer when it fires). But dl_overflow() removes its utilization from dl_b->total_bw. When the task switches back to SCHED_DEADLINE, the if (new_bw == p->dl.dl_bw) check in dl_overflow() prevents __dl_add() from being called, and so when the task switches back to SCHED_OTHER dl_b->total_bw becomes negative. > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 9503d59..0ee0ec2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > runtime) : 0; int cpus, err = -1; > > - if (new_bw == p->dl.dl_bw) > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > return 0; This hunk actually fixes issue 2) mentioned above, so I think it should be committed in a short time (independently from the rest of the patch). And maybe is a good candidate for backporting to stable kernels? Thanks, Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 12:48 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni @ 2016-02-10 13:42 ` Juri Lelli 2016-02-23 15:48 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-10 13:42 UTC (permalink / raw) To: luca abeni Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On 10/02/16 13:48, Luca Abeni wrote: > Hi, > > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > > From: Juri Lelli <juri.lelli@arm.com> > > Date: Sat, 6 Feb 2016 12:41:09 +0000 > > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted > > bandwidth > > > > Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks > > that passed admission control at root_domain level only. This creates > > problems when such data structure(s) are destroyed, when we > > reconfigure scheduling domains for example. > > > > This is part one of two changes required to fix the problem. In this > > patch we add per-rq tracking of admitted bandwidth. Tasks bring with > > them their bandwidth contribution when they enter the system and are > > enqueued for the first time. Contributions are then moved around when > > migrations happen and removed when tasks die. > > I think this patch actually does two different things (addressing two > separate problems): > 1) it introduces the tracking of per-rq utilization (used in your next > patch to address the root domain issues) > 2) it fixes a bug in the current utilization tracking mechanism. > Currently, a task doing > while(1) { > switch to SCHED_DEADLINE > switch to SCHED_OTHER > } > brings dl_b->total_bw below 0. Thanks to Juri for showing me this > problem (and how to reproduce it) in a private email. > This happens because when the task switches back from SCHED_DEADLINE > to SCHED_OTHER, switched_from_dl() does not clear its deadline > parameters (they will be cleared by the deadline timer when it > fires). But dl_overflow() removes its utilization from > dl_b->total_bw. When the task switches back to SCHED_DEADLINE, the > if (new_bw == p->dl.dl_bw) check in dl_overflow() prevents > __dl_add() from being called, and so when the task switches back to > SCHED_OTHER dl_b->total_bw becomes negative. > Yep. That is also the reason why I think, whatever refactoring we are goind to do, we should keep per-rq and root_domain accounting done together. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 9503d59..0ee0ec2 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > > runtime) : 0; int cpus, err = -1; > > > > - if (new_bw == p->dl.dl_bw) > > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > > return 0; > > This hunk actually fixes issue 2) mentioned above, so I think it should > be committed in a short time (independently from the rest of the > patch). And maybe is a good candidate for backporting to stable kernels? > Yes, this is a sensible fix per se. I can split it and send it separately. Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 13:42 ` Juri Lelli @ 2016-02-23 15:48 ` Peter Zijlstra 2016-02-23 15:51 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2016-02-23 15:48 UTC (permalink / raw) To: Juri Lelli Cc: luca abeni, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On Wed, Feb 10, 2016 at 01:42:40PM +0000, Juri Lelli wrote: > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 9503d59..0ee0ec2 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > > > runtime) : 0; int cpus, err = -1; > > > > > > - if (new_bw == p->dl.dl_bw) > > > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > > > return 0; > > > > This hunk actually fixes issue 2) mentioned above, so I think it should > > be committed in a short time (independently from the rest of the > > patch). And maybe is a good candidate for backporting to stable kernels? > > > > Yes, this is a sensible fix per se. I can split it and send it > separately. Did you ever send that? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-23 15:48 ` Peter Zijlstra @ 2016-02-23 15:51 ` Juri Lelli 0 siblings, 0 replies; 46+ messages in thread From: Juri Lelli @ 2016-02-23 15:51 UTC (permalink / raw) To: Peter Zijlstra Cc: luca abeni, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On 23/02/16 16:48, Peter Zijlstra wrote: > On Wed, Feb 10, 2016 at 01:42:40PM +0000, Juri Lelli wrote: > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index 9503d59..0ee0ec2 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, > > > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period, > > > > runtime) : 0; int cpus, err = -1; > > > > > > > > - if (new_bw == p->dl.dl_bw) > > > > + if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw) > > > > return 0; > > > > > > This hunk actually fixes issue 2) mentioned above, so I think it should > > > be committed in a short time (independently from the rest of the > > > patch). And maybe is a good candidate for backporting to stable kernels? > > > > > > > Yes, this is a sensible fix per se. I can split it and send it > > separately. > > Did you ever send that? > No, I didn't. Will do ASAP. Thanks, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 11:32 ` Juri Lelli 2016-02-10 11:43 ` luca abeni 2016-02-10 12:48 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni @ 2016-02-10 14:37 ` Steven Rostedt 2016-02-10 16:27 ` Juri Lelli 2 siblings, 1 reply; 46+ messages in thread From: Steven Rostedt @ 2016-02-10 14:37 UTC (permalink / raw) To: Juri Lelli Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li On Wed, 10 Feb 2016 11:32:58 +0000 Juri Lelli <juri.lelli@arm.com> wrote: > Hi, > > I've updated this patch since, with a bit more testing and talking with > Luca in private, I realized that the previous version didn't manage > switching back and forth from SCHED_DEADLINE correctly. Thanks a lot > Luca for your feedback (even if not visible on the list). > > I updated the testing branch accordingly and added a test to my tests > that stresses switch-in/switch-out. > > I don't particularly like the fact that we break the scheduling classes > abstraction in __dl_overflow(), so I think a little bit of refactoring > is still needed. But that can also happen afterwards, if we fix the > problem with root domains. > > Best, > > - Juri > > --->8--- > > >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > From: Juri Lelli <juri.lelli@arm.com> > Date: Sat, 6 Feb 2016 12:41:09 +0000 > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth > I applied this patch and patch 2 and hit this: [ 2298.134284] ------------[ cut here ]------------ [ 2298.138933] WARNING: CPU: 4 PID: 0 at /home/rostedt/work/git/linux-trace.git/kernel/sched/sched.h:735 task_dead_dl+0xc5/0xd0() [ 2298.150350] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc bluetooth lockd grace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal snd_seq vhost_net snd_seq_device tun vhost macvtap macvlan coretemp iTCO_wdt snd_pcm hp_wmi rfkill kvm_intel sparse_keymap iTCO_vendor_support snd_timer snd acpi_cpufreq kvm i2c_i801 mei_me mei soundcore lpc_ich mfd_core irqbypass wmi serio_raw uinput i915 i2c_algo_bit e1000e drm_kms_helper crc32_pclmul ptp crc32c_intel drm pps_core i2c_core video sunrpc [ 2298.207495] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.5.0-rc1-test+ #204 [ 2298.214392] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 [ 2298.223371] ffffffff81abc680 ffff880119433d68 ffffffff81411c3f 0000000000000000 [ 2298.230904] ffffffff81abc680 ffff880119433da0 ffffffff810acf66 ffff88011eb16f40 [ 2298.238435] ffff88001ee16200 ffffffff81fd4a00 00000000000aaaaa 0000000000000001 [ 2298.245958] Call Trace: [ 2298.248431] [<ffffffff81411c3f>] dump_stack+0x50/0xb1 [ 2298.253597] [<ffffffff810acf66>] warn_slowpath_common+0x86/0xc0 [ 2298.259627] [<ffffffff810ad05a>] warn_slowpath_null+0x1a/0x20 [ 2298.265490] [<ffffffff810f92a5>] task_dead_dl+0xc5/0xd0 [ 2298.270828] [<ffffffff810d833f>] finish_task_switch+0x16f/0x310 [ 2298.276871] [<ffffffff810fa7f3>] ? pick_next_task_dl+0xb3/0x250 [ 2298.282906] [<ffffffff817f07a3>] __schedule+0x3d3/0x9e0 [ 2298.288252] [<ffffffff817f1001>] schedule+0x41/0xc0 [ 2298.293242] [<ffffffff817f12c8>] schedule_preempt_disabled+0x18/0x30 [ 2298.299712] [<ffffffff810fc974>] cpu_startup_entry+0x74/0x4e0 [ 2298.305573] [<ffffffff8105d16f>] start_secondary+0x2bf/0x330 [ 2298.311347] ---[ end trace 732d16efabe456f1 ]--- It's the warning you added in __dl_sub_ac(). Things appear to still get all screwed up with cpusets and root domains, but I can still recover with playing with load_balance. -- Steve ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 14:37 ` Steven Rostedt @ 2016-02-10 16:27 ` Juri Lelli 2016-02-11 12:12 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-10 16:27 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li On 10/02/16 09:37, Steven Rostedt wrote: > On Wed, 10 Feb 2016 11:32:58 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > Hi, > > > > I've updated this patch since, with a bit more testing and talking with > > Luca in private, I realized that the previous version didn't manage > > switching back and forth from SCHED_DEADLINE correctly. Thanks a lot > > Luca for your feedback (even if not visible on the list). > > > > I updated the testing branch accordingly and added a test to my tests > > that stresses switch-in/switch-out. > > > > I don't particularly like the fact that we break the scheduling classes > > abstraction in __dl_overflow(), so I think a little bit of refactoring > > is still needed. But that can also happen afterwards, if we fix the > > problem with root domains. > > > > Best, > > > > - Juri > > > > --->8--- > > > > >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001 > > From: Juri Lelli <juri.lelli@arm.com> > > Date: Sat, 6 Feb 2016 12:41:09 +0000 > > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth > > > > > I applied this patch and patch 2 and hit this: > > [ 2298.134284] ------------[ cut here ]------------ > [ 2298.138933] WARNING: CPU: 4 PID: 0 at /home/rostedt/work/git/linux-trace.git/kernel/sched/sched.h:735 task_dead_dl+0xc5/0xd0() > [ 2298.150350] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc bluetooth lockd grace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal snd_seq vhost_net snd_seq_device tun vhost macvtap macvlan coretemp iTCO_wdt snd_pcm hp_wmi rfkill kvm_intel sparse_keymap iTCO_vendor_support snd_timer snd acpi_cpufreq kvm i2c_i801 mei_me mei soundcore lpc_ich mfd_core irqbypass wmi serio_raw uinput i915 i2c_algo_bit e1000e drm_kms_helper crc32_pclmul ptp crc32c_intel drm pps_core i2c_core video sunrpc > [ 2298.207495] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.5.0-rc1-test+ #204 > [ 2298.214392] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 > [ 2298.223371] ffffffff81abc680 ffff880119433d68 ffffffff81411c3f 0000000000000000 > [ 2298.230904] ffffffff81abc680 ffff880119433da0 ffffffff810acf66 ffff88011eb16f40 > [ 2298.238435] ffff88001ee16200 ffffffff81fd4a00 00000000000aaaaa 0000000000000001 > [ 2298.245958] Call Trace: > [ 2298.248431] [<ffffffff81411c3f>] dump_stack+0x50/0xb1 > [ 2298.253597] [<ffffffff810acf66>] warn_slowpath_common+0x86/0xc0 > [ 2298.259627] [<ffffffff810ad05a>] warn_slowpath_null+0x1a/0x20 > [ 2298.265490] [<ffffffff810f92a5>] task_dead_dl+0xc5/0xd0 > [ 2298.270828] [<ffffffff810d833f>] finish_task_switch+0x16f/0x310 > [ 2298.276871] [<ffffffff810fa7f3>] ? pick_next_task_dl+0xb3/0x250 > [ 2298.282906] [<ffffffff817f07a3>] __schedule+0x3d3/0x9e0 > [ 2298.288252] [<ffffffff817f1001>] schedule+0x41/0xc0 > [ 2298.293242] [<ffffffff817f12c8>] schedule_preempt_disabled+0x18/0x30 > [ 2298.299712] [<ffffffff810fc974>] cpu_startup_entry+0x74/0x4e0 > [ 2298.305573] [<ffffffff8105d16f>] start_secondary+0x2bf/0x330 > [ 2298.311347] ---[ end trace 732d16efabe456f1 ]--- > > It's the warning you added in __dl_sub_ac(). > OK. There are still holes where we fail to properly update per-rq bw. It seems (by running you test) that we fail to move the per-rq bw when we move the root_domain bw due css_set_move_task(). So, the final task_dead_dl() tries to remove bw from where there isn't. I'm trying to see how we can close this hole. Thanks, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-10 16:27 ` Juri Lelli @ 2016-02-11 12:12 ` Juri Lelli 2016-02-11 12:22 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-11 12:12 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li On 10/02/16 16:27, Juri Lelli wrote: > On 10/02/16 09:37, Steven Rostedt wrote: > > On Wed, 10 Feb 2016 11:32:58 +0000 > > Juri Lelli <juri.lelli@arm.com> wrote: > > [...] > > > > I applied this patch and patch 2 and hit this: > > [...] > > > > It's the warning you added in __dl_sub_ac(). > > > > OK. There are still holes where we fail to properly update per-rq bw. It > seems (by running you test) that we fail to move the per-rq bw when we > move the root_domain bw due css_set_move_task(). So, the final > task_dead_dl() tries to remove bw from where there isn't. > > I'm trying to see how we can close this hole. > So, just to give an update from yesterday (kind of tricky this one :/). I think we still have (at least) two problems: - select_task_rq_dl, if we select a different target - select_task_rq might make use of select_fallback_rq, if cpus_allowed changed after the task went to sleep Second case is what creates the problem here, as we don't update task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl() (it has to be added yes), but I fear that we should hold both rq locks :/. Luca, did you already face this problem (if I got it right) and thought of a way to fix it? I'll go back and stare a bit more at those paths. Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 12:12 ` Juri Lelli @ 2016-02-11 12:22 ` luca abeni 2016-02-11 12:27 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-11 12:22 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li Hi Juri, On Thu, 11 Feb 2016 12:12:57 +0000 Juri Lelli <juri.lelli@arm.com> wrote: [...] > I think we still have (at least) two problems: > > - select_task_rq_dl, if we select a different target > - select_task_rq might make use of select_fallback_rq, if > cpus_allowed changed after the task went to sleep > > Second case is what creates the problem here, as we don't update > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl() > (it has to be added yes), but I fear that we should hold both rq > locks :/. > > Luca, did you already face this problem (if I got it right) and > thought of a way to fix it? I'll go back and stare a bit more at > those paths. In my patch I took care of the first case (modifying select_task_rq_dl() to move the utilization from the "old rq" to the "new rq"), but I never managed to trigger select_fallback_rq() in my tests, so I overlooked that case. Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 12:22 ` luca abeni @ 2016-02-11 12:27 ` Juri Lelli 2016-02-11 12:40 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-11 12:27 UTC (permalink / raw) To: luca abeni Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On 11/02/16 13:22, Luca Abeni wrote: > Hi Juri, > > On Thu, 11 Feb 2016 12:12:57 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > [...] > > I think we still have (at least) two problems: > > > > - select_task_rq_dl, if we select a different target > > - select_task_rq might make use of select_fallback_rq, if > > cpus_allowed changed after the task went to sleep > > > > Second case is what creates the problem here, as we don't update > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > > maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl() > > (it has to be added yes), but I fear that we should hold both rq > > locks :/. > > > > Luca, did you already face this problem (if I got it right) and > > thought of a way to fix it? I'll go back and stare a bit more at > > those paths. > In my patch I took care of the first case (modifying > select_task_rq_dl() to move the utilization from the "old rq" to the > "new rq"), but I never managed to trigger select_fallback_rq() in my > tests, so I overlooked that case. > Right, I was thinking to do the same. And you did that after grabbing both locks, right? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 12:27 ` Juri Lelli @ 2016-02-11 12:40 ` luca abeni 2016-02-11 12:49 ` Juri Lelli 0 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-11 12:40 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Thu, 11 Feb 2016 12:27:54 +0000 Juri Lelli <juri.lelli@arm.com> wrote: > On 11/02/16 13:22, Luca Abeni wrote: > > Hi Juri, > > > > On Thu, 11 Feb 2016 12:12:57 +0000 > > Juri Lelli <juri.lelli@arm.com> wrote: > > [...] > > > I think we still have (at least) two problems: > > > > > > - select_task_rq_dl, if we select a different target > > > - select_task_rq might make use of select_fallback_rq, if > > > cpus_allowed changed after the task went to sleep > > > > > > Second case is what creates the problem here, as we don't update > > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > > > maybe adding fallback_cpu in task_struct, from > > > migrate_task_rq_dl() (it has to be added yes), but I fear that we > > > should hold both rq locks :/. > > > > > > Luca, did you already face this problem (if I got it right) and > > > thought of a way to fix it? I'll go back and stare a bit more at > > > those paths. > > In my patch I took care of the first case (modifying > > select_task_rq_dl() to move the utilization from the "old rq" to the > > "new rq"), but I never managed to trigger select_fallback_rq() in my > > tests, so I overlooked that case. > > > > Right, I was thinking to do the same. And you did that after grabbing > both locks, right? Not sure if I did everything correctly, but my code in select_task_rq_dl() currently looks like this (you can obviously ignore the "migrate_active" and "*_running_bw()" parts, and focus on the "*_rq_bw()" stuff): [...] if (rq != cpu_rq(cpu)) { int migrate_active; raw_spin_lock(&rq->lock); migrate_active = hrtimer_active(&p->dl.inactive_timer); if (migrate_active) { hrtimer_try_to_cancel(&p->dl.inactive_timer); sub_running_bw(&p->dl, &rq->dl); } sub_rq_bw(&p->dl, &rq->dl); raw_spin_unlock(&rq->lock); rq = cpu_rq(cpu); raw_spin_lock(&rq->lock); add_rq_bw(&p->dl, &rq->dl); if (migrate_active) add_running_bw(&p->dl, &rq->dl); raw_spin_unlock(&rq->lock); } [...] lockdep is not screaming, and I am not able to trigger any race condition or strange behaviour (I am currently at more than 24h of continuous stress-testing, but maybe my testcase is not so good in finding races here :) Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 12:40 ` luca abeni @ 2016-02-11 12:49 ` Juri Lelli 2016-02-11 13:05 ` luca abeni 0 siblings, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-11 12:49 UTC (permalink / raw) To: luca abeni Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On 11/02/16 13:40, Luca Abeni wrote: > On Thu, 11 Feb 2016 12:27:54 +0000 > Juri Lelli <juri.lelli@arm.com> wrote: > > > On 11/02/16 13:22, Luca Abeni wrote: > > > Hi Juri, > > > > > > On Thu, 11 Feb 2016 12:12:57 +0000 > > > Juri Lelli <juri.lelli@arm.com> wrote: > > > [...] > > > > I think we still have (at least) two problems: > > > > > > > > - select_task_rq_dl, if we select a different target > > > > - select_task_rq might make use of select_fallback_rq, if > > > > cpus_allowed changed after the task went to sleep > > > > > > > > Second case is what creates the problem here, as we don't update > > > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, > > > > maybe adding fallback_cpu in task_struct, from > > > > migrate_task_rq_dl() (it has to be added yes), but I fear that we > > > > should hold both rq locks :/. > > > > > > > > Luca, did you already face this problem (if I got it right) and > > > > thought of a way to fix it? I'll go back and stare a bit more at > > > > those paths. > > > In my patch I took care of the first case (modifying > > > select_task_rq_dl() to move the utilization from the "old rq" to the > > > "new rq"), but I never managed to trigger select_fallback_rq() in my > > > tests, so I overlooked that case. > > > > > > > Right, I was thinking to do the same. And you did that after grabbing > > both locks, right? > > Not sure if I did everything correctly, but my code in > select_task_rq_dl() currently looks like this (you can obviously > ignore the "migrate_active" and "*_running_bw()" parts, and focus on > the "*_rq_bw()" stuff): > [...] > if (rq != cpu_rq(cpu)) { > int migrate_active; > > raw_spin_lock(&rq->lock); > migrate_active = hrtimer_active(&p->dl.inactive_timer); > if (migrate_active) { > hrtimer_try_to_cancel(&p->dl.inactive_timer); > sub_running_bw(&p->dl, &rq->dl); > } > sub_rq_bw(&p->dl, &rq->dl); > raw_spin_unlock(&rq->lock); > rq = cpu_rq(cpu); Can't something happen here? My problem is that I use per-rq bw tracking to save/restore root_domain state. So, I fear that a root_domain update can happen while we are in the middle of moving bw from one cpu to another. > raw_spin_lock(&rq->lock); > add_rq_bw(&p->dl, &rq->dl); > if (migrate_active) > add_running_bw(&p->dl, &rq->dl); > raw_spin_unlock(&rq->lock); > } > [...] > > lockdep is not screaming, and I am not able to trigger any race > condition or strange behaviour (I am currently at more than 24h of > continuous stress-testing, but maybe my testcase is not so good in > finding races here :) > Thanks for sharing what you have! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 12:49 ` Juri Lelli @ 2016-02-11 13:05 ` luca abeni 2016-02-11 14:25 ` Steven Rostedt 0 siblings, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-11 13:05 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Thu, 11 Feb 2016 12:49:59 +0000 Juri Lelli <juri.lelli@arm.com> wrote: [...] > > > > > Luca, did you already face this problem (if I got it right) > > > > > and thought of a way to fix it? I'll go back and stare a bit > > > > > more at those paths. > > > > In my patch I took care of the first case (modifying > > > > select_task_rq_dl() to move the utilization from the "old rq" > > > > to the "new rq"), but I never managed to trigger > > > > select_fallback_rq() in my tests, so I overlooked that case. > > > > > > > > > > Right, I was thinking to do the same. And you did that after > > > grabbing both locks, right? > > > > Not sure if I did everything correctly, but my code in > > select_task_rq_dl() currently looks like this (you can obviously > > ignore the "migrate_active" and "*_running_bw()" parts, and focus on > > the "*_rq_bw()" stuff): > > [...] > > if (rq != cpu_rq(cpu)) { > > int migrate_active; > > > > raw_spin_lock(&rq->lock); > > migrate_active = > > hrtimer_active(&p->dl.inactive_timer); if (migrate_active) { > > hrtimer_try_to_cancel(&p->dl.inactive_timer); > > sub_running_bw(&p->dl, &rq->dl); > > } > > sub_rq_bw(&p->dl, &rq->dl); > > raw_spin_unlock(&rq->lock); > > rq = cpu_rq(cpu); > > Can't something happen here? My problem is that I use per-rq bw > tracking to save/restore root_domain state. So, I fear that a > root_domain update can happen while we are in the middle of moving bw > from one cpu to another. Well, I never used the rq utilization to re-build the root_domain utilization (and I never played with root domains too much... :)... So, I do not really know. Maybe the code should do: raw_spin_lock(&rq->lock); raw_spin_lock(&cpu_rq(cpu)->lock); sub_rq_bw(&p->dl, &rq->dl); add_rq_bw(&p->dl, &cpu_rq(cpu)->dl); [...] ? Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 13:05 ` luca abeni @ 2016-02-11 14:25 ` Steven Rostedt 2016-02-11 17:10 ` Juri Lelli 2016-02-11 21:48 ` Luca Abeni 0 siblings, 2 replies; 46+ messages in thread From: Steven Rostedt @ 2016-02-11 14:25 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Thu, 11 Feb 2016 14:05:45 +0100 luca abeni <luca.abeni@unitn.it> wrote: > Well, I never used the rq utilization to re-build the root_domain > utilization (and I never played with root domains too much... :)... > So, I do not really know. Maybe the code should do: > raw_spin_lock(&rq->lock); > raw_spin_lock(&cpu_rq(cpu)->lock); Of course you want to use double_rq_lock() here instead. -- Steve > sub_rq_bw(&p->dl, &rq->dl); > add_rq_bw(&p->dl, &cpu_rq(cpu)->dl); > [...] > ? > > > Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 14:25 ` Steven Rostedt @ 2016-02-11 17:10 ` Juri Lelli 2016-02-12 17:05 ` Peter Zijlstra 2016-02-11 21:48 ` Luca Abeni 1 sibling, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-11 17:10 UTC (permalink / raw) To: Steven Rostedt Cc: luca abeni, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On 11/02/16 09:25, Steven Rostedt wrote: > On Thu, 11 Feb 2016 14:05:45 +0100 > luca abeni <luca.abeni@unitn.it> wrote: > > > > Well, I never used the rq utilization to re-build the root_domain > > utilization (and I never played with root domains too much... :)... > > So, I do not really know. Maybe the code should do: > > raw_spin_lock(&rq->lock); > > raw_spin_lock(&cpu_rq(cpu)->lock); > > Of course you want to use double_rq_lock() here instead. > Right. Is something like this completely out of question/broken? I slighly tested it with Steve's test and I don't see the warning anymore (sched_debug looks good as well); but my confidence is still pretty low. :( --->8--- >From 9713e12bc682ca364e62f9d69bcd44598c50a8a9 Mon Sep 17 00:00:00 2001 From: Juri Lelli <juri.lelli@arm.com> Date: Thu, 11 Feb 2016 16:55:49 +0000 Subject: [PATCH] fixup! sched/deadline: add per rq tracking of admitted bandwidth Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- include/linux/init_task.h | 1 + include/linux/sched.h | 1 + kernel/sched/core.c | 5 ++++- kernel/sched/deadline.c | 26 +++++++++++++++++++++++++- 4 files changed, 31 insertions(+), 2 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index f2cb8d4..c582f9d 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -199,6 +199,7 @@ extern struct task_group root_task_group; .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ .nr_cpus_allowed= NR_CPUS, \ + .fallback_cpu = -1, \ .mm = NULL, \ .active_mm = &init_mm, \ .restart_block = { \ diff --git a/include/linux/sched.h b/include/linux/sched.h index a10494a..a6fc95c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1401,6 +1401,7 @@ struct task_struct { struct task_struct *last_wakee; int wake_cpu; + int fallback_cpu; #endif int on_rq; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7fb9246..4e4bc41 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1442,7 +1442,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p) continue; if (!cpu_active(dest_cpu)) continue; - if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) + if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) { + p->fallback_cpu = dest_cpu; return dest_cpu; } } @@ -1490,6 +1491,7 @@ out: } } + p->fallback_cpu = dest_cpu; return dest_cpu; } @@ -1954,6 +1956,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; set_task_cpu(p, cpu); + p->fallback_cpu = -1; } #endif /* CONFIG_SMP */ diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6368f43..1eccecf 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1043,6 +1043,21 @@ static void yield_task_dl(struct rq *rq) #ifdef CONFIG_SMP +static void swap_task_ac_bw(struct task_struct *p, + struct rq *from, + struct rq *to) +{ + unsigned long flags; + + lockdep_assert_held(&p->pi_lock); + local_irq_save(flags); + double_rq_lock(from, to); + __dl_sub_ac(from, p->dl.dl_bw); + __dl_add_ac(to, p->dl.dl_bw); + double_rq_unlock(from, to); + local_irq_restore(flags); +} + static int find_later_rq(struct task_struct *task); static int @@ -1077,8 +1092,10 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) if (target != -1 && (dl_time_before(p->dl.deadline, cpu_rq(target)->dl.earliest_dl.curr) || - (cpu_rq(target)->dl.dl_nr_running == 0))) + (cpu_rq(target)->dl.dl_nr_running == 0))) { cpu = target; + swap_task_ac_bw(p, rq, cpu_rq(target)); + } } rcu_read_unlock(); @@ -1807,6 +1824,12 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p, switched_to_dl(rq, p); } +static void migrate_task_rq_dl(struct task_struct *p) +{ + if (p->fallback_cpu != -1) + swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu)); +} + const struct sched_class dl_sched_class = { .next = &rt_sched_class, .enqueue_task = enqueue_task_dl, @@ -1820,6 +1843,7 @@ const struct sched_class dl_sched_class = { #ifdef CONFIG_SMP .select_task_rq = select_task_rq_dl, + .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, .rq_online = rq_online_dl, .rq_offline = rq_offline_dl, -- 2.7.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 17:10 ` Juri Lelli @ 2016-02-12 17:05 ` Peter Zijlstra 2016-02-12 17:19 ` Juri Lelli 2016-02-24 19:17 ` Peter Zijlstra 0 siblings, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2016-02-12 17:05 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote: > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 6368f43..1eccecf 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > +static void swap_task_ac_bw(struct task_struct *p, > + struct rq *from, > + struct rq *to) > +{ > + unsigned long flags; > + > + lockdep_assert_held(&p->pi_lock); > + local_irq_save(flags); > + double_rq_lock(from, to); > + __dl_sub_ac(from, p->dl.dl_bw); > + __dl_add_ac(to, p->dl.dl_bw); > + double_rq_unlock(from, to); > + local_irq_restore(flags); > +} > +static void migrate_task_rq_dl(struct task_struct *p) > +{ > + if (p->fallback_cpu != -1) > + swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu)); > +} This patch scares me. Now, my brain is having an awfully hard time trying to re-engage after flu, but this looks very wrong. So we call sched_class::migrate_task_rq() from set_task_cpu(), and we call set_task_cpu() while potentially holding rq::lock's (try push_dl_task() for kicks). Sure, you play horrible games with fallback_cpu, but those games are just that, horrible. So your initial patch migrates the bandwidth along when a runnable task gets moved about, this hack seems to be mostly about waking up. The 'normal' accounting is done on enqueue/dequeue, while here you use the migration hook. Having two separate means of accounting this also feels more fragile than one would want. Let me think a bit about this. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-12 17:05 ` Peter Zijlstra @ 2016-02-12 17:19 ` Juri Lelli 2016-02-24 19:17 ` Peter Zijlstra 1 sibling, 0 replies; 46+ messages in thread From: Juri Lelli @ 2016-02-12 17:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li On 12/02/16 18:05, Peter Zijlstra wrote: > On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote: > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > > index 6368f43..1eccecf 100644 > > --- a/kernel/sched/deadline.c > > +++ b/kernel/sched/deadline.c > > > +static void swap_task_ac_bw(struct task_struct *p, > > + struct rq *from, > > + struct rq *to) > > +{ > > + unsigned long flags; > > + > > + lockdep_assert_held(&p->pi_lock); > > + local_irq_save(flags); > > + double_rq_lock(from, to); > > + __dl_sub_ac(from, p->dl.dl_bw); > > + __dl_add_ac(to, p->dl.dl_bw); > > + double_rq_unlock(from, to); > > + local_irq_restore(flags); > > +} > > > +static void migrate_task_rq_dl(struct task_struct *p) > > +{ > > + if (p->fallback_cpu != -1) > > + swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu)); > > +} > > This patch scares me. > Yeah, same here. However, I didn't find yet something different to fix this and wanted some help :). > Now, my brain is having an awfully hard time trying to re-engage after > flu, but this looks very wrong. > > So we call sched_class::migrate_task_rq() from set_task_cpu(), and we > call set_task_cpu() while potentially holding rq::lock's (try > push_dl_task() for kicks). > > Sure, you play horrible games with fallback_cpu, but those games are > just that, horrible. > Right. I'm counting on fallback_cpu to be able to not call swap_task_ac_bw() (and the rq locks) during push/pull migrations. I was actually thinking that we could have a non locked version of swap and call that in push/pull from migrate_task_rq_dl. But this is most probably more horrible. > > So your initial patch migrates the bandwidth along when a runnable task > gets moved about, this hack seems to be mostly about waking up. The > 'normal' accounting is done on enqueue/dequeue, while here you use the > migration hook. > The problem is that I don't do anything in enqueue/dequeue (apart from when cpuset migrates us while still on_rq), and I think we don't want to do anything there as a task dl_bw should remain in ac_bw when it goes to sleep, etc. This is the static view of admitted bw. We want to save/restore the admitted bw in the root_domain also when tasks are sleeping/blocked. > Having two separate means of accounting this also feels more fragile > than one would want. > > Let me think a bit about this. > I was looking at sending out a v2 with this as RFC. I guess is better if I wait :). Thanks! Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-12 17:05 ` Peter Zijlstra 2016-02-12 17:19 ` Juri Lelli @ 2016-02-24 19:17 ` Peter Zijlstra 2016-02-24 21:46 ` luca abeni 2016-02-25 10:07 ` Juri Lelli 1 sibling, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2016-02-24 19:17 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote: > Having two separate means of accounting this also feels more fragile > than one would want. > > Let me think a bit about this. I think there's a fundamental problem that makes the whole notion of per-rq accounting 'impossible'. On hot-unplug we only migrate runnable tasks, all blocked tasks remain on the dead cpu. This would very much include their bandwidth requirements. This means that between a hot-unplug and the moment that _all_ those blocked tasks have ran at least once, the sum of online bandwidth doesn't match and we can get into admission trouble (same for GRUB, which can also use per-rq bw like this). The main problem is that there is no real way to find blocked tasks; currently the only way is to iterate _all_ tasks and filter on task_cpu(). We could of course add a blocked tree/list for deadline tasks, to explicitly keep track of all these; this would allow migrating blocked tasks on hotplug and avoid the real ugly I think. But I've not tried yet. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-24 19:17 ` Peter Zijlstra @ 2016-02-24 21:46 ` luca abeni 2016-02-25 7:53 ` Peter Zijlstra 2016-02-25 10:07 ` Juri Lelli 1 sibling, 1 reply; 46+ messages in thread From: luca abeni @ 2016-02-24 21:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Juri Lelli, Steven Rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi, On Wed, 24 Feb 2016 20:17:52 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote: > > Having two separate means of accounting this also feels more fragile > > than one would want. > > > > Let me think a bit about this. > > I think there's a fundamental problem that makes the whole notion of > per-rq accounting 'impossible'. > > On hot-unplug we only migrate runnable tasks, all blocked tasks remain > on the dead cpu. This would very much include their bandwidth > requirements. > > This means that between a hot-unplug and the moment that _all_ those > blocked tasks have ran at least once, the sum of online bandwidth > doesn't match and we can get into admission trouble (same for GRUB, > which can also use per-rq bw like this). After Juri's patch and emails, I tried to think about the CPU hot-(un)plugging issues, and to check if/how they affect GRUB reclaiming... I arrived to the conclusion that for GRUB this is not a problem (but, as usual, I might be wrong): GRUB just needs to track the per-runqueue active/inactive utilization, and is not badly affected by the fact that inactive utilization is migrated "too late" (when a task wakes up instead of when the CPU goes offline). This is because GRUB does not care about "global" utilization, but considers the various runqueues in isolation (there is a flavor of the m-grub algorithm that uses global inactive utilization, but it is not implemented by the patches I submitted). In other words: Juri's patch uses per-runqueue utilizations to re-build the global utilization, while GRUB does not care if the sum of the "active utilizations" match with the utilization used for admission control. I still have to check some details, and to run some more tests with CPU hot-(un)plug (and this is why I did not send a v2 of the reclaiming RFC yet)... In particular, I need to check what happens if the "inactive timer" fires when the CPU on which the task was running is already offline. Thanks, Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-24 21:46 ` luca abeni @ 2016-02-25 7:53 ` Peter Zijlstra 0 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2016-02-25 7:53 UTC (permalink / raw) To: luca abeni Cc: Juri Lelli, Steven Rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li On Wed, Feb 24, 2016 at 10:46:43PM +0100, luca abeni wrote: > > I arrived to the conclusion that for GRUB this is not a problem (but, > as usual, I might be wrong): GRUB just needs to track the per-runqueue > active/inactive utilization, Ah! indeed, my bad. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-24 19:17 ` Peter Zijlstra 2016-02-24 21:46 ` luca abeni @ 2016-02-25 10:07 ` Juri Lelli 2016-02-25 10:20 ` Peter Zijlstra 1 sibling, 1 reply; 46+ messages in thread From: Juri Lelli @ 2016-02-25 10:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li Hi Peter, On 24/02/16 20:17, Peter Zijlstra wrote: > On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote: > > Having two separate means of accounting this also feels more fragile > > than one would want. > > > > Let me think a bit about this. > > I think there's a fundamental problem that makes the whole notion of > per-rq accounting 'impossible'. > > On hot-unplug we only migrate runnable tasks, all blocked tasks remain > on the dead cpu. This would very much include their bandwidth > requirements. > > This means that between a hot-unplug and the moment that _all_ those > blocked tasks have ran at least once, the sum of online bandwidth > doesn't match and we can get into admission trouble (same for GRUB, > which can also use per-rq bw like this). > > The main problem is that there is no real way to find blocked tasks; > currently the only way is to iterate _all_ tasks and filter on > task_cpu(). > > We could of course add a blocked tree/list for deadline tasks, to > explicitly keep track of all these; this would allow migrating blocked > tasks on hotplug and avoid the real ugly I think. But I've not tried > yet. > Argh, this makes lot of sense to me. I've actually pondered a tree/list solution, but then decided to try the cumulative approach because it looked nicer. But it contains holes, I'm afraid. As Luca already said, GRUB shouldn't have these problems though. I'll try and see what introducting a list of blocked/throttled deadline tasks means, considering also the interaction with cpusets and such. Maybe it's simpler than it seems. I'm not sure this will come anytime soon, unfortunately. I'm almost 100% on the sched-freq/schedutil discussion these days. Anyway, do you also think that what we want to solve the root domain issue is something based on rq_online/offline and per-rq information? Everything else that I tried or thought of was broken/more horrible. :-/ Best, - Juri ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-25 10:07 ` Juri Lelli @ 2016-02-25 10:20 ` Peter Zijlstra 2016-03-24 9:20 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2016-02-25 10:20 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li On Thu, Feb 25, 2016 at 10:07:06AM +0000, Juri Lelli wrote: > Argh, this makes lot of sense to me. I've actually pondered a tree/list > solution, but then decided to try the cumulative approach because it > looked nicer. But it contains holes, I'm afraid. As Luca already said, > GRUB shouldn't have these problems though. > > I'll try and see what introducting a list of blocked/throttled deadline > tasks means, considering also the interaction with cpusets and such. > Maybe it's simpler than it seems. > > I'm not sure this will come anytime soon, unfortunately. I'm almost 100% > on the sched-freq/schedutil discussion these days. Just skip sleep and write them when its dark outside :-) > Anyway, do you also think that what we want to solve the root domain > issue is something based on rq_online/offline and per-rq information? > Everything else that I tried or thought of was broken/more horrible. :-/ I was still trying to get my head around this, the above was my suggestion to the per-rq state, but I've not thought hard on alternative approaches to the root_domain issue. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-25 10:20 ` Peter Zijlstra @ 2016-03-24 9:20 ` Peter Zijlstra 0 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2016-03-24 9:20 UTC (permalink / raw) To: Juri Lelli Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot, wanpeng.li On Thu, Feb 25, 2016 at 11:20:34AM +0100, Peter Zijlstra wrote: > On Thu, Feb 25, 2016 at 10:07:06AM +0000, Juri Lelli wrote: > > Argh, this makes lot of sense to me. I've actually pondered a tree/list > > solution, but then decided to try the cumulative approach because it > > looked nicer. But it contains holes, I'm afraid. As Luca already said, > > GRUB shouldn't have these problems though. > > > > I'll try and see what introducting a list of blocked/throttled deadline > > tasks means, considering also the interaction with cpusets and such. > > Maybe it's simpler than it seems. > > > > I'm not sure this will come anytime soon, unfortunately. I'm almost 100% > > on the sched-freq/schedutil discussion these days. > > Just skip sleep and write them when its dark outside :-) > > > Anyway, do you also think that what we want to solve the root domain > > issue is something based on rq_online/offline and per-rq information? > > Everything else that I tried or thought of was broken/more horrible. :-/ > > I was still trying to get my head around this, the above was my > suggestion to the per-rq state, but I've not thought hard on alternative > approaches to the root_domain issue. So the below is the inactive list; it seems to not insta-explode when I run a few simple dl proglets. I don't particularly like it because it makes wakeups (esp. cross-cpu ones) more expensive for the benefit of hotplug/cpusets which is something that 'never' happens. So what I'm going to try and do is forget all about this here patch and see what I can do with a full task-list iteration on rebuild. But I figured that since I wrote it and it might work, I might as well post it. --- include/linux/sched.h | 5 ++ kernel/sched/core.c | 6 ++- kernel/sched/deadline.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 7 ++- 5 files changed, 132 insertions(+), 6 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c617ea12c6b7..d9848eac35f2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1351,6 +1351,11 @@ struct sched_dl_entity { * own bandwidth to be enforced, thus we need one timer per task. */ struct hrtimer dl_timer; + +#ifdef CONFIG_SMP + struct list_head dl_inactive_entry; + int dl_inactive_cpu; +#endif }; union rcu_special { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0b21e7a724e1..7f3fab6349a4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1162,7 +1162,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) if (task_cpu(p) != new_cpu) { if (p->sched_class->migrate_task_rq) - p->sched_class->migrate_task_rq(p); + p->sched_class->migrate_task_rq(p, new_cpu); p->se.nr_migrations++; perf_event_task_migrate(p); } @@ -2077,6 +2077,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) RB_CLEAR_NODE(&p->dl.rb_node); init_dl_task_timer(&p->dl); __dl_clear_params(p); +#ifdef CONFIG_SMP + INIT_LIST_HEAD(&p->dl.dl_inactive_entry); +#endif INIT_LIST_HEAD(&p->rt.run_list); p->rt.timeout = 0; @@ -5397,6 +5400,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu) migrate_tasks(rq); BUG_ON(rq->nr_running != 1); /* the migration thread */ raw_spin_unlock_irqrestore(&rq->lock, flags); + migrate_inactive_dl(rq); break; case CPU_DEAD: diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c7a036facbe1..f999b8bb6fea 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -80,6 +80,9 @@ void init_dl_rq(struct dl_rq *dl_rq) dl_rq->dl_nr_migratory = 0; dl_rq->overloaded = 0; dl_rq->pushable_dl_tasks_root = RB_ROOT; + + raw_spin_lock_init(&dl_rq->dl_inactive_lock); + INIT_LIST_HEAD(&dl_rq->dl_inactive_list); #else init_dl_bw(&dl_rq->dl_bw); #endif @@ -289,6 +292,62 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p return later_rq; } +static void enqueue_inactive(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) +{ + raw_spin_lock(&dl_rq->dl_inactive_lock); + WRITE_ONCE(dl_se->dl_inactive_cpu, rq_of_dl_rq(dl_rq)->cpu); + list_add(&dl_se->dl_inactive_entry, &dl_rq->dl_inactive_list); + raw_spin_unlock(&dl_rq->dl_inactive_lock); +} + +static void dequeue_inactive(struct sched_dl_entity *dl_se) +{ + int tmp, cpu = READ_ONCE(dl_se->dl_inactive_cpu); + struct rq *rq; + +again: + if (cpu == -1) + return; + rq = cpu_rq(cpu); + + raw_spin_lock(&rq->dl.dl_inactive_lock); + tmp = READ_ONCE(dl_se->dl_inactive_cpu); + if (cpu != tmp) { + cpu = tmp; + raw_spin_unlock(&rq->dl.dl_inactive_lock); + goto again; + } + list_del_init(&dl_se->dl_inactive_entry); + WRITE_ONCE(dl_se->dl_inactive_cpu, -1); + raw_spin_unlock(&rq->dl.dl_inactive_lock); +} + +static void migrate_inactive(struct sched_dl_entity *dl_se, int new_cpu) +{ + int tmp, cpu = READ_ONCE(dl_se->dl_inactive_cpu); + struct rq *src_rq, *dst_rq; + + dst_rq = cpu_rq(new_cpu); +again: + if (cpu == -1) + return; + src_rq = cpu_rq(cpu); + + double_raw_lock(&src_rq->dl.dl_inactive_lock, + &dst_rq->dl.dl_inactive_lock); + tmp = READ_ONCE(dl_se->dl_inactive_cpu); + if (cpu != tmp) { + cpu = tmp; + raw_spin_unlock(&dst_rq->dl.dl_inactive_lock); + raw_spin_unlock(&src_rq->dl.dl_inactive_lock); + goto again; + } + list_move(&dl_se->dl_inactive_entry, &dst_rq->dl.dl_inactive_list); + WRITE_ONCE(dl_se->dl_inactive_cpu, new_cpu); + raw_spin_unlock(&dst_rq->dl.dl_inactive_lock); + raw_spin_unlock(&src_rq->dl.dl_inactive_lock); +} + #else static inline @@ -327,6 +386,11 @@ static inline void queue_push_tasks(struct rq *rq) static inline void queue_pull_task(struct rq *rq) { } + +static inline void enqueue_inactive(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) { } +static inline void dequeue_inactive(struct sched_dl_entity *dl_se) { } +static inline void migrate_inactive(struct sched_dl_entity *dl_se, int new_cpu) { } + #endif /* CONFIG_SMP */ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags); @@ -960,6 +1024,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) return; + if (!(flags & ENQUEUE_RESTORE)) + dequeue_inactive(&p->dl); + enqueue_dl_entity(&p->dl, pi_se, flags); if (!task_current(rq, p) && p->nr_cpus_allowed > 1) @@ -970,6 +1037,8 @@ static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { dequeue_dl_entity(&p->dl); dequeue_pushable_dl_task(rq, p); + if (!(flags & DEQUEUE_SAVE)) + enqueue_inactive(&p->dl, &rq->dl); } static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) @@ -1074,6 +1143,34 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p) resched_curr(rq); } +static void migrate_task_rq_dl(struct task_struct *p, int new_cpu) +{ + struct sched_dl_entity *dl_se = &p->dl; + + if (list_empty(&dl_se->dl_inactive_entry)) + return; + + migrate_inactive(dl_se, new_cpu); +} + +void migrate_inactive_dl(struct rq *src_rq) +{ + int cpu = cpumask_any_and(src_rq->rd->online, cpu_active_mask); + struct rq *dst_rq = cpu_rq(cpu); + struct sched_dl_entity *dl_se, *tmp; + + double_raw_lock(&src_rq->dl.dl_inactive_lock, + &dst_rq->dl.dl_inactive_lock); + + list_for_each_entry_safe(dl_se, tmp, &src_rq->dl.dl_inactive_list, dl_inactive_entry) { + WRITE_ONCE(dl_se->dl_inactive_cpu, cpu); + list_move(&dl_se->dl_inactive_entry, &dst_rq->dl.dl_inactive_list); + } + + raw_spin_unlock(&dst_rq->dl.dl_inactive_lock); + raw_spin_unlock(&src_rq->dl.dl_inactive_lock); +} + #endif /* CONFIG_SMP */ /* @@ -1211,13 +1308,19 @@ static void task_dead_dl(struct task_struct *p) { struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); + local_irq_disable(); + /* * Since we are TASK_DEAD we won't slip out of the domain! */ - raw_spin_lock_irq(&dl_b->lock); + raw_spin_lock(&dl_b->lock); /* XXX we should retain the bw until 0-lag */ dl_b->total_bw -= p->dl.dl_bw; - raw_spin_unlock_irq(&dl_b->lock); + raw_spin_unlock(&dl_b->lock); + + dequeue_inactive(&p->dl); + + local_irq_enable(); } static void set_curr_task_dl(struct rq *rq) @@ -1702,7 +1805,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) * this is the right place to try to pull some other one * from an overloaded cpu, if any. */ - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) + if (!task_on_rq_queued(p)) { + dequeue_inactive(&p->dl); + return; + } + + if (rq->dl.dl_nr_running) return; queue_pull_task(rq); @@ -1728,6 +1836,9 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) resched_curr(rq); #endif } + + if (!task_on_rq_queued(p)) + enqueue_inactive(&p->dl, &rq->dl); } /* @@ -1779,6 +1890,7 @@ const struct sched_class dl_sched_class = { #ifdef CONFIG_SMP .select_task_rq = select_task_rq_dl, + .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, .rq_online = rq_online_dl, .rq_offline = rq_offline_dl, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 303d6392b389..04e856a85c0f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5231,7 +5231,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f * cfs_rq_of(p) references at time of call are still valid and identify the * previous cpu. The caller guarantees p->pi_lock or task_rq(p)->lock is held. */ -static void migrate_task_rq_fair(struct task_struct *p) +static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) { /* * We are supposed to update the task to "current" time, then its up to date diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e6d4a3fa3660..0de1e2894d22 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -517,6 +517,9 @@ struct dl_rq { */ struct rb_root pushable_dl_tasks_root; struct rb_node *pushable_dl_tasks_leftmost; + + raw_spinlock_t dl_inactive_lock; + struct list_head dl_inactive_list; #else struct dl_bw dl_bw; #endif @@ -776,6 +779,8 @@ extern int migrate_swap(struct task_struct *, struct task_struct *); #ifdef CONFIG_SMP +extern void migrate_inactive_dl(struct rq *src_rq); + static inline void queue_balance_callback(struct rq *rq, struct callback_head *head, @@ -1205,7 +1210,7 @@ struct sched_class { #ifdef CONFIG_SMP int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags); - void (*migrate_task_rq)(struct task_struct *p); + void (*migrate_task_rq)(struct task_struct *p, int new_cpu); void (*task_waking) (struct task_struct *task); void (*task_woken) (struct rq *this_rq, struct task_struct *task); ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth 2016-02-11 14:25 ` Steven Rostedt 2016-02-11 17:10 ` Juri Lelli @ 2016-02-11 21:48 ` Luca Abeni 1 sibling, 0 replies; 46+ messages in thread From: Luca Abeni @ 2016-02-11 21:48 UTC (permalink / raw) To: Steven Rostedt Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li On Thu, 11 Feb 2016 09:25:46 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 11 Feb 2016 14:05:45 +0100 > luca abeni <luca.abeni@unitn.it> wrote: > > > > Well, I never used the rq utilization to re-build the root_domain > > utilization (and I never played with root domains too much... :)... > > So, I do not really know. Maybe the code should do: > > raw_spin_lock(&rq->lock); > > raw_spin_lock(&cpu_rq(cpu)->lock); > > Of course you want to use double_rq_lock() here instead. Right... Of course I always miss something obvious ;-) Luca ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes 2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli 2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli @ 2016-02-08 12:45 ` Juri Lelli 1 sibling, 0 replies; 46+ messages in thread From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw) To: rostedt Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li, juri.lelli Steven Rostedt reported that we lose information about the amount of bandwidth that has been admitted to the system across operations that reconfigure scheduling domains (and are thus destructive w.r.t. root domains). Wanpeng Li also previously reported a similar problem related to hotplug, but, since hotplug operations entail scheduling domains reconfiguration, we can fix both problems at once. This patch leverages per-rq admitted bandwidth information introduced by previous commit and couple that with rq_{online,offline}_dl calls to save and restore root_domain state across scheduling domains reconfiguration. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Juri Lelli <juri.lelli@arm.com> --- kernel/sched/deadline.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 2480cab..925814e 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1683,6 +1683,7 @@ static void rq_online_dl(struct rq *rq) if (rq->dl.overloaded) dl_set_overload(rq); + __dl_add(&rq->rd->dl_bw, rq->dl.ac_bw); cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu); if (rq->dl.dl_nr_running > 0) cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1); @@ -1696,6 +1697,7 @@ static void rq_offline_dl(struct rq *rq) cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0); cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu); + __dl_clear(&rq->rd->dl_bw, rq->dl.ac_bw); } void __init init_sched_dl_class(void) -- 2.7.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
end of thread, other threads:[~2016-03-24 9:21 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
2016-02-10 11:32 ` Juri Lelli
2016-02-10 11:43 ` luca abeni
2016-02-10 11:58 ` Juri Lelli
2016-02-19 13:43 ` luca abeni
2016-02-19 14:20 ` Steven Rostedt
2016-02-19 14:53 ` luca abeni
2016-02-19 14:57 ` Steven Rostedt
2016-02-22 11:03 ` luca abeni
2016-02-22 10:57 ` [PATCH 0/3] cleanup " Luca Abeni
2016-02-22 10:57 ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni
2016-02-22 10:57 ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni
2016-02-22 10:57 ` [PATCH 3/4] Remove dl_new Luca Abeni
2016-02-23 15:42 ` Peter Zijlstra
2016-02-24 13:53 ` luca abeni
2016-02-25 9:46 ` Juri Lelli
2016-03-03 9:03 ` luca abeni
2016-03-03 9:28 ` Juri Lelli
2016-03-03 14:23 ` Steven Rostedt
2016-03-03 14:31 ` luca abeni
2016-03-03 16:12 ` Juri Lelli
2016-02-10 12:48 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni
2016-02-10 13:42 ` Juri Lelli
2016-02-23 15:48 ` Peter Zijlstra
2016-02-23 15:51 ` Juri Lelli
2016-02-10 14:37 ` Steven Rostedt
2016-02-10 16:27 ` Juri Lelli
2016-02-11 12:12 ` Juri Lelli
2016-02-11 12:22 ` luca abeni
2016-02-11 12:27 ` Juri Lelli
2016-02-11 12:40 ` luca abeni
2016-02-11 12:49 ` Juri Lelli
2016-02-11 13:05 ` luca abeni
2016-02-11 14:25 ` Steven Rostedt
2016-02-11 17:10 ` Juri Lelli
2016-02-12 17:05 ` Peter Zijlstra
2016-02-12 17:19 ` Juri Lelli
2016-02-24 19:17 ` Peter Zijlstra
2016-02-24 21:46 ` luca abeni
2016-02-25 7:53 ` Peter Zijlstra
2016-02-25 10:07 ` Juri Lelli
2016-02-25 10:20 ` Peter Zijlstra
2016-03-24 9:20 ` Peter Zijlstra
2016-02-11 21:48 ` Luca Abeni
2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).