From: Paul Turner <pjt@google.com>
To: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Bharata B Rao <bharata@linux.vnet.ibm.com>,
Dhaval Giani <dhaval.giani@gmail.com>,
Balbir Singh <balbir@linux.vnet.ibm.com>,
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>,
Ingo Molnar <mingo@elte.hu>, Pavel Emelyanov <xemul@openvz.org>,
Nikhil Rao <ncrao@google.com>
Subject: Re: [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
Date: Wed, 11 May 2011 02:24:14 -0700 [thread overview]
Message-ID: <BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com> (raw)
In-Reply-To: <4DC8E810.1020208@jp.fujitsu.com>
On Tue, May 10, 2011 at 12:24 AM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> Some comments...
>
> (2011/05/03 18:28), Paul Turner wrote:
>> At the start of a new period there are several actions we must refresh the
>> global bandwidth pool as well as unthrottle any cfs_rq entities who previously
>> ran out of bandwidth (as quota permits).
>>
>> Unthrottled entities have the cfs_rq->throttled flag cleared and are re-enqueued
>> into the cfs entity hierarchy.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>> kernel/sched.c | 3 +
>> kernel/sched_fair.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 107 insertions(+), 1 deletion(-)
>>
>> Index: tip/kernel/sched.c
>> ===================================================================
>> --- tip.orig/kernel/sched.c
>> +++ tip/kernel/sched.c
>> @@ -9294,6 +9294,9 @@ static int tg_set_cfs_bandwidth(struct t
>> cfs_rq->runtime_enabled = quota != RUNTIME_INF;
>> cfs_rq->runtime_remaining = 0;
>> cfs_rq->runtime_expires = runtime_expires;
>> +
>> + if (cfs_rq_throttled(cfs_rq))
>> + unthrottle_cfs_rq(cfs_rq);
>> raw_spin_unlock_irq(&rq->lock);
>> }
>> out_unlock:
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -1456,10 +1456,88 @@ static void check_enqueue_throttle(struc
>> throttle_cfs_rq(cfs_rq);
>> }
>>
>> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> +{
>> + struct rq *rq = rq_of(cfs_rq);
>> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>> + struct sched_entity *se;
>> + int enqueue = 1;
>> + long task_delta;
>> +
>> + se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +
>> + cfs_rq->throttled = 0;
>> + raw_spin_lock(&cfs_b->lock);
>> + list_del_rcu(&cfs_rq->throttled_list);
>> + raw_spin_unlock(&cfs_b->lock);
>> +
>> + if (!cfs_rq->load.weight)
>> + return;
>> +
>> + task_delta = cfs_rq->h_nr_running;
>> + for_each_sched_entity(se) {
>> + if (se->on_rq)
>> + enqueue = 0;
>> +
>> + cfs_rq = cfs_rq_of(se);
>> + if (enqueue)
>> + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>> + cfs_rq->h_nr_running += task_delta;
>> +
>> + if (cfs_rq_throttled(cfs_rq))
>> + break;
>> + }
>> +
>> + if (!se)
>> + rq->nr_running += task_delta;
>> +
>> + /* determine whether we need to wake up potentially idle cpu */
>> + if (rq->curr == rq->idle && rq->cfs.nr_running)
>> + resched_task(rq->curr);
>> +}
>> +
>> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>> + u64 remaining, u64 expires)
>> +{
>> + struct cfs_rq *cfs_rq;
>> + u64 runtime = remaining;
>> +
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
>> + throttled_list) {
>> + struct rq *rq = rq_of(cfs_rq);
>> +
>> + raw_spin_lock(&rq->lock);
>> + if (!cfs_rq_throttled(cfs_rq))
>> + goto next;
>> +
>> + runtime = -cfs_rq->runtime_remaining + 1;
>
> It will helpful if a comment can explain why negative and 1.
Remaining runtime of <= 0 implies that there was no bandwidth
available. See checks below et al. in check_... functions.
We choose the minimum amount here to return to a positive quota state.
Originally I had elected to take a full slice here. The limitation
became that this then effectively duplicated the assign_cfs_rq_runtime
path, and would require the quota handed out in each to be in
lockstep. Another trade-off is be that when we're in a large state of
arrears, handing out this extra bandwidth (in excess of the minimum
+1) up-front may prevent us from unthrottling another cfs_rq.
Will add a comment explaining that the minimum amount to leave arrears
is chosen above.
>
>> + if (runtime > remaining)
>> + runtime = remaining;
>> + remaining -= runtime;
>> +
>> + cfs_rq->runtime_remaining += runtime;
>> + cfs_rq->runtime_expires = expires;
>> +
>> + /* we check whether we're throttled above */
>> + if (cfs_rq->runtime_remaining > 0)
>> + unthrottle_cfs_rq(cfs_rq);
>> +
>> +next:
>> + raw_spin_unlock(&rq->lock);
>> +
>> + if (!remaining)
>> + break;
>> + }
>> + rcu_read_unlock();
>> +
>> + return remaining;
>> +}
>> +
>> static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>> {
>> u64 quota, runtime = 0, runtime_expires;
>> - int idle = 0;
>> + int idle = 0, throttled = 0;
>>
>> runtime_expires = sched_clock_cpu(smp_processor_id());
>>
>> @@ -1469,6 +1547,7 @@ static int do_sched_cfs_period_timer(str
>> if (quota != RUNTIME_INF) {
>> runtime = quota;
>> runtime_expires += ktime_to_ns(cfs_b->period);
>> + throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>>
>> cfs_b->runtime = runtime;
>> cfs_b->runtime_expires = runtime_expires;
>> @@ -1477,6 +1556,30 @@ static int do_sched_cfs_period_timer(str
>> }
>> raw_spin_unlock(&cfs_b->lock);
>>
>> + if (!throttled || quota == RUNTIME_INF)
>> + goto out;
>> + idle = 0;
>> +
>> +retry:
>> + runtime = distribute_cfs_runtime(cfs_b, runtime, runtime_expires);
>> +
>> + raw_spin_lock(&cfs_b->lock);
>> + /* new new bandwidth may have been set */
>
> Typo? new, newer, newest...?
>
s/new new/new/ :)
>> + if (unlikely(runtime_expires != cfs_b->runtime_expires))
>> + goto out_unlock;
>> + /*
>> + * make sure no-one was throttled while we were handing out the new
>> + * runtime.
>> + */
>> + if (runtime > 0 && !list_empty(&cfs_b->throttled_cfs_rq)) {
>> + raw_spin_unlock(&cfs_b->lock);
>> + goto retry;
>> + }
>> + cfs_b->runtime = runtime;
>> + cfs_b->idle = idle;
>> +out_unlock:
>> + raw_spin_unlock(&cfs_b->lock);
>> +out:
>> return idle;
>> }
>> #else
>
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> It would be better if this unthrottle patch (09/15) comes before
> throttle patch (08/15) in this series, not to make a small window
> in the history that throttled entity never back to the run queue.
> But I'm just paranoid...
>
The feature is inert unless bandwidth is set so this should be safe.
The trade-off with reversing the order is that a patch undoing state
that doesn't yet exist looks very strange :). If the above is a
concern I'd probably prefer to separate it into 3 parts:
1. add throttle
2. add unthrottle
3. enable throttle
Where (3) would consist only of the enqueue/put checks to trigger throttling.
>
> Thanks,
> H.Seto
>
>
next prev parent reply other threads:[~2011-05-11 16:30 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-03 9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
2011-05-03 9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-05-10 7:14 ` Hidetoshi Seto
2011-05-10 8:32 ` Mike Galbraith
2011-05-11 7:55 ` Hidetoshi Seto
2011-05-11 8:13 ` Paul Turner
2011-05-11 8:45 ` Mike Galbraith
2011-05-11 8:59 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-05-10 7:17 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-05-10 7:18 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
2011-05-10 7:20 ` Hidetoshi Seto
2011-05-11 9:37 ` Paul Turner
2011-05-16 9:30 ` Peter Zijlstra
2011-05-16 9:43 ` Peter Zijlstra
2011-05-16 12:32 ` Paul Turner
2011-05-17 15:26 ` Peter Zijlstra
2011-05-18 7:16 ` Paul Turner
2011-05-18 11:57 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-05-10 7:21 ` Hidetoshi Seto
2011-05-11 9:27 ` Paul Turner
2011-05-16 10:18 ` Peter Zijlstra
2011-05-16 12:56 ` Paul Turner
2011-05-03 9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-05-10 7:22 ` Hidetoshi Seto
2011-05-11 9:25 ` Paul Turner
2011-05-16 10:27 ` Peter Zijlstra
2011-05-16 12:59 ` Paul Turner
2011-05-17 15:28 ` Peter Zijlstra
2011-05-18 7:02 ` Paul Turner
2011-05-16 10:32 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
2011-05-10 7:22 ` Hidetoshi Seto
2011-05-16 11:05 ` Peter Zijlstra
2011-05-16 11:07 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
2011-05-10 7:23 ` Hidetoshi Seto
2011-05-16 15:58 ` Peter Zijlstra
2011-05-16 16:05 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-05-10 7:24 ` Hidetoshi Seto
2011-05-11 9:24 ` Paul Turner [this message]
2011-05-03 9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
2011-05-10 7:24 ` Hidetoshi Seto
2011-05-17 13:31 ` Peter Zijlstra
2011-05-18 7:18 ` Paul Turner
2011-05-03 9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
2011-05-10 7:26 ` Hidetoshi Seto
2011-05-11 9:11 ` Paul Turner
2011-05-03 9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-05-10 7:27 ` Hidetoshi Seto
2011-05-11 9:10 ` Paul Turner
2011-05-03 9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-05-10 7:27 ` Hidetoshi Seto
2011-05-11 7:56 ` Hidetoshi Seto
2011-05-11 9:09 ` Paul Turner
2011-05-03 9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
2011-05-10 7:28 ` Hidetoshi Seto
2011-05-03 9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
2011-05-10 7:29 ` Hidetoshi Seto
2011-05-11 9:09 ` Paul Turner
2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
2011-06-08 3:09 ` Paul Turner
2011-06-08 10:46 ` Vladimir Davydov
2011-06-08 16:32 ` Kamalesh Babulal
2011-06-09 3:25 ` Paul Turner
2011-06-10 18:17 ` Kamalesh Babulal
2011-06-14 0:00 ` Paul Turner
2011-06-15 5:37 ` Kamalesh Babulal
2011-06-21 19:48 ` Paul Turner
2011-06-24 15:05 ` Kamalesh Babulal
2011-09-07 11:00 ` Srivatsa Vaddagiri
2011-09-07 14:54 ` Srivatsa Vaddagiri
2011-09-07 15:20 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
2011-09-07 19:22 ` Peter Zijlstra
2011-09-08 15:15 ` Srivatsa Vaddagiri
2011-09-09 12:31 ` Peter Zijlstra
2011-09-09 13:26 ` Srivatsa Vaddagiri
2011-09-12 10:17 ` Srivatsa Vaddagiri
2011-09-12 12:35 ` Peter Zijlstra
2011-09-13 4:15 ` Srivatsa Vaddagiri
2011-09-13 5:03 ` Srivatsa Vaddagiri
2011-09-13 5:05 ` Srivatsa Vaddagiri
2011-09-13 9:39 ` Peter Zijlstra
2011-09-13 11:28 ` Srivatsa Vaddagiri
2011-09-13 14:07 ` Peter Zijlstra
2011-09-13 16:21 ` Srivatsa Vaddagiri
2011-09-13 16:33 ` Peter Zijlstra
2011-09-13 17:41 ` Srivatsa Vaddagiri
2011-09-13 16:36 ` Peter Zijlstra
2011-09-13 17:54 ` Srivatsa Vaddagiri
2011-09-13 18:03 ` Peter Zijlstra
2011-09-13 18:12 ` Srivatsa Vaddagiri
2011-09-13 18:07 ` Peter Zijlstra
2011-09-13 18:19 ` Peter Zijlstra
2011-09-13 18:28 ` Srivatsa Vaddagiri
2011-09-13 18:30 ` Peter Zijlstra
2011-09-13 18:35 ` Srivatsa Vaddagiri
2011-09-15 17:55 ` Kamalesh Babulal
2011-09-15 21:48 ` Peter Zijlstra
2011-09-19 17:51 ` Kamalesh Babulal
2011-09-20 0:38 ` Venki Pallipadi
2011-09-20 11:09 ` Kamalesh Babulal
2011-09-20 13:56 ` Peter Zijlstra
2011-09-20 14:04 ` Peter Zijlstra
2011-09-20 12:55 ` Peter Zijlstra
2011-09-21 17:34 ` Kamalesh Babulal
2011-09-13 14:19 ` Peter Zijlstra
2011-09-13 18:01 ` Srivatsa Vaddagiri
2011-09-13 18:23 ` Peter Zijlstra
2011-09-16 8:14 ` Paul Turner
2011-09-16 8:28 ` Peter Zijlstra
2011-09-19 16:35 ` Srivatsa Vaddagiri
2011-09-16 8:22 ` Paul Turner
2011-06-14 10:16 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Hidetoshi Seto
2011-06-14 6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
2011-06-14 7:29 ` Hidetoshi Seto
2011-06-14 7:44 ` Hu Tao
2011-06-15 8:37 ` Hu Tao
2011-06-16 0:57 ` Hidetoshi Seto
2011-06-16 9:45 ` Hu Tao
2011-06-17 1:22 ` Hidetoshi Seto
2011-06-17 6:05 ` Hu Tao
2011-06-17 6:25 ` Paul Turner
2011-06-17 9:13 ` Hidetoshi Seto
2011-06-18 0:28 ` Paul Turner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='BANLkTi=NmCxKX6EbDQcJYDJ5kKyG2N1ssw@mail.gmail.com' \
--to=pjt@google.com \
--cc=a.p.zijlstra@chello.nl \
--cc=balbir@linux.vnet.ibm.com \
--cc=bharata@linux.vnet.ibm.com \
--cc=dhaval.giani@gmail.com \
--cc=kamalesh@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=ncrao@google.com \
--cc=seto.hidetoshi@jp.fujitsu.com \
--cc=svaidy@linux.vnet.ibm.com \
--cc=vatsa@in.ibm.com \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).