From: bsegall@google.com
To: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Phil Auld <pauld@redhat.com>, Peter Oskolkov <posk@posk.io>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
Brendan Gregg <bgregg@netflix.com>, Kyle Anderson <kwa@yelp.com>,
Gabriel Munos <gmunoz@netflix.com>,
John Hammond <jhammond@indeed.com>,
Cong Wang <xiyou.wangcong@gmail.com>
Subject: Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.
Date: Mon, 24 Jun 2019 10:33:07 -0700 [thread overview]
Message-ID: <xm26tvcex50s.fsf@bsegall-linux.svl.corp.google.com> (raw)
In-Reply-To: <1561391404-14450-2-git-send-email-chiluk+linux@indeed.com> (Dave Chiluk's message of "Mon, 24 Jun 2019 10:50:04 -0500")
Dave Chiluk <chiluk+linux@indeed.com> writes:
> It has been observed, that highly-threaded, user-interactive
> applications running under cpu.cfs_quota_us constraints can hit a high
> percentage of periods throttled while simultaneously not consuming the
> allocated amount of quota. This impacts user-interactive non-cpu bound
> applications, such as those running in kubernetes or mesos when run on
> multiple cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. This
> results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the
> end of the period this remaining quota goes unused and expires. This
> expiration of unused time on per-cpu runqueues results in applications
> under-utilizing their quota while simultaneously hitting throttling.
>
> The solution is to return all spare cfs_rq->runtime_remaining when
> cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the
> desire to prevent cfs_rq from always pulling quota with the desire to
> allow applications to fully utilize their quota.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> ---
> kernel/sched/fair.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..4894eda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> return 1;
> }
>
> -/* a cfs_rq won't donate quota below this amount */
> +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little
> + * remaining runtime.
> + */
> static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
> /* minimum remaining period time to redistribute slack quota */
> static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
> @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
> static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> {
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> - s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
> + s64 slack_runtime = cfs_rq->runtime_remaining;
>
> + /* There is no runtime to return. */
> if (slack_runtime <= 0)
> return;
>
> raw_spin_lock(&cfs_b->lock);
> if (cfs_b->quota != RUNTIME_INF &&
> cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> - cfs_b->runtime += slack_runtime;
> + /* As we near 0 quota remaining on cfs_b start returning all
> + * remaining runtime. This avoids stranding and then expiring
> + * runtime on per-cpu cfs_rq.
> + *
> + * cfs->b has plenty of runtime leave min_cfs_rq_runtime of
> + * runtime on this cfs_rq.
> + */
> + if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 &&
> + slack_runtime > min_cfs_rq_runtime)
> + slack_runtime -= min_cfs_rq_runtime;
>
> + cfs_b->runtime += slack_runtime;
> /* we are under rq->lock, defer unthrottling using a timer */
> if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
> !list_empty(&cfs_b->throttled_cfs_rq))
This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
now take a tg-global lock on every group se dequeue. Setting min=0 means
that we have to take it on both enqueue and dequeue, while baseline
means we take it once per min_cfs_rq_runtime in the worst case.
In addition how much this helps is very dependent on the exact pattern
of sleep/wake - you can still strand all but 15ms of runtime with a
pretty reasonable pattern.
If the cost of taking this global lock across all cpus without a
ratelimit was somehow not a problem, I'd much prefer to just set
min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
and sorta have 2x period 2x runtime" solution of removing expiration)
next prev parent reply other threads:[~2019-06-24 17:33 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk
2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-05-23 18:44 ` [PATCH v2 1/1] " Dave Chiluk
2019-05-23 21:01 ` Peter Oskolkov
2019-05-24 14:32 ` Phil Auld
2019-05-24 15:14 ` Dave Chiluk
2019-05-24 15:59 ` Phil Auld
2019-05-24 16:28 ` Peter Oskolkov
2019-05-24 21:35 ` Dave Chiluk
2019-05-24 22:07 ` Peter Oskolkov
2019-05-28 22:25 ` Dave Chiluk
2019-05-24 8:55 ` Peter Zijlstra
2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk
2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk
2019-05-29 19:28 ` Phil Auld
2019-05-29 19:50 ` bsegall
2019-05-29 21:05 ` bsegall
2019-05-30 17:53 ` Dave Chiluk
2019-05-30 20:44 ` bsegall
[not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>
2019-06-24 15:50 ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk
2019-06-24 17:33 ` bsegall [this message]
2019-06-26 22:10 ` Dave Chiluk
2019-06-27 20:18 ` bsegall
2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk
2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk
2019-06-27 19:49 ` [PATCH v5 1/1] " Dave Chiluk
2019-07-01 20:15 ` bsegall
2019-07-11 9:51 ` Peter Zijlstra
2019-07-11 17:46 ` bsegall
[not found] ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>
2019-07-12 18:01 ` bsegall
2019-07-12 22:09 ` bsegall
2019-07-15 15:44 ` Dave Chiluk
2019-07-16 19:58 ` bsegall
2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk
2019-07-23 16:44 ` [PATCH v6 1/1] " Dave Chiluk
2019-07-23 17:13 ` Phil Auld
2019-07-23 22:12 ` Dave Chiluk
2019-07-23 23:26 ` Phil Auld
2019-07-26 18:14 ` Peter Zijlstra
2019-08-08 10:53 ` [tip:sched/core] " tip-bot for Dave Chiluk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xm26tvcex50s.fsf@bsegall-linux.svl.corp.google.com \
--to=bsegall@google.com \
--cc=bgregg@netflix.com \
--cc=cgroups@vger.kernel.org \
--cc=chiluk+linux@indeed.com \
--cc=gmunoz@netflix.com \
--cc=jhammond@indeed.com \
--cc=kwa@yelp.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=posk@posk.io \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox