Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Paul Turner <pjt@google.com>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	devel@openvz.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
Date: Fri, 8 Feb 2013 06:46:01 -0800	[thread overview]
Message-ID: <20130208144601.GA13327@google.com> (raw)
In-Reply-To: <1360307446-26978-1-git-send-email-vdavydov@parallels.com>

On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
> If cfs_rq->runtime_remaining is <= 0 then either
> - cfs_rq is throttled and waiting for quota redistribution, or
> - cfs_rq is currently executing and will be throttled on
>   put_prev_entity, or
> - cfs_rq is not throttled and has not executed since its quota was set
>   (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
> 
> It is obvious that the last case is rather an exception from the rule
> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
> soon as it finishes its execution". Moreover, it can lead to a task hang
> as follows. If put_prev_task is called immediately after first
> pick_next_task after quota was set, "immediately" meaning rq->clock in
> both functions is the same, then the corresponding cfs_rq will be
> throttled. Besides being unfair (the cfs_rq has not executed in fact),
> the quota refilling timer can be idle at that time and it won't be
> activated on put_prev_task because update_curr calls
> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
> strictly positive. As a result we can get a task "running" inside a
> throttled cfs_rq which will probably never be unthrottled.
> 
> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
> will be throttled only if it has executed for some positive number of
> nanoseconds.
> --
> Several times we had our customers encountered such hangs inside a VM
> (seems something is wrong or rather different in time accounting there).

Yeah, looks like!

It's not ultimately _super_ shocking; I can think of a few  places where such
gremlins could lurk if they caused enough problems for someone to really go
digging.

> Analyzing crash dumps revealed that hung tasks were running inside
> cfs_rq's, which had the following setup
> 
> cfs_rq->throttled=1
> cfs_rq->runtime_enabled=1
> cfs_rq->runtime_remaining=0
> cfs_rq->tg->cfs_bandwidth.idle=1
> cfs_rq->tg->cfs_bandwidth.timer_active=0
> 
> which conforms pretty nice to the explanation given above.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  kernel/sched/core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..c7a078f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  
>  		raw_spin_lock_irq(&rq->lock);
>  		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;

So I agree this is reasonably correct and would fix the issue identified.
However, one concern is that it would potentially grant a tick of execution
time on all cfs_rqs which could result in large quota violations on a many core
machine; one trick then would be to give them "expired" quota; which would be
safe against put_prev_entity->check_cfs_runtime, e.g.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..4369231 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7687,7 +7687,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		/*
+		 * On re-definition of bandwidth values we allocate a trivial
+		 * amount of already expired quota.  This guarantees that
+		 * put_prev_entity() cannot lead to a throttle event before we
+		 * have seen a call to account_cfs_runtime(); while not being
+		 * usable by newly waking, or set_curr_task_fair-ing, cpus
+		 * since it would be immediately expired, requiring
+		 * reassignment.
+		 */
+		cfs_rq->runtime_remaining = 1;
+		cfs_rq->runtime_expires = rq_of(cfs_rq)->clock - 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);

A perhaps more explicit approach that should be more consistent would be to
properly allocate bandwidth in the first place.  Something like (compile
tested):

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..9646c01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7682,6 +7682,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_possible_cpu(i) {
+		bool exhausted = false;
 		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
 		struct rq *rq = cfs_rq->rq;
 
@@ -7689,9 +7690,27 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 
+		/*
+		 * We know there's bandwidth remaining (since this loop would
+		 * have otherwise terminated) we can unthrottle up-front.
+		 */
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
+
+		if (cfs_rq->curr) {
+			/* cfs_rq is currently running, force an update */
+			account_cfs_rq_runtime(cfs_rq, 0);
+			/* If we were unable to allocate runtime then:
+			 * (a) We've sent a reschedule against cpu i
+			 * (b) There is no point in visiting further cpus as we
+			 *     have exhausted our new quota.
+			 */
+			if (!cfs_rq->runtime_remaining)
+				exhausted = true;
+		}
 		raw_spin_unlock_irq(&rq->lock);
+		if (exhausted)
+			break;
 	}
 out_unlock:
 	mutex_unlock(&cfs_constraints_mutex);


That said I actually thought of the first patch (e.g. explicitly using expired
quota) after I wrote the second.  It's perhaps more subtle; but not
unreasonable.  Any thoughts?

Thanks for the report,

- Paul
>  		if (cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
> -- 
> 1.7.1
>

next prev parent reply	other threads:[~2013-02-08 14:46 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08  7:10 [PATCH] sched: initialize runtime to non-zero on cfs bw set Vladimir Davydov
2013-02-08 14:46 ` Paul Turner [this message]
2013-02-08 15:26   ` Vladimir Davydov
     [not found]   ` <BEF8F492-C44F-43F1-AB39-EA498A0063EA@parallels.com>
2013-02-08 16:32     ` Vladimir Davydov
2013-02-08 15:17 ` [tip:sched/urgent] sched: Initialize cfs_rq-> runtime_remaining " tip-bot for Vladimir Davydov

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1dff78a dfblob:4369231 dfblob:1dff78a dfblob:9646c01 )
 OR (
bs:"Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130208144601.GA13327@google.com \
    --to=pjt@google.com \
    --cc=devel@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vdavydov@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox