All of lore.kernel.org
 help / color / mirror / Atom feed
From: bsegall@google.com
To: Huaixin Chang <changhuaixin@linux.alibaba.com>
Cc: linux-kernel@vger.kernel.org, shanpeic@linux.alibaba.com,
	yun.wang@linux.alibaba.com, xlpang@linux.alibaba.com,
	peterz@infradead.org, mingo@redhat.com, bsegall@google.com,
	chiluk+linux@indeed.com, vincent.guittot@linaro.org
Subject: Re: [PATCH v2] sched: Fix race between runtime distribution and assignment
Date: Thu, 26 Mar 2020 10:27:59 -0700	[thread overview]
Message-ID: <xm26v9mrni00.fsf@google.com> (raw)
In-Reply-To: <20200326065606.19193-1-changhuaixin@linux.alibaba.com> (Huaixin Chang's message of "Thu, 26 Mar 2020 14:56:06 +0800")

Huaixin Chang <changhuaixin@linux.alibaba.com> writes:

> Currently, there is a potential race between distribute_cfs_runtime()
> and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
> distributes without holding lock and finds out there is not enough
> runtime to charge against after distribution. Because
> assign_cfs_rq_runtime() might be called during distribution, and use
> cfs_b->runtime at the same time.
>
> Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
> and cfs period timer runs, slow threads might run and sleep, returning
> unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
> pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
> a lot. If runtime distributed is large too, over-use of runtime happens.
>
> A runtime over-using by about 70 percent of quota is seen when we
> test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
> 95 slow threads in test group, configure 10ms quota for this group and
> see the CPU usage of fibtest is 17.0%, which is far more than the
> expected 10%.
>
> On a smaller machine with 32 cores, we also run fibtest with 96
> threads. CPU usage is more than 12%, which is also more than expected
> 10%. This shows that on similar workloads, this race do affect CPU
> bandwidth control.
>
> Solve this by holding lock inside distribute_cfs_runtime().

Some minor requests below, other than that

Reviewed-by: Ben Segall <bsegall@google.com>

>
> Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> ---
> v2: fix spelling, initialize variable rumaining in distribute_cfs_runtime()
> ---
>  kernel/sched/fair.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c1217bfe5e81..fd30e06a7ffc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4629,11 +4629,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  		resched_curr(rq);
>  }
>  
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> +static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
>  {
>  	struct cfs_rq *cfs_rq;
> -	u64 runtime;
> -	u64 starting_runtime = remaining;
> +	u64 runtime, remaining = 1;
> +	unsigned long flags;
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
> @@ -4648,10 +4648,13 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  		/* By the above check, this should never be true */
>  		SCHED_WARN_ON(cfs_rq->runtime_remaining > 0);
>  
> +		raw_spin_lock_irqsave(&cfs_b->lock, flags);

No need for _irqsave/_irqrestore, the rqlock already did.

>  		runtime = -cfs_rq->runtime_remaining + 1;
> -		if (runtime > remaining)
> -			runtime = remaining;
> -		remaining -= runtime;
> +		if (runtime > cfs_b->runtime)
> +			runtime = cfs_b->runtime;
> +		cfs_b->runtime -= runtime;
> +		remaining = cfs_b->runtime;
> +		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  
>  		cfs_rq->runtime_remaining += runtime;
>  
> @@ -4666,8 +4669,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  			break;
>  	}
>  	rcu_read_unlock();
> -
> -	return starting_runtime - remaining;
>  }
>  
>  /*
> @@ -4678,7 +4679,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -	u64 runtime;
>  	int throttled;
>  
>  	/* no need to continue the timer with no bandwidth constraint */
> @@ -4708,23 +4708,17 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  
>  	/*
>  	 * This check is repeated as we are holding onto the new bandwidth while
> -	 * we unthrottle. This can potentially race with an unthrottled group
> -	 * trying to acquire new bandwidth from the global pool. This can result
> -	 * in us over-using our runtime if it is all used during this loop, but
> -	 * only by limited amounts in that extreme case.
> +	 * we unthrottle.

"This check is repeated as we release cfs_b->lock while we unthrottle."
or something like that. This code is no longer even holding onto the new
bandwidth on its own.

>  	 */
>  	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
> -		runtime = cfs_b->runtime;
>  		cfs_b->distribute_running = 1;
>  		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
> -		runtime = distribute_cfs_runtime(cfs_b, runtime);
> +		distribute_cfs_runtime(cfs_b);
>  		raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  
>  		cfs_b->distribute_running = 0;
>  		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
> -
> -		lsub_positive(&cfs_b->runtime, runtime);
>  	}
>  
>  	/*
> @@ -4858,10 +4852,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime);
> +	distribute_cfs_runtime(cfs_b);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
>  	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }

  reply	other threads:[~2020-03-26 17:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-25  9:26 [PATCH] alios: sched: Fix race between runtime distribution and assignment Huaixin Chang
2020-03-26  3:28 ` kbuild test robot
2020-03-26  6:56 ` [PATCH v2] " Huaixin Chang
2020-03-26 17:27   ` bsegall [this message]
2020-03-27  3:26 ` [PATCH v3] sched/fair: " Huaixin Chang
2020-03-30 10:44   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xm26v9mrni00.fsf@google.com \
    --to=bsegall@google.com \
    --cc=changhuaixin@linux.alibaba.com \
    --cc=chiluk+linux@indeed.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=shanpeic@linux.alibaba.com \
    --cc=vincent.guittot@linaro.org \
    --cc=xlpang@linux.alibaba.com \
    --cc=yun.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.