Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Phil Auld <pauld@redhat.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
Date: Sat, 9 Mar 2019 15:33:21 -0500	[thread overview]
Message-ID: <20190309203320.GA24464@lorien.usersys.redhat.com> (raw)
In-Reply-To: <xm26wolbyfe9.fsf@bsegall-linux.svl.corp.google.com>

On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
> >> Phil Auld <pauld@redhat.com> writes:
> >> 
> >> > Interestingly, if I limit the number of child cgroups to the number of 
> >> > them I'm actually putting processes into (16 down from 2500) the problem
> >> > does not reproduce.
> >> 
> >> That is indeed interesting, and definitely not something we'd want to
> >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
> >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
> >> I was originally thinking of)
> >> 
> >
> > The locking may be a red herring.
> >
> > The setup is root->throttled->a where a is 1-2500. There are 4 threads in
> > each of the first 16 a groups.  The parent, throttled, is where the 
> > cfs_period/quota_us are set. 
> >
> > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
> >
> > The distribute_cfg_runtime looks to be O(n * m) where n is number of 
> > throttled cfs_rqs and m is the number of child cgroups. But I'm not 
> > completely clear on how the hierarchical cgroups play together here. 
> >
> > I'll pull on this thread some. 
> >
> > Thanks for your input.
> >
> >
> > Cheers,
> > Phil
> 
> Yeah, that isn't under the cfs_b lock, but is still part of distribute
> (and under rq lock, which might also matter). I was thinking too much
> about just the cfs_b regions. I'm not sure there's any good general
> optimization there.
>

It's really an edge case, but the watchdog NMI is pretty painful.

> I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
> ancestor with a quota" pointer and ones with quota could have
> "descendants with quota" list, parallel to the children/parent lists of
> tgs. Then throttle/unthrottle would only have to visit these lists, and
> child cgroups/cfs_rqs without their own quotas would just check
> cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
> can also probably be tracked there.

That seems like it would add a lot of complexity for this edge case. Maybe
it would be acceptible to use the safety valve like my first example, or
something like the below which will tune the period up until it doesn't
overrun for ever.  The down side of this one is it does change the user's
settings, but that could be preferable to an NMI crash.

Cheers,
Phil

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..78f9e28adc7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC;
+int cfs_quota_period_autotune_shift  = 4; /* 100 / 16 = 6.25% */
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	s64 nsprev, nsnow, new_period;
+	ktime_t now;
 	int overrun;
 	int idle = 0;
 
 	raw_spin_lock(&cfs_b->lock);
+	nsprev = ktime_to_ns(hrtimer_cb_get_time(timer));
 	for (;;) {
-		overrun = hrtimer_forward_now(timer, cfs_b->period);
+		/* 
+		 * Note this reverts the change to use hrtimer_forward_now, which avoids calling hrtimer_cb_get_time
+		 * for a value we already have
+		 */
+		now = hrtimer_cb_get_time(timer);
+		nsnow = ktime_to_ns(now);
+		if (nsnow - nsprev >= cfs_quota_period_autotune_thresh) {
+			new_period = ktime_to_ns(cfs_b->period);
+			new_period += new_period >> cfs_quota_period_autotune_shift;
+			if (new_period <= max_cfs_quota_period) {
+				cfs_b->period = ns_to_ktime(new_period);
+				cfs_b->quota += cfs_b->quota >> cfs_quota_period_autotune_shift;
+				pr_warn_ratelimited(
+					"cfs_period_timer [cpu%d] : Running too long, scaling up (new period %lld, new quota = %lld)\n", 
+					smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
+			}
+			nsprev = nsnow;
+		}
+
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
 		if (!overrun)
 			break;
 


--

next prev parent reply	other threads:[~2019-03-09 20:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-01 14:52 [RFC] sched/fair: hard lockup in sched_cfs_period_timer Phil Auld
2019-03-04 18:13 ` bsegall
2019-03-04 19:05   ` Phil Auld
2019-03-05 18:49     ` bsegall
2019-03-05 20:05       ` Phil Auld
2019-03-05 20:45         ` bsegall
2019-03-06 16:23           ` Phil Auld
2019-03-06 19:25             ` bsegall
2019-03-09 20:33               ` Phil Auld [this message]
2019-03-11 17:44                 ` bsegall
2019-03-11 20:25                   ` Phil Auld
2019-03-12 13:57                     ` Phil Auld
2019-03-13 17:44                       ` bsegall
2019-03-13 18:50                         ` Phil Auld
2019-03-13 20:26                           ` bsegall
2019-03-13 21:10                             ` Phil Auld
2019-03-12 17:29                     ` bsegall

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:310d0637fe4 dfblob:78f9e28adc7 )
 OR (
bs:"Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190309203320.GA24464@lorien.usersys.redhat.com \
    --to=pauld@redhat.com \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox