public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ivan Babrou <ivan@cloudflare.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: Lower than expected CPU pressure in PSI
Date: Sat, 8 Feb 2020 11:19:57 +0100	[thread overview]
Message-ID: <20200208101957.GU14946@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20200207130829.GG14897@hirez.programming.kicks-ass.net>

On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > > 
> > > My test setup consists of 3 services:
> > > 
> > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > * stress-cpu2-first-half.service    : taskset -c 2 stress --cpu 1
> > > * stress-cpu2-second-half.service   : taskset -c 2 stress --cpu 1
> > > 
> > > First service runs unconstrained, the other two compete for CPU.
> > > 
> > > As expected, I can see 500ms/s sched delay for the latter two and
> > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > > 
> > > However, CPU pressure reported by PSI says that none of my services
> > > have any pressure on them. I can see around 434ms/s pressure on
> > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > is surprising for three reasons:
> > > 
> > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > * Pressure on root cgroup is lower than on system.slice
> > 
> > CPU pressure is currently implemented based only on the number of
> > *runnable* tasks, not on who gets to actively use the CPU. This works
> > for contention within cgroups or at the global scope, but it doesn't
> > correctly reflect competition between cgroups. It also doesn't show
> > the effects of e.g. cpu cycle limiting through cpu.max where there
> > might *be* only one runnable task, but it's not getting the CPU.
> > 
> > I've been working on fixing this, but hadn't gotten around to sending
> > the patch upstream. Attaching it below. Would you mind testing it?
> > 
> > Peter, what would you think of the below?
> 
> I'm not loving it; but I see what it does and I can't quickly see an
> alternative.
> 
> My main gripe is doing even more of those cgroup traversals.
> 
> One thing pick_next_task_fair() does is try and limit the cgroup
> traversal to the sub-tree that contains both prev and next. Not sure
> that is immediately applicable here, but it might be worth looking into.

One option I suppose, would be to replace this:

+static inline void psi_sched_switch(struct task_struct *prev,
+                                   struct task_struct *next,
+                                   bool sleep)
+{
+       if (static_branch_likely(&psi_disabled))
+               return;
+
+       /*
+        * Clear the TSK_ONCPU state if the task was preempted. If
+        * it's a voluntary sleep, dequeue will have taken care of it.
+        */
+       if (!sleep)
+               psi_task_change(prev, TSK_ONCPU, 0);
+
+       psi_task_change(next, 0, TSK_ONCPU);
+}

With something like:

static inline void psi_sched_switch(struct task_struct *prev,
                                   struct task_struct *next,
                                   bool sleep)
{
	struct psi_group *g, *p = NULL;

	set = TSK_ONCPU;
	clear = 0;

	while ((g = iterate_group(next, &g))) {
		u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];
		if (nr_running) {
			/* if set, we hit the subtree @prev lives in, terminate */
			p = g;
			break;
		}

		/* the rest of psi_task_change */
	}

	if (sleep)
		return;

	set = 0;
	clear = TSK_ONCPU;

	while ((g = iterate_group(prev, &g))) {
		if (g == p)
			break;

		/* the rest of psi_task_change */
	}
}

That way we avoid clearing and setting the common parents.

  reply	other threads:[~2020-02-08 10:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-08 19:47 Lower than expected CPU pressure in PSI Ivan Babrou
2020-01-09 16:16 ` Johannes Weiner
2020-01-10 19:28   ` Ivan Babrou
2020-01-15 16:55     ` Johannes Weiner
2020-01-16 20:24       ` Ivan Babrou
2020-02-07 13:08   ` Peter Zijlstra
2020-02-08 10:19     ` Peter Zijlstra [this message]
2020-02-10 18:04       ` Johannes Weiner
2020-01-09 16:23 ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200208101957.GU14946@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox