From: Peter Zijlstra <peterz@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Tejun Heo <tj@kernel.org>, Suren Baghdasaryan <surenb@google.com>,
Vinayak Menon <vinmenon@codeaurora.org>,
Christopher Lameter <cl@linux.com>,
Mike Galbraith <efault@gmx.de>,
Shakeel Butt <shakeelb@google.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO
Date: Wed, 18 Jul 2018 14:03:18 +0200 [thread overview]
Message-ID: <20180718120318.GC2476@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20180712172942.10094-9-hannes@cmpxchg.org>
On Thu, Jul 12, 2018 at 01:29:40PM -0400, Johannes Weiner wrote:
> +/* Tracked task states */
> +enum psi_task_count {
> + NR_RUNNING,
> + NR_IOWAIT,
> + NR_MEMSTALL,
> + NR_PSI_TASK_COUNTS,
> +};
> +
> +/* Task state bitmasks */
> +#define TSK_RUNNING (1 << NR_RUNNING)
> +#define TSK_IOWAIT (1 << NR_IOWAIT)
> +#define TSK_MEMSTALL (1 << NR_MEMSTALL)
> +
> +/* Resources that workloads could be stalled on */
> +enum psi_res {
> + PSI_CPU,
> + PSI_MEM,
> + PSI_IO,
> + NR_PSI_RESOURCES,
> +};
> +
> +/* Pressure states for a group of tasks */
> +enum psi_state {
> + PSI_NONE, /* No stalled tasks */
> + PSI_SOME, /* Stalled tasks & working tasks */
> + PSI_FULL, /* Stalled tasks & no working tasks */
> + NR_PSI_STATES,
> +};
> +
> +struct psi_resource {
> + /* Current pressure state for this resource */
> + enum psi_state state;
This has a 4 byte hole here (really 7 but GCC is generous and uses 4
bytes for the enum that spans the value range [0-2]).
> + /* Start of current state (rq_clock) */
> + u64 state_start;
> +
> + /* Time sampling buckets for pressure states SOME and FULL (ns) */
> + u64 times[2];
> +};
> +
> +struct psi_group_cpu {
> + /* States of the tasks belonging to this group */
> + unsigned int tasks[NR_PSI_TASK_COUNTS];
> +
> + /* There are runnable or D-state tasks */
> + int nonidle;
> +
> + /* Start of current non-idle state (rq_clock) */
> + u64 nonidle_start;
> +
> + /* Time sampling bucket for non-idle state (ns) */
> + u64 nonidle_time;
> +
> + /* Per-resource pressure tracking in this group */
> + struct psi_resource res[NR_PSI_RESOURCES];
> +};
> +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_cpus);
Since psi_group_cpu is exactly 2 lines big, I think you want the above
to be DEFINE_PER_CPU_SHARED_ALIGNED() to minimize cache misses on
accounting. Also, I think you want to stick ____cacheline_aligned_in_smp
on the structure, such that alloc_percpu() also DTRT.
Of those 2 lines, 12 bytes are wasted because of that hole above, and a
further 8 are wasted because PSI_CPU does not use FULL, for a total of
20 wasted bytes in there.
> +static void time_state(struct psi_resource *res, int state, u64 now)
> +{
> + if (res->state != PSI_NONE) {
> + bool was_full = res->state == PSI_FULL;
> +
> + res->times[was_full] += now - res->state_start;
> + }
> + if (res->state != state)
> + res->state = state;
> + if (res->state != PSI_NONE)
> + res->state_start = now;
> +}
Does the compiler optimize that and fold the two != NONE branches?
> +static void psi_group_change(struct psi_group *group, int cpu, u64 now,
> + unsigned int clear, unsigned int set)
> +{
> + enum psi_state state = PSI_NONE;
> + struct psi_group_cpu *groupc;
> + unsigned int *tasks;
> + unsigned int to, bo;
> +
> + groupc = per_cpu_ptr(group->cpus, cpu);
> + tasks = groupc->tasks;
bool was_nonidle = tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL];
> + /* Update task counts according to the set/clear bitmasks */
> + for (to = 0; (bo = ffs(clear)); to += bo, clear >>= bo) {
> + int idx = to + (bo - 1);
> +
> + if (tasks[idx] == 0 && !psi_bug) {
> + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d idx=%d tasks=[%u %u %u] clear=%x set=%x\n",
> + cpu, idx, tasks[0], tasks[1], tasks[2],
> + clear, set);
> + psi_bug = 1;
> + }
WARN_ONCE(!tasks[idx], ...);
> + tasks[idx]--;
> + }
> + for (to = 0; (bo = ffs(set)); to += bo, set >>= bo)
> + tasks[to + (bo - 1)]++;
You want to benchmark this, but since it's only 3 consecutive bits, it
might actually be faster to not use ffs() and simply test all 3 bits:
for (to = set, bo = 0; to; to &= ~(1 << bo), bo++)
tasks[bo]++;
or something like that.
> +
> + /* Time in which tasks wait for the CPU */
> + state = PSI_NONE;
> + if (tasks[NR_RUNNING] > 1)
> + state = PSI_SOME;
> + time_state(&groupc->res[PSI_CPU], state, now);
> +
> + /* Time in which tasks wait for memory */
> + state = PSI_NONE;
> + if (tasks[NR_MEMSTALL]) {
> + if (!tasks[NR_RUNNING] ||
> + (cpu_curr(cpu)->flags & PF_MEMSTALL))
I'm confused, why do we care if the current tasks is MEMSTALL or not?
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_MEM], state, now);
> +
> + /* Time in which tasks wait for IO */
> + state = PSI_NONE;
> + if (tasks[NR_IOWAIT]) {
> + if (!tasks[NR_RUNNING])
> + state = PSI_FULL;
> + else
> + state = PSI_SOME;
> + }
> + time_state(&groupc->res[PSI_IO], state, now);
> +
> + /* Time in which tasks are non-idle, to weigh the CPU in summaries */
if (was_nonidle);
> + groupc->nonidle_time += now - groupc->nonidle_start;
if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
> + groupc->nonidle_start = now;
Does away with groupc->nonidle, giving us 24 bytes free.
> + /* Kick the stats aggregation worker if it's gone to sleep */
> + if (!delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> +}
If you always update the time buckets, rename nonidle_start as last_time
and do away with psi_resource::state_start, you gain another 24 bytes,
giving 48 bytes free.
And as said before, we can compress the state from 12 bytes, to 6 bits
(or 1 byte), giving another 11 bytes for 59 bytes free.
Leaving us just 5 bytes short of needing a single cacheline :/
struct ponies {
unsigned int tasks[3]; /* 0 12 */
unsigned int cpu_state:2; /* 12:30 4 */
unsigned int io_state:2; /* 12:28 4 */
unsigned int mem_state:2; /* 12:26 4 */
/* XXX 26 bits hole, try to pack */
/* typedef u64 */ long long unsigned int last_time; /* 16 8 */
/* typedef u64 */ long long unsigned int some_time[3]; /* 24 24 */
/* typedef u64 */ long long unsigned int full_time[2]; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* typedef u64 */ long long unsigned int nonidle_time; /* 64 8 */
/* size: 72, cachelines: 2, members: 8 */
/* bit holes: 1, sum bit holes: 26 bits */
/* last cacheline: 8 bytes */
};
ARGGH!
next prev parent reply other threads:[~2018-07-18 12:03 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-12 17:29 [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Johannes Weiner
2018-07-12 17:29 ` [PATCH 01/10] mm: workingset: don't drop refault information prematurely Johannes Weiner
2018-07-12 17:29 ` [PATCH 02/10] mm: workingset: tell cache transitions from workingset thrashing Johannes Weiner
2018-07-23 13:36 ` Arnd Bergmann
2018-07-23 13:36 ` Arnd Bergmann
2018-07-23 15:23 ` Johannes Weiner
2018-07-23 15:23 ` Johannes Weiner
2018-07-23 15:23 ` Johannes Weiner
2018-07-23 15:35 ` Arnd Bergmann
2018-07-23 15:35 ` Arnd Bergmann
2018-07-23 16:27 ` Johannes Weiner
2018-07-23 16:27 ` Johannes Weiner
2018-07-23 16:27 ` Johannes Weiner
2018-07-24 15:04 ` Will Deacon
2018-07-24 15:04 ` Will Deacon
2018-07-25 16:06 ` Will Deacon
2018-07-25 16:06 ` Will Deacon
2018-07-12 17:29 ` [PATCH 03/10] delayacct: track delays from thrashing cache pages Johannes Weiner
2018-07-12 17:29 ` [PATCH 04/10] sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD Johannes Weiner
2018-07-12 17:29 ` [PATCH 05/10] sched: loadavg: make calc_load_n() public Johannes Weiner
2018-07-12 17:29 ` [PATCH 06/10] sched: sched.h: make rq locking and clock functions available in stats.h Johannes Weiner
2018-07-12 17:29 ` [PATCH 07/10] sched: introduce this_rq_lock_irq() Johannes Weiner
2018-07-12 17:29 ` [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO Johannes Weiner
2018-07-13 9:21 ` Peter Zijlstra
2018-07-13 16:17 ` Johannes Weiner
2018-07-14 8:48 ` Peter Zijlstra
2018-07-14 9:02 ` Peter Zijlstra
2018-07-17 10:03 ` Peter Zijlstra
2018-07-18 21:56 ` Johannes Weiner
2018-07-17 14:16 ` Peter Zijlstra
2018-07-18 22:00 ` Johannes Weiner
2018-07-17 14:21 ` Peter Zijlstra
2018-07-18 22:03 ` Johannes Weiner
2018-07-17 15:01 ` Peter Zijlstra
2018-07-18 22:06 ` Johannes Weiner
2018-07-20 14:13 ` Johannes Weiner
2018-07-17 15:17 ` Peter Zijlstra
2018-07-18 22:11 ` Johannes Weiner
2018-07-17 15:32 ` Peter Zijlstra
2018-07-18 12:03 ` Peter Zijlstra [this message]
2018-07-18 12:22 ` Peter Zijlstra
2018-07-18 22:36 ` Johannes Weiner
2018-07-19 13:58 ` Peter Zijlstra
2018-07-19 9:26 ` Peter Zijlstra
2018-07-19 12:50 ` Johannes Weiner
2018-07-19 13:18 ` Peter Zijlstra
2018-07-19 15:08 ` Linus Torvalds
2018-07-19 17:54 ` Johannes Weiner
2018-07-19 18:47 ` Johannes Weiner
2018-07-19 20:31 ` Peter Zijlstra
2018-07-24 16:01 ` Johannes Weiner
2018-07-18 12:46 ` Peter Zijlstra
2018-07-18 13:56 ` Johannes Weiner
2018-07-18 16:31 ` Peter Zijlstra
2018-07-18 16:46 ` Johannes Weiner
2018-07-20 20:35 ` Peter Zijlstra
2018-07-12 17:29 ` [PATCH 09/10] psi: cgroup support Johannes Weiner
2018-07-12 20:08 ` Tejun Heo
2018-07-17 15:40 ` Peter Zijlstra
2018-07-24 15:54 ` Johannes Weiner
2018-07-12 17:29 ` [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure Johannes Weiner
2018-07-12 23:45 ` Andrew Morton
2018-07-13 22:17 ` Johannes Weiner
2018-07-13 22:13 ` Suren Baghdasaryan
2018-07-13 22:49 ` Johannes Weiner
2018-07-13 23:34 ` Suren Baghdasaryan
2018-07-17 15:13 ` Peter Zijlstra
2018-07-12 17:37 ` [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2 Linus Torvalds
2018-07-12 23:44 ` Andrew Morton
2018-07-13 22:14 ` Johannes Weiner
2018-07-16 15:57 ` Daniel Drake
2018-07-17 11:25 ` Michal Hocko
2018-07-17 12:13 ` Daniel Drake
2018-07-17 12:23 ` Michal Hocko
2018-07-25 22:57 ` Daniel Drake
2018-07-18 22:21 ` Johannes Weiner
2018-07-19 11:29 ` peter enderborg
2018-07-19 11:29 ` peter enderborg
2018-07-19 12:18 ` Johannes Weiner
2018-07-23 21:14 ` Balbir Singh
2018-07-24 15:15 ` Johannes Weiner
2018-07-26 1:07 ` Singh, Balbir
2018-07-26 20:07 ` Johannes Weiner
2018-07-27 23:40 ` Suren Baghdasaryan
2018-07-27 22:01 ` Pavel Machek
2018-07-30 15:40 ` Johannes Weiner
2018-07-30 17:39 ` Pavel Machek
2018-07-30 17:51 ` Tejun Heo
2018-07-30 17:54 ` Randy Dunlap
2018-07-30 18:05 ` Tejun Heo
2018-07-30 17:59 ` Pavel Machek
2018-07-30 18:07 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180718120318.GC2476@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vinmenon@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.