From: Ingo Molnar <mingo@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Johannes Weiner <hannes@cmpxchg.org>,
Hugh Dickins <hughd@google.com>,
Arnaldo Carvalho de Melo <acme@redhat.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Mike Galbraith <efault@gmx.de>
Subject: Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
Date: Mon, 10 Dec 2012 21:29:33 +0100 [thread overview]
Message-ID: <20121210202933.GA15363@gmail.com> (raw)
In-Reply-To: <20121210123336.GI1009@suse.de>
* Mel Gorman <mgorman@suse.de> wrote:
> > NUMA convergence latency measurements
> > -------------------------------------
> >
> > 'NUMA convergence' latency is the number of seconds a
> > workload takes to reach 'perfectly NUMA balanced' state.
> > This is measured on the CPU placement side: once it has
> > converged then memory typically follows within a couple of
> > seconds.
>
> This is a sortof misleading metric so be wary of it as the
> speed a workload converges is not necessarily useful. It only
> makes a difference for short-lived workloads or during phase
> changes. If the workload is short-lived, it's not interesting
> anyway. If the workload is rapidly changing phases then the
> migration costs can be a major factor and rapidly converging
> might actually be slower overall.
>
> The speed the workload converges will depend very heavily on
> when the PTEs are marked pte_numa and when the faults are
> incurred. If this is happening very rapidly then a workload
> will converge quickly *but* this can incur a high system CPU
> cost (PTE scanning, fault trapping etc). This metric can be
> gamed by always scanning rapidly but the overall performance
> may be worse.
>
> I'm not saying that this metric is not useful, it is. Just be
> careful of optimising for it. numacores system CPU usage has
> been really high in a number of benchmarks and it may be
> because you are optimising to minimise time to convergence.
You are missing a big part of the NUMA balancing picture here:
the primary use of 'latency of convergence' is to determine
whether a workload converges *at all*.
For example if you look at the 4-process / 8-threads-per-process
latency results:
[ Lower numbers are better. ]
[test unit] : v3.7 |balancenuma-v10| AutoNUMA-v28 | numa-u-v3 |
------------------------------------------------------------------------------------------
4x8-convergence : 101.1 | 101.3 | 3.4 | 3.9 | secs
You'll see that balancenuma does not converge this workload.
Where does such a workload matter? For example in the 4x JVM
SPECjbb tests that Thomas Gleixner has reported today:
http://lkml.org/lkml/2012/12/10/437
There balancenuma does worse than AutoNUMA and the -v3 tree
exactly because it does not NUMA-converge as well (or at all).
> I'm trying to understand what you're measuring a bit better.
> Take 1x4 for example -- one process, 4 threads. If I'm reading
> this description then all 4 threads use the same memory. Is
> this correct? If so, this is basically a variation of numa01
> which is an adverse workload. [...]
No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been
performing - not an 'adverse' workload. The threads of the JVM
are sharing memory significantly enough to justify moving them
on the same node.
> [...] balancenuma will not migrate memory in this case as
> it'll never get past the two-stage filter. If there are few
> threads, it might never get scheduled on a new node in which
> case it'll also do nothing.
>
> The correct action in this case is to interleave memory and
> spread the tasks between nodes but it lacks the information to
> do that. [...]
No, the correct action is to move related threads close to each
other.
> [...] This was deliberate as I was expecting numacore or
> autonuma to be rebased on top and I didn't want to collide.
>
> Does the memory requirement of all threads fit in a single
> node? This is related to my second question -- how do you
> define convergence?
NUMA-convergence is to achieve the ideal CPU and memory
placement of tasks.
> > The 'balancenuma' kernel does not converge any of the
> > workloads where worker threads or processes relate to each
> > other.
>
> I'd like to know if it is because the workload fits on one
> node. If the buffers are all really small, balancenuma would
> have skipped them entirely for example due to this check
>
> /* Skip small VMAs. They are not likely to be of relevance */
> if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
> continue;
No, the memory areas are larger than 2MB.
> Another possible explanation is that in the 4x4 case that the
> processes threads are getting scheduled on separate nodes. As
> each thread is sharing data it would not get past the
> two-stage filter.
>
> How realistic is it that threads are accessing the same data?
In practice? Very ...
> That looks like it would be a bad idea even from a caching
> perspective if the data is being updated. I would expect that
> the majority of HPC workloads would have each thread accessing
> mostly private data until the final stages where the results
> are aggregated together.
You tested such a workload many times in the past: the 4x JVM
SPECjbb test ...
> > NUMA workload bandwidth measurements
> > ------------------------------------
> >
> > The other set of numbers I've collected are workload
> > bandwidth measurements, run over 20 seconds. Using 20
> > seconds gives a healthy mix of pre-convergence and
> > post-convergence bandwidth,
>
> 20 seconds is *really* short. That might not even be enough
> time for autonumas knumad thread to find the process and
> update it as IIRC it starts pretty slowly.
If you check the convergence latency tables you'll see that
AutoNUMA is able to converge within 20 seconds.
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-12-10 20:29 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
2012-12-10 12:33 ` Mel Gorman
2012-12-10 20:29 ` Ingo Molnar [this message]
2012-12-10 21:59 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121210202933.GA15363@gmail.com \
--to=mingo@kernel.org \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=acme@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=efault@gmx.de \
--cc=fweisbec@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).