Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Mike Galbraith <efault@gmx.de>
Subject: Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
Date: Mon, 10 Dec 2012 21:29:33 +0100	[thread overview]
Message-ID: <20121210202933.GA15363@gmail.com> (raw)
In-Reply-To: <20121210123336.GI1009@suse.de>


* Mel Gorman <mgorman@suse.de> wrote:

> > NUMA convergence latency measurements
> > -------------------------------------
> > 
> > 'NUMA convergence' latency is the number of seconds a 
> > workload takes to reach 'perfectly NUMA balanced' state. 
> > This is measured on the CPU placement side: once it has 
> > converged then memory typically follows within a couple of 
> > seconds.
> 
> This is a sortof misleading metric so be wary of it as the 
> speed a workload converges is not necessarily useful. It only 
> makes a difference for short-lived workloads or during phase 
> changes. If the workload is short-lived, it's not interesting 
> anyway. If the workload is rapidly changing phases then the 
> migration costs can be a major factor and rapidly converging 
> might actually be slower overall.
> 
> The speed the workload converges will depend very heavily on 
> when the PTEs are marked pte_numa and when the faults are 
> incurred. If this is happening very rapidly then a workload 
> will converge quickly *but* this can incur a high system CPU 
> cost (PTE scanning, fault trapping etc).  This metric can be 
> gamed by always scanning rapidly but the overall performance 
> may be worse.
> 
> I'm not saying that this metric is not useful, it is. Just be 
> careful of optimising for it. numacores system CPU usage has 
> been really high in a number of benchmarks and it may be 
> because you are optimising to minimise time to convergence.

You are missing a big part of the NUMA balancing picture here: 
the primary use of 'latency of convergence' is to determine 
whether a workload converges *at all*.

For example if you look at the 4-process / 8-threads-per-process 
latency results:

                            [ Lower numbers are better. ]
 
  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
 ------------------------------------------------------------------------------------------
  4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs

You'll see that balancenuma does not converge this workload. 

Where does such a workload matter? For example in the 4x JVM 
SPECjbb tests that Thomas Gleixner has reported today:

    http://lkml.org/lkml/2012/12/10/437

There balancenuma does worse than AutoNUMA and the -v3 tree 
exactly because it does not NUMA-converge as well (or at all).

> I'm trying to understand what you're measuring a bit better.  
> Take 1x4 for example -- one process, 4 threads. If I'm reading 
> this description then all 4 threads use the same memory. Is 
> this correct? If so, this is basically a variation of numa01 
> which is an adverse workload. [...]

No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been 
performing - not an 'adverse' workload. The threads of the JVM 
are sharing memory significantly enough to justify moving them 
on the same node.

> [...]  balancenuma will not migrate memory in this case as 
> it'll never get past the two-stage filter. If there are few 
> threads, it might never get scheduled on a new node in which 
> case it'll also do nothing.
> 
> The correct action in this case is to interleave memory and 
> spread the tasks between nodes but it lacks the information to 
> do that. [...]

No, the correct action is to move related threads close to each 
other.

> [...] This was deliberate as I was expecting numacore or 
> autonuma to be rebased on top and I didn't want to collide.
> 
> Does the memory requirement of all threads fit in a single 
> node? This is related to my second question -- how do you 
> define convergence?

NUMA-convergence is to achieve the ideal CPU and memory 
placement of tasks.

> > The 'balancenuma' kernel does not converge any of the 
> > workloads where worker threads or processes relate to each 
> > other.
> 
> I'd like to know if it is because the workload fits on one 
> node. If the buffers are all really small, balancenuma would 
> have skipped them entirely for example due to this check
> 
>         /* Skip small VMAs. They are not likely to be of relevance */
>         if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
>                 continue;

No, the memory areas are larger than 2MB.

> Another possible explanation is that in the 4x4 case that the 
> processes threads are getting scheduled on separate nodes. As 
> each thread is sharing data it would not get past the 
> two-stage filter.
> 
> How realistic is it that threads are accessing the same data? 

In practice? Very ...

> That looks like it would be a bad idea even from a caching 
> perspective if the data is being updated. I would expect that 
> the majority of HPC workloads would have each thread accessing 
> mostly private data until the final stages where the results 
> are aggregated together.

You tested such a workload many times in the past: the 4x JVM 
SPECjbb test ...

> > NUMA workload bandwidth measurements
> > ------------------------------------
> > 
> > The other set of numbers I've collected are workload 
> > bandwidth measurements, run over 20 seconds. Using 20 
> > seconds gives a healthy mix of pre-convergence and 
> > post-convergence bandwidth,
> 
> 20 seconds is *really* short. That might not even be enough 
> time for autonumas knumad thread to find the process and 
> update it as IIRC it starts pretty slowly.

If you check the convergence latency tables you'll see that 
AutoNUMA is able to converge within 20 seconds.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-12-10 20:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
2012-12-10 12:33   ` Mel Gorman
2012-12-10 20:29     ` Ingo Molnar [this message]
2012-12-10 21:59       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121210202933.GA15363@gmail.com \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=efault@gmx.de \
    --cc=fweisbec@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).