public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity
Date: Tue, 23 Jun 2015 10:10:39 +0200	[thread overview]
Message-ID: <20150623081038.GA26231@gmail.com> (raw)
In-Reply-To: <20150622162958.GB32412@linux.vnet.ibm.com>


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> * Rik van Riel <riel@redhat.com> [2015-06-16 10:39:13]:
> 
> > On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > > This is consistent with all other load balancing instances where we
> > > absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> > > env->imbalance_pct allows to pull and retain task to their preferred
> > > nodes.
> > >
> > > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >
> > How does this work with other workloads, eg.
> > single instance SPECjbb2005, or two SPECjbb2005
> > instances on a four node system?
> >
> > Is the load still balanced evenly between nodes
> > with this patch?
> >
> 
> Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
> System, 2 JVMs per System and 4 JVMs per System and observed that the
> load spreading was similar with and without this patch.
> 
> Also I have visualized using htop when running 0.5X (i.e 48 threads on
> 96 cpu system) cpu stress workloads to see that the spread is similar
> before and after the patch.
> 
> Please let me know if there are any better ways to observe the
> spread. [...]

There are. I see you are using prehistoric tooling, but see the various NUMA 
convergence latency measurement utilities in 'perf bench numa':

triton:~/tip> perf bench numa mem -h
# Running 'numa/mem' benchmark:

 # Running main, "perf bench numa numa-mem -h"

 usage: perf bench numa <options>

    -p, --nr_proc <n>     number of processes
    -t, --nr_threads <n>  number of threads per process
    -G, --mb_global <MB>  global  memory (MBs)
    -P, --mb_proc <MB>    process memory (MBs)
    -L, --mb_proc_locked <MB>
                          process serialized/locked memory access (MBs), <= process_memory
    -T, --mb_thread <MB>  thread  memory (MBs)
    -l, --nr_loops <n>    max number of loops to run
    -s, --nr_secs <n>     max number of seconds to run
    -u, --usleep <n>      usecs to sleep per loop iteration
    -R, --data_reads      access the data via writes (can be mixed with -W)
    -W, --data_writes     access the data via writes (can be mixed with -R)
    -B, --data_backwards  access the data backwards as well
    -Z, --data_zero_memset
                          access the data via glibc bzero only
    -r, --data_rand_walk  access the data with random (32bit LFSR) walk
    -z, --init_zero       bzero the initial allocations
    -I, --init_random     randomize the contents of the initial allocations
    -0, --init_cpu0       do the initial allocations on CPU#0
    -x, --perturb_secs <n>
                          perturb thread 0/0 every X secs, to test convergence stability
    -d, --show_details    Show details
    -a, --all             Run all tests in the suite
    -H, --thp <n>         MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE
    -c, --show_convergence
                          show convergence details
    -m, --measure_convergence
                          measure convergence latency
    -q, --quiet           quiet mode
    -S, --serialize-startup
                          serialize thread startup
    -C, --cpus <cpu[,cpu2,...cpuN]>
                          bind the first N tasks to these specific cpus (the rest is unbound)
    -M, --memnodes <node[,node2,...nodeN]>
                          bind the first N tasks to these specific memory nodes (the rest is unbound)

'-m' will measure convergence.
'-c' will visualize it.
'--thp' can be used to turn hugepages on/off

For example you can create a 'numa02' work-alike by doing:

  vega:~> cat numa02
  #!/bin/bash

  perf bench numa mem --no-data_rand_walk -p 1 -t 32 -G 0 -P 0 -T 32 -l 800 -zZ0c $@

this perf bench numa command mimics numa02 pretty exactly on a 32 CPU system.

This will run it in a loop:

  vega:~> cat numa02-loop 

  while :; do
    ./numa02 2>&1 | grep runtime-max/thread
    sleep 1
  done

Or here are various numa01 work-alikes:

  vega:~> cat numa01
  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 3072 -T 0 -l 50 -zZ0c $@

  vega:~> cat numa01-hard-bind
  ./numa01 --cpus=0-16_16x16#16 --memnodes=0x16,2x16

or numa01-thread-alloc:

  vega:~> cat numa01-THREAD_ALLOC

  perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 -zZ0c $@

You can generate very flexible setups of NUMA access patterns, and measure their 
behavior accurately.

It's all so much more capable and more flexible than autonumabench ...

Also, when you are trying to report numbers for multiple runs, please use 
something like:

   perf stat --null --repeat 3 ...

This will run the workload 3 times (doing only time measurement) and report the 
stddev in a human readable form.

Thanks,

	Ingo

  parent reply	other threads:[~2015-06-23  8:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16 11:55 [PATCH v2 0/4] Improve numa load balancing Srikar Dronamraju
2015-06-16 11:55 ` [PATCH v2 1/4] sched/tip:Prefer numa hotness over cache hotness Srikar Dronamraju
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Prefer NUMA " tip-bot for Srikar Dronamraju
2015-07-07  0:19     ` Srikar Dronamraju
2015-07-08 13:31       ` Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity Srikar Dronamraju
2015-06-16 14:39   ` Rik van Riel
2015-06-22 16:29     ` Srikar Dronamraju
2015-06-23  1:18       ` Rik van Riel
2015-06-23  8:10       ` Ingo Molnar [this message]
2015-06-23 13:01         ` Srikar Dronamraju
2015-06-23 14:36           ` Ingo Molnar
2015-07-06 15:50   ` [tip:sched/core] sched/numa: Consider 'imbalance_pct' when comparing loads in numa_has_capacity() tip-bot for Srikar Dronamraju
2015-07-07  6:49   ` tip-bot for Srikar Dronamraju
2015-06-16 11:56 ` [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node Srikar Dronamraju
2015-06-16 14:54   ` Rik van Riel
2015-06-16 17:19     ` Srikar Dronamraju
2015-06-16 18:25       ` Rik van Riel
2015-06-16 17:18   ` Rik van Riel
2015-06-16 11:56 ` [PATCH v2 4/4] sched:Use correct nid while evaluating task weights Srikar Dronamraju
2015-06-16 15:00   ` Rik van Riel
2015-06-16 17:26     ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150623081038.GA26231@gmail.com \
    --to=mingo@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox