Announce: the 'perf bench numa mem' NUMA performance measurement tool

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Announce: the 'perf bench numa mem' NUMA performance measurement tool
@ 2012-12-07 20:55 Ingo Molnar
  2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
  2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
  0 siblings, 2 replies; 6+ messages in thread
From: Ingo Molnar @ 2012-12-07 20:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Arnaldo Carvalho de Melo, Frederic Weisbecker, Mike Galbraith

This is a NUMA performance measurement tool I've been honing for some time
and people expressed interest in it so here's a tidied up version of it.

I also pushed it out into the tip:perf/bench branch:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/bench

Maybe others find it useful too. I'll post a couple of perf bench
NUMA performance numbers in the next hour or so.

Thanks,

	Ingo

------------>
Ingo Molnar (1):
  perf: Add 'perf bench numa mem' NUMA performance measurement suite

 tools/perf/Makefile        |    3 +-
 tools/perf/bench/bench.h   |    1 +
 tools/perf/bench/numa.c    | 1731 ++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c |   13 +
 tools/perf/util/hist.h     |    2 +-
 5 files changed, 1748 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/bench/numa.c

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite
  2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
@ 2012-12-07 20:55 ` Ingo Molnar
  2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2012-12-07 20:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Arnaldo Carvalho de Melo, Frederic Weisbecker, Mike Galbraith,
	Steven Rostedt

Add a suite of NUMA performance benchmarks.

The goal was simulate the behavior and access patterns of real NUMA
workloads, via a wide range of parameters, so this tool goes well
beyond simple bzero() measurements that most NUMA micro-benchmarks use:

 - It processes the data and creates a chain of data dependencies,
   like a real workload would. Neither the compiler, nor the
   kernel (via KSM and other optimizations) nor the CPU can
   eliminate parts of the workload.

 - It randomizes the initial state and also randomizes the target
   addresses of the processing - it's not a simple forward scan
   of addresses.

 - It provides flexible options to set process, thread and memory
   relationship information: -G sets "global" memory shared between
   all test processes, -P sets "process" memory shared by all
   threads of a process and -T sets "thread" private memory.

 - There's a NUMA convergence monitoring and convergence latency
   measurement option via -c and -m.

 - Micro-sleeps and synchronization can be injected to provoke lock
   contention and scheduling, via the -u and -S options. This simulates
   IO and contention.

 - The -x option instructs the workload to 'perturb' itself artificially
   every N seconds, by moving to the first and last CPU of the system
   periodically. This way the stability of convergence equilibrium and
   the number of steps taken for the scheduler to reach equilibrium again
   can be measured.

 - The amount of work can be specified via the -l loop count, and/or
   via a -s seconds-timeout value.

 - CPU and node memory binding options, to test hard binding scenarios.
   THP can be turned on and off via madvise() calls.

 - Live reporting of convergence progress in an 'at glance' output format.
   Printing of convergence and deconvergence events.

The 'perf bench numa mem -a' option will start an array of about 30
individual tests that will each output such measurements:

 # Running  5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
  5x5-bw-thread,                         20.276, secs,           runtime-max/thread
  5x5-bw-thread,                         20.004, secs,           runtime-min/thread
  5x5-bw-thread,                         20.155, secs,           runtime-avg/thread
  5x5-bw-thread,                          0.671, %,              spread-runtime/thread
  5x5-bw-thread,                         21.153, GB,             data/thread
  5x5-bw-thread,                        528.818, GB,             data-total
  5x5-bw-thread,                          0.959, nsecs,          runtime/byte/thread
  5x5-bw-thread,                          1.043, GB/sec,         thread-speed
  5x5-bw-thread,                         26.081, GB/sec,         total-speed

See the help text and the code for more details.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/perf/Makefile        |    3 +-
 tools/perf/bench/bench.h   |    1 +
 tools/perf/bench/numa.c    | 1731 ++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c |   13 +
 tools/perf/util/hist.h     |    2 +-
 5 files changed, 1748 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/bench/numa.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index cca5bb8..91621f9 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -104,7 +104,7 @@ ifdef PARSER_DEBUG
 endif
 
 CFLAGS = -fno-omit-frame-pointer -ggdb3 -funwind-tables -Wall -Wextra -std=gnu99 $(CFLAGS_WERROR) $(CFLAGS_OPTIMIZE) $(EXTRA_WARNINGS) $(EXTRA_CFLAGS) $(PARSER_DEBUG_CFLAGS)
-EXTLIBS = -lpthread -lrt -lelf -lm
+EXTLIBS = -lpthread -lrt -lelf -lm -lnuma
 ALL_CFLAGS = $(CFLAGS) -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE
 ALL_LDFLAGS = $(LDFLAGS)
 STRIP ?= strip
@@ -435,6 +435,7 @@ LIB_OBJS += $(OUTPUT)tests/attr.o
 BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
 BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
 # Benchmark modules
+BUILTIN_OBJS += $(OUTPUT)bench/numa.o
 BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
 BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
 ifeq ($(RAW_ARCH),x86_64)
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 8f89998..a5223e6 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -1,6 +1,7 @@
 #ifndef BENCH_H
 #define BENCH_H
 
+extern int bench_numa(int argc, const char **argv, const char *prefix);
 extern int bench_sched_messaging(int argc, const char **argv, const char *prefix);
 extern int bench_sched_pipe(int argc, const char **argv, const char *prefix);
 extern int bench_mem_memcpy(int argc, const char **argv,
diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
new file mode 100644
index 0000000..30d1c32
--- /dev/null
+++ b/tools/perf/bench/numa.c
@@ -0,0 +1,1731 @@
+/*
+ * numa.c
+ *
+ * numa: Simulate NUMA-sensitive workload and measure their NUMA performance
+ */
+
+#include "../perf.h"
+#include "../builtin.h"
+#include "../util/util.h"
+#include "../util/parse-options.h"
+
+#include "bench.h"
+
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <assert.h>
+#include <malloc.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+
+#include <numa.h>
+#include <numaif.h>
+
+/*
+ * Regular printout to the terminal, supressed if -q is specified:
+ */
+#define tprintf(x...) do { if (g && g->p.show_details >= 0) printf(x); } while (0)
+
+/*
+ * Debug printf:
+ */
+#define dprintf(x...) do { if (g && g->p.show_details >= 1) printf(x); } while (0)
+
+struct thread_data {
+	int			curr_cpu;
+	cpu_set_t		bind_cpumask;
+	int			bind_node;
+	u8			*process_data;
+	int			process_nr;
+	int			thread_nr;
+	int			task_nr;
+	unsigned int		loops_done;
+	u64			val;
+	u64			runtime_ns;
+	pthread_mutex_t		*process_lock;
+};
+
+/* Parameters set by options: */
+
+struct params {
+	/* Startup synchronization: */
+	bool			serialize_startup;
+
+	/* Task hierarchy: */
+	int			nr_proc;
+	int			nr_threads;
+
+	/* Working set sizes: */
+	const char		*mb_global_str;
+	const char		*mb_proc_str;
+	const char		*mb_proc_locked_str;
+	const char		*mb_thread_str;
+
+	double			mb_global;
+	double			mb_proc;
+	double			mb_proc_locked;
+	double			mb_thread;
+
+	/* Access patterns to the working set: */
+	bool			data_reads;
+	bool			data_writes;
+	bool			data_backwards;
+	bool			data_zero_memset;
+	bool			data_rand_walk;
+	u32			nr_loops;
+	u32			nr_secs;
+	u32			sleep_usecs;
+
+	/* Working set initialization: */
+	bool			init_zero;
+	bool			init_random;
+	bool			init_cpu0;
+
+	/* Misc options: */
+	int			show_details;
+	int			run_all;
+	int			thp;
+
+	long			bytes_global;
+	long			bytes_process;
+	long			bytes_process_locked;
+	long			bytes_thread;
+
+	int			nr_tasks;
+	bool			show_quiet;
+
+	bool			show_convergence;
+	bool			measure_convergence;
+
+	int			perturb_secs;
+	int			nr_cpus;
+	int			nr_nodes;
+
+	/* Affinity options -C and -N: */
+	char			*cpu_list_str;
+	char			*node_list_str;
+};
+
+
+/* Global, read-writable area, accessible to all processes and threads: */
+
+struct global_info {
+	u8			*data;
+
+	pthread_mutex_t		startup_mutex;
+	int			nr_tasks_started;
+
+	pthread_mutex_t		startup_done_mutex;
+
+	pthread_mutex_t		start_work_mutex;
+	int			nr_tasks_working;
+
+	pthread_mutex_t		stop_work_mutex;
+	u64			bytes_done;
+
+	struct thread_data	*threads;
+
+	/* Convergence latency measurement: */
+	bool			all_converged;
+	bool			stop_work;
+
+	int			print_once;
+
+	struct params		p;
+};
+
+static struct global_info	*g = NULL;
+
+static int parse_cpus_opt(const struct option *opt, const char *arg, int unset);
+static int parse_nodes_opt(const struct option *opt, const char *arg, int unset);
+
+struct params p0;
+
+static const struct option options[] = {
+	OPT_INTEGER('p', "nr_proc"	, &p0.nr_proc,		"number of processes"),
+	OPT_INTEGER('t', "nr_threads"	, &p0.nr_threads,	"number of threads per process"),
+
+	OPT_STRING('G', "mb_global"	, &p0.mb_global_str,	"MB", "global  memory (MBs)"),
+	OPT_STRING('P', "mb_proc"	, &p0.mb_proc_str,	"MB", "process memory (MBs)"),
+	OPT_STRING('L', "mb_proc_locked", &p0.mb_proc_locked_str,"MB", "process serialized/locked memory access (MBs), <= process_memory"),
+	OPT_STRING('T', "mb_thread"	, &p0.mb_thread_str,	"MB", "thread  memory (MBs)"),
+
+	OPT_UINTEGER('l', "nr_loops"	, &p0.nr_loops,		"max number of loops to run"),
+	OPT_UINTEGER('s', "nr_secs"	, &p0.nr_secs,		"max number of seconds to run"),
+	OPT_UINTEGER('u', "usleep"	, &p0.sleep_usecs,	"usecs to sleep per loop iteration"),
+
+	OPT_BOOLEAN('R', "data_reads"	, &p0.data_reads,	"access the data via writes (can be mixed with -W)"),
+	OPT_BOOLEAN('W', "data_writes"	, &p0.data_writes,	"access the data via writes (can be mixed with -R)"),
+	OPT_BOOLEAN('B', "data_backwards", &p0.data_backwards,	"access the data backwards as well"),
+	OPT_BOOLEAN('Z', "data_zero_memset", &p0.data_zero_memset,"access the data via glibc bzero only"),
+	OPT_BOOLEAN('r', "data_rand_walk", &p0.data_rand_walk,	"access the data with random (32bit LFSR) walk"),
+
+
+	OPT_BOOLEAN('z', "init_zero"	, &p0.init_zero,	"bzero the initial allocations"),
+	OPT_BOOLEAN('I', "init_random"	, &p0.init_random,	"randomize the contents of the initial allocations"),
+	OPT_BOOLEAN('0', "init_cpu0"	, &p0.init_cpu0,	"do the initial allocations on CPU#0"),
+	OPT_INTEGER('x', "perturb_secs", &p0.perturb_secs,	"perturb thread 0/0 every X secs, to test convergence stability"),
+
+	OPT_INCR   ('d', "show_details"	, &p0.show_details,	"Show details"),
+	OPT_INCR   ('a', "all"		, &p0.run_all,		"Run all tests in the suite"),
+	OPT_INTEGER('H', "thp"		, &p0.thp,		"MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE"),
+	OPT_BOOLEAN('c', "show_convergence", &p0.show_convergence, "show convergence details"),
+	OPT_BOOLEAN('m', "measure_convergence",	&p0.measure_convergence, "measure convergence latency"),
+	OPT_BOOLEAN('q', "quiet"	, &p0.show_quiet,	"bzero the initial allocations"),
+	OPT_BOOLEAN('S', "serialize-startup", &p0.serialize_startup,"serialize thread startup"),
+
+	/* Special option string parsing callbacks: */
+        OPT_CALLBACK('C', "cpus", NULL, "cpu[,cpu2,...cpuN]",
+			"bind the first N tasks to these specific cpus (the rest is unbound)",
+			parse_cpus_opt),
+        OPT_CALLBACK('M', "memnodes", NULL, "node[,node2,...nodeN]",
+			"bind the first N tasks to these specific memory nodes (the rest is unbound)",
+			parse_nodes_opt),
+	OPT_END()
+};
+
+static const char * const bench_numa_usage[] = {
+	"perf bench numa <options>",
+	NULL
+};
+
+static const char * const numa_usage[] = {
+	"perf bench numa mem [<options>]",
+	NULL
+};
+
+static cpu_set_t bind_to_cpu(int target_cpu)
+{
+	cpu_set_t orig_mask, mask;
+	int ret;
+
+	ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+	BUG_ON(ret);
+
+	CPU_ZERO(&mask);
+
+	if (target_cpu == -1) {
+		int cpu;
+
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &mask);
+	} else {
+		BUG_ON(target_cpu < 0 || target_cpu >= g->p.nr_cpus);
+		CPU_SET(target_cpu, &mask);
+	}
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+
+	return orig_mask;
+}
+
+static cpu_set_t bind_to_node(int target_node)
+{
+	int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes;
+	cpu_set_t orig_mask, mask;
+	int cpu;
+	int ret;
+
+	BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus);
+	BUG_ON(!cpus_per_node);
+
+	ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+	BUG_ON(ret);
+
+	CPU_ZERO(&mask);
+
+	if (target_node == -1) {
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &mask);
+	} else {
+		int cpu_start = (target_node + 0) * cpus_per_node;
+		int cpu_stop  = (target_node + 1) * cpus_per_node;
+
+		BUG_ON(cpu_stop > g->p.nr_cpus);
+
+		for (cpu = cpu_start; cpu < cpu_stop; cpu++)
+			CPU_SET(cpu, &mask);
+	}
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+
+	return orig_mask;
+}
+
+static void bind_to_cpumask(cpu_set_t mask)
+{
+	int ret;
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+}
+
+static void mempol_restore(void)
+{
+	int ret;
+
+	ret = set_mempolicy(MPOL_DEFAULT, NULL, g->p.nr_nodes-1);
+
+	BUG_ON(ret);
+}
+
+static void bind_to_memnode(int node)
+{
+	unsigned long nodemask;
+	int ret;
+
+	if (node == -1)
+		return;
+
+	BUG_ON(g->p.nr_nodes > (int)sizeof(nodemask));
+	nodemask = 1L << node;
+
+	ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)*8);
+	dprintf("binding to node %d, mask: %016lx => %d\n", node, nodemask, ret);
+
+	BUG_ON(ret);
+}
+
+#define HPSIZE (2*1024*1024)
+
+#define set_taskname(fmt...)				\
+do {							\
+	char name[20];					\
+							\
+	snprintf(name, 20, fmt);			\
+	prctl(PR_SET_NAME, name);			\
+} while (0)
+
+static u8 *alloc_data(ssize_t bytes0, int map_flags,
+		      int init_zero, int init_cpu0, int thp, int init_random)
+{
+	cpu_set_t orig_mask;
+	ssize_t bytes;
+	u8 *buf;
+	int ret;
+
+	if (!bytes0)
+		return NULL;
+
+	/* Allocate and initialize all memory on CPU#0: */
+	if (init_cpu0) {
+		orig_mask = bind_to_node(0);
+		bind_to_memnode(0);
+	}
+
+	bytes = bytes0 + HPSIZE;
+
+	buf = (void *)mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANON|map_flags, -1, 0);
+	BUG_ON(buf == (void *)-1);
+
+	if (map_flags == MAP_PRIVATE) {
+		if (thp > 0) {
+			ret = madvise(buf, bytes, MADV_HUGEPAGE);
+			if (ret && !g->print_once) {
+				g->print_once = 1;
+				printf("WARNING: Could not enable THP - do: 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'\n");
+			}
+		}
+		if (thp < 0) {
+			ret = madvise(buf, bytes, MADV_NOHUGEPAGE);
+			if (ret && !g->print_once) {
+				g->print_once = 1;
+				printf("WARNING: Could not disable THP: run a CONFIG_TRANSPARENT_HUGEPAGE kernel?\n");
+			}
+		}
+	}
+
+	if (init_zero) {
+		bzero(buf, bytes);
+	} else {
+		/* Initialize random contents, different in each word: */
+		if (init_random) {
+			u64 *wbuf = (void *)buf;
+			long off = rand();
+			long i;
+
+			for (i = 0; i < bytes/8; i++)
+				wbuf[i] = i + off;
+		}
+	}
+
+	/* Align to 2MB boundary: */
+	buf = (void *)(((unsigned long)buf + HPSIZE-1) & ~(HPSIZE-1));
+
+	/* Restore affinity: */
+	if (init_cpu0) {
+		bind_to_cpumask(orig_mask);
+		mempol_restore();
+	}
+
+	return buf;
+}
+
+static void free_data(void *data, ssize_t bytes)
+{
+	int ret;
+
+	if (!data)
+		return;
+
+	ret = munmap(data, bytes);
+	BUG_ON(ret);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes, zeroed:
+ */
+static void * zalloc_shared_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_SHARED, 1, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes:
+ */
+static void * setup_shared_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_SHARED, 0, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Allocate process-local memory - this will either be shared between
+ * threads of this process, or only be accessed by this thread:
+ */
+static void * setup_private_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_PRIVATE, 0, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Return a process-shared (global) mutex:
+ */
+static void init_global_mutex(pthread_mutex_t *mutex)
+{
+	pthread_mutexattr_t attr;
+
+	pthread_mutexattr_init(&attr);
+	pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
+	pthread_mutex_init(mutex, &attr);
+}
+
+static int parse_cpu_list(const char *arg)
+{
+	p0.cpu_list_str = strdup(arg);
+
+	dprintf("got CPU list: {%s}\n", p0.cpu_list_str);
+
+	return 0;
+}
+
+static void parse_setup_cpu_list(void)
+{
+	struct thread_data *td;
+	char *str0, *str;
+	int t;
+
+	if (!g->p.cpu_list_str)
+		return;
+
+	dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+	str0 = str = strdup(g->p.cpu_list_str);
+	t = 0;
+
+	BUG_ON(!str);
+
+	tprintf("# binding tasks to CPUs:\n");
+	tprintf("#  ");
+
+	while (true) {
+		int bind_cpu, bind_cpu_0, bind_cpu_1;
+		char *tok, *tok_end, *tok_step, *tok_len, *tok_mul;
+		int bind_len;
+		int step;
+		int mul;
+
+		tok = strsep(&str, ",");
+		if (!tok)
+			break;
+
+		tok_end = strstr(tok, "-");
+
+		dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+		if (!tok_end) {
+			/* Single CPU specified: */
+			bind_cpu_0 = bind_cpu_1 = atol(tok);
+		} else {
+			/* CPU range specified (for example: "5-11"): */
+			bind_cpu_0 = atol(tok);
+			bind_cpu_1 = atol(tok_end + 1);
+		}
+
+		step = 1;
+		tok_step = strstr(tok, "#");
+		if (tok_step) {
+			step = atol(tok_step + 1);
+			BUG_ON(step <= 0 || step >= g->p.nr_cpus);
+		}
+
+		/*
+		 * Mask length.
+		 * Eg: "--cpus 8_4-16#4" means: '--cpus 8_4,12_4,16_4',
+		 * where the _4 means the next 4 CPUs are allowed.
+		 */
+		bind_len = 1;
+		tok_len = strstr(tok, "_");
+		if (tok_len) {
+			bind_len = atol(tok_len + 1);
+			BUG_ON(bind_len <= 0 || bind_len > g->p.nr_cpus);
+		}
+
+		/* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+		mul = 1;
+		tok_mul = strstr(tok, "x");
+		if (tok_mul) {
+			mul = atol(tok_mul + 1);
+			BUG_ON(mul <= 0);
+		}
+
+		dprintf("CPUs: %d_%d-%d#%dx%d\n", bind_cpu_0, bind_len, bind_cpu_1, step, mul);
+
+		BUG_ON(bind_cpu_0 < 0 || bind_cpu_0 >= g->p.nr_cpus);
+		BUG_ON(bind_cpu_1 < 0 || bind_cpu_1 >= g->p.nr_cpus);
+		BUG_ON(bind_cpu_0 > bind_cpu_1);
+
+		for (bind_cpu = bind_cpu_0; bind_cpu <= bind_cpu_1; bind_cpu += step) {
+			int i;
+
+			for (i = 0; i < mul; i++) {
+				int cpu;
+
+				if (t >= g->p.nr_tasks) {
+					printf("\n# NOTE: ignoring bind CPUs starting at CPU#%d\n #", bind_cpu);
+					goto out;
+				}
+				td = g->threads + t;
+
+				if (t)
+					tprintf(",");
+				if (bind_len > 1) {
+					tprintf("%2d/%d", bind_cpu, bind_len);
+				} else {
+					tprintf("%2d", bind_cpu);
+				}
+
+				CPU_ZERO(&td->bind_cpumask);
+				for (cpu = bind_cpu; cpu < bind_cpu+bind_len; cpu++) {
+					BUG_ON(cpu < 0 || cpu >= g->p.nr_cpus);
+					CPU_SET(cpu, &td->bind_cpumask);
+				}
+				t++;
+			}
+		}
+	}
+out:
+
+	tprintf("\n");
+
+	if (t < g->p.nr_tasks)
+		printf("# NOTE: %d tasks bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+	free(str0);
+}
+
+static int parse_cpus_opt(const struct option *opt __maybe_unused,
+			  const char *arg, int unset __maybe_unused)
+{
+	if (!arg)
+		return -1;
+
+	return parse_cpu_list(arg);
+}
+
+static int parse_node_list(const char *arg)
+{
+	p0.node_list_str = strdup(arg);
+
+	dprintf("got NODE list: {%s}\n", p0.node_list_str);
+
+	return 0;
+}
+
+static void parse_setup_node_list(void)
+{
+	struct thread_data *td;
+	char *str0, *str;
+	int t;
+
+	if (!g->p.node_list_str)
+		return;
+
+	dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+	str0 = str = strdup(g->p.node_list_str);
+	t = 0;
+
+	BUG_ON(!str);
+
+	tprintf("# binding tasks to NODEs:\n");
+	tprintf("# ");
+
+	while (true) {
+		int bind_node, bind_node_0, bind_node_1;
+		char *tok, *tok_end, *tok_step, *tok_mul;
+		int step;
+		int mul;
+
+		tok = strsep(&str, ",");
+		if (!tok)
+			break;
+
+		tok_end = strstr(tok, "-");
+
+		dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+		if (!tok_end) {
+			/* Single NODE specified: */
+			bind_node_0 = bind_node_1 = atol(tok);
+		} else {
+			/* NODE range specified (for example: "5-11"): */
+			bind_node_0 = atol(tok);
+			bind_node_1 = atol(tok_end + 1);
+		}
+
+		step = 1;
+		tok_step = strstr(tok, "#");
+		if (tok_step) {
+			step = atol(tok_step + 1);
+			BUG_ON(step <= 0 || step >= g->p.nr_nodes);
+		}
+
+		/* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+		mul = 1;
+		tok_mul = strstr(tok, "x");
+		if (tok_mul) {
+			mul = atol(tok_mul + 1);
+			BUG_ON(mul <= 0);
+		}
+
+		dprintf("NODEs: %d-%d #%d\n", bind_node_0, bind_node_1, step);
+
+		BUG_ON(bind_node_0 < 0 || bind_node_0 >= g->p.nr_nodes);
+		BUG_ON(bind_node_1 < 0 || bind_node_1 >= g->p.nr_nodes);
+		BUG_ON(bind_node_0 > bind_node_1);
+
+		for (bind_node = bind_node_0; bind_node <= bind_node_1; bind_node += step) {
+			int i;
+
+			for (i = 0; i < mul; i++) {
+				if (t >= g->p.nr_tasks) {
+					printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node);
+					goto out;
+				}
+				td = g->threads + t;
+
+				if (!t)
+					tprintf(" %2d", bind_node);
+				else
+					tprintf(",%2d", bind_node);
+
+				td->bind_node = bind_node;
+				t++;
+			}
+		}
+	}
+out:
+
+	tprintf("\n");
+
+	if (t < g->p.nr_tasks)
+		printf("# NOTE: %d tasks mem-bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+	free(str0);
+}
+
+static int parse_nodes_opt(const struct option *opt __maybe_unused,
+			  const char *arg, int unset __maybe_unused)
+{
+	if (!arg)
+		return -1;
+
+	return parse_node_list(arg);
+
+	return 0;
+}
+
+#define BIT(x) (1ul << x)
+
+static inline uint32_t lfsr_32(uint32_t lfsr)
+{
+	const uint32_t taps = BIT(1) | BIT(5) | BIT(6) | BIT(31);
+	return (lfsr>>1) ^ ((0x0u - (lfsr & 0x1u)) & taps);
+}
+
+/*
+ * Make sure there's real data dependency to RAM (when read
+ * accesses are enabled), so the compiler, the CPU and the
+ * kernel (KSM, zero page, etc.) cannot optimize away RAM
+ * accesses:
+ */
+static inline u64 access_data(u64 *data __attribute__((unused)), u64 val)
+{
+	if (g->p.data_reads)
+		val += *data;
+	if (g->p.data_writes)
+		*data = val + 1;
+	return val;
+}
+
+/*
+ * The worker process does two types of work, a forwards going
+ * loop and a backwards going loop.
+ *
+ * We do this so that on multiprocessor systems we do not create
+ * a 'train' of processing, with highly synchronized processes,
+ * skewing the whole benchmark.
+ */
+static u64 do_work(u8 *__data, long bytes, int nr, int nr_max, int loop, u64 val)
+{
+	long words = bytes/sizeof(u64);
+	u64 *data = (void *)__data;
+	long chunk_0, chunk_1;
+	u64 *d0, *d, *d1;
+	long off;
+	long i;
+
+	BUG_ON(!data && words);
+	BUG_ON(data && !words);
+
+	if (!data)
+		return val;
+
+	/* Very simple memset() work variant: */
+	if (g->p.data_zero_memset && !g->p.data_rand_walk) {
+		bzero(data, bytes);
+		return val;
+	}
+
+	/* Spread out by PID/TID nr and by loop nr: */
+	chunk_0 = words/nr_max;
+	chunk_1 = words/g->p.nr_loops;
+	off = nr*chunk_0 + loop*chunk_1;
+
+	while (off >= words)
+		off -= words;
+
+	if (g->p.data_rand_walk) {
+		u32 lfsr = nr + loop + val;
+		int j;
+
+		for (i = 0; i < words/1024; i++) {
+			long start, end;
+
+			lfsr = lfsr_32(lfsr);
+
+			start = lfsr % words;
+			end = min(start + 1024, words-1);
+
+			if (g->p.data_zero_memset) {
+				bzero(data + start, (end-start) * sizeof(u64));
+			} else {
+				for (j = start; j < end; j++)
+					val = access_data(data + j, val);
+			}
+		}
+	} else if (!g->p.data_backwards || (nr + loop) & 1) {
+
+		d0 = data + off;
+		d  = data + off + 1;
+		d1 = data + words;
+
+		/* Process data forwards: */
+		for (;;) {
+			if (unlikely(d >= d1))
+				d = data;
+			if (unlikely(d == d0))
+				break;
+
+			val = access_data(d, val);
+
+			d++;
+		}
+	} else {
+		/* Process data backwards: */
+
+		d0 = data + off;
+		d  = data + off - 1;
+		d1 = data + words;
+
+		/* Process data forwards: */
+		for (;;) {
+			if (unlikely(d < data))
+				d = data + words-1;
+			if (unlikely(d == d0))
+				break;
+
+			val = access_data(d, val);
+
+			d--;
+		}
+	}
+
+	return val;
+}
+
+static void update_curr_cpu(int task_nr, unsigned long bytes_worked)
+{
+	unsigned int cpu;
+
+	cpu = sched_getcpu();
+
+	g->threads[task_nr].curr_cpu = cpu;
+	prctl(0, bytes_worked);
+}
+
+#define MAX_NR_NODES	64
+
+/*
+ * Count the number of nodes a process's threads
+ * are spread out on.
+ *
+ * A count of 1 means that the process is compressed
+ * to a single node. A count of g->p.nr_nodes means it's
+ * spread out on the whole system.
+ */
+static int count_process_nodes(int process_nr)
+{
+	char node_present[MAX_NR_NODES] = { 0, };
+	int nodes;
+	int n, t;
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+		struct thread_data *td;
+		int task_nr;
+		int node;
+
+		task_nr = process_nr*g->p.nr_threads + t;
+		td = g->threads + task_nr;
+
+		node = numa_node_of_cpu(td->curr_cpu);
+		node_present[node] = 1;
+	}
+
+	nodes = 0;
+
+	for (n = 0; n < MAX_NR_NODES; n++)
+		nodes += node_present[n];
+
+	return nodes;
+}
+
+/*
+ * Count the number of distinct process-threads a node contains.
+ *
+ * A count of 1 means that the node contains only a single
+ * process. If all nodes on the system contain at most one
+ * process then we are well-converged.
+ */
+static int count_node_processes(int node)
+{
+	int processes = 0;
+	int t, p;
+
+	for (p = 0; p < g->p.nr_proc; p++) {
+		for (t = 0; t < g->p.nr_threads; t++) {
+			struct thread_data *td;
+			int task_nr;
+			int n;
+
+			task_nr = p*g->p.nr_threads + t;
+			td = g->threads + task_nr;
+
+			n = numa_node_of_cpu(td->curr_cpu);
+			if (n == node) {
+				processes++;
+				break;
+			}
+		}
+	}
+
+	return processes;
+}
+
+static void calc_convergence_compression(int *strong)
+{
+	unsigned int nodes_min, nodes_max;
+	int p;
+
+	nodes_min = -1;
+	nodes_max =  0;
+
+	for (p = 0; p < g->p.nr_proc; p++) {
+		unsigned int nodes = count_process_nodes(p);
+
+		nodes_min = min(nodes, nodes_min);
+		nodes_max = max(nodes, nodes_max);
+	}
+
+	/* Strong convergence: all threads compress on a single node: */
+	if (nodes_min == 1 && nodes_max == 1) {
+		*strong = 1;
+	} else {
+		*strong = 0;
+		tprintf(" {%d-%d}", nodes_min, nodes_max);
+	}
+}
+
+static void calc_convergence(double runtime_ns_max, double *convergence)
+{
+	unsigned int loops_done_min, loops_done_max;
+	int process_groups;
+	int nodes[MAX_NR_NODES];
+	int distance;
+	int nr_min;
+	int nr_max;
+	int strong;
+	int sum;
+	int nr;
+	int node;
+	int cpu;
+	int t;
+
+	if (!g->p.show_convergence && !g->p.measure_convergence)
+		return;
+
+	for (node = 0; node < g->p.nr_nodes; node++)
+		nodes[node] = 0;
+
+	loops_done_min = -1;
+	loops_done_max = 0;
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		struct thread_data *td = g->threads + t;
+		unsigned int loops_done;
+
+		cpu = td->curr_cpu;
+
+		/* Not all threads have written it yet: */
+		if (cpu < 0)
+			continue;
+
+		node = numa_node_of_cpu(cpu);
+
+		nodes[node]++;
+
+		loops_done = td->loops_done;
+		loops_done_min = min(loops_done, loops_done_min);
+		loops_done_max = max(loops_done, loops_done_max);
+	}
+
+	nr_max = 0;
+	nr_min = g->p.nr_tasks;
+	sum = 0;
+
+	for (node = 0; node < g->p.nr_nodes; node++) {
+		nr = nodes[node];
+		nr_min = min(nr, nr_min);
+		nr_max = max(nr, nr_max);
+		sum += nr;
+	}
+	BUG_ON(nr_min > nr_max);
+
+	BUG_ON(sum > g->p.nr_tasks);
+
+	if (0 && (sum < g->p.nr_tasks))
+		return;
+
+	/*
+	 * Count the number of distinct process groups present
+	 * on nodes - when we are converged this will decrease
+	 * to g->p.nr_proc:
+	 */
+	process_groups = 0;
+
+	for (node = 0; node < g->p.nr_nodes; node++) {
+		int processes = count_node_processes(node);
+
+		nr = nodes[node];
+		tprintf(" %2d/%-2d", nr, processes);
+
+		process_groups += processes;
+	}
+
+	distance = nr_max - nr_min;
+
+	tprintf(" [%2d/%-2d]", distance, process_groups);
+
+	tprintf(" l:%3d-%-3d (%3d)",
+		loops_done_min, loops_done_max, loops_done_max-loops_done_min);
+
+	if (loops_done_min && loops_done_max) {
+		double skew = 1.0 - (double)loops_done_min/loops_done_max;
+
+		tprintf(" [%4.1f%%]", skew * 100.0);
+	}
+
+	calc_convergence_compression(&strong);
+
+	if (strong && process_groups == g->p.nr_proc) {
+		if (!*convergence) {
+			*convergence = runtime_ns_max;
+			tprintf(" (%6.1fs converged)\n", *convergence/1e9);
+			if (g->p.measure_convergence) {
+				g->all_converged = true;
+				g->stop_work = true;
+			}
+		}
+	} else {
+		if (*convergence) {
+			tprintf(" (%6.1fs de-converged)", runtime_ns_max/1e9);
+			*convergence = 0;
+		}
+		tprintf("\n");
+	}
+}
+
+static void show_summary(double runtime_ns_max, int l, double *convergence)
+{
+	tprintf("\r #  %5.1f%%  [%.1f mins]",
+		(double)(l+1)/g->p.nr_loops*100.0, runtime_ns_max/1e9 / 60.0);
+
+	calc_convergence(runtime_ns_max, convergence);
+
+	if (g->p.show_details >= 0)
+		fflush(stdout);
+}
+
+static void *worker_thread(void *__tdata)
+{
+	struct thread_data *td = __tdata;
+	struct timeval start0, start, stop, diff;
+	int process_nr = td->process_nr;
+	int thread_nr = td->thread_nr;
+	unsigned long last_perturbance;
+	int task_nr = td->task_nr;
+	int details = g->p.show_details;
+	int first_task, last_task;
+	double convergence = 0;
+	u64 val = td->val;
+	double runtime_ns_max;
+	u8 *global_data;
+	u8 *process_data;
+	u8 *thread_data;
+	u64 bytes_done;
+	long work_done;
+	u32 l;
+
+	bind_to_cpumask(td->bind_cpumask);
+	bind_to_memnode(td->bind_node);
+
+	set_taskname("thread %d/%d", process_nr, thread_nr);
+
+	global_data = g->data;
+	process_data = td->process_data;
+	thread_data = setup_private_data(g->p.bytes_thread);
+
+	bytes_done = 0;
+
+	last_task = 0;
+	if (process_nr == g->p.nr_proc-1 && thread_nr == g->p.nr_threads-1)
+		last_task = 1;
+
+	first_task = 0;
+	if (process_nr == 0 && thread_nr == 0)
+		first_task = 1;
+
+	if (details >= 2) {
+		printf("#  thread %2d / %2d global mem: %p, process mem: %p, thread mem: %p\n",
+			process_nr, thread_nr, global_data, process_data, thread_data);
+	}
+
+	if (g->p.serialize_startup) {
+		pthread_mutex_lock(&g->startup_mutex);
+		g->nr_tasks_started++;
+		pthread_mutex_unlock(&g->startup_mutex);
+
+		/* Here we will wait for the main process to start us all at once: */
+		pthread_mutex_lock(&g->start_work_mutex);
+		g->nr_tasks_working++;
+
+		/* Last one wake the main process: */
+		if (g->nr_tasks_working == g->p.nr_tasks)
+			pthread_mutex_unlock(&g->startup_done_mutex);
+
+		pthread_mutex_unlock(&g->start_work_mutex);
+	}
+
+	gettimeofday(&start0, NULL);
+
+	start = stop = start0;
+	last_perturbance = start.tv_sec;
+
+	for (l = 0; l < g->p.nr_loops; l++) {
+		start = stop;
+
+		if (g->stop_work)
+			break;
+
+		val += do_work(global_data,  g->p.bytes_global,  process_nr, g->p.nr_proc,	l, val);
+		val += do_work(process_data, g->p.bytes_process, thread_nr,  g->p.nr_threads,	l, val);
+		val += do_work(thread_data,  g->p.bytes_thread,  0,          1,		l, val);
+
+		if (g->p.sleep_usecs) {
+			pthread_mutex_lock(td->process_lock);
+			usleep(g->p.sleep_usecs);
+			pthread_mutex_unlock(td->process_lock);
+		}
+		/*
+		 * Amount of work to be done under a process-global lock:
+		 */
+		if (g->p.bytes_process_locked) {
+			pthread_mutex_lock(td->process_lock);
+			val += do_work(process_data, g->p.bytes_process_locked, thread_nr,  g->p.nr_threads,	l, val);
+			pthread_mutex_unlock(td->process_lock);
+		}
+
+		work_done = g->p.bytes_global + g->p.bytes_process +
+			    g->p.bytes_process_locked + g->p.bytes_thread;
+
+		update_curr_cpu(task_nr, work_done);
+		bytes_done += work_done;
+
+		if (details < 0 && !g->p.perturb_secs && !g->p.measure_convergence && !g->p.nr_secs)
+			continue;
+
+		td->loops_done = l;
+
+		gettimeofday(&stop, NULL);
+
+		/* Check whether our max runtime timed out: */
+		if (g->p.nr_secs) {
+			timersub(&stop, &start0, &diff);
+			if (diff.tv_sec >= g->p.nr_secs) {
+				g->stop_work = true;
+				break;
+			}
+		}
+
+		/* Update the summary at most once per second: */
+		if (start.tv_sec == stop.tv_sec)
+			continue;
+
+		/*
+		 * Perturb the first task's equilibrium every g->p.perturb_secs seconds,
+		 * by migrating to CPU#0:
+		 */
+		if (first_task && g->p.perturb_secs && (int)(stop.tv_sec - last_perturbance) >= g->p.perturb_secs) {
+			cpu_set_t orig_mask;
+			int target_cpu;
+			int this_cpu;
+
+			last_perturbance = stop.tv_sec;
+
+			/*
+			 * Depending on where we are running, move into
+			 * the other half of the system, to create some
+			 * real disturbance:
+			 */
+			this_cpu = g->threads[task_nr].curr_cpu;
+			if (this_cpu < g->p.nr_cpus/2)
+				target_cpu = g->p.nr_cpus-1;
+			else
+				target_cpu = 0;
+
+			orig_mask = bind_to_cpu(target_cpu);
+
+			/* Here we are running on the target CPU already */
+			if (details >= 1)
+				printf(" (injecting perturbalance, moved to CPU#%d)\n", target_cpu);
+
+			bind_to_cpumask(orig_mask);
+		}
+
+		if (details >= 3) {
+			timersub(&stop, &start, &diff);
+			runtime_ns_max = diff.tv_sec * 1000000000;
+			runtime_ns_max += diff.tv_usec * 1000;
+
+			if (details >= 0) {
+				printf(" #%2d / %2d: %14.2lf nsecs/op [val: %016lx]\n",
+					process_nr, thread_nr, runtime_ns_max / bytes_done, val);
+			}
+			fflush(stdout);
+		}
+		if (!last_task)
+			continue;
+
+		timersub(&stop, &start0, &diff);
+		runtime_ns_max = diff.tv_sec * 1000000000ULL;
+		runtime_ns_max += diff.tv_usec * 1000ULL;
+
+		show_summary(runtime_ns_max, l, &convergence);
+	}
+
+	gettimeofday(&stop, NULL);
+	timersub(&stop, &start0, &diff);
+	td->runtime_ns = diff.tv_sec * 1000000000ULL;
+	td->runtime_ns += diff.tv_usec * 1000ULL;
+
+	free_data(thread_data, g->p.bytes_thread);
+
+	pthread_mutex_lock(&g->stop_work_mutex);
+	g->bytes_done += bytes_done;
+	pthread_mutex_unlock(&g->stop_work_mutex);
+
+	return NULL;
+}
+
+/*
+ * A worker process starts a couple of threads:
+ */
+static void worker_process(int process_nr)
+{
+	pthread_mutex_t process_lock;
+	struct thread_data *td;
+	pthread_t *pthreads;
+	u8 *process_data;
+	int task_nr;
+	int ret;
+	int t;
+
+	pthread_mutex_init(&process_lock, NULL);
+	set_taskname("process %d", process_nr);
+
+	/*
+	 * Pick up the memory policy and the CPU binding of our first thread,
+	 * so that we initialize memory accordingly:
+	 */
+	task_nr = process_nr*g->p.nr_threads;
+	td = g->threads + task_nr;
+
+	bind_to_memnode(td->bind_node);
+	bind_to_cpumask(td->bind_cpumask);
+
+	pthreads = zalloc(g->p.nr_threads * sizeof(pthread_t));
+	process_data = setup_private_data(g->p.bytes_process);
+
+	if (g->p.show_details >= 3) {
+		printf(" # process %2d global mem: %p, process mem: %p\n",
+			process_nr, g->data, process_data);
+	}
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+		task_nr = process_nr*g->p.nr_threads + t;
+		td = g->threads + task_nr;
+
+		td->process_data = process_data;
+		td->process_nr   = process_nr;
+		td->thread_nr    = t;
+		td->task_nr	 = task_nr;
+		td->val          = rand();
+		td->curr_cpu	 = -1;
+		td->process_lock = &process_lock;
+
+		ret = pthread_create(pthreads + t, NULL, worker_thread, td);
+		BUG_ON(ret);
+	}
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+                ret = pthread_join(pthreads[t], NULL);
+		BUG_ON(ret);
+	}
+
+	free_data(process_data, g->p.bytes_process);
+	free(pthreads);
+}
+
+static void print_summary(void)
+{
+	if (g->p.show_details < 0)
+		return;
+
+	printf("\n ###\n");
+	printf(" # %d %s will execute (on %d nodes, %d CPUs):\n",
+		g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus);
+	printf(" #      %5dx %5ldMB global  shared mem operations\n",
+			g->p.nr_loops, g->p.bytes_global/1024/1024);
+	printf(" #      %5dx %5ldMB process shared mem operations\n",
+			g->p.nr_loops, g->p.bytes_process/1024/1024);
+	printf(" #      %5dx %5ldMB thread  local  mem operations\n",
+			g->p.nr_loops, g->p.bytes_thread/1024/1024);
+
+	printf(" ###\n");
+
+	printf("\n ###\n"); fflush(stdout);
+}
+
+static void init_thread_data(void)
+{
+	ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+	int t;
+
+	g->threads = zalloc_shared_data(size);
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		struct thread_data *td = g->threads + t;
+		int cpu;
+
+		/* Allow all nodes by default: */
+		td->bind_node = -1;
+
+		/* Allow all CPUs by default: */
+		CPU_ZERO(&td->bind_cpumask);
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &td->bind_cpumask);
+	}
+}
+
+static void deinit_thread_data(void)
+{
+	ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+
+	free_data(g->threads, size);
+}
+
+static int init(void)
+{
+	g = (void *)alloc_data(sizeof(*g), MAP_SHARED, 1, 0, 0 /* THP */, 0);
+
+	/* Copy over options: */
+	g->p = p0;
+
+	g->p.nr_cpus = numa_num_configured_cpus();
+
+	g->p.nr_nodes = numa_max_node() + 1;
+
+	/* char array in count_process_nodes(): */
+	BUG_ON(g->p.nr_nodes > MAX_NR_NODES || g->p.nr_nodes < 0);
+
+	if (g->p.show_quiet && !g->p.show_details)
+		g->p.show_details = -1;
+
+	/* Some memory should be specified: */
+	if (!g->p.mb_global_str && !g->p.mb_proc_str && !g->p.mb_thread_str)
+		return -1;
+
+	if (g->p.mb_global_str) {
+		g->p.mb_global = atof(g->p.mb_global_str);
+		BUG_ON(g->p.mb_global < 0);
+	}
+
+	if (g->p.mb_proc_str) {
+		g->p.mb_proc = atof(g->p.mb_proc_str);
+		BUG_ON(g->p.mb_proc < 0);
+	}
+
+	if (g->p.mb_proc_locked_str) {
+		g->p.mb_proc_locked = atof(g->p.mb_proc_locked_str);
+		BUG_ON(g->p.mb_proc_locked < 0);
+		BUG_ON(g->p.mb_proc_locked > g->p.mb_proc);
+	}
+
+	if (g->p.mb_thread_str) {
+		g->p.mb_thread = atof(g->p.mb_thread_str);
+		BUG_ON(g->p.mb_thread < 0);
+	}
+
+	BUG_ON(g->p.nr_threads <= 0);
+	BUG_ON(g->p.nr_proc <= 0);
+
+	g->p.nr_tasks = g->p.nr_proc*g->p.nr_threads;
+
+	g->p.bytes_global		= g->p.mb_global	*1024L*1024L;
+	g->p.bytes_process		= g->p.mb_proc		*1024L*1024L;
+	g->p.bytes_process_locked	= g->p.mb_proc_locked	*1024L*1024L;
+	g->p.bytes_thread		= g->p.mb_thread	*1024L*1024L;
+
+	g->data = setup_shared_data(g->p.bytes_global);
+
+	/* Startup serialization: */
+	init_global_mutex(&g->start_work_mutex);
+	init_global_mutex(&g->startup_mutex);
+	init_global_mutex(&g->startup_done_mutex);
+	init_global_mutex(&g->stop_work_mutex);
+
+	init_thread_data();
+
+	tprintf("#\n");
+	parse_setup_cpu_list();
+	parse_setup_node_list();
+	tprintf("#\n");
+
+	print_summary();
+
+	return 0;
+}
+
+static void deinit(void)
+{
+	free_data(g->data, g->p.bytes_global);
+	g->data = NULL;
+
+	deinit_thread_data();
+
+	free_data(g, sizeof(*g));
+	g = NULL;
+}
+
+/*
+ * Print a short or long result, depending on the verbosity setting:
+ */
+static void print_res(const char *name, double val,
+		      const char *txt_unit, const char *txt_short, const char *txt_long)
+{
+	if (!name)
+		name = "main,";
+
+	if (g->p.show_quiet)
+		printf(" %-30s %15.3f, %-15s %s\n", name, val, txt_unit, txt_short);
+	else
+		printf(" %14.3f %s\n", val, txt_long);
+}
+
+static int __bench_numa(const char *name)
+{
+	struct timeval start, stop, diff;
+	u64 runtime_ns_min, runtime_ns_sum;
+	pid_t *pids, pid, wpid;
+	double delta_runtime;
+	double runtime_avg;
+	double runtime_sec_max;
+	double runtime_sec_min;
+	int wait_stat;
+	double bytes;
+	int i, t;
+
+	if (init())
+		return -1;
+
+	pids = zalloc(g->p.nr_proc * sizeof(*pids));
+	pid = -1;
+
+	/* All threads try to acquire it, this way we can wait for them to start up: */
+	pthread_mutex_lock(&g->start_work_mutex);
+
+	if (g->p.serialize_startup) {
+		tprintf(" #\n");
+		tprintf(" # Startup synchronization: ..."); fflush(stdout);
+	}
+
+	gettimeofday(&start, NULL);
+
+	for (i = 0; i < g->p.nr_proc; i++) {
+		pid = fork();
+		dprintf(" # process %2d: PID %d\n", i, pid);
+
+		BUG_ON(pid < 0);
+		if (!pid) {
+			/* Child process: */
+			worker_process(i);
+
+			exit(0);
+		}
+		pids[i] = pid;
+
+	}
+	/* Wait for all the threads to start up: */
+	while (g->nr_tasks_started != g->p.nr_tasks)
+		usleep(1000);
+
+	BUG_ON(g->nr_tasks_started != g->p.nr_tasks);
+
+	if (g->p.serialize_startup) {
+		double startup_sec;
+
+		pthread_mutex_lock(&g->startup_done_mutex);
+
+		/* This will start all threads: */
+		pthread_mutex_unlock(&g->start_work_mutex);
+
+		/* This mutex is locked - the last started thread will wake us: */
+		pthread_mutex_lock(&g->startup_done_mutex);
+
+		gettimeofday(&stop, NULL);
+
+		timersub(&stop, &start, &diff);
+
+		startup_sec = diff.tv_sec * 1000000000.0;
+		startup_sec += diff.tv_usec * 1000.0;
+		startup_sec /= 1e9;
+
+		tprintf(" threads initialized in %.6f seconds.\n", startup_sec);
+		tprintf(" #\n");
+
+		start = stop;
+		pthread_mutex_unlock(&g->startup_done_mutex);
+	} else {
+		gettimeofday(&start, NULL);
+	}
+
+	/* Parent process: */
+
+
+	for (i = 0; i < g->p.nr_proc; i++) {
+		wpid = waitpid(pids[i], &wait_stat, 0);
+		BUG_ON(wpid < 0);
+		BUG_ON(!WIFEXITED(wait_stat));
+
+	}
+
+	runtime_ns_sum = 0;
+	runtime_ns_min = -1LL;
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		u64 thread_runtime_ns = g->threads[t].runtime_ns;
+
+		runtime_ns_sum += thread_runtime_ns;
+		runtime_ns_min = min(thread_runtime_ns, runtime_ns_min);
+	}
+
+	gettimeofday(&stop, NULL);
+	timersub(&stop, &start, &diff);
+
+	BUG_ON(bench_format != BENCH_FORMAT_DEFAULT);
+
+	tprintf("\n ###\n");
+	tprintf("\n");
+
+	runtime_sec_max = diff.tv_sec * 1000000000.0;
+	runtime_sec_max += diff.tv_usec * 1000.0;
+	runtime_sec_max /= 1e9;
+
+	runtime_sec_min = runtime_ns_min/1e9;
+
+	bytes = g->bytes_done;
+	runtime_avg = (double)runtime_ns_sum / g->p.nr_tasks / 1e9;
+
+	if (g->p.measure_convergence) {
+		print_res(name, runtime_sec_max,
+			"secs,", "NUMA-convergence-latency", "secs latency to NUMA-converge");
+	}
+
+	print_res(name, runtime_sec_max,
+		"secs,", "runtime-max/thread",	"secs slowest (max) thread-runtime");
+
+	print_res(name, runtime_sec_min,
+		"secs,", "runtime-min/thread",	"secs fastest (min) thread-runtime");
+
+	print_res(name, runtime_avg,
+		"secs,", "runtime-avg/thread",	"secs average thread-runtime");
+
+	delta_runtime = (runtime_sec_max - runtime_sec_min)/2.0;
+	print_res(name, delta_runtime / runtime_sec_max * 100.0,
+		"%,", "spread-runtime/thread",	"% difference between max/avg runtime");
+
+	print_res(name, bytes / g->p.nr_tasks / 1e9,
+		"GB,", "data/thread",		"GB data processed, per thread");
+
+	print_res(name, bytes / 1e9,
+		"GB,", "data-total",		"GB data processed, total");
+
+	print_res(name, runtime_sec_max * 1e9 / (bytes / g->p.nr_tasks),
+		"nsecs,", "runtime/byte/thread","nsecs/byte/thread runtime");
+
+	print_res(name, bytes / g->p.nr_tasks / 1e9 / runtime_sec_max,
+		"GB/sec,", "thread-speed",	"GB/sec/thread speed");
+
+	print_res(name, bytes / runtime_sec_max / 1e9,
+		"GB/sec,", "total-speed",	"GB/sec total speed");
+
+	free(pids);
+
+	deinit();
+
+	return 0;
+}
+
+#define MAX_ARGS 50
+
+static int command_size(const char **argv)
+{
+	int size = 0;
+
+	while (*argv) {
+		size++;
+		argv++;
+	}
+
+	BUG_ON(size >= MAX_ARGS);
+
+	return size;
+}
+
+static void init_params(struct params *p, const char *name, int argc, const char **argv)
+{
+	int i;
+
+	printf("\n # Running %s \"perf bench numa", name);
+
+	for (i = 0; i < argc; i++)
+		printf(" %s", argv[i]);
+
+	printf("\"\n");
+
+	memset(p, 0, sizeof(*p));
+
+	/* Initialize nonzero defaults: */
+
+	p->serialize_startup		= 1;
+	p->data_reads			= true;
+	p->data_writes			= true;
+	p->data_backwards		= true;
+	p->data_rand_walk		= true;
+	p->nr_loops			= -1;
+	p->init_random			= true;
+}
+
+static int run_bench_numa(const char *name, const char **argv)
+{
+	int argc = command_size(argv);
+
+	init_params(&p0, name, argc, argv);
+	argc = parse_options(argc, argv, options, bench_numa_usage, 0);
+	if (argc)
+		goto err;
+
+	if (__bench_numa(name))
+		goto err;
+
+	return 0;
+
+err:
+	usage_with_options(numa_usage, options);
+	return -1;
+}
+
+#define OPT_BW_RAM		"-s",  "20", "-zZq",    "--thp", " 1", "--no-data_rand_walk"
+#define OPT_BW_RAM_NOTHP	OPT_BW_RAM,		"--thp", "-1"
+
+#define OPT_CONV		"-s", "100", "-zZ0qcm", "--thp", " 1"
+#define OPT_CONV_NOTHP		OPT_CONV,		"--thp", "-1"
+
+#define OPT_BW			"-s",  "20", "-zZ0q",   "--thp", " 1"
+#define OPT_BW_NOTHP		OPT_BW,			"--thp", "-1"
+
+/*
+ * The built-in test-suite executed by "perf bench numa -a".
+ *
+ * (A minimum of 4 nodes and 16 GB of RAM is recommended.)
+ */
+static const char *tests[][MAX_ARGS] = {
+   /* Basic single-stream NUMA bandwidth measurements: */
+   { "RAM-bw-local,",	  "mem",  "-p",  "1",  "-t",  "1", "-P", "1024",
+			  "-C" ,   "0", "-M",   "0", OPT_BW_RAM },
+   { "RAM-bw-local-NOTHP,",
+			  "mem",  "-p",  "1",  "-t",  "1", "-P", "1024",
+			  "-C" ,   "0", "-M",   "0", OPT_BW_RAM_NOTHP },
+   { "RAM-bw-remote,",	  "mem",  "-p",  "1",  "-t",  "1", "-P", "1024",
+			  "-C" ,   "0", "-M",   "1", OPT_BW_RAM },
+
+   /* 2-stream NUMA bandwidth measurements: */
+   { "RAM-bw-local-2x,",  "mem",  "-p",  "2",  "-t",  "1", "-P", "1024",
+			   "-C", "0,2", "-M", "0x2", OPT_BW_RAM },
+   { "RAM-bw-remote-2x,", "mem",  "-p",  "2",  "-t",  "1", "-P", "1024",
+		 	   "-C", "0,2", "-M", "1x2", OPT_BW_RAM },
+
+   /* Cross-stream NUMA bandwidth measurement: */
+   { "RAM-bw-cross,",     "mem",  "-p",  "2",  "-t",  "1", "-P", "1024",
+		 	   "-C", "0,8", "-M", "1,0", OPT_BW_RAM },
+
+   /* Convergence latency measurements: */
+   { " 1x3-convergence,", "mem",  "-p",  "1", "-t",  "3", "-P",  "512", OPT_CONV },
+   { " 1x4-convergence,", "mem",  "-p",  "1", "-t",  "4", "-P",  "512", OPT_CONV },
+   { " 1x6-convergence,", "mem",  "-p",  "1", "-t",  "6", "-P", "1020", OPT_CONV },
+   { " 2x3-convergence,", "mem",  "-p",  "3", "-t",  "3", "-P", "1020", OPT_CONV },
+   { " 3x3-convergence,", "mem",  "-p",  "3", "-t",  "3", "-P", "1020", OPT_CONV },
+   { " 4x4-convergence,", "mem",  "-p",  "4", "-t",  "4", "-P",  "512", OPT_CONV },
+   { " 4x4-convergence-NOTHP,",
+			  "mem",  "-p",  "4", "-t",  "4", "-P",  "512", OPT_CONV_NOTHP },
+   { " 4x6-convergence,", "mem",  "-p",  "4", "-t",  "6", "-P", "1020", OPT_CONV },
+   { " 4x8-convergence,", "mem",  "-p",  "4", "-t",  "8", "-P",  "512", OPT_CONV },
+   { " 8x4-convergence,", "mem",  "-p",  "8", "-t",  "4", "-P",  "512", OPT_CONV },
+   { " 8x4-convergence-NOTHP,",
+			  "mem",  "-p",  "8", "-t",  "4", "-P",  "512", OPT_CONV_NOTHP },
+   { " 3x1-convergence,", "mem",  "-p",  "3", "-t",  "1", "-P",  "512", OPT_CONV },
+   { " 4x1-convergence,", "mem",  "-p",  "4", "-t",  "1", "-P",  "512", OPT_CONV },
+   { " 8x1-convergence,", "mem",  "-p",  "8", "-t",  "1", "-P",  "512", OPT_CONV },
+   { "16x1-convergence,", "mem",  "-p", "16", "-t",  "1", "-P",  "256", OPT_CONV },
+   { "32x1-convergence,", "mem",  "-p", "32", "-t",  "1", "-P",  "128", OPT_CONV },
+
+   /* Various NUMA process/thread layout bandwidth measurements: */
+   { " 2x1-bw-process,",  "mem",  "-p",  "2", "-t",  "1", "-P", "1024", OPT_BW },
+   { " 3x1-bw-process,",  "mem",  "-p",  "3", "-t",  "1", "-P", "1024", OPT_BW },
+   { " 4x1-bw-process,",  "mem",  "-p",  "4", "-t",  "1", "-P", "1024", OPT_BW },
+   { " 8x1-bw-process,",  "mem",  "-p",  "8", "-t",  "1", "-P", " 512", OPT_BW },
+   { " 8x1-bw-process-NOTHP,",
+			  "mem",  "-p",  "8", "-t",  "1", "-P", " 512", OPT_BW_NOTHP },
+   { "16x1-bw-process,",  "mem",  "-p", "16", "-t",  "1", "-P",  "256", OPT_BW },
+
+   { " 4x1-bw-thread,",	  "mem",  "-p",  "1", "-t",  "4", "-T",  "256", OPT_BW },
+   { " 8x1-bw-thread,",	  "mem",  "-p",  "1", "-t",  "8", "-T",  "256", OPT_BW },
+   { "16x1-bw-thread,",   "mem",  "-p",  "1", "-t", "16", "-T",  "128", OPT_BW },
+   { "32x1-bw-thread,",   "mem",  "-p",  "1", "-t", "32", "-T",   "64", OPT_BW },
+
+   { " 2x3-bw-thread,",	  "mem",  "-p",  "2", "-t",  "3", "-P",  "512", OPT_BW },
+   { " 4x4-bw-thread,",	  "mem",  "-p",  "4", "-t",  "4", "-P",  "512", OPT_BW },
+   { " 4x6-bw-thread,",	  "mem",  "-p",  "4", "-t",  "6", "-P",  "512", OPT_BW },
+   { " 4x8-bw-thread,",	  "mem",  "-p",  "4", "-t",  "8", "-P",  "512", OPT_BW },
+   { " 4x8-bw-thread-NOTHP,",
+			  "mem",  "-p",  "4", "-t",  "8", "-P",  "512", OPT_BW_NOTHP },
+   { " 3x3-bw-thread,",	  "mem",  "-p",  "3", "-t",  "3", "-P",  "512", OPT_BW },
+   { " 5x5-bw-thread,",	  "mem",  "-p",  "5", "-t",  "5", "-P",  "512", OPT_BW },
+
+   { "2x16-bw-thread,",   "mem",  "-p",  "2", "-t", "16", "-P",  "512", OPT_BW },
+   { "1x32-bw-thread,",   "mem",  "-p",  "1", "-t", "32", "-P", "2048", OPT_BW },
+
+   { "numa02-bw,",	  "mem",  "-p",  "1", "-t", "32", "-T",   "32", OPT_BW },
+   { "numa02-bw-NOTHP,",  "mem",  "-p",  "1", "-t", "32", "-T",   "32", OPT_BW_NOTHP },
+   { "numa01-bw-thread,", "mem",  "-p",  "2", "-t", "16", "-T",  "192", OPT_BW },
+   { "numa01-bw-thread-NOTHP,",
+			  "mem",  "-p",  "2", "-t", "16", "-T",  "192", OPT_BW_NOTHP },
+};
+
+static int bench_all(void)
+{
+	int nr = ARRAY_SIZE(tests);
+	int ret;
+	int i;
+
+	ret = system("echo ' #'; echo ' # Running test on: '$(uname -a); echo ' #'");
+	BUG_ON(ret < 0);
+
+	for (i = 0; i < nr; i++) {
+		if (run_bench_numa(tests[i][0], tests[i] + 1))
+			return -1;
+	}
+
+	printf("\n");
+
+	return 0;
+}
+
+int bench_numa(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	init_params(&p0, "main,", argc, argv);
+	argc = parse_options(argc, argv, options, bench_numa_usage, 0);
+	if (argc)
+		goto err;
+
+	if (p0.run_all)
+		return bench_all();
+
+	if (__bench_numa(NULL))
+		goto err;
+
+	return 0;
+
+err:
+	usage_with_options(numa_usage, options);
+	return -1;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index cae9a5f..441cdb4 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -35,6 +35,16 @@ struct bench_suite {
 /* sentinel: easy for help */
 #define suite_all { "all", "Test all benchmark suites", NULL }
 
+static struct bench_suite numa_suites[] = {
+	{ "mem",
+	  "Benchmark for NUMA workloads",
+	  bench_numa },
+	suite_all,
+	{ NULL,
+	  NULL,
+	  NULL                  }
+};
+
 static struct bench_suite sched_suites[] = {
 	{ "messaging",
 	  "Benchmark for scheduler and IPC mechanisms",
@@ -68,6 +78,9 @@ struct bench_subsys {
 };
 
 static struct bench_subsys subsystems[] = {
+	{ "numa",
+	  "NUMA scheduling and MM behavior",
+	  numa_suites },
 	{ "sched",
 	  "scheduler and IPC mechanism",
 	  sched_suites },
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 1278c2c..8b091a5 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -195,7 +195,7 @@ static inline int hist_entry__tui_annotate(struct hist_entry *self
 	return 0;
 }
 
-static inline int script_browse(const char *script_opt)
+static inline int script_browse(const char *script_opt __maybe_unused)
 {
 	return 0;
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
  2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
  2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
@ 2012-12-07 21:53 ` Ingo Molnar
  2012-12-10 12:33   ` Mel Gorman
  1 sibling, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2012-12-07 21:53 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
	Arnaldo Carvalho de Melo, Frederic Weisbecker, Mike Galbraith


Here's a (strongly NUMA-centric) performance comparison of the 
three NUMA kernels: the 'balancenuma-v10' tree from Mel, the 
AutoNUMA-v28 kernel from Andrea and the unified NUMA -v3 tree 
Peter and me are working on.

The goal of these measurements is to specifically quantify the 
NUMA optimization qualities of each of the three NUMA-optimizing 
kernels.

There are lots of numbers in this mail and lot of material to 
read - sorry about that! :-/

I used the latest available kernel versions everywhere: 
furthermore the AutoNUMA-v28 tree has been patched with Hugh 
Dickin's THP-migration support patch, to make it a fair 
apples-to-apples comparison.

I have used the 'perf bench numa' tool to do the measurements, 
which tool can be found at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/bench

   # to build it install numactl-dev[el] and do "cd tools/perf; make -j install'

To get the raw numbers I ran "perf bench numa mem -a" multiple 
times on each kernel, on a 32-way, 64 GB RAM, 4-node Opteron 
test-system. Each kernel used the same base .config, copied from 
a Fedora RPM kernel, with the NUMA-balancing options enabled.

( Note that the testcases are tailored to my test-system: on
  a smaller system you'd want to run slightly smaller testcases,
  on a larger system you'd want to run a couple of larger 
  testcases as well. )

NUMA convergence latency measurements
-------------------------------------

'NUMA convergence' latency is the number of seconds a workload 
takes to reach 'perfectly NUMA balanced' state. This is measured 
on the CPU placement side: once it has converged then memory 
typically follows within a couple of seconds.

Because convergence is not guaranteed, a 100 seconds latency 
time-out is used in the benchmark. If you see a 100 seconds 
result in the table it means that that particular NUMA kernel 
did not manage to converge that workload unit test within 100 
seconds.

The NxM denotion means process/thread relationship: a 1x4 test 
is 1 process with 4 thread that share a workload - a 4x6 test 
are 4 processes with 6 threads in each process, the processes 
isolated from each other but the threads working on the same 
working set.

I used a wide set of test-cases I collected in the past:

                           [ Lower numbers are better. ]

 [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
------------------------------------------------------------------------------------------
 1x3-convergence        :  100.1 |         100.0 |           0.2 |           2.3 |  secs
 1x4-convergence        :  100.2 |         100.1 |         100.2 |           2.1 |  secs
 1x6-convergence        :  100.3 |         100.4 |         100.8 |           7.3 |  secs
 2x3-convergence        :  100.6 |         100.6 |         100.5 |           4.1 |  secs
 3x3-convergence        :  100.6 |         100.5 |         100.5 |           7.6 |  secs
 4x4-convergence        :  100.6 |         100.5 |           4.1 |           7.4 |  secs
 4x4-convergence-NOTHP  :  101.1 |         100.5 |          12.2 |           9.2 |  secs
 4x6-convergence        :    5.4 |         101.2 |          16.6 |          11.7 |  secs
 4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs
 8x4-convergence        :  100.9 |         100.8 |          18.3 |           8.9 |  secs
 8x4-convergence-NOTHP  :  101.9 |         101.0 |          15.7 |          12.1 |  secs
 3x1-convergence        :    0.7 |           1.0 |           0.8 |           0.9 |  secs
 4x1-convergence        :    0.6 |           0.8 |           0.8 |           0.7 |  secs
 8x1-convergence        :    2.8 |           2.9 |           2.9 |           1.2 |  secs
 16x1-convergence       :    3.5 |           3.7 |           2.5 |           2.0 |  secs
 32x1-convergence       :    3.6 |           2.8 |           3.0 |           1.9 |  secs

As expected, mainline only manages to converge workloads where 
each worker process is isolated and the default 
spread-to-all-nodes scheduling policy creates an ideal layout, 
regardless of task ordering.

[ Note that the mainline kernel got a 'lucky strike' convergence 
  in the 4x6 workload: it's always possible for the workload
  to accidentally converge. On a repeat test this did not occur, 
  but I did not erase the outlier because luck is a valid and 
  existing phenomenon. ]

The 'balancenuma' kernel does not converge any of the workloads 
where worker threads or processes relate to each other.

AutoNUMA does pretty well, but it did not manage to converge for 
4 testcases of shared, under-loaded workloads.

The unified NUMA-v3 tree converged well in every testcase.


NUMA workload bandwidth measurements
------------------------------------

The other set of numbers I've collected are workload bandwidth 
measurements, run over 20 seconds. Using 20 seconds gives a 
healthy mix of pre-convergence and post-convergence bandwidth, 
giving the (non-trivial) expense of convergence and memory 
migraton a weight in the result as well. So these are not 
'ideal' results with long runtimes where migration cost gets 
averaged out.

[ The denotion of the workloads is similar to the latency 
  measurements: for example "2x3" means 2 processes, 3 threads 
  per process. See the 'perf bench' tool for details. ]

The 'numa02' and 'numa01-THREAD' tests are AutoNUMA-benchmark 
work-alike workloads, with a shorter runtime for numa01.

The results are:

                           [ Higher numbers are better. ]

 [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 | numa-u-v3     |
------------------------------------------------------------------------------------------
 2x1-bw-process         :   6.248|  6.136:  -1.8%|  8.073:  29.2%|  9.647:  54.4%|  GB/sec
 3x1-bw-process         :   7.292|  7.250:  -0.6%| 12.583:  72.6%| 14.528:  99.2%|  GB/sec
 4x1-bw-process         :   6.007|  6.867:  14.3%| 12.313: 105.0%| 18.903: 214.7%|  GB/sec
 8x1-bw-process         :   6.100|  7.974:  30.7%| 20.237: 231.8%| 26.829: 339.8%|  GB/sec
 8x1-bw-process-NOTHP   :   5.944|  5.937:  -0.1%| 17.831: 200.0%| 22.237: 274.1%|  GB/sec
 16x1-bw-process        :   5.607|  5.592:  -0.3%|  5.959:   6.3%| 29.294: 422.5%|  GB/sec
 4x1-bw-thread          :   6.035| 13.598: 125.3%| 17.443: 189.0%| 19.290: 219.6%|  GB/sec
 8x1-bw-thread          :   5.941| 16.356: 175.3%| 22.433: 277.6%| 26.391: 344.2%|  GB/sec
 16x1-bw-thread         :   5.648| 24.608: 335.7%| 20.204: 257.7%| 29.557: 423.3%|  GB/sec
 32x1-bw-thread         :   5.929| 25.477: 329.7%| 18.230: 207.5%| 30.232: 409.9%|  GB/sec
 2x3-bw-thread          :   5.756|  8.785:  52.6%| 14.652: 154.6%| 15.327: 166.3%|  GB/sec
 4x4-bw-thread          :   5.605|  6.366:  13.6%|  9.835:  75.5%| 27.957: 398.8%|  GB/sec
 4x6-bw-thread          :   5.771|  6.287:   8.9%| 15.372: 166.4%| 27.877: 383.1%|  GB/sec
 4x8-bw-thread          :   5.858|  5.860:   0.0%| 11.865: 102.5%| 28.439: 385.5%|  GB/sec
 4x8-bw-thread-NOTHP    :   5.645|  6.167:   9.2%|  9.224:  63.4%| 25.067: 344.1%|  GB/sec
 3x3-bw-thread          :   5.937|  8.235:  38.7%|  6.635:  11.8%| 21.560: 263.1%|  GB/sec
 5x5-bw-thread          :   5.771|  5.762:  -0.2%|  9.575:  65.9%| 26.081: 351.9%|  GB/sec
 2x16-bw-thread         :   5.953|  5.920:  -0.6%|  5.945:  -0.1%| 23.269: 290.9%|  GB/sec
 1x32-bw-thread         :   5.879|  5.828:  -0.9%|  5.848:  -0.5%| 18.985: 222.9%|  GB/sec
 numa02-bw              :   6.049| 29.054: 380.3%| 24.744: 309.1%| 31.431: 419.6%|  GB/sec
 numa02-bw-NOTHP        :   5.850| 27.064: 362.6%| 20.415: 249.0%| 29.104: 397.5%|  GB/sec
 numa01-bw-thread       :   5.834| 20.338: 248.6%| 15.169: 160.0%| 28.607: 390.3%|  GB/sec
 numa01-bw-thread-NOTHP :   5.581| 18.528: 232.0%| 12.108: 117.0%| 21.119: 278.4%|  GB/sec
------------------------------------------------------------------------------------------

The first column shows mainline kernel bandwidth in GB/sec, the 
following 3 colums show pairs of GB/sec bandwidth and percentage 
results, where percentage shows the speed difference to the 
mainline kernel.

Noise is 1-2% in these tests with these durations, so the good 
news is that none of the NUMA kernels regresses on these 
workloads against the mainline kernel. Perhaps balancenuma's 
"2x1-bw-process" and "3x1-bw-process" results might be worth a 
closer look.

No kernel shows particular vulnerability to the NOTHP tests that 
were mixed into the test stream.

As can be expected from the convergence latency results, the 
'balancenuma' tree does well with workloads where there's no 
relationship between threads - but even there it's outperformed 
by the AutoNUMA kernel, and outperformed by an even larger 
margin by the NUMA-v3 kernel. Workloads like the 4x JVM SPECjbb 
on the other hand pose a challenge to the balancenuma kernel, 
both the AutoNUMA and the NUMA-v3 kernels are several times 
faster in those tests.

The AutoNUMA kernel does well in most workloads - its weakness 
are system-wide shared workloads like 2x16-bw-thread and 
1x32-bw-thread, where it falls back to mainline performance.

The NUMA-v3 kernel outperforms every other NUMA kernel.

Here's a direct comparison between the two fastest kernels, the 
AutoNUMA and the NUMA-v3 kernels:


                        [ Higher numbers are better. ]

 [test unit]            :AutoNUMA| numa-u-v3     |
----------------------------------------------------------
 2x1-bw-process         :   8.073|  9.647:  19.5%|  GB/sec
 3x1-bw-process         :  12.583| 14.528:  15.5%|  GB/sec
 4x1-bw-process         :  12.313| 18.903:  53.5%|  GB/sec
 8x1-bw-process         :  20.237| 26.829:  32.6%|  GB/sec
 8x1-bw-process-NOTHP   :  17.831| 22.237:  24.7%|  GB/sec
 16x1-bw-process        :   5.959| 29.294: 391.6%|  GB/sec
 4x1-bw-thread          :  17.443| 19.290:  10.6%|  GB/sec
 8x1-bw-thread          :  22.433| 26.391:  17.6%|  GB/sec
 16x1-bw-thread         :  20.204| 29.557:  46.3%|  GB/sec
 32x1-bw-thread         :  18.230| 30.232:  65.8%|  GB/sec
 2x3-bw-thread          :  14.652| 15.327:   4.6%|  GB/sec
 4x4-bw-thread          :   9.835| 27.957: 184.3%|  GB/sec
 4x6-bw-thread          :  15.372| 27.877:  81.3%|  GB/sec
 4x8-bw-thread          :  11.865| 28.439: 139.7%|  GB/sec
 4x8-bw-thread-NOTHP    :   9.224| 25.067: 171.8%|  GB/sec
 3x3-bw-thread          :   6.635| 21.560: 224.9%|  GB/sec
 5x5-bw-thread          :   9.575| 26.081: 172.4%|  GB/sec
 2x16-bw-thread         :   5.945| 23.269: 291.4%|  GB/sec
 1x32-bw-thread         :   5.848| 18.985: 224.6%|  GB/sec
 numa02-bw              :  24.744| 31.431:  27.0%|  GB/sec
 numa02-bw-NOTHP        :  20.415| 29.104:  42.6%|  GB/sec
 numa01-bw-thread       :  15.169| 28.607:  88.6%|  GB/sec
 numa01-bw-thread-NOTHP :  12.108| 21.119:  74.4%|  GB/sec


NUMA workload "spread" measurements
-----------------------------------

A third, somewhat obscure category of measurements deals with 
the 'execution spread' between threads. Workloads that have to 
wait for the result of every thread before they can declare a 
result are directly limited by this spread.

The 'spread' is measured by the percentage difference between 
the slowest and fastest thread's execution time in a workload:

                           [ Lower numbers are better. ]

 [test unit]            :   v3.7  |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
------------------------------------------------------------------------------------------
 RAM-bw-local           :    0.0% |          0.0% |          0.0% |          0.0% |  %
 RAM-bw-local-NOTHP     :    0.2% |          0.2% |          0.2% |          0.2% |  %
 RAM-bw-remote          :    0.0% |          0.0% |          0.0% |          0.0% |  %
 RAM-bw-local-2x        :    0.3% |          0.0% |          0.2% |          0.3% |  %
 RAM-bw-remote-2x       :    0.0% |          0.2% |          0.0% |          0.2% |  %
 RAM-bw-cross           :    0.4% |          0.2% |          0.0% |          0.1% |  %
 2x1-bw-process         :    0.5% |          0.2% |          0.2% |          0.2% |  %
 3x1-bw-process         :    0.6% |          0.2% |          0.2% |          0.1% |  %
 4x1-bw-process         :    0.4% |          0.8% |          0.2% |          0.3% |  %
 8x1-bw-process         :    0.8% |          0.1% |          0.2% |          0.2% |  %
 8x1-bw-process-NOTHP   :    0.9% |          0.7% |          0.4% |          0.5% |  %
 16x1-bw-process        :    1.0% |          0.9% |          0.6% |          0.1% |  %
 4x1-bw-thread          :    0.1% |          0.1% |          0.1% |          0.1% |  %
 8x1-bw-thread          :    0.2% |          0.1% |          0.1% |          0.2% |  %
 16x1-bw-thread         :    0.3% |          0.1% |          0.1% |          0.1% |  %
 32x1-bw-thread         :    0.3% |          0.1% |          0.1% |          0.1% |  %
 2x3-bw-thread          :    0.4% |          0.3% |          0.3% |          0.3% |  %
 4x4-bw-thread          :    2.3% |          1.4% |          0.8% |          0.4% |  %
 4x6-bw-thread          :    2.5% |          2.2% |          1.0% |          0.6% |  %
 4x8-bw-thread          :    3.9% |          3.7% |          1.3% |          0.9% |  %
 4x8-bw-thread-NOTHP    :    6.0% |          2.5% |          1.5% |          1.0% |  %
 3x3-bw-thread          :    0.5% |          0.4% |          0.5% |          0.3% |  %
 5x5-bw-thread          :    1.8% |          2.7% |          1.3% |          0.7% |  %
 2x16-bw-thread         :    3.7% |          4.1% |          3.6% |          1.1% |  %
 1x32-bw-thread         :    2.9% |          7.3% |          3.5% |          4.4% |  %
 numa02-bw              :    0.1% |          0.0% |          0.1% |          0.1% |  %
 numa02-bw-NOTHP        :    0.4% |          0.3% |          0.3% |          0.3% |  %
 numa01-bw-thread       :    1.3% |          0.4% |          0.3% |          0.3% |  %
 numa01-bw-thread-NOTHP :    1.8% |          0.8% |          0.8% |          0.9% |  %

The results are pretty good because the runs were relatively 
short with 20 seconds runtime.

Both mainline and balancenuma has trouble with the spread of 
shared workloads - possibly signalling memory allocation 
assymetries. Longer - 60 seconds or more - runs of the key 
workloads would certainly be informative there.

NOTHP (4K ptes) increases the spread and non-determinism of 
every NUMA kernel.

The AutoNUMA and NUMA-v3 kernels have the lowest spread, 
signalling stable NUMA convergence in most scenarios.

Finally, below is the (long!) dump of all the raw data, in case 
someone wants to double-check my results. The perf/bench tool 
can be used to double check the measurements on other systems.

Thanks,

	Ingo

-------------------->

Here are the exact kernel versions used:

 # kernel 1: {v3.7-rc8-18a2f371f5ed}
 # kernel 2: {balancenuma-v10}
 # kernel 3: {autonuma-v28-c4bba428cc5c}
 # kernel 4: {numa/base-v3}

-------------------->

 #
 # Running test on: Linux vega 3.7.0-rc8+ #3 SMP Fri Dec 7 18:29:16 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
 #
# Running numa/mem benchmark...

 # Running main, "perf bench numa mem -a"

 # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local,                           20.111, secs,           runtime-max/thread
 RAM-bw-local,                           20.106, secs,           runtime-min/thread
 RAM-bw-local,                           20.106, secs,           runtime-avg/thread
 RAM-bw-local,                            0.013, %,              spread-runtime/thread
 RAM-bw-local,                          169.651, GB,             data/thread
 RAM-bw-local,                          169.651, GB,             data-total
 RAM-bw-local,                            0.119, nsecs,          runtime/byte/thread
 RAM-bw-local,                            8.436, GB/sec,         thread-speed
 RAM-bw-local,                            8.436, GB/sec,         total-speed

 # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
 RAM-bw-local-NOTHP,                     20.125, secs,           runtime-max/thread
 RAM-bw-local-NOTHP,                     20.050, secs,           runtime-min/thread
 RAM-bw-local-NOTHP,                     20.050, secs,           runtime-avg/thread
 RAM-bw-local-NOTHP,                      0.187, %,              spread-runtime/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data-total
 RAM-bw-local-NOTHP,                      0.119, nsecs,          runtime/byte/thread
 RAM-bw-local-NOTHP,                      8.430, GB/sec,         thread-speed
 RAM-bw-local-NOTHP,                      8.430, GB/sec,         total-speed

 # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote,                          20.141, secs,           runtime-max/thread
 RAM-bw-remote,                          20.134, secs,           runtime-min/thread
 RAM-bw-remote,                          20.134, secs,           runtime-avg/thread
 RAM-bw-remote,                           0.017, %,              spread-runtime/thread
 RAM-bw-remote,                         135.291, GB,             data/thread
 RAM-bw-remote,                         135.291, GB,             data-total
 RAM-bw-remote,                           0.149, nsecs,          runtime/byte/thread
 RAM-bw-remote,                           6.717, GB/sec,         thread-speed
 RAM-bw-remote,                           6.717, GB/sec,         total-speed

 # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local-2x,                        20.128, secs,           runtime-max/thread
 RAM-bw-local-2x,                        20.006, secs,           runtime-min/thread
 RAM-bw-local-2x,                        20.064, secs,           runtime-avg/thread
 RAM-bw-local-2x,                         0.302, %,              spread-runtime/thread
 RAM-bw-local-2x,                       132.607, GB,             data/thread
 RAM-bw-local-2x,                       265.214, GB,             data-total
 RAM-bw-local-2x,                         0.152, nsecs,          runtime/byte/thread
 RAM-bw-local-2x,                         6.588, GB/sec,         thread-speed
 RAM-bw-local-2x,                        13.177, GB/sec,         total-speed

 # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote-2x,                       20.102, secs,           runtime-max/thread
 RAM-bw-remote-2x,                       20.094, secs,           runtime-min/thread
 RAM-bw-remote-2x,                       20.094, secs,           runtime-avg/thread
 RAM-bw-remote-2x,                        0.021, %,              spread-runtime/thread
 RAM-bw-remote-2x,                       74.088, GB,             data/thread
 RAM-bw-remote-2x,                      148.176, GB,             data-total
 RAM-bw-remote-2x,                        0.271, nsecs,          runtime/byte/thread
 RAM-bw-remote-2x,                        3.686, GB/sec,         thread-speed
 RAM-bw-remote-2x,                        7.371, GB/sec,         total-speed

 # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-cross,                           20.159, secs,           runtime-max/thread
 RAM-bw-cross,                           20.011, secs,           runtime-min/thread
 RAM-bw-cross,                           20.081, secs,           runtime-avg/thread
 RAM-bw-cross,                            0.369, %,              spread-runtime/thread
 RAM-bw-cross,                          122.407, GB,             data/thread
 RAM-bw-cross,                          244.813, GB,             data-total
 RAM-bw-cross,                            0.165, nsecs,          runtime/byte/thread
 RAM-bw-cross,                            6.072, GB/sec,         thread-speed
 RAM-bw-cross,                           12.144, GB/sec,         total-speed

 # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
  1x3-convergence,                      100.103, secs,           NUMA-convergence-latency
  1x3-convergence,                      100.103, secs,           runtime-max/thread
  1x3-convergence,                      100.082, secs,           runtime-min/thread
  1x3-convergence,                      100.093, secs,           runtime-avg/thread
  1x3-convergence,                        0.010, %,              spread-runtime/thread
  1x3-convergence,                      278.636, GB,             data/thread
  1x3-convergence,                      835.908, GB,             data-total
  1x3-convergence,                        0.359, nsecs,          runtime/byte/thread
  1x3-convergence,                        2.784, GB/sec,         thread-speed
  1x3-convergence,                        8.351, GB/sec,         total-speed

 # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  1x4-convergence,                      100.211, secs,           NUMA-convergence-latency
  1x4-convergence,                      100.211, secs,           runtime-max/thread
  1x4-convergence,                      100.070, secs,           runtime-min/thread
  1x4-convergence,                      100.140, secs,           runtime-avg/thread
  1x4-convergence,                        0.070, %,              spread-runtime/thread
  1x4-convergence,                      154.887, GB,             data/thread
  1x4-convergence,                      619.549, GB,             data-total
  1x4-convergence,                        0.647, nsecs,          runtime/byte/thread
  1x4-convergence,                        1.546, GB/sec,         thread-speed
  1x4-convergence,                        6.182, GB/sec,         total-speed

 # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  1x6-convergence,                      100.343, secs,           NUMA-convergence-latency
  1x6-convergence,                      100.343, secs,           runtime-max/thread
  1x6-convergence,                      100.235, secs,           runtime-min/thread
  1x6-convergence,                      100.303, secs,           runtime-avg/thread
  1x6-convergence,                        0.054, %,              spread-runtime/thread
  1x6-convergence,                       95.725, GB,             data/thread
  1x6-convergence,                      574.347, GB,             data-total
  1x6-convergence,                        1.048, nsecs,          runtime/byte/thread
  1x6-convergence,                        0.954, GB/sec,         thread-speed
  1x6-convergence,                        5.724, GB/sec,         total-speed

 # Running  2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  2x3-convergence,                      100.601, secs,           NUMA-convergence-latency
  2x3-convergence,                      100.601, secs,           runtime-max/thread
  2x3-convergence,                      100.054, secs,           runtime-min/thread
  2x3-convergence,                      100.307, secs,           runtime-avg/thread
  2x3-convergence,                        0.272, %,              spread-runtime/thread
  2x3-convergence,                       65.837, GB,             data/thread
  2x3-convergence,                      592.529, GB,             data-total
  2x3-convergence,                        1.528, nsecs,          runtime/byte/thread
  2x3-convergence,                        0.654, GB/sec,         thread-speed
  2x3-convergence,                        5.890, GB/sec,         total-speed

 # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  3x3-convergence,                      100.572, secs,           NUMA-convergence-latency
  3x3-convergence,                      100.572, secs,           runtime-max/thread
  3x3-convergence,                      100.095, secs,           runtime-min/thread
  3x3-convergence,                      100.330, secs,           runtime-avg/thread
  3x3-convergence,                        0.238, %,              spread-runtime/thread
  3x3-convergence,                       65.837, GB,             data/thread
  3x3-convergence,                      592.529, GB,             data-total
  3x3-convergence,                        1.528, nsecs,          runtime/byte/thread
  3x3-convergence,                        0.655, GB/sec,         thread-speed
  3x3-convergence,                        5.892, GB/sec,         total-speed

 # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  4x4-convergence,                      100.571, secs,           NUMA-convergence-latency
  4x4-convergence,                      100.571, secs,           runtime-max/thread
  4x4-convergence,                      100.122, secs,           runtime-min/thread
  4x4-convergence,                      100.386, secs,           runtime-avg/thread
  4x4-convergence,                        0.223, %,              spread-runtime/thread
  4x4-convergence,                       35.266, GB,             data/thread
  4x4-convergence,                      564.251, GB,             data-total
  4x4-convergence,                        2.852, nsecs,          runtime/byte/thread
  4x4-convergence,                        0.351, GB/sec,         thread-speed
  4x4-convergence,                        5.610, GB/sec,         total-speed

 # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  4x4-convergence-NOTHP,                101.051, secs,           NUMA-convergence-latency
  4x4-convergence-NOTHP,                101.051, secs,           runtime-max/thread
  4x4-convergence-NOTHP,                100.066, secs,           runtime-min/thread
  4x4-convergence-NOTHP,                100.683, secs,           runtime-avg/thread
  4x4-convergence-NOTHP,                  0.487, %,              spread-runtime/thread
  4x4-convergence-NOTHP,                 35.769, GB,             data/thread
  4x4-convergence-NOTHP,                572.304, GB,             data-total
  4x4-convergence-NOTHP,                  2.825, nsecs,          runtime/byte/thread
  4x4-convergence-NOTHP,                  0.354, GB/sec,         thread-speed
  4x4-convergence-NOTHP,                  5.664, GB/sec,         total-speed

 # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  4x6-convergence,                        5.444, secs,           NUMA-convergence-latency
  4x6-convergence,                        5.444, secs,           runtime-max/thread
  4x6-convergence,                        2.853, secs,           runtime-min/thread
  4x6-convergence,                        4.531, secs,           runtime-avg/thread
  4x6-convergence,                       23.794, %,              spread-runtime/thread
  4x6-convergence,                        1.292, GB,             data/thread
  4x6-convergence,                       31.017, GB,             data-total
  4x6-convergence,                        4.212, nsecs,          runtime/byte/thread
  4x6-convergence,                        0.237, GB/sec,         thread-speed
  4x6-convergence,                        5.698, GB/sec,         total-speed

 # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
  4x8-convergence,                      101.133, secs,           NUMA-convergence-latency
  4x8-convergence,                      101.133, secs,           runtime-max/thread
  4x8-convergence,                      100.455, secs,           runtime-min/thread
  4x8-convergence,                      100.803, secs,           runtime-avg/thread
  4x8-convergence,                        0.335, %,              spread-runtime/thread
  4x8-convergence,                       18.522, GB,             data/thread
  4x8-convergence,                      592.705, GB,             data-total
  4x8-convergence,                        5.460, nsecs,          runtime/byte/thread
  4x8-convergence,                        0.183, GB/sec,         thread-speed
  4x8-convergence,                        5.861, GB/sec,         total-speed

 # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  8x4-convergence,                      100.878, secs,           NUMA-convergence-latency
  8x4-convergence,                      100.878, secs,           runtime-max/thread
  8x4-convergence,                      100.021, secs,           runtime-min/thread
  8x4-convergence,                      100.567, secs,           runtime-avg/thread
  8x4-convergence,                        0.425, %,              spread-runtime/thread
  8x4-convergence,                       18.388, GB,             data/thread
  8x4-convergence,                      588.411, GB,             data-total
  8x4-convergence,                        5.486, nsecs,          runtime/byte/thread
  8x4-convergence,                        0.182, GB/sec,         thread-speed
  8x4-convergence,                        5.833, GB/sec,         total-speed

 # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  8x4-convergence-NOTHP,                101.868, secs,           NUMA-convergence-latency
  8x4-convergence-NOTHP,                101.868, secs,           runtime-max/thread
  8x4-convergence-NOTHP,                100.499, secs,           runtime-min/thread
  8x4-convergence-NOTHP,                101.118, secs,           runtime-avg/thread
  8x4-convergence-NOTHP,                  0.672, %,              spread-runtime/thread
  8x4-convergence-NOTHP,                 17.851, GB,             data/thread
  8x4-convergence-NOTHP,                571.231, GB,             data-total
  8x4-convergence-NOTHP,                  5.707, nsecs,          runtime/byte/thread
  8x4-convergence-NOTHP,                  0.175, GB/sec,         thread-speed
  8x4-convergence-NOTHP,                  5.608, GB/sec,         total-speed

 # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  3x1-convergence,                        0.652, secs,           NUMA-convergence-latency
  3x1-convergence,                        0.652, secs,           runtime-max/thread
  3x1-convergence,                        0.471, secs,           runtime-min/thread
  3x1-convergence,                        0.584, secs,           runtime-avg/thread
  3x1-convergence,                       13.878, %,              spread-runtime/thread
  3x1-convergence,                        1.432, GB,             data/thread
  3x1-convergence,                        4.295, GB,             data-total
  3x1-convergence,                        0.456, nsecs,          runtime/byte/thread
  3x1-convergence,                        2.195, GB/sec,         thread-speed
  3x1-convergence,                        6.584, GB/sec,         total-speed

 # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  4x1-convergence,                        0.643, secs,           NUMA-convergence-latency
  4x1-convergence,                        0.643, secs,           runtime-max/thread
  4x1-convergence,                        0.479, secs,           runtime-min/thread
  4x1-convergence,                        0.562, secs,           runtime-avg/thread
  4x1-convergence,                       12.750, %,              spread-runtime/thread
  4x1-convergence,                        1.074, GB,             data/thread
  4x1-convergence,                        4.295, GB,             data-total
  4x1-convergence,                        0.599, nsecs,          runtime/byte/thread
  4x1-convergence,                        1.669, GB/sec,         thread-speed
  4x1-convergence,                        6.677, GB/sec,         total-speed

 # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  8x1-convergence,                        2.803, secs,           NUMA-convergence-latency
  8x1-convergence,                        2.803, secs,           runtime-max/thread
  8x1-convergence,                        2.509, secs,           runtime-min/thread
  8x1-convergence,                        2.664, secs,           runtime-avg/thread
  8x1-convergence,                        5.250, %,              spread-runtime/thread
  8x1-convergence,                        2.147, GB,             data/thread
  8x1-convergence,                       17.180, GB,             data-total
  8x1-convergence,                        1.305, nsecs,          runtime/byte/thread
  8x1-convergence,                        0.766, GB/sec,         thread-speed
  8x1-convergence,                        6.129, GB/sec,         total-speed

 # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
 16x1-convergence,                        3.482, secs,           NUMA-convergence-latency
 16x1-convergence,                        3.482, secs,           runtime-max/thread
 16x1-convergence,                        3.162, secs,           runtime-min/thread
 16x1-convergence,                        3.328, secs,           runtime-avg/thread
 16x1-convergence,                        4.603, %,              spread-runtime/thread
 16x1-convergence,                        1.242, GB,             data/thread
 16x1-convergence,                       19.864, GB,             data-total
 16x1-convergence,                        2.805, nsecs,          runtime/byte/thread
 16x1-convergence,                        0.357, GB/sec,         thread-speed
 16x1-convergence,                        5.704, GB/sec,         total-speed

 # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
 32x1-convergence,                        3.612, secs,           NUMA-convergence-latency
 32x1-convergence,                        3.612, secs,           runtime-max/thread
 32x1-convergence,                        3.170, secs,           runtime-min/thread
 32x1-convergence,                        3.456, secs,           runtime-avg/thread
 32x1-convergence,                        6.118, %,              spread-runtime/thread
 32x1-convergence,                        0.671, GB,             data/thread
 32x1-convergence,                       21.475, GB,             data-total
 32x1-convergence,                        5.382, nsecs,          runtime/byte/thread
 32x1-convergence,                        0.186, GB/sec,         thread-speed
 32x1-convergence,                        5.945, GB/sec,         total-speed

 # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  2x1-bw-process,                        20.280, secs,           runtime-max/thread
  2x1-bw-process,                        20.059, secs,           runtime-min/thread
  2x1-bw-process,                        20.166, secs,           runtime-avg/thread
  2x1-bw-process,                         0.546, %,              spread-runtime/thread
  2x1-bw-process,                        63.351, GB,             data/thread
  2x1-bw-process,                       126.702, GB,             data-total
  2x1-bw-process,                         0.320, nsecs,          runtime/byte/thread
  2x1-bw-process,                         3.124, GB/sec,         thread-speed
  2x1-bw-process,                         6.248, GB/sec,         total-speed

 # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  3x1-bw-process,                        20.320, secs,           runtime-max/thread
  3x1-bw-process,                        20.078, secs,           runtime-min/thread
  3x1-bw-process,                        20.202, secs,           runtime-avg/thread
  3x1-bw-process,                         0.595, %,              spread-runtime/thread
  3x1-bw-process,                        49.392, GB,             data/thread
  3x1-bw-process,                       148.176, GB,             data-total
  3x1-bw-process,                         0.411, nsecs,          runtime/byte/thread
  3x1-bw-process,                         2.431, GB/sec,         thread-speed
  3x1-bw-process,                         7.292, GB/sec,         total-speed

 # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  4x1-bw-process,                        20.379, secs,           runtime-max/thread
  4x1-bw-process,                        20.210, secs,           runtime-min/thread
  4x1-bw-process,                        20.291, secs,           runtime-avg/thread
  4x1-bw-process,                         0.413, %,              spread-runtime/thread
  4x1-bw-process,                        30.602, GB,             data/thread
  4x1-bw-process,                       122.407, GB,             data-total
  4x1-bw-process,                         0.666, nsecs,          runtime/byte/thread
  4x1-bw-process,                         1.502, GB/sec,         thread-speed
  4x1-bw-process,                         6.007, GB/sec,         total-speed

 # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
  8x1-bw-process,                        20.419, secs,           runtime-max/thread
  8x1-bw-process,                        20.073, secs,           runtime-min/thread
  8x1-bw-process,                        20.328, secs,           runtime-avg/thread
  8x1-bw-process,                         0.848, %,              spread-runtime/thread
  8x1-bw-process,                        15.569, GB,             data/thread
  8x1-bw-process,                       124.554, GB,             data-total
  8x1-bw-process,                         1.311, nsecs,          runtime/byte/thread
  8x1-bw-process,                         0.762, GB/sec,         thread-speed
  8x1-bw-process,                         6.100, GB/sec,         total-speed

 # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
  8x1-bw-process-NOTHP,                  20.502, secs,           runtime-max/thread
  8x1-bw-process-NOTHP,                  20.113, secs,           runtime-min/thread
  8x1-bw-process-NOTHP,                  20.307, secs,           runtime-avg/thread
  8x1-bw-process-NOTHP,                   0.950, %,              spread-runtime/thread
  8x1-bw-process-NOTHP,                  15.234, GB,             data/thread
  8x1-bw-process-NOTHP,                 121.870, GB,             data-total
  8x1-bw-process-NOTHP,                   1.346, nsecs,          runtime/byte/thread
  8x1-bw-process-NOTHP,                   0.743, GB/sec,         thread-speed
  8x1-bw-process-NOTHP,                   5.944, GB/sec,         total-speed

 # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
 16x1-bw-process,                        20.539, secs,           runtime-max/thread
 16x1-bw-process,                        20.145, secs,           runtime-min/thread
 16x1-bw-process,                        20.407, secs,           runtime-avg/thread
 16x1-bw-process,                         0.959, %,              spread-runtime/thread
 16x1-bw-process,                         7.197, GB,             data/thread
 16x1-bw-process,                       115.159, GB,             data-total
 16x1-bw-process,                         2.854, nsecs,          runtime/byte/thread
 16x1-bw-process,                         0.350, GB/sec,         thread-speed
 16x1-bw-process,                         5.607, GB/sec,         total-speed

 # Running  4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
  4x1-bw-thread,                         20.105, secs,           runtime-max/thread
  4x1-bw-thread,                         20.047, secs,           runtime-min/thread
  4x1-bw-thread,                         20.071, secs,           runtime-avg/thread
  4x1-bw-thread,                          0.144, %,              spread-runtime/thread
  4x1-bw-thread,                         30.333, GB,             data/thread
  4x1-bw-thread,                        121.333, GB,             data-total
  4x1-bw-thread,                          0.663, nsecs,          runtime/byte/thread
  4x1-bw-thread,                          1.509, GB/sec,         thread-speed
  4x1-bw-thread,                          6.035, GB/sec,         total-speed

 # Running  8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
  8x1-bw-thread,                         20.106, secs,           runtime-max/thread
  8x1-bw-thread,                         20.021, secs,           runtime-min/thread
  8x1-bw-thread,                         20.062, secs,           runtime-avg/thread
  8x1-bw-thread,                          0.213, %,              spread-runtime/thread
  8x1-bw-thread,                         14.932, GB,             data/thread
  8x1-bw-thread,                        119.454, GB,             data-total
  8x1-bw-thread,                          1.347, nsecs,          runtime/byte/thread
  8x1-bw-thread,                          0.743, GB/sec,         thread-speed
  8x1-bw-thread,                          5.941, GB/sec,         total-speed

 # Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
 16x1-bw-thread,                         20.176, secs,           runtime-max/thread
 16x1-bw-thread,                         20.049, secs,           runtime-min/thread
 16x1-bw-thread,                         20.125, secs,           runtime-avg/thread
 16x1-bw-thread,                          0.314, %,              spread-runtime/thread
 16x1-bw-thread,                          7.122, GB,             data/thread
 16x1-bw-thread,                        113.951, GB,             data-total
 16x1-bw-thread,                          2.833, nsecs,          runtime/byte/thread
 16x1-bw-thread,                          0.353, GB/sec,         thread-speed
 16x1-bw-thread,                          5.648, GB/sec,         total-speed

 # Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
 32x1-bw-thread,                         20.159, secs,           runtime-max/thread
 32x1-bw-thread,                         20.034, secs,           runtime-min/thread
 32x1-bw-thread,                         20.120, secs,           runtime-avg/thread
 32x1-bw-thread,                          0.309, %,              spread-runtime/thread
 32x1-bw-thread,                          3.735, GB,             data/thread
 32x1-bw-thread,                        119.521, GB,             data-total
 32x1-bw-thread,                          5.397, nsecs,          runtime/byte/thread
 32x1-bw-thread,                          0.185, GB/sec,         thread-speed
 32x1-bw-thread,                          5.929, GB/sec,         total-speed

 # Running  2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  2x3-bw-thread,                         20.239, secs,           runtime-max/thread
  2x3-bw-thread,                         20.092, secs,           runtime-min/thread
  2x3-bw-thread,                         20.183, secs,           runtime-avg/thread
  2x3-bw-thread,                          0.363, %,              spread-runtime/thread
  2x3-bw-thread,                         19.417, GB,             data/thread
  2x3-bw-thread,                        116.501, GB,             data-total
  2x3-bw-thread,                          1.042, nsecs,          runtime/byte/thread
  2x3-bw-thread,                          0.959, GB/sec,         thread-speed
  2x3-bw-thread,                          5.756, GB/sec,         total-speed

 # Running  4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
  4x4-bw-thread,                         20.978, secs,           runtime-max/thread
  4x4-bw-thread,                         20.005, secs,           runtime-min/thread
  4x4-bw-thread,                         20.576, secs,           runtime-avg/thread
  4x4-bw-thread,                          2.321, %,              spread-runtime/thread
  4x4-bw-thread,                          7.348, GB,             data/thread
  4x4-bw-thread,                        117.575, GB,             data-total
  4x4-bw-thread,                          2.855, nsecs,          runtime/byte/thread
  4x4-bw-thread,                          0.350, GB/sec,         thread-speed
  4x4-bw-thread,                          5.605, GB/sec,         total-speed

 # Running  4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
  4x6-bw-thread,                         21.118, secs,           runtime-max/thread
  4x6-bw-thread,                         20.082, secs,           runtime-min/thread
  4x6-bw-thread,                         20.819, secs,           runtime-avg/thread
  4x6-bw-thread,                          2.451, %,              spread-runtime/thread
  4x6-bw-thread,                          5.078, GB,             data/thread
  4x6-bw-thread,                        121.870, GB,             data-total
  4x6-bw-thread,                          4.159, nsecs,          runtime/byte/thread
  4x6-bw-thread,                          0.240, GB/sec,         thread-speed
  4x6-bw-thread,                          5.771, GB/sec,         total-speed

 # Running  4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
  4x8-bw-thread,                         21.994, secs,           runtime-max/thread
  4x8-bw-thread,                         20.290, secs,           runtime-min/thread
  4x8-bw-thread,                         21.387, secs,           runtime-avg/thread
  4x8-bw-thread,                          3.874, %,              spread-runtime/thread
  4x8-bw-thread,                          4.027, GB,             data/thread
  4x8-bw-thread,                        128.849, GB,             data-total
  4x8-bw-thread,                          5.462, nsecs,          runtime/byte/thread
  4x8-bw-thread,                          0.183, GB/sec,         thread-speed
  4x8-bw-thread,                          5.858, GB/sec,         total-speed

 # Running  4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
  4x8-bw-thread-NOTHP,                   22.728, secs,           runtime-max/thread
  4x8-bw-thread-NOTHP,                   20.013, secs,           runtime-min/thread
  4x8-bw-thread-NOTHP,                   21.968, secs,           runtime-avg/thread
  4x8-bw-thread-NOTHP,                    5.975, %,              spread-runtime/thread
  4x8-bw-thread-NOTHP,                    4.010, GB,             data/thread
  4x8-bw-thread-NOTHP,                  128.312, GB,             data-total
  4x8-bw-thread-NOTHP,                    5.668, nsecs,          runtime/byte/thread
  4x8-bw-thread-NOTHP,                    0.176, GB/sec,         thread-speed
  4x8-bw-thread-NOTHP,                    5.645, GB/sec,         total-speed

 # Running  3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  3x3-bw-thread,                         20.526, secs,           runtime-max/thread
  3x3-bw-thread,                         20.317, secs,           runtime-min/thread
  3x3-bw-thread,                         20.467, secs,           runtime-avg/thread
  3x3-bw-thread,                          0.510, %,              spread-runtime/thread
  3x3-bw-thread,                         13.541, GB,             data/thread
  3x3-bw-thread,                        121.870, GB,             data-total
  3x3-bw-thread,                          1.516, nsecs,          runtime/byte/thread
  3x3-bw-thread,                          0.660, GB/sec,         thread-speed
  3x3-bw-thread,                          5.937, GB/sec,         total-speed

 # Running  5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
  5x5-bw-thread,                         21.023, secs,           runtime-max/thread
  5x5-bw-thread,                         20.252, secs,           runtime-min/thread
  5x5-bw-thread,                         20.701, secs,           runtime-avg/thread
  5x5-bw-thread,                          1.833, %,              spread-runtime/thread
  5x5-bw-thread,                          4.853, GB,             data/thread
  5x5-bw-thread,                        121.333, GB,             data-total
  5x5-bw-thread,                          4.332, nsecs,          runtime/byte/thread
  5x5-bw-thread,                          0.231, GB/sec,         thread-speed
  5x5-bw-thread,                          5.771, GB/sec,         total-speed

 # Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
 2x16-bw-thread,                         21.646, secs,           runtime-max/thread
 2x16-bw-thread,                         20.065, secs,           runtime-min/thread
 2x16-bw-thread,                         21.026, secs,           runtime-avg/thread
 2x16-bw-thread,                          3.652, %,              spread-runtime/thread
 2x16-bw-thread,                          4.027, GB,             data/thread
 2x16-bw-thread,                        128.849, GB,             data-total
 2x16-bw-thread,                          5.376, nsecs,          runtime/byte/thread
 2x16-bw-thread,                          0.186, GB/sec,         thread-speed
 2x16-bw-thread,                          5.953, GB/sec,         total-speed

 # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
 1x32-bw-thread,                         23.377, secs,           runtime-max/thread
 1x32-bw-thread,                         22.030, secs,           runtime-min/thread
 1x32-bw-thread,                         22.936, secs,           runtime-avg/thread
 1x32-bw-thread,                          2.881, %,              spread-runtime/thread
 1x32-bw-thread,                          4.295, GB,             data/thread
 1x32-bw-thread,                        137.439, GB,             data-total
 1x32-bw-thread,                          5.443, nsecs,          runtime/byte/thread
 1x32-bw-thread,                          0.184, GB/sec,         thread-speed
 1x32-bw-thread,                          5.879, GB/sec,         total-speed

 # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
 numa02-bw,                              20.065, secs,           runtime-max/thread
 numa02-bw,                              20.012, secs,           runtime-min/thread
 numa02-bw,                              20.050, secs,           runtime-avg/thread
 numa02-bw,                               0.132, %,              spread-runtime/thread
 numa02-bw,                               3.793, GB,             data/thread
 numa02-bw,                             121.366, GB,             data-total
 numa02-bw,                               5.290, nsecs,          runtime/byte/thread
 numa02-bw,                               0.189, GB/sec,         thread-speed
 numa02-bw,                               6.049, GB/sec,         total-speed

 # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
 numa02-bw-NOTHP,                        20.132, secs,           runtime-max/thread
 numa02-bw-NOTHP,                        19.987, secs,           runtime-min/thread
 numa02-bw-NOTHP,                        20.049, secs,           runtime-avg/thread
 numa02-bw-NOTHP,                         0.360, %,              spread-runtime/thread
 numa02-bw-NOTHP,                         3.681, GB,             data/thread
 numa02-bw-NOTHP,                       117.776, GB,             data-total
 numa02-bw-NOTHP,                         5.470, nsecs,          runtime/byte/thread
 numa02-bw-NOTHP,                         0.183, GB/sec,         thread-speed
 numa02-bw-NOTHP,                         5.850, GB/sec,         total-speed

 # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
 numa01-bw-thread,                       20.704, secs,           runtime-max/thread
 numa01-bw-thread,                       20.185, secs,           runtime-min/thread
 numa01-bw-thread,                       20.571, secs,           runtime-avg/thread
 numa01-bw-thread,                        1.254, %,              spread-runtime/thread
 numa01-bw-thread,                        3.775, GB,             data/thread
 numa01-bw-thread,                      120.796, GB,             data-total
 numa01-bw-thread,                        5.485, nsecs,          runtime/byte/thread
 numa01-bw-thread,                        0.182, GB/sec,         thread-speed
 numa01-bw-thread,                        5.834, GB/sec,         total-speed

 # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
 numa01-bw-thread-NOTHP,                 20.780, secs,           runtime-max/thread
 numa01-bw-thread-NOTHP,                 20.023, secs,           runtime-min/thread
 numa01-bw-thread-NOTHP,                 20.418, secs,           runtime-avg/thread
 numa01-bw-thread-NOTHP,                  1.821, %,              spread-runtime/thread
 numa01-bw-thread-NOTHP,                  3.624, GB,             data/thread
 numa01-bw-thread-NOTHP,                115.964, GB,             data-total
 numa01-bw-thread-NOTHP,                  5.734, nsecs,          runtime/byte/thread
 numa01-bw-thread-NOTHP,                  0.174, GB/sec,         thread-speed
 numa01-bw-thread-NOTHP,                  5.581, GB/sec,         total-speed

 #
 # Running test on: Linux vega 3.7.0-rc6+ #2 SMP Fri Dec 7 17:59:13 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
 #
# Running numa/mem benchmark...

 # Running main, "perf bench numa mem -a"

 # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local,                           20.049, secs,           runtime-max/thread
 RAM-bw-local,                           20.044, secs,           runtime-min/thread
 RAM-bw-local,                           20.044, secs,           runtime-avg/thread
 RAM-bw-local,                            0.014, %,              spread-runtime/thread
 RAM-bw-local,                          172.872, GB,             data/thread
 RAM-bw-local,                          172.872, GB,             data-total
 RAM-bw-local,                            0.116, nsecs,          runtime/byte/thread
 RAM-bw-local,                            8.622, GB/sec,         thread-speed
 RAM-bw-local,                            8.622, GB/sec,         total-speed

 # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
 RAM-bw-local-NOTHP,                     20.135, secs,           runtime-max/thread
 RAM-bw-local-NOTHP,                     20.059, secs,           runtime-min/thread
 RAM-bw-local-NOTHP,                     20.059, secs,           runtime-avg/thread
 RAM-bw-local-NOTHP,                      0.189, %,              spread-runtime/thread
 RAM-bw-local-NOTHP,                    172.872, GB,             data/thread
 RAM-bw-local-NOTHP,                    172.872, GB,             data-total
 RAM-bw-local-NOTHP,                      0.116, nsecs,          runtime/byte/thread
 RAM-bw-local-NOTHP,                      8.586, GB/sec,         thread-speed
 RAM-bw-local-NOTHP,                      8.586, GB/sec,         total-speed

 # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote,                          20.080, secs,           runtime-max/thread
 RAM-bw-remote,                          20.073, secs,           runtime-min/thread
 RAM-bw-remote,                          20.073, secs,           runtime-avg/thread
 RAM-bw-remote,                           0.017, %,              spread-runtime/thread
 RAM-bw-remote,                         135.291, GB,             data/thread
 RAM-bw-remote,                         135.291, GB,             data-total
 RAM-bw-remote,                           0.148, nsecs,          runtime/byte/thread
 RAM-bw-remote,                           6.738, GB/sec,         thread-speed
 RAM-bw-remote,                           6.738, GB/sec,         total-speed

 # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local-2x,                        20.127, secs,           runtime-max/thread
 RAM-bw-local-2x,                        20.111, secs,           runtime-min/thread
 RAM-bw-local-2x,                        20.116, secs,           runtime-avg/thread
 RAM-bw-local-2x,                         0.038, %,              spread-runtime/thread
 RAM-bw-local-2x,                       130.997, GB,             data/thread
 RAM-bw-local-2x,                       261.993, GB,             data-total
 RAM-bw-local-2x,                         0.154, nsecs,          runtime/byte/thread
 RAM-bw-local-2x,                         6.509, GB/sec,         thread-speed
 RAM-bw-local-2x,                        13.017, GB/sec,         total-speed

 # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote-2x,                       20.183, secs,           runtime-max/thread
 RAM-bw-remote-2x,                       20.110, secs,           runtime-min/thread
 RAM-bw-remote-2x,                       20.143, secs,           runtime-avg/thread
 RAM-bw-remote-2x,                        0.180, %,              spread-runtime/thread
 RAM-bw-remote-2x,                       75.162, GB,             data/thread
 RAM-bw-remote-2x,                      150.324, GB,             data-total
 RAM-bw-remote-2x,                        0.269, nsecs,          runtime/byte/thread
 RAM-bw-remote-2x,                        3.724, GB/sec,         thread-speed
 RAM-bw-remote-2x,                        7.448, GB/sec,         total-speed

 # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-cross,                           20.159, secs,           runtime-max/thread
 RAM-bw-cross,                           20.071, secs,           runtime-min/thread
 RAM-bw-cross,                           20.111, secs,           runtime-avg/thread
 RAM-bw-cross,                            0.220, %,              spread-runtime/thread
 RAM-bw-cross,                          124.017, GB,             data/thread
 RAM-bw-cross,                          248.034, GB,             data-total
 RAM-bw-cross,                            0.163, nsecs,          runtime/byte/thread
 RAM-bw-cross,                            6.152, GB/sec,         thread-speed
 RAM-bw-cross,                           12.304, GB/sec,         total-speed

 # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
  1x3-convergence,                      100.038, secs,           NUMA-convergence-latency
  1x3-convergence,                      100.038, secs,           runtime-max/thread
  1x3-convergence,                      100.005, secs,           runtime-min/thread
  1x3-convergence,                      100.016, secs,           runtime-avg/thread
  1x3-convergence,                        0.016, %,              spread-runtime/thread
  1x3-convergence,                      379.210, GB,             data/thread
  1x3-convergence,                     1137.629, GB,             data-total
  1x3-convergence,                        0.264, nsecs,          runtime/byte/thread
  1x3-convergence,                        3.791, GB/sec,         thread-speed
  1x3-convergence,                       11.372, GB/sec,         total-speed

 # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  1x4-convergence,                      100.091, secs,           NUMA-convergence-latency
  1x4-convergence,                      100.091, secs,           runtime-max/thread
  1x4-convergence,                      100.016, secs,           runtime-min/thread
  1x4-convergence,                      100.053, secs,           runtime-avg/thread
  1x4-convergence,                        0.037, %,              spread-runtime/thread
  1x4-convergence,                      162.672, GB,             data/thread
  1x4-convergence,                      650.688, GB,             data-total
  1x4-convergence,                        0.615, nsecs,          runtime/byte/thread
  1x4-convergence,                        1.625, GB/sec,         thread-speed
  1x4-convergence,                        6.501, GB/sec,         total-speed

 # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  1x6-convergence,                      100.366, secs,           NUMA-convergence-latency
  1x6-convergence,                      100.366, secs,           runtime-max/thread
  1x6-convergence,                      100.005, secs,           runtime-min/thread
  1x6-convergence,                      100.144, secs,           runtime-avg/thread
  1x6-convergence,                        0.180, %,              spread-runtime/thread
  1x6-convergence,                      103.924, GB,             data/thread
  1x6-convergence,                      623.546, GB,             data-total
  1x6-convergence,                        0.966, nsecs,          runtime/byte/thread
  1x6-convergence,                        1.035, GB/sec,         thread-speed
  1x6-convergence,                        6.213, GB/sec,         total-speed

 # Running  2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  2x3-convergence,                      100.632, secs,           NUMA-convergence-latency
  2x3-convergence,                      100.632, secs,           runtime-max/thread
  2x3-convergence,                      100.080, secs,           runtime-min/thread
  2x3-convergence,                      100.376, secs,           runtime-avg/thread
  2x3-convergence,                        0.274, %,              spread-runtime/thread
  2x3-convergence,                       87.941, GB,             data/thread
  2x3-convergence,                      791.465, GB,             data-total
  2x3-convergence,                        1.144, nsecs,          runtime/byte/thread
  2x3-convergence,                        0.874, GB/sec,         thread-speed
  2x3-convergence,                        7.865, GB/sec,         total-speed

 # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  3x3-convergence,                      100.474, secs,           NUMA-convergence-latency
  3x3-convergence,                      100.474, secs,           runtime-max/thread
  3x3-convergence,                      100.070, secs,           runtime-min/thread
  3x3-convergence,                      100.338, secs,           runtime-avg/thread
  3x3-convergence,                        0.201, %,              spread-runtime/thread
  3x3-convergence,                      118.363, GB,             data/thread
  3x3-convergence,                     1065.269, GB,             data-total
  3x3-convergence,                        0.849, nsecs,          runtime/byte/thread
  3x3-convergence,                        1.178, GB/sec,         thread-speed
  3x3-convergence,                       10.602, GB/sec,         total-speed

 # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  4x4-convergence,                      100.527, secs,           NUMA-convergence-latency
  4x4-convergence,                      100.527, secs,           runtime-max/thread
  4x4-convergence,                      100.179, secs,           runtime-min/thread
  4x4-convergence,                      100.353, secs,           runtime-avg/thread
  4x4-convergence,                        0.173, %,              spread-runtime/thread
  4x4-convergence,                       65.230, GB,             data/thread
  4x4-convergence,                     1043.677, GB,             data-total
  4x4-convergence,                        1.541, nsecs,          runtime/byte/thread
  4x4-convergence,                        0.649, GB/sec,         thread-speed
  4x4-convergence,                       10.382, GB/sec,         total-speed

 # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  4x4-convergence-NOTHP,                100.532, secs,           NUMA-convergence-latency
  4x4-convergence-NOTHP,                100.532, secs,           runtime-max/thread
  4x4-convergence-NOTHP,                100.095, secs,           runtime-min/thread
  4x4-convergence-NOTHP,                100.343, secs,           runtime-avg/thread
  4x4-convergence-NOTHP,                  0.217, %,              spread-runtime/thread
  4x4-convergence-NOTHP,                 57.311, GB,             data/thread
  4x4-convergence-NOTHP,                916.976, GB,             data-total
  4x4-convergence-NOTHP,                  1.754, nsecs,          runtime/byte/thread
  4x4-convergence-NOTHP,                  0.570, GB/sec,         thread-speed
  4x4-convergence-NOTHP,                  9.121, GB/sec,         total-speed

 # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  4x6-convergence,                      101.230, secs,           NUMA-convergence-latency
  4x6-convergence,                      101.230, secs,           runtime-max/thread
  4x6-convergence,                      100.093, secs,           runtime-min/thread
  4x6-convergence,                      100.825, secs,           runtime-avg/thread
  4x6-convergence,                        0.562, %,              spread-runtime/thread
  4x6-convergence,                       28.076, GB,             data/thread
  4x6-convergence,                      673.815, GB,             data-total
  4x6-convergence,                        3.606, nsecs,          runtime/byte/thread
  4x6-convergence,                        0.277, GB/sec,         thread-speed
  4x6-convergence,                        6.656, GB/sec,         total-speed

 # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
  4x8-convergence,                      101.310, secs,           NUMA-convergence-latency
  4x8-convergence,                      101.310, secs,           runtime-max/thread
  4x8-convergence,                      100.052, secs,           runtime-min/thread
  4x8-convergence,                      100.679, secs,           runtime-avg/thread
  4x8-convergence,                        0.621, %,              spread-runtime/thread
  4x8-convergence,                       18.740, GB,             data/thread
  4x8-convergence,                      599.685, GB,             data-total
  4x8-convergence,                        5.406, nsecs,          runtime/byte/thread
  4x8-convergence,                        0.185, GB/sec,         thread-speed
  4x8-convergence,                        5.919, GB/sec,         total-speed

 # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  8x4-convergence,                      100.849, secs,           NUMA-convergence-latency
  8x4-convergence,                      100.849, secs,           runtime-max/thread
  8x4-convergence,                      100.020, secs,           runtime-min/thread
  8x4-convergence,                      100.570, secs,           runtime-avg/thread
  8x4-convergence,                        0.411, %,              spread-runtime/thread
  8x4-convergence,                       22.364, GB,             data/thread
  8x4-convergence,                      715.649, GB,             data-total
  8x4-convergence,                        4.509, nsecs,          runtime/byte/thread
  8x4-convergence,                        0.222, GB/sec,         thread-speed
  8x4-convergence,                        7.096, GB/sec,         total-speed

 # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  8x4-convergence-NOTHP,                100.976, secs,           NUMA-convergence-latency
  8x4-convergence-NOTHP,                100.976, secs,           runtime-max/thread
  8x4-convergence-NOTHP,                100.066, secs,           runtime-min/thread
  8x4-convergence-NOTHP,                100.580, secs,           runtime-avg/thread
  8x4-convergence-NOTHP,                  0.451, %,              spread-runtime/thread
  8x4-convergence-NOTHP,                 27.146, GB,             data/thread
  8x4-convergence-NOTHP,                868.657, GB,             data-total
  8x4-convergence-NOTHP,                  3.720, nsecs,          runtime/byte/thread
  8x4-convergence-NOTHP,                  0.269, GB/sec,         thread-speed
  8x4-convergence-NOTHP,                  8.603, GB/sec,         total-speed

 # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  3x1-convergence,                        1.010, secs,           NUMA-convergence-latency
  3x1-convergence,                        1.010, secs,           runtime-max/thread
  3x1-convergence,                        0.869, secs,           runtime-min/thread
  3x1-convergence,                        0.958, secs,           runtime-avg/thread
  3x1-convergence,                        6.944, %,              spread-runtime/thread
  3x1-convergence,                        2.326, GB,             data/thread
  3x1-convergence,                        6.979, GB,             data-total
  3x1-convergence,                        0.434, nsecs,          runtime/byte/thread
  3x1-convergence,                        2.305, GB/sec,         thread-speed
  3x1-convergence,                        6.914, GB/sec,         total-speed

 # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  4x1-convergence,                        0.782, secs,           NUMA-convergence-latency
  4x1-convergence,                        0.782, secs,           runtime-max/thread
  4x1-convergence,                        0.623, secs,           runtime-min/thread
  4x1-convergence,                        0.689, secs,           runtime-avg/thread
  4x1-convergence,                       10.122, %,              spread-runtime/thread
  4x1-convergence,                        1.208, GB,             data/thread
  4x1-convergence,                        4.832, GB,             data-total
  4x1-convergence,                        0.647, nsecs,          runtime/byte/thread
  4x1-convergence,                        1.545, GB/sec,         thread-speed
  4x1-convergence,                        6.181, GB/sec,         total-speed

 # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  8x1-convergence,                        2.914, secs,           NUMA-convergence-latency
  8x1-convergence,                        2.914, secs,           runtime-max/thread
  8x1-convergence,                        2.533, secs,           runtime-min/thread
  8x1-convergence,                        2.750, secs,           runtime-avg/thread
  8x1-convergence,                        6.538, %,              spread-runtime/thread
  8x1-convergence,                        2.215, GB,             data/thread
  8x1-convergence,                       17.717, GB,             data-total
  8x1-convergence,                        1.316, nsecs,          runtime/byte/thread
  8x1-convergence,                        0.760, GB/sec,         thread-speed
  8x1-convergence,                        6.080, GB/sec,         total-speed

 # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
 16x1-convergence,                        3.688, secs,           NUMA-convergence-latency
 16x1-convergence,                        3.688, secs,           runtime-max/thread
 16x1-convergence,                        3.358, secs,           runtime-min/thread
 16x1-convergence,                        3.533, secs,           runtime-avg/thread
 16x1-convergence,                        4.481, %,              spread-runtime/thread
 16x1-convergence,                        1.292, GB,             data/thread
 16x1-convergence,                       20.670, GB,             data-total
 16x1-convergence,                        2.855, nsecs,          runtime/byte/thread
 16x1-convergence,                        0.350, GB/sec,         thread-speed
 16x1-convergence,                        5.604, GB/sec,         total-speed

 # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
 32x1-convergence,                        2.762, secs,           NUMA-convergence-latency
 32x1-convergence,                        2.762, secs,           runtime-max/thread
 32x1-convergence,                        2.552, secs,           runtime-min/thread
 32x1-convergence,                        2.735, secs,           runtime-avg/thread
 32x1-convergence,                        3.807, %,              spread-runtime/thread
 32x1-convergence,                        0.516, GB,             data/thread
 32x1-convergence,                       16.509, GB,             data-total
 32x1-convergence,                        5.354, nsecs,          runtime/byte/thread
 32x1-convergence,                        0.187, GB/sec,         thread-speed
 32x1-convergence,                        5.976, GB/sec,         total-speed

 # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  2x1-bw-process,                        20.123, secs,           runtime-max/thread
  2x1-bw-process,                        20.053, secs,           runtime-min/thread
  2x1-bw-process,                        20.085, secs,           runtime-avg/thread
  2x1-bw-process,                         0.173, %,              spread-runtime/thread
  2x1-bw-process,                        61.740, GB,             data/thread
  2x1-bw-process,                       123.480, GB,             data-total
  2x1-bw-process,                         0.326, nsecs,          runtime/byte/thread
  2x1-bw-process,                         3.068, GB/sec,         thread-speed
  2x1-bw-process,                         6.136, GB/sec,         total-speed

 # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  3x1-bw-process,                        20.143, secs,           runtime-max/thread
  3x1-bw-process,                        20.043, secs,           runtime-min/thread
  3x1-bw-process,                        20.091, secs,           runtime-avg/thread
  3x1-bw-process,                         0.249, %,              spread-runtime/thread
  3x1-bw-process,                        48.676, GB,             data/thread
  3x1-bw-process,                       146.029, GB,             data-total
  3x1-bw-process,                         0.414, nsecs,          runtime/byte/thread
  3x1-bw-process,                         2.417, GB/sec,         thread-speed
  3x1-bw-process,                         7.250, GB/sec,         total-speed

 # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  4x1-bw-process,                        20.327, secs,           runtime-max/thread
  4x1-bw-process,                        20.020, secs,           runtime-min/thread
  4x1-bw-process,                        20.168, secs,           runtime-avg/thread
  4x1-bw-process,                         0.754, %,              spread-runtime/thread
  4x1-bw-process,                        34.897, GB,             data/thread
  4x1-bw-process,                       139.586, GB,             data-total
  4x1-bw-process,                         0.582, nsecs,          runtime/byte/thread
  4x1-bw-process,                         1.717, GB/sec,         thread-speed
  4x1-bw-process,                         6.867, GB/sec,         total-speed

 # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
  8x1-bw-process,                        20.063, secs,           runtime-max/thread
  8x1-bw-process,                        20.004, secs,           runtime-min/thread
  8x1-bw-process,                        20.034, secs,           runtime-avg/thread
  8x1-bw-process,                         0.148, %,              spread-runtime/thread
  8x1-bw-process,                        19.998, GB,             data/thread
  8x1-bw-process,                       159.988, GB,             data-total
  8x1-bw-process,                         1.003, nsecs,          runtime/byte/thread
  8x1-bw-process,                         0.997, GB/sec,         thread-speed
  8x1-bw-process,                         7.974, GB/sec,         total-speed

 # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
  8x1-bw-process-NOTHP,                  20.435, secs,           runtime-max/thread
  8x1-bw-process-NOTHP,                  20.150, secs,           runtime-min/thread
  8x1-bw-process-NOTHP,                  20.255, secs,           runtime-avg/thread
  8x1-bw-process-NOTHP,                   0.699, %,              spread-runtime/thread
  8x1-bw-process-NOTHP,                  15.167, GB,             data/thread
  8x1-bw-process-NOTHP,                 121.333, GB,             data-total
  8x1-bw-process-NOTHP,                   1.347, nsecs,          runtime/byte/thread
  8x1-bw-process-NOTHP,                   0.742, GB/sec,         thread-speed
  8x1-bw-process-NOTHP,                   5.937, GB/sec,         total-speed

 # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
 16x1-bw-process,                        20.451, secs,           runtime-max/thread
 16x1-bw-process,                        20.078, secs,           runtime-min/thread
 16x1-bw-process,                        20.311, secs,           runtime-avg/thread
 16x1-bw-process,                         0.912, %,              spread-runtime/thread
 16x1-bw-process,                         7.147, GB,             data/thread
 16x1-bw-process,                       114.354, GB,             data-total
 16x1-bw-process,                         2.861, nsecs,          runtime/byte/thread
 16x1-bw-process,                         0.349, GB/sec,         thread-speed
 16x1-bw-process,                         5.592, GB/sec,         total-speed

 # Running  4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
  4x1-bw-thread,                         20.038, secs,           runtime-max/thread
  4x1-bw-thread,                         20.006, secs,           runtime-min/thread
  4x1-bw-thread,                         20.023, secs,           runtime-avg/thread
  4x1-bw-thread,                          0.079, %,              spread-runtime/thread
  4x1-bw-thread,                         68.115, GB,             data/thread
  4x1-bw-thread,                        272.462, GB,             data-total
  4x1-bw-thread,                          0.294, nsecs,          runtime/byte/thread
  4x1-bw-thread,                          3.399, GB/sec,         thread-speed
  4x1-bw-thread,                         13.598, GB/sec,         total-speed

 # Running  8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
  8x1-bw-thread,                         20.055, secs,           runtime-max/thread
  8x1-bw-thread,                         20.001, secs,           runtime-min/thread
  8x1-bw-thread,                         20.033, secs,           runtime-avg/thread
  8x1-bw-thread,                          0.136, %,              spread-runtime/thread
  8x1-bw-thread,                         41.004, GB,             data/thread
  8x1-bw-thread,                        328.028, GB,             data-total
  8x1-bw-thread,                          0.489, nsecs,          runtime/byte/thread
  8x1-bw-thread,                          2.045, GB/sec,         thread-speed
  8x1-bw-thread,                         16.356, GB/sec,         total-speed

 # Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
 16x1-bw-thread,                         20.044, secs,           runtime-max/thread
 16x1-bw-thread,                         19.994, secs,           runtime-min/thread
 16x1-bw-thread,                         20.021, secs,           runtime-avg/thread
 16x1-bw-thread,                          0.124, %,              spread-runtime/thread
 16x1-bw-thread,                         30.828, GB,             data/thread
 16x1-bw-thread,                        493.250, GB,             data-total
 16x1-bw-thread,                          0.650, nsecs,          runtime/byte/thread
 16x1-bw-thread,                          1.538, GB/sec,         thread-speed
 16x1-bw-thread,                         24.608, GB/sec,         total-speed

 # Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
 32x1-bw-thread,                         19.990, secs,           runtime-max/thread
 32x1-bw-thread,                         19.955, secs,           runtime-min/thread
 32x1-bw-thread,                         19.996, secs,           runtime-avg/thread
 32x1-bw-thread,                          0.087, %,              spread-runtime/thread
 32x1-bw-thread,                         15.915, GB,             data/thread
 32x1-bw-thread,                        509.289, GB,             data-total
 32x1-bw-thread,                          1.256, nsecs,          runtime/byte/thread
 32x1-bw-thread,                          0.796, GB/sec,         thread-speed
 32x1-bw-thread,                         25.477, GB/sec,         total-speed

 # Running  2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  2x3-bw-thread,                         20.168, secs,           runtime-max/thread
  2x3-bw-thread,                         20.028, secs,           runtime-min/thread
  2x3-bw-thread,                         20.103, secs,           runtime-avg/thread
  2x3-bw-thread,                          0.346, %,              spread-runtime/thread
  2x3-bw-thread,                         29.528, GB,             data/thread
  2x3-bw-thread,                        177.167, GB,             data-total
  2x3-bw-thread,                          0.683, nsecs,          runtime/byte/thread
  2x3-bw-thread,                          1.464, GB/sec,         thread-speed
  2x3-bw-thread,                          8.785, GB/sec,         total-speed

 # Running  4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
  4x4-bw-thread,                         20.576, secs,           runtime-max/thread
  4x4-bw-thread,                         20.002, secs,           runtime-min/thread
  4x4-bw-thread,                         20.312, secs,           runtime-avg/thread
  4x4-bw-thread,                          1.394, %,              spread-runtime/thread
  4x4-bw-thread,                          8.187, GB,             data/thread
  4x4-bw-thread,                        130.997, GB,             data-total
  4x4-bw-thread,                          2.513, nsecs,          runtime/byte/thread
  4x4-bw-thread,                          0.398, GB/sec,         thread-speed
  4x4-bw-thread,                          6.366, GB/sec,         total-speed

 # Running  4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
  4x6-bw-thread,                         21.007, secs,           runtime-max/thread
  4x6-bw-thread,                         20.075, secs,           runtime-min/thread
  4x6-bw-thread,                         20.573, secs,           runtime-avg/thread
  4x6-bw-thread,                          2.219, %,              spread-runtime/thread
  4x6-bw-thread,                          5.503, GB,             data/thread
  4x6-bw-thread,                        132.070, GB,             data-total
  4x6-bw-thread,                          3.817, nsecs,          runtime/byte/thread
  4x6-bw-thread,                          0.262, GB/sec,         thread-speed
  4x6-bw-thread,                          6.287, GB/sec,         total-speed

 # Running  4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
  4x8-bw-thread,                         21.986, secs,           runtime-max/thread
  4x8-bw-thread,                         20.359, secs,           runtime-min/thread
  4x8-bw-thread,                         21.300, secs,           runtime-avg/thread
  4x8-bw-thread,                          3.701, %,              spread-runtime/thread
  4x8-bw-thread,                          4.027, GB,             data/thread
  4x8-bw-thread,                        128.849, GB,             data-total
  4x8-bw-thread,                          5.460, nsecs,          runtime/byte/thread
  4x8-bw-thread,                          0.183, GB/sec,         thread-speed
  4x8-bw-thread,                          5.860, GB/sec,         total-speed

 # Running  4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
  4x8-bw-thread-NOTHP,                   21.155, secs,           runtime-max/thread
  4x8-bw-thread-NOTHP,                   20.115, secs,           runtime-min/thread
  4x8-bw-thread-NOTHP,                   20.705, secs,           runtime-avg/thread
  4x8-bw-thread-NOTHP,                    2.459, %,              spread-runtime/thread
  4x8-bw-thread-NOTHP,                    4.077, GB,             data/thread
  4x8-bw-thread-NOTHP,                  130.460, GB,             data-total
  4x8-bw-thread-NOTHP,                    5.189, nsecs,          runtime/byte/thread
  4x8-bw-thread-NOTHP,                    0.193, GB/sec,         thread-speed
  4x8-bw-thread-NOTHP,                    6.167, GB/sec,         total-speed

 # Running  3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  3x3-bw-thread,                         20.211, secs,           runtime-max/thread
  3x3-bw-thread,                         20.044, secs,           runtime-min/thread
  3x3-bw-thread,                         20.127, secs,           runtime-avg/thread
  3x3-bw-thread,                          0.413, %,              spread-runtime/thread
  3x3-bw-thread,                         18.492, GB,             data/thread
  3x3-bw-thread,                        166.430, GB,             data-total
  3x3-bw-thread,                          1.093, nsecs,          runtime/byte/thread
  3x3-bw-thread,                          0.915, GB/sec,         thread-speed
  3x3-bw-thread,                          8.235, GB/sec,         total-speed

 # Running  5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
  5x5-bw-thread,                         21.244, secs,           runtime-max/thread
  5x5-bw-thread,                         20.115, secs,           runtime-min/thread
  5x5-bw-thread,                         20.873, secs,           runtime-avg/thread
  5x5-bw-thread,                          2.657, %,              spread-runtime/thread
  5x5-bw-thread,                          4.896, GB,             data/thread
  5x5-bw-thread,                        122.407, GB,             data-total
  5x5-bw-thread,                          4.339, nsecs,          runtime/byte/thread
  5x5-bw-thread,                          0.230, GB/sec,         thread-speed
  5x5-bw-thread,                          5.762, GB/sec,         total-speed

 # Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
 2x16-bw-thread,                         21.854, secs,           runtime-max/thread
 2x16-bw-thread,                         20.047, secs,           runtime-min/thread
 2x16-bw-thread,                         21.157, secs,           runtime-avg/thread
 2x16-bw-thread,                          4.135, %,              spread-runtime/thread
 2x16-bw-thread,                          4.043, GB,             data/thread
 2x16-bw-thread,                        129.386, GB,             data-total
 2x16-bw-thread,                          5.405, nsecs,          runtime/byte/thread
 2x16-bw-thread,                          0.185, GB/sec,         thread-speed
 2x16-bw-thread,                          5.920, GB/sec,         total-speed

 # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
 1x32-bw-thread,                         23.952, secs,           runtime-max/thread
 1x32-bw-thread,                         20.470, secs,           runtime-min/thread
 1x32-bw-thread,                         22.975, secs,           runtime-avg/thread
 1x32-bw-thread,                          7.268, %,              spread-runtime/thread
 1x32-bw-thread,                          4.362, GB,             data/thread
 1x32-bw-thread,                        139.586, GB,             data-total
 1x32-bw-thread,                          5.491, nsecs,          runtime/byte/thread
 1x32-bw-thread,                          0.182, GB/sec,         thread-speed
 1x32-bw-thread,                          5.828, GB/sec,         total-speed

 # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
 numa02-bw,                              19.990, secs,           runtime-max/thread
 numa02-bw,                              19.975, secs,           runtime-min/thread
 numa02-bw,                              19.995, secs,           runtime-avg/thread
 numa02-bw,                               0.037, %,              spread-runtime/thread
 numa02-bw,                              18.150, GB,             data/thread
 numa02-bw,                             580.794, GB,             data-total
 numa02-bw,                               1.101, nsecs,          runtime/byte/thread
 numa02-bw,                               0.908, GB/sec,         thread-speed
 numa02-bw,                              29.054, GB/sec,         total-speed

 # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
 numa02-bw-NOTHP,                        20.072, secs,           runtime-max/thread
 numa02-bw-NOTHP,                        19.965, secs,           runtime-min/thread
 numa02-bw-NOTHP,                        19.998, secs,           runtime-avg/thread
 numa02-bw-NOTHP,                         0.266, %,              spread-runtime/thread
 numa02-bw-NOTHP,                        16.975, GB,             data/thread
 numa02-bw-NOTHP,                       543.213, GB,             data-total
 numa02-bw-NOTHP,                         1.182, nsecs,          runtime/byte/thread
 numa02-bw-NOTHP,                         0.846, GB/sec,         thread-speed
 numa02-bw-NOTHP,                        27.064, GB/sec,         total-speed

 # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
 numa01-bw-thread,                       20.125, secs,           runtime-max/thread
 numa01-bw-thread,                       19.980, secs,           runtime-min/thread
 numa01-bw-thread,                       20.094, secs,           runtime-avg/thread
 numa01-bw-thread,                        0.361, %,              spread-runtime/thread
 numa01-bw-thread,                       12.791, GB,             data/thread
 numa01-bw-thread,                      409.297, GB,             data-total
 numa01-bw-thread,                        1.573, nsecs,          runtime/byte/thread
 numa01-bw-thread,                        0.636, GB/sec,         thread-speed
 numa01-bw-thread,                       20.338, GB/sec,         total-speed

 # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
 numa01-bw-thread-NOTHP,                 20.298, secs,           runtime-max/thread
 numa01-bw-thread-NOTHP,                 19.965, secs,           runtime-min/thread
 numa01-bw-thread-NOTHP,                 20.055, secs,           runtime-avg/thread
 numa01-bw-thread-NOTHP,                  0.820, %,              spread-runtime/thread
 numa01-bw-thread-NOTHP,                 11.752, GB,             data/thread
 numa01-bw-thread-NOTHP,                376.078, GB,             data-total
 numa01-bw-thread-NOTHP,                  1.727, nsecs,          runtime/byte/thread
 numa01-bw-thread-NOTHP,                  0.579, GB/sec,         thread-speed
 numa01-bw-thread-NOTHP,                 18.528, GB/sec,         total-speed

 #
 # Running test on: Linux vega 3.6.0+ #4 SMP Fri Dec 7 19:14:49 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
 #
# Running numa/mem benchmark...

 # Running main, "perf bench numa mem -a"

 # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local,                           20.080, secs,           runtime-max/thread
 RAM-bw-local,                           20.073, secs,           runtime-min/thread
 RAM-bw-local,                           20.073, secs,           runtime-avg/thread
 RAM-bw-local,                            0.018, %,              spread-runtime/thread
 RAM-bw-local,                          170.725, GB,             data/thread
 RAM-bw-local,                          170.725, GB,             data-total
 RAM-bw-local,                            0.118, nsecs,          runtime/byte/thread
 RAM-bw-local,                            8.502, GB/sec,         thread-speed
 RAM-bw-local,                            8.502, GB/sec,         total-speed

 # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
 RAM-bw-local-NOTHP,                     20.112, secs,           runtime-max/thread
 RAM-bw-local-NOTHP,                     20.028, secs,           runtime-min/thread
 RAM-bw-local-NOTHP,                     20.028, secs,           runtime-avg/thread
 RAM-bw-local-NOTHP,                      0.209, %,              spread-runtime/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data-total
 RAM-bw-local-NOTHP,                      0.119, nsecs,          runtime/byte/thread
 RAM-bw-local-NOTHP,                      8.435, GB/sec,         thread-speed
 RAM-bw-local-NOTHP,                      8.435, GB/sec,         total-speed

 # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote,                          20.101, secs,           runtime-max/thread
 RAM-bw-remote,                          20.093, secs,           runtime-min/thread
 RAM-bw-remote,                          20.093, secs,           runtime-avg/thread
 RAM-bw-remote,                           0.021, %,              spread-runtime/thread
 RAM-bw-remote,                         134.218, GB,             data/thread
 RAM-bw-remote,                         134.218, GB,             data-total
 RAM-bw-remote,                           0.150, nsecs,          runtime/byte/thread
 RAM-bw-remote,                           6.677, GB/sec,         thread-speed
 RAM-bw-remote,                           6.677, GB/sec,         total-speed

 # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local-2x,                        20.109, secs,           runtime-max/thread
 RAM-bw-local-2x,                        20.011, secs,           runtime-min/thread
 RAM-bw-local-2x,                        20.056, secs,           runtime-avg/thread
 RAM-bw-local-2x,                         0.243, %,              spread-runtime/thread
 RAM-bw-local-2x,                       135.291, GB,             data/thread
 RAM-bw-local-2x,                       270.583, GB,             data-total
 RAM-bw-local-2x,                         0.149, nsecs,          runtime/byte/thread
 RAM-bw-local-2x,                         6.728, GB/sec,         thread-speed
 RAM-bw-local-2x,                        13.456, GB/sec,         total-speed

 # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote-2x,                       20.292, secs,           runtime-max/thread
 RAM-bw-remote-2x,                       20.279, secs,           runtime-min/thread
 RAM-bw-remote-2x,                       20.281, secs,           runtime-avg/thread
 RAM-bw-remote-2x,                        0.034, %,              spread-runtime/thread
 RAM-bw-remote-2x,                       74.625, GB,             data/thread
 RAM-bw-remote-2x,                      149.250, GB,             data-total
 RAM-bw-remote-2x,                        0.272, nsecs,          runtime/byte/thread
 RAM-bw-remote-2x,                        3.677, GB/sec,         thread-speed
 RAM-bw-remote-2x,                        7.355, GB/sec,         total-speed

 # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-cross,                           20.177, secs,           runtime-max/thread
 RAM-bw-cross,                           20.158, secs,           runtime-min/thread
 RAM-bw-cross,                           20.163, secs,           runtime-avg/thread
 RAM-bw-cross,                            0.048, %,              spread-runtime/thread
 RAM-bw-cross,                          122.943, GB,             data/thread
 RAM-bw-cross,                          245.887, GB,             data-total
 RAM-bw-cross,                            0.164, nsecs,          runtime/byte/thread
 RAM-bw-cross,                            6.093, GB/sec,         thread-speed
 RAM-bw-cross,                           12.187, GB/sec,         total-speed

 # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
  1x3-convergence,                        0.224, secs,           NUMA-convergence-latency
  1x3-convergence,                        0.224, secs,           runtime-max/thread
  1x3-convergence,                        0.205, secs,           runtime-min/thread
  1x3-convergence,                        0.214, secs,           runtime-avg/thread
  1x3-convergence,                        4.078, %,              spread-runtime/thread
  1x3-convergence,                        0.537, GB,             data/thread
  1x3-convergence,                        1.611, GB,             data-total
  1x3-convergence,                        0.417, nsecs,          runtime/byte/thread
  1x3-convergence,                        2.401, GB/sec,         thread-speed
  1x3-convergence,                        7.202, GB/sec,         total-speed

 # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  1x4-convergence,                      100.173, secs,           NUMA-convergence-latency
  1x4-convergence,                      100.173, secs,           runtime-max/thread
  1x4-convergence,                      100.026, secs,           runtime-min/thread
  1x4-convergence,                      100.067, secs,           runtime-avg/thread
  1x4-convergence,                        0.073, %,              spread-runtime/thread
  1x4-convergence,                      162.672, GB,             data/thread
  1x4-convergence,                      650.688, GB,             data-total
  1x4-convergence,                        0.616, nsecs,          runtime/byte/thread
  1x4-convergence,                        1.624, GB/sec,         thread-speed
  1x4-convergence,                        6.496, GB/sec,         total-speed

 # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  1x6-convergence,                      100.821, secs,           NUMA-convergence-latency
  1x6-convergence,                      100.821, secs,           runtime-max/thread
  1x6-convergence,                      100.428, secs,           runtime-min/thread
  1x6-convergence,                      100.706, secs,           runtime-avg/thread
  1x6-convergence,                        0.195, %,              spread-runtime/thread
  1x6-convergence,                       99.111, GB,             data/thread
  1x6-convergence,                      594.668, GB,             data-total
  1x6-convergence,                        1.017, nsecs,          runtime/byte/thread
  1x6-convergence,                        0.983, GB/sec,         thread-speed
  1x6-convergence,                        5.898, GB/sec,         total-speed

 # Running  2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  2x3-convergence,                      100.539, secs,           NUMA-convergence-latency
  2x3-convergence,                      100.539, secs,           runtime-max/thread
  2x3-convergence,                      100.015, secs,           runtime-min/thread
  2x3-convergence,                      100.273, secs,           runtime-avg/thread
  2x3-convergence,                        0.260, %,              spread-runtime/thread
  2x3-convergence,                      147.954, GB,             data/thread
  2x3-convergence,                     1331.587, GB,             data-total
  2x3-convergence,                        0.680, nsecs,          runtime/byte/thread
  2x3-convergence,                        1.472, GB/sec,         thread-speed
  2x3-convergence,                       13.245, GB/sec,         total-speed

 # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  3x3-convergence,                      100.463, secs,           NUMA-convergence-latency
  3x3-convergence,                      100.463, secs,           runtime-max/thread
  3x3-convergence,                      100.066, secs,           runtime-min/thread
  3x3-convergence,                      100.216, secs,           runtime-avg/thread
  3x3-convergence,                        0.198, %,              spread-runtime/thread
  3x3-convergence,                      132.624, GB,             data/thread
  3x3-convergence,                     1193.615, GB,             data-total
  3x3-convergence,                        0.758, nsecs,          runtime/byte/thread
  3x3-convergence,                        1.320, GB/sec,         thread-speed
  3x3-convergence,                       11.881, GB/sec,         total-speed

 # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  4x4-convergence,                        4.119, secs,           NUMA-convergence-latency
  4x4-convergence,                        4.119, secs,           runtime-max/thread
  4x4-convergence,                        3.751, secs,           runtime-min/thread
  4x4-convergence,                        3.948, secs,           runtime-avg/thread
  4x4-convergence,                        4.462, %,              spread-runtime/thread
  4x4-convergence,                        1.980, GB,             data/thread
  4x4-convergence,                       31.675, GB,             data-total
  4x4-convergence,                        2.081, nsecs,          runtime/byte/thread
  4x4-convergence,                        0.481, GB/sec,         thread-speed
  4x4-convergence,                        7.690, GB/sec,         total-speed

 # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  4x4-convergence-NOTHP,                 12.166, secs,           NUMA-convergence-latency
  4x4-convergence-NOTHP,                 12.166, secs,           runtime-max/thread
  4x4-convergence-NOTHP,                 11.801, secs,           runtime-min/thread
  4x4-convergence-NOTHP,                 11.917, secs,           runtime-avg/thread
  4x4-convergence-NOTHP,                  1.502, %,              spread-runtime/thread
  4x4-convergence-NOTHP,                  5.234, GB,             data/thread
  4x4-convergence-NOTHP,                 83.752, GB,             data-total
  4x4-convergence-NOTHP,                  2.324, nsecs,          runtime/byte/thread
  4x4-convergence-NOTHP,                  0.430, GB/sec,         thread-speed
  4x4-convergence-NOTHP,                  6.884, GB/sec,         total-speed

 # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  4x6-convergence,                       16.592, secs,           NUMA-convergence-latency
  4x6-convergence,                       16.592, secs,           runtime-max/thread
  4x6-convergence,                       15.407, secs,           runtime-min/thread
  4x6-convergence,                       16.109, secs,           runtime-avg/thread
  4x6-convergence,                        3.572, %,              spread-runtime/thread
  4x6-convergence,                        6.729, GB,             data/thread
  4x6-convergence,                      161.502, GB,             data-total
  4x6-convergence,                        2.466, nsecs,          runtime/byte/thread
  4x6-convergence,                        0.406, GB/sec,         thread-speed
  4x6-convergence,                        9.734, GB/sec,         total-speed

 # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
  4x8-convergence,                        3.385, secs,           NUMA-convergence-latency
  4x8-convergence,                        3.385, secs,           runtime-max/thread
  4x8-convergence,                        1.465, secs,           runtime-min/thread
  4x8-convergence,                        2.846, secs,           runtime-avg/thread
  4x8-convergence,                       28.361, %,              spread-runtime/thread
  4x8-convergence,                        0.638, GB,             data/thread
  4x8-convergence,                       20.401, GB,             data-total
  4x8-convergence,                        5.309, nsecs,          runtime/byte/thread
  4x8-convergence,                        0.188, GB/sec,         thread-speed
  4x8-convergence,                        6.028, GB/sec,         total-speed

 # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  8x4-convergence,                       18.295, secs,           NUMA-convergence-latency
  8x4-convergence,                       18.295, secs,           runtime-max/thread
  8x4-convergence,                       16.808, secs,           runtime-min/thread
  8x4-convergence,                       17.809, secs,           runtime-avg/thread
  8x4-convergence,                        4.064, %,              spread-runtime/thread
  8x4-convergence,                        3.406, GB,             data/thread
  8x4-convergence,                      108.985, GB,             data-total
  8x4-convergence,                        5.372, nsecs,          runtime/byte/thread
  8x4-convergence,                        0.186, GB/sec,         thread-speed
  8x4-convergence,                        5.957, GB/sec,         total-speed

 # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  8x4-convergence-NOTHP,                 15.675, secs,           NUMA-convergence-latency
  8x4-convergence-NOTHP,                 15.675, secs,           runtime-max/thread
  8x4-convergence-NOTHP,                 14.861, secs,           runtime-min/thread
  8x4-convergence-NOTHP,                 15.321, secs,           runtime-avg/thread
  8x4-convergence-NOTHP,                  2.596, %,              spread-runtime/thread
  8x4-convergence-NOTHP,                  5.302, GB,             data/thread
  8x4-convergence-NOTHP,                169.651, GB,             data-total
  8x4-convergence-NOTHP,                  2.957, nsecs,          runtime/byte/thread
  8x4-convergence-NOTHP,                  0.338, GB/sec,         thread-speed
  8x4-convergence-NOTHP,                 10.823, GB/sec,         total-speed

 # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  3x1-convergence,                        0.811, secs,           NUMA-convergence-latency
  3x1-convergence,                        0.811, secs,           runtime-max/thread
  3x1-convergence,                        0.739, secs,           runtime-min/thread
  3x1-convergence,                        0.782, secs,           runtime-avg/thread
  3x1-convergence,                        4.431, %,              spread-runtime/thread
  3x1-convergence,                        1.969, GB,             data/thread
  3x1-convergence,                        5.906, GB,             data-total
  3x1-convergence,                        0.412, nsecs,          runtime/byte/thread
  3x1-convergence,                        2.428, GB/sec,         thread-speed
  3x1-convergence,                        7.284, GB/sec,         total-speed

 # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  4x1-convergence,                        0.806, secs,           NUMA-convergence-latency
  4x1-convergence,                        0.806, secs,           runtime-max/thread
  4x1-convergence,                        0.728, secs,           runtime-min/thread
  4x1-convergence,                        0.780, secs,           runtime-avg/thread
  4x1-convergence,                        4.838, %,              spread-runtime/thread
  4x1-convergence,                        1.476, GB,             data/thread
  4x1-convergence,                        5.906, GB,             data-total
  4x1-convergence,                        0.546, nsecs,          runtime/byte/thread
  4x1-convergence,                        1.832, GB/sec,         thread-speed
  4x1-convergence,                        7.329, GB/sec,         total-speed

 # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  8x1-convergence,                        2.879, secs,           NUMA-convergence-latency
  8x1-convergence,                        2.879, secs,           runtime-max/thread
  8x1-convergence,                        2.737, secs,           runtime-min/thread
  8x1-convergence,                        2.805, secs,           runtime-avg/thread
  8x1-convergence,                        2.475, %,              spread-runtime/thread
  8x1-convergence,                        3.288, GB,             data/thread
  8x1-convergence,                       26.307, GB,             data-total
  8x1-convergence,                        0.876, nsecs,          runtime/byte/thread
  8x1-convergence,                        1.142, GB/sec,         thread-speed
  8x1-convergence,                        9.137, GB/sec,         total-speed

 # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
 16x1-convergence,                        2.484, secs,           NUMA-convergence-latency
 16x1-convergence,                        2.484, secs,           runtime-max/thread
 16x1-convergence,                        2.169, secs,           runtime-min/thread
 16x1-convergence,                        2.376, secs,           runtime-avg/thread
 16x1-convergence,                        6.353, %,              spread-runtime/thread
 16x1-convergence,                        0.906, GB,             data/thread
 16x1-convergence,                       14.496, GB,             data-total
 16x1-convergence,                        2.742, nsecs,          runtime/byte/thread
 16x1-convergence,                        0.365, GB/sec,         thread-speed
 16x1-convergence,                        5.835, GB/sec,         total-speed

 # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
 32x1-convergence,                        3.039, secs,           NUMA-convergence-latency
 32x1-convergence,                        3.039, secs,           runtime-max/thread
 32x1-convergence,                        2.755, secs,           runtime-min/thread
 32x1-convergence,                        2.983, secs,           runtime-avg/thread
 32x1-convergence,                        4.672, %,              spread-runtime/thread
 32x1-convergence,                        0.579, GB,             data/thread
 32x1-convergence,                       18.522, GB,             data-total
 32x1-convergence,                        5.251, nsecs,          runtime/byte/thread
 32x1-convergence,                        0.190, GB/sec,         thread-speed
 32x1-convergence,                        6.094, GB/sec,         total-speed

 # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  2x1-bw-process,                        20.217, secs,           runtime-max/thread
  2x1-bw-process,                        20.126, secs,           runtime-min/thread
  2x1-bw-process,                        20.168, secs,           runtime-avg/thread
  2x1-bw-process,                         0.224, %,              spread-runtime/thread
  2x1-bw-process,                        81.604, GB,             data/thread
  2x1-bw-process,                       163.209, GB,             data-total
  2x1-bw-process,                         0.248, nsecs,          runtime/byte/thread
  2x1-bw-process,                         4.036, GB/sec,         thread-speed
  2x1-bw-process,                         8.073, GB/sec,         total-speed

 # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  3x1-bw-process,                        20.138, secs,           runtime-max/thread
  3x1-bw-process,                        20.075, secs,           runtime-min/thread
  3x1-bw-process,                        20.105, secs,           runtime-avg/thread
  3x1-bw-process,                         0.156, %,              spread-runtime/thread
  3x1-bw-process,                        84.468, GB,             data/thread
  3x1-bw-process,                       253.403, GB,             data-total
  3x1-bw-process,                         0.238, nsecs,          runtime/byte/thread
  3x1-bw-process,                         4.194, GB/sec,         thread-speed
  3x1-bw-process,                        12.583, GB/sec,         total-speed

 # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  4x1-bw-process,                        20.143, secs,           runtime-max/thread
  4x1-bw-process,                        20.052, secs,           runtime-min/thread
  4x1-bw-process,                        20.079, secs,           runtime-avg/thread
  4x1-bw-process,                         0.227, %,              spread-runtime/thread
  4x1-bw-process,                        62.009, GB,             data/thread
  4x1-bw-process,                       248.034, GB,             data-total
  4x1-bw-process,                         0.325, nsecs,          runtime/byte/thread
  4x1-bw-process,                         3.078, GB/sec,         thread-speed
  4x1-bw-process,                        12.313, GB/sec,         total-speed

 # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
  8x1-bw-process,                        20.109, secs,           runtime-max/thread
  8x1-bw-process,                        20.013, secs,           runtime-min/thread
  8x1-bw-process,                        20.072, secs,           runtime-avg/thread
  8x1-bw-process,                         0.238, %,              spread-runtime/thread
  8x1-bw-process,                        50.869, GB,             data/thread
  8x1-bw-process,                       406.948, GB,             data-total
  8x1-bw-process,                         0.395, nsecs,          runtime/byte/thread
  8x1-bw-process,                         2.530, GB/sec,         thread-speed
  8x1-bw-process,                        20.237, GB/sec,         total-speed

 # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
  8x1-bw-process-NOTHP,                  20.203, secs,           runtime-max/thread
  8x1-bw-process-NOTHP,                  20.033, secs,           runtime-min/thread
  8x1-bw-process-NOTHP,                  20.071, secs,           runtime-avg/thread
  8x1-bw-process-NOTHP,                   0.422, %,              spread-runtime/thread
  8x1-bw-process-NOTHP,                  45.030, GB,             data/thread
  8x1-bw-process-NOTHP,                 360.240, GB,             data-total
  8x1-bw-process-NOTHP,                   0.449, nsecs,          runtime/byte/thread
  8x1-bw-process-NOTHP,                   2.229, GB/sec,         thread-speed
  8x1-bw-process-NOTHP,                  17.831, GB/sec,         total-speed

 # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
 16x1-bw-process,                        20.271, secs,           runtime-max/thread
 16x1-bw-process,                        20.021, secs,           runtime-min/thread
 16x1-bw-process,                        20.175, secs,           runtime-avg/thread
 16x1-bw-process,                         0.615, %,              spread-runtime/thread
 16x1-bw-process,                         7.550, GB,             data/thread
 16x1-bw-process,                       120.796, GB,             data-total
 16x1-bw-process,                         2.685, nsecs,          runtime/byte/thread
 16x1-bw-process,                         0.372, GB/sec,         thread-speed
 16x1-bw-process,                         5.959, GB/sec,         total-speed

 # Running  4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
  4x1-bw-thread,                         20.052, secs,           runtime-max/thread
  4x1-bw-thread,                         20.013, secs,           runtime-min/thread
  4x1-bw-thread,                         20.030, secs,           runtime-avg/thread
  4x1-bw-thread,                          0.097, %,              spread-runtime/thread
  4x1-bw-thread,                         87.443, GB,             data/thread
  4x1-bw-thread,                        349.771, GB,             data-total
  4x1-bw-thread,                          0.229, nsecs,          runtime/byte/thread
  4x1-bw-thread,                          4.361, GB/sec,         thread-speed
  4x1-bw-thread,                         17.443, GB/sec,         total-speed

 # Running  8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
  8x1-bw-thread,                         20.067, secs,           runtime-max/thread
  8x1-bw-thread,                         20.011, secs,           runtime-min/thread
  8x1-bw-thread,                         20.038, secs,           runtime-avg/thread
  8x1-bw-thread,                          0.140, %,              spread-runtime/thread
  8x1-bw-thread,                         56.271, GB,             data/thread
  8x1-bw-thread,                        450.166, GB,             data-total
  8x1-bw-thread,                          0.357, nsecs,          runtime/byte/thread
  8x1-bw-thread,                          2.804, GB/sec,         thread-speed
  8x1-bw-thread,                         22.433, GB/sec,         total-speed

 # Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
 16x1-bw-thread,                         20.029, secs,           runtime-max/thread
 16x1-bw-thread,                         20.002, secs,           runtime-min/thread
 16x1-bw-thread,                         20.020, secs,           runtime-avg/thread
 16x1-bw-thread,                          0.067, %,              spread-runtime/thread
 16x1-bw-thread,                         25.292, GB,             data/thread
 16x1-bw-thread,                        404.666, GB,             data-total
 16x1-bw-thread,                          0.792, nsecs,          runtime/byte/thread
 16x1-bw-thread,                          1.263, GB/sec,         thread-speed
 16x1-bw-thread,                         20.204, GB/sec,         total-speed

 # Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
 32x1-bw-thread,                         19.989, secs,           runtime-max/thread
 32x1-bw-thread,                         19.962, secs,           runtime-min/thread
 32x1-bw-thread,                         20.004, secs,           runtime-avg/thread
 32x1-bw-thread,                          0.068, %,              spread-runtime/thread
 32x1-bw-thread,                         11.388, GB,             data/thread
 32x1-bw-thread,                        364.401, GB,             data-total
 32x1-bw-thread,                          1.755, nsecs,          runtime/byte/thread
 32x1-bw-thread,                          0.570, GB/sec,         thread-speed
 32x1-bw-thread,                         18.230, GB/sec,         total-speed

 # Running  2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  2x3-bw-thread,                         20.190, secs,           runtime-max/thread
  2x3-bw-thread,                         20.082, secs,           runtime-min/thread
  2x3-bw-thread,                         20.110, secs,           runtime-avg/thread
  2x3-bw-thread,                          0.268, %,              spread-runtime/thread
  2x3-bw-thread,                         49.303, GB,             data/thread
  2x3-bw-thread,                        295.816, GB,             data-total
  2x3-bw-thread,                          0.410, nsecs,          runtime/byte/thread
  2x3-bw-thread,                          2.442, GB/sec,         thread-speed
  2x3-bw-thread,                         14.652, GB/sec,         total-speed

 # Running  4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
  4x4-bw-thread,                         20.307, secs,           runtime-max/thread
  4x4-bw-thread,                         20.002, secs,           runtime-min/thread
  4x4-bw-thread,                         20.202, secs,           runtime-avg/thread
  4x4-bw-thread,                          0.750, %,              spread-runtime/thread
  4x4-bw-thread,                         12.482, GB,             data/thread
  4x4-bw-thread,                        199.716, GB,             data-total
  4x4-bw-thread,                          1.627, nsecs,          runtime/byte/thread
  4x4-bw-thread,                          0.615, GB/sec,         thread-speed
  4x4-bw-thread,                          9.835, GB/sec,         total-speed

 # Running  4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
  4x6-bw-thread,                         20.431, secs,           runtime-max/thread
  4x6-bw-thread,                         20.007, secs,           runtime-min/thread
  4x6-bw-thread,                         20.283, secs,           runtime-avg/thread
  4x6-bw-thread,                          1.036, %,              spread-runtime/thread
  4x6-bw-thread,                         13.086, GB,             data/thread
  4x6-bw-thread,                        314.069, GB,             data-total
  4x6-bw-thread,                          1.561, nsecs,          runtime/byte/thread
  4x6-bw-thread,                          0.641, GB/sec,         thread-speed
  4x6-bw-thread,                         15.372, GB/sec,         total-speed

 # Running  4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
  4x8-bw-thread,                         20.543, secs,           runtime-max/thread
  4x8-bw-thread,                         20.015, secs,           runtime-min/thread
  4x8-bw-thread,                         20.324, secs,           runtime-avg/thread
  4x8-bw-thread,                          1.287, %,              spread-runtime/thread
  4x8-bw-thread,                          7.617, GB,             data/thread
  4x8-bw-thread,                        243.739, GB,             data-total
  4x8-bw-thread,                          2.697, nsecs,          runtime/byte/thread
  4x8-bw-thread,                          0.371, GB/sec,         thread-speed
  4x8-bw-thread,                         11.865, GB/sec,         total-speed

 # Running  4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
  4x8-bw-thread-NOTHP,                   20.661, secs,           runtime-max/thread
  4x8-bw-thread-NOTHP,                   20.023, secs,           runtime-min/thread
  4x8-bw-thread-NOTHP,                   20.292, secs,           runtime-avg/thread
  4x8-bw-thread-NOTHP,                    1.546, %,              spread-runtime/thread
  4x8-bw-thread-NOTHP,                    5.956, GB,             data/thread
  4x8-bw-thread-NOTHP,                  190.589, GB,             data-total
  4x8-bw-thread-NOTHP,                    3.469, nsecs,          runtime/byte/thread
  4x8-bw-thread-NOTHP,                    0.288, GB/sec,         thread-speed
  4x8-bw-thread-NOTHP,                    9.224, GB/sec,         total-speed

 # Running  3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  3x3-bw-thread,                         20.310, secs,           runtime-max/thread
  3x3-bw-thread,                         20.116, secs,           runtime-min/thread
  3x3-bw-thread,                         20.202, secs,           runtime-avg/thread
  3x3-bw-thread,                          0.480, %,              spread-runtime/thread
  3x3-bw-thread,                         14.973, GB,             data/thread
  3x3-bw-thread,                        134.755, GB,             data-total
  3x3-bw-thread,                          1.356, nsecs,          runtime/byte/thread
  3x3-bw-thread,                          0.737, GB/sec,         thread-speed
  3x3-bw-thread,                          6.635, GB/sec,         total-speed

 # Running  5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
  5x5-bw-thread,                         20.578, secs,           runtime-max/thread
  5x5-bw-thread,                         20.039, secs,           runtime-min/thread
  5x5-bw-thread,                         20.379, secs,           runtime-avg/thread
  5x5-bw-thread,                          1.309, %,              spread-runtime/thread
  5x5-bw-thread,                          7.881, GB,             data/thread
  5x5-bw-thread,                        197.032, GB,             data-total
  5x5-bw-thread,                          2.611, nsecs,          runtime/byte/thread
  5x5-bw-thread,                          0.383, GB/sec,         thread-speed
  5x5-bw-thread,                          9.575, GB/sec,         total-speed

 # Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
 2x16-bw-thread,                         21.581, secs,           runtime-max/thread
 2x16-bw-thread,                         20.043, secs,           runtime-min/thread
 2x16-bw-thread,                         20.958, secs,           runtime-avg/thread
 2x16-bw-thread,                          3.564, %,              spread-runtime/thread
 2x16-bw-thread,                          4.010, GB,             data/thread
 2x16-bw-thread,                        128.312, GB,             data-total
 2x16-bw-thread,                          5.382, nsecs,          runtime/byte/thread
 2x16-bw-thread,                          0.186, GB/sec,         thread-speed
 2x16-bw-thread,                          5.945, GB/sec,         total-speed

 # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
 1x32-bw-thread,                         23.503, secs,           runtime-max/thread
 1x32-bw-thread,                         21.850, secs,           runtime-min/thread
 1x32-bw-thread,                         22.953, secs,           runtime-avg/thread
 1x32-bw-thread,                          3.518, %,              spread-runtime/thread
 1x32-bw-thread,                          4.295, GB,             data/thread
 1x32-bw-thread,                        137.439, GB,             data-total
 1x32-bw-thread,                          5.472, nsecs,          runtime/byte/thread
 1x32-bw-thread,                          0.183, GB/sec,         thread-speed
 1x32-bw-thread,                          5.848, GB/sec,         total-speed

 # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
 numa02-bw,                              19.948, secs,           runtime-max/thread
 numa02-bw,                              19.921, secs,           runtime-min/thread
 numa02-bw,                              19.983, secs,           runtime-avg/thread
 numa02-bw,                               0.068, %,              spread-runtime/thread
 numa02-bw,                              15.425, GB,             data/thread
 numa02-bw,                             493.586, GB,             data-total
 numa02-bw,                               1.293, nsecs,          runtime/byte/thread
 numa02-bw,                               0.773, GB/sec,         thread-speed
 numa02-bw,                              24.744, GB/sec,         total-speed

 # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
 numa02-bw-NOTHP,                        20.055, secs,           runtime-max/thread
 numa02-bw-NOTHP,                        19.948, secs,           runtime-min/thread
 numa02-bw-NOTHP,                        19.991, secs,           runtime-avg/thread
 numa02-bw-NOTHP,                         0.267, %,              spread-runtime/thread
 numa02-bw-NOTHP,                        12.795, GB,             data/thread
 numa02-bw-NOTHP,                       409.431, GB,             data-total
 numa02-bw-NOTHP,                         1.567, nsecs,          runtime/byte/thread
 numa02-bw-NOTHP,                         0.638, GB/sec,         thread-speed
 numa02-bw-NOTHP,                        20.415, GB/sec,         total-speed

 # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
 numa01-bw-thread,                       20.107, secs,           runtime-max/thread
 numa01-bw-thread,                       19.978, secs,           runtime-min/thread
 numa01-bw-thread,                       20.067, secs,           runtime-avg/thread
 numa01-bw-thread,                        0.320, %,              spread-runtime/thread
 numa01-bw-thread,                        9.532, GB,             data/thread
 numa01-bw-thread,                      305.010, GB,             data-total
 numa01-bw-thread,                        2.110, nsecs,          runtime/byte/thread
 numa01-bw-thread,                        0.474, GB/sec,         thread-speed
 numa01-bw-thread,                       15.169, GB/sec,         total-speed

 # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
 numa01-bw-thread-NOTHP,                 20.319, secs,           runtime-max/thread
 numa01-bw-thread-NOTHP,                 19.978, secs,           runtime-min/thread
 numa01-bw-thread-NOTHP,                 20.076, secs,           runtime-avg/thread
 numa01-bw-thread-NOTHP,                  0.839, %,              spread-runtime/thread
 numa01-bw-thread-NOTHP,                  7.688, GB,             data/thread
 numa01-bw-thread-NOTHP,                246.021, GB,             data-total
 numa01-bw-thread-NOTHP,                  2.643, nsecs,          runtime/byte/thread
 numa01-bw-thread-NOTHP,                  0.378, GB/sec,         thread-speed
 numa01-bw-thread-NOTHP,                 12.108, GB/sec,         total-speed

 #
 # Running test on: Linux vega 3.7.0-rc8+ #2 SMP Fri Dec 7 02:46:02 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
 #
# Running numa/mem benchmark...

 # Running main, "perf bench numa mem -a"

 # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local,                           20.132, secs,           runtime-max/thread
 RAM-bw-local,                           20.123, secs,           runtime-min/thread
 RAM-bw-local,                           20.123, secs,           runtime-avg/thread
 RAM-bw-local,                            0.024, %,              spread-runtime/thread
 RAM-bw-local,                          171.799, GB,             data/thread
 RAM-bw-local,                          171.799, GB,             data-total
 RAM-bw-local,                            0.117, nsecs,          runtime/byte/thread
 RAM-bw-local,                            8.534, GB/sec,         thread-speed
 RAM-bw-local,                            8.534, GB/sec,         total-speed

 # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp  1 --no-data_rand_walk --thp -1"
 RAM-bw-local-NOTHP,                     20.133, secs,           runtime-max/thread
 RAM-bw-local-NOTHP,                     20.047, secs,           runtime-min/thread
 RAM-bw-local-NOTHP,                     20.047, secs,           runtime-avg/thread
 RAM-bw-local-NOTHP,                      0.214, %,              spread-runtime/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data/thread
 RAM-bw-local-NOTHP,                    169.651, GB,             data-total
 RAM-bw-local-NOTHP,                      0.119, nsecs,          runtime/byte/thread
 RAM-bw-local-NOTHP,                      8.427, GB/sec,         thread-speed
 RAM-bw-local-NOTHP,                      8.427, GB/sec,         total-speed

 # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote,                          20.127, secs,           runtime-max/thread
 RAM-bw-remote,                          20.117, secs,           runtime-min/thread
 RAM-bw-remote,                          20.117, secs,           runtime-avg/thread
 RAM-bw-remote,                           0.025, %,              spread-runtime/thread
 RAM-bw-remote,                         134.218, GB,             data/thread
 RAM-bw-remote,                         134.218, GB,             data-total
 RAM-bw-remote,                           0.150, nsecs,          runtime/byte/thread
 RAM-bw-remote,                           6.669, GB/sec,         thread-speed
 RAM-bw-remote,                           6.669, GB/sec,         total-speed

 # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-local-2x,                        20.139, secs,           runtime-max/thread
 RAM-bw-local-2x,                        20.011, secs,           runtime-min/thread
 RAM-bw-local-2x,                        20.070, secs,           runtime-avg/thread
 RAM-bw-local-2x,                         0.319, %,              spread-runtime/thread
 RAM-bw-local-2x,                       130.997, GB,             data/thread
 RAM-bw-local-2x,                       261.993, GB,             data-total
 RAM-bw-local-2x,                         0.154, nsecs,          runtime/byte/thread
 RAM-bw-local-2x,                         6.505, GB/sec,         thread-speed
 RAM-bw-local-2x,                        13.009, GB/sec,         total-speed

 # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-remote-2x,                       20.177, secs,           runtime-max/thread
 RAM-bw-remote-2x,                       20.083, secs,           runtime-min/thread
 RAM-bw-remote-2x,                       20.125, secs,           runtime-avg/thread
 RAM-bw-remote-2x,                        0.233, %,              spread-runtime/thread
 RAM-bw-remote-2x,                       74.088, GB,             data/thread
 RAM-bw-remote-2x,                      148.176, GB,             data-total
 RAM-bw-remote-2x,                        0.272, nsecs,          runtime/byte/thread
 RAM-bw-remote-2x,                        3.672, GB/sec,         thread-speed
 RAM-bw-remote-2x,                        7.344, GB/sec,         total-speed

 # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp  1 --no-data_rand_walk"
 RAM-bw-cross,                           20.122, secs,           runtime-max/thread
 RAM-bw-cross,                           20.094, secs,           runtime-min/thread
 RAM-bw-cross,                           20.103, secs,           runtime-avg/thread
 RAM-bw-cross,                            0.070, %,              spread-runtime/thread
 RAM-bw-cross,                          121.870, GB,             data/thread
 RAM-bw-cross,                          243.739, GB,             data-total
 RAM-bw-cross,                            0.165, nsecs,          runtime/byte/thread
 RAM-bw-cross,                            6.057, GB/sec,         thread-speed
 RAM-bw-cross,                           12.113, GB/sec,         total-speed

 # Running  1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1"
  1x3-convergence,                        2.333, secs,           NUMA-convergence-latency
  1x3-convergence,                        2.333, secs,           runtime-max/thread
  1x3-convergence,                        2.304, secs,           runtime-min/thread
  1x3-convergence,                        2.313, secs,           runtime-avg/thread
  1x3-convergence,                        0.620, %,              spread-runtime/thread
  1x3-convergence,                        7.516, GB,             data/thread
  1x3-convergence,                       22.549, GB,             data-total
  1x3-convergence,                        0.310, nsecs,          runtime/byte/thread
  1x3-convergence,                        3.222, GB/sec,         thread-speed
  1x3-convergence,                        9.665, GB/sec,         total-speed

 # Running  1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  1x4-convergence,                        2.057, secs,           NUMA-convergence-latency
  1x4-convergence,                        2.057, secs,           runtime-max/thread
  1x4-convergence,                        1.958, secs,           runtime-min/thread
  1x4-convergence,                        1.998, secs,           runtime-avg/thread
  1x4-convergence,                        2.403, %,              spread-runtime/thread
  1x4-convergence,                        4.429, GB,             data/thread
  1x4-convergence,                       17.717, GB,             data-total
  1x4-convergence,                        0.464, nsecs,          runtime/byte/thread
  1x4-convergence,                        2.154, GB/sec,         thread-speed
  1x4-convergence,                        8.614, GB/sec,         total-speed

 # Running  1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  1x6-convergence,                        7.327, secs,           NUMA-convergence-latency
  1x6-convergence,                        7.327, secs,           runtime-max/thread
  1x6-convergence,                        6.879, secs,           runtime-min/thread
  1x6-convergence,                        7.187, secs,           runtime-avg/thread
  1x6-convergence,                        3.063, %,              spread-runtime/thread
  1x6-convergence,                       11.052, GB,             data/thread
  1x6-convergence,                       66.312, GB,             data-total
  1x6-convergence,                        0.663, nsecs,          runtime/byte/thread
  1x6-convergence,                        1.508, GB/sec,         thread-speed
  1x6-convergence,                        9.050, GB/sec,         total-speed

 # Running  2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  2x3-convergence,                        4.086, secs,           NUMA-convergence-latency
  2x3-convergence,                        4.086, secs,           runtime-max/thread
  2x3-convergence,                        3.779, secs,           runtime-min/thread
  2x3-convergence,                        3.960, secs,           runtime-avg/thread
  2x3-convergence,                        3.761, %,              spread-runtime/thread
  2x3-convergence,                        6.774, GB,             data/thread
  2x3-convergence,                       60.964, GB,             data-total
  2x3-convergence,                        0.603, nsecs,          runtime/byte/thread
  2x3-convergence,                        1.658, GB/sec,         thread-speed
  2x3-convergence,                       14.920, GB/sec,         total-speed

 # Running  3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp  1"
  3x3-convergence,                        7.627, secs,           NUMA-convergence-latency
  3x3-convergence,                        7.627, secs,           runtime-max/thread
  3x3-convergence,                        7.380, secs,           runtime-min/thread
  3x3-convergence,                        7.504, secs,           runtime-avg/thread
  3x3-convergence,                        1.624, %,              spread-runtime/thread
  3x3-convergence,                       15.093, GB,             data/thread
  3x3-convergence,                      135.833, GB,             data-total
  3x3-convergence,                        0.505, nsecs,          runtime/byte/thread
  3x3-convergence,                        1.979, GB/sec,         thread-speed
  3x3-convergence,                       17.809, GB/sec,         total-speed

 # Running  4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  4x4-convergence,                        7.381, secs,           NUMA-convergence-latency
  4x4-convergence,                        7.381, secs,           runtime-max/thread
  4x4-convergence,                        7.149, secs,           runtime-min/thread
  4x4-convergence,                        7.277, secs,           runtime-avg/thread
  4x4-convergence,                        1.569, %,              spread-runtime/thread
  4x4-convergence,                        7.181, GB,             data/thread
  4x4-convergence,                      114.890, GB,             data-total
  4x4-convergence,                        1.028, nsecs,          runtime/byte/thread
  4x4-convergence,                        0.973, GB/sec,         thread-speed
  4x4-convergence,                       15.566, GB/sec,         total-speed

 # Running  4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  4x4-convergence-NOTHP,                  9.200, secs,           NUMA-convergence-latency
  4x4-convergence-NOTHP,                  9.200, secs,           runtime-max/thread
  4x4-convergence-NOTHP,                  8.944, secs,           runtime-min/thread
  4x4-convergence-NOTHP,                  9.047, secs,           runtime-avg/thread
  4x4-convergence-NOTHP,                  1.391, %,              spread-runtime/thread
  4x4-convergence-NOTHP,                 11.778, GB,             data/thread
  4x4-convergence-NOTHP,                188.442, GB,             data-total
  4x4-convergence-NOTHP,                  0.781, nsecs,          runtime/byte/thread
  4x4-convergence-NOTHP,                  1.280, GB/sec,         thread-speed
  4x4-convergence-NOTHP,                 20.483, GB/sec,         total-speed

 # Running  4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp  1"
  4x6-convergence,                       11.664, secs,           NUMA-convergence-latency
  4x6-convergence,                       11.664, secs,           runtime-max/thread
  4x6-convergence,                       11.155, secs,           runtime-min/thread
  4x6-convergence,                       11.420, secs,           runtime-avg/thread
  4x6-convergence,                        2.180, %,              spread-runtime/thread
  4x6-convergence,                       11.319, GB,             data/thread
  4x6-convergence,                      271.665, GB,             data-total
  4x6-convergence,                        1.030, nsecs,          runtime/byte/thread
  4x6-convergence,                        0.970, GB/sec,         thread-speed
  4x6-convergence,                       23.292, GB/sec,         total-speed

 # Running  4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp  1"
  4x8-convergence,                        3.880, secs,           NUMA-convergence-latency
  4x8-convergence,                        3.880, secs,           runtime-max/thread
  4x8-convergence,                        3.613, secs,           runtime-min/thread
  4x8-convergence,                        3.784, secs,           runtime-avg/thread
  4x8-convergence,                        3.440, %,              spread-runtime/thread
  4x8-convergence,                        2.047, GB,             data/thread
  4x8-convergence,                       65.498, GB,             data-total
  4x8-convergence,                        1.896, nsecs,          runtime/byte/thread
  4x8-convergence,                        0.528, GB/sec,         thread-speed
  4x8-convergence,                       16.882, GB/sec,         total-speed

 # Running  8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1"
  8x4-convergence,                        8.938, secs,           NUMA-convergence-latency
  8x4-convergence,                        8.938, secs,           runtime-max/thread
  8x4-convergence,                        8.556, secs,           runtime-min/thread
  8x4-convergence,                        8.744, secs,           runtime-avg/thread
  8x4-convergence,                        2.135, %,              spread-runtime/thread
  8x4-convergence,                        4.396, GB,             data/thread
  8x4-convergence,                      140.660, GB,             data-total
  8x4-convergence,                        2.033, nsecs,          runtime/byte/thread
  8x4-convergence,                        0.492, GB/sec,         thread-speed
  8x4-convergence,                       15.738, GB/sec,         total-speed

 # Running  8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp  1 --thp -1"
  8x4-convergence-NOTHP,                 12.123, secs,           NUMA-convergence-latency
  8x4-convergence-NOTHP,                 12.123, secs,           runtime-max/thread
  8x4-convergence-NOTHP,                 11.749, secs,           runtime-min/thread
  8x4-convergence-NOTHP,                 11.936, secs,           runtime-avg/thread
  8x4-convergence-NOTHP,                  1.542, %,              spread-runtime/thread
  8x4-convergence-NOTHP,                  4.480, GB,             data/thread
  8x4-convergence-NOTHP,                143.345, GB,             data-total
  8x4-convergence-NOTHP,                  2.706, nsecs,          runtime/byte/thread
  8x4-convergence-NOTHP,                  0.370, GB/sec,         thread-speed
  8x4-convergence-NOTHP,                 11.824, GB/sec,         total-speed

 # Running  3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  3x1-convergence,                        0.879, secs,           NUMA-convergence-latency
  3x1-convergence,                        0.879, secs,           runtime-max/thread
  3x1-convergence,                        0.810, secs,           runtime-min/thread
  3x1-convergence,                        0.839, secs,           runtime-avg/thread
  3x1-convergence,                        3.911, %,              spread-runtime/thread
  3x1-convergence,                        2.326, GB,             data/thread
  3x1-convergence,                        6.979, GB,             data-total
  3x1-convergence,                        0.378, nsecs,          runtime/byte/thread
  3x1-convergence,                        2.647, GB/sec,         thread-speed
  3x1-convergence,                        7.941, GB/sec,         total-speed

 # Running  4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  4x1-convergence,                        0.685, secs,           NUMA-convergence-latency
  4x1-convergence,                        0.685, secs,           runtime-max/thread
  4x1-convergence,                        0.617, secs,           runtime-min/thread
  4x1-convergence,                        0.650, secs,           runtime-avg/thread
  4x1-convergence,                        4.967, %,              spread-runtime/thread
  4x1-convergence,                        1.476, GB,             data/thread
  4x1-convergence,                        5.906, GB,             data-total
  4x1-convergence,                        0.464, nsecs,          runtime/byte/thread
  4x1-convergence,                        2.154, GB/sec,         thread-speed
  4x1-convergence,                        8.616, GB/sec,         total-speed

 # Running  8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp  1"
  8x1-convergence,                        1.158, secs,           NUMA-convergence-latency
  8x1-convergence,                        1.158, secs,           runtime-max/thread
  8x1-convergence,                        1.010, secs,           runtime-min/thread
  8x1-convergence,                        1.060, secs,           runtime-avg/thread
  8x1-convergence,                        6.396, %,              spread-runtime/thread
  8x1-convergence,                        1.745, GB,             data/thread
  8x1-convergence,                       13.959, GB,             data-total
  8x1-convergence,                        0.664, nsecs,          runtime/byte/thread
  8x1-convergence,                        1.507, GB/sec,         thread-speed
  8x1-convergence,                       12.054, GB/sec,         total-speed

 # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp  1"
 16x1-convergence,                        2.010, secs,           NUMA-convergence-latency
 16x1-convergence,                        2.010, secs,           runtime-max/thread
 16x1-convergence,                        1.939, secs,           runtime-min/thread
 16x1-convergence,                        1.991, secs,           runtime-avg/thread
 16x1-convergence,                        1.760, %,              spread-runtime/thread
 16x1-convergence,                        2.668, GB,             data/thread
 16x1-convergence,                       42.681, GB,             data-total
 16x1-convergence,                        0.753, nsecs,          runtime/byte/thread
 16x1-convergence,                        1.327, GB/sec,         thread-speed
 16x1-convergence,                       21.237, GB/sec,         total-speed

 # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp  1"
 32x1-convergence,                        1.946, secs,           NUMA-convergence-latency
 32x1-convergence,                        1.946, secs,           runtime-max/thread
 32x1-convergence,                        1.850, secs,           runtime-min/thread
 32x1-convergence,                        1.946, secs,           runtime-avg/thread
 32x1-convergence,                        2.479, %,              spread-runtime/thread
 32x1-convergence,                        1.242, GB,             data/thread
 32x1-convergence,                       39.728, GB,             data-total
 32x1-convergence,                        1.568, nsecs,          runtime/byte/thread
 32x1-convergence,                        0.638, GB/sec,         thread-speed
 32x1-convergence,                       20.410, GB/sec,         total-speed

 # Running  2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  2x1-bw-process,                        20.146, secs,           runtime-max/thread
  2x1-bw-process,                        20.068, secs,           runtime-min/thread
  2x1-bw-process,                        20.102, secs,           runtime-avg/thread
  2x1-bw-process,                         0.193, %,              spread-runtime/thread
  2x1-bw-process,                        97.174, GB,             data/thread
  2x1-bw-process,                       194.347, GB,             data-total
  2x1-bw-process,                         0.207, nsecs,          runtime/byte/thread
  2x1-bw-process,                         4.824, GB/sec,         thread-speed
  2x1-bw-process,                         9.647, GB/sec,         total-speed

 # Running  3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  3x1-bw-process,                        20.177, secs,           runtime-max/thread
  3x1-bw-process,                        20.127, secs,           runtime-min/thread
  3x1-bw-process,                        20.146, secs,           runtime-avg/thread
  3x1-bw-process,                         0.126, %,              spread-runtime/thread
  3x1-bw-process,                        97.711, GB,             data/thread
  3x1-bw-process,                       293.132, GB,             data-total
  3x1-bw-process,                         0.207, nsecs,          runtime/byte/thread
  3x1-bw-process,                         4.843, GB/sec,         thread-speed
  3x1-bw-process,                        14.528, GB/sec,         total-speed

 # Running  4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp  1"
  4x1-bw-process,                        20.165, secs,           runtime-max/thread
  4x1-bw-process,                        20.025, secs,           runtime-min/thread
  4x1-bw-process,                        20.078, secs,           runtime-avg/thread
  4x1-bw-process,                         0.348, %,              spread-runtime/thread
  4x1-bw-process,                        95.295, GB,             data/thread
  4x1-bw-process,                       381.178, GB,             data-total
  4x1-bw-process,                         0.212, nsecs,          runtime/byte/thread
  4x1-bw-process,                         4.726, GB/sec,         thread-speed
  4x1-bw-process,                        18.903, GB/sec,         total-speed

 # Running  8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1"
  8x1-bw-process,                        20.131, secs,           runtime-max/thread
  8x1-bw-process,                        20.066, secs,           runtime-min/thread
  8x1-bw-process,                        20.090, secs,           runtime-avg/thread
  8x1-bw-process,                         0.161, %,              spread-runtime/thread
  8x1-bw-process,                        67.512, GB,             data/thread
  8x1-bw-process,                       540.092, GB,             data-total
  8x1-bw-process,                         0.298, nsecs,          runtime/byte/thread
  8x1-bw-process,                         3.354, GB/sec,         thread-speed
  8x1-bw-process,                        26.829, GB/sec,         total-speed

 # Running  8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P  512 -s 20 -zZ0q --thp  1 --thp -1"
  8x1-bw-process-NOTHP,                  20.208, secs,           runtime-max/thread
  8x1-bw-process-NOTHP,                  20.002, secs,           runtime-min/thread
  8x1-bw-process-NOTHP,                  20.067, secs,           runtime-avg/thread
  8x1-bw-process-NOTHP,                   0.509, %,              spread-runtime/thread
  8x1-bw-process-NOTHP,                  56.170, GB,             data/thread
  8x1-bw-process-NOTHP,                 449.361, GB,             data-total
  8x1-bw-process-NOTHP,                   0.360, nsecs,          runtime/byte/thread
  8x1-bw-process-NOTHP,                   2.780, GB/sec,         thread-speed
  8x1-bw-process-NOTHP,                  22.237, GB/sec,         total-speed

 # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp  1"
 16x1-bw-process,                        20.068, secs,           runtime-max/thread
 16x1-bw-process,                        20.014, secs,           runtime-min/thread
 16x1-bw-process,                        20.042, secs,           runtime-avg/thread
 16x1-bw-process,                         0.136, %,              spread-runtime/thread
 16x1-bw-process,                        36.742, GB,             data/thread
 16x1-bw-process,                       587.874, GB,             data-total
 16x1-bw-process,                         0.546, nsecs,          runtime/byte/thread
 16x1-bw-process,                         1.831, GB/sec,         thread-speed
 16x1-bw-process,                        29.294, GB/sec,         total-speed

 # Running  4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp  1"
  4x1-bw-thread,                         20.053, secs,           runtime-max/thread
  4x1-bw-thread,                         20.003, secs,           runtime-min/thread
  4x1-bw-thread,                         20.025, secs,           runtime-avg/thread
  4x1-bw-thread,                          0.123, %,              spread-runtime/thread
  4x1-bw-thread,                         96.704, GB,             data/thread
  4x1-bw-thread,                        386.815, GB,             data-total
  4x1-bw-thread,                          0.207, nsecs,          runtime/byte/thread
  4x1-bw-thread,                          4.822, GB/sec,         thread-speed
  4x1-bw-thread,                         19.290, GB/sec,         total-speed

 # Running  8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp  1"
  8x1-bw-thread,                         20.068, secs,           runtime-max/thread
  8x1-bw-thread,                         20.004, secs,           runtime-min/thread
  8x1-bw-thread,                         20.031, secs,           runtime-avg/thread
  8x1-bw-thread,                          0.160, %,              spread-runtime/thread
  8x1-bw-thread,                         66.203, GB,             data/thread
  8x1-bw-thread,                        529.623, GB,             data-total
  8x1-bw-thread,                          0.303, nsecs,          runtime/byte/thread
  8x1-bw-thread,                          3.299, GB/sec,         thread-speed
  8x1-bw-thread,                         26.391, GB/sec,         total-speed

 # Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp  1"
 16x1-bw-thread,                         20.044, secs,           runtime-max/thread
 16x1-bw-thread,                         20.007, secs,           runtime-min/thread
 16x1-bw-thread,                         20.029, secs,           runtime-avg/thread
 16x1-bw-thread,                          0.092, %,              spread-runtime/thread
 16x1-bw-thread,                         37.027, GB,             data/thread
 16x1-bw-thread,                        592.437, GB,             data-total
 16x1-bw-thread,                          0.541, nsecs,          runtime/byte/thread
 16x1-bw-thread,                          1.847, GB/sec,         thread-speed
 16x1-bw-thread,                         29.557, GB/sec,         total-speed

 # Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp  1"
 32x1-bw-thread,                         20.029, secs,           runtime-max/thread
 32x1-bw-thread,                         19.975, secs,           runtime-min/thread
 32x1-bw-thread,                         20.015, secs,           runtime-avg/thread
 32x1-bw-thread,                          0.134, %,              spread-runtime/thread
 32x1-bw-thread,                         18.923, GB,             data/thread
 32x1-bw-thread,                        605.523, GB,             data-total
 32x1-bw-thread,                          1.058, nsecs,          runtime/byte/thread
 32x1-bw-thread,                          0.945, GB/sec,         thread-speed
 32x1-bw-thread,                         30.232, GB/sec,         total-speed

 # Running  2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  2x3-bw-thread,                         20.176, secs,           runtime-max/thread
  2x3-bw-thread,                         20.072, secs,           runtime-min/thread
  2x3-bw-thread,                         20.136, secs,           runtime-avg/thread
  2x3-bw-thread,                          0.257, %,              spread-runtime/thread
  2x3-bw-thread,                         51.540, GB,             data/thread
  2x3-bw-thread,                        309.238, GB,             data-total
  2x3-bw-thread,                          0.391, nsecs,          runtime/byte/thread
  2x3-bw-thread,                          2.555, GB/sec,         thread-speed
  2x3-bw-thread,                         15.327, GB/sec,         total-speed

 # Running  4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp  1"
  4x4-bw-thread,                         20.183, secs,           runtime-max/thread
  4x4-bw-thread,                         20.013, secs,           runtime-min/thread
  4x4-bw-thread,                         20.086, secs,           runtime-avg/thread
  4x4-bw-thread,                          0.421, %,              spread-runtime/thread
  4x4-bw-thread,                         35.266, GB,             data/thread
  4x4-bw-thread,                        564.251, GB,             data-total
  4x4-bw-thread,                          0.572, nsecs,          runtime/byte/thread
  4x4-bw-thread,                          1.747, GB/sec,         thread-speed
  4x4-bw-thread,                         27.957, GB/sec,         total-speed

 # Running  4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp  1"
  4x6-bw-thread,                         20.298, secs,           runtime-max/thread
  4x6-bw-thread,                         20.061, secs,           runtime-min/thread
  4x6-bw-thread,                         20.184, secs,           runtime-avg/thread
  4x6-bw-thread,                          0.584, %,              spread-runtime/thread
  4x6-bw-thread,                         23.578, GB,             data/thread
  4x6-bw-thread,                        565.862, GB,             data-total
  4x6-bw-thread,                          0.861, nsecs,          runtime/byte/thread
  4x6-bw-thread,                          1.162, GB/sec,         thread-speed
  4x6-bw-thread,                         27.877, GB/sec,         total-speed

 # Running  4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1"
  4x8-bw-thread,                         20.350, secs,           runtime-max/thread
  4x8-bw-thread,                         20.004, secs,           runtime-min/thread
  4x8-bw-thread,                         20.190, secs,           runtime-avg/thread
  4x8-bw-thread,                          0.851, %,              spread-runtime/thread
  4x8-bw-thread,                         18.086, GB,             data/thread
  4x8-bw-thread,                        578.747, GB,             data-total
  4x8-bw-thread,                          1.125, nsecs,          runtime/byte/thread
  4x8-bw-thread,                          0.889, GB/sec,         thread-speed
  4x8-bw-thread,                         28.439, GB/sec,         total-speed

 # Running  4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp  1 --thp -1"
  4x8-bw-thread-NOTHP,                   20.411, secs,           runtime-max/thread
  4x8-bw-thread-NOTHP,                   19.990, secs,           runtime-min/thread
  4x8-bw-thread-NOTHP,                   20.246, secs,           runtime-avg/thread
  4x8-bw-thread-NOTHP,                    1.032, %,              spread-runtime/thread
  4x8-bw-thread-NOTHP,                   15.989, GB,             data/thread
  4x8-bw-thread-NOTHP,                  511.638, GB,             data-total
  4x8-bw-thread-NOTHP,                    1.277, nsecs,          runtime/byte/thread
  4x8-bw-thread-NOTHP,                    0.783, GB/sec,         thread-speed
  4x8-bw-thread-NOTHP,                   25.067, GB/sec,         total-speed

 # Running  3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp  1"
  3x3-bw-thread,                         20.170, secs,           runtime-max/thread
  3x3-bw-thread,                         20.050, secs,           runtime-min/thread
  3x3-bw-thread,                         20.109, secs,           runtime-avg/thread
  3x3-bw-thread,                          0.299, %,              spread-runtime/thread
  3x3-bw-thread,                         48.318, GB,             data/thread
  3x3-bw-thread,                        434.865, GB,             data-total
  3x3-bw-thread,                          0.417, nsecs,          runtime/byte/thread
  3x3-bw-thread,                          2.396, GB/sec,         thread-speed
  3x3-bw-thread,                         21.560, GB/sec,         total-speed

 # Running  5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp  1"
  5x5-bw-thread,                         20.276, secs,           runtime-max/thread
  5x5-bw-thread,                         20.004, secs,           runtime-min/thread
  5x5-bw-thread,                         20.155, secs,           runtime-avg/thread
  5x5-bw-thread,                          0.671, %,              spread-runtime/thread
  5x5-bw-thread,                         21.153, GB,             data/thread
  5x5-bw-thread,                        528.818, GB,             data-total
  5x5-bw-thread,                          0.959, nsecs,          runtime/byte/thread
  5x5-bw-thread,                          1.043, GB/sec,         thread-speed
  5x5-bw-thread,                         26.081, GB/sec,         total-speed

 # Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp  1"
 2x16-bw-thread,                         20.465, secs,           runtime-max/thread
 2x16-bw-thread,                         20.004, secs,           runtime-min/thread
 2x16-bw-thread,                         20.284, secs,           runtime-avg/thread
 2x16-bw-thread,                          1.127, %,              spread-runtime/thread
 2x16-bw-thread,                         14.881, GB,             data/thread
 2x16-bw-thread,                        476.204, GB,             data-total
 2x16-bw-thread,                          1.375, nsecs,          runtime/byte/thread
 2x16-bw-thread,                          0.727, GB/sec,         thread-speed
 2x16-bw-thread,                         23.269, GB/sec,         total-speed

 # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp  1"
 1x32-bw-thread,                         21.944, secs,           runtime-max/thread
 1x32-bw-thread,                         20.031, secs,           runtime-min/thread
 1x32-bw-thread,                         20.878, secs,           runtime-avg/thread
 1x32-bw-thread,                          4.358, %,              spread-runtime/thread
 1x32-bw-thread,                         13.019, GB,             data/thread
 1x32-bw-thread,                        416.612, GB,             data-total
 1x32-bw-thread,                          1.686, nsecs,          runtime/byte/thread
 1x32-bw-thread,                          0.593, GB/sec,         thread-speed
 1x32-bw-thread,                         18.985, GB/sec,         total-speed

 # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1"
 numa02-bw,                              20.000, secs,           runtime-max/thread
 numa02-bw,                              19.967, secs,           runtime-min/thread
 numa02-bw,                              19.994, secs,           runtime-avg/thread
 numa02-bw,                               0.081, %,              spread-runtime/thread
 numa02-bw,                              19.644, GB,             data/thread
 numa02-bw,                             628.609, GB,             data-total
 numa02-bw,                               1.018, nsecs,          runtime/byte/thread
 numa02-bw,                               0.982, GB/sec,         thread-speed
 numa02-bw,                              31.431, GB/sec,         total-speed

 # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp  1 --thp -1"
 numa02-bw-NOTHP,                        20.062, secs,           runtime-max/thread
 numa02-bw-NOTHP,                        19.940, secs,           runtime-min/thread
 numa02-bw-NOTHP,                        19.988, secs,           runtime-avg/thread
 numa02-bw-NOTHP,                         0.304, %,              spread-runtime/thread
 numa02-bw-NOTHP,                        18.246, GB,             data/thread
 numa02-bw-NOTHP,                       583.881, GB,             data-total
 numa02-bw-NOTHP,                         1.100, nsecs,          runtime/byte/thread
 numa02-bw-NOTHP,                         0.909, GB/sec,         thread-speed
 numa02-bw-NOTHP,                        29.104, GB/sec,         total-speed

 # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1"
 numa01-bw-thread,                       20.106, secs,           runtime-max/thread
 numa01-bw-thread,                       19.989, secs,           runtime-min/thread
 numa01-bw-thread,                       20.052, secs,           runtime-avg/thread
 numa01-bw-thread,                        0.293, %,              spread-runtime/thread
 numa01-bw-thread,                       17.975, GB,             data/thread
 numa01-bw-thread,                      575.190, GB,             data-total
 numa01-bw-thread,                        1.119, nsecs,          runtime/byte/thread
 numa01-bw-thread,                        0.894, GB/sec,         thread-speed
 numa01-bw-thread,                       28.607, GB/sec,         total-speed

 # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp  1 --thp -1"
 numa01-bw-thread-NOTHP,                 20.391, secs,           runtime-max/thread
 numa01-bw-thread-NOTHP,                 20.010, secs,           runtime-min/thread
 numa01-bw-thread-NOTHP,                 20.085, secs,           runtime-avg/thread
 numa01-bw-thread-NOTHP,                  0.936, %,              spread-runtime/thread
 numa01-bw-thread-NOTHP,                 13.457, GB,             data/thread
 numa01-bw-thread-NOTHP,                430.638, GB,             data-total
 numa01-bw-thread-NOTHP,                  1.515, nsecs,          runtime/byte/thread
 numa01-bw-thread-NOTHP,                  0.660, GB/sec,         thread-speed
 numa01-bw-thread-NOTHP,                 21.119, GB/sec,         total-speed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
  2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
@ 2012-12-10 12:33   ` Mel Gorman
  2012-12-10 20:29     ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2012-12-10 12:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Mike Galbraith

On Fri, Dec 07, 2012 at 10:53:57PM +0100, Ingo Molnar wrote:
> 
> Here's a (strongly NUMA-centric) performance comparison of the 
> three NUMA kernels: the 'balancenuma-v10' tree from Mel, the 
> AutoNUMA-v28 kernel from Andrea and the unified NUMA -v3 tree 
> Peter and me are working on.
> 
> The goal of these measurements is to specifically quantify the 
> NUMA optimization qualities of each of the three NUMA-optimizing 
> kernels.
> 
> There are lots of numbers in this mail and lot of material to 
> read - sorry about that! :-/
> 
> I used the latest available kernel versions everywhere: 
> furthermore the AutoNUMA-v28 tree has been patched with Hugh 
> Dickin's THP-migration support patch, to make it a fair 
> apples-to-apples comparison.
> 

Autonuma is still missing the TLB flush optimisations, migration scalability
fixes and the like. Not a big deal as such, I didn't include them either.

> I have used the 'perf bench numa' tool to do the measurements, 
> which tool can be found at:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/bench
> 
>    # to build it install numactl-dev[el] and do "cd tools/perf; make -j install'
> 
> To get the raw numbers I ran "perf bench numa mem -a" multiple 
> times on each kernel, on a 32-way, 64 GB RAM, 4-node Opteron 
> test-system. Each kernel used the same base .config, copied from 
> a Fedora RPM kernel, with the NUMA-balancing options enabled.
> 
> ( Note that the testcases are tailored to my test-system: on
>   a smaller system you'd want to run slightly smaller testcases,
>   on a larger system you'd want to run a couple of larger 
>   testcases as well. )
> 
> NUMA convergence latency measurements
> -------------------------------------
> 
> 'NUMA convergence' latency is the number of seconds a workload 
> takes to reach 'perfectly NUMA balanced' state. This is measured 
> on the CPU placement side: once it has converged then memory 
> typically follows within a couple of seconds.
> 

This is a sortof misleading metric so be wary of it as the speed a
workload converges is not necessarily useful. It only makes a difference
for short-lived workloads or during phase changes. If the workload is
short-lived, it's not interesting anyway. If the workload is rapidly
changing phases then the migration costs can be a major factor and rapidly
converging might actually be slower overall.

The speed the workload converges will depend very heavily on when the PTEs
are marked pte_numa and when the faults are incurred. If this is happening
very rapidly then a workload will converge quickly *but* this can incur a
high system CPU cost (PTE scanning, fault trapping etc).  This metric can
be gamed by always scanning rapidly but the overall performance may be worse.

I'm not saying that this metric is not useful, it is. Just be careful of
optimising for it. numacores system CPU usage has been really high in a
number of benchmarks and it may be because you are optimising to minimise
time to convergence.

> Because convergence is not guaranteed, a 100 seconds latency 
> time-out is used in the benchmark. If you see a 100 seconds 
> result in the table it means that that particular NUMA kernel 
> did not manage to converge that workload unit test within 100 
> seconds.
> 
> The NxM denotion means process/thread relationship: a 1x4 test 
> is 1 process with 4 thread that share a workload - a 4x6 test 
> are 4 processes with 6 threads in each process, the processes 
> isolated from each other but the threads working on the same 
> working set.
> 

I'm trying to understand what you're measuring a bit better.  Take 1x4 for
example -- one process, 4 threads. If I'm reading this description then all
4 threads use the same memory. Is this correct? If so, this is basically
a variation of numa01 which is an adverse workload.  balancenuma will
not migrate memory in this case as it'll never get past the two-stage
filter. If there are few threads, it might never get scheduled on a new
node in which case it'll also do nothing.

The correct action in this case is to interleave memory and spread the
tasks between nodes but it lacks the information to do that. This was
deliberate as I was expecting numacore or autonuma to be rebased on top
and I didn't want to collide.

Does the memory requirement of all threads fit in a single node? This is
related to my second question -- how do you define convergence?

balancenuma is driven by where the process gets scheduled and it makes no
special attempt to spread itself out between nodes. If the threads are
always scheduled on the same node then it will never migrate to other
nodes because it does not need to. If you define convergence to be "all
nodes are evenly used" then balancenuma will never converge if all the
threads can stay on the same node.

> I used a wide set of test-cases I collected in the past:
> 
>                            [ Lower numbers are better. ]
> 
>  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
> ------------------------------------------------------------------------------------------
>  1x3-convergence        :  100.1 |         100.0 |           0.2 |           2.3 |  secs
>  1x4-convergence        :  100.2 |         100.1 |         100.2 |           2.1 |  secs
>  1x6-convergence        :  100.3 |         100.4 |         100.8 |           7.3 |  secs
>  2x3-convergence        :  100.6 |         100.6 |         100.5 |           4.1 |  secs
>  3x3-convergence        :  100.6 |         100.5 |         100.5 |           7.6 |  secs
>  4x4-convergence        :  100.6 |         100.5 |           4.1 |           7.4 |  secs
>  4x4-convergence-NOTHP  :  101.1 |         100.5 |          12.2 |           9.2 |  secs
>  4x6-convergence        :    5.4 |         101.2 |          16.6 |          11.7 |  secs
>  4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs
>  8x4-convergence        :  100.9 |         100.8 |          18.3 |           8.9 |  secs
>  8x4-convergence-NOTHP  :  101.9 |         101.0 |          15.7 |          12.1 |  secs
>  3x1-convergence        :    0.7 |           1.0 |           0.8 |           0.9 |  secs
>  4x1-convergence        :    0.6 |           0.8 |           0.8 |           0.7 |  secs
>  8x1-convergence        :    2.8 |           2.9 |           2.9 |           1.2 |  secs
>  16x1-convergence       :    3.5 |           3.7 |           2.5 |           2.0 |  secs
>  32x1-convergence       :    3.6 |           2.8 |           3.0 |           1.9 |  secs
> 

So, I recognise that balancenuma is not converging when the threads use
the same memory. They are all basically variations of numa01. It converges
quickly when the memory is private between threads like numa01_thread_alloc
does for example.

The figures do imply though that numacore is able to identify when multiple
threads are sharing the same memory and interleave them.

> As expected, mainline only manages to converge workloads where 
> each worker process is isolated and the default 
> spread-to-all-nodes scheduling policy creates an ideal layout, 
> regardless of task ordering.
> 
> [ Note that the mainline kernel got a 'lucky strike' convergence 
>   in the 4x6 workload: it's always possible for the workload
>   to accidentally converge. On a repeat test this did not occur, 
>   but I did not erase the outlier because luck is a valid and 
>   existing phenomenon. ]
> 
> The 'balancenuma' kernel does not converge any of the workloads 
> where worker threads or processes relate to each other.
> 

I'd like to know if it is because the workload fits on one node. If the
buffers are all really small, balancenuma would have skipped them
entirely for example due to this check

        /* Skip small VMAs. They are not likely to be of relevance */
        if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
                continue;

Another possible explanation is that in the 4x4 case that the processes
threads are getting scheduled on separate nodes. As each thread is sharing
data it would not get past the two-stage filter.

How realistic is it that threads are accessing the same data? That looks
like it would be a bad idea even from a caching perspective if the data
is being updated. I would expect that the majority of HPC workloads would
have each thread accessing mostly private data until the final stages
where the results are aggregated together.

> AutoNUMA does pretty well, but it did not manage to converge for 
> 4 testcases of shared, under-loaded workloads.
> 
> The unified NUMA-v3 tree converged well in every testcase.
> 
> 
> NUMA workload bandwidth measurements
> ------------------------------------
> 
> The other set of numbers I've collected are workload bandwidth 
> measurements, run over 20 seconds. Using 20 seconds gives a 
> healthy mix of pre-convergence and post-convergence bandwidth, 

20 seconds is *really* short. That might not even be enough time for
autonumas knumad thread to find the process and update it as IIRC it starts
pretty slowly.

> giving the (non-trivial) expense of convergence and memory 
> migraton a weight in the result as well. So these are not 
> 'ideal' results with long runtimes where migration cost gets 
> averaged out.
> 
> [ The denotion of the workloads is similar to the latency 
>   measurements: for example "2x3" means 2 processes, 3 threads 
>   per process. See the 'perf bench' tool for details. ]
> 
> The 'numa02' and 'numa01-THREAD' tests are AutoNUMA-benchmark 
> work-alike workloads, with a shorter runtime for numa01.
> 
> The results are:
> 
>                            [ Higher numbers are better. ]
> 
>  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 | numa-u-v3     |
> ------------------------------------------------------------------------------------------
>  2x1-bw-process         :   6.248|  6.136:  -1.8%|  8.073:  29.2%|  9.647:  54.4%|  GB/sec
>  3x1-bw-process         :   7.292|  7.250:  -0.6%| 12.583:  72.6%| 14.528:  99.2%|  GB/sec
>  4x1-bw-process         :   6.007|  6.867:  14.3%| 12.313: 105.0%| 18.903: 214.7%|  GB/sec
>  8x1-bw-process         :   6.100|  7.974:  30.7%| 20.237: 231.8%| 26.829: 339.8%|  GB/sec
>  8x1-bw-process-NOTHP   :   5.944|  5.937:  -0.1%| 17.831: 200.0%| 22.237: 274.1%|  GB/sec
>  16x1-bw-process        :   5.607|  5.592:  -0.3%|  5.959:   6.3%| 29.294: 422.5%|  GB/sec
>  4x1-bw-thread          :   6.035| 13.598: 125.3%| 17.443: 189.0%| 19.290: 219.6%|  GB/sec
>  8x1-bw-thread          :   5.941| 16.356: 175.3%| 22.433: 277.6%| 26.391: 344.2%|  GB/sec
>  16x1-bw-thread         :   5.648| 24.608: 335.7%| 20.204: 257.7%| 29.557: 423.3%|  GB/sec
>  32x1-bw-thread         :   5.929| 25.477: 329.7%| 18.230: 207.5%| 30.232: 409.9%|  GB/sec
>  2x3-bw-thread          :   5.756|  8.785:  52.6%| 14.652: 154.6%| 15.327: 166.3%|  GB/sec
>  4x4-bw-thread          :   5.605|  6.366:  13.6%|  9.835:  75.5%| 27.957: 398.8%|  GB/sec
>  4x6-bw-thread          :   5.771|  6.287:   8.9%| 15.372: 166.4%| 27.877: 383.1%|  GB/sec
>  4x8-bw-thread          :   5.858|  5.860:   0.0%| 11.865: 102.5%| 28.439: 385.5%|  GB/sec
>  4x8-bw-thread-NOTHP    :   5.645|  6.167:   9.2%|  9.224:  63.4%| 25.067: 344.1%|  GB/sec
>  3x3-bw-thread          :   5.937|  8.235:  38.7%|  6.635:  11.8%| 21.560: 263.1%|  GB/sec
>  5x5-bw-thread          :   5.771|  5.762:  -0.2%|  9.575:  65.9%| 26.081: 351.9%|  GB/sec
>  2x16-bw-thread         :   5.953|  5.920:  -0.6%|  5.945:  -0.1%| 23.269: 290.9%|  GB/sec
>  1x32-bw-thread         :   5.879|  5.828:  -0.9%|  5.848:  -0.5%| 18.985: 222.9%|  GB/sec
>  numa02-bw              :   6.049| 29.054: 380.3%| 24.744: 309.1%| 31.431: 419.6%|  GB/sec
>  numa02-bw-NOTHP        :   5.850| 27.064: 362.6%| 20.415: 249.0%| 29.104: 397.5%|  GB/sec
>  numa01-bw-thread       :   5.834| 20.338: 248.6%| 15.169: 160.0%| 28.607: 390.3%|  GB/sec
>  numa01-bw-thread-NOTHP :   5.581| 18.528: 232.0%| 12.108: 117.0%| 21.119: 278.4%|  GB/sec
> ------------------------------------------------------------------------------------------
> 

Again, balancenumas results would depend *very* heavily on how it took
for the scheduler to put a task on a new node.

> The first column shows mainline kernel bandwidth in GB/sec, the 
> following 3 colums show pairs of GB/sec bandwidth and percentage 
> results, where percentage shows the speed difference to the 
> mainline kernel.
> 
> Noise is 1-2% in these tests with these durations, so the good 
> news is that none of the NUMA kernels regresses on these 
> workloads against the mainline kernel. Perhaps balancenuma's 
> "2x1-bw-process" and "3x1-bw-process" results might be worth a 
> closer look.
> 

Balancenuma takes no action until a task is scheduled on a new node.
Until that time, it assumes that no action is necessary because its workload
is already accessing local memory. It does not take into account that two
processes could be on the same node competing for memory bandwidth. I
expect that is what is happening here. In 2x1-bw-process, both tasks
start on the same node scheduled on CPUs for that node. As long as they
both fit there and the scheduler does not migrate them in 20 seconds,
it will leave memory where it is.

Addressing this would require calculation of the per-node memory load
and spreading tasks around on that basis via the load balancer. Fairly
straight-forward to do and I believe numacore does something along these
lines but it would violate what balancenuma was for -- a common base that
either numacore or autonuma could use.

> No kernel shows particular vulnerability to the NOTHP tests that 
> were mixed into the test stream.
> 
> As can be expected from the convergence latency results, the 
> 'balancenuma' tree does well with workloads where there's no 
> relationship between threads

I don't think it's exactly about workload isolation. It's more a factor of
how long it takes for the scheduler to put tasks on new nodes and whether it
leaves them there. I can work on patches that calculate per-numa load and
hook into the load balancer but at that point I'm going to start colliding
heavily with your work.

> - but even there it's outperformed 
> by the AutoNUMA kernel, and outperformed by an even larger 
> margin by the NUMA-v3 kernel. Workloads like the 4x JVM SPECjbb 
> on the other hand pose a challenge to the balancenuma kernel, 
> both the AutoNUMA and the NUMA-v3 kernels are several times 
> faster in those tests.
> 
> The AutoNUMA kernel does well in most workloads - its weakness 
> are system-wide shared workloads like 2x16-bw-thread and 
> 1x32-bw-thread, where it falls back to mainline performance.
> 
> The NUMA-v3 kernel outperforms every other NUMA kernel.
> 
> Here's a direct comparison between the two fastest kernels, the 
> AutoNUMA and the NUMA-v3 kernels:
> 
> 
>                         [ Higher numbers are better. ]
> 
>  [test unit]            :AutoNUMA| numa-u-v3     |
> ----------------------------------------------------------
>  2x1-bw-process         :   8.073|  9.647:  19.5%|  GB/sec
>  3x1-bw-process         :  12.583| 14.528:  15.5%|  GB/sec
>  4x1-bw-process         :  12.313| 18.903:  53.5%|  GB/sec
>  8x1-bw-process         :  20.237| 26.829:  32.6%|  GB/sec
>  8x1-bw-process-NOTHP   :  17.831| 22.237:  24.7%|  GB/sec
>  16x1-bw-process        :   5.959| 29.294: 391.6%|  GB/sec
>  4x1-bw-thread          :  17.443| 19.290:  10.6%|  GB/sec
>  8x1-bw-thread          :  22.433| 26.391:  17.6%|  GB/sec
>  16x1-bw-thread         :  20.204| 29.557:  46.3%|  GB/sec
>  32x1-bw-thread         :  18.230| 30.232:  65.8%|  GB/sec
>  2x3-bw-thread          :  14.652| 15.327:   4.6%|  GB/sec
>  4x4-bw-thread          :   9.835| 27.957: 184.3%|  GB/sec
>  4x6-bw-thread          :  15.372| 27.877:  81.3%|  GB/sec
>  4x8-bw-thread          :  11.865| 28.439: 139.7%|  GB/sec
>  4x8-bw-thread-NOTHP    :   9.224| 25.067: 171.8%|  GB/sec
>  3x3-bw-thread          :   6.635| 21.560: 224.9%|  GB/sec
>  5x5-bw-thread          :   9.575| 26.081: 172.4%|  GB/sec
>  2x16-bw-thread         :   5.945| 23.269: 291.4%|  GB/sec
>  1x32-bw-thread         :   5.848| 18.985: 224.6%|  GB/sec
>  numa02-bw              :  24.744| 31.431:  27.0%|  GB/sec
>  numa02-bw-NOTHP        :  20.415| 29.104:  42.6%|  GB/sec
>  numa01-bw-thread       :  15.169| 28.607:  88.6%|  GB/sec
>  numa01-bw-thread-NOTHP :  12.108| 21.119:  74.4%|  GB/sec
> 
> 
> NUMA workload "spread" measurements
> -----------------------------------
> 
> A third, somewhat obscure category of measurements deals with 
> the 'execution spread' between threads. Workloads that have to 
> wait for the result of every thread before they can declare a 
> result are directly limited by this spread.
> 
> The 'spread' is measured by the percentage difference between 
> the slowest and fastest thread's execution time in a workload:
> 
>                            [ Lower numbers are better. ]
> 
>  [test unit]            :   v3.7  |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
> ------------------------------------------------------------------------------------------
>  RAM-bw-local           :    0.0% |          0.0% |          0.0% |          0.0% |  %
>  RAM-bw-local-NOTHP     :    0.2% |          0.2% |          0.2% |          0.2% |  %
>  RAM-bw-remote          :    0.0% |          0.0% |          0.0% |          0.0% |  %
>  RAM-bw-local-2x        :    0.3% |          0.0% |          0.2% |          0.3% |  %
>  RAM-bw-remote-2x       :    0.0% |          0.2% |          0.0% |          0.2% |  %
>  RAM-bw-cross           :    0.4% |          0.2% |          0.0% |          0.1% |  %
>  2x1-bw-process         :    0.5% |          0.2% |          0.2% |          0.2% |  %
>  3x1-bw-process         :    0.6% |          0.2% |          0.2% |          0.1% |  %
>  4x1-bw-process         :    0.4% |          0.8% |          0.2% |          0.3% |  %
>  8x1-bw-process         :    0.8% |          0.1% |          0.2% |          0.2% |  %
>  8x1-bw-process-NOTHP   :    0.9% |          0.7% |          0.4% |          0.5% |  %
>  16x1-bw-process        :    1.0% |          0.9% |          0.6% |          0.1% |  %
>  4x1-bw-thread          :    0.1% |          0.1% |          0.1% |          0.1% |  %
>  8x1-bw-thread          :    0.2% |          0.1% |          0.1% |          0.2% |  %
>  16x1-bw-thread         :    0.3% |          0.1% |          0.1% |          0.1% |  %
>  32x1-bw-thread         :    0.3% |          0.1% |          0.1% |          0.1% |  %
>  2x3-bw-thread          :    0.4% |          0.3% |          0.3% |          0.3% |  %
>  4x4-bw-thread          :    2.3% |          1.4% |          0.8% |          0.4% |  %
>  4x6-bw-thread          :    2.5% |          2.2% |          1.0% |          0.6% |  %
>  4x8-bw-thread          :    3.9% |          3.7% |          1.3% |          0.9% |  %
>  4x8-bw-thread-NOTHP    :    6.0% |          2.5% |          1.5% |          1.0% |  %
>  3x3-bw-thread          :    0.5% |          0.4% |          0.5% |          0.3% |  %
>  5x5-bw-thread          :    1.8% |          2.7% |          1.3% |          0.7% |  %
>  2x16-bw-thread         :    3.7% |          4.1% |          3.6% |          1.1% |  %
>  1x32-bw-thread         :    2.9% |          7.3% |          3.5% |          4.4% |  %
>  numa02-bw              :    0.1% |          0.0% |          0.1% |          0.1% |  %
>  numa02-bw-NOTHP        :    0.4% |          0.3% |          0.3% |          0.3% |  %
>  numa01-bw-thread       :    1.3% |          0.4% |          0.3% |          0.3% |  %
>  numa01-bw-thread-NOTHP :    1.8% |          0.8% |          0.8% |          0.9% |  %
> 
> The results are pretty good because the runs were relatively 
> short with 20 seconds runtime.
> 
> Both mainline and balancenuma has trouble with the spread of 
> shared workloads - possibly signalling memory allocation 
> assymetries. Longer - 60 seconds or more - runs of the key 
> workloads would certainly be informative there.
> 
> NOTHP (4K ptes) increases the spread and non-determinism of 
> every NUMA kernel.
> 
> The AutoNUMA and NUMA-v3 kernels have the lowest spread, 
> signalling stable NUMA convergence in most scenarios.
> 
> Finally, below is the (long!) dump of all the raw data, in case 
> someone wants to double-check my results. The perf/bench tool 
> can be used to double check the measurements on other systems.
> 

I'll take your word that you got it right and nothing in the results
surprised me as such.

My reading of the results are basically that balancenuma suffers in these
comparisons because it's not hooking into the scheduler to feed information
to the load balancer on how the tasks should be spread around. As the
scheduler does not move the tasks (too few, too short lived) it looks bad
as a result. I can work on the patches to spread identify per-node load
and hook into the load balancer to spread the tasks but at that point I'll
start heavily colliding with either an autonuma or numacore rebased which
I had wanted to avoid.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
  2012-12-10 12:33   ` Mel Gorman
@ 2012-12-10 20:29     ` Ingo Molnar
  2012-12-10 21:59       ` Mel Gorman
  0 siblings, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2012-12-10 20:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Mike Galbraith


* Mel Gorman <mgorman@suse.de> wrote:

> > NUMA convergence latency measurements
> > -------------------------------------
> > 
> > 'NUMA convergence' latency is the number of seconds a 
> > workload takes to reach 'perfectly NUMA balanced' state. 
> > This is measured on the CPU placement side: once it has 
> > converged then memory typically follows within a couple of 
> > seconds.
> 
> This is a sortof misleading metric so be wary of it as the 
> speed a workload converges is not necessarily useful. It only 
> makes a difference for short-lived workloads or during phase 
> changes. If the workload is short-lived, it's not interesting 
> anyway. If the workload is rapidly changing phases then the 
> migration costs can be a major factor and rapidly converging 
> might actually be slower overall.
> 
> The speed the workload converges will depend very heavily on 
> when the PTEs are marked pte_numa and when the faults are 
> incurred. If this is happening very rapidly then a workload 
> will converge quickly *but* this can incur a high system CPU 
> cost (PTE scanning, fault trapping etc).  This metric can be 
> gamed by always scanning rapidly but the overall performance 
> may be worse.
> 
> I'm not saying that this metric is not useful, it is. Just be 
> careful of optimising for it. numacores system CPU usage has 
> been really high in a number of benchmarks and it may be 
> because you are optimising to minimise time to convergence.

You are missing a big part of the NUMA balancing picture here: 
the primary use of 'latency of convergence' is to determine 
whether a workload converges *at all*.

For example if you look at the 4-process / 8-threads-per-process 
latency results:

                            [ Lower numbers are better. ]
 
  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
 ------------------------------------------------------------------------------------------
  4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs

You'll see that balancenuma does not converge this workload. 

Where does such a workload matter? For example in the 4x JVM 
SPECjbb tests that Thomas Gleixner has reported today:

    http://lkml.org/lkml/2012/12/10/437

There balancenuma does worse than AutoNUMA and the -v3 tree 
exactly because it does not NUMA-converge as well (or at all).

> I'm trying to understand what you're measuring a bit better.  
> Take 1x4 for example -- one process, 4 threads. If I'm reading 
> this description then all 4 threads use the same memory. Is 
> this correct? If so, this is basically a variation of numa01 
> which is an adverse workload. [...]

No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been 
performing - not an 'adverse' workload. The threads of the JVM 
are sharing memory significantly enough to justify moving them 
on the same node.

> [...]  balancenuma will not migrate memory in this case as 
> it'll never get past the two-stage filter. If there are few 
> threads, it might never get scheduled on a new node in which 
> case it'll also do nothing.
> 
> The correct action in this case is to interleave memory and 
> spread the tasks between nodes but it lacks the information to 
> do that. [...]

No, the correct action is to move related threads close to each 
other.

> [...] This was deliberate as I was expecting numacore or 
> autonuma to be rebased on top and I didn't want to collide.
> 
> Does the memory requirement of all threads fit in a single 
> node? This is related to my second question -- how do you 
> define convergence?

NUMA-convergence is to achieve the ideal CPU and memory 
placement of tasks.

> > The 'balancenuma' kernel does not converge any of the 
> > workloads where worker threads or processes relate to each 
> > other.
> 
> I'd like to know if it is because the workload fits on one 
> node. If the buffers are all really small, balancenuma would 
> have skipped them entirely for example due to this check
> 
>         /* Skip small VMAs. They are not likely to be of relevance */
>         if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
>                 continue;

No, the memory areas are larger than 2MB.

> Another possible explanation is that in the 4x4 case that the 
> processes threads are getting scheduled on separate nodes. As 
> each thread is sharing data it would not get past the 
> two-stage filter.
> 
> How realistic is it that threads are accessing the same data? 

In practice? Very ...

> That looks like it would be a bad idea even from a caching 
> perspective if the data is being updated. I would expect that 
> the majority of HPC workloads would have each thread accessing 
> mostly private data until the final stages where the results 
> are aggregated together.

You tested such a workload many times in the past: the 4x JVM 
SPECjbb test ...

> > NUMA workload bandwidth measurements
> > ------------------------------------
> > 
> > The other set of numbers I've collected are workload 
> > bandwidth measurements, run over 20 seconds. Using 20 
> > seconds gives a healthy mix of pre-convergence and 
> > post-convergence bandwidth,
> 
> 20 seconds is *really* short. That might not even be enough 
> time for autonumas knumad thread to find the process and 
> update it as IIRC it starts pretty slowly.

If you check the convergence latency tables you'll see that 
AutoNUMA is able to converge within 20 seconds.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
  2012-12-10 20:29     ` Ingo Molnar
@ 2012-12-10 21:59       ` Mel Gorman
  0 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2012-12-10 21:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins, Arnaldo Carvalho de Melo,
	Frederic Weisbecker, Mike Galbraith

On Mon, Dec 10, 2012 at 09:29:33PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > NUMA convergence latency measurements
> > > -------------------------------------
> > > 
> > > 'NUMA convergence' latency is the number of seconds a 
> > > workload takes to reach 'perfectly NUMA balanced' state. 
> > > This is measured on the CPU placement side: once it has 
> > > converged then memory typically follows within a couple of 
> > > seconds.
> > 
> > This is a sortof misleading metric so be wary of it as the 
> > speed a workload converges is not necessarily useful. It only 
> > makes a difference for short-lived workloads or during phase 
> > changes. If the workload is short-lived, it's not interesting 
> > anyway. If the workload is rapidly changing phases then the 
> > migration costs can be a major factor and rapidly converging 
> > might actually be slower overall.
> > 
> > The speed the workload converges will depend very heavily on 
> > when the PTEs are marked pte_numa and when the faults are 
> > incurred. If this is happening very rapidly then a workload 
> > will converge quickly *but* this can incur a high system CPU 
> > cost (PTE scanning, fault trapping etc).  This metric can be 
> > gamed by always scanning rapidly but the overall performance 
> > may be worse.
> > 
> > I'm not saying that this metric is not useful, it is. Just be 
> > careful of optimising for it. numacores system CPU usage has 
> > been really high in a number of benchmarks and it may be 
> > because you are optimising to minimise time to convergence.
> 
> You are missing a big part of the NUMA balancing picture here: 
> the primary use of 'latency of convergence' is to determine 
> whether a workload converges *at all*.
> 
> For example if you look at the 4-process / 8-threads-per-process 
> latency results:
> 
>                             [ Lower numbers are better. ]
>  
>   [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
>  ------------------------------------------------------------------------------------------
>   4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs
> 
> You'll see that balancenuma does not converge this workload. 
> 

Does it ever get scheduled on a new node? Balancenuma is completely at the
mercy of the scheduler. It makes no attempts to estimate numa loading or
hint to the load balancer. It does not even start trying to converge until
it's scheduled on a new node.

> Where does such a workload matter? For example in the 4x JVM 
> SPECjbb tests that Thomas Gleixner has reported today:
> 
>     http://lkml.org/lkml/2012/12/10/437
> 
> There balancenuma does worse than AutoNUMA and the -v3 tree 
> exactly because it does not NUMA-converge as well (or at all).
> 

I know. To do that I would have had to hook into the scheduler, build
statistics and use the load balancer to move the tasks around. This would
have directly collided with either an autonuma or a numacore rebase.  I've
made this point often enough and I'm getting very sick of repeating myself.

> > I'm trying to understand what you're measuring a bit better.  
> > Take 1x4 for example -- one process, 4 threads. If I'm reading 
> > this description then all 4 threads use the same memory. Is 
> > this correct? If so, this is basically a variation of numa01 
> > which is an adverse workload. [...]
> 
> No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been 
> performing - not an 'adverse' workload. The threads of the JVM 
> are sharing memory significantly enough to justify moving them 
> on the same node.
> 

1x8 would not even be a single JVM test. It would have ranged 1x8 to
1x72 over the course of the entire test. I also still do not know what
granularity you are sharing data on. If they are using the exact same
pages, it's closer to numa01 than specjbb which has semi-private data
depending on how the heap is laid out.

> > [...]  balancenuma will not migrate memory in this case as 
> > it'll never get past the two-stage filter. If there are few 
> > threads, it might never get scheduled on a new node in which 
> > case it'll also do nothing.
> > 
> > The correct action in this case is to interleave memory and 
> > spread the tasks between nodes but it lacks the information to 
> > do that. [...]
> 
> No, the correct action is to move related threads close to each 
> other.
> 
> > [...] This was deliberate as I was expecting numacore or 
> > autonuma to be rebased on top and I didn't want to collide.
> > 
> > Does the memory requirement of all threads fit in a single 
> > node? This is related to my second question -- how do you 
> > define convergence?
> 
> NUMA-convergence is to achieve the ideal CPU and memory 
> placement of tasks.
> 

That is your goal, it does not define what convergence is. You also did not
tell me if all the threads can fit in a single node or not. If they do,
then it's possible that balancenuma never "converges" simply because the
data access are already local so it does not migrate. If your definition
of convergence includes that tasks should migrate to as many nodes as
possible to maxmise memory bandwidth then say that.

> > > The 'balancenuma' kernel does not converge any of the 
> > > workloads where worker threads or processes relate to each 
> > > other.
> > 
> > I'd like to know if it is because the workload fits on one 
> > node. If the buffers are all really small, balancenuma would 
> > have skipped them entirely for example due to this check
> > 
> >         /* Skip small VMAs. They are not likely to be of relevance */
> >         if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
> >                 continue;
> 
> No, the memory areas are larger than 2MB.
> 

Does the workload for 1x8 fit in one node? My test machines are
occupied so I cannot check myself right now.

> > Another possible explanation is that in the 4x4 case that the 
> > processes threads are getting scheduled on separate nodes. As 
> > each thread is sharing data it would not get past the 
> > two-stage filter.
> > 
> > How realistic is it that threads are accessing the same data? 
> 
> In practice? Very ...
> 

I'm skeptical. I would expect HPC workloads in particular to isolate data
between threads where possible.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-12-10 21:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
2012-12-10 12:33   ` Mel Gorman
2012-12-10 20:29     ` Ingo Molnar
2012-12-10 21:59       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).