* [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite
2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
@ 2012-12-07 20:55 ` Ingo Molnar
2012-12-07 21:53 ` NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.] Ingo Molnar
1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2012-12-07 20:55 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
Arnaldo Carvalho de Melo, Frederic Weisbecker, Mike Galbraith,
Steven Rostedt
Add a suite of NUMA performance benchmarks.
The goal was simulate the behavior and access patterns of real NUMA
workloads, via a wide range of parameters, so this tool goes well
beyond simple bzero() measurements that most NUMA micro-benchmarks use:
- It processes the data and creates a chain of data dependencies,
like a real workload would. Neither the compiler, nor the
kernel (via KSM and other optimizations) nor the CPU can
eliminate parts of the workload.
- It randomizes the initial state and also randomizes the target
addresses of the processing - it's not a simple forward scan
of addresses.
- It provides flexible options to set process, thread and memory
relationship information: -G sets "global" memory shared between
all test processes, -P sets "process" memory shared by all
threads of a process and -T sets "thread" private memory.
- There's a NUMA convergence monitoring and convergence latency
measurement option via -c and -m.
- Micro-sleeps and synchronization can be injected to provoke lock
contention and scheduling, via the -u and -S options. This simulates
IO and contention.
- The -x option instructs the workload to 'perturb' itself artificially
every N seconds, by moving to the first and last CPU of the system
periodically. This way the stability of convergence equilibrium and
the number of steps taken for the scheduler to reach equilibrium again
can be measured.
- The amount of work can be specified via the -l loop count, and/or
via a -s seconds-timeout value.
- CPU and node memory binding options, to test hard binding scenarios.
THP can be turned on and off via madvise() calls.
- Live reporting of convergence progress in an 'at glance' output format.
Printing of convergence and deconvergence events.
The 'perf bench numa mem -a' option will start an array of about 30
individual tests that will each output such measurements:
# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 20.276, secs, runtime-max/thread
5x5-bw-thread, 20.004, secs, runtime-min/thread
5x5-bw-thread, 20.155, secs, runtime-avg/thread
5x5-bw-thread, 0.671, %, spread-runtime/thread
5x5-bw-thread, 21.153, GB, data/thread
5x5-bw-thread, 528.818, GB, data-total
5x5-bw-thread, 0.959, nsecs, runtime/byte/thread
5x5-bw-thread, 1.043, GB/sec, thread-speed
5x5-bw-thread, 26.081, GB/sec, total-speed
See the help text and the code for more details.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
tools/perf/Makefile | 3 +-
tools/perf/bench/bench.h | 1 +
tools/perf/bench/numa.c | 1731 ++++++++++++++++++++++++++++++++++++++++++++
tools/perf/builtin-bench.c | 13 +
tools/perf/util/hist.h | 2 +-
5 files changed, 1748 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/bench/numa.c
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index cca5bb8..91621f9 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -104,7 +104,7 @@ ifdef PARSER_DEBUG
endif
CFLAGS = -fno-omit-frame-pointer -ggdb3 -funwind-tables -Wall -Wextra -std=gnu99 $(CFLAGS_WERROR) $(CFLAGS_OPTIMIZE) $(EXTRA_WARNINGS) $(EXTRA_CFLAGS) $(PARSER_DEBUG_CFLAGS)
-EXTLIBS = -lpthread -lrt -lelf -lm
+EXTLIBS = -lpthread -lrt -lelf -lm -lnuma
ALL_CFLAGS = $(CFLAGS) -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE
ALL_LDFLAGS = $(LDFLAGS)
STRIP ?= strip
@@ -435,6 +435,7 @@ LIB_OBJS += $(OUTPUT)tests/attr.o
BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
# Benchmark modules
+BUILTIN_OBJS += $(OUTPUT)bench/numa.o
BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
ifeq ($(RAW_ARCH),x86_64)
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 8f89998..a5223e6 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -1,6 +1,7 @@
#ifndef BENCH_H
#define BENCH_H
+extern int bench_numa(int argc, const char **argv, const char *prefix);
extern int bench_sched_messaging(int argc, const char **argv, const char *prefix);
extern int bench_sched_pipe(int argc, const char **argv, const char *prefix);
extern int bench_mem_memcpy(int argc, const char **argv,
diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
new file mode 100644
index 0000000..30d1c32
--- /dev/null
+++ b/tools/perf/bench/numa.c
@@ -0,0 +1,1731 @@
+/*
+ * numa.c
+ *
+ * numa: Simulate NUMA-sensitive workload and measure their NUMA performance
+ */
+
+#include "../perf.h"
+#include "../builtin.h"
+#include "../util/util.h"
+#include "../util/parse-options.h"
+
+#include "bench.h"
+
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <assert.h>
+#include <malloc.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+
+#include <numa.h>
+#include <numaif.h>
+
+/*
+ * Regular printout to the terminal, supressed if -q is specified:
+ */
+#define tprintf(x...) do { if (g && g->p.show_details >= 0) printf(x); } while (0)
+
+/*
+ * Debug printf:
+ */
+#define dprintf(x...) do { if (g && g->p.show_details >= 1) printf(x); } while (0)
+
+struct thread_data {
+ int curr_cpu;
+ cpu_set_t bind_cpumask;
+ int bind_node;
+ u8 *process_data;
+ int process_nr;
+ int thread_nr;
+ int task_nr;
+ unsigned int loops_done;
+ u64 val;
+ u64 runtime_ns;
+ pthread_mutex_t *process_lock;
+};
+
+/* Parameters set by options: */
+
+struct params {
+ /* Startup synchronization: */
+ bool serialize_startup;
+
+ /* Task hierarchy: */
+ int nr_proc;
+ int nr_threads;
+
+ /* Working set sizes: */
+ const char *mb_global_str;
+ const char *mb_proc_str;
+ const char *mb_proc_locked_str;
+ const char *mb_thread_str;
+
+ double mb_global;
+ double mb_proc;
+ double mb_proc_locked;
+ double mb_thread;
+
+ /* Access patterns to the working set: */
+ bool data_reads;
+ bool data_writes;
+ bool data_backwards;
+ bool data_zero_memset;
+ bool data_rand_walk;
+ u32 nr_loops;
+ u32 nr_secs;
+ u32 sleep_usecs;
+
+ /* Working set initialization: */
+ bool init_zero;
+ bool init_random;
+ bool init_cpu0;
+
+ /* Misc options: */
+ int show_details;
+ int run_all;
+ int thp;
+
+ long bytes_global;
+ long bytes_process;
+ long bytes_process_locked;
+ long bytes_thread;
+
+ int nr_tasks;
+ bool show_quiet;
+
+ bool show_convergence;
+ bool measure_convergence;
+
+ int perturb_secs;
+ int nr_cpus;
+ int nr_nodes;
+
+ /* Affinity options -C and -N: */
+ char *cpu_list_str;
+ char *node_list_str;
+};
+
+
+/* Global, read-writable area, accessible to all processes and threads: */
+
+struct global_info {
+ u8 *data;
+
+ pthread_mutex_t startup_mutex;
+ int nr_tasks_started;
+
+ pthread_mutex_t startup_done_mutex;
+
+ pthread_mutex_t start_work_mutex;
+ int nr_tasks_working;
+
+ pthread_mutex_t stop_work_mutex;
+ u64 bytes_done;
+
+ struct thread_data *threads;
+
+ /* Convergence latency measurement: */
+ bool all_converged;
+ bool stop_work;
+
+ int print_once;
+
+ struct params p;
+};
+
+static struct global_info *g = NULL;
+
+static int parse_cpus_opt(const struct option *opt, const char *arg, int unset);
+static int parse_nodes_opt(const struct option *opt, const char *arg, int unset);
+
+struct params p0;
+
+static const struct option options[] = {
+ OPT_INTEGER('p', "nr_proc" , &p0.nr_proc, "number of processes"),
+ OPT_INTEGER('t', "nr_threads" , &p0.nr_threads, "number of threads per process"),
+
+ OPT_STRING('G', "mb_global" , &p0.mb_global_str, "MB", "global memory (MBs)"),
+ OPT_STRING('P', "mb_proc" , &p0.mb_proc_str, "MB", "process memory (MBs)"),
+ OPT_STRING('L', "mb_proc_locked", &p0.mb_proc_locked_str,"MB", "process serialized/locked memory access (MBs), <= process_memory"),
+ OPT_STRING('T', "mb_thread" , &p0.mb_thread_str, "MB", "thread memory (MBs)"),
+
+ OPT_UINTEGER('l', "nr_loops" , &p0.nr_loops, "max number of loops to run"),
+ OPT_UINTEGER('s', "nr_secs" , &p0.nr_secs, "max number of seconds to run"),
+ OPT_UINTEGER('u', "usleep" , &p0.sleep_usecs, "usecs to sleep per loop iteration"),
+
+ OPT_BOOLEAN('R', "data_reads" , &p0.data_reads, "access the data via writes (can be mixed with -W)"),
+ OPT_BOOLEAN('W', "data_writes" , &p0.data_writes, "access the data via writes (can be mixed with -R)"),
+ OPT_BOOLEAN('B', "data_backwards", &p0.data_backwards, "access the data backwards as well"),
+ OPT_BOOLEAN('Z', "data_zero_memset", &p0.data_zero_memset,"access the data via glibc bzero only"),
+ OPT_BOOLEAN('r', "data_rand_walk", &p0.data_rand_walk, "access the data with random (32bit LFSR) walk"),
+
+
+ OPT_BOOLEAN('z', "init_zero" , &p0.init_zero, "bzero the initial allocations"),
+ OPT_BOOLEAN('I', "init_random" , &p0.init_random, "randomize the contents of the initial allocations"),
+ OPT_BOOLEAN('0', "init_cpu0" , &p0.init_cpu0, "do the initial allocations on CPU#0"),
+ OPT_INTEGER('x', "perturb_secs", &p0.perturb_secs, "perturb thread 0/0 every X secs, to test convergence stability"),
+
+ OPT_INCR ('d', "show_details" , &p0.show_details, "Show details"),
+ OPT_INCR ('a', "all" , &p0.run_all, "Run all tests in the suite"),
+ OPT_INTEGER('H', "thp" , &p0.thp, "MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE"),
+ OPT_BOOLEAN('c', "show_convergence", &p0.show_convergence, "show convergence details"),
+ OPT_BOOLEAN('m', "measure_convergence", &p0.measure_convergence, "measure convergence latency"),
+ OPT_BOOLEAN('q', "quiet" , &p0.show_quiet, "bzero the initial allocations"),
+ OPT_BOOLEAN('S', "serialize-startup", &p0.serialize_startup,"serialize thread startup"),
+
+ /* Special option string parsing callbacks: */
+ OPT_CALLBACK('C', "cpus", NULL, "cpu[,cpu2,...cpuN]",
+ "bind the first N tasks to these specific cpus (the rest is unbound)",
+ parse_cpus_opt),
+ OPT_CALLBACK('M', "memnodes", NULL, "node[,node2,...nodeN]",
+ "bind the first N tasks to these specific memory nodes (the rest is unbound)",
+ parse_nodes_opt),
+ OPT_END()
+};
+
+static const char * const bench_numa_usage[] = {
+ "perf bench numa <options>",
+ NULL
+};
+
+static const char * const numa_usage[] = {
+ "perf bench numa mem [<options>]",
+ NULL
+};
+
+static cpu_set_t bind_to_cpu(int target_cpu)
+{
+ cpu_set_t orig_mask, mask;
+ int ret;
+
+ ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+ BUG_ON(ret);
+
+ CPU_ZERO(&mask);
+
+ if (target_cpu == -1) {
+ int cpu;
+
+ for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+ CPU_SET(cpu, &mask);
+ } else {
+ BUG_ON(target_cpu < 0 || target_cpu >= g->p.nr_cpus);
+ CPU_SET(target_cpu, &mask);
+ }
+
+ ret = sched_setaffinity(0, sizeof(mask), &mask);
+ BUG_ON(ret);
+
+ return orig_mask;
+}
+
+static cpu_set_t bind_to_node(int target_node)
+{
+ int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes;
+ cpu_set_t orig_mask, mask;
+ int cpu;
+ int ret;
+
+ BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus);
+ BUG_ON(!cpus_per_node);
+
+ ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+ BUG_ON(ret);
+
+ CPU_ZERO(&mask);
+
+ if (target_node == -1) {
+ for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+ CPU_SET(cpu, &mask);
+ } else {
+ int cpu_start = (target_node + 0) * cpus_per_node;
+ int cpu_stop = (target_node + 1) * cpus_per_node;
+
+ BUG_ON(cpu_stop > g->p.nr_cpus);
+
+ for (cpu = cpu_start; cpu < cpu_stop; cpu++)
+ CPU_SET(cpu, &mask);
+ }
+
+ ret = sched_setaffinity(0, sizeof(mask), &mask);
+ BUG_ON(ret);
+
+ return orig_mask;
+}
+
+static void bind_to_cpumask(cpu_set_t mask)
+{
+ int ret;
+
+ ret = sched_setaffinity(0, sizeof(mask), &mask);
+ BUG_ON(ret);
+}
+
+static void mempol_restore(void)
+{
+ int ret;
+
+ ret = set_mempolicy(MPOL_DEFAULT, NULL, g->p.nr_nodes-1);
+
+ BUG_ON(ret);
+}
+
+static void bind_to_memnode(int node)
+{
+ unsigned long nodemask;
+ int ret;
+
+ if (node == -1)
+ return;
+
+ BUG_ON(g->p.nr_nodes > (int)sizeof(nodemask));
+ nodemask = 1L << node;
+
+ ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)*8);
+ dprintf("binding to node %d, mask: %016lx => %d\n", node, nodemask, ret);
+
+ BUG_ON(ret);
+}
+
+#define HPSIZE (2*1024*1024)
+
+#define set_taskname(fmt...) \
+do { \
+ char name[20]; \
+ \
+ snprintf(name, 20, fmt); \
+ prctl(PR_SET_NAME, name); \
+} while (0)
+
+static u8 *alloc_data(ssize_t bytes0, int map_flags,
+ int init_zero, int init_cpu0, int thp, int init_random)
+{
+ cpu_set_t orig_mask;
+ ssize_t bytes;
+ u8 *buf;
+ int ret;
+
+ if (!bytes0)
+ return NULL;
+
+ /* Allocate and initialize all memory on CPU#0: */
+ if (init_cpu0) {
+ orig_mask = bind_to_node(0);
+ bind_to_memnode(0);
+ }
+
+ bytes = bytes0 + HPSIZE;
+
+ buf = (void *)mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANON|map_flags, -1, 0);
+ BUG_ON(buf == (void *)-1);
+
+ if (map_flags == MAP_PRIVATE) {
+ if (thp > 0) {
+ ret = madvise(buf, bytes, MADV_HUGEPAGE);
+ if (ret && !g->print_once) {
+ g->print_once = 1;
+ printf("WARNING: Could not enable THP - do: 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'\n");
+ }
+ }
+ if (thp < 0) {
+ ret = madvise(buf, bytes, MADV_NOHUGEPAGE);
+ if (ret && !g->print_once) {
+ g->print_once = 1;
+ printf("WARNING: Could not disable THP: run a CONFIG_TRANSPARENT_HUGEPAGE kernel?\n");
+ }
+ }
+ }
+
+ if (init_zero) {
+ bzero(buf, bytes);
+ } else {
+ /* Initialize random contents, different in each word: */
+ if (init_random) {
+ u64 *wbuf = (void *)buf;
+ long off = rand();
+ long i;
+
+ for (i = 0; i < bytes/8; i++)
+ wbuf[i] = i + off;
+ }
+ }
+
+ /* Align to 2MB boundary: */
+ buf = (void *)(((unsigned long)buf + HPSIZE-1) & ~(HPSIZE-1));
+
+ /* Restore affinity: */
+ if (init_cpu0) {
+ bind_to_cpumask(orig_mask);
+ mempol_restore();
+ }
+
+ return buf;
+}
+
+static void free_data(void *data, ssize_t bytes)
+{
+ int ret;
+
+ if (!data)
+ return;
+
+ ret = munmap(data, bytes);
+ BUG_ON(ret);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes, zeroed:
+ */
+static void * zalloc_shared_data(ssize_t bytes)
+{
+ return alloc_data(bytes, MAP_SHARED, 1, g->p.init_cpu0, g->p.thp, g->p.init_random);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes:
+ */
+static void * setup_shared_data(ssize_t bytes)
+{
+ return alloc_data(bytes, MAP_SHARED, 0, g->p.init_cpu0, g->p.thp, g->p.init_random);
+}
+
+/*
+ * Allocate process-local memory - this will either be shared between
+ * threads of this process, or only be accessed by this thread:
+ */
+static void * setup_private_data(ssize_t bytes)
+{
+ return alloc_data(bytes, MAP_PRIVATE, 0, g->p.init_cpu0, g->p.thp, g->p.init_random);
+}
+
+/*
+ * Return a process-shared (global) mutex:
+ */
+static void init_global_mutex(pthread_mutex_t *mutex)
+{
+ pthread_mutexattr_t attr;
+
+ pthread_mutexattr_init(&attr);
+ pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
+ pthread_mutex_init(mutex, &attr);
+}
+
+static int parse_cpu_list(const char *arg)
+{
+ p0.cpu_list_str = strdup(arg);
+
+ dprintf("got CPU list: {%s}\n", p0.cpu_list_str);
+
+ return 0;
+}
+
+static void parse_setup_cpu_list(void)
+{
+ struct thread_data *td;
+ char *str0, *str;
+ int t;
+
+ if (!g->p.cpu_list_str)
+ return;
+
+ dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+ str0 = str = strdup(g->p.cpu_list_str);
+ t = 0;
+
+ BUG_ON(!str);
+
+ tprintf("# binding tasks to CPUs:\n");
+ tprintf("# ");
+
+ while (true) {
+ int bind_cpu, bind_cpu_0, bind_cpu_1;
+ char *tok, *tok_end, *tok_step, *tok_len, *tok_mul;
+ int bind_len;
+ int step;
+ int mul;
+
+ tok = strsep(&str, ",");
+ if (!tok)
+ break;
+
+ tok_end = strstr(tok, "-");
+
+ dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+ if (!tok_end) {
+ /* Single CPU specified: */
+ bind_cpu_0 = bind_cpu_1 = atol(tok);
+ } else {
+ /* CPU range specified (for example: "5-11"): */
+ bind_cpu_0 = atol(tok);
+ bind_cpu_1 = atol(tok_end + 1);
+ }
+
+ step = 1;
+ tok_step = strstr(tok, "#");
+ if (tok_step) {
+ step = atol(tok_step + 1);
+ BUG_ON(step <= 0 || step >= g->p.nr_cpus);
+ }
+
+ /*
+ * Mask length.
+ * Eg: "--cpus 8_4-16#4" means: '--cpus 8_4,12_4,16_4',
+ * where the _4 means the next 4 CPUs are allowed.
+ */
+ bind_len = 1;
+ tok_len = strstr(tok, "_");
+ if (tok_len) {
+ bind_len = atol(tok_len + 1);
+ BUG_ON(bind_len <= 0 || bind_len > g->p.nr_cpus);
+ }
+
+ /* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+ mul = 1;
+ tok_mul = strstr(tok, "x");
+ if (tok_mul) {
+ mul = atol(tok_mul + 1);
+ BUG_ON(mul <= 0);
+ }
+
+ dprintf("CPUs: %d_%d-%d#%dx%d\n", bind_cpu_0, bind_len, bind_cpu_1, step, mul);
+
+ BUG_ON(bind_cpu_0 < 0 || bind_cpu_0 >= g->p.nr_cpus);
+ BUG_ON(bind_cpu_1 < 0 || bind_cpu_1 >= g->p.nr_cpus);
+ BUG_ON(bind_cpu_0 > bind_cpu_1);
+
+ for (bind_cpu = bind_cpu_0; bind_cpu <= bind_cpu_1; bind_cpu += step) {
+ int i;
+
+ for (i = 0; i < mul; i++) {
+ int cpu;
+
+ if (t >= g->p.nr_tasks) {
+ printf("\n# NOTE: ignoring bind CPUs starting at CPU#%d\n #", bind_cpu);
+ goto out;
+ }
+ td = g->threads + t;
+
+ if (t)
+ tprintf(",");
+ if (bind_len > 1) {
+ tprintf("%2d/%d", bind_cpu, bind_len);
+ } else {
+ tprintf("%2d", bind_cpu);
+ }
+
+ CPU_ZERO(&td->bind_cpumask);
+ for (cpu = bind_cpu; cpu < bind_cpu+bind_len; cpu++) {
+ BUG_ON(cpu < 0 || cpu >= g->p.nr_cpus);
+ CPU_SET(cpu, &td->bind_cpumask);
+ }
+ t++;
+ }
+ }
+ }
+out:
+
+ tprintf("\n");
+
+ if (t < g->p.nr_tasks)
+ printf("# NOTE: %d tasks bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+ free(str0);
+}
+
+static int parse_cpus_opt(const struct option *opt __maybe_unused,
+ const char *arg, int unset __maybe_unused)
+{
+ if (!arg)
+ return -1;
+
+ return parse_cpu_list(arg);
+}
+
+static int parse_node_list(const char *arg)
+{
+ p0.node_list_str = strdup(arg);
+
+ dprintf("got NODE list: {%s}\n", p0.node_list_str);
+
+ return 0;
+}
+
+static void parse_setup_node_list(void)
+{
+ struct thread_data *td;
+ char *str0, *str;
+ int t;
+
+ if (!g->p.node_list_str)
+ return;
+
+ dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+ str0 = str = strdup(g->p.node_list_str);
+ t = 0;
+
+ BUG_ON(!str);
+
+ tprintf("# binding tasks to NODEs:\n");
+ tprintf("# ");
+
+ while (true) {
+ int bind_node, bind_node_0, bind_node_1;
+ char *tok, *tok_end, *tok_step, *tok_mul;
+ int step;
+ int mul;
+
+ tok = strsep(&str, ",");
+ if (!tok)
+ break;
+
+ tok_end = strstr(tok, "-");
+
+ dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+ if (!tok_end) {
+ /* Single NODE specified: */
+ bind_node_0 = bind_node_1 = atol(tok);
+ } else {
+ /* NODE range specified (for example: "5-11"): */
+ bind_node_0 = atol(tok);
+ bind_node_1 = atol(tok_end + 1);
+ }
+
+ step = 1;
+ tok_step = strstr(tok, "#");
+ if (tok_step) {
+ step = atol(tok_step + 1);
+ BUG_ON(step <= 0 || step >= g->p.nr_nodes);
+ }
+
+ /* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+ mul = 1;
+ tok_mul = strstr(tok, "x");
+ if (tok_mul) {
+ mul = atol(tok_mul + 1);
+ BUG_ON(mul <= 0);
+ }
+
+ dprintf("NODEs: %d-%d #%d\n", bind_node_0, bind_node_1, step);
+
+ BUG_ON(bind_node_0 < 0 || bind_node_0 >= g->p.nr_nodes);
+ BUG_ON(bind_node_1 < 0 || bind_node_1 >= g->p.nr_nodes);
+ BUG_ON(bind_node_0 > bind_node_1);
+
+ for (bind_node = bind_node_0; bind_node <= bind_node_1; bind_node += step) {
+ int i;
+
+ for (i = 0; i < mul; i++) {
+ if (t >= g->p.nr_tasks) {
+ printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node);
+ goto out;
+ }
+ td = g->threads + t;
+
+ if (!t)
+ tprintf(" %2d", bind_node);
+ else
+ tprintf(",%2d", bind_node);
+
+ td->bind_node = bind_node;
+ t++;
+ }
+ }
+ }
+out:
+
+ tprintf("\n");
+
+ if (t < g->p.nr_tasks)
+ printf("# NOTE: %d tasks mem-bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+ free(str0);
+}
+
+static int parse_nodes_opt(const struct option *opt __maybe_unused,
+ const char *arg, int unset __maybe_unused)
+{
+ if (!arg)
+ return -1;
+
+ return parse_node_list(arg);
+
+ return 0;
+}
+
+#define BIT(x) (1ul << x)
+
+static inline uint32_t lfsr_32(uint32_t lfsr)
+{
+ const uint32_t taps = BIT(1) | BIT(5) | BIT(6) | BIT(31);
+ return (lfsr>>1) ^ ((0x0u - (lfsr & 0x1u)) & taps);
+}
+
+/*
+ * Make sure there's real data dependency to RAM (when read
+ * accesses are enabled), so the compiler, the CPU and the
+ * kernel (KSM, zero page, etc.) cannot optimize away RAM
+ * accesses:
+ */
+static inline u64 access_data(u64 *data __attribute__((unused)), u64 val)
+{
+ if (g->p.data_reads)
+ val += *data;
+ if (g->p.data_writes)
+ *data = val + 1;
+ return val;
+}
+
+/*
+ * The worker process does two types of work, a forwards going
+ * loop and a backwards going loop.
+ *
+ * We do this so that on multiprocessor systems we do not create
+ * a 'train' of processing, with highly synchronized processes,
+ * skewing the whole benchmark.
+ */
+static u64 do_work(u8 *__data, long bytes, int nr, int nr_max, int loop, u64 val)
+{
+ long words = bytes/sizeof(u64);
+ u64 *data = (void *)__data;
+ long chunk_0, chunk_1;
+ u64 *d0, *d, *d1;
+ long off;
+ long i;
+
+ BUG_ON(!data && words);
+ BUG_ON(data && !words);
+
+ if (!data)
+ return val;
+
+ /* Very simple memset() work variant: */
+ if (g->p.data_zero_memset && !g->p.data_rand_walk) {
+ bzero(data, bytes);
+ return val;
+ }
+
+ /* Spread out by PID/TID nr and by loop nr: */
+ chunk_0 = words/nr_max;
+ chunk_1 = words/g->p.nr_loops;
+ off = nr*chunk_0 + loop*chunk_1;
+
+ while (off >= words)
+ off -= words;
+
+ if (g->p.data_rand_walk) {
+ u32 lfsr = nr + loop + val;
+ int j;
+
+ for (i = 0; i < words/1024; i++) {
+ long start, end;
+
+ lfsr = lfsr_32(lfsr);
+
+ start = lfsr % words;
+ end = min(start + 1024, words-1);
+
+ if (g->p.data_zero_memset) {
+ bzero(data + start, (end-start) * sizeof(u64));
+ } else {
+ for (j = start; j < end; j++)
+ val = access_data(data + j, val);
+ }
+ }
+ } else if (!g->p.data_backwards || (nr + loop) & 1) {
+
+ d0 = data + off;
+ d = data + off + 1;
+ d1 = data + words;
+
+ /* Process data forwards: */
+ for (;;) {
+ if (unlikely(d >= d1))
+ d = data;
+ if (unlikely(d == d0))
+ break;
+
+ val = access_data(d, val);
+
+ d++;
+ }
+ } else {
+ /* Process data backwards: */
+
+ d0 = data + off;
+ d = data + off - 1;
+ d1 = data + words;
+
+ /* Process data forwards: */
+ for (;;) {
+ if (unlikely(d < data))
+ d = data + words-1;
+ if (unlikely(d == d0))
+ break;
+
+ val = access_data(d, val);
+
+ d--;
+ }
+ }
+
+ return val;
+}
+
+static void update_curr_cpu(int task_nr, unsigned long bytes_worked)
+{
+ unsigned int cpu;
+
+ cpu = sched_getcpu();
+
+ g->threads[task_nr].curr_cpu = cpu;
+ prctl(0, bytes_worked);
+}
+
+#define MAX_NR_NODES 64
+
+/*
+ * Count the number of nodes a process's threads
+ * are spread out on.
+ *
+ * A count of 1 means that the process is compressed
+ * to a single node. A count of g->p.nr_nodes means it's
+ * spread out on the whole system.
+ */
+static int count_process_nodes(int process_nr)
+{
+ char node_present[MAX_NR_NODES] = { 0, };
+ int nodes;
+ int n, t;
+
+ for (t = 0; t < g->p.nr_threads; t++) {
+ struct thread_data *td;
+ int task_nr;
+ int node;
+
+ task_nr = process_nr*g->p.nr_threads + t;
+ td = g->threads + task_nr;
+
+ node = numa_node_of_cpu(td->curr_cpu);
+ node_present[node] = 1;
+ }
+
+ nodes = 0;
+
+ for (n = 0; n < MAX_NR_NODES; n++)
+ nodes += node_present[n];
+
+ return nodes;
+}
+
+/*
+ * Count the number of distinct process-threads a node contains.
+ *
+ * A count of 1 means that the node contains only a single
+ * process. If all nodes on the system contain at most one
+ * process then we are well-converged.
+ */
+static int count_node_processes(int node)
+{
+ int processes = 0;
+ int t, p;
+
+ for (p = 0; p < g->p.nr_proc; p++) {
+ for (t = 0; t < g->p.nr_threads; t++) {
+ struct thread_data *td;
+ int task_nr;
+ int n;
+
+ task_nr = p*g->p.nr_threads + t;
+ td = g->threads + task_nr;
+
+ n = numa_node_of_cpu(td->curr_cpu);
+ if (n == node) {
+ processes++;
+ break;
+ }
+ }
+ }
+
+ return processes;
+}
+
+static void calc_convergence_compression(int *strong)
+{
+ unsigned int nodes_min, nodes_max;
+ int p;
+
+ nodes_min = -1;
+ nodes_max = 0;
+
+ for (p = 0; p < g->p.nr_proc; p++) {
+ unsigned int nodes = count_process_nodes(p);
+
+ nodes_min = min(nodes, nodes_min);
+ nodes_max = max(nodes, nodes_max);
+ }
+
+ /* Strong convergence: all threads compress on a single node: */
+ if (nodes_min == 1 && nodes_max == 1) {
+ *strong = 1;
+ } else {
+ *strong = 0;
+ tprintf(" {%d-%d}", nodes_min, nodes_max);
+ }
+}
+
+static void calc_convergence(double runtime_ns_max, double *convergence)
+{
+ unsigned int loops_done_min, loops_done_max;
+ int process_groups;
+ int nodes[MAX_NR_NODES];
+ int distance;
+ int nr_min;
+ int nr_max;
+ int strong;
+ int sum;
+ int nr;
+ int node;
+ int cpu;
+ int t;
+
+ if (!g->p.show_convergence && !g->p.measure_convergence)
+ return;
+
+ for (node = 0; node < g->p.nr_nodes; node++)
+ nodes[node] = 0;
+
+ loops_done_min = -1;
+ loops_done_max = 0;
+
+ for (t = 0; t < g->p.nr_tasks; t++) {
+ struct thread_data *td = g->threads + t;
+ unsigned int loops_done;
+
+ cpu = td->curr_cpu;
+
+ /* Not all threads have written it yet: */
+ if (cpu < 0)
+ continue;
+
+ node = numa_node_of_cpu(cpu);
+
+ nodes[node]++;
+
+ loops_done = td->loops_done;
+ loops_done_min = min(loops_done, loops_done_min);
+ loops_done_max = max(loops_done, loops_done_max);
+ }
+
+ nr_max = 0;
+ nr_min = g->p.nr_tasks;
+ sum = 0;
+
+ for (node = 0; node < g->p.nr_nodes; node++) {
+ nr = nodes[node];
+ nr_min = min(nr, nr_min);
+ nr_max = max(nr, nr_max);
+ sum += nr;
+ }
+ BUG_ON(nr_min > nr_max);
+
+ BUG_ON(sum > g->p.nr_tasks);
+
+ if (0 && (sum < g->p.nr_tasks))
+ return;
+
+ /*
+ * Count the number of distinct process groups present
+ * on nodes - when we are converged this will decrease
+ * to g->p.nr_proc:
+ */
+ process_groups = 0;
+
+ for (node = 0; node < g->p.nr_nodes; node++) {
+ int processes = count_node_processes(node);
+
+ nr = nodes[node];
+ tprintf(" %2d/%-2d", nr, processes);
+
+ process_groups += processes;
+ }
+
+ distance = nr_max - nr_min;
+
+ tprintf(" [%2d/%-2d]", distance, process_groups);
+
+ tprintf(" l:%3d-%-3d (%3d)",
+ loops_done_min, loops_done_max, loops_done_max-loops_done_min);
+
+ if (loops_done_min && loops_done_max) {
+ double skew = 1.0 - (double)loops_done_min/loops_done_max;
+
+ tprintf(" [%4.1f%%]", skew * 100.0);
+ }
+
+ calc_convergence_compression(&strong);
+
+ if (strong && process_groups == g->p.nr_proc) {
+ if (!*convergence) {
+ *convergence = runtime_ns_max;
+ tprintf(" (%6.1fs converged)\n", *convergence/1e9);
+ if (g->p.measure_convergence) {
+ g->all_converged = true;
+ g->stop_work = true;
+ }
+ }
+ } else {
+ if (*convergence) {
+ tprintf(" (%6.1fs de-converged)", runtime_ns_max/1e9);
+ *convergence = 0;
+ }
+ tprintf("\n");
+ }
+}
+
+static void show_summary(double runtime_ns_max, int l, double *convergence)
+{
+ tprintf("\r # %5.1f%% [%.1f mins]",
+ (double)(l+1)/g->p.nr_loops*100.0, runtime_ns_max/1e9 / 60.0);
+
+ calc_convergence(runtime_ns_max, convergence);
+
+ if (g->p.show_details >= 0)
+ fflush(stdout);
+}
+
+static void *worker_thread(void *__tdata)
+{
+ struct thread_data *td = __tdata;
+ struct timeval start0, start, stop, diff;
+ int process_nr = td->process_nr;
+ int thread_nr = td->thread_nr;
+ unsigned long last_perturbance;
+ int task_nr = td->task_nr;
+ int details = g->p.show_details;
+ int first_task, last_task;
+ double convergence = 0;
+ u64 val = td->val;
+ double runtime_ns_max;
+ u8 *global_data;
+ u8 *process_data;
+ u8 *thread_data;
+ u64 bytes_done;
+ long work_done;
+ u32 l;
+
+ bind_to_cpumask(td->bind_cpumask);
+ bind_to_memnode(td->bind_node);
+
+ set_taskname("thread %d/%d", process_nr, thread_nr);
+
+ global_data = g->data;
+ process_data = td->process_data;
+ thread_data = setup_private_data(g->p.bytes_thread);
+
+ bytes_done = 0;
+
+ last_task = 0;
+ if (process_nr == g->p.nr_proc-1 && thread_nr == g->p.nr_threads-1)
+ last_task = 1;
+
+ first_task = 0;
+ if (process_nr == 0 && thread_nr == 0)
+ first_task = 1;
+
+ if (details >= 2) {
+ printf("# thread %2d / %2d global mem: %p, process mem: %p, thread mem: %p\n",
+ process_nr, thread_nr, global_data, process_data, thread_data);
+ }
+
+ if (g->p.serialize_startup) {
+ pthread_mutex_lock(&g->startup_mutex);
+ g->nr_tasks_started++;
+ pthread_mutex_unlock(&g->startup_mutex);
+
+ /* Here we will wait for the main process to start us all at once: */
+ pthread_mutex_lock(&g->start_work_mutex);
+ g->nr_tasks_working++;
+
+ /* Last one wake the main process: */
+ if (g->nr_tasks_working == g->p.nr_tasks)
+ pthread_mutex_unlock(&g->startup_done_mutex);
+
+ pthread_mutex_unlock(&g->start_work_mutex);
+ }
+
+ gettimeofday(&start0, NULL);
+
+ start = stop = start0;
+ last_perturbance = start.tv_sec;
+
+ for (l = 0; l < g->p.nr_loops; l++) {
+ start = stop;
+
+ if (g->stop_work)
+ break;
+
+ val += do_work(global_data, g->p.bytes_global, process_nr, g->p.nr_proc, l, val);
+ val += do_work(process_data, g->p.bytes_process, thread_nr, g->p.nr_threads, l, val);
+ val += do_work(thread_data, g->p.bytes_thread, 0, 1, l, val);
+
+ if (g->p.sleep_usecs) {
+ pthread_mutex_lock(td->process_lock);
+ usleep(g->p.sleep_usecs);
+ pthread_mutex_unlock(td->process_lock);
+ }
+ /*
+ * Amount of work to be done under a process-global lock:
+ */
+ if (g->p.bytes_process_locked) {
+ pthread_mutex_lock(td->process_lock);
+ val += do_work(process_data, g->p.bytes_process_locked, thread_nr, g->p.nr_threads, l, val);
+ pthread_mutex_unlock(td->process_lock);
+ }
+
+ work_done = g->p.bytes_global + g->p.bytes_process +
+ g->p.bytes_process_locked + g->p.bytes_thread;
+
+ update_curr_cpu(task_nr, work_done);
+ bytes_done += work_done;
+
+ if (details < 0 && !g->p.perturb_secs && !g->p.measure_convergence && !g->p.nr_secs)
+ continue;
+
+ td->loops_done = l;
+
+ gettimeofday(&stop, NULL);
+
+ /* Check whether our max runtime timed out: */
+ if (g->p.nr_secs) {
+ timersub(&stop, &start0, &diff);
+ if (diff.tv_sec >= g->p.nr_secs) {
+ g->stop_work = true;
+ break;
+ }
+ }
+
+ /* Update the summary at most once per second: */
+ if (start.tv_sec == stop.tv_sec)
+ continue;
+
+ /*
+ * Perturb the first task's equilibrium every g->p.perturb_secs seconds,
+ * by migrating to CPU#0:
+ */
+ if (first_task && g->p.perturb_secs && (int)(stop.tv_sec - last_perturbance) >= g->p.perturb_secs) {
+ cpu_set_t orig_mask;
+ int target_cpu;
+ int this_cpu;
+
+ last_perturbance = stop.tv_sec;
+
+ /*
+ * Depending on where we are running, move into
+ * the other half of the system, to create some
+ * real disturbance:
+ */
+ this_cpu = g->threads[task_nr].curr_cpu;
+ if (this_cpu < g->p.nr_cpus/2)
+ target_cpu = g->p.nr_cpus-1;
+ else
+ target_cpu = 0;
+
+ orig_mask = bind_to_cpu(target_cpu);
+
+ /* Here we are running on the target CPU already */
+ if (details >= 1)
+ printf(" (injecting perturbalance, moved to CPU#%d)\n", target_cpu);
+
+ bind_to_cpumask(orig_mask);
+ }
+
+ if (details >= 3) {
+ timersub(&stop, &start, &diff);
+ runtime_ns_max = diff.tv_sec * 1000000000;
+ runtime_ns_max += diff.tv_usec * 1000;
+
+ if (details >= 0) {
+ printf(" #%2d / %2d: %14.2lf nsecs/op [val: %016lx]\n",
+ process_nr, thread_nr, runtime_ns_max / bytes_done, val);
+ }
+ fflush(stdout);
+ }
+ if (!last_task)
+ continue;
+
+ timersub(&stop, &start0, &diff);
+ runtime_ns_max = diff.tv_sec * 1000000000ULL;
+ runtime_ns_max += diff.tv_usec * 1000ULL;
+
+ show_summary(runtime_ns_max, l, &convergence);
+ }
+
+ gettimeofday(&stop, NULL);
+ timersub(&stop, &start0, &diff);
+ td->runtime_ns = diff.tv_sec * 1000000000ULL;
+ td->runtime_ns += diff.tv_usec * 1000ULL;
+
+ free_data(thread_data, g->p.bytes_thread);
+
+ pthread_mutex_lock(&g->stop_work_mutex);
+ g->bytes_done += bytes_done;
+ pthread_mutex_unlock(&g->stop_work_mutex);
+
+ return NULL;
+}
+
+/*
+ * A worker process starts a couple of threads:
+ */
+static void worker_process(int process_nr)
+{
+ pthread_mutex_t process_lock;
+ struct thread_data *td;
+ pthread_t *pthreads;
+ u8 *process_data;
+ int task_nr;
+ int ret;
+ int t;
+
+ pthread_mutex_init(&process_lock, NULL);
+ set_taskname("process %d", process_nr);
+
+ /*
+ * Pick up the memory policy and the CPU binding of our first thread,
+ * so that we initialize memory accordingly:
+ */
+ task_nr = process_nr*g->p.nr_threads;
+ td = g->threads + task_nr;
+
+ bind_to_memnode(td->bind_node);
+ bind_to_cpumask(td->bind_cpumask);
+
+ pthreads = zalloc(g->p.nr_threads * sizeof(pthread_t));
+ process_data = setup_private_data(g->p.bytes_process);
+
+ if (g->p.show_details >= 3) {
+ printf(" # process %2d global mem: %p, process mem: %p\n",
+ process_nr, g->data, process_data);
+ }
+
+ for (t = 0; t < g->p.nr_threads; t++) {
+ task_nr = process_nr*g->p.nr_threads + t;
+ td = g->threads + task_nr;
+
+ td->process_data = process_data;
+ td->process_nr = process_nr;
+ td->thread_nr = t;
+ td->task_nr = task_nr;
+ td->val = rand();
+ td->curr_cpu = -1;
+ td->process_lock = &process_lock;
+
+ ret = pthread_create(pthreads + t, NULL, worker_thread, td);
+ BUG_ON(ret);
+ }
+
+ for (t = 0; t < g->p.nr_threads; t++) {
+ ret = pthread_join(pthreads[t], NULL);
+ BUG_ON(ret);
+ }
+
+ free_data(process_data, g->p.bytes_process);
+ free(pthreads);
+}
+
+static void print_summary(void)
+{
+ if (g->p.show_details < 0)
+ return;
+
+ printf("\n ###\n");
+ printf(" # %d %s will execute (on %d nodes, %d CPUs):\n",
+ g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus);
+ printf(" # %5dx %5ldMB global shared mem operations\n",
+ g->p.nr_loops, g->p.bytes_global/1024/1024);
+ printf(" # %5dx %5ldMB process shared mem operations\n",
+ g->p.nr_loops, g->p.bytes_process/1024/1024);
+ printf(" # %5dx %5ldMB thread local mem operations\n",
+ g->p.nr_loops, g->p.bytes_thread/1024/1024);
+
+ printf(" ###\n");
+
+ printf("\n ###\n"); fflush(stdout);
+}
+
+static void init_thread_data(void)
+{
+ ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+ int t;
+
+ g->threads = zalloc_shared_data(size);
+
+ for (t = 0; t < g->p.nr_tasks; t++) {
+ struct thread_data *td = g->threads + t;
+ int cpu;
+
+ /* Allow all nodes by default: */
+ td->bind_node = -1;
+
+ /* Allow all CPUs by default: */
+ CPU_ZERO(&td->bind_cpumask);
+ for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+ CPU_SET(cpu, &td->bind_cpumask);
+ }
+}
+
+static void deinit_thread_data(void)
+{
+ ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+
+ free_data(g->threads, size);
+}
+
+static int init(void)
+{
+ g = (void *)alloc_data(sizeof(*g), MAP_SHARED, 1, 0, 0 /* THP */, 0);
+
+ /* Copy over options: */
+ g->p = p0;
+
+ g->p.nr_cpus = numa_num_configured_cpus();
+
+ g->p.nr_nodes = numa_max_node() + 1;
+
+ /* char array in count_process_nodes(): */
+ BUG_ON(g->p.nr_nodes > MAX_NR_NODES || g->p.nr_nodes < 0);
+
+ if (g->p.show_quiet && !g->p.show_details)
+ g->p.show_details = -1;
+
+ /* Some memory should be specified: */
+ if (!g->p.mb_global_str && !g->p.mb_proc_str && !g->p.mb_thread_str)
+ return -1;
+
+ if (g->p.mb_global_str) {
+ g->p.mb_global = atof(g->p.mb_global_str);
+ BUG_ON(g->p.mb_global < 0);
+ }
+
+ if (g->p.mb_proc_str) {
+ g->p.mb_proc = atof(g->p.mb_proc_str);
+ BUG_ON(g->p.mb_proc < 0);
+ }
+
+ if (g->p.mb_proc_locked_str) {
+ g->p.mb_proc_locked = atof(g->p.mb_proc_locked_str);
+ BUG_ON(g->p.mb_proc_locked < 0);
+ BUG_ON(g->p.mb_proc_locked > g->p.mb_proc);
+ }
+
+ if (g->p.mb_thread_str) {
+ g->p.mb_thread = atof(g->p.mb_thread_str);
+ BUG_ON(g->p.mb_thread < 0);
+ }
+
+ BUG_ON(g->p.nr_threads <= 0);
+ BUG_ON(g->p.nr_proc <= 0);
+
+ g->p.nr_tasks = g->p.nr_proc*g->p.nr_threads;
+
+ g->p.bytes_global = g->p.mb_global *1024L*1024L;
+ g->p.bytes_process = g->p.mb_proc *1024L*1024L;
+ g->p.bytes_process_locked = g->p.mb_proc_locked *1024L*1024L;
+ g->p.bytes_thread = g->p.mb_thread *1024L*1024L;
+
+ g->data = setup_shared_data(g->p.bytes_global);
+
+ /* Startup serialization: */
+ init_global_mutex(&g->start_work_mutex);
+ init_global_mutex(&g->startup_mutex);
+ init_global_mutex(&g->startup_done_mutex);
+ init_global_mutex(&g->stop_work_mutex);
+
+ init_thread_data();
+
+ tprintf("#\n");
+ parse_setup_cpu_list();
+ parse_setup_node_list();
+ tprintf("#\n");
+
+ print_summary();
+
+ return 0;
+}
+
+static void deinit(void)
+{
+ free_data(g->data, g->p.bytes_global);
+ g->data = NULL;
+
+ deinit_thread_data();
+
+ free_data(g, sizeof(*g));
+ g = NULL;
+}
+
+/*
+ * Print a short or long result, depending on the verbosity setting:
+ */
+static void print_res(const char *name, double val,
+ const char *txt_unit, const char *txt_short, const char *txt_long)
+{
+ if (!name)
+ name = "main,";
+
+ if (g->p.show_quiet)
+ printf(" %-30s %15.3f, %-15s %s\n", name, val, txt_unit, txt_short);
+ else
+ printf(" %14.3f %s\n", val, txt_long);
+}
+
+static int __bench_numa(const char *name)
+{
+ struct timeval start, stop, diff;
+ u64 runtime_ns_min, runtime_ns_sum;
+ pid_t *pids, pid, wpid;
+ double delta_runtime;
+ double runtime_avg;
+ double runtime_sec_max;
+ double runtime_sec_min;
+ int wait_stat;
+ double bytes;
+ int i, t;
+
+ if (init())
+ return -1;
+
+ pids = zalloc(g->p.nr_proc * sizeof(*pids));
+ pid = -1;
+
+ /* All threads try to acquire it, this way we can wait for them to start up: */
+ pthread_mutex_lock(&g->start_work_mutex);
+
+ if (g->p.serialize_startup) {
+ tprintf(" #\n");
+ tprintf(" # Startup synchronization: ..."); fflush(stdout);
+ }
+
+ gettimeofday(&start, NULL);
+
+ for (i = 0; i < g->p.nr_proc; i++) {
+ pid = fork();
+ dprintf(" # process %2d: PID %d\n", i, pid);
+
+ BUG_ON(pid < 0);
+ if (!pid) {
+ /* Child process: */
+ worker_process(i);
+
+ exit(0);
+ }
+ pids[i] = pid;
+
+ }
+ /* Wait for all the threads to start up: */
+ while (g->nr_tasks_started != g->p.nr_tasks)
+ usleep(1000);
+
+ BUG_ON(g->nr_tasks_started != g->p.nr_tasks);
+
+ if (g->p.serialize_startup) {
+ double startup_sec;
+
+ pthread_mutex_lock(&g->startup_done_mutex);
+
+ /* This will start all threads: */
+ pthread_mutex_unlock(&g->start_work_mutex);
+
+ /* This mutex is locked - the last started thread will wake us: */
+ pthread_mutex_lock(&g->startup_done_mutex);
+
+ gettimeofday(&stop, NULL);
+
+ timersub(&stop, &start, &diff);
+
+ startup_sec = diff.tv_sec * 1000000000.0;
+ startup_sec += diff.tv_usec * 1000.0;
+ startup_sec /= 1e9;
+
+ tprintf(" threads initialized in %.6f seconds.\n", startup_sec);
+ tprintf(" #\n");
+
+ start = stop;
+ pthread_mutex_unlock(&g->startup_done_mutex);
+ } else {
+ gettimeofday(&start, NULL);
+ }
+
+ /* Parent process: */
+
+
+ for (i = 0; i < g->p.nr_proc; i++) {
+ wpid = waitpid(pids[i], &wait_stat, 0);
+ BUG_ON(wpid < 0);
+ BUG_ON(!WIFEXITED(wait_stat));
+
+ }
+
+ runtime_ns_sum = 0;
+ runtime_ns_min = -1LL;
+
+ for (t = 0; t < g->p.nr_tasks; t++) {
+ u64 thread_runtime_ns = g->threads[t].runtime_ns;
+
+ runtime_ns_sum += thread_runtime_ns;
+ runtime_ns_min = min(thread_runtime_ns, runtime_ns_min);
+ }
+
+ gettimeofday(&stop, NULL);
+ timersub(&stop, &start, &diff);
+
+ BUG_ON(bench_format != BENCH_FORMAT_DEFAULT);
+
+ tprintf("\n ###\n");
+ tprintf("\n");
+
+ runtime_sec_max = diff.tv_sec * 1000000000.0;
+ runtime_sec_max += diff.tv_usec * 1000.0;
+ runtime_sec_max /= 1e9;
+
+ runtime_sec_min = runtime_ns_min/1e9;
+
+ bytes = g->bytes_done;
+ runtime_avg = (double)runtime_ns_sum / g->p.nr_tasks / 1e9;
+
+ if (g->p.measure_convergence) {
+ print_res(name, runtime_sec_max,
+ "secs,", "NUMA-convergence-latency", "secs latency to NUMA-converge");
+ }
+
+ print_res(name, runtime_sec_max,
+ "secs,", "runtime-max/thread", "secs slowest (max) thread-runtime");
+
+ print_res(name, runtime_sec_min,
+ "secs,", "runtime-min/thread", "secs fastest (min) thread-runtime");
+
+ print_res(name, runtime_avg,
+ "secs,", "runtime-avg/thread", "secs average thread-runtime");
+
+ delta_runtime = (runtime_sec_max - runtime_sec_min)/2.0;
+ print_res(name, delta_runtime / runtime_sec_max * 100.0,
+ "%,", "spread-runtime/thread", "% difference between max/avg runtime");
+
+ print_res(name, bytes / g->p.nr_tasks / 1e9,
+ "GB,", "data/thread", "GB data processed, per thread");
+
+ print_res(name, bytes / 1e9,
+ "GB,", "data-total", "GB data processed, total");
+
+ print_res(name, runtime_sec_max * 1e9 / (bytes / g->p.nr_tasks),
+ "nsecs,", "runtime/byte/thread","nsecs/byte/thread runtime");
+
+ print_res(name, bytes / g->p.nr_tasks / 1e9 / runtime_sec_max,
+ "GB/sec,", "thread-speed", "GB/sec/thread speed");
+
+ print_res(name, bytes / runtime_sec_max / 1e9,
+ "GB/sec,", "total-speed", "GB/sec total speed");
+
+ free(pids);
+
+ deinit();
+
+ return 0;
+}
+
+#define MAX_ARGS 50
+
+static int command_size(const char **argv)
+{
+ int size = 0;
+
+ while (*argv) {
+ size++;
+ argv++;
+ }
+
+ BUG_ON(size >= MAX_ARGS);
+
+ return size;
+}
+
+static void init_params(struct params *p, const char *name, int argc, const char **argv)
+{
+ int i;
+
+ printf("\n # Running %s \"perf bench numa", name);
+
+ for (i = 0; i < argc; i++)
+ printf(" %s", argv[i]);
+
+ printf("\"\n");
+
+ memset(p, 0, sizeof(*p));
+
+ /* Initialize nonzero defaults: */
+
+ p->serialize_startup = 1;
+ p->data_reads = true;
+ p->data_writes = true;
+ p->data_backwards = true;
+ p->data_rand_walk = true;
+ p->nr_loops = -1;
+ p->init_random = true;
+}
+
+static int run_bench_numa(const char *name, const char **argv)
+{
+ int argc = command_size(argv);
+
+ init_params(&p0, name, argc, argv);
+ argc = parse_options(argc, argv, options, bench_numa_usage, 0);
+ if (argc)
+ goto err;
+
+ if (__bench_numa(name))
+ goto err;
+
+ return 0;
+
+err:
+ usage_with_options(numa_usage, options);
+ return -1;
+}
+
+#define OPT_BW_RAM "-s", "20", "-zZq", "--thp", " 1", "--no-data_rand_walk"
+#define OPT_BW_RAM_NOTHP OPT_BW_RAM, "--thp", "-1"
+
+#define OPT_CONV "-s", "100", "-zZ0qcm", "--thp", " 1"
+#define OPT_CONV_NOTHP OPT_CONV, "--thp", "-1"
+
+#define OPT_BW "-s", "20", "-zZ0q", "--thp", " 1"
+#define OPT_BW_NOTHP OPT_BW, "--thp", "-1"
+
+/*
+ * The built-in test-suite executed by "perf bench numa -a".
+ *
+ * (A minimum of 4 nodes and 16 GB of RAM is recommended.)
+ */
+static const char *tests[][MAX_ARGS] = {
+ /* Basic single-stream NUMA bandwidth measurements: */
+ { "RAM-bw-local,", "mem", "-p", "1", "-t", "1", "-P", "1024",
+ "-C" , "0", "-M", "0", OPT_BW_RAM },
+ { "RAM-bw-local-NOTHP,",
+ "mem", "-p", "1", "-t", "1", "-P", "1024",
+ "-C" , "0", "-M", "0", OPT_BW_RAM_NOTHP },
+ { "RAM-bw-remote,", "mem", "-p", "1", "-t", "1", "-P", "1024",
+ "-C" , "0", "-M", "1", OPT_BW_RAM },
+
+ /* 2-stream NUMA bandwidth measurements: */
+ { "RAM-bw-local-2x,", "mem", "-p", "2", "-t", "1", "-P", "1024",
+ "-C", "0,2", "-M", "0x2", OPT_BW_RAM },
+ { "RAM-bw-remote-2x,", "mem", "-p", "2", "-t", "1", "-P", "1024",
+ "-C", "0,2", "-M", "1x2", OPT_BW_RAM },
+
+ /* Cross-stream NUMA bandwidth measurement: */
+ { "RAM-bw-cross,", "mem", "-p", "2", "-t", "1", "-P", "1024",
+ "-C", "0,8", "-M", "1,0", OPT_BW_RAM },
+
+ /* Convergence latency measurements: */
+ { " 1x3-convergence,", "mem", "-p", "1", "-t", "3", "-P", "512", OPT_CONV },
+ { " 1x4-convergence,", "mem", "-p", "1", "-t", "4", "-P", "512", OPT_CONV },
+ { " 1x6-convergence,", "mem", "-p", "1", "-t", "6", "-P", "1020", OPT_CONV },
+ { " 2x3-convergence,", "mem", "-p", "3", "-t", "3", "-P", "1020", OPT_CONV },
+ { " 3x3-convergence,", "mem", "-p", "3", "-t", "3", "-P", "1020", OPT_CONV },
+ { " 4x4-convergence,", "mem", "-p", "4", "-t", "4", "-P", "512", OPT_CONV },
+ { " 4x4-convergence-NOTHP,",
+ "mem", "-p", "4", "-t", "4", "-P", "512", OPT_CONV_NOTHP },
+ { " 4x6-convergence,", "mem", "-p", "4", "-t", "6", "-P", "1020", OPT_CONV },
+ { " 4x8-convergence,", "mem", "-p", "4", "-t", "8", "-P", "512", OPT_CONV },
+ { " 8x4-convergence,", "mem", "-p", "8", "-t", "4", "-P", "512", OPT_CONV },
+ { " 8x4-convergence-NOTHP,",
+ "mem", "-p", "8", "-t", "4", "-P", "512", OPT_CONV_NOTHP },
+ { " 3x1-convergence,", "mem", "-p", "3", "-t", "1", "-P", "512", OPT_CONV },
+ { " 4x1-convergence,", "mem", "-p", "4", "-t", "1", "-P", "512", OPT_CONV },
+ { " 8x1-convergence,", "mem", "-p", "8", "-t", "1", "-P", "512", OPT_CONV },
+ { "16x1-convergence,", "mem", "-p", "16", "-t", "1", "-P", "256", OPT_CONV },
+ { "32x1-convergence,", "mem", "-p", "32", "-t", "1", "-P", "128", OPT_CONV },
+
+ /* Various NUMA process/thread layout bandwidth measurements: */
+ { " 2x1-bw-process,", "mem", "-p", "2", "-t", "1", "-P", "1024", OPT_BW },
+ { " 3x1-bw-process,", "mem", "-p", "3", "-t", "1", "-P", "1024", OPT_BW },
+ { " 4x1-bw-process,", "mem", "-p", "4", "-t", "1", "-P", "1024", OPT_BW },
+ { " 8x1-bw-process,", "mem", "-p", "8", "-t", "1", "-P", " 512", OPT_BW },
+ { " 8x1-bw-process-NOTHP,",
+ "mem", "-p", "8", "-t", "1", "-P", " 512", OPT_BW_NOTHP },
+ { "16x1-bw-process,", "mem", "-p", "16", "-t", "1", "-P", "256", OPT_BW },
+
+ { " 4x1-bw-thread,", "mem", "-p", "1", "-t", "4", "-T", "256", OPT_BW },
+ { " 8x1-bw-thread,", "mem", "-p", "1", "-t", "8", "-T", "256", OPT_BW },
+ { "16x1-bw-thread,", "mem", "-p", "1", "-t", "16", "-T", "128", OPT_BW },
+ { "32x1-bw-thread,", "mem", "-p", "1", "-t", "32", "-T", "64", OPT_BW },
+
+ { " 2x3-bw-thread,", "mem", "-p", "2", "-t", "3", "-P", "512", OPT_BW },
+ { " 4x4-bw-thread,", "mem", "-p", "4", "-t", "4", "-P", "512", OPT_BW },
+ { " 4x6-bw-thread,", "mem", "-p", "4", "-t", "6", "-P", "512", OPT_BW },
+ { " 4x8-bw-thread,", "mem", "-p", "4", "-t", "8", "-P", "512", OPT_BW },
+ { " 4x8-bw-thread-NOTHP,",
+ "mem", "-p", "4", "-t", "8", "-P", "512", OPT_BW_NOTHP },
+ { " 3x3-bw-thread,", "mem", "-p", "3", "-t", "3", "-P", "512", OPT_BW },
+ { " 5x5-bw-thread,", "mem", "-p", "5", "-t", "5", "-P", "512", OPT_BW },
+
+ { "2x16-bw-thread,", "mem", "-p", "2", "-t", "16", "-P", "512", OPT_BW },
+ { "1x32-bw-thread,", "mem", "-p", "1", "-t", "32", "-P", "2048", OPT_BW },
+
+ { "numa02-bw,", "mem", "-p", "1", "-t", "32", "-T", "32", OPT_BW },
+ { "numa02-bw-NOTHP,", "mem", "-p", "1", "-t", "32", "-T", "32", OPT_BW_NOTHP },
+ { "numa01-bw-thread,", "mem", "-p", "2", "-t", "16", "-T", "192", OPT_BW },
+ { "numa01-bw-thread-NOTHP,",
+ "mem", "-p", "2", "-t", "16", "-T", "192", OPT_BW_NOTHP },
+};
+
+static int bench_all(void)
+{
+ int nr = ARRAY_SIZE(tests);
+ int ret;
+ int i;
+
+ ret = system("echo ' #'; echo ' # Running test on: '$(uname -a); echo ' #'");
+ BUG_ON(ret < 0);
+
+ for (i = 0; i < nr; i++) {
+ if (run_bench_numa(tests[i][0], tests[i] + 1))
+ return -1;
+ }
+
+ printf("\n");
+
+ return 0;
+}
+
+int bench_numa(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+ init_params(&p0, "main,", argc, argv);
+ argc = parse_options(argc, argv, options, bench_numa_usage, 0);
+ if (argc)
+ goto err;
+
+ if (p0.run_all)
+ return bench_all();
+
+ if (__bench_numa(NULL))
+ goto err;
+
+ return 0;
+
+err:
+ usage_with_options(numa_usage, options);
+ return -1;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index cae9a5f..441cdb4 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -35,6 +35,16 @@ struct bench_suite {
/* sentinel: easy for help */
#define suite_all { "all", "Test all benchmark suites", NULL }
+static struct bench_suite numa_suites[] = {
+ { "mem",
+ "Benchmark for NUMA workloads",
+ bench_numa },
+ suite_all,
+ { NULL,
+ NULL,
+ NULL }
+};
+
static struct bench_suite sched_suites[] = {
{ "messaging",
"Benchmark for scheduler and IPC mechanisms",
@@ -68,6 +78,9 @@ struct bench_subsys {
};
static struct bench_subsys subsystems[] = {
+ { "numa",
+ "NUMA scheduling and MM behavior",
+ numa_suites },
{ "sched",
"scheduler and IPC mechanism",
sched_suites },
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 1278c2c..8b091a5 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -195,7 +195,7 @@ static inline int hist_entry__tui_annotate(struct hist_entry *self
return 0;
}
-static inline int script_browse(const char *script_opt)
+static inline int script_browse(const char *script_opt __maybe_unused)
{
return 0;
}
--
1.7.11.7
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]
2012-12-07 20:55 Announce: the 'perf bench numa mem' NUMA performance measurement tool Ingo Molnar
2012-12-07 20:55 ` [PATCH] perf: Add 'perf bench numa mem' NUMA performance measurement suite Ingo Molnar
@ 2012-12-07 21:53 ` Ingo Molnar
2012-12-10 12:33 ` Mel Gorman
1 sibling, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2012-12-07 21:53 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins,
Arnaldo Carvalho de Melo, Frederic Weisbecker, Mike Galbraith
Here's a (strongly NUMA-centric) performance comparison of the
three NUMA kernels: the 'balancenuma-v10' tree from Mel, the
AutoNUMA-v28 kernel from Andrea and the unified NUMA -v3 tree
Peter and me are working on.
The goal of these measurements is to specifically quantify the
NUMA optimization qualities of each of the three NUMA-optimizing
kernels.
There are lots of numbers in this mail and lot of material to
read - sorry about that! :-/
I used the latest available kernel versions everywhere:
furthermore the AutoNUMA-v28 tree has been patched with Hugh
Dickin's THP-migration support patch, to make it a fair
apples-to-apples comparison.
I have used the 'perf bench numa' tool to do the measurements,
which tool can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/bench
# to build it install numactl-dev[el] and do "cd tools/perf; make -j install'
To get the raw numbers I ran "perf bench numa mem -a" multiple
times on each kernel, on a 32-way, 64 GB RAM, 4-node Opteron
test-system. Each kernel used the same base .config, copied from
a Fedora RPM kernel, with the NUMA-balancing options enabled.
( Note that the testcases are tailored to my test-system: on
a smaller system you'd want to run slightly smaller testcases,
on a larger system you'd want to run a couple of larger
testcases as well. )
NUMA convergence latency measurements
-------------------------------------
'NUMA convergence' latency is the number of seconds a workload
takes to reach 'perfectly NUMA balanced' state. This is measured
on the CPU placement side: once it has converged then memory
typically follows within a couple of seconds.
Because convergence is not guaranteed, a 100 seconds latency
time-out is used in the benchmark. If you see a 100 seconds
result in the table it means that that particular NUMA kernel
did not manage to converge that workload unit test within 100
seconds.
The NxM denotion means process/thread relationship: a 1x4 test
is 1 process with 4 thread that share a workload - a 4x6 test
are 4 processes with 6 threads in each process, the processes
isolated from each other but the threads working on the same
working set.
I used a wide set of test-cases I collected in the past:
[ Lower numbers are better. ]
[test unit] : v3.7 |balancenuma-v10| AutoNUMA-v28 | numa-u-v3 |
------------------------------------------------------------------------------------------
1x3-convergence : 100.1 | 100.0 | 0.2 | 2.3 | secs
1x4-convergence : 100.2 | 100.1 | 100.2 | 2.1 | secs
1x6-convergence : 100.3 | 100.4 | 100.8 | 7.3 | secs
2x3-convergence : 100.6 | 100.6 | 100.5 | 4.1 | secs
3x3-convergence : 100.6 | 100.5 | 100.5 | 7.6 | secs
4x4-convergence : 100.6 | 100.5 | 4.1 | 7.4 | secs
4x4-convergence-NOTHP : 101.1 | 100.5 | 12.2 | 9.2 | secs
4x6-convergence : 5.4 | 101.2 | 16.6 | 11.7 | secs
4x8-convergence : 101.1 | 101.3 | 3.4 | 3.9 | secs
8x4-convergence : 100.9 | 100.8 | 18.3 | 8.9 | secs
8x4-convergence-NOTHP : 101.9 | 101.0 | 15.7 | 12.1 | secs
3x1-convergence : 0.7 | 1.0 | 0.8 | 0.9 | secs
4x1-convergence : 0.6 | 0.8 | 0.8 | 0.7 | secs
8x1-convergence : 2.8 | 2.9 | 2.9 | 1.2 | secs
16x1-convergence : 3.5 | 3.7 | 2.5 | 2.0 | secs
32x1-convergence : 3.6 | 2.8 | 3.0 | 1.9 | secs
As expected, mainline only manages to converge workloads where
each worker process is isolated and the default
spread-to-all-nodes scheduling policy creates an ideal layout,
regardless of task ordering.
[ Note that the mainline kernel got a 'lucky strike' convergence
in the 4x6 workload: it's always possible for the workload
to accidentally converge. On a repeat test this did not occur,
but I did not erase the outlier because luck is a valid and
existing phenomenon. ]
The 'balancenuma' kernel does not converge any of the workloads
where worker threads or processes relate to each other.
AutoNUMA does pretty well, but it did not manage to converge for
4 testcases of shared, under-loaded workloads.
The unified NUMA-v3 tree converged well in every testcase.
NUMA workload bandwidth measurements
------------------------------------
The other set of numbers I've collected are workload bandwidth
measurements, run over 20 seconds. Using 20 seconds gives a
healthy mix of pre-convergence and post-convergence bandwidth,
giving the (non-trivial) expense of convergence and memory
migraton a weight in the result as well. So these are not
'ideal' results with long runtimes where migration cost gets
averaged out.
[ The denotion of the workloads is similar to the latency
measurements: for example "2x3" means 2 processes, 3 threads
per process. See the 'perf bench' tool for details. ]
The 'numa02' and 'numa01-THREAD' tests are AutoNUMA-benchmark
work-alike workloads, with a shorter runtime for numa01.
The results are:
[ Higher numbers are better. ]
[test unit] : v3.7 |balancenuma-v10| AutoNUMA-v28 | numa-u-v3 |
------------------------------------------------------------------------------------------
2x1-bw-process : 6.248| 6.136: -1.8%| 8.073: 29.2%| 9.647: 54.4%| GB/sec
3x1-bw-process : 7.292| 7.250: -0.6%| 12.583: 72.6%| 14.528: 99.2%| GB/sec
4x1-bw-process : 6.007| 6.867: 14.3%| 12.313: 105.0%| 18.903: 214.7%| GB/sec
8x1-bw-process : 6.100| 7.974: 30.7%| 20.237: 231.8%| 26.829: 339.8%| GB/sec
8x1-bw-process-NOTHP : 5.944| 5.937: -0.1%| 17.831: 200.0%| 22.237: 274.1%| GB/sec
16x1-bw-process : 5.607| 5.592: -0.3%| 5.959: 6.3%| 29.294: 422.5%| GB/sec
4x1-bw-thread : 6.035| 13.598: 125.3%| 17.443: 189.0%| 19.290: 219.6%| GB/sec
8x1-bw-thread : 5.941| 16.356: 175.3%| 22.433: 277.6%| 26.391: 344.2%| GB/sec
16x1-bw-thread : 5.648| 24.608: 335.7%| 20.204: 257.7%| 29.557: 423.3%| GB/sec
32x1-bw-thread : 5.929| 25.477: 329.7%| 18.230: 207.5%| 30.232: 409.9%| GB/sec
2x3-bw-thread : 5.756| 8.785: 52.6%| 14.652: 154.6%| 15.327: 166.3%| GB/sec
4x4-bw-thread : 5.605| 6.366: 13.6%| 9.835: 75.5%| 27.957: 398.8%| GB/sec
4x6-bw-thread : 5.771| 6.287: 8.9%| 15.372: 166.4%| 27.877: 383.1%| GB/sec
4x8-bw-thread : 5.858| 5.860: 0.0%| 11.865: 102.5%| 28.439: 385.5%| GB/sec
4x8-bw-thread-NOTHP : 5.645| 6.167: 9.2%| 9.224: 63.4%| 25.067: 344.1%| GB/sec
3x3-bw-thread : 5.937| 8.235: 38.7%| 6.635: 11.8%| 21.560: 263.1%| GB/sec
5x5-bw-thread : 5.771| 5.762: -0.2%| 9.575: 65.9%| 26.081: 351.9%| GB/sec
2x16-bw-thread : 5.953| 5.920: -0.6%| 5.945: -0.1%| 23.269: 290.9%| GB/sec
1x32-bw-thread : 5.879| 5.828: -0.9%| 5.848: -0.5%| 18.985: 222.9%| GB/sec
numa02-bw : 6.049| 29.054: 380.3%| 24.744: 309.1%| 31.431: 419.6%| GB/sec
numa02-bw-NOTHP : 5.850| 27.064: 362.6%| 20.415: 249.0%| 29.104: 397.5%| GB/sec
numa01-bw-thread : 5.834| 20.338: 248.6%| 15.169: 160.0%| 28.607: 390.3%| GB/sec
numa01-bw-thread-NOTHP : 5.581| 18.528: 232.0%| 12.108: 117.0%| 21.119: 278.4%| GB/sec
------------------------------------------------------------------------------------------
The first column shows mainline kernel bandwidth in GB/sec, the
following 3 colums show pairs of GB/sec bandwidth and percentage
results, where percentage shows the speed difference to the
mainline kernel.
Noise is 1-2% in these tests with these durations, so the good
news is that none of the NUMA kernels regresses on these
workloads against the mainline kernel. Perhaps balancenuma's
"2x1-bw-process" and "3x1-bw-process" results might be worth a
closer look.
No kernel shows particular vulnerability to the NOTHP tests that
were mixed into the test stream.
As can be expected from the convergence latency results, the
'balancenuma' tree does well with workloads where there's no
relationship between threads - but even there it's outperformed
by the AutoNUMA kernel, and outperformed by an even larger
margin by the NUMA-v3 kernel. Workloads like the 4x JVM SPECjbb
on the other hand pose a challenge to the balancenuma kernel,
both the AutoNUMA and the NUMA-v3 kernels are several times
faster in those tests.
The AutoNUMA kernel does well in most workloads - its weakness
are system-wide shared workloads like 2x16-bw-thread and
1x32-bw-thread, where it falls back to mainline performance.
The NUMA-v3 kernel outperforms every other NUMA kernel.
Here's a direct comparison between the two fastest kernels, the
AutoNUMA and the NUMA-v3 kernels:
[ Higher numbers are better. ]
[test unit] :AutoNUMA| numa-u-v3 |
----------------------------------------------------------
2x1-bw-process : 8.073| 9.647: 19.5%| GB/sec
3x1-bw-process : 12.583| 14.528: 15.5%| GB/sec
4x1-bw-process : 12.313| 18.903: 53.5%| GB/sec
8x1-bw-process : 20.237| 26.829: 32.6%| GB/sec
8x1-bw-process-NOTHP : 17.831| 22.237: 24.7%| GB/sec
16x1-bw-process : 5.959| 29.294: 391.6%| GB/sec
4x1-bw-thread : 17.443| 19.290: 10.6%| GB/sec
8x1-bw-thread : 22.433| 26.391: 17.6%| GB/sec
16x1-bw-thread : 20.204| 29.557: 46.3%| GB/sec
32x1-bw-thread : 18.230| 30.232: 65.8%| GB/sec
2x3-bw-thread : 14.652| 15.327: 4.6%| GB/sec
4x4-bw-thread : 9.835| 27.957: 184.3%| GB/sec
4x6-bw-thread : 15.372| 27.877: 81.3%| GB/sec
4x8-bw-thread : 11.865| 28.439: 139.7%| GB/sec
4x8-bw-thread-NOTHP : 9.224| 25.067: 171.8%| GB/sec
3x3-bw-thread : 6.635| 21.560: 224.9%| GB/sec
5x5-bw-thread : 9.575| 26.081: 172.4%| GB/sec
2x16-bw-thread : 5.945| 23.269: 291.4%| GB/sec
1x32-bw-thread : 5.848| 18.985: 224.6%| GB/sec
numa02-bw : 24.744| 31.431: 27.0%| GB/sec
numa02-bw-NOTHP : 20.415| 29.104: 42.6%| GB/sec
numa01-bw-thread : 15.169| 28.607: 88.6%| GB/sec
numa01-bw-thread-NOTHP : 12.108| 21.119: 74.4%| GB/sec
NUMA workload "spread" measurements
-----------------------------------
A third, somewhat obscure category of measurements deals with
the 'execution spread' between threads. Workloads that have to
wait for the result of every thread before they can declare a
result are directly limited by this spread.
The 'spread' is measured by the percentage difference between
the slowest and fastest thread's execution time in a workload:
[ Lower numbers are better. ]
[test unit] : v3.7 |balancenuma-v10| AutoNUMA-v28 | numa-u-v3 |
------------------------------------------------------------------------------------------
RAM-bw-local : 0.0% | 0.0% | 0.0% | 0.0% | %
RAM-bw-local-NOTHP : 0.2% | 0.2% | 0.2% | 0.2% | %
RAM-bw-remote : 0.0% | 0.0% | 0.0% | 0.0% | %
RAM-bw-local-2x : 0.3% | 0.0% | 0.2% | 0.3% | %
RAM-bw-remote-2x : 0.0% | 0.2% | 0.0% | 0.2% | %
RAM-bw-cross : 0.4% | 0.2% | 0.0% | 0.1% | %
2x1-bw-process : 0.5% | 0.2% | 0.2% | 0.2% | %
3x1-bw-process : 0.6% | 0.2% | 0.2% | 0.1% | %
4x1-bw-process : 0.4% | 0.8% | 0.2% | 0.3% | %
8x1-bw-process : 0.8% | 0.1% | 0.2% | 0.2% | %
8x1-bw-process-NOTHP : 0.9% | 0.7% | 0.4% | 0.5% | %
16x1-bw-process : 1.0% | 0.9% | 0.6% | 0.1% | %
4x1-bw-thread : 0.1% | 0.1% | 0.1% | 0.1% | %
8x1-bw-thread : 0.2% | 0.1% | 0.1% | 0.2% | %
16x1-bw-thread : 0.3% | 0.1% | 0.1% | 0.1% | %
32x1-bw-thread : 0.3% | 0.1% | 0.1% | 0.1% | %
2x3-bw-thread : 0.4% | 0.3% | 0.3% | 0.3% | %
4x4-bw-thread : 2.3% | 1.4% | 0.8% | 0.4% | %
4x6-bw-thread : 2.5% | 2.2% | 1.0% | 0.6% | %
4x8-bw-thread : 3.9% | 3.7% | 1.3% | 0.9% | %
4x8-bw-thread-NOTHP : 6.0% | 2.5% | 1.5% | 1.0% | %
3x3-bw-thread : 0.5% | 0.4% | 0.5% | 0.3% | %
5x5-bw-thread : 1.8% | 2.7% | 1.3% | 0.7% | %
2x16-bw-thread : 3.7% | 4.1% | 3.6% | 1.1% | %
1x32-bw-thread : 2.9% | 7.3% | 3.5% | 4.4% | %
numa02-bw : 0.1% | 0.0% | 0.1% | 0.1% | %
numa02-bw-NOTHP : 0.4% | 0.3% | 0.3% | 0.3% | %
numa01-bw-thread : 1.3% | 0.4% | 0.3% | 0.3% | %
numa01-bw-thread-NOTHP : 1.8% | 0.8% | 0.8% | 0.9% | %
The results are pretty good because the runs were relatively
short with 20 seconds runtime.
Both mainline and balancenuma has trouble with the spread of
shared workloads - possibly signalling memory allocation
assymetries. Longer - 60 seconds or more - runs of the key
workloads would certainly be informative there.
NOTHP (4K ptes) increases the spread and non-determinism of
every NUMA kernel.
The AutoNUMA and NUMA-v3 kernels have the lowest spread,
signalling stable NUMA convergence in most scenarios.
Finally, below is the (long!) dump of all the raw data, in case
someone wants to double-check my results. The perf/bench tool
can be used to double check the measurements on other systems.
Thanks,
Ingo
-------------------->
Here are the exact kernel versions used:
# kernel 1: {v3.7-rc8-18a2f371f5ed}
# kernel 2: {balancenuma-v10}
# kernel 3: {autonuma-v28-c4bba428cc5c}
# kernel 4: {numa/base-v3}
-------------------->
#
# Running test on: Linux vega 3.7.0-rc8+ #3 SMP Fri Dec 7 18:29:16 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
#
# Running numa/mem benchmark...
# Running main, "perf bench numa mem -a"
# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local, 20.111, secs, runtime-max/thread
RAM-bw-local, 20.106, secs, runtime-min/thread
RAM-bw-local, 20.106, secs, runtime-avg/thread
RAM-bw-local, 0.013, %, spread-runtime/thread
RAM-bw-local, 169.651, GB, data/thread
RAM-bw-local, 169.651, GB, data-total
RAM-bw-local, 0.119, nsecs, runtime/byte/thread
RAM-bw-local, 8.436, GB/sec, thread-speed
RAM-bw-local, 8.436, GB/sec, total-speed
# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
RAM-bw-local-NOTHP, 20.125, secs, runtime-max/thread
RAM-bw-local-NOTHP, 20.050, secs, runtime-min/thread
RAM-bw-local-NOTHP, 20.050, secs, runtime-avg/thread
RAM-bw-local-NOTHP, 0.187, %, spread-runtime/thread
RAM-bw-local-NOTHP, 169.651, GB, data/thread
RAM-bw-local-NOTHP, 169.651, GB, data-total
RAM-bw-local-NOTHP, 0.119, nsecs, runtime/byte/thread
RAM-bw-local-NOTHP, 8.430, GB/sec, thread-speed
RAM-bw-local-NOTHP, 8.430, GB/sec, total-speed
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote, 20.141, secs, runtime-max/thread
RAM-bw-remote, 20.134, secs, runtime-min/thread
RAM-bw-remote, 20.134, secs, runtime-avg/thread
RAM-bw-remote, 0.017, %, spread-runtime/thread
RAM-bw-remote, 135.291, GB, data/thread
RAM-bw-remote, 135.291, GB, data-total
RAM-bw-remote, 0.149, nsecs, runtime/byte/thread
RAM-bw-remote, 6.717, GB/sec, thread-speed
RAM-bw-remote, 6.717, GB/sec, total-speed
# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local-2x, 20.128, secs, runtime-max/thread
RAM-bw-local-2x, 20.006, secs, runtime-min/thread
RAM-bw-local-2x, 20.064, secs, runtime-avg/thread
RAM-bw-local-2x, 0.302, %, spread-runtime/thread
RAM-bw-local-2x, 132.607, GB, data/thread
RAM-bw-local-2x, 265.214, GB, data-total
RAM-bw-local-2x, 0.152, nsecs, runtime/byte/thread
RAM-bw-local-2x, 6.588, GB/sec, thread-speed
RAM-bw-local-2x, 13.177, GB/sec, total-speed
# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote-2x, 20.102, secs, runtime-max/thread
RAM-bw-remote-2x, 20.094, secs, runtime-min/thread
RAM-bw-remote-2x, 20.094, secs, runtime-avg/thread
RAM-bw-remote-2x, 0.021, %, spread-runtime/thread
RAM-bw-remote-2x, 74.088, GB, data/thread
RAM-bw-remote-2x, 148.176, GB, data-total
RAM-bw-remote-2x, 0.271, nsecs, runtime/byte/thread
RAM-bw-remote-2x, 3.686, GB/sec, thread-speed
RAM-bw-remote-2x, 7.371, GB/sec, total-speed
# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-cross, 20.159, secs, runtime-max/thread
RAM-bw-cross, 20.011, secs, runtime-min/thread
RAM-bw-cross, 20.081, secs, runtime-avg/thread
RAM-bw-cross, 0.369, %, spread-runtime/thread
RAM-bw-cross, 122.407, GB, data/thread
RAM-bw-cross, 244.813, GB, data-total
RAM-bw-cross, 0.165, nsecs, runtime/byte/thread
RAM-bw-cross, 6.072, GB/sec, thread-speed
RAM-bw-cross, 12.144, GB/sec, total-speed
# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
1x3-convergence, 100.103, secs, NUMA-convergence-latency
1x3-convergence, 100.103, secs, runtime-max/thread
1x3-convergence, 100.082, secs, runtime-min/thread
1x3-convergence, 100.093, secs, runtime-avg/thread
1x3-convergence, 0.010, %, spread-runtime/thread
1x3-convergence, 278.636, GB, data/thread
1x3-convergence, 835.908, GB, data-total
1x3-convergence, 0.359, nsecs, runtime/byte/thread
1x3-convergence, 2.784, GB/sec, thread-speed
1x3-convergence, 8.351, GB/sec, total-speed
# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1x4-convergence, 100.211, secs, NUMA-convergence-latency
1x4-convergence, 100.211, secs, runtime-max/thread
1x4-convergence, 100.070, secs, runtime-min/thread
1x4-convergence, 100.140, secs, runtime-avg/thread
1x4-convergence, 0.070, %, spread-runtime/thread
1x4-convergence, 154.887, GB, data/thread
1x4-convergence, 619.549, GB, data-total
1x4-convergence, 0.647, nsecs, runtime/byte/thread
1x4-convergence, 1.546, GB/sec, thread-speed
1x4-convergence, 6.182, GB/sec, total-speed
# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1x6-convergence, 100.343, secs, NUMA-convergence-latency
1x6-convergence, 100.343, secs, runtime-max/thread
1x6-convergence, 100.235, secs, runtime-min/thread
1x6-convergence, 100.303, secs, runtime-avg/thread
1x6-convergence, 0.054, %, spread-runtime/thread
1x6-convergence, 95.725, GB, data/thread
1x6-convergence, 574.347, GB, data-total
1x6-convergence, 1.048, nsecs, runtime/byte/thread
1x6-convergence, 0.954, GB/sec, thread-speed
1x6-convergence, 5.724, GB/sec, total-speed
# Running 2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2x3-convergence, 100.601, secs, NUMA-convergence-latency
2x3-convergence, 100.601, secs, runtime-max/thread
2x3-convergence, 100.054, secs, runtime-min/thread
2x3-convergence, 100.307, secs, runtime-avg/thread
2x3-convergence, 0.272, %, spread-runtime/thread
2x3-convergence, 65.837, GB, data/thread
2x3-convergence, 592.529, GB, data-total
2x3-convergence, 1.528, nsecs, runtime/byte/thread
2x3-convergence, 0.654, GB/sec, thread-speed
2x3-convergence, 5.890, GB/sec, total-speed
# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
3x3-convergence, 100.572, secs, NUMA-convergence-latency
3x3-convergence, 100.572, secs, runtime-max/thread
3x3-convergence, 100.095, secs, runtime-min/thread
3x3-convergence, 100.330, secs, runtime-avg/thread
3x3-convergence, 0.238, %, spread-runtime/thread
3x3-convergence, 65.837, GB, data/thread
3x3-convergence, 592.529, GB, data-total
3x3-convergence, 1.528, nsecs, runtime/byte/thread
3x3-convergence, 0.655, GB/sec, thread-speed
3x3-convergence, 5.892, GB/sec, total-speed
# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4x4-convergence, 100.571, secs, NUMA-convergence-latency
4x4-convergence, 100.571, secs, runtime-max/thread
4x4-convergence, 100.122, secs, runtime-min/thread
4x4-convergence, 100.386, secs, runtime-avg/thread
4x4-convergence, 0.223, %, spread-runtime/thread
4x4-convergence, 35.266, GB, data/thread
4x4-convergence, 564.251, GB, data-total
4x4-convergence, 2.852, nsecs, runtime/byte/thread
4x4-convergence, 0.351, GB/sec, thread-speed
4x4-convergence, 5.610, GB/sec, total-speed
# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4x4-convergence-NOTHP, 101.051, secs, NUMA-convergence-latency
4x4-convergence-NOTHP, 101.051, secs, runtime-max/thread
4x4-convergence-NOTHP, 100.066, secs, runtime-min/thread
4x4-convergence-NOTHP, 100.683, secs, runtime-avg/thread
4x4-convergence-NOTHP, 0.487, %, spread-runtime/thread
4x4-convergence-NOTHP, 35.769, GB, data/thread
4x4-convergence-NOTHP, 572.304, GB, data-total
4x4-convergence-NOTHP, 2.825, nsecs, runtime/byte/thread
4x4-convergence-NOTHP, 0.354, GB/sec, thread-speed
4x4-convergence-NOTHP, 5.664, GB/sec, total-speed
# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
4x6-convergence, 5.444, secs, NUMA-convergence-latency
4x6-convergence, 5.444, secs, runtime-max/thread
4x6-convergence, 2.853, secs, runtime-min/thread
4x6-convergence, 4.531, secs, runtime-avg/thread
4x6-convergence, 23.794, %, spread-runtime/thread
4x6-convergence, 1.292, GB, data/thread
4x6-convergence, 31.017, GB, data-total
4x6-convergence, 4.212, nsecs, runtime/byte/thread
4x6-convergence, 0.237, GB/sec, thread-speed
4x6-convergence, 5.698, GB/sec, total-speed
# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
4x8-convergence, 101.133, secs, NUMA-convergence-latency
4x8-convergence, 101.133, secs, runtime-max/thread
4x8-convergence, 100.455, secs, runtime-min/thread
4x8-convergence, 100.803, secs, runtime-avg/thread
4x8-convergence, 0.335, %, spread-runtime/thread
4x8-convergence, 18.522, GB, data/thread
4x8-convergence, 592.705, GB, data-total
4x8-convergence, 5.460, nsecs, runtime/byte/thread
4x8-convergence, 0.183, GB/sec, thread-speed
4x8-convergence, 5.861, GB/sec, total-speed
# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
8x4-convergence, 100.878, secs, NUMA-convergence-latency
8x4-convergence, 100.878, secs, runtime-max/thread
8x4-convergence, 100.021, secs, runtime-min/thread
8x4-convergence, 100.567, secs, runtime-avg/thread
8x4-convergence, 0.425, %, spread-runtime/thread
8x4-convergence, 18.388, GB, data/thread
8x4-convergence, 588.411, GB, data-total
8x4-convergence, 5.486, nsecs, runtime/byte/thread
8x4-convergence, 0.182, GB/sec, thread-speed
8x4-convergence, 5.833, GB/sec, total-speed
# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
8x4-convergence-NOTHP, 101.868, secs, NUMA-convergence-latency
8x4-convergence-NOTHP, 101.868, secs, runtime-max/thread
8x4-convergence-NOTHP, 100.499, secs, runtime-min/thread
8x4-convergence-NOTHP, 101.118, secs, runtime-avg/thread
8x4-convergence-NOTHP, 0.672, %, spread-runtime/thread
8x4-convergence-NOTHP, 17.851, GB, data/thread
8x4-convergence-NOTHP, 571.231, GB, data-total
8x4-convergence-NOTHP, 5.707, nsecs, runtime/byte/thread
8x4-convergence-NOTHP, 0.175, GB/sec, thread-speed
8x4-convergence-NOTHP, 5.608, GB/sec, total-speed
# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
3x1-convergence, 0.652, secs, NUMA-convergence-latency
3x1-convergence, 0.652, secs, runtime-max/thread
3x1-convergence, 0.471, secs, runtime-min/thread
3x1-convergence, 0.584, secs, runtime-avg/thread
3x1-convergence, 13.878, %, spread-runtime/thread
3x1-convergence, 1.432, GB, data/thread
3x1-convergence, 4.295, GB, data-total
3x1-convergence, 0.456, nsecs, runtime/byte/thread
3x1-convergence, 2.195, GB/sec, thread-speed
3x1-convergence, 6.584, GB/sec, total-speed
# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
4x1-convergence, 0.643, secs, NUMA-convergence-latency
4x1-convergence, 0.643, secs, runtime-max/thread
4x1-convergence, 0.479, secs, runtime-min/thread
4x1-convergence, 0.562, secs, runtime-avg/thread
4x1-convergence, 12.750, %, spread-runtime/thread
4x1-convergence, 1.074, GB, data/thread
4x1-convergence, 4.295, GB, data-total
4x1-convergence, 0.599, nsecs, runtime/byte/thread
4x1-convergence, 1.669, GB/sec, thread-speed
4x1-convergence, 6.677, GB/sec, total-speed
# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
8x1-convergence, 2.803, secs, NUMA-convergence-latency
8x1-convergence, 2.803, secs, runtime-max/thread
8x1-convergence, 2.509, secs, runtime-min/thread
8x1-convergence, 2.664, secs, runtime-avg/thread
8x1-convergence, 5.250, %, spread-runtime/thread
8x1-convergence, 2.147, GB, data/thread
8x1-convergence, 17.180, GB, data-total
8x1-convergence, 1.305, nsecs, runtime/byte/thread
8x1-convergence, 0.766, GB/sec, thread-speed
8x1-convergence, 6.129, GB/sec, total-speed
# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
16x1-convergence, 3.482, secs, NUMA-convergence-latency
16x1-convergence, 3.482, secs, runtime-max/thread
16x1-convergence, 3.162, secs, runtime-min/thread
16x1-convergence, 3.328, secs, runtime-avg/thread
16x1-convergence, 4.603, %, spread-runtime/thread
16x1-convergence, 1.242, GB, data/thread
16x1-convergence, 19.864, GB, data-total
16x1-convergence, 2.805, nsecs, runtime/byte/thread
16x1-convergence, 0.357, GB/sec, thread-speed
16x1-convergence, 5.704, GB/sec, total-speed
# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
32x1-convergence, 3.612, secs, NUMA-convergence-latency
32x1-convergence, 3.612, secs, runtime-max/thread
32x1-convergence, 3.170, secs, runtime-min/thread
32x1-convergence, 3.456, secs, runtime-avg/thread
32x1-convergence, 6.118, %, spread-runtime/thread
32x1-convergence, 0.671, GB, data/thread
32x1-convergence, 21.475, GB, data-total
32x1-convergence, 5.382, nsecs, runtime/byte/thread
32x1-convergence, 0.186, GB/sec, thread-speed
32x1-convergence, 5.945, GB/sec, total-speed
# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
2x1-bw-process, 20.280, secs, runtime-max/thread
2x1-bw-process, 20.059, secs, runtime-min/thread
2x1-bw-process, 20.166, secs, runtime-avg/thread
2x1-bw-process, 0.546, %, spread-runtime/thread
2x1-bw-process, 63.351, GB, data/thread
2x1-bw-process, 126.702, GB, data-total
2x1-bw-process, 0.320, nsecs, runtime/byte/thread
2x1-bw-process, 3.124, GB/sec, thread-speed
2x1-bw-process, 6.248, GB/sec, total-speed
# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
3x1-bw-process, 20.320, secs, runtime-max/thread
3x1-bw-process, 20.078, secs, runtime-min/thread
3x1-bw-process, 20.202, secs, runtime-avg/thread
3x1-bw-process, 0.595, %, spread-runtime/thread
3x1-bw-process, 49.392, GB, data/thread
3x1-bw-process, 148.176, GB, data-total
3x1-bw-process, 0.411, nsecs, runtime/byte/thread
3x1-bw-process, 2.431, GB/sec, thread-speed
3x1-bw-process, 7.292, GB/sec, total-speed
# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
4x1-bw-process, 20.379, secs, runtime-max/thread
4x1-bw-process, 20.210, secs, runtime-min/thread
4x1-bw-process, 20.291, secs, runtime-avg/thread
4x1-bw-process, 0.413, %, spread-runtime/thread
4x1-bw-process, 30.602, GB, data/thread
4x1-bw-process, 122.407, GB, data-total
4x1-bw-process, 0.666, nsecs, runtime/byte/thread
4x1-bw-process, 1.502, GB/sec, thread-speed
4x1-bw-process, 6.007, GB/sec, total-speed
# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
8x1-bw-process, 20.419, secs, runtime-max/thread
8x1-bw-process, 20.073, secs, runtime-min/thread
8x1-bw-process, 20.328, secs, runtime-avg/thread
8x1-bw-process, 0.848, %, spread-runtime/thread
8x1-bw-process, 15.569, GB, data/thread
8x1-bw-process, 124.554, GB, data-total
8x1-bw-process, 1.311, nsecs, runtime/byte/thread
8x1-bw-process, 0.762, GB/sec, thread-speed
8x1-bw-process, 6.100, GB/sec, total-speed
# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
8x1-bw-process-NOTHP, 20.502, secs, runtime-max/thread
8x1-bw-process-NOTHP, 20.113, secs, runtime-min/thread
8x1-bw-process-NOTHP, 20.307, secs, runtime-avg/thread
8x1-bw-process-NOTHP, 0.950, %, spread-runtime/thread
8x1-bw-process-NOTHP, 15.234, GB, data/thread
8x1-bw-process-NOTHP, 121.870, GB, data-total
8x1-bw-process-NOTHP, 1.346, nsecs, runtime/byte/thread
8x1-bw-process-NOTHP, 0.743, GB/sec, thread-speed
8x1-bw-process-NOTHP, 5.944, GB/sec, total-speed
# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
16x1-bw-process, 20.539, secs, runtime-max/thread
16x1-bw-process, 20.145, secs, runtime-min/thread
16x1-bw-process, 20.407, secs, runtime-avg/thread
16x1-bw-process, 0.959, %, spread-runtime/thread
16x1-bw-process, 7.197, GB, data/thread
16x1-bw-process, 115.159, GB, data-total
16x1-bw-process, 2.854, nsecs, runtime/byte/thread
16x1-bw-process, 0.350, GB/sec, thread-speed
16x1-bw-process, 5.607, GB/sec, total-speed
# Running 4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
4x1-bw-thread, 20.105, secs, runtime-max/thread
4x1-bw-thread, 20.047, secs, runtime-min/thread
4x1-bw-thread, 20.071, secs, runtime-avg/thread
4x1-bw-thread, 0.144, %, spread-runtime/thread
4x1-bw-thread, 30.333, GB, data/thread
4x1-bw-thread, 121.333, GB, data-total
4x1-bw-thread, 0.663, nsecs, runtime/byte/thread
4x1-bw-thread, 1.509, GB/sec, thread-speed
4x1-bw-thread, 6.035, GB/sec, total-speed
# Running 8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
8x1-bw-thread, 20.106, secs, runtime-max/thread
8x1-bw-thread, 20.021, secs, runtime-min/thread
8x1-bw-thread, 20.062, secs, runtime-avg/thread
8x1-bw-thread, 0.213, %, spread-runtime/thread
8x1-bw-thread, 14.932, GB, data/thread
8x1-bw-thread, 119.454, GB, data-total
8x1-bw-thread, 1.347, nsecs, runtime/byte/thread
8x1-bw-thread, 0.743, GB/sec, thread-speed
8x1-bw-thread, 5.941, GB/sec, total-speed
# Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
16x1-bw-thread, 20.176, secs, runtime-max/thread
16x1-bw-thread, 20.049, secs, runtime-min/thread
16x1-bw-thread, 20.125, secs, runtime-avg/thread
16x1-bw-thread, 0.314, %, spread-runtime/thread
16x1-bw-thread, 7.122, GB, data/thread
16x1-bw-thread, 113.951, GB, data-total
16x1-bw-thread, 2.833, nsecs, runtime/byte/thread
16x1-bw-thread, 0.353, GB/sec, thread-speed
16x1-bw-thread, 5.648, GB/sec, total-speed
# Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
32x1-bw-thread, 20.159, secs, runtime-max/thread
32x1-bw-thread, 20.034, secs, runtime-min/thread
32x1-bw-thread, 20.120, secs, runtime-avg/thread
32x1-bw-thread, 0.309, %, spread-runtime/thread
32x1-bw-thread, 3.735, GB, data/thread
32x1-bw-thread, 119.521, GB, data-total
32x1-bw-thread, 5.397, nsecs, runtime/byte/thread
32x1-bw-thread, 0.185, GB/sec, thread-speed
32x1-bw-thread, 5.929, GB/sec, total-speed
# Running 2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
2x3-bw-thread, 20.239, secs, runtime-max/thread
2x3-bw-thread, 20.092, secs, runtime-min/thread
2x3-bw-thread, 20.183, secs, runtime-avg/thread
2x3-bw-thread, 0.363, %, spread-runtime/thread
2x3-bw-thread, 19.417, GB, data/thread
2x3-bw-thread, 116.501, GB, data-total
2x3-bw-thread, 1.042, nsecs, runtime/byte/thread
2x3-bw-thread, 0.959, GB/sec, thread-speed
2x3-bw-thread, 5.756, GB/sec, total-speed
# Running 4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
4x4-bw-thread, 20.978, secs, runtime-max/thread
4x4-bw-thread, 20.005, secs, runtime-min/thread
4x4-bw-thread, 20.576, secs, runtime-avg/thread
4x4-bw-thread, 2.321, %, spread-runtime/thread
4x4-bw-thread, 7.348, GB, data/thread
4x4-bw-thread, 117.575, GB, data-total
4x4-bw-thread, 2.855, nsecs, runtime/byte/thread
4x4-bw-thread, 0.350, GB/sec, thread-speed
4x4-bw-thread, 5.605, GB/sec, total-speed
# Running 4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
4x6-bw-thread, 21.118, secs, runtime-max/thread
4x6-bw-thread, 20.082, secs, runtime-min/thread
4x6-bw-thread, 20.819, secs, runtime-avg/thread
4x6-bw-thread, 2.451, %, spread-runtime/thread
4x6-bw-thread, 5.078, GB, data/thread
4x6-bw-thread, 121.870, GB, data-total
4x6-bw-thread, 4.159, nsecs, runtime/byte/thread
4x6-bw-thread, 0.240, GB/sec, thread-speed
4x6-bw-thread, 5.771, GB/sec, total-speed
# Running 4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
4x8-bw-thread, 21.994, secs, runtime-max/thread
4x8-bw-thread, 20.290, secs, runtime-min/thread
4x8-bw-thread, 21.387, secs, runtime-avg/thread
4x8-bw-thread, 3.874, %, spread-runtime/thread
4x8-bw-thread, 4.027, GB, data/thread
4x8-bw-thread, 128.849, GB, data-total
4x8-bw-thread, 5.462, nsecs, runtime/byte/thread
4x8-bw-thread, 0.183, GB/sec, thread-speed
4x8-bw-thread, 5.858, GB/sec, total-speed
# Running 4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
4x8-bw-thread-NOTHP, 22.728, secs, runtime-max/thread
4x8-bw-thread-NOTHP, 20.013, secs, runtime-min/thread
4x8-bw-thread-NOTHP, 21.968, secs, runtime-avg/thread
4x8-bw-thread-NOTHP, 5.975, %, spread-runtime/thread
4x8-bw-thread-NOTHP, 4.010, GB, data/thread
4x8-bw-thread-NOTHP, 128.312, GB, data-total
4x8-bw-thread-NOTHP, 5.668, nsecs, runtime/byte/thread
4x8-bw-thread-NOTHP, 0.176, GB/sec, thread-speed
4x8-bw-thread-NOTHP, 5.645, GB/sec, total-speed
# Running 3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
3x3-bw-thread, 20.526, secs, runtime-max/thread
3x3-bw-thread, 20.317, secs, runtime-min/thread
3x3-bw-thread, 20.467, secs, runtime-avg/thread
3x3-bw-thread, 0.510, %, spread-runtime/thread
3x3-bw-thread, 13.541, GB, data/thread
3x3-bw-thread, 121.870, GB, data-total
3x3-bw-thread, 1.516, nsecs, runtime/byte/thread
3x3-bw-thread, 0.660, GB/sec, thread-speed
3x3-bw-thread, 5.937, GB/sec, total-speed
# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 21.023, secs, runtime-max/thread
5x5-bw-thread, 20.252, secs, runtime-min/thread
5x5-bw-thread, 20.701, secs, runtime-avg/thread
5x5-bw-thread, 1.833, %, spread-runtime/thread
5x5-bw-thread, 4.853, GB, data/thread
5x5-bw-thread, 121.333, GB, data-total
5x5-bw-thread, 4.332, nsecs, runtime/byte/thread
5x5-bw-thread, 0.231, GB/sec, thread-speed
5x5-bw-thread, 5.771, GB/sec, total-speed
# Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
2x16-bw-thread, 21.646, secs, runtime-max/thread
2x16-bw-thread, 20.065, secs, runtime-min/thread
2x16-bw-thread, 21.026, secs, runtime-avg/thread
2x16-bw-thread, 3.652, %, spread-runtime/thread
2x16-bw-thread, 4.027, GB, data/thread
2x16-bw-thread, 128.849, GB, data-total
2x16-bw-thread, 5.376, nsecs, runtime/byte/thread
2x16-bw-thread, 0.186, GB/sec, thread-speed
2x16-bw-thread, 5.953, GB/sec, total-speed
# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
1x32-bw-thread, 23.377, secs, runtime-max/thread
1x32-bw-thread, 22.030, secs, runtime-min/thread
1x32-bw-thread, 22.936, secs, runtime-avg/thread
1x32-bw-thread, 2.881, %, spread-runtime/thread
1x32-bw-thread, 4.295, GB, data/thread
1x32-bw-thread, 137.439, GB, data-total
1x32-bw-thread, 5.443, nsecs, runtime/byte/thread
1x32-bw-thread, 0.184, GB/sec, thread-speed
1x32-bw-thread, 5.879, GB/sec, total-speed
# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
numa02-bw, 20.065, secs, runtime-max/thread
numa02-bw, 20.012, secs, runtime-min/thread
numa02-bw, 20.050, secs, runtime-avg/thread
numa02-bw, 0.132, %, spread-runtime/thread
numa02-bw, 3.793, GB, data/thread
numa02-bw, 121.366, GB, data-total
numa02-bw, 5.290, nsecs, runtime/byte/thread
numa02-bw, 0.189, GB/sec, thread-speed
numa02-bw, 6.049, GB/sec, total-speed
# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
numa02-bw-NOTHP, 20.132, secs, runtime-max/thread
numa02-bw-NOTHP, 19.987, secs, runtime-min/thread
numa02-bw-NOTHP, 20.049, secs, runtime-avg/thread
numa02-bw-NOTHP, 0.360, %, spread-runtime/thread
numa02-bw-NOTHP, 3.681, GB, data/thread
numa02-bw-NOTHP, 117.776, GB, data-total
numa02-bw-NOTHP, 5.470, nsecs, runtime/byte/thread
numa02-bw-NOTHP, 0.183, GB/sec, thread-speed
numa02-bw-NOTHP, 5.850, GB/sec, total-speed
# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
numa01-bw-thread, 20.704, secs, runtime-max/thread
numa01-bw-thread, 20.185, secs, runtime-min/thread
numa01-bw-thread, 20.571, secs, runtime-avg/thread
numa01-bw-thread, 1.254, %, spread-runtime/thread
numa01-bw-thread, 3.775, GB, data/thread
numa01-bw-thread, 120.796, GB, data-total
numa01-bw-thread, 5.485, nsecs, runtime/byte/thread
numa01-bw-thread, 0.182, GB/sec, thread-speed
numa01-bw-thread, 5.834, GB/sec, total-speed
# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
numa01-bw-thread-NOTHP, 20.780, secs, runtime-max/thread
numa01-bw-thread-NOTHP, 20.023, secs, runtime-min/thread
numa01-bw-thread-NOTHP, 20.418, secs, runtime-avg/thread
numa01-bw-thread-NOTHP, 1.821, %, spread-runtime/thread
numa01-bw-thread-NOTHP, 3.624, GB, data/thread
numa01-bw-thread-NOTHP, 115.964, GB, data-total
numa01-bw-thread-NOTHP, 5.734, nsecs, runtime/byte/thread
numa01-bw-thread-NOTHP, 0.174, GB/sec, thread-speed
numa01-bw-thread-NOTHP, 5.581, GB/sec, total-speed
#
# Running test on: Linux vega 3.7.0-rc6+ #2 SMP Fri Dec 7 17:59:13 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
#
# Running numa/mem benchmark...
# Running main, "perf bench numa mem -a"
# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local, 20.049, secs, runtime-max/thread
RAM-bw-local, 20.044, secs, runtime-min/thread
RAM-bw-local, 20.044, secs, runtime-avg/thread
RAM-bw-local, 0.014, %, spread-runtime/thread
RAM-bw-local, 172.872, GB, data/thread
RAM-bw-local, 172.872, GB, data-total
RAM-bw-local, 0.116, nsecs, runtime/byte/thread
RAM-bw-local, 8.622, GB/sec, thread-speed
RAM-bw-local, 8.622, GB/sec, total-speed
# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
RAM-bw-local-NOTHP, 20.135, secs, runtime-max/thread
RAM-bw-local-NOTHP, 20.059, secs, runtime-min/thread
RAM-bw-local-NOTHP, 20.059, secs, runtime-avg/thread
RAM-bw-local-NOTHP, 0.189, %, spread-runtime/thread
RAM-bw-local-NOTHP, 172.872, GB, data/thread
RAM-bw-local-NOTHP, 172.872, GB, data-total
RAM-bw-local-NOTHP, 0.116, nsecs, runtime/byte/thread
RAM-bw-local-NOTHP, 8.586, GB/sec, thread-speed
RAM-bw-local-NOTHP, 8.586, GB/sec, total-speed
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote, 20.080, secs, runtime-max/thread
RAM-bw-remote, 20.073, secs, runtime-min/thread
RAM-bw-remote, 20.073, secs, runtime-avg/thread
RAM-bw-remote, 0.017, %, spread-runtime/thread
RAM-bw-remote, 135.291, GB, data/thread
RAM-bw-remote, 135.291, GB, data-total
RAM-bw-remote, 0.148, nsecs, runtime/byte/thread
RAM-bw-remote, 6.738, GB/sec, thread-speed
RAM-bw-remote, 6.738, GB/sec, total-speed
# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local-2x, 20.127, secs, runtime-max/thread
RAM-bw-local-2x, 20.111, secs, runtime-min/thread
RAM-bw-local-2x, 20.116, secs, runtime-avg/thread
RAM-bw-local-2x, 0.038, %, spread-runtime/thread
RAM-bw-local-2x, 130.997, GB, data/thread
RAM-bw-local-2x, 261.993, GB, data-total
RAM-bw-local-2x, 0.154, nsecs, runtime/byte/thread
RAM-bw-local-2x, 6.509, GB/sec, thread-speed
RAM-bw-local-2x, 13.017, GB/sec, total-speed
# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote-2x, 20.183, secs, runtime-max/thread
RAM-bw-remote-2x, 20.110, secs, runtime-min/thread
RAM-bw-remote-2x, 20.143, secs, runtime-avg/thread
RAM-bw-remote-2x, 0.180, %, spread-runtime/thread
RAM-bw-remote-2x, 75.162, GB, data/thread
RAM-bw-remote-2x, 150.324, GB, data-total
RAM-bw-remote-2x, 0.269, nsecs, runtime/byte/thread
RAM-bw-remote-2x, 3.724, GB/sec, thread-speed
RAM-bw-remote-2x, 7.448, GB/sec, total-speed
# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-cross, 20.159, secs, runtime-max/thread
RAM-bw-cross, 20.071, secs, runtime-min/thread
RAM-bw-cross, 20.111, secs, runtime-avg/thread
RAM-bw-cross, 0.220, %, spread-runtime/thread
RAM-bw-cross, 124.017, GB, data/thread
RAM-bw-cross, 248.034, GB, data-total
RAM-bw-cross, 0.163, nsecs, runtime/byte/thread
RAM-bw-cross, 6.152, GB/sec, thread-speed
RAM-bw-cross, 12.304, GB/sec, total-speed
# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
1x3-convergence, 100.038, secs, NUMA-convergence-latency
1x3-convergence, 100.038, secs, runtime-max/thread
1x3-convergence, 100.005, secs, runtime-min/thread
1x3-convergence, 100.016, secs, runtime-avg/thread
1x3-convergence, 0.016, %, spread-runtime/thread
1x3-convergence, 379.210, GB, data/thread
1x3-convergence, 1137.629, GB, data-total
1x3-convergence, 0.264, nsecs, runtime/byte/thread
1x3-convergence, 3.791, GB/sec, thread-speed
1x3-convergence, 11.372, GB/sec, total-speed
# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1x4-convergence, 100.091, secs, NUMA-convergence-latency
1x4-convergence, 100.091, secs, runtime-max/thread
1x4-convergence, 100.016, secs, runtime-min/thread
1x4-convergence, 100.053, secs, runtime-avg/thread
1x4-convergence, 0.037, %, spread-runtime/thread
1x4-convergence, 162.672, GB, data/thread
1x4-convergence, 650.688, GB, data-total
1x4-convergence, 0.615, nsecs, runtime/byte/thread
1x4-convergence, 1.625, GB/sec, thread-speed
1x4-convergence, 6.501, GB/sec, total-speed
# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1x6-convergence, 100.366, secs, NUMA-convergence-latency
1x6-convergence, 100.366, secs, runtime-max/thread
1x6-convergence, 100.005, secs, runtime-min/thread
1x6-convergence, 100.144, secs, runtime-avg/thread
1x6-convergence, 0.180, %, spread-runtime/thread
1x6-convergence, 103.924, GB, data/thread
1x6-convergence, 623.546, GB, data-total
1x6-convergence, 0.966, nsecs, runtime/byte/thread
1x6-convergence, 1.035, GB/sec, thread-speed
1x6-convergence, 6.213, GB/sec, total-speed
# Running 2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2x3-convergence, 100.632, secs, NUMA-convergence-latency
2x3-convergence, 100.632, secs, runtime-max/thread
2x3-convergence, 100.080, secs, runtime-min/thread
2x3-convergence, 100.376, secs, runtime-avg/thread
2x3-convergence, 0.274, %, spread-runtime/thread
2x3-convergence, 87.941, GB, data/thread
2x3-convergence, 791.465, GB, data-total
2x3-convergence, 1.144, nsecs, runtime/byte/thread
2x3-convergence, 0.874, GB/sec, thread-speed
2x3-convergence, 7.865, GB/sec, total-speed
# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
3x3-convergence, 100.474, secs, NUMA-convergence-latency
3x3-convergence, 100.474, secs, runtime-max/thread
3x3-convergence, 100.070, secs, runtime-min/thread
3x3-convergence, 100.338, secs, runtime-avg/thread
3x3-convergence, 0.201, %, spread-runtime/thread
3x3-convergence, 118.363, GB, data/thread
3x3-convergence, 1065.269, GB, data-total
3x3-convergence, 0.849, nsecs, runtime/byte/thread
3x3-convergence, 1.178, GB/sec, thread-speed
3x3-convergence, 10.602, GB/sec, total-speed
# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4x4-convergence, 100.527, secs, NUMA-convergence-latency
4x4-convergence, 100.527, secs, runtime-max/thread
4x4-convergence, 100.179, secs, runtime-min/thread
4x4-convergence, 100.353, secs, runtime-avg/thread
4x4-convergence, 0.173, %, spread-runtime/thread
4x4-convergence, 65.230, GB, data/thread
4x4-convergence, 1043.677, GB, data-total
4x4-convergence, 1.541, nsecs, runtime/byte/thread
4x4-convergence, 0.649, GB/sec, thread-speed
4x4-convergence, 10.382, GB/sec, total-speed
# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4x4-convergence-NOTHP, 100.532, secs, NUMA-convergence-latency
4x4-convergence-NOTHP, 100.532, secs, runtime-max/thread
4x4-convergence-NOTHP, 100.095, secs, runtime-min/thread
4x4-convergence-NOTHP, 100.343, secs, runtime-avg/thread
4x4-convergence-NOTHP, 0.217, %, spread-runtime/thread
4x4-convergence-NOTHP, 57.311, GB, data/thread
4x4-convergence-NOTHP, 916.976, GB, data-total
4x4-convergence-NOTHP, 1.754, nsecs, runtime/byte/thread
4x4-convergence-NOTHP, 0.570, GB/sec, thread-speed
4x4-convergence-NOTHP, 9.121, GB/sec, total-speed
# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
4x6-convergence, 101.230, secs, NUMA-convergence-latency
4x6-convergence, 101.230, secs, runtime-max/thread
4x6-convergence, 100.093, secs, runtime-min/thread
4x6-convergence, 100.825, secs, runtime-avg/thread
4x6-convergence, 0.562, %, spread-runtime/thread
4x6-convergence, 28.076, GB, data/thread
4x6-convergence, 673.815, GB, data-total
4x6-convergence, 3.606, nsecs, runtime/byte/thread
4x6-convergence, 0.277, GB/sec, thread-speed
4x6-convergence, 6.656, GB/sec, total-speed
# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
4x8-convergence, 101.310, secs, NUMA-convergence-latency
4x8-convergence, 101.310, secs, runtime-max/thread
4x8-convergence, 100.052, secs, runtime-min/thread
4x8-convergence, 100.679, secs, runtime-avg/thread
4x8-convergence, 0.621, %, spread-runtime/thread
4x8-convergence, 18.740, GB, data/thread
4x8-convergence, 599.685, GB, data-total
4x8-convergence, 5.406, nsecs, runtime/byte/thread
4x8-convergence, 0.185, GB/sec, thread-speed
4x8-convergence, 5.919, GB/sec, total-speed
# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
8x4-convergence, 100.849, secs, NUMA-convergence-latency
8x4-convergence, 100.849, secs, runtime-max/thread
8x4-convergence, 100.020, secs, runtime-min/thread
8x4-convergence, 100.570, secs, runtime-avg/thread
8x4-convergence, 0.411, %, spread-runtime/thread
8x4-convergence, 22.364, GB, data/thread
8x4-convergence, 715.649, GB, data-total
8x4-convergence, 4.509, nsecs, runtime/byte/thread
8x4-convergence, 0.222, GB/sec, thread-speed
8x4-convergence, 7.096, GB/sec, total-speed
# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
8x4-convergence-NOTHP, 100.976, secs, NUMA-convergence-latency
8x4-convergence-NOTHP, 100.976, secs, runtime-max/thread
8x4-convergence-NOTHP, 100.066, secs, runtime-min/thread
8x4-convergence-NOTHP, 100.580, secs, runtime-avg/thread
8x4-convergence-NOTHP, 0.451, %, spread-runtime/thread
8x4-convergence-NOTHP, 27.146, GB, data/thread
8x4-convergence-NOTHP, 868.657, GB, data-total
8x4-convergence-NOTHP, 3.720, nsecs, runtime/byte/thread
8x4-convergence-NOTHP, 0.269, GB/sec, thread-speed
8x4-convergence-NOTHP, 8.603, GB/sec, total-speed
# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
3x1-convergence, 1.010, secs, NUMA-convergence-latency
3x1-convergence, 1.010, secs, runtime-max/thread
3x1-convergence, 0.869, secs, runtime-min/thread
3x1-convergence, 0.958, secs, runtime-avg/thread
3x1-convergence, 6.944, %, spread-runtime/thread
3x1-convergence, 2.326, GB, data/thread
3x1-convergence, 6.979, GB, data-total
3x1-convergence, 0.434, nsecs, runtime/byte/thread
3x1-convergence, 2.305, GB/sec, thread-speed
3x1-convergence, 6.914, GB/sec, total-speed
# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
4x1-convergence, 0.782, secs, NUMA-convergence-latency
4x1-convergence, 0.782, secs, runtime-max/thread
4x1-convergence, 0.623, secs, runtime-min/thread
4x1-convergence, 0.689, secs, runtime-avg/thread
4x1-convergence, 10.122, %, spread-runtime/thread
4x1-convergence, 1.208, GB, data/thread
4x1-convergence, 4.832, GB, data-total
4x1-convergence, 0.647, nsecs, runtime/byte/thread
4x1-convergence, 1.545, GB/sec, thread-speed
4x1-convergence, 6.181, GB/sec, total-speed
# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
8x1-convergence, 2.914, secs, NUMA-convergence-latency
8x1-convergence, 2.914, secs, runtime-max/thread
8x1-convergence, 2.533, secs, runtime-min/thread
8x1-convergence, 2.750, secs, runtime-avg/thread
8x1-convergence, 6.538, %, spread-runtime/thread
8x1-convergence, 2.215, GB, data/thread
8x1-convergence, 17.717, GB, data-total
8x1-convergence, 1.316, nsecs, runtime/byte/thread
8x1-convergence, 0.760, GB/sec, thread-speed
8x1-convergence, 6.080, GB/sec, total-speed
# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
16x1-convergence, 3.688, secs, NUMA-convergence-latency
16x1-convergence, 3.688, secs, runtime-max/thread
16x1-convergence, 3.358, secs, runtime-min/thread
16x1-convergence, 3.533, secs, runtime-avg/thread
16x1-convergence, 4.481, %, spread-runtime/thread
16x1-convergence, 1.292, GB, data/thread
16x1-convergence, 20.670, GB, data-total
16x1-convergence, 2.855, nsecs, runtime/byte/thread
16x1-convergence, 0.350, GB/sec, thread-speed
16x1-convergence, 5.604, GB/sec, total-speed
# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
32x1-convergence, 2.762, secs, NUMA-convergence-latency
32x1-convergence, 2.762, secs, runtime-max/thread
32x1-convergence, 2.552, secs, runtime-min/thread
32x1-convergence, 2.735, secs, runtime-avg/thread
32x1-convergence, 3.807, %, spread-runtime/thread
32x1-convergence, 0.516, GB, data/thread
32x1-convergence, 16.509, GB, data-total
32x1-convergence, 5.354, nsecs, runtime/byte/thread
32x1-convergence, 0.187, GB/sec, thread-speed
32x1-convergence, 5.976, GB/sec, total-speed
# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
2x1-bw-process, 20.123, secs, runtime-max/thread
2x1-bw-process, 20.053, secs, runtime-min/thread
2x1-bw-process, 20.085, secs, runtime-avg/thread
2x1-bw-process, 0.173, %, spread-runtime/thread
2x1-bw-process, 61.740, GB, data/thread
2x1-bw-process, 123.480, GB, data-total
2x1-bw-process, 0.326, nsecs, runtime/byte/thread
2x1-bw-process, 3.068, GB/sec, thread-speed
2x1-bw-process, 6.136, GB/sec, total-speed
# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
3x1-bw-process, 20.143, secs, runtime-max/thread
3x1-bw-process, 20.043, secs, runtime-min/thread
3x1-bw-process, 20.091, secs, runtime-avg/thread
3x1-bw-process, 0.249, %, spread-runtime/thread
3x1-bw-process, 48.676, GB, data/thread
3x1-bw-process, 146.029, GB, data-total
3x1-bw-process, 0.414, nsecs, runtime/byte/thread
3x1-bw-process, 2.417, GB/sec, thread-speed
3x1-bw-process, 7.250, GB/sec, total-speed
# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
4x1-bw-process, 20.327, secs, runtime-max/thread
4x1-bw-process, 20.020, secs, runtime-min/thread
4x1-bw-process, 20.168, secs, runtime-avg/thread
4x1-bw-process, 0.754, %, spread-runtime/thread
4x1-bw-process, 34.897, GB, data/thread
4x1-bw-process, 139.586, GB, data-total
4x1-bw-process, 0.582, nsecs, runtime/byte/thread
4x1-bw-process, 1.717, GB/sec, thread-speed
4x1-bw-process, 6.867, GB/sec, total-speed
# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
8x1-bw-process, 20.063, secs, runtime-max/thread
8x1-bw-process, 20.004, secs, runtime-min/thread
8x1-bw-process, 20.034, secs, runtime-avg/thread
8x1-bw-process, 0.148, %, spread-runtime/thread
8x1-bw-process, 19.998, GB, data/thread
8x1-bw-process, 159.988, GB, data-total
8x1-bw-process, 1.003, nsecs, runtime/byte/thread
8x1-bw-process, 0.997, GB/sec, thread-speed
8x1-bw-process, 7.974, GB/sec, total-speed
# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
8x1-bw-process-NOTHP, 20.435, secs, runtime-max/thread
8x1-bw-process-NOTHP, 20.150, secs, runtime-min/thread
8x1-bw-process-NOTHP, 20.255, secs, runtime-avg/thread
8x1-bw-process-NOTHP, 0.699, %, spread-runtime/thread
8x1-bw-process-NOTHP, 15.167, GB, data/thread
8x1-bw-process-NOTHP, 121.333, GB, data-total
8x1-bw-process-NOTHP, 1.347, nsecs, runtime/byte/thread
8x1-bw-process-NOTHP, 0.742, GB/sec, thread-speed
8x1-bw-process-NOTHP, 5.937, GB/sec, total-speed
# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
16x1-bw-process, 20.451, secs, runtime-max/thread
16x1-bw-process, 20.078, secs, runtime-min/thread
16x1-bw-process, 20.311, secs, runtime-avg/thread
16x1-bw-process, 0.912, %, spread-runtime/thread
16x1-bw-process, 7.147, GB, data/thread
16x1-bw-process, 114.354, GB, data-total
16x1-bw-process, 2.861, nsecs, runtime/byte/thread
16x1-bw-process, 0.349, GB/sec, thread-speed
16x1-bw-process, 5.592, GB/sec, total-speed
# Running 4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
4x1-bw-thread, 20.038, secs, runtime-max/thread
4x1-bw-thread, 20.006, secs, runtime-min/thread
4x1-bw-thread, 20.023, secs, runtime-avg/thread
4x1-bw-thread, 0.079, %, spread-runtime/thread
4x1-bw-thread, 68.115, GB, data/thread
4x1-bw-thread, 272.462, GB, data-total
4x1-bw-thread, 0.294, nsecs, runtime/byte/thread
4x1-bw-thread, 3.399, GB/sec, thread-speed
4x1-bw-thread, 13.598, GB/sec, total-speed
# Running 8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
8x1-bw-thread, 20.055, secs, runtime-max/thread
8x1-bw-thread, 20.001, secs, runtime-min/thread
8x1-bw-thread, 20.033, secs, runtime-avg/thread
8x1-bw-thread, 0.136, %, spread-runtime/thread
8x1-bw-thread, 41.004, GB, data/thread
8x1-bw-thread, 328.028, GB, data-total
8x1-bw-thread, 0.489, nsecs, runtime/byte/thread
8x1-bw-thread, 2.045, GB/sec, thread-speed
8x1-bw-thread, 16.356, GB/sec, total-speed
# Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
16x1-bw-thread, 20.044, secs, runtime-max/thread
16x1-bw-thread, 19.994, secs, runtime-min/thread
16x1-bw-thread, 20.021, secs, runtime-avg/thread
16x1-bw-thread, 0.124, %, spread-runtime/thread
16x1-bw-thread, 30.828, GB, data/thread
16x1-bw-thread, 493.250, GB, data-total
16x1-bw-thread, 0.650, nsecs, runtime/byte/thread
16x1-bw-thread, 1.538, GB/sec, thread-speed
16x1-bw-thread, 24.608, GB/sec, total-speed
# Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
32x1-bw-thread, 19.990, secs, runtime-max/thread
32x1-bw-thread, 19.955, secs, runtime-min/thread
32x1-bw-thread, 19.996, secs, runtime-avg/thread
32x1-bw-thread, 0.087, %, spread-runtime/thread
32x1-bw-thread, 15.915, GB, data/thread
32x1-bw-thread, 509.289, GB, data-total
32x1-bw-thread, 1.256, nsecs, runtime/byte/thread
32x1-bw-thread, 0.796, GB/sec, thread-speed
32x1-bw-thread, 25.477, GB/sec, total-speed
# Running 2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
2x3-bw-thread, 20.168, secs, runtime-max/thread
2x3-bw-thread, 20.028, secs, runtime-min/thread
2x3-bw-thread, 20.103, secs, runtime-avg/thread
2x3-bw-thread, 0.346, %, spread-runtime/thread
2x3-bw-thread, 29.528, GB, data/thread
2x3-bw-thread, 177.167, GB, data-total
2x3-bw-thread, 0.683, nsecs, runtime/byte/thread
2x3-bw-thread, 1.464, GB/sec, thread-speed
2x3-bw-thread, 8.785, GB/sec, total-speed
# Running 4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
4x4-bw-thread, 20.576, secs, runtime-max/thread
4x4-bw-thread, 20.002, secs, runtime-min/thread
4x4-bw-thread, 20.312, secs, runtime-avg/thread
4x4-bw-thread, 1.394, %, spread-runtime/thread
4x4-bw-thread, 8.187, GB, data/thread
4x4-bw-thread, 130.997, GB, data-total
4x4-bw-thread, 2.513, nsecs, runtime/byte/thread
4x4-bw-thread, 0.398, GB/sec, thread-speed
4x4-bw-thread, 6.366, GB/sec, total-speed
# Running 4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
4x6-bw-thread, 21.007, secs, runtime-max/thread
4x6-bw-thread, 20.075, secs, runtime-min/thread
4x6-bw-thread, 20.573, secs, runtime-avg/thread
4x6-bw-thread, 2.219, %, spread-runtime/thread
4x6-bw-thread, 5.503, GB, data/thread
4x6-bw-thread, 132.070, GB, data-total
4x6-bw-thread, 3.817, nsecs, runtime/byte/thread
4x6-bw-thread, 0.262, GB/sec, thread-speed
4x6-bw-thread, 6.287, GB/sec, total-speed
# Running 4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
4x8-bw-thread, 21.986, secs, runtime-max/thread
4x8-bw-thread, 20.359, secs, runtime-min/thread
4x8-bw-thread, 21.300, secs, runtime-avg/thread
4x8-bw-thread, 3.701, %, spread-runtime/thread
4x8-bw-thread, 4.027, GB, data/thread
4x8-bw-thread, 128.849, GB, data-total
4x8-bw-thread, 5.460, nsecs, runtime/byte/thread
4x8-bw-thread, 0.183, GB/sec, thread-speed
4x8-bw-thread, 5.860, GB/sec, total-speed
# Running 4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
4x8-bw-thread-NOTHP, 21.155, secs, runtime-max/thread
4x8-bw-thread-NOTHP, 20.115, secs, runtime-min/thread
4x8-bw-thread-NOTHP, 20.705, secs, runtime-avg/thread
4x8-bw-thread-NOTHP, 2.459, %, spread-runtime/thread
4x8-bw-thread-NOTHP, 4.077, GB, data/thread
4x8-bw-thread-NOTHP, 130.460, GB, data-total
4x8-bw-thread-NOTHP, 5.189, nsecs, runtime/byte/thread
4x8-bw-thread-NOTHP, 0.193, GB/sec, thread-speed
4x8-bw-thread-NOTHP, 6.167, GB/sec, total-speed
# Running 3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
3x3-bw-thread, 20.211, secs, runtime-max/thread
3x3-bw-thread, 20.044, secs, runtime-min/thread
3x3-bw-thread, 20.127, secs, runtime-avg/thread
3x3-bw-thread, 0.413, %, spread-runtime/thread
3x3-bw-thread, 18.492, GB, data/thread
3x3-bw-thread, 166.430, GB, data-total
3x3-bw-thread, 1.093, nsecs, runtime/byte/thread
3x3-bw-thread, 0.915, GB/sec, thread-speed
3x3-bw-thread, 8.235, GB/sec, total-speed
# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 21.244, secs, runtime-max/thread
5x5-bw-thread, 20.115, secs, runtime-min/thread
5x5-bw-thread, 20.873, secs, runtime-avg/thread
5x5-bw-thread, 2.657, %, spread-runtime/thread
5x5-bw-thread, 4.896, GB, data/thread
5x5-bw-thread, 122.407, GB, data-total
5x5-bw-thread, 4.339, nsecs, runtime/byte/thread
5x5-bw-thread, 0.230, GB/sec, thread-speed
5x5-bw-thread, 5.762, GB/sec, total-speed
# Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
2x16-bw-thread, 21.854, secs, runtime-max/thread
2x16-bw-thread, 20.047, secs, runtime-min/thread
2x16-bw-thread, 21.157, secs, runtime-avg/thread
2x16-bw-thread, 4.135, %, spread-runtime/thread
2x16-bw-thread, 4.043, GB, data/thread
2x16-bw-thread, 129.386, GB, data-total
2x16-bw-thread, 5.405, nsecs, runtime/byte/thread
2x16-bw-thread, 0.185, GB/sec, thread-speed
2x16-bw-thread, 5.920, GB/sec, total-speed
# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
1x32-bw-thread, 23.952, secs, runtime-max/thread
1x32-bw-thread, 20.470, secs, runtime-min/thread
1x32-bw-thread, 22.975, secs, runtime-avg/thread
1x32-bw-thread, 7.268, %, spread-runtime/thread
1x32-bw-thread, 4.362, GB, data/thread
1x32-bw-thread, 139.586, GB, data-total
1x32-bw-thread, 5.491, nsecs, runtime/byte/thread
1x32-bw-thread, 0.182, GB/sec, thread-speed
1x32-bw-thread, 5.828, GB/sec, total-speed
# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
numa02-bw, 19.990, secs, runtime-max/thread
numa02-bw, 19.975, secs, runtime-min/thread
numa02-bw, 19.995, secs, runtime-avg/thread
numa02-bw, 0.037, %, spread-runtime/thread
numa02-bw, 18.150, GB, data/thread
numa02-bw, 580.794, GB, data-total
numa02-bw, 1.101, nsecs, runtime/byte/thread
numa02-bw, 0.908, GB/sec, thread-speed
numa02-bw, 29.054, GB/sec, total-speed
# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
numa02-bw-NOTHP, 20.072, secs, runtime-max/thread
numa02-bw-NOTHP, 19.965, secs, runtime-min/thread
numa02-bw-NOTHP, 19.998, secs, runtime-avg/thread
numa02-bw-NOTHP, 0.266, %, spread-runtime/thread
numa02-bw-NOTHP, 16.975, GB, data/thread
numa02-bw-NOTHP, 543.213, GB, data-total
numa02-bw-NOTHP, 1.182, nsecs, runtime/byte/thread
numa02-bw-NOTHP, 0.846, GB/sec, thread-speed
numa02-bw-NOTHP, 27.064, GB/sec, total-speed
# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
numa01-bw-thread, 20.125, secs, runtime-max/thread
numa01-bw-thread, 19.980, secs, runtime-min/thread
numa01-bw-thread, 20.094, secs, runtime-avg/thread
numa01-bw-thread, 0.361, %, spread-runtime/thread
numa01-bw-thread, 12.791, GB, data/thread
numa01-bw-thread, 409.297, GB, data-total
numa01-bw-thread, 1.573, nsecs, runtime/byte/thread
numa01-bw-thread, 0.636, GB/sec, thread-speed
numa01-bw-thread, 20.338, GB/sec, total-speed
# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
numa01-bw-thread-NOTHP, 20.298, secs, runtime-max/thread
numa01-bw-thread-NOTHP, 19.965, secs, runtime-min/thread
numa01-bw-thread-NOTHP, 20.055, secs, runtime-avg/thread
numa01-bw-thread-NOTHP, 0.820, %, spread-runtime/thread
numa01-bw-thread-NOTHP, 11.752, GB, data/thread
numa01-bw-thread-NOTHP, 376.078, GB, data-total
numa01-bw-thread-NOTHP, 1.727, nsecs, runtime/byte/thread
numa01-bw-thread-NOTHP, 0.579, GB/sec, thread-speed
numa01-bw-thread-NOTHP, 18.528, GB/sec, total-speed
#
# Running test on: Linux vega 3.6.0+ #4 SMP Fri Dec 7 19:14:49 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
#
# Running numa/mem benchmark...
# Running main, "perf bench numa mem -a"
# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local, 20.080, secs, runtime-max/thread
RAM-bw-local, 20.073, secs, runtime-min/thread
RAM-bw-local, 20.073, secs, runtime-avg/thread
RAM-bw-local, 0.018, %, spread-runtime/thread
RAM-bw-local, 170.725, GB, data/thread
RAM-bw-local, 170.725, GB, data-total
RAM-bw-local, 0.118, nsecs, runtime/byte/thread
RAM-bw-local, 8.502, GB/sec, thread-speed
RAM-bw-local, 8.502, GB/sec, total-speed
# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
RAM-bw-local-NOTHP, 20.112, secs, runtime-max/thread
RAM-bw-local-NOTHP, 20.028, secs, runtime-min/thread
RAM-bw-local-NOTHP, 20.028, secs, runtime-avg/thread
RAM-bw-local-NOTHP, 0.209, %, spread-runtime/thread
RAM-bw-local-NOTHP, 169.651, GB, data/thread
RAM-bw-local-NOTHP, 169.651, GB, data-total
RAM-bw-local-NOTHP, 0.119, nsecs, runtime/byte/thread
RAM-bw-local-NOTHP, 8.435, GB/sec, thread-speed
RAM-bw-local-NOTHP, 8.435, GB/sec, total-speed
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote, 20.101, secs, runtime-max/thread
RAM-bw-remote, 20.093, secs, runtime-min/thread
RAM-bw-remote, 20.093, secs, runtime-avg/thread
RAM-bw-remote, 0.021, %, spread-runtime/thread
RAM-bw-remote, 134.218, GB, data/thread
RAM-bw-remote, 134.218, GB, data-total
RAM-bw-remote, 0.150, nsecs, runtime/byte/thread
RAM-bw-remote, 6.677, GB/sec, thread-speed
RAM-bw-remote, 6.677, GB/sec, total-speed
# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local-2x, 20.109, secs, runtime-max/thread
RAM-bw-local-2x, 20.011, secs, runtime-min/thread
RAM-bw-local-2x, 20.056, secs, runtime-avg/thread
RAM-bw-local-2x, 0.243, %, spread-runtime/thread
RAM-bw-local-2x, 135.291, GB, data/thread
RAM-bw-local-2x, 270.583, GB, data-total
RAM-bw-local-2x, 0.149, nsecs, runtime/byte/thread
RAM-bw-local-2x, 6.728, GB/sec, thread-speed
RAM-bw-local-2x, 13.456, GB/sec, total-speed
# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote-2x, 20.292, secs, runtime-max/thread
RAM-bw-remote-2x, 20.279, secs, runtime-min/thread
RAM-bw-remote-2x, 20.281, secs, runtime-avg/thread
RAM-bw-remote-2x, 0.034, %, spread-runtime/thread
RAM-bw-remote-2x, 74.625, GB, data/thread
RAM-bw-remote-2x, 149.250, GB, data-total
RAM-bw-remote-2x, 0.272, nsecs, runtime/byte/thread
RAM-bw-remote-2x, 3.677, GB/sec, thread-speed
RAM-bw-remote-2x, 7.355, GB/sec, total-speed
# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-cross, 20.177, secs, runtime-max/thread
RAM-bw-cross, 20.158, secs, runtime-min/thread
RAM-bw-cross, 20.163, secs, runtime-avg/thread
RAM-bw-cross, 0.048, %, spread-runtime/thread
RAM-bw-cross, 122.943, GB, data/thread
RAM-bw-cross, 245.887, GB, data-total
RAM-bw-cross, 0.164, nsecs, runtime/byte/thread
RAM-bw-cross, 6.093, GB/sec, thread-speed
RAM-bw-cross, 12.187, GB/sec, total-speed
# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
1x3-convergence, 0.224, secs, NUMA-convergence-latency
1x3-convergence, 0.224, secs, runtime-max/thread
1x3-convergence, 0.205, secs, runtime-min/thread
1x3-convergence, 0.214, secs, runtime-avg/thread
1x3-convergence, 4.078, %, spread-runtime/thread
1x3-convergence, 0.537, GB, data/thread
1x3-convergence, 1.611, GB, data-total
1x3-convergence, 0.417, nsecs, runtime/byte/thread
1x3-convergence, 2.401, GB/sec, thread-speed
1x3-convergence, 7.202, GB/sec, total-speed
# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1x4-convergence, 100.173, secs, NUMA-convergence-latency
1x4-convergence, 100.173, secs, runtime-max/thread
1x4-convergence, 100.026, secs, runtime-min/thread
1x4-convergence, 100.067, secs, runtime-avg/thread
1x4-convergence, 0.073, %, spread-runtime/thread
1x4-convergence, 162.672, GB, data/thread
1x4-convergence, 650.688, GB, data-total
1x4-convergence, 0.616, nsecs, runtime/byte/thread
1x4-convergence, 1.624, GB/sec, thread-speed
1x4-convergence, 6.496, GB/sec, total-speed
# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1x6-convergence, 100.821, secs, NUMA-convergence-latency
1x6-convergence, 100.821, secs, runtime-max/thread
1x6-convergence, 100.428, secs, runtime-min/thread
1x6-convergence, 100.706, secs, runtime-avg/thread
1x6-convergence, 0.195, %, spread-runtime/thread
1x6-convergence, 99.111, GB, data/thread
1x6-convergence, 594.668, GB, data-total
1x6-convergence, 1.017, nsecs, runtime/byte/thread
1x6-convergence, 0.983, GB/sec, thread-speed
1x6-convergence, 5.898, GB/sec, total-speed
# Running 2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2x3-convergence, 100.539, secs, NUMA-convergence-latency
2x3-convergence, 100.539, secs, runtime-max/thread
2x3-convergence, 100.015, secs, runtime-min/thread
2x3-convergence, 100.273, secs, runtime-avg/thread
2x3-convergence, 0.260, %, spread-runtime/thread
2x3-convergence, 147.954, GB, data/thread
2x3-convergence, 1331.587, GB, data-total
2x3-convergence, 0.680, nsecs, runtime/byte/thread
2x3-convergence, 1.472, GB/sec, thread-speed
2x3-convergence, 13.245, GB/sec, total-speed
# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
3x3-convergence, 100.463, secs, NUMA-convergence-latency
3x3-convergence, 100.463, secs, runtime-max/thread
3x3-convergence, 100.066, secs, runtime-min/thread
3x3-convergence, 100.216, secs, runtime-avg/thread
3x3-convergence, 0.198, %, spread-runtime/thread
3x3-convergence, 132.624, GB, data/thread
3x3-convergence, 1193.615, GB, data-total
3x3-convergence, 0.758, nsecs, runtime/byte/thread
3x3-convergence, 1.320, GB/sec, thread-speed
3x3-convergence, 11.881, GB/sec, total-speed
# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4x4-convergence, 4.119, secs, NUMA-convergence-latency
4x4-convergence, 4.119, secs, runtime-max/thread
4x4-convergence, 3.751, secs, runtime-min/thread
4x4-convergence, 3.948, secs, runtime-avg/thread
4x4-convergence, 4.462, %, spread-runtime/thread
4x4-convergence, 1.980, GB, data/thread
4x4-convergence, 31.675, GB, data-total
4x4-convergence, 2.081, nsecs, runtime/byte/thread
4x4-convergence, 0.481, GB/sec, thread-speed
4x4-convergence, 7.690, GB/sec, total-speed
# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4x4-convergence-NOTHP, 12.166, secs, NUMA-convergence-latency
4x4-convergence-NOTHP, 12.166, secs, runtime-max/thread
4x4-convergence-NOTHP, 11.801, secs, runtime-min/thread
4x4-convergence-NOTHP, 11.917, secs, runtime-avg/thread
4x4-convergence-NOTHP, 1.502, %, spread-runtime/thread
4x4-convergence-NOTHP, 5.234, GB, data/thread
4x4-convergence-NOTHP, 83.752, GB, data-total
4x4-convergence-NOTHP, 2.324, nsecs, runtime/byte/thread
4x4-convergence-NOTHP, 0.430, GB/sec, thread-speed
4x4-convergence-NOTHP, 6.884, GB/sec, total-speed
# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
4x6-convergence, 16.592, secs, NUMA-convergence-latency
4x6-convergence, 16.592, secs, runtime-max/thread
4x6-convergence, 15.407, secs, runtime-min/thread
4x6-convergence, 16.109, secs, runtime-avg/thread
4x6-convergence, 3.572, %, spread-runtime/thread
4x6-convergence, 6.729, GB, data/thread
4x6-convergence, 161.502, GB, data-total
4x6-convergence, 2.466, nsecs, runtime/byte/thread
4x6-convergence, 0.406, GB/sec, thread-speed
4x6-convergence, 9.734, GB/sec, total-speed
# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
4x8-convergence, 3.385, secs, NUMA-convergence-latency
4x8-convergence, 3.385, secs, runtime-max/thread
4x8-convergence, 1.465, secs, runtime-min/thread
4x8-convergence, 2.846, secs, runtime-avg/thread
4x8-convergence, 28.361, %, spread-runtime/thread
4x8-convergence, 0.638, GB, data/thread
4x8-convergence, 20.401, GB, data-total
4x8-convergence, 5.309, nsecs, runtime/byte/thread
4x8-convergence, 0.188, GB/sec, thread-speed
4x8-convergence, 6.028, GB/sec, total-speed
# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
8x4-convergence, 18.295, secs, NUMA-convergence-latency
8x4-convergence, 18.295, secs, runtime-max/thread
8x4-convergence, 16.808, secs, runtime-min/thread
8x4-convergence, 17.809, secs, runtime-avg/thread
8x4-convergence, 4.064, %, spread-runtime/thread
8x4-convergence, 3.406, GB, data/thread
8x4-convergence, 108.985, GB, data-total
8x4-convergence, 5.372, nsecs, runtime/byte/thread
8x4-convergence, 0.186, GB/sec, thread-speed
8x4-convergence, 5.957, GB/sec, total-speed
# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
8x4-convergence-NOTHP, 15.675, secs, NUMA-convergence-latency
8x4-convergence-NOTHP, 15.675, secs, runtime-max/thread
8x4-convergence-NOTHP, 14.861, secs, runtime-min/thread
8x4-convergence-NOTHP, 15.321, secs, runtime-avg/thread
8x4-convergence-NOTHP, 2.596, %, spread-runtime/thread
8x4-convergence-NOTHP, 5.302, GB, data/thread
8x4-convergence-NOTHP, 169.651, GB, data-total
8x4-convergence-NOTHP, 2.957, nsecs, runtime/byte/thread
8x4-convergence-NOTHP, 0.338, GB/sec, thread-speed
8x4-convergence-NOTHP, 10.823, GB/sec, total-speed
# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
3x1-convergence, 0.811, secs, NUMA-convergence-latency
3x1-convergence, 0.811, secs, runtime-max/thread
3x1-convergence, 0.739, secs, runtime-min/thread
3x1-convergence, 0.782, secs, runtime-avg/thread
3x1-convergence, 4.431, %, spread-runtime/thread
3x1-convergence, 1.969, GB, data/thread
3x1-convergence, 5.906, GB, data-total
3x1-convergence, 0.412, nsecs, runtime/byte/thread
3x1-convergence, 2.428, GB/sec, thread-speed
3x1-convergence, 7.284, GB/sec, total-speed
# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
4x1-convergence, 0.806, secs, NUMA-convergence-latency
4x1-convergence, 0.806, secs, runtime-max/thread
4x1-convergence, 0.728, secs, runtime-min/thread
4x1-convergence, 0.780, secs, runtime-avg/thread
4x1-convergence, 4.838, %, spread-runtime/thread
4x1-convergence, 1.476, GB, data/thread
4x1-convergence, 5.906, GB, data-total
4x1-convergence, 0.546, nsecs, runtime/byte/thread
4x1-convergence, 1.832, GB/sec, thread-speed
4x1-convergence, 7.329, GB/sec, total-speed
# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
8x1-convergence, 2.879, secs, NUMA-convergence-latency
8x1-convergence, 2.879, secs, runtime-max/thread
8x1-convergence, 2.737, secs, runtime-min/thread
8x1-convergence, 2.805, secs, runtime-avg/thread
8x1-convergence, 2.475, %, spread-runtime/thread
8x1-convergence, 3.288, GB, data/thread
8x1-convergence, 26.307, GB, data-total
8x1-convergence, 0.876, nsecs, runtime/byte/thread
8x1-convergence, 1.142, GB/sec, thread-speed
8x1-convergence, 9.137, GB/sec, total-speed
# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
16x1-convergence, 2.484, secs, NUMA-convergence-latency
16x1-convergence, 2.484, secs, runtime-max/thread
16x1-convergence, 2.169, secs, runtime-min/thread
16x1-convergence, 2.376, secs, runtime-avg/thread
16x1-convergence, 6.353, %, spread-runtime/thread
16x1-convergence, 0.906, GB, data/thread
16x1-convergence, 14.496, GB, data-total
16x1-convergence, 2.742, nsecs, runtime/byte/thread
16x1-convergence, 0.365, GB/sec, thread-speed
16x1-convergence, 5.835, GB/sec, total-speed
# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
32x1-convergence, 3.039, secs, NUMA-convergence-latency
32x1-convergence, 3.039, secs, runtime-max/thread
32x1-convergence, 2.755, secs, runtime-min/thread
32x1-convergence, 2.983, secs, runtime-avg/thread
32x1-convergence, 4.672, %, spread-runtime/thread
32x1-convergence, 0.579, GB, data/thread
32x1-convergence, 18.522, GB, data-total
32x1-convergence, 5.251, nsecs, runtime/byte/thread
32x1-convergence, 0.190, GB/sec, thread-speed
32x1-convergence, 6.094, GB/sec, total-speed
# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
2x1-bw-process, 20.217, secs, runtime-max/thread
2x1-bw-process, 20.126, secs, runtime-min/thread
2x1-bw-process, 20.168, secs, runtime-avg/thread
2x1-bw-process, 0.224, %, spread-runtime/thread
2x1-bw-process, 81.604, GB, data/thread
2x1-bw-process, 163.209, GB, data-total
2x1-bw-process, 0.248, nsecs, runtime/byte/thread
2x1-bw-process, 4.036, GB/sec, thread-speed
2x1-bw-process, 8.073, GB/sec, total-speed
# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
3x1-bw-process, 20.138, secs, runtime-max/thread
3x1-bw-process, 20.075, secs, runtime-min/thread
3x1-bw-process, 20.105, secs, runtime-avg/thread
3x1-bw-process, 0.156, %, spread-runtime/thread
3x1-bw-process, 84.468, GB, data/thread
3x1-bw-process, 253.403, GB, data-total
3x1-bw-process, 0.238, nsecs, runtime/byte/thread
3x1-bw-process, 4.194, GB/sec, thread-speed
3x1-bw-process, 12.583, GB/sec, total-speed
# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
4x1-bw-process, 20.143, secs, runtime-max/thread
4x1-bw-process, 20.052, secs, runtime-min/thread
4x1-bw-process, 20.079, secs, runtime-avg/thread
4x1-bw-process, 0.227, %, spread-runtime/thread
4x1-bw-process, 62.009, GB, data/thread
4x1-bw-process, 248.034, GB, data-total
4x1-bw-process, 0.325, nsecs, runtime/byte/thread
4x1-bw-process, 3.078, GB/sec, thread-speed
4x1-bw-process, 12.313, GB/sec, total-speed
# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
8x1-bw-process, 20.109, secs, runtime-max/thread
8x1-bw-process, 20.013, secs, runtime-min/thread
8x1-bw-process, 20.072, secs, runtime-avg/thread
8x1-bw-process, 0.238, %, spread-runtime/thread
8x1-bw-process, 50.869, GB, data/thread
8x1-bw-process, 406.948, GB, data-total
8x1-bw-process, 0.395, nsecs, runtime/byte/thread
8x1-bw-process, 2.530, GB/sec, thread-speed
8x1-bw-process, 20.237, GB/sec, total-speed
# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
8x1-bw-process-NOTHP, 20.203, secs, runtime-max/thread
8x1-bw-process-NOTHP, 20.033, secs, runtime-min/thread
8x1-bw-process-NOTHP, 20.071, secs, runtime-avg/thread
8x1-bw-process-NOTHP, 0.422, %, spread-runtime/thread
8x1-bw-process-NOTHP, 45.030, GB, data/thread
8x1-bw-process-NOTHP, 360.240, GB, data-total
8x1-bw-process-NOTHP, 0.449, nsecs, runtime/byte/thread
8x1-bw-process-NOTHP, 2.229, GB/sec, thread-speed
8x1-bw-process-NOTHP, 17.831, GB/sec, total-speed
# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
16x1-bw-process, 20.271, secs, runtime-max/thread
16x1-bw-process, 20.021, secs, runtime-min/thread
16x1-bw-process, 20.175, secs, runtime-avg/thread
16x1-bw-process, 0.615, %, spread-runtime/thread
16x1-bw-process, 7.550, GB, data/thread
16x1-bw-process, 120.796, GB, data-total
16x1-bw-process, 2.685, nsecs, runtime/byte/thread
16x1-bw-process, 0.372, GB/sec, thread-speed
16x1-bw-process, 5.959, GB/sec, total-speed
# Running 4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
4x1-bw-thread, 20.052, secs, runtime-max/thread
4x1-bw-thread, 20.013, secs, runtime-min/thread
4x1-bw-thread, 20.030, secs, runtime-avg/thread
4x1-bw-thread, 0.097, %, spread-runtime/thread
4x1-bw-thread, 87.443, GB, data/thread
4x1-bw-thread, 349.771, GB, data-total
4x1-bw-thread, 0.229, nsecs, runtime/byte/thread
4x1-bw-thread, 4.361, GB/sec, thread-speed
4x1-bw-thread, 17.443, GB/sec, total-speed
# Running 8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
8x1-bw-thread, 20.067, secs, runtime-max/thread
8x1-bw-thread, 20.011, secs, runtime-min/thread
8x1-bw-thread, 20.038, secs, runtime-avg/thread
8x1-bw-thread, 0.140, %, spread-runtime/thread
8x1-bw-thread, 56.271, GB, data/thread
8x1-bw-thread, 450.166, GB, data-total
8x1-bw-thread, 0.357, nsecs, runtime/byte/thread
8x1-bw-thread, 2.804, GB/sec, thread-speed
8x1-bw-thread, 22.433, GB/sec, total-speed
# Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
16x1-bw-thread, 20.029, secs, runtime-max/thread
16x1-bw-thread, 20.002, secs, runtime-min/thread
16x1-bw-thread, 20.020, secs, runtime-avg/thread
16x1-bw-thread, 0.067, %, spread-runtime/thread
16x1-bw-thread, 25.292, GB, data/thread
16x1-bw-thread, 404.666, GB, data-total
16x1-bw-thread, 0.792, nsecs, runtime/byte/thread
16x1-bw-thread, 1.263, GB/sec, thread-speed
16x1-bw-thread, 20.204, GB/sec, total-speed
# Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
32x1-bw-thread, 19.989, secs, runtime-max/thread
32x1-bw-thread, 19.962, secs, runtime-min/thread
32x1-bw-thread, 20.004, secs, runtime-avg/thread
32x1-bw-thread, 0.068, %, spread-runtime/thread
32x1-bw-thread, 11.388, GB, data/thread
32x1-bw-thread, 364.401, GB, data-total
32x1-bw-thread, 1.755, nsecs, runtime/byte/thread
32x1-bw-thread, 0.570, GB/sec, thread-speed
32x1-bw-thread, 18.230, GB/sec, total-speed
# Running 2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
2x3-bw-thread, 20.190, secs, runtime-max/thread
2x3-bw-thread, 20.082, secs, runtime-min/thread
2x3-bw-thread, 20.110, secs, runtime-avg/thread
2x3-bw-thread, 0.268, %, spread-runtime/thread
2x3-bw-thread, 49.303, GB, data/thread
2x3-bw-thread, 295.816, GB, data-total
2x3-bw-thread, 0.410, nsecs, runtime/byte/thread
2x3-bw-thread, 2.442, GB/sec, thread-speed
2x3-bw-thread, 14.652, GB/sec, total-speed
# Running 4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
4x4-bw-thread, 20.307, secs, runtime-max/thread
4x4-bw-thread, 20.002, secs, runtime-min/thread
4x4-bw-thread, 20.202, secs, runtime-avg/thread
4x4-bw-thread, 0.750, %, spread-runtime/thread
4x4-bw-thread, 12.482, GB, data/thread
4x4-bw-thread, 199.716, GB, data-total
4x4-bw-thread, 1.627, nsecs, runtime/byte/thread
4x4-bw-thread, 0.615, GB/sec, thread-speed
4x4-bw-thread, 9.835, GB/sec, total-speed
# Running 4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
4x6-bw-thread, 20.431, secs, runtime-max/thread
4x6-bw-thread, 20.007, secs, runtime-min/thread
4x6-bw-thread, 20.283, secs, runtime-avg/thread
4x6-bw-thread, 1.036, %, spread-runtime/thread
4x6-bw-thread, 13.086, GB, data/thread
4x6-bw-thread, 314.069, GB, data-total
4x6-bw-thread, 1.561, nsecs, runtime/byte/thread
4x6-bw-thread, 0.641, GB/sec, thread-speed
4x6-bw-thread, 15.372, GB/sec, total-speed
# Running 4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
4x8-bw-thread, 20.543, secs, runtime-max/thread
4x8-bw-thread, 20.015, secs, runtime-min/thread
4x8-bw-thread, 20.324, secs, runtime-avg/thread
4x8-bw-thread, 1.287, %, spread-runtime/thread
4x8-bw-thread, 7.617, GB, data/thread
4x8-bw-thread, 243.739, GB, data-total
4x8-bw-thread, 2.697, nsecs, runtime/byte/thread
4x8-bw-thread, 0.371, GB/sec, thread-speed
4x8-bw-thread, 11.865, GB/sec, total-speed
# Running 4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
4x8-bw-thread-NOTHP, 20.661, secs, runtime-max/thread
4x8-bw-thread-NOTHP, 20.023, secs, runtime-min/thread
4x8-bw-thread-NOTHP, 20.292, secs, runtime-avg/thread
4x8-bw-thread-NOTHP, 1.546, %, spread-runtime/thread
4x8-bw-thread-NOTHP, 5.956, GB, data/thread
4x8-bw-thread-NOTHP, 190.589, GB, data-total
4x8-bw-thread-NOTHP, 3.469, nsecs, runtime/byte/thread
4x8-bw-thread-NOTHP, 0.288, GB/sec, thread-speed
4x8-bw-thread-NOTHP, 9.224, GB/sec, total-speed
# Running 3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
3x3-bw-thread, 20.310, secs, runtime-max/thread
3x3-bw-thread, 20.116, secs, runtime-min/thread
3x3-bw-thread, 20.202, secs, runtime-avg/thread
3x3-bw-thread, 0.480, %, spread-runtime/thread
3x3-bw-thread, 14.973, GB, data/thread
3x3-bw-thread, 134.755, GB, data-total
3x3-bw-thread, 1.356, nsecs, runtime/byte/thread
3x3-bw-thread, 0.737, GB/sec, thread-speed
3x3-bw-thread, 6.635, GB/sec, total-speed
# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 20.578, secs, runtime-max/thread
5x5-bw-thread, 20.039, secs, runtime-min/thread
5x5-bw-thread, 20.379, secs, runtime-avg/thread
5x5-bw-thread, 1.309, %, spread-runtime/thread
5x5-bw-thread, 7.881, GB, data/thread
5x5-bw-thread, 197.032, GB, data-total
5x5-bw-thread, 2.611, nsecs, runtime/byte/thread
5x5-bw-thread, 0.383, GB/sec, thread-speed
5x5-bw-thread, 9.575, GB/sec, total-speed
# Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
2x16-bw-thread, 21.581, secs, runtime-max/thread
2x16-bw-thread, 20.043, secs, runtime-min/thread
2x16-bw-thread, 20.958, secs, runtime-avg/thread
2x16-bw-thread, 3.564, %, spread-runtime/thread
2x16-bw-thread, 4.010, GB, data/thread
2x16-bw-thread, 128.312, GB, data-total
2x16-bw-thread, 5.382, nsecs, runtime/byte/thread
2x16-bw-thread, 0.186, GB/sec, thread-speed
2x16-bw-thread, 5.945, GB/sec, total-speed
# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
1x32-bw-thread, 23.503, secs, runtime-max/thread
1x32-bw-thread, 21.850, secs, runtime-min/thread
1x32-bw-thread, 22.953, secs, runtime-avg/thread
1x32-bw-thread, 3.518, %, spread-runtime/thread
1x32-bw-thread, 4.295, GB, data/thread
1x32-bw-thread, 137.439, GB, data-total
1x32-bw-thread, 5.472, nsecs, runtime/byte/thread
1x32-bw-thread, 0.183, GB/sec, thread-speed
1x32-bw-thread, 5.848, GB/sec, total-speed
# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
numa02-bw, 19.948, secs, runtime-max/thread
numa02-bw, 19.921, secs, runtime-min/thread
numa02-bw, 19.983, secs, runtime-avg/thread
numa02-bw, 0.068, %, spread-runtime/thread
numa02-bw, 15.425, GB, data/thread
numa02-bw, 493.586, GB, data-total
numa02-bw, 1.293, nsecs, runtime/byte/thread
numa02-bw, 0.773, GB/sec, thread-speed
numa02-bw, 24.744, GB/sec, total-speed
# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
numa02-bw-NOTHP, 20.055, secs, runtime-max/thread
numa02-bw-NOTHP, 19.948, secs, runtime-min/thread
numa02-bw-NOTHP, 19.991, secs, runtime-avg/thread
numa02-bw-NOTHP, 0.267, %, spread-runtime/thread
numa02-bw-NOTHP, 12.795, GB, data/thread
numa02-bw-NOTHP, 409.431, GB, data-total
numa02-bw-NOTHP, 1.567, nsecs, runtime/byte/thread
numa02-bw-NOTHP, 0.638, GB/sec, thread-speed
numa02-bw-NOTHP, 20.415, GB/sec, total-speed
# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
numa01-bw-thread, 20.107, secs, runtime-max/thread
numa01-bw-thread, 19.978, secs, runtime-min/thread
numa01-bw-thread, 20.067, secs, runtime-avg/thread
numa01-bw-thread, 0.320, %, spread-runtime/thread
numa01-bw-thread, 9.532, GB, data/thread
numa01-bw-thread, 305.010, GB, data-total
numa01-bw-thread, 2.110, nsecs, runtime/byte/thread
numa01-bw-thread, 0.474, GB/sec, thread-speed
numa01-bw-thread, 15.169, GB/sec, total-speed
# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
numa01-bw-thread-NOTHP, 20.319, secs, runtime-max/thread
numa01-bw-thread-NOTHP, 19.978, secs, runtime-min/thread
numa01-bw-thread-NOTHP, 20.076, secs, runtime-avg/thread
numa01-bw-thread-NOTHP, 0.839, %, spread-runtime/thread
numa01-bw-thread-NOTHP, 7.688, GB, data/thread
numa01-bw-thread-NOTHP, 246.021, GB, data-total
numa01-bw-thread-NOTHP, 2.643, nsecs, runtime/byte/thread
numa01-bw-thread-NOTHP, 0.378, GB/sec, thread-speed
numa01-bw-thread-NOTHP, 12.108, GB/sec, total-speed
#
# Running test on: Linux vega 3.7.0-rc8+ #2 SMP Fri Dec 7 02:46:02 CET 2012 x86_64 x86_64 x86_64 GNU/Linux
#
# Running numa/mem benchmark...
# Running main, "perf bench numa mem -a"
# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local, 20.132, secs, runtime-max/thread
RAM-bw-local, 20.123, secs, runtime-min/thread
RAM-bw-local, 20.123, secs, runtime-avg/thread
RAM-bw-local, 0.024, %, spread-runtime/thread
RAM-bw-local, 171.799, GB, data/thread
RAM-bw-local, 171.799, GB, data-total
RAM-bw-local, 0.117, nsecs, runtime/byte/thread
RAM-bw-local, 8.534, GB/sec, thread-speed
RAM-bw-local, 8.534, GB/sec, total-speed
# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
RAM-bw-local-NOTHP, 20.133, secs, runtime-max/thread
RAM-bw-local-NOTHP, 20.047, secs, runtime-min/thread
RAM-bw-local-NOTHP, 20.047, secs, runtime-avg/thread
RAM-bw-local-NOTHP, 0.214, %, spread-runtime/thread
RAM-bw-local-NOTHP, 169.651, GB, data/thread
RAM-bw-local-NOTHP, 169.651, GB, data-total
RAM-bw-local-NOTHP, 0.119, nsecs, runtime/byte/thread
RAM-bw-local-NOTHP, 8.427, GB/sec, thread-speed
RAM-bw-local-NOTHP, 8.427, GB/sec, total-speed
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote, 20.127, secs, runtime-max/thread
RAM-bw-remote, 20.117, secs, runtime-min/thread
RAM-bw-remote, 20.117, secs, runtime-avg/thread
RAM-bw-remote, 0.025, %, spread-runtime/thread
RAM-bw-remote, 134.218, GB, data/thread
RAM-bw-remote, 134.218, GB, data-total
RAM-bw-remote, 0.150, nsecs, runtime/byte/thread
RAM-bw-remote, 6.669, GB/sec, thread-speed
RAM-bw-remote, 6.669, GB/sec, total-speed
# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-local-2x, 20.139, secs, runtime-max/thread
RAM-bw-local-2x, 20.011, secs, runtime-min/thread
RAM-bw-local-2x, 20.070, secs, runtime-avg/thread
RAM-bw-local-2x, 0.319, %, spread-runtime/thread
RAM-bw-local-2x, 130.997, GB, data/thread
RAM-bw-local-2x, 261.993, GB, data-total
RAM-bw-local-2x, 0.154, nsecs, runtime/byte/thread
RAM-bw-local-2x, 6.505, GB/sec, thread-speed
RAM-bw-local-2x, 13.009, GB/sec, total-speed
# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-remote-2x, 20.177, secs, runtime-max/thread
RAM-bw-remote-2x, 20.083, secs, runtime-min/thread
RAM-bw-remote-2x, 20.125, secs, runtime-avg/thread
RAM-bw-remote-2x, 0.233, %, spread-runtime/thread
RAM-bw-remote-2x, 74.088, GB, data/thread
RAM-bw-remote-2x, 148.176, GB, data-total
RAM-bw-remote-2x, 0.272, nsecs, runtime/byte/thread
RAM-bw-remote-2x, 3.672, GB/sec, thread-speed
RAM-bw-remote-2x, 7.344, GB/sec, total-speed
# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"
RAM-bw-cross, 20.122, secs, runtime-max/thread
RAM-bw-cross, 20.094, secs, runtime-min/thread
RAM-bw-cross, 20.103, secs, runtime-avg/thread
RAM-bw-cross, 0.070, %, spread-runtime/thread
RAM-bw-cross, 121.870, GB, data/thread
RAM-bw-cross, 243.739, GB, data-total
RAM-bw-cross, 0.165, nsecs, runtime/byte/thread
RAM-bw-cross, 6.057, GB/sec, thread-speed
RAM-bw-cross, 12.113, GB/sec, total-speed
# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
1x3-convergence, 2.333, secs, NUMA-convergence-latency
1x3-convergence, 2.333, secs, runtime-max/thread
1x3-convergence, 2.304, secs, runtime-min/thread
1x3-convergence, 2.313, secs, runtime-avg/thread
1x3-convergence, 0.620, %, spread-runtime/thread
1x3-convergence, 7.516, GB, data/thread
1x3-convergence, 22.549, GB, data-total
1x3-convergence, 0.310, nsecs, runtime/byte/thread
1x3-convergence, 3.222, GB/sec, thread-speed
1x3-convergence, 9.665, GB/sec, total-speed
# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1x4-convergence, 2.057, secs, NUMA-convergence-latency
1x4-convergence, 2.057, secs, runtime-max/thread
1x4-convergence, 1.958, secs, runtime-min/thread
1x4-convergence, 1.998, secs, runtime-avg/thread
1x4-convergence, 2.403, %, spread-runtime/thread
1x4-convergence, 4.429, GB, data/thread
1x4-convergence, 17.717, GB, data-total
1x4-convergence, 0.464, nsecs, runtime/byte/thread
1x4-convergence, 2.154, GB/sec, thread-speed
1x4-convergence, 8.614, GB/sec, total-speed
# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1x6-convergence, 7.327, secs, NUMA-convergence-latency
1x6-convergence, 7.327, secs, runtime-max/thread
1x6-convergence, 6.879, secs, runtime-min/thread
1x6-convergence, 7.187, secs, runtime-avg/thread
1x6-convergence, 3.063, %, spread-runtime/thread
1x6-convergence, 11.052, GB, data/thread
1x6-convergence, 66.312, GB, data-total
1x6-convergence, 0.663, nsecs, runtime/byte/thread
1x6-convergence, 1.508, GB/sec, thread-speed
1x6-convergence, 9.050, GB/sec, total-speed
# Running 2x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2x3-convergence, 4.086, secs, NUMA-convergence-latency
2x3-convergence, 4.086, secs, runtime-max/thread
2x3-convergence, 3.779, secs, runtime-min/thread
2x3-convergence, 3.960, secs, runtime-avg/thread
2x3-convergence, 3.761, %, spread-runtime/thread
2x3-convergence, 6.774, GB, data/thread
2x3-convergence, 60.964, GB, data-total
2x3-convergence, 0.603, nsecs, runtime/byte/thread
2x3-convergence, 1.658, GB/sec, thread-speed
2x3-convergence, 14.920, GB/sec, total-speed
# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
3x3-convergence, 7.627, secs, NUMA-convergence-latency
3x3-convergence, 7.627, secs, runtime-max/thread
3x3-convergence, 7.380, secs, runtime-min/thread
3x3-convergence, 7.504, secs, runtime-avg/thread
3x3-convergence, 1.624, %, spread-runtime/thread
3x3-convergence, 15.093, GB, data/thread
3x3-convergence, 135.833, GB, data-total
3x3-convergence, 0.505, nsecs, runtime/byte/thread
3x3-convergence, 1.979, GB/sec, thread-speed
3x3-convergence, 17.809, GB/sec, total-speed
# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4x4-convergence, 7.381, secs, NUMA-convergence-latency
4x4-convergence, 7.381, secs, runtime-max/thread
4x4-convergence, 7.149, secs, runtime-min/thread
4x4-convergence, 7.277, secs, runtime-avg/thread
4x4-convergence, 1.569, %, spread-runtime/thread
4x4-convergence, 7.181, GB, data/thread
4x4-convergence, 114.890, GB, data-total
4x4-convergence, 1.028, nsecs, runtime/byte/thread
4x4-convergence, 0.973, GB/sec, thread-speed
4x4-convergence, 15.566, GB/sec, total-speed
# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4x4-convergence-NOTHP, 9.200, secs, NUMA-convergence-latency
4x4-convergence-NOTHP, 9.200, secs, runtime-max/thread
4x4-convergence-NOTHP, 8.944, secs, runtime-min/thread
4x4-convergence-NOTHP, 9.047, secs, runtime-avg/thread
4x4-convergence-NOTHP, 1.391, %, spread-runtime/thread
4x4-convergence-NOTHP, 11.778, GB, data/thread
4x4-convergence-NOTHP, 188.442, GB, data-total
4x4-convergence-NOTHP, 0.781, nsecs, runtime/byte/thread
4x4-convergence-NOTHP, 1.280, GB/sec, thread-speed
4x4-convergence-NOTHP, 20.483, GB/sec, total-speed
# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
4x6-convergence, 11.664, secs, NUMA-convergence-latency
4x6-convergence, 11.664, secs, runtime-max/thread
4x6-convergence, 11.155, secs, runtime-min/thread
4x6-convergence, 11.420, secs, runtime-avg/thread
4x6-convergence, 2.180, %, spread-runtime/thread
4x6-convergence, 11.319, GB, data/thread
4x6-convergence, 271.665, GB, data-total
4x6-convergence, 1.030, nsecs, runtime/byte/thread
4x6-convergence, 0.970, GB/sec, thread-speed
4x6-convergence, 23.292, GB/sec, total-speed
# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
4x8-convergence, 3.880, secs, NUMA-convergence-latency
4x8-convergence, 3.880, secs, runtime-max/thread
4x8-convergence, 3.613, secs, runtime-min/thread
4x8-convergence, 3.784, secs, runtime-avg/thread
4x8-convergence, 3.440, %, spread-runtime/thread
4x8-convergence, 2.047, GB, data/thread
4x8-convergence, 65.498, GB, data-total
4x8-convergence, 1.896, nsecs, runtime/byte/thread
4x8-convergence, 0.528, GB/sec, thread-speed
4x8-convergence, 16.882, GB/sec, total-speed
# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
8x4-convergence, 8.938, secs, NUMA-convergence-latency
8x4-convergence, 8.938, secs, runtime-max/thread
8x4-convergence, 8.556, secs, runtime-min/thread
8x4-convergence, 8.744, secs, runtime-avg/thread
8x4-convergence, 2.135, %, spread-runtime/thread
8x4-convergence, 4.396, GB, data/thread
8x4-convergence, 140.660, GB, data-total
8x4-convergence, 2.033, nsecs, runtime/byte/thread
8x4-convergence, 0.492, GB/sec, thread-speed
8x4-convergence, 15.738, GB/sec, total-speed
# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
8x4-convergence-NOTHP, 12.123, secs, NUMA-convergence-latency
8x4-convergence-NOTHP, 12.123, secs, runtime-max/thread
8x4-convergence-NOTHP, 11.749, secs, runtime-min/thread
8x4-convergence-NOTHP, 11.936, secs, runtime-avg/thread
8x4-convergence-NOTHP, 1.542, %, spread-runtime/thread
8x4-convergence-NOTHP, 4.480, GB, data/thread
8x4-convergence-NOTHP, 143.345, GB, data-total
8x4-convergence-NOTHP, 2.706, nsecs, runtime/byte/thread
8x4-convergence-NOTHP, 0.370, GB/sec, thread-speed
8x4-convergence-NOTHP, 11.824, GB/sec, total-speed
# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
3x1-convergence, 0.879, secs, NUMA-convergence-latency
3x1-convergence, 0.879, secs, runtime-max/thread
3x1-convergence, 0.810, secs, runtime-min/thread
3x1-convergence, 0.839, secs, runtime-avg/thread
3x1-convergence, 3.911, %, spread-runtime/thread
3x1-convergence, 2.326, GB, data/thread
3x1-convergence, 6.979, GB, data-total
3x1-convergence, 0.378, nsecs, runtime/byte/thread
3x1-convergence, 2.647, GB/sec, thread-speed
3x1-convergence, 7.941, GB/sec, total-speed
# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
4x1-convergence, 0.685, secs, NUMA-convergence-latency
4x1-convergence, 0.685, secs, runtime-max/thread
4x1-convergence, 0.617, secs, runtime-min/thread
4x1-convergence, 0.650, secs, runtime-avg/thread
4x1-convergence, 4.967, %, spread-runtime/thread
4x1-convergence, 1.476, GB, data/thread
4x1-convergence, 5.906, GB, data-total
4x1-convergence, 0.464, nsecs, runtime/byte/thread
4x1-convergence, 2.154, GB/sec, thread-speed
4x1-convergence, 8.616, GB/sec, total-speed
# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
8x1-convergence, 1.158, secs, NUMA-convergence-latency
8x1-convergence, 1.158, secs, runtime-max/thread
8x1-convergence, 1.010, secs, runtime-min/thread
8x1-convergence, 1.060, secs, runtime-avg/thread
8x1-convergence, 6.396, %, spread-runtime/thread
8x1-convergence, 1.745, GB, data/thread
8x1-convergence, 13.959, GB, data-total
8x1-convergence, 0.664, nsecs, runtime/byte/thread
8x1-convergence, 1.507, GB/sec, thread-speed
8x1-convergence, 12.054, GB/sec, total-speed
# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
16x1-convergence, 2.010, secs, NUMA-convergence-latency
16x1-convergence, 2.010, secs, runtime-max/thread
16x1-convergence, 1.939, secs, runtime-min/thread
16x1-convergence, 1.991, secs, runtime-avg/thread
16x1-convergence, 1.760, %, spread-runtime/thread
16x1-convergence, 2.668, GB, data/thread
16x1-convergence, 42.681, GB, data-total
16x1-convergence, 0.753, nsecs, runtime/byte/thread
16x1-convergence, 1.327, GB/sec, thread-speed
16x1-convergence, 21.237, GB/sec, total-speed
# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
32x1-convergence, 1.946, secs, NUMA-convergence-latency
32x1-convergence, 1.946, secs, runtime-max/thread
32x1-convergence, 1.850, secs, runtime-min/thread
32x1-convergence, 1.946, secs, runtime-avg/thread
32x1-convergence, 2.479, %, spread-runtime/thread
32x1-convergence, 1.242, GB, data/thread
32x1-convergence, 39.728, GB, data-total
32x1-convergence, 1.568, nsecs, runtime/byte/thread
32x1-convergence, 0.638, GB/sec, thread-speed
32x1-convergence, 20.410, GB/sec, total-speed
# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
2x1-bw-process, 20.146, secs, runtime-max/thread
2x1-bw-process, 20.068, secs, runtime-min/thread
2x1-bw-process, 20.102, secs, runtime-avg/thread
2x1-bw-process, 0.193, %, spread-runtime/thread
2x1-bw-process, 97.174, GB, data/thread
2x1-bw-process, 194.347, GB, data-total
2x1-bw-process, 0.207, nsecs, runtime/byte/thread
2x1-bw-process, 4.824, GB/sec, thread-speed
2x1-bw-process, 9.647, GB/sec, total-speed
# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
3x1-bw-process, 20.177, secs, runtime-max/thread
3x1-bw-process, 20.127, secs, runtime-min/thread
3x1-bw-process, 20.146, secs, runtime-avg/thread
3x1-bw-process, 0.126, %, spread-runtime/thread
3x1-bw-process, 97.711, GB, data/thread
3x1-bw-process, 293.132, GB, data-total
3x1-bw-process, 0.207, nsecs, runtime/byte/thread
3x1-bw-process, 4.843, GB/sec, thread-speed
3x1-bw-process, 14.528, GB/sec, total-speed
# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
4x1-bw-process, 20.165, secs, runtime-max/thread
4x1-bw-process, 20.025, secs, runtime-min/thread
4x1-bw-process, 20.078, secs, runtime-avg/thread
4x1-bw-process, 0.348, %, spread-runtime/thread
4x1-bw-process, 95.295, GB, data/thread
4x1-bw-process, 381.178, GB, data-total
4x1-bw-process, 0.212, nsecs, runtime/byte/thread
4x1-bw-process, 4.726, GB/sec, thread-speed
4x1-bw-process, 18.903, GB/sec, total-speed
# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
8x1-bw-process, 20.131, secs, runtime-max/thread
8x1-bw-process, 20.066, secs, runtime-min/thread
8x1-bw-process, 20.090, secs, runtime-avg/thread
8x1-bw-process, 0.161, %, spread-runtime/thread
8x1-bw-process, 67.512, GB, data/thread
8x1-bw-process, 540.092, GB, data-total
8x1-bw-process, 0.298, nsecs, runtime/byte/thread
8x1-bw-process, 3.354, GB/sec, thread-speed
8x1-bw-process, 26.829, GB/sec, total-speed
# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
8x1-bw-process-NOTHP, 20.208, secs, runtime-max/thread
8x1-bw-process-NOTHP, 20.002, secs, runtime-min/thread
8x1-bw-process-NOTHP, 20.067, secs, runtime-avg/thread
8x1-bw-process-NOTHP, 0.509, %, spread-runtime/thread
8x1-bw-process-NOTHP, 56.170, GB, data/thread
8x1-bw-process-NOTHP, 449.361, GB, data-total
8x1-bw-process-NOTHP, 0.360, nsecs, runtime/byte/thread
8x1-bw-process-NOTHP, 2.780, GB/sec, thread-speed
8x1-bw-process-NOTHP, 22.237, GB/sec, total-speed
# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
16x1-bw-process, 20.068, secs, runtime-max/thread
16x1-bw-process, 20.014, secs, runtime-min/thread
16x1-bw-process, 20.042, secs, runtime-avg/thread
16x1-bw-process, 0.136, %, spread-runtime/thread
16x1-bw-process, 36.742, GB, data/thread
16x1-bw-process, 587.874, GB, data-total
16x1-bw-process, 0.546, nsecs, runtime/byte/thread
16x1-bw-process, 1.831, GB/sec, thread-speed
16x1-bw-process, 29.294, GB/sec, total-speed
# Running 4x1-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
4x1-bw-thread, 20.053, secs, runtime-max/thread
4x1-bw-thread, 20.003, secs, runtime-min/thread
4x1-bw-thread, 20.025, secs, runtime-avg/thread
4x1-bw-thread, 0.123, %, spread-runtime/thread
4x1-bw-thread, 96.704, GB, data/thread
4x1-bw-thread, 386.815, GB, data-total
4x1-bw-thread, 0.207, nsecs, runtime/byte/thread
4x1-bw-thread, 4.822, GB/sec, thread-speed
4x1-bw-thread, 19.290, GB/sec, total-speed
# Running 8x1-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
8x1-bw-thread, 20.068, secs, runtime-max/thread
8x1-bw-thread, 20.004, secs, runtime-min/thread
8x1-bw-thread, 20.031, secs, runtime-avg/thread
8x1-bw-thread, 0.160, %, spread-runtime/thread
8x1-bw-thread, 66.203, GB, data/thread
8x1-bw-thread, 529.623, GB, data-total
8x1-bw-thread, 0.303, nsecs, runtime/byte/thread
8x1-bw-thread, 3.299, GB/sec, thread-speed
8x1-bw-thread, 26.391, GB/sec, total-speed
# Running 16x1-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
16x1-bw-thread, 20.044, secs, runtime-max/thread
16x1-bw-thread, 20.007, secs, runtime-min/thread
16x1-bw-thread, 20.029, secs, runtime-avg/thread
16x1-bw-thread, 0.092, %, spread-runtime/thread
16x1-bw-thread, 37.027, GB, data/thread
16x1-bw-thread, 592.437, GB, data-total
16x1-bw-thread, 0.541, nsecs, runtime/byte/thread
16x1-bw-thread, 1.847, GB/sec, thread-speed
16x1-bw-thread, 29.557, GB/sec, total-speed
# Running 32x1-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
32x1-bw-thread, 20.029, secs, runtime-max/thread
32x1-bw-thread, 19.975, secs, runtime-min/thread
32x1-bw-thread, 20.015, secs, runtime-avg/thread
32x1-bw-thread, 0.134, %, spread-runtime/thread
32x1-bw-thread, 18.923, GB, data/thread
32x1-bw-thread, 605.523, GB, data-total
32x1-bw-thread, 1.058, nsecs, runtime/byte/thread
32x1-bw-thread, 0.945, GB/sec, thread-speed
32x1-bw-thread, 30.232, GB/sec, total-speed
# Running 2x3-bw-thread, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
2x3-bw-thread, 20.176, secs, runtime-max/thread
2x3-bw-thread, 20.072, secs, runtime-min/thread
2x3-bw-thread, 20.136, secs, runtime-avg/thread
2x3-bw-thread, 0.257, %, spread-runtime/thread
2x3-bw-thread, 51.540, GB, data/thread
2x3-bw-thread, 309.238, GB, data-total
2x3-bw-thread, 0.391, nsecs, runtime/byte/thread
2x3-bw-thread, 2.555, GB/sec, thread-speed
2x3-bw-thread, 15.327, GB/sec, total-speed
# Running 4x4-bw-thread, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
4x4-bw-thread, 20.183, secs, runtime-max/thread
4x4-bw-thread, 20.013, secs, runtime-min/thread
4x4-bw-thread, 20.086, secs, runtime-avg/thread
4x4-bw-thread, 0.421, %, spread-runtime/thread
4x4-bw-thread, 35.266, GB, data/thread
4x4-bw-thread, 564.251, GB, data-total
4x4-bw-thread, 0.572, nsecs, runtime/byte/thread
4x4-bw-thread, 1.747, GB/sec, thread-speed
4x4-bw-thread, 27.957, GB/sec, total-speed
# Running 4x6-bw-thread, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
4x6-bw-thread, 20.298, secs, runtime-max/thread
4x6-bw-thread, 20.061, secs, runtime-min/thread
4x6-bw-thread, 20.184, secs, runtime-avg/thread
4x6-bw-thread, 0.584, %, spread-runtime/thread
4x6-bw-thread, 23.578, GB, data/thread
4x6-bw-thread, 565.862, GB, data-total
4x6-bw-thread, 0.861, nsecs, runtime/byte/thread
4x6-bw-thread, 1.162, GB/sec, thread-speed
4x6-bw-thread, 27.877, GB/sec, total-speed
# Running 4x8-bw-thread, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
4x8-bw-thread, 20.350, secs, runtime-max/thread
4x8-bw-thread, 20.004, secs, runtime-min/thread
4x8-bw-thread, 20.190, secs, runtime-avg/thread
4x8-bw-thread, 0.851, %, spread-runtime/thread
4x8-bw-thread, 18.086, GB, data/thread
4x8-bw-thread, 578.747, GB, data-total
4x8-bw-thread, 1.125, nsecs, runtime/byte/thread
4x8-bw-thread, 0.889, GB/sec, thread-speed
4x8-bw-thread, 28.439, GB/sec, total-speed
# Running 4x8-bw-thread-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
4x8-bw-thread-NOTHP, 20.411, secs, runtime-max/thread
4x8-bw-thread-NOTHP, 19.990, secs, runtime-min/thread
4x8-bw-thread-NOTHP, 20.246, secs, runtime-avg/thread
4x8-bw-thread-NOTHP, 1.032, %, spread-runtime/thread
4x8-bw-thread-NOTHP, 15.989, GB, data/thread
4x8-bw-thread-NOTHP, 511.638, GB, data-total
4x8-bw-thread-NOTHP, 1.277, nsecs, runtime/byte/thread
4x8-bw-thread-NOTHP, 0.783, GB/sec, thread-speed
4x8-bw-thread-NOTHP, 25.067, GB/sec, total-speed
# Running 3x3-bw-thread, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
3x3-bw-thread, 20.170, secs, runtime-max/thread
3x3-bw-thread, 20.050, secs, runtime-min/thread
3x3-bw-thread, 20.109, secs, runtime-avg/thread
3x3-bw-thread, 0.299, %, spread-runtime/thread
3x3-bw-thread, 48.318, GB, data/thread
3x3-bw-thread, 434.865, GB, data-total
3x3-bw-thread, 0.417, nsecs, runtime/byte/thread
3x3-bw-thread, 2.396, GB/sec, thread-speed
3x3-bw-thread, 21.560, GB/sec, total-speed
# Running 5x5-bw-thread, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
5x5-bw-thread, 20.276, secs, runtime-max/thread
5x5-bw-thread, 20.004, secs, runtime-min/thread
5x5-bw-thread, 20.155, secs, runtime-avg/thread
5x5-bw-thread, 0.671, %, spread-runtime/thread
5x5-bw-thread, 21.153, GB, data/thread
5x5-bw-thread, 528.818, GB, data-total
5x5-bw-thread, 0.959, nsecs, runtime/byte/thread
5x5-bw-thread, 1.043, GB/sec, thread-speed
5x5-bw-thread, 26.081, GB/sec, total-speed
# Running 2x16-bw-thread, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
2x16-bw-thread, 20.465, secs, runtime-max/thread
2x16-bw-thread, 20.004, secs, runtime-min/thread
2x16-bw-thread, 20.284, secs, runtime-avg/thread
2x16-bw-thread, 1.127, %, spread-runtime/thread
2x16-bw-thread, 14.881, GB, data/thread
2x16-bw-thread, 476.204, GB, data-total
2x16-bw-thread, 1.375, nsecs, runtime/byte/thread
2x16-bw-thread, 0.727, GB/sec, thread-speed
2x16-bw-thread, 23.269, GB/sec, total-speed
# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
1x32-bw-thread, 21.944, secs, runtime-max/thread
1x32-bw-thread, 20.031, secs, runtime-min/thread
1x32-bw-thread, 20.878, secs, runtime-avg/thread
1x32-bw-thread, 4.358, %, spread-runtime/thread
1x32-bw-thread, 13.019, GB, data/thread
1x32-bw-thread, 416.612, GB, data-total
1x32-bw-thread, 1.686, nsecs, runtime/byte/thread
1x32-bw-thread, 0.593, GB/sec, thread-speed
1x32-bw-thread, 18.985, GB/sec, total-speed
# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
numa02-bw, 20.000, secs, runtime-max/thread
numa02-bw, 19.967, secs, runtime-min/thread
numa02-bw, 19.994, secs, runtime-avg/thread
numa02-bw, 0.081, %, spread-runtime/thread
numa02-bw, 19.644, GB, data/thread
numa02-bw, 628.609, GB, data-total
numa02-bw, 1.018, nsecs, runtime/byte/thread
numa02-bw, 0.982, GB/sec, thread-speed
numa02-bw, 31.431, GB/sec, total-speed
# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
numa02-bw-NOTHP, 20.062, secs, runtime-max/thread
numa02-bw-NOTHP, 19.940, secs, runtime-min/thread
numa02-bw-NOTHP, 19.988, secs, runtime-avg/thread
numa02-bw-NOTHP, 0.304, %, spread-runtime/thread
numa02-bw-NOTHP, 18.246, GB, data/thread
numa02-bw-NOTHP, 583.881, GB, data-total
numa02-bw-NOTHP, 1.100, nsecs, runtime/byte/thread
numa02-bw-NOTHP, 0.909, GB/sec, thread-speed
numa02-bw-NOTHP, 29.104, GB/sec, total-speed
# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
numa01-bw-thread, 20.106, secs, runtime-max/thread
numa01-bw-thread, 19.989, secs, runtime-min/thread
numa01-bw-thread, 20.052, secs, runtime-avg/thread
numa01-bw-thread, 0.293, %, spread-runtime/thread
numa01-bw-thread, 17.975, GB, data/thread
numa01-bw-thread, 575.190, GB, data-total
numa01-bw-thread, 1.119, nsecs, runtime/byte/thread
numa01-bw-thread, 0.894, GB/sec, thread-speed
numa01-bw-thread, 28.607, GB/sec, total-speed
# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
numa01-bw-thread-NOTHP, 20.391, secs, runtime-max/thread
numa01-bw-thread-NOTHP, 20.010, secs, runtime-min/thread
numa01-bw-thread-NOTHP, 20.085, secs, runtime-avg/thread
numa01-bw-thread-NOTHP, 0.936, %, spread-runtime/thread
numa01-bw-thread-NOTHP, 13.457, GB, data/thread
numa01-bw-thread-NOTHP, 430.638, GB, data-total
numa01-bw-thread-NOTHP, 1.515, nsecs, runtime/byte/thread
numa01-bw-thread-NOTHP, 0.660, GB/sec, thread-speed
numa01-bw-thread-NOTHP, 21.119, GB/sec, total-speed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread