From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Hillf Danton <dhillf@gmail.com>, Dan Smith <danms@us.ibm.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
Paul Turner <pjt@google.com>,
Suresh Siddha <suresh.b.siddha@intel.com>,
Mike Galbraith <efault@gmx.de>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Bharata B Rao <bharata.rao@gmail.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
Christoph Lameter <cl@linux.com>, Alex Shi <alex.shi@intel.com>,
Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Don Morris <don.morris@hp.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
Date: Thu, 28 Jun 2012 16:46:16 +0200 [thread overview]
Message-ID: <1340894776.28750.44.camel@twins> (raw)
In-Reply-To: <1340888180-15355-14-git-send-email-aarcange@redhat.com>
On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +/*
> + * This function sched_autonuma_balance() is responsible for deciding
> + * which is the best CPU each process should be running on according
> + * to the NUMA statistics collected in mm->mm_autonuma and
> + * tsk->task_autonuma.
> + *
> + * The core math that evaluates the current CPU against the CPUs of
> + * all _other_ nodes is this:
> + *
> + * if (w_nid > w_other && w_nid > w_cpu_nid)
> + * weight = w_nid - w_other + w_nid - w_cpu_nid;
> + *
> + * w_nid: NUMA affinity of the current thread/process if run on the
> + * other CPU.
> + *
> + * w_other: NUMA affinity of the other thread/process if run on the
> + * other CPU.
> + *
> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
> + * the current CPU.
> + *
> + * weight: combined NUMA affinity benefit in moving the current
> + * thread/process to the other CPU taking into account both the
> higher
> + * NUMA affinity of the current process if run on the other CPU, and
> + * the increase in NUMA affinity in the other CPU by replacing the
> + * other process.
A lot of words, all meaningless without a proper definition of w_*
stuff. How are they calculated and why.
> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.
So you do something like:
max_(i, node(i) != curr_node) { weight_i }
That is, you have this weight, then what do you do?
> + * Checking that the current process has higher NUMA affinity than
> the
> + * other process on the other CPU (w_nid > w_other) and not only that
> + * the current process has higher NUMA affinity on the other CPU than
> + * on the current CPU (w_nid > w_cpu_nid) completely avoids ping
> pongs
> + * and ensures (temporary) convergence of the algorithm (at least
> from
> + * a CPU standpoint).
How does that follow?
> + * It's then up to the idle balancing code that will run as soon as
> + * the current CPU goes idle to pick the other process and move it
> + * here (or in some other idle CPU if any).
> + *
> + * By only evaluating running processes against running processes we
> + * avoid interfering with the CFS stock active idle balancing, which
> + * is critical to optimal performance with HT enabled. (getting HT
> + * wrong is worse than running on remote memory so the active idle
> + * balancing has priority)
what?
> + * Idle balancing and all other CFS load balancing become NUMA
> + * affinity aware through the introduction of
> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
> + * autonuma_node first when it needs to find idle CPUs during idle
> + * balancing or tasks to pick during load balancing.
You talk a lot about idle balance, but there's zero mention of fairness.
This is worrysome.
> + * The task's autonuma_node is the node selected by
> + * sched_autonuma_balance() when it migrates a task to the
> + * selected_cpu in the selected_nid
I think I already said that strict was out of the question and hard
movement like that simply didn't make sense.
> + * Once a process/thread has been moved to another node, closer to
> the
> + * much of memory it has recently accessed,
closer to the recently accessed memory you mean?
> any memory for that task
> + * not in the new node moves slowly (asynchronously in the
> background)
> + * to the new node. This is done by the knuma_migratedN (where the
> + * suffix N is the node id) daemon described in mm/autonuma.c.
> + *
> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.
No no no,.. its not a friggin detail, its absolutely crucial. Also, if
you'd given proper definition you wouldn't need to hand wave your way
around the dynamics either because that would simply follow from the
definition.
<snip terrible example>
> + * Before scanning all other CPUs' runqueues to compute the above
> + * math,
OK, lets stop calling the one isolated conditional you mentioned 'math'.
On its own its useless.
> we also verify that the current CPU is not already in the
> + * preferred NUMA node from the point of view of both the process
> + * statistics and the thread statistics. In such case we can return
> to
> + * the caller without having to check any other CPUs' runqueues
> + * because full convergence has been already reached.
Things being in the 'preferred' place don't have much to do with
convergence. Does your model have local minima/maxima where it can get
stuck, or does it always find a global min/max?
> + * This algorithm might be expanded to take all runnable processes
> + * into account but examining just the currently running processes is
> + * a good enough approximation because some runnable processes may
> run
> + * only for a short time so statistically there will always be a bias
> + * on the processes that uses most the of the CPU. This is ideal
> + * because it doesn't matter if NUMA balancing isn't optimal for
> + * processes that run only for a short time.
Almost, but not quite.. it would be so if the sampling could be proven
to be unbiased. But its quite possible for a task to consume most cpu
time and never show up as the current task in your load-balance run.
As it stands you wrote a lot of words.. but none of them were really
helpful in understanding what you do.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-06-28 14:46 UTC|newest]
Thread overview: 177+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-28 12:55 [PATCH 00/40] AutoNUMA19 Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 01/40] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-06-29 14:10 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 02/40] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-06-29 14:10 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-06-29 14:11 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-06-29 14:16 ` Rik van Riel
2012-07-04 23:05 ` Andrea Arcangeli
2012-06-30 4:47 ` Konrad Rzeszutek Wilk
2012-07-03 10:45 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-06-28 15:13 ` Don Morris
2012-06-28 15:00 ` Andrea Arcangeli
2012-06-29 14:26 ` Rik van Riel
2012-07-03 20:30 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-06-29 15:02 ` Rik van Riel
2012-07-04 23:03 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 07/40] autonuma: generic " Andrea Arcangeli
2012-06-29 15:13 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 08/40] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-06-29 15:27 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 09/40] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-06-29 15:36 ` Rik van Riel
2012-06-29 16:04 ` Peter Zijlstra
2012-06-29 16:11 ` Rik van Riel
2012-06-29 16:38 ` Andrea Arcangeli
2012-06-29 16:58 ` Rik van Riel
2012-07-05 13:09 ` Johannes Weiner
2012-07-05 18:33 ` Glauber Costa
2012-07-05 20:07 ` Andrea Arcangeli
2012-06-30 4:50 ` Konrad Rzeszutek Wilk
2012-07-04 23:14 ` Andrea Arcangeli
2012-07-05 12:04 ` Konrad Rzeszutek Wilk
2012-07-05 12:28 ` Andrea Arcangeli
2012-07-05 12:18 ` Peter Zijlstra
2012-07-05 12:21 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-06-29 15:47 ` Rik van Riel
2012-06-29 17:45 ` Rik van Riel
2012-07-04 23:16 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 11/40] autonuma: define the autonuma flags Andrea Arcangeli
2012-06-29 16:10 ` Rik van Riel
2012-06-30 4:58 ` Konrad Rzeszutek Wilk
2012-07-02 15:42 ` Konrad Rzeszutek Wilk
2012-06-30 5:01 ` Konrad Rzeszutek Wilk
2012-07-04 23:45 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 12/40] autonuma: core autonuma.h header Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 13/40] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-06-28 14:46 ` Peter Zijlstra [this message]
2012-06-29 14:11 ` Nai Xia
2012-06-29 16:30 ` Andrea Arcangeli
2012-06-29 18:09 ` Nai Xia
2012-06-29 21:02 ` Nai Xia
2012-07-03 11:53 ` Peter Zijlstra
2012-06-28 14:53 ` Peter Zijlstra
2012-06-29 12:16 ` Hillf Danton
2012-06-29 12:55 ` Ingo Molnar
2012-06-29 16:51 ` Dor Laor
2012-06-29 18:41 ` Peter Zijlstra
2012-06-29 18:46 ` Rik van Riel
2012-06-29 18:51 ` Peter Zijlstra
2012-06-29 18:57 ` Peter Zijlstra
2012-06-29 19:03 ` Peter Zijlstra
2012-06-29 19:19 ` Rik van Riel
2012-07-02 16:57 ` Vaidyanathan Srinivasan
2012-07-05 16:56 ` Vaidyanathan Srinivasan
2012-07-06 13:04 ` Hillf Danton
2012-07-06 18:38 ` Vaidyanathan Srinivasan
2012-07-12 13:12 ` Andrea Arcangeli
2012-06-29 18:49 ` Peter Zijlstra
2012-06-29 18:53 ` Peter Zijlstra
2012-06-29 20:01 ` Nai Xia
2012-06-29 20:44 ` Nai Xia
2012-06-30 1:23 ` Andrea Arcangeli
2012-06-30 2:43 ` Nai Xia
2012-06-30 5:48 ` Dor Laor
2012-06-30 6:58 ` Nai Xia
2012-06-30 13:04 ` Andrea Arcangeli
2012-06-30 15:19 ` Nai Xia
2012-06-30 19:37 ` Dor Laor
2012-07-01 2:41 ` Nai Xia
2012-06-30 23:55 ` Benjamin Herrenschmidt
2012-07-01 3:10 ` Nai Xia
2012-06-30 8:23 ` Nai Xia
2012-07-02 7:29 ` Rik van Riel
2012-07-02 7:43 ` Nai Xia
2012-06-30 12:48 ` Andrea Arcangeli
2012-06-30 15:10 ` Nai Xia
2012-07-02 7:36 ` Rik van Riel
2012-07-02 7:56 ` Nai Xia
2012-07-02 8:17 ` Rik van Riel
2012-07-02 8:31 ` Nai Xia
2012-07-05 18:07 ` Rik van Riel
2012-07-05 22:59 ` Andrea Arcangeli
2012-07-06 1:00 ` Nai Xia
2012-06-29 19:04 ` Peter Zijlstra
2012-06-29 20:27 ` Nai Xia
2012-06-29 18:03 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 14/40] autonuma: add page structure fields Andrea Arcangeli
2012-06-29 18:06 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-06-29 18:31 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 16/40] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-06-29 18:35 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 17/40] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-06-29 18:37 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 18/40] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-06-29 18:39 ` Rik van Riel
2012-06-30 5:04 ` Konrad Rzeszutek Wilk
2012-07-12 17:50 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 19/40] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-06-29 18:52 ` Rik van Riel
2012-06-30 5:10 ` Konrad Rzeszutek Wilk
2012-07-12 17:59 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 20/40] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-06-29 18:54 ` Rik van Riel
2012-06-30 5:12 ` Konrad Rzeszutek Wilk
2012-07-12 18:08 ` Andrea Arcangeli
2012-07-12 18:17 ` Johannes Weiner
2012-07-13 14:19 ` Christoph Lameter
2012-07-14 17:01 ` Andrea Arcangeli
2012-07-01 15:33 ` Rik van Riel
2012-07-12 18:27 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-06-29 18:57 ` Rik van Riel
2012-06-29 19:05 ` Peter Zijlstra
2012-06-29 19:07 ` Rik van Riel
2012-06-29 20:48 ` Ingo Molnar
2012-06-28 12:56 ` [PATCH 22/40] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-07-01 16:37 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 23/40] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-07-01 16:57 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 24/40] autonuma: core Andrea Arcangeli
2012-07-02 4:07 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-07-02 4:14 ` Rik van Riel
2012-07-14 16:43 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-07-02 4:19 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 27/40] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-07-02 4:22 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 28/40] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-07-02 4:24 ` Rik van Riel
2012-07-12 18:50 ` Andrea Arcangeli
2012-07-12 21:25 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 29/40] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-07-02 4:33 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 30/40] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-07-02 4:47 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 31/40] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-07-02 4:49 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 32/40] autonuma: initialize page structure fields Andrea Arcangeli
2012-07-02 4:50 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-07-02 4:56 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-07-02 4:58 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 35/40] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-07-02 5:12 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 36/40] autonuma: page_autonuma Andrea Arcangeli
2012-06-30 5:24 ` Konrad Rzeszutek Wilk
2012-07-12 19:43 ` Andrea Arcangeli
2012-07-02 6:37 ` Rik van Riel
2012-07-12 19:58 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 37/40] autonuma: page_autonuma change #include for sparse Andrea Arcangeli
2012-07-02 6:22 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-07-02 5:15 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-07-02 6:40 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-07-02 7:18 ` Rik van Riel
2012-07-12 20:21 ` Andrea Arcangeli
2012-07-09 15:40 ` [PATCH 00/40] AutoNUMA19 Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1340894776.28750.44.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=Lee.Schermerhorn@hp.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=benh@kernel.crashing.org \
--cc=bharata.rao@gmail.com \
--cc=cl@linux.com \
--cc=danms@us.ibm.com \
--cc=dhillf@gmail.com \
--cc=don.morris@hp.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=konrad.wilk@oracle.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mauricfo@linux.vnet.ibm.com \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).