From: Nai Xia <nai.xia@gmail.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Hillf Danton <dhillf@gmail.com>, Dan Smith <danms@us.ibm.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
Paul Turner <pjt@google.com>,
Suresh Siddha <suresh.b.siddha@intel.com>,
Mike Galbraith <efault@gmx.de>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Bharata B Rao <bharata.rao@gmail.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
Christoph Lameter <cl@linux.com>, Alex Shi <alex.shi@intel.com>,
Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Don Morris <don.morris@hp.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
Date: Fri, 29 Jun 2012 22:11:35 +0800 [thread overview]
Message-ID: <4FEDB797.3050804@gmail.com> (raw)
In-Reply-To: <1340894776.28750.44.camel@twins>
On 2012a1'06ae??28ae?JPY 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + * if (w_nid> w_other&& w_nid> w_cpu_nid)
>> + * weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid> w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid> w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>> any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>> we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.
Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.
Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.
Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.
>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"dont@kvack.org"> email@kvack.org</a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-06-29 14:11 UTC|newest]
Thread overview: 177+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-28 12:55 [PATCH 00/40] AutoNUMA19 Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 01/40] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-06-29 14:10 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 02/40] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-06-29 14:10 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-06-29 14:11 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-06-29 14:16 ` Rik van Riel
2012-07-04 23:05 ` Andrea Arcangeli
2012-06-30 4:47 ` Konrad Rzeszutek Wilk
2012-07-03 10:45 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-06-28 15:13 ` Don Morris
2012-06-28 15:00 ` Andrea Arcangeli
2012-06-29 14:26 ` Rik van Riel
2012-07-03 20:30 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-06-29 15:02 ` Rik van Riel
2012-07-04 23:03 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 07/40] autonuma: generic " Andrea Arcangeli
2012-06-29 15:13 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 08/40] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-06-29 15:27 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 09/40] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-06-29 15:36 ` Rik van Riel
2012-06-29 16:04 ` Peter Zijlstra
2012-06-29 16:11 ` Rik van Riel
2012-06-29 16:38 ` Andrea Arcangeli
2012-06-29 16:58 ` Rik van Riel
2012-07-05 13:09 ` Johannes Weiner
2012-07-05 18:33 ` Glauber Costa
2012-07-05 20:07 ` Andrea Arcangeli
2012-06-30 4:50 ` Konrad Rzeszutek Wilk
2012-07-04 23:14 ` Andrea Arcangeli
2012-07-05 12:04 ` Konrad Rzeszutek Wilk
2012-07-05 12:28 ` Andrea Arcangeli
2012-07-05 12:18 ` Peter Zijlstra
2012-07-05 12:21 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-06-29 15:47 ` Rik van Riel
2012-06-29 17:45 ` Rik van Riel
2012-07-04 23:16 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 11/40] autonuma: define the autonuma flags Andrea Arcangeli
2012-06-29 16:10 ` Rik van Riel
2012-06-30 4:58 ` Konrad Rzeszutek Wilk
2012-07-02 15:42 ` Konrad Rzeszutek Wilk
2012-06-30 5:01 ` Konrad Rzeszutek Wilk
2012-07-04 23:45 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 12/40] autonuma: core autonuma.h header Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 13/40] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-06-28 14:46 ` Peter Zijlstra
2012-06-29 14:11 ` Nai Xia [this message]
2012-06-29 16:30 ` Andrea Arcangeli
2012-06-29 18:09 ` Nai Xia
2012-06-29 21:02 ` Nai Xia
2012-07-03 11:53 ` Peter Zijlstra
2012-06-28 14:53 ` Peter Zijlstra
2012-06-29 12:16 ` Hillf Danton
2012-06-29 12:55 ` Ingo Molnar
2012-06-29 16:51 ` Dor Laor
2012-06-29 18:41 ` Peter Zijlstra
2012-06-29 18:46 ` Rik van Riel
2012-06-29 18:51 ` Peter Zijlstra
2012-06-29 18:57 ` Peter Zijlstra
2012-06-29 19:03 ` Peter Zijlstra
2012-06-29 19:19 ` Rik van Riel
2012-07-02 16:57 ` Vaidyanathan Srinivasan
2012-07-05 16:56 ` Vaidyanathan Srinivasan
2012-07-06 13:04 ` Hillf Danton
2012-07-06 18:38 ` Vaidyanathan Srinivasan
2012-07-12 13:12 ` Andrea Arcangeli
2012-06-29 18:49 ` Peter Zijlstra
2012-06-29 18:53 ` Peter Zijlstra
2012-06-29 20:01 ` Nai Xia
2012-06-29 20:44 ` Nai Xia
2012-06-30 1:23 ` Andrea Arcangeli
2012-06-30 2:43 ` Nai Xia
2012-06-30 5:48 ` Dor Laor
2012-06-30 6:58 ` Nai Xia
2012-06-30 13:04 ` Andrea Arcangeli
2012-06-30 15:19 ` Nai Xia
2012-06-30 19:37 ` Dor Laor
2012-07-01 2:41 ` Nai Xia
2012-06-30 23:55 ` Benjamin Herrenschmidt
2012-07-01 3:10 ` Nai Xia
2012-06-30 8:23 ` Nai Xia
2012-07-02 7:29 ` Rik van Riel
2012-07-02 7:43 ` Nai Xia
2012-06-30 12:48 ` Andrea Arcangeli
2012-06-30 15:10 ` Nai Xia
2012-07-02 7:36 ` Rik van Riel
2012-07-02 7:56 ` Nai Xia
2012-07-02 8:17 ` Rik van Riel
2012-07-02 8:31 ` Nai Xia
2012-07-05 18:07 ` Rik van Riel
2012-07-05 22:59 ` Andrea Arcangeli
2012-07-06 1:00 ` Nai Xia
2012-06-29 19:04 ` Peter Zijlstra
2012-06-29 20:27 ` Nai Xia
2012-06-29 18:03 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 14/40] autonuma: add page structure fields Andrea Arcangeli
2012-06-29 18:06 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-06-29 18:31 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 16/40] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-06-29 18:35 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 17/40] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-06-29 18:37 ` Rik van Riel
2012-06-28 12:55 ` [PATCH 18/40] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-06-29 18:39 ` Rik van Riel
2012-06-30 5:04 ` Konrad Rzeszutek Wilk
2012-07-12 17:50 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 19/40] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-06-29 18:52 ` Rik van Riel
2012-06-30 5:10 ` Konrad Rzeszutek Wilk
2012-07-12 17:59 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 20/40] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-06-29 18:54 ` Rik van Riel
2012-06-30 5:12 ` Konrad Rzeszutek Wilk
2012-07-12 18:08 ` Andrea Arcangeli
2012-07-12 18:17 ` Johannes Weiner
2012-07-13 14:19 ` Christoph Lameter
2012-07-14 17:01 ` Andrea Arcangeli
2012-07-01 15:33 ` Rik van Riel
2012-07-12 18:27 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-06-29 18:57 ` Rik van Riel
2012-06-29 19:05 ` Peter Zijlstra
2012-06-29 19:07 ` Rik van Riel
2012-06-29 20:48 ` Ingo Molnar
2012-06-28 12:56 ` [PATCH 22/40] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-07-01 16:37 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 23/40] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-07-01 16:57 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 24/40] autonuma: core Andrea Arcangeli
2012-07-02 4:07 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-07-02 4:14 ` Rik van Riel
2012-07-14 16:43 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-07-02 4:19 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 27/40] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-07-02 4:22 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 28/40] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-07-02 4:24 ` Rik van Riel
2012-07-12 18:50 ` Andrea Arcangeli
2012-07-12 21:25 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 29/40] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-07-02 4:33 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 30/40] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-07-02 4:47 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 31/40] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-07-02 4:49 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 32/40] autonuma: initialize page structure fields Andrea Arcangeli
2012-07-02 4:50 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-07-02 4:56 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-07-02 4:58 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 35/40] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-07-02 5:12 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 36/40] autonuma: page_autonuma Andrea Arcangeli
2012-06-30 5:24 ` Konrad Rzeszutek Wilk
2012-07-12 19:43 ` Andrea Arcangeli
2012-07-02 6:37 ` Rik van Riel
2012-07-12 19:58 ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 37/40] autonuma: page_autonuma change #include for sparse Andrea Arcangeli
2012-07-02 6:22 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-07-02 5:15 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-07-02 6:40 ` Rik van Riel
2012-06-28 12:56 ` [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-07-02 7:18 ` Rik van Riel
2012-07-12 20:21 ` Andrea Arcangeli
2012-07-09 15:40 ` [PATCH 00/40] AutoNUMA19 Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FEDB797.3050804@gmail.com \
--to=nai.xia@gmail.com \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=benh@kernel.crashing.org \
--cc=bharata.rao@gmail.com \
--cc=cl@linux.com \
--cc=danms@us.ibm.com \
--cc=dhillf@gmail.com \
--cc=don.morris@hp.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=konrad.wilk@oracle.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mauricfo@linux.vnet.ibm.com \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).