Re: [PATCH 13/40] autonuma: CPU follow memory algorithm

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nai Xia <nai.xia@gmail.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Hillf Danton <dhillf@gmail.com>, Dan Smith <danms@us.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Paul Turner <pjt@google.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Bharata B Rao <bharata.rao@gmail.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Christoph Lameter <cl@linux.com>, Alex Shi <alex.shi@intel.com>,
	Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Don Morris <don.morris@hp.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
Date: Fri, 29 Jun 2012 22:11:35 +0800	[thread overview]
Message-ID: <4FEDB797.3050804@gmail.com> (raw)
In-Reply-To: <1340894776.28750.44.camel@twins>



On 2012a1'06ae??28ae?JPY 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + *     if (w_nid>  w_other&&  w_nid>  w_cpu_nid)
>> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> 	max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid>  w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid>  w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>>   any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>>   we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.

Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.

Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.

Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.

>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-06-29 14:11 UTC|newest]

Thread overview: 177+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-28 12:55 [PATCH 00/40] AutoNUMA19 Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 01/40] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 02/40] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-06-29 14:11   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-06-29 14:16   ` Rik van Riel
2012-07-04 23:05     ` Andrea Arcangeli
2012-06-30  4:47   ` Konrad Rzeszutek Wilk
2012-07-03 10:45     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-06-28 15:13   ` Don Morris
2012-06-28 15:00     ` Andrea Arcangeli
2012-06-29 14:26   ` Rik van Riel
2012-07-03 20:30     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-06-29 15:02   ` Rik van Riel
2012-07-04 23:03     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 07/40] autonuma: generic " Andrea Arcangeli
2012-06-29 15:13   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 08/40] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-06-29 15:27   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 09/40] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-06-29 15:36   ` Rik van Riel
2012-06-29 16:04     ` Peter Zijlstra
2012-06-29 16:11       ` Rik van Riel
2012-06-29 16:38     ` Andrea Arcangeli
2012-06-29 16:58       ` Rik van Riel
2012-07-05 13:09         ` Johannes Weiner
2012-07-05 18:33           ` Glauber Costa
2012-07-05 20:07             ` Andrea Arcangeli
2012-06-30  4:50   ` Konrad Rzeszutek Wilk
2012-07-04 23:14     ` Andrea Arcangeli
2012-07-05 12:04       ` Konrad Rzeszutek Wilk
2012-07-05 12:28         ` Andrea Arcangeli
2012-07-05 12:18       ` Peter Zijlstra
2012-07-05 12:21         ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-06-29 15:47   ` Rik van Riel
2012-06-29 17:45   ` Rik van Riel
2012-07-04 23:16     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 11/40] autonuma: define the autonuma flags Andrea Arcangeli
2012-06-29 16:10   ` Rik van Riel
2012-06-30  4:58   ` Konrad Rzeszutek Wilk
2012-07-02 15:42     ` Konrad Rzeszutek Wilk
2012-06-30  5:01   ` Konrad Rzeszutek Wilk
2012-07-04 23:45     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 12/40] autonuma: core autonuma.h header Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 13/40] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-06-28 14:46   ` Peter Zijlstra
2012-06-29 14:11     ` Nai Xia [this message]
2012-06-29 16:30       ` Andrea Arcangeli
2012-06-29 18:09         ` Nai Xia
2012-06-29 21:02         ` Nai Xia
2012-07-03 11:53     ` Peter Zijlstra
2012-06-28 14:53   ` Peter Zijlstra
2012-06-29 12:16     ` Hillf Danton
2012-06-29 12:55       ` Ingo Molnar
2012-06-29 16:51         ` Dor Laor
2012-06-29 18:41           ` Peter Zijlstra
2012-06-29 18:46             ` Rik van Riel
2012-06-29 18:51               ` Peter Zijlstra
2012-06-29 18:57               ` Peter Zijlstra
2012-06-29 19:03                 ` Peter Zijlstra
2012-06-29 19:19                   ` Rik van Riel
2012-07-02 16:57                     ` Vaidyanathan Srinivasan
2012-07-05 16:56                       ` Vaidyanathan Srinivasan
2012-07-06 13:04                         ` Hillf Danton
2012-07-06 18:38                           ` Vaidyanathan Srinivasan
2012-07-12 13:12                             ` Andrea Arcangeli
2012-06-29 18:49           ` Peter Zijlstra
2012-06-29 18:53           ` Peter Zijlstra
2012-06-29 20:01             ` Nai Xia
2012-06-29 20:44               ` Nai Xia
2012-06-30  1:23               ` Andrea Arcangeli
2012-06-30  2:43                 ` Nai Xia
2012-06-30  5:48                   ` Dor Laor
2012-06-30  6:58                     ` Nai Xia
2012-06-30 13:04                       ` Andrea Arcangeli
2012-06-30 15:19                         ` Nai Xia
2012-06-30 19:37                       ` Dor Laor
2012-07-01  2:41                         ` Nai Xia
2012-06-30 23:55                       ` Benjamin Herrenschmidt
2012-07-01  3:10                         ` Nai Xia
2012-06-30  8:23                     ` Nai Xia
2012-07-02  7:29                       ` Rik van Riel
2012-07-02  7:43                         ` Nai Xia
2012-06-30 12:48                   ` Andrea Arcangeli
2012-06-30 15:10                     ` Nai Xia
2012-07-02  7:36                       ` Rik van Riel
2012-07-02  7:56                         ` Nai Xia
2012-07-02  8:17                           ` Rik van Riel
2012-07-02  8:31                             ` Nai Xia
2012-07-05 18:07               ` Rik van Riel
2012-07-05 22:59                 ` Andrea Arcangeli
2012-07-06  1:00                 ` Nai Xia
2012-06-29 19:04           ` Peter Zijlstra
2012-06-29 20:27             ` Nai Xia
2012-06-29 18:03   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 14/40] autonuma: add page structure fields Andrea Arcangeli
2012-06-29 18:06   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-06-29 18:31   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 16/40] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-06-29 18:35   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 17/40] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-06-29 18:37   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 18/40] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-06-29 18:39   ` Rik van Riel
2012-06-30  5:04   ` Konrad Rzeszutek Wilk
2012-07-12 17:50     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 19/40] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-06-29 18:52   ` Rik van Riel
2012-06-30  5:10   ` Konrad Rzeszutek Wilk
2012-07-12 17:59     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 20/40] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-06-29 18:54   ` Rik van Riel
2012-06-30  5:12   ` Konrad Rzeszutek Wilk
2012-07-12 18:08     ` Andrea Arcangeli
2012-07-12 18:17       ` Johannes Weiner
2012-07-13 14:19         ` Christoph Lameter
2012-07-14 17:01           ` Andrea Arcangeli
2012-07-01 15:33   ` Rik van Riel
2012-07-12 18:27     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-06-29 18:57   ` Rik van Riel
2012-06-29 19:05     ` Peter Zijlstra
2012-06-29 19:07       ` Rik van Riel
2012-06-29 20:48         ` Ingo Molnar
2012-06-28 12:56 ` [PATCH 22/40] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-07-01 16:37   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 23/40] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-07-01 16:57   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 24/40] autonuma: core Andrea Arcangeli
2012-07-02  4:07   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-07-02  4:14   ` Rik van Riel
2012-07-14 16:43     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-07-02  4:19   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 27/40] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-07-02  4:22   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 28/40] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-07-02  4:24   ` Rik van Riel
2012-07-12 18:50     ` Andrea Arcangeli
2012-07-12 21:25       ` Rik van Riel
2012-06-28 12:56 ` [PATCH 29/40] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-07-02  4:33   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 30/40] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-07-02  4:47   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 31/40] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-07-02  4:49   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 32/40] autonuma: initialize page structure fields Andrea Arcangeli
2012-07-02  4:50   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-07-02  4:56   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-07-02  4:58   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 35/40] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-07-02  5:12   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 36/40] autonuma: page_autonuma Andrea Arcangeli
2012-06-30  5:24   ` Konrad Rzeszutek Wilk
2012-07-12 19:43     ` Andrea Arcangeli
2012-07-02  6:37   ` Rik van Riel
2012-07-12 19:58     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 37/40] autonuma: page_autonuma change #include for sparse Andrea Arcangeli
2012-07-02  6:22   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-07-02  5:15   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-07-02  6:40   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-07-02  7:18   ` Rik van Riel
2012-07-12 20:21     ` Andrea Arcangeli
2012-07-09 15:40 ` [PATCH 00/40] AutoNUMA19 Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FEDB797.3050804@gmail.com \
    --to=nai.xia@gmail.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=benh@kernel.crashing.org \
    --cc=bharata.rao@gmail.com \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=don.morris@hp.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=konrad.wilk@oracle.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mauricfo@linux.vnet.ibm.com \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).