linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: Nai Xia <nai.xia@gmail.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	dlaor@redhat.com, Ingo Molnar <mingo@kernel.org>,
	Hillf Danton <dhillf@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Dan Smith <danms@us.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
	Paul Turner <pjt@google.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Bharata B Rao <bharata.rao@gmail.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
	Christoph Lameter <cl@linux.com>, Alex Shi <alex.shi@intel.com>,
	Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Don Morris <don.morris@hp.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
Date: Fri, 6 Jul 2012 00:59:35 +0200	[thread overview]
Message-ID: <20120705225935.GS25422@redhat.com> (raw)
In-Reply-To: <4FF5D7CA.5020301@redhat.com>

Hi Rik,

On Thu, Jul 05, 2012 at 02:07:06PM -0400, Rik van Riel wrote:
> Once the first thread gets a NUMA pagefault on a
> particular page, the page is made present in the
> page tables and NO OTHER THREAD will get NUMA
> page faults.

Oh this is a great question, thanks for raising it.

> That means when trying to compare the weighting
> of NUMA accesses between different threads in a
> 10 second interval, we only know THE FIRST FAULT.

The task_autonuma statistics don't include only the first fault every
10sec, by the time we compare stuff in the scheduler we have a trail
of fault history that decay with an exponential backoff (that can be
tuned to decay slower, right now it goes down pretty aggressively with
a shift right).

static void cpu_follow_memory_pass(struct task_struct *p,
				   struct task_autonuma *task_autonuma,
				   unsigned long *task_numa_fault)
{
	int nid;
	for_each_node(nid)
		task_numa_fault[nid] >>= 1;
	task_autonuma->task_numa_fault_tot >>= 1;
}

So depending on which thread is the first at every pass, that
information will accumulate fine over mutliple task_autonuma
structures if it's a huge amounts.

> We have no information on whether any other threads
> tried to access the same page, because we do not
> get faults more frequently.

If all threads access the same pages and there is false sharing, the
last_nid logic will eventually trigger. You're right maybe two three
passes in a row it's the same thread getting to the same page first,
but eventually another thread will get there and the last_nid will
change. The more nodes and the more threads, the less likely it's the
same getting there first.

The moment another thread in a different node access the page, any
pending migration is aborted if it's still in the page_autonuma LRU
and the autonuma_last_nid will have to be reconfirmed then before we
migrate it anywhere again.

> Not only do we not get use frequency information,
> we may not get the information on which threads use
> which memory, at all.
> 
> Somehow Andrea's code still seems to work.

It works when the process fits in the node because we use the
mm_autonuma statistics when comparing the percentage of memory
utilization per node (so called w_other/w_nid/w_cpu_nid) of threads
belonging to different processes. This alone solves all false sharing
if the process fits in the node. So that above issue becomes
irrelevant (we already convered without using task_autonuma).

Now if the process doesn't fit in the node, if there is false sharing,
that will be accounted with a smaller factor, and it will be accounted
for in its original memory location thanks to the last_nid logic. The
memory will not be migrated because of the last_nid logic
(statistically speaking).

Some spillover will definitely materialize but it won't be significant
as long as the NUMA trashing is not enormous. If the NUMA thrasing is
unlimted, well that workload is impossible to optimize and it's
impossible to converge anywhere and the best would be to do
MADV_INTERLEAVE.

But note that we're only talking about memory with page_mapcount=1
here, shared memory will never generate a single migration spillovers
or numa hinting page fault.

> How much sense does the following code still make,
> considering we may never get all the info on which
> threads use which memory?

It is required to handle the case of threads that have local memory
and the threads don't fit in a single node. That is the only case we
can perfectly coverge that involves more threads than CPUs in the
node.

This scenario is optimally optimized thanks to the mm = p->mm code
below.

There can be false sharing too as long as there is some local memory
too to converge (it may be impossible to converge on the false shared
regions even if we would account them more aggressively).

I don't exclude the reduced accounting of false shared memory that you
are asking about, may actually be beneficial. The more threads are
involved in the false sharing the more the accounting of the false
sharing regions will be reduced, and that may help to converge without
mistakes. The more threads are involved in the false sharing, the more
likely it's impossible to converge on the false shared memory.

Last but not the least, this is what the hardware gives us, it looks
good enough info to me, but I'm just trying to live with the only
information we can collect from the hardware efficiently.

> 
> +			/*
> +			 * Generate the w_nid/w_cpu_nid from the
> +			 * pre-computed mm/task_numa_weight[] and
> +			 * compute w_other using the w_m/w_t info
> +			 * collected from the other process.
> +			 */
> +			if (mm == p->mm) {
> +				if (w_t > w_t_t)
> +					w_t_t = w_t;
> +				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +				w_nid = task_numa_weight[nid];
> +				w_cpu_nid = task_numa_weight[cpu_nid];
> +				w_type = W_TYPE_THREAD;
> 
> Andrea, what is the real reason your code works?

Tried to explain above, but it's getting too long again, I wouldn't
know which part to drop though. If it's too messy ignore and I'll try
again later.

PS. this stuff isn't fixed in stone, I'm not saying this is the best
data collection or the best way to compute the data, I believe it's
closer to the absolute minimum amount of info and minimum computations
on the data required to perform as the hard bindings in the majority
of workloads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-07-05 23:00 UTC|newest]

Thread overview: 177+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-28 12:55 [PATCH 00/40] AutoNUMA19 Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 01/40] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 02/40] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-06-29 14:11   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-06-29 14:16   ` Rik van Riel
2012-07-04 23:05     ` Andrea Arcangeli
2012-06-30  4:47   ` Konrad Rzeszutek Wilk
2012-07-03 10:45     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-06-28 15:13   ` Don Morris
2012-06-28 15:00     ` Andrea Arcangeli
2012-06-29 14:26   ` Rik van Riel
2012-07-03 20:30     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-06-29 15:02   ` Rik van Riel
2012-07-04 23:03     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 07/40] autonuma: generic " Andrea Arcangeli
2012-06-29 15:13   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 08/40] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-06-29 15:27   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 09/40] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-06-29 15:36   ` Rik van Riel
2012-06-29 16:04     ` Peter Zijlstra
2012-06-29 16:11       ` Rik van Riel
2012-06-29 16:38     ` Andrea Arcangeli
2012-06-29 16:58       ` Rik van Riel
2012-07-05 13:09         ` Johannes Weiner
2012-07-05 18:33           ` Glauber Costa
2012-07-05 20:07             ` Andrea Arcangeli
2012-06-30  4:50   ` Konrad Rzeszutek Wilk
2012-07-04 23:14     ` Andrea Arcangeli
2012-07-05 12:04       ` Konrad Rzeszutek Wilk
2012-07-05 12:28         ` Andrea Arcangeli
2012-07-05 12:18       ` Peter Zijlstra
2012-07-05 12:21         ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-06-29 15:47   ` Rik van Riel
2012-06-29 17:45   ` Rik van Riel
2012-07-04 23:16     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 11/40] autonuma: define the autonuma flags Andrea Arcangeli
2012-06-29 16:10   ` Rik van Riel
2012-06-30  4:58   ` Konrad Rzeszutek Wilk
2012-07-02 15:42     ` Konrad Rzeszutek Wilk
2012-06-30  5:01   ` Konrad Rzeszutek Wilk
2012-07-04 23:45     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 12/40] autonuma: core autonuma.h header Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 13/40] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-06-28 14:46   ` Peter Zijlstra
2012-06-29 14:11     ` Nai Xia
2012-06-29 16:30       ` Andrea Arcangeli
2012-06-29 18:09         ` Nai Xia
2012-06-29 21:02         ` Nai Xia
2012-07-03 11:53     ` Peter Zijlstra
2012-06-28 14:53   ` Peter Zijlstra
2012-06-29 12:16     ` Hillf Danton
2012-06-29 12:55       ` Ingo Molnar
2012-06-29 16:51         ` Dor Laor
2012-06-29 18:41           ` Peter Zijlstra
2012-06-29 18:46             ` Rik van Riel
2012-06-29 18:51               ` Peter Zijlstra
2012-06-29 18:57               ` Peter Zijlstra
2012-06-29 19:03                 ` Peter Zijlstra
2012-06-29 19:19                   ` Rik van Riel
2012-07-02 16:57                     ` Vaidyanathan Srinivasan
2012-07-05 16:56                       ` Vaidyanathan Srinivasan
2012-07-06 13:04                         ` Hillf Danton
2012-07-06 18:38                           ` Vaidyanathan Srinivasan
2012-07-12 13:12                             ` Andrea Arcangeli
2012-06-29 18:49           ` Peter Zijlstra
2012-06-29 18:53           ` Peter Zijlstra
2012-06-29 20:01             ` Nai Xia
2012-06-29 20:44               ` Nai Xia
2012-06-30  1:23               ` Andrea Arcangeli
2012-06-30  2:43                 ` Nai Xia
2012-06-30  5:48                   ` Dor Laor
2012-06-30  6:58                     ` Nai Xia
2012-06-30 13:04                       ` Andrea Arcangeli
2012-06-30 15:19                         ` Nai Xia
2012-06-30 19:37                       ` Dor Laor
2012-07-01  2:41                         ` Nai Xia
2012-06-30 23:55                       ` Benjamin Herrenschmidt
2012-07-01  3:10                         ` Nai Xia
2012-06-30  8:23                     ` Nai Xia
2012-07-02  7:29                       ` Rik van Riel
2012-07-02  7:43                         ` Nai Xia
2012-06-30 12:48                   ` Andrea Arcangeli
2012-06-30 15:10                     ` Nai Xia
2012-07-02  7:36                       ` Rik van Riel
2012-07-02  7:56                         ` Nai Xia
2012-07-02  8:17                           ` Rik van Riel
2012-07-02  8:31                             ` Nai Xia
2012-07-05 18:07               ` Rik van Riel
2012-07-05 22:59                 ` Andrea Arcangeli [this message]
2012-07-06  1:00                 ` Nai Xia
2012-06-29 19:04           ` Peter Zijlstra
2012-06-29 20:27             ` Nai Xia
2012-06-29 18:03   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 14/40] autonuma: add page structure fields Andrea Arcangeli
2012-06-29 18:06   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-06-29 18:31   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 16/40] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-06-29 18:35   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 17/40] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-06-29 18:37   ` Rik van Riel
2012-06-28 12:55 ` [PATCH 18/40] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-06-29 18:39   ` Rik van Riel
2012-06-30  5:04   ` Konrad Rzeszutek Wilk
2012-07-12 17:50     ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 19/40] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-06-29 18:52   ` Rik van Riel
2012-06-30  5:10   ` Konrad Rzeszutek Wilk
2012-07-12 17:59     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 20/40] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-06-29 18:54   ` Rik van Riel
2012-06-30  5:12   ` Konrad Rzeszutek Wilk
2012-07-12 18:08     ` Andrea Arcangeli
2012-07-12 18:17       ` Johannes Weiner
2012-07-13 14:19         ` Christoph Lameter
2012-07-14 17:01           ` Andrea Arcangeli
2012-07-01 15:33   ` Rik van Riel
2012-07-12 18:27     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-06-29 18:57   ` Rik van Riel
2012-06-29 19:05     ` Peter Zijlstra
2012-06-29 19:07       ` Rik van Riel
2012-06-29 20:48         ` Ingo Molnar
2012-06-28 12:56 ` [PATCH 22/40] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-07-01 16:37   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 23/40] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-07-01 16:57   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 24/40] autonuma: core Andrea Arcangeli
2012-07-02  4:07   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-07-02  4:14   ` Rik van Riel
2012-07-14 16:43     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-07-02  4:19   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 27/40] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-07-02  4:22   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 28/40] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-07-02  4:24   ` Rik van Riel
2012-07-12 18:50     ` Andrea Arcangeli
2012-07-12 21:25       ` Rik van Riel
2012-06-28 12:56 ` [PATCH 29/40] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-07-02  4:33   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 30/40] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-07-02  4:47   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 31/40] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-07-02  4:49   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 32/40] autonuma: initialize page structure fields Andrea Arcangeli
2012-07-02  4:50   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-07-02  4:56   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-07-02  4:58   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 35/40] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-07-02  5:12   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 36/40] autonuma: page_autonuma Andrea Arcangeli
2012-06-30  5:24   ` Konrad Rzeszutek Wilk
2012-07-12 19:43     ` Andrea Arcangeli
2012-07-02  6:37   ` Rik van Riel
2012-07-12 19:58     ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 37/40] autonuma: page_autonuma change #include for sparse Andrea Arcangeli
2012-07-02  6:22   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-07-02  5:15   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-07-02  6:40   ` Rik van Riel
2012-06-28 12:56 ` [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-07-02  7:18   ` Rik van Riel
2012-07-12 20:21     ` Andrea Arcangeli
2012-07-09 15:40 ` [PATCH 00/40] AutoNUMA19 Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120705225935.GS25422@redhat.com \
    --to=aarcange@redhat.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=benh@kernel.crashing.org \
    --cc=bharata.rao@gmail.com \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=dlaor@redhat.com \
    --cc=don.morris@hp.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=konrad.wilk@oracle.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mauricfo@linux.vnet.ibm.com \
    --cc=mingo@elte.hu \
    --cc=mingo@kernel.org \
    --cc=nai.xia@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vatsa@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).