All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <pzijlstr@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Jones <drjones@redhat.com>, Dan Smith <danms@us.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Christoph Lameter <cl@linux.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
Date: Thu, 11 Oct 2012 20:48:51 +0100	[thread overview]
Message-ID: <20121011194851.GL3317@csn.ul.ie> (raw)
In-Reply-To: <20121011164300.GN1818@redhat.com>

On Thu, Oct 11, 2012 at 06:43:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > > faults to identify the per NUMA node working set of the thread at
> > > runtime.
> > > 
> > > Arming the NUMA hinting page fault mechanism works similarly to
> > > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > > triggers we can identify it as a NUMA hinting page fault.
> > > 
> > 
> > That implies that there is an atomic update requirement or at least
> > an ordering requirement -- present bit must be cleared before setting
> > NUMA bit. No doubt it'll be clear later in the series how this is
> > accomplished. What you propose seems ok but it all depends how it's
> > implemented so I'm leaving my ack off this particular patch for now.
> 
> Correct. The switch is done atomically (clear _PAGE_PRESENT at the
> same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
> to avoid firing an IPI for every pte/pmd_numa we establish).
> 

Good. I think you might still be flushing more than you need to but
commented on the patch itself.

> It's still similar to setting a range PROT_NONE (except the way
> _PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
> mutually exclusive, so they can easily share the same pte/pmd
> bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.
> 
> The NUMA hinting page fault also won't require any TLB flush ever.
> 

It sortof can. The fault itself is still a heavy operation that can do
things like this

numa_hinting_fault
 -> numa_hinting_fault_memory_follow_cpu
    -> autonuma_migrate_page
      -> sync_isolate_migratepages
	 (lru lock for single page)
      -> migrate_pages

and buried down there where it unmaps the page and makes a migration PTE
is a TLB flush due to calling ptep_clear_flush_notify(). That's a bad case
obviously and the expectation is that as the threads converage to a node that
it's not a problem. While it's converging though it will be a heavy cost.

Tracking how often a numa_hinting_fault results in a migration should be
enough to keep an eye on it.

> So the whole process (establish/teardown) has an incredibly low TLB
> flushing cost.
> 
> The only fixed cost is in knuma_scand and the enter/exit kernel for
> every not-shared page every 10 sec (or whatever you set the duration
> of a knuma_scand pass in sysfs).
> 

10 seconds should be sufficiently low. It itself might need to adapt in
the future but at least 10 seconds now by default will not stomp too heavily.

> Furthermore, if the pmd_scan mode is activated, I guarantee there's at
> max 1 NUMA hinting page fault every 2m virtual region (even if some
> accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
> to disable THP (echo never >enabled) to measure the exact cost per 4k
> page. It's hardly measurable here. With THP the fault is also 1 every
> 2m virtual region but no accuracy is lost in that case (or more
> precisely, there's no way to get more accuracy than that as we deal
> with a pmd).
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <pzijlstr@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <dhillf@gmail.com>,
	Andrew Jones <drjones@redhat.com>, Dan Smith <danms@us.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Christoph Lameter <cl@linux.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Mike Galbraith <efault@gmx.de>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 04/33] autonuma: define _PAGE_NUMA
Date: Thu, 11 Oct 2012 20:48:51 +0100	[thread overview]
Message-ID: <20121011194851.GL3317@csn.ul.ie> (raw)
In-Reply-To: <20121011164300.GN1818@redhat.com>

On Thu, Oct 11, 2012 at 06:43:00PM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 12:01:37PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:46AM +0200, Andrea Arcangeli wrote:
> > > The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> > > faults to identify the per NUMA node working set of the thread at
> > > runtime.
> > > 
> > > Arming the NUMA hinting page fault mechanism works similarly to
> > > setting up a mprotect(PROT_NONE) virtual range: the present bit is
> > > cleared at the same time that _PAGE_NUMA is set, so when the fault
> > > triggers we can identify it as a NUMA hinting page fault.
> > > 
> > 
> > That implies that there is an atomic update requirement or at least
> > an ordering requirement -- present bit must be cleared before setting
> > NUMA bit. No doubt it'll be clear later in the series how this is
> > accomplished. What you propose seems ok but it all depends how it's
> > implemented so I'm leaving my ack off this particular patch for now.
> 
> Correct. The switch is done atomically (clear _PAGE_PRESENT at the
> same time _PAGE_NUMA is set). The tlb flush is deferred (it's batched
> to avoid firing an IPI for every pte/pmd_numa we establish).
> 

Good. I think you might still be flushing more than you need to but
commented on the patch itself.

> It's still similar to setting a range PROT_NONE (except the way
> _PAGE_PROTNONE and _PAGE_NUMA works is the opposite, and they are
> mutually exclusive, so they can easily share the same pte/pmd
> bitflag). Except PROT_NONE must be synchronous, _PAGE_NUMA is set lazily.
> 
> The NUMA hinting page fault also won't require any TLB flush ever.
> 

It sortof can. The fault itself is still a heavy operation that can do
things like this

numa_hinting_fault
 -> numa_hinting_fault_memory_follow_cpu
    -> autonuma_migrate_page
      -> sync_isolate_migratepages
	 (lru lock for single page)
      -> migrate_pages

and buried down there where it unmaps the page and makes a migration PTE
is a TLB flush due to calling ptep_clear_flush_notify(). That's a bad case
obviously and the expectation is that as the threads converage to a node that
it's not a problem. While it's converging though it will be a heavy cost.

Tracking how often a numa_hinting_fault results in a migration should be
enough to keep an eye on it.

> So the whole process (establish/teardown) has an incredibly low TLB
> flushing cost.
> 
> The only fixed cost is in knuma_scand and the enter/exit kernel for
> every not-shared page every 10 sec (or whatever you set the duration
> of a knuma_scand pass in sysfs).
> 

10 seconds should be sufficiently low. It itself might need to adapt in
the future but at least 10 seconds now by default will not stomp too heavily.

> Furthermore, if the pmd_scan mode is activated, I guarantee there's at
> max 1 NUMA hinting page fault every 2m virtual region (even if some
> accuracy is lost). You can try to set scan_pmd = 0 in sysfs and also
> to disable THP (echo never >enabled) to measure the exact cost per 4k
> page. It's hardly measurable here. With THP the fault is also 1 every
> 2m virtual region but no accuracy is lost in that case (or more
> precisely, there's no way to get more accuracy than that as we deal
> with a pmd).
> 

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2012-10-11 19:48 UTC|newest]

Thread overview: 148+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 10:50   ` Mel Gorman
2012-10-11 16:07     ` Andrea Arcangeli
2012-10-11 16:07       ` Andrea Arcangeli
2012-10-11 19:37       ` Mel Gorman
2012-10-11 19:37         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-10-11 10:54   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 11:01   ` Mel Gorman
2012-10-11 16:43     ` Andrea Arcangeli
2012-10-11 16:43       ` Andrea Arcangeli
2012-10-11 19:48       ` Mel Gorman [this message]
2012-10-11 19:48         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 11:15   ` Mel Gorman
2012-10-11 16:58     ` Andrea Arcangeli
2012-10-11 16:58       ` Andrea Arcangeli
2012-10-11 19:54       ` Mel Gorman
2012-10-11 19:54         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 12:22   ` Mel Gorman
2012-10-11 17:05     ` Andrea Arcangeli
2012-10-11 17:05       ` Andrea Arcangeli
2012-10-11 20:01       ` Mel Gorman
2012-10-11 20:01         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 12:28   ` Mel Gorman
2012-10-11 15:24     ` Rik van Riel
2012-10-11 15:57       ` Mel Gorman
2012-10-12  0:23       ` Christoph Lameter
2012-10-12  0:52         ` Andrea Arcangeli
2012-10-12  0:52           ` Andrea Arcangeli
2012-10-11 17:15     ` Andrea Arcangeli
2012-10-11 17:15       ` Andrea Arcangeli
2012-10-11 20:06       ` Mel Gorman
2012-10-11 20:06         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 13:46   ` Mel Gorman
2012-10-11 17:34     ` Andrea Arcangeli
2012-10-11 17:34       ` Andrea Arcangeli
2012-10-11 20:17       ` Mel Gorman
2012-10-11 20:17         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-11 14:58   ` Mel Gorman
2012-10-12  0:25     ` Andrea Arcangeli
2012-10-12  0:25       ` Andrea Arcangeli
2012-10-12  8:29       ` Mel Gorman
2012-10-12  8:29         ` Mel Gorman
2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
2012-10-11 15:43   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-10-11 13:50   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-10-11 15:47   ` Mel Gorman
2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
2012-10-11 15:53   ` Mel Gorman
2012-10-11 17:34     ` Rik van Riel
     [not found]       ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03         ` Rik van Riel
2012-10-12 14:03           ` Rik van Riel
2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-10-05  6:41   ` Mike Galbraith
2012-10-05 11:54     ` Andrea Arcangeli
2012-10-06  2:39       ` Mike Galbraith
2012-10-06 12:34         ` Andrea Arcangeli
2012-10-07  6:07           ` Mike Galbraith
2012-10-08  7:03             ` Mike Galbraith
2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
2012-10-10 22:01   ` Rik van Riel
2012-10-10 22:36     ` Andrea Arcangeli
2012-10-11 18:28   ` Mel Gorman
2012-10-13 18:06   ` Srikar Dronamraju
2012-10-15  8:24     ` Srikar Dronamraju
2012-10-15  8:24       ` Srikar Dronamraju
2012-10-15  9:20       ` Mel Gorman
2012-10-15  9:20         ` Mel Gorman
2012-10-15 10:00         ` Srikar Dronamraju
2012-10-15 10:00           ` Srikar Dronamraju
2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-10-04 20:03   ` KOSAKI Motohiro
2012-10-11 18:32   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-10-11 18:33   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-10-11 18:36   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-10-11 18:44   ` Mel Gorman
2012-10-12 11:37     ` Rik van Riel
2012-10-12 12:35       ` Mel Gorman
2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
2012-10-11 18:45   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-10-11 18:47   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-10-11 18:50   ` Mel Gorman
2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
2012-10-04 14:16   ` Christoph Lameter
2012-10-04 20:09   ` KOSAKI Motohiro
2012-10-05 11:31     ` Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
2012-10-04 20:49   ` Rik van Riel
2012-10-05 23:08   ` Rik van Riel
2012-10-05 23:14   ` Andi Kleen
2012-10-05 23:14     ` Andi Kleen
2012-10-05 23:57     ` Tim Chen
2012-10-05 23:57       ` Tim Chen
2012-10-06  0:11       ` Andi Kleen
2012-10-06  0:11         ` Andi Kleen
2012-10-08 13:44         ` Don Morris
2012-10-08 13:44           ` Don Morris
2012-10-08 20:34     ` Rik van Riel
2012-10-08 20:34       ` Rik van Riel
2012-10-11 10:19 ` Mel Gorman
2012-10-11 14:56   ` Andrea Arcangeli
2012-10-11 14:56     ` Andrea Arcangeli
2012-10-11 15:35     ` Mel Gorman
2012-10-11 15:35       ` Mel Gorman
2012-10-12  0:41       ` Andrea Arcangeli
2012-10-12  0:41         ` Andrea Arcangeli
2012-10-12 14:54       ` Mel Gorman
2012-10-12 14:54         ` Mel Gorman
2012-10-11 21:34 ` Mel Gorman
2012-10-12  1:45   ` Andrea Arcangeli
2012-10-12  1:45     ` Andrea Arcangeli
2012-10-12  8:46     ` Mel Gorman
2012-10-12  8:46       ` Mel Gorman
2012-10-13 18:40 ` Srikar Dronamraju
2012-10-13 18:40   ` Srikar Dronamraju
2012-10-14  4:57   ` Andrea Arcangeli
2012-10-14  4:57     ` Andrea Arcangeli
2012-10-15  8:16     ` Srikar Dronamraju
2012-10-15  8:16       ` Srikar Dronamraju
2012-10-23 16:32     ` Srikar Dronamraju
2012-10-23 16:32       ` Srikar Dronamraju
2012-10-16 13:48 ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121011194851.GL3317@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=danms@us.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=drjones@redhat.com \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    --cc=pzijlstr@redhat.com \
    --cc=riel@redhat.com \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.