linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation
Date: Fri, 16 Nov 2012 18:44:59 +0000	[thread overview]
Message-ID: <20121116184459.GG8218@suse.de> (raw)
In-Reply-To: <20121116173755.GB4697@gmail.com>

On Fri, Nov 16, 2012 at 06:37:55PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > AFAICS, this portion of numa/core:
> > > 
> > > c740b1cccdcb x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
> > 
> > We share this.
> > 
> > > 02743c9c03f1 mm/mpol: Use special PROT_NONE to migrate pages
> > 
> > hard-codes prot_none
> 
> I prefer any arch support extensions to be done in the patch 
> that adds that specific arch support.
> 
> That way we can consider the pros and cons of abstraction. Also 
> see further below.
> 

Using _PAGE_NUMA mapped onto _PROT_NONE does not prevent the same
abstraction.

> > > cd203e33c39d mm/mpol: Add MPOL_MF_NOOP
> > 
> > I have a patch that backs this out on the grounds that I don't 
> > think we have adequately discussed if it was the correct 
> > userspace interface. I know Peter put a lot of time into it so 
> > it's probably correct but without man pages or spending time 
> > writing an example program that used it, I played safe.
> 
> I'm fine with not exposing it to user-space.
> 

Ok.

> > > Mel, could you please work on this basis, or point out the 
> > > bits you don't agree with so I can fix it?
> > 
> > My main hangup is the prot_none choice and I know it's 
> > something we have butted heads on without progress. [...]
> 
> It's the basic KISS concept - I think you are over-designing 
> this. An architecture opts in to the new, generic code via 
> doing:
> 
>   select ARCH_SUPPORTS_NUMA_BALANCING
> 
> ... if it cannot enable that then it will extend the core code 
> in *very* visible ways.
> 

It won't kill them to decide if they really want to use _PAGE_NUMA or
not. The fact that the generic helpers end up being a few instructions
is nice although you probably could do the same with some juggling.

> > [...] I feel it is a lot cleaner to have the _PAGE_NUMA bit 
> > (even if it's PROT_NONE underneath) and the helpers avoid 
> > function calls where possible. It also made the PMD handling 
> > sortof straight-forward and allowed the batching taking of the 
> > PTL and migration if the pages in the PMD were all on the same 
> > node. I liked this.
> > 
> > Yours is closer to what the architecture does and can use 
> > change_protect() with very few changes but on balance I did 
> > not find this a compelling alternative.
> 
> IMO here you are on the wrong side of history as well.
> 
> For example reusing change_protection() *already* uncovered 
> useful optimizations to the generic code:
> 
>    http://comments.gmane.org/gmane.linux.kernel.mm/89707
> 
> (regardless of how this particular change_protection() 
>  optimization will look like.)
> 
> that optimization would not have happened with your open-coded 
> change_protection() variant plain and simple.
> 

As I said before, very little actually stops me using change_protection if
_PAGE_NUMA == _PAGE_NONE. The only reason I didn't convert yet is because
I wanted to see what the full set of requirements were. Right now they
are simple;

1. Something to avoid unnecessary TLB flushes if there are no updates
2. Return if all the pages underneath are on the same node or not so
   that pmd_numa can be set if desired
3. Collect stats on PTE updates

1 should already be there. 2 would be trivial. 3 should also be fairly
trivial with some jiggery pokery.

A conversion is not a fundamental problem. If an arch cannot use _PAGE_NONE
they will need to implement their own version of change_prot_numa() but
that in itself should be sufficient discouragment.

If an arch cannot use _PAGE_NONE in your case, it's a retrofit to find
all the places that use prot_none and see if they really mean prot_none
or if they meant prot_numa.

> So, to put it bluntly, you are not only doing a stupid thing, 
> you are doing an actively harmful thing here...
> 

Great, calling me stupid is going to help.

Is your major problem the change_page_numa() part? If so, I can fix
that and adjust change_protection in the way I need.

> If you fix that then most of the differences between your tree 
> and numa/core disappears. You'll end up very close to:
> 

MIGRATE_FAULT is still there.

The lack of batch handling of a PMD fault may also be a problem. Right
now you only handle transparent hugepages and then depend on being able to
natively migrate them to avoid a big hit. In my case it is possible to mark
a PMD and deal with it as a single fault even if it's not a transparent
hugepage. This batches the taking of the PTL and migration of pages. This
will trap less although the guy that does trap takes a heavier hit. Maybe
this will work out best, maybe not, but it's possible.

An optimisation of this would be that if all pages in a PMD are on the same
node then only set the PMD. On the next fault if the fault is properly
placed it's one PMD update and the fault is complete. If it's misplaced
then one page needs to migrate and the pte_numa needs to be set on all the
pages below. On a fully converged workload this will be faster as we'll take
one PMD fault instead of 512 PTE faults reducing overall system CPU usage.

There are also the stats that track PTE updates, faults and migrations
which allow a user to make an estimation for how expensive automatic
balancing is from /proc/vmstat. This will help debugging user problems,
possibly without profiling.

>   - rebasing numa/core pretty much as-is
>   + add your migrate_displaced() function
>   - remove the user-facing lazy migration facilities.
>   + inline pte_numa()/pmd_numa() if you think it's beneficial
> 

+ regular pmd batch handling
+ stats on PTE updates and faults to estimate costs from /proc/vmstat

> If that works for you I'll test and backmerge all such deltas 
> quickly and we can move on.
> 

Or if you're willing to backmerge then why not rebase the policy bits on
top of the basic migration policy picking some point between here
depending on what you'd like to do?

 mm: numa: Rate limit setting of pte_numa if node is saturated
 sched: numa: Slowly increase the scanning period as NUMA faults are handled
 mm: numa: Introduce last_nid to the page frame
 mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
 sched: numa: Introduce tsk_home_node()
 sched: numa: Make find_busiest_queue() a method
 sched: numa: Implement home-node awareness
 sched: numa: Introduce per-mm and per-task structures

So that way, not only can we see the logical progression of how your
stuff works but also compare it to a basic policy that is not
particularly smart to make sure we are actually going to the right
direction.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-11-16 18:45 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-16 11:22 [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman
2012-11-16 11:22 ` [PATCH 01/43] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-11-16 11:22 ` [PATCH 02/43] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-11-16 11:22 ` [PATCH 03/43] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-11-16 11:22 ` [PATCH 04/43] mm: numa: define _PAGE_NUMA Mel Gorman
2012-11-16 11:22 ` [PATCH 05/43] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-11-16 11:22 ` [PATCH 06/43] mm: numa: Make pte_numa() and pmd_numa() a generic implementation Mel Gorman
2012-11-16 14:09   ` Rik van Riel
2012-11-16 14:41     ` Mel Gorman
2012-11-16 15:32       ` Linus Torvalds
2012-11-16 16:08         ` Ingo Molnar
2012-11-16 16:56           ` Mel Gorman
2012-11-16 17:12             ` Ingo Molnar
2012-11-16 17:48               ` Mel Gorman
2012-11-16 18:04                 ` Ingo Molnar
2012-11-16 18:55                   ` Mel Gorman
2012-11-16 17:26             ` Rik van Riel
2012-11-16 17:37             ` Ingo Molnar
2012-11-16 18:44               ` Mel Gorman [this message]
2012-11-16 16:19         ` Mel Gorman
2012-11-16 11:22 ` [PATCH 07/43] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
2012-11-16 14:09   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 08/43] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-11-16 11:22 ` [PATCH 09/43] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-11-16 11:22 ` [PATCH 10/43] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-11-16 11:22 ` [PATCH 11/43] mm: mempolicy: Add MPOL_MF_NOOP Mel Gorman
2012-11-16 11:22 ` [PATCH 12/43] mm: mempolicy: Check for misplaced page Mel Gorman
2012-11-16 11:22 ` [PATCH 13/43] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-11-19 19:44   ` [tip:numa/core] mm/migration: Improve migrate_misplaced_page() tip-bot for Mel Gorman
2012-11-16 11:22 ` [PATCH 14/43] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-11-16 16:08   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 15/43] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2012-11-16 11:22 ` [PATCH 16/43] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
2012-11-16 16:22   ` Rik van Riel
2012-11-16 11:22 ` [PATCH 17/43] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Mel Gorman
2012-11-16 11:22 ` [PATCH 18/43] mm: numa: Add fault driven placement and migration Mel Gorman
2012-11-16 11:22 ` [PATCH 19/43] mm: numa: Avoid double faulting after migrating misplaced page Mel Gorman
2012-11-16 11:22 ` [PATCH 20/43] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-11-16 11:22 ` [PATCH 21/43] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
2012-11-16 11:22 ` [PATCH 22/43] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-11-16 11:22 ` [PATCH 23/43] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2012-11-16 11:22 ` [PATCH 24/43] mm: numa: Migrate on reference policy Mel Gorman
2012-11-16 11:22 ` [PATCH 25/43] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
2012-11-16 11:22 ` [PATCH 26/43] mm: numa: Only mark a PMD pmd_numa if the pages are all on the same node Mel Gorman
2012-11-16 11:22 ` [PATCH 27/43] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
2012-11-16 11:22 ` [PATCH 28/43] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
2012-11-16 11:22 ` [PATCH 29/43] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
2012-11-16 11:22 ` [PATCH 30/43] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
2012-11-16 11:22 ` [PATCH 31/43] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-11-16 11:22 ` [PATCH 32/43] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-11-16 11:22 ` [PATCH 33/43] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-11-16 11:22 ` [PATCH 34/43] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-11-16 11:22 ` [PATCH 35/43] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-11-16 11:22 ` [PATCH 36/43] sched: numa: Introduce tsk_home_node() Mel Gorman
2012-11-16 11:22 ` [PATCH 37/43] sched: numa: Make find_busiest_queue() a method Mel Gorman
2012-11-16 11:22 ` [PATCH 38/43] sched: numa: Implement home-node awareness Mel Gorman
2012-11-16 11:22 ` [PATCH 39/43] sched: numa: Introduce per-mm and per-task structures Mel Gorman
2012-11-16 11:22 ` [PATCH 40/43] sched: numa: CPU follows memory Mel Gorman
2012-11-16 11:22 ` [PATCH 41/43] sched: numa: Rename mempolicy to HOME Mel Gorman
2012-11-16 11:22 ` [PATCH 42/43] sched: numa: Consider only one CPU per node for CPU-follows-memory Mel Gorman
2012-11-16 11:22 ` [PATCH 43/43] sched: numa: Increase and decrease a tasks scanning period based on task fault statistics Mel Gorman
2012-11-16 14:56 ` [RFC PATCH 00/43] Automatic NUMA Balancing V3 Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121116184459.GG8218@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).