Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Rik van Riel <riel@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
Date: Fri, 26 Oct 2012 09:15:32 +0200	[thread overview]
Message-ID: <20121026071532.GC8141@gmail.com> (raw)
In-Reply-To: <CA+55aFwJdn8Kz9UByuRfGNtf9Hkv-=8xB+WRd47uHZU1YMagZw@mail.gmail.com>

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > +       /*
> > +        * Using runtime rather than walltime has the dual advantage that
> > +        * we (mostly) drive the selection from busy threads and that the
> > +        * task needs to have done some actual work before we bother with
> > +        * NUMA placement.
> > +        */
> 
> That explanation makes sense..
> 
> > +       now = curr->se.sum_exec_runtime;
> > +       period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> > +
> > +       if (now - curr->node_stamp > period) {
> > +               curr->node_stamp = now;
> > +
> > +               if (!time_before(jiffies, curr->mm->numa_next_scan)) {
> 
> .. but then the whole "numa_next_scan" thing ends up being 
> about real-time anyway?
>
> So 'numa_scan_period' in in CPU time (msec, converted to nsec 
> at runtime rather than when setting it), but 'numa_next_scan' 
> is in wallclock time (jiffies)?
> 
> But *both* of them are based on the same 'numa_scan_period' 
> thing that the user sets in ms.
> 
> So numa_scan_period is interpreted as both wallclock *and* as 
> runtime?
> 
> Maybe this works, but it doesn't really make much sense.

So, the relationship between wall clock time and execution 
runtime is that on the limit they run at the same speed: when 
there's a single task running. In any other case execution 
runtime can only run slower than wall time.

So the bit you found weird:

> > +               if (!time_before(jiffies, curr->mm->numa_next_scan)) {

together with the task_numa_work() frequency limit:

        /*
         * Enforce maximal scan/migration frequency..
         */
        migrate = mm->numa_next_scan;
        if (time_before(now, migrate))
                return;

        next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
        if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
                return;

puts an upper limit on the per mm scanning frequency.

This filters us from over-sampling if there are many threads: if 
all threads happen to come in at the same time we don't create a 
spike in overhead.

We also avoid multiple threads scanning at once in parallel to 
each other. Faults are nicely parallel, especially with all the 
preparatory patches in place, so the distributed nature of the 
faults itself is not a problem.

So we have to conflicting goals here: on one hand we have a 
quality of sampling goal which asks for per task runtime 
proportional scanning on all threads, but we also have a 
performance goal and don't actually want all threads running at 
the same time. This frequency limit avoids the over-sampling 
scenario while still fulfilling the per task sampling property, 
statistically on average.

If you agree that we should do it like that and if the 
implementation is correct and optimal, I will put a better 
explanation into the code.

[
  task_numa_work() performance side note:

  We are also *very* close to be able to use down_read() instead
  of down_write() in the sampling-unmap code in 
  task_numa_work(), as it should be safe in theory to call 
  change_protection(PROT_NONE) in parallel - but there's one 
  regression that disagrees with this theory so we use 
  down_write() at the moment.

  Maybe you could help us there: can you see a reason why the
  change_prot_none()->change_protection() call in
  task_numa_work() can not occur in parallel to a page fault in
  another thread on another CPU? It should be safe - yet if we 
  change it I can see occasional corruption of user-space state: 
  segfaults and register corruption.
]

> [...] And what is the impact of this on machines that run lots 
> of loads with delays (whether due to IO or timers)?

I've done sysbench OLTP measurements which showed no apparent 
regressions:

 #
 # Comparing { res-schednuma-NO_NUMA.txt } to { res-schednuma-+NUMA.txt }:
 #
 #  threads     improvement %       SysBench OLTP transactions/second
 #-------------------------------------------------------------------
         2:            2.11 %              #    2160.20  vs.  2205.80
         4:           -5.52 %              #    4202.04  vs.  3969.97
         8:            0.01 %              #    6894.45  vs.  6895.45
        16:           -0.31 %              #   11840.77  vs. 11804.30
        24:           -0.56 %              #   15053.98  vs. 14969.14
        30:            0.56 %              #   17043.23  vs. 17138.21
        32:           -1.08 %              #   17797.04  vs. 17604.67
        34:            1.04 %              #   18158.10  vs. 18347.22
        36:           -0.16 %              #   18125.42  vs. 18096.68
        40:            0.45 %              #   18218.73  vs. 18300.59
        48:           -0.39 %              #   18266.91  vs. 18195.26
        56:           -0.11 %              #   18285.56  vs. 18265.74
        64:            0.23 %              #   18304.74  vs. 18347.51
        96:            0.18 %              #   18268.44  vs. 18302.04
       128:            0.22 %              #   18058.92  vs. 18099.34
       256:            1.63 %              #   17068.55  vs. 17347.14
       512:            6.86 %              #   13452.18  vs. 14375.08

No regression is the best we can hope for I think, given that 
OLTP typically has huge global caches and global serialization, 
so any NUMA conscious will at most be a nuisance.

We've also done kbuild measurements - which too is a pretty 
sleepy workload that is too fast for any migration techniques to 
help.

But even sysbench isn't doing very long delays, so I will do 
more IO delay targeted measurements.

So I've been actively looking for and checking the worst-case 
loads for this feature. The feature obviously helps long-run, 
CPU-intense workloads, but those aren't the challenging ones 
really IMO: I spent 70% of the time analyzing workloads that are 
not expected to be friends with this feature.

We are also keeping CONFIG_SCHED_NUMA off by default for good 
measure.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-10-26  7:15 UTC|newest]

Thread overview: 135+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
2012-10-25 12:16 ` [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method Peter Zijlstra
2012-10-25 12:16 ` [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally Peter Zijlstra
2012-11-01  9:56   ` Mel Gorman
2012-11-01 13:13     ` Rik van Riel
2012-10-25 12:16 ` [PATCH 03/31] mm/thp: Preserve pgprot across huge page split Peter Zijlstra
2012-11-01 10:22   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
2012-10-25 20:10   ` Linus Torvalds
2012-10-26  6:24     ` [PATCH 04/31, v2] " Ingo Molnar
2012-11-01 10:42   ` [PATCH 04/31] " Mel Gorman
2012-10-25 12:16 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Peter Zijlstra
2012-10-25 20:17   ` Linus Torvalds
2012-10-26  2:30     ` Rik van Riel
2012-10-26  2:56       ` Linus Torvalds
2012-10-26  3:57         ` Rik van Riel
2012-10-26  4:23           ` Linus Torvalds
2012-10-26  6:42             ` Ingo Molnar
2012-10-26 12:34             ` Michel Lespinasse
2012-10-26 12:48               ` Andi Kleen
2012-10-26 13:16                 ` Rik van Riel
2012-10-26 13:26                   ` Ingo Molnar
2012-10-26 13:28                     ` Ingo Molnar
2012-10-26 18:44                     ` [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags() Rik van Riel
2012-10-26 18:49                       ` Linus Torvalds
2012-10-26 19:16                         ` Rik van Riel
2012-10-26 19:18                           ` Linus Torvalds
2012-10-26 19:21                             ` Rik van Riel
2012-10-29 15:23                             ` Rik van Riel
2012-12-21  9:57                               ` trailing flush_tlb_fix_spurious_fault in handle_pte_fault (was Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()) Vineet Gupta
2012-10-26 18:45                     ` [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags Rik van Riel
2012-10-26 21:12                       ` Alan Cox
2012-10-27  3:49                         ` Rik van Riel
2012-10-27 10:29                           ` Ingo Molnar
2012-10-27 13:40                         ` Rik van Riel
2012-10-29 16:57                           ` Borislav Petkov
2012-10-29 17:06                             ` Linus Torvalds
2012-11-17 14:50                               ` Borislav Petkov
2012-11-17 14:56                                 ` Linus Torvalds
2012-11-17 15:17                                   ` Borislav Petkov
2012-11-17 15:24                                   ` Rik van Riel
2012-11-17 21:53                                     ` Shentino
2012-11-18 15:29                                       ` Michel Lespinasse
2012-10-26 18:46                     ` [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags Rik van Riel
2012-10-26 18:48                       ` Linus Torvalds
2012-10-26 18:53                         ` Linus Torvalds
2012-10-26 18:57                         ` Rik van Riel
2012-10-26 19:16                           ` Linus Torvalds
2012-10-26 19:33                             ` [PATCH -v2 " Rik van Riel
2012-10-26 13:23                 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Michel Lespinasse
2012-10-26 17:01               ` Linus Torvalds
2012-10-26 17:54                 ` Rik van Riel
2012-10-26 18:02                   ` Linus Torvalds
2012-10-26 18:14                     ` Rik van Riel
2012-10-26 18:41                       ` Linus Torvalds
2012-10-25 12:16 ` [PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte Peter Zijlstra
2012-10-25 12:16 ` [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Peter Zijlstra
2012-11-01 10:49   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 08/31] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Peter Zijlstra
2012-10-25 12:16 ` [PATCH 09/31] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Peter Zijlstra
2012-10-25 12:16 ` [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-10-25 20:58   ` Andi Kleen
2012-10-26  7:59     ` Ingo Molnar
2012-10-25 12:16 ` [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
2012-11-01 10:58   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
2012-11-01 11:10   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 13/31] mm/mpol: Check for misplaced page Peter Zijlstra
2012-10-25 12:16 ` [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure Peter Zijlstra
2012-11-01 11:51   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
2012-11-01 12:01   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 16/31] numa, mm: Support NUMA hinting page faults from gup/gup_fast Peter Zijlstra
2012-10-25 12:16 ` [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page() Peter Zijlstra
2012-11-01 12:20   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 18/31] mm/mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
2012-10-25 12:16 ` [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node() Peter Zijlstra
2012-11-01 13:48   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware Peter Zijlstra
2012-11-01 13:58   ` Mel Gorman
2012-11-01 14:10     ` Don Morris
2012-10-25 12:16 ` [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
2012-11-01 14:00   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 22/31] sched, numa, mm: Implement THP migration Peter Zijlstra
2012-11-01 14:16   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 23/31] sched, numa, mm: Implement home-node awareness Peter Zijlstra
2012-11-01 15:06   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe Peter Zijlstra
2012-11-01 15:17   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 25/31] sched, numa, mm/mpol: Add_MPOL_F_HOME Peter Zijlstra
2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
2012-10-25 20:53   ` Linus Torvalds
2012-10-26  7:15     ` Ingo Molnar [this message]
2012-10-26 13:50       ` Ingo Molnar
2012-10-26 14:11         ` Peter Zijlstra
2012-10-26 14:14           ` Ingo Molnar
2012-10-26 16:47             ` Linus Torvalds
2012-10-30 19:23   ` Rik van Riel
2012-11-01 15:40   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 27/31] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
2012-10-25 12:16 ` [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
2012-11-01 15:48   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 29/31] sched, numa, mm: Add NUMA_MIGRATION feature flag Peter Zijlstra
2012-10-25 12:16 ` [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
2012-11-01 15:52   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 31/31] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Peter Zijlstra
2012-10-26  9:07 ` [PATCH 00/31] numa/core patches Zhouping Liu
2012-10-26  9:08   ` Peter Zijlstra
2012-10-26  9:20     ` Ingo Molnar
2012-10-26  9:41       ` Zhouping Liu
2012-10-26 10:20       ` Zhouping Liu
2012-10-26 10:24         ` Ingo Molnar
2012-10-28 17:56     ` Johannes Weiner
2012-10-29  2:44       ` Zhouping Liu
2012-10-29  6:50         ` [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Ingo Molnar
2012-10-29  8:24           ` Johannes Weiner
2012-10-29  8:36             ` Zhouping Liu
2012-10-29 11:15             ` Ingo Molnar
2012-10-30  6:29       ` [PATCH 00/31] numa/core patches Zhouping Liu
2012-10-31  0:48         ` Johannes Weiner
2012-10-31  7:26           ` Hugh Dickins
2012-10-31 13:15             ` Zhouping Liu
2012-10-31 17:31               ` Hugh Dickins
2012-11-01 13:41                 ` Hugh Dickins
2012-11-02  3:23                   ` Zhouping Liu
2012-11-02 23:06                     ` Hugh Dickins
2012-10-30 12:20 ` Mel Gorman
2012-10-30 15:28   ` Andrew Morton
2012-10-30 16:59     ` Mel Gorman
2012-11-03 11:04   ` Alex Shi
2012-11-03 12:21     ` Mel Gorman
2012-11-10  2:47       ` Alex Shi
2012-11-12  9:50         ` Mel Gorman
2012-11-09  8:51   ` Rik van Riel
2012-11-05 17:11 ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121026071532.GC8141@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).