From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Hillf Danton <dhillf@gmail.com>, Dan Smith <danms@us.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
Paul Turner <pjt@google.com>,
Suresh Siddha <suresh.b.siddha@intel.com>,
Mike Galbraith <efault@gmx.de>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Bharata B Rao <bharata.rao@gmail.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
Christoph Lameter <cl@linux.com>
Subject: Re: [PATCH 00/35] AutoNUMA alpha14
Date: Wed, 30 May 2012 16:46:40 +0200 [thread overview]
Message-ID: <1338389200.26856.273.camel@twins> (raw)
In-Reply-To: <CA+55aFxpD+LsE+aNvDJtz9sGsGMvdusisgOY3Csbzyx1mEqW-w@mail.gmail.com>
On Sat, 2012-05-26 at 13:42 -0700, Linus Torvalds wrote:
> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
>
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.
>
> So I think very strongly that we should entirely dismiss all the
> people who want to do manual placement and claim that they know what
> their loads do. They're either full of sh*t (most likely), or they
> have a very specific benchmark and platform that they are tuning for
> that is totally irrelevant to everybody else.
>
> What we *should* try to aim for is a system that doesn't do horribly
> badly right out of the box. IOW, no tuning what-so-ever (at most a
> kind of "yes, I want you to try to do the NUMA thing" flag to just
> enable it at all), and try to not suck.
>
> Seriously. "Try to avoid sucking" is *way* superior to "We can let the
> user tweak things to their hearts content". Because users won't get it
> right.
>
> Give the anal people a knob they can tweak, and tell them it does
> something fancy. And never actually wire the damn thing up. They'll be
> really happy with their OCD tweaking, and do lots of nice graphs that
> just show how the error bars are so big that you can find any damn
> pattern you want in random noise.
So the thing is, my homenode-per-process approach should work for
everything except the case where a single process out-strips a single
node in either cpu utilization or memory consumption.
Now I claim such processes are rare since nodes are big, typically 6-8
cores. Writing anything that can sustain parallel execution larger than
that is very specialist (and typically already employs strong data
separation).
Yes there are such things out there, some use JVMs some are virtual
machines some regular applications, but by and large processes are small
compared to nodes.
So my approach is focus on the normal case, and provide 2 system calls
to replace sched_setaffinity() and mbind() for the people who use those.
Now, maybe I shouldn't have bothered with the system calls.. but I
thought providing something better than hard-affinity would be nice.
Andrea went the other way and focused on these big processes. His
approach relies on a pte scanner and faults. His code builds a
page<->thread map using this data either moves memory around or
processes (I'm a little vague on the details simply because I haven't
seen it explained anywhere yet -- and the code is non-obvious).
I have a number of problems with both the approach as well as the
implementation.
On the approach my biggest complaints are:
- the complexity, it focuses on the rarest sort of processes and thus
results in a rather complex setup.
- load-balance state explosion, the page-tables become part of the
load-balance state -- this is a lot of extra state making
reproduction more 'interesting'.
- the overhead, since its per page, it needs per-page state.
- I don't see how it can reliably work for virtual machines, because
the host page<->thread (vcpu) relation doesn't reflect a
data<->compute relation in this case. The guest scheduler can move
the guest thread (the compute) part around between the vcpus at a
much higher rate than the host will update its page<->vcpu map.
On the implementation:
- he works around the scheduler instead of with it.
- its x86 only (although he claims adding archs is trivial
I've yet to see the first !x86 support).
- complete lack of useful comments describing the balancing goal and
approach.
The worst part is that I've asked for this stuff several times, but
nothing seems forth-coming.
Anyway, I prefer doing the simple thing first and then seeing if there's
need for more complexity, esp. given the overheads involved. But if you
prefer we can dive off the deep end :-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-05-30 14:47 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-25 17:02 [PATCH 00/35] AutoNUMA alpha14 Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 01/35] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 02/35] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-05-25 20:26 ` Konrad Rzeszutek Wilk
2012-05-26 15:59 ` Andrea Arcangeli
2012-05-29 14:10 ` Konrad Rzeszutek Wilk
2012-05-29 16:01 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-05-30 18:22 ` Konrad Rzeszutek Wilk
2012-05-30 18:34 ` Andrea Arcangeli
2012-05-30 20:01 ` Konrad Rzeszutek Wilk
2012-06-05 17:13 ` Andrea Arcangeli
2012-06-05 17:17 ` Konrad Rzeszutek Wilk
2012-06-05 17:40 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 05/35] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 06/35] autonuma: generic " Andrea Arcangeli
2012-05-30 20:23 ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 07/35] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 08/35] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-05-29 12:49 ` Peter Zijlstra
2012-05-29 16:11 ` Andrea Arcangeli
2012-05-29 17:04 ` Peter Zijlstra
2012-05-29 17:44 ` Andrea Arcangeli
2012-05-29 17:48 ` Peter Zijlstra
2012-05-29 18:15 ` Andrea Arcangeli
2012-05-30 20:26 ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 09/35] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 10/35] autonuma: define the autonuma flags Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 11/35] autonuma: core autonuma.h header Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 12/35] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-05-29 13:00 ` Peter Zijlstra
2012-05-29 13:54 ` Rik van Riel
2012-05-29 13:10 ` Peter Zijlstra
2012-06-22 17:36 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 13/35] autonuma: add page structure fields Andrea Arcangeli
2012-05-29 13:16 ` Peter Zijlstra
2012-05-29 13:56 ` Rik van Riel
2012-05-29 14:54 ` Peter Zijlstra
2012-05-30 8:25 ` KOSAKI Motohiro
2012-05-30 9:06 ` Peter Zijlstra
2012-05-30 9:41 ` KOSAKI Motohiro
2012-05-30 9:55 ` Peter Zijlstra
2012-05-30 13:49 ` Andrea Arcangeli
2012-05-31 18:18 ` Peter Zijlstra
2012-06-05 14:51 ` Andrea Arcangeli
2012-06-19 18:06 ` Andrea Arcangeli
2012-05-29 16:38 ` Andrea Arcangeli
2012-05-29 16:46 ` Rik van Riel
2012-05-29 16:56 ` Peter Zijlstra
2012-05-29 18:35 ` Andrea Arcangeli
2012-05-29 17:38 ` Linus Torvalds
2012-05-29 18:09 ` Andrea Arcangeli
2012-05-29 20:42 ` Rik van Riel
2012-05-25 17:02 ` [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-05-29 13:51 ` Peter Zijlstra
2012-05-30 0:14 ` Andrea Arcangeli
2012-05-30 18:19 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 15/35] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 16/35] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 17/35] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 18/35] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-05-30 20:55 ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 19/35] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-05-29 14:02 ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 21/35] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-05-29 16:05 ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 22/35] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-05-29 16:12 ` Peter Zijlstra
2012-05-29 17:33 ` Andrea Arcangeli
2012-05-29 17:43 ` Peter Zijlstra
2012-05-29 18:24 ` Andrea Arcangeli
2012-05-29 22:21 ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 23/35] autonuma: core Andrea Arcangeli
2012-05-29 11:45 ` Kirill A. Shutemov
2012-05-30 0:03 ` Andrea Arcangeli
2012-05-29 16:27 ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 24/35] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 25/35] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 26/35] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 27/35] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 28/35] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 29/35] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 30/35] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-05-29 16:30 ` Peter Zijlstra
2012-05-29 16:49 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 31/35] autonuma: initialize page structure fields Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 32/35] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 33/35] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 34/35] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 35/35] autonuma: page_autonuma Andrea Arcangeli
2012-05-29 16:44 ` Peter Zijlstra
2012-05-29 17:14 ` Andrea Arcangeli
2012-05-26 17:28 ` [PATCH 00/35] AutoNUMA alpha14 Rik van Riel
2012-05-26 20:42 ` Linus Torvalds
2012-05-29 15:53 ` Christoph Lameter
2012-05-29 16:08 ` Andrea Arcangeli
2012-05-30 14:46 ` Peter Zijlstra [this message]
2012-05-30 15:30 ` Ingo Molnar
2012-05-29 13:36 ` Kirill A. Shutemov
2012-05-29 15:43 ` Petr Holasek
2012-05-31 18:08 ` AutoNUMA15 Andrea Arcangeli
2012-05-31 20:01 ` AutoNUMA15 Don Morris
2012-05-31 22:54 ` AutoNUMA15 Andrea Arcangeli
2012-06-01 0:04 ` AutoNUMA15 Andrea Arcangeli
2012-05-31 18:52 ` AutoNUMA15 Don Morris
2012-06-07 2:30 ` AutoNUMA15 Zhouping Liu
2012-06-21 7:29 ` AutoNUMA15 Alex Shi
2012-06-21 14:55 ` AutoNUMA15 Andrea Arcangeli
2012-06-26 7:52 ` AutoNUMA15 Alex Shi
2012-06-26 12:03 ` AutoNUMA15 Andrea Arcangeli
2012-07-12 2:36 ` AutoNUMA15 Alex Shi
2012-05-29 17:15 ` [PATCH 00/35] AutoNUMA alpha14 Andrea Arcangeli
2012-06-01 22:41 ` Mauricio Faria de Oliveira
2012-06-22 17:57 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1338389200.26856.273.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=Lee.Schermerhorn@hp.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bharata.rao@gmail.com \
--cc=cl@linux.com \
--cc=danms@us.ibm.com \
--cc=dhillf@gmail.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).