From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-numa@vger.kernel.org, akpm@linux-foundation.org,
Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@kernel.dk>,
Hugh Dickins <hughd@google.com>,
andi@firstfloor.org, David Rientjes <rientjes@google.com>,
Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH/RFC 0/8] numa - Migrate-on-Fault
Date: Wed, 17 Nov 2010 12:03:56 -0500 [thread overview]
Message-ID: <1290013436.3786.149.camel@zaphod> (raw)
In-Reply-To: <20101115143350.GC6809@random.random>
[-- Attachment #1: Type: text/plain, Size: 8445 bytes --]
On Mon, 2010-11-15 at 15:33 +0100, Andrea Arcangeli wrote:
> Hi everyone,
>
> On Mon, Nov 15, 2010 at 08:13:14AM -0600, Christoph Lameter wrote:
> > On Sun, 14 Nov 2010, KOSAKI Motohiro wrote:
> >
> > > Nice!
> >
> > Lets not get overenthused. There has been no conclusive proof that the
> > overhead introduced by automatic migration schemes is consistently less
> > than the benefit obtained by moving the data. Quite to the contrary. We
> > have over a decades worth of research and attempts on this issue and there
> > was no general improvement to be had that way.
> >
> > The reason that the manual placement interfaces exist is because there was
> > no generally beneficial migration scheme available. The manual interfaces
> > allow the writing of various automatic migrations schemes in user space.
> >
> > If wecan come up with something that is an improvement then lets go
> > this way but I am skeptical.
>
> I generally find the patchset very interesting but I think like
> Christoph.
Christoph is correct that we have no concrete data on modern processors
for these patch sets. I did present some results back on '07 from a
4-node, 16 processor ia64 server. The slides from that presentation are
here:
http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf
Slide 18 shows the effects on stream benchmark execution [per pass] of
restoring the locality after a transient job [a parallel kernel build]
causes a flurry of load balancing. The streams jobs return to 'best
case' performance after the perturbation even tho' they started in a
less than optimal locality configuration.
Slides 29 shows the effects on a stand-alone parallel kernel build of
the patches--disabled and enabled, with and without auto-migration of
page cache pages. Not much change in real nor user time.
Auto-migrating page cache pages chewed up a lot of system time for the
stand alone kernel build because the shared pages of the tool chain
executables and libraries were seriously thrashing. With auto-migration
of anon pages only, we see a slight [OK, tiny!] but repeatable
improvement in real time for a ~2% increase in system time.
Slide 30 shows, IMO, a more interesting result. On a heavily loaded
system with the stream benchmark running on all nodes, the interconnect
bandwidth becomes a precious resource, so locality matters more.
Comparing a parallel kernel build in this environment with automigration
[anon pages only] enabled vs disabled, I observed:
~18% improvement in real time
~4% improvement in user time
~21% improvement in system time
Slide 27 gives you an idea of what was happening during a parallel
kernel build. "Swap faults" on that slide are faults on anon pages that
have been moved to the migration cache by automigration. Those stats
were taken with ad hoc instrumentation. I've added some vmstats since
then.
So, if an objective is to pack more jobs [guest vms] on a single system,
one might suppose that we'd have a heavier loaded system, perhaps
spending a lot of time in the kernel handling various faults. Something
like this approach might help, even on current generation numa
platforms, altho' I'd expect more benefit on larger socket count
systems. Should be testable.
>
> It's good to give the patchset more visibility as it's quite unique in
> this area, but when talking with Lee I also thought the synchronous
> migrate on fault was probably too aggressive and I like an algorithm
> where memory follows cpus and cpus follow memory in a total dynamic
> way.
>
> I suggested Lee during our chat (and also to others during KS+Plubers)
> that we need a more dynamic algorithm that works in the background
> asynchronously. Specifically I want the cpu to follow memory closely
> whenever idle status allows it (change cpu in context switch is cheap,
> I don't like pinning or "single" home node concept) and then memory
> slowly also in tandem follow cpu in the background with kernel
> thread. So that both having cpu follow memory fast, and memory follow
> cpu slow, eventually things over time should converge in a optimal
> behavior. I like the migration done from a kthread like
> khugepaged/ksmd, not synchronously adding latency to page fault (or
> having to take down ptes to trigger the migrate on fault, migrate
> never need to require the app to exit kernel and take a fault just to
> migrate, it happens transparently as far as userland is concerned,
> well of course unless it trips on the migration pte just at the wrong
> time :).
I don't know about the background migration thread. Christoph mentioned
the decades of research and attempts to address this issue. IMO, most
of these stumbled on the cost of collecting sufficient data to know what
pages to migrate where. And, if you don't know which pages to migrate,
you can end up doing a lot of work for little gain. I recall Christoph
or someone at SGI calling it "just too late" migration.
With lazy migration, we KNOW what pages the task is referencing in the
fault path, so me move only the pages actually needed right now.
Because any time now the scheduler could decide to move the task to a
different node. I did add a "migration interval" control to experiment
with different length delays in inter-node migration to give a task time
to amortize the automigration overhead. Needs more "experimentation".
>
> So the patchset looks very interesting, and it may actually be optimal
> for some slower hardware, but I've the perception these days the
> memory being remote isn't as a big deal as not keeping all two memory
> controllers in action simultaneously (using just one controller is
> worse than using both simultaneously from the wrong end, locality not
> as important as not stepping in each other toes). So in general
> synchronous migrate on fault seems a bit too aggressive to me and not
> ideal for newer hardware. Still this is one of the most interesting
> patchsets at this time in this area I've seen so far.
As I mentioned above, my results were on older hardware, so it will be
interesting to see the results on modern hardware with lots of guest vms
as the workload. Maybe I'll get to this eventually. And I believe
you're correct that on modern systems being remote is not a big deal if
the interconnect and target node are lightly loaded. But, in my
experience with recent 4 and 8-node servers, locality still matters very
much as load increases.
>
> The homenode logic ironically may be optimal with the most important
> bench because the way that bench is setup all vm are fairly small and
> there are plenty of them so it'll never happen that a vm has more
> memory than what can fit in the ram of a single node, but I like
> dynamic approach that works best in all environments, even if it's not
> clearly as simple and maybe not as optimal in the one relevant
> benchmark we care about. I'm unsure what the homenode is supposed to
> decide when the task has two three four times the ram that fits in a
> single node (and that may not be a so uncommon scenario after all).
> I admit not having read enough on this homenode logic, but I never got
> any attraction to it personally as there should never be any single
> "home" to any task in my view.
Well, as we discussed, now we have an implicit "home node" anyway: the
node where a task's kernel data structures are first allocated. A task
that spends much time in the kernel will always run faster on the node
where its task struct and thread info/stack live. So, until we can
migrate these [was easier in unix based kernels], we'll always have an
implicit home node.
I'm attaching some statistics I collected while running a stress load on
the patches before posting them. The 'vmstress-stats' file includes a
description of the statistics. I've also attached a simple script to
watch the automigration stats if you decided to try out these patches.
Heads up: as I mentioned to Kosaki-san in other mail, lazy migration
[migrate-on-fault] incurs a null pointer deref in swap_cgroup_record()
in the most recent mmotm [09nov on 37-rc1]. The patches seem quite
robust [modulo a migration remove/duplicate race, I think, under heavy
load :(] on the mmotm version referenced in the patches: 03nov on
2.6.36. You may be able to find a copy from Andrew, but I've placed a
copy here:
http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.36-mmotm-101103-1217/mmotm-101103-1217.tar.gz
Regards,
Lee
[-- Attachment #2: vmstress-stats --]
[-- Type: text/plain, Size: 5136 bytes --]
Final stats for 58 minute usex vm stress workload.
pgs loc pages pages | tasks pages pages pages | ----------------mig cache------------------
checked misplacd migrated | migrated scanned selected failed | pgs added pgs removd duplicates refs freed
46496887 44500155 63046843 | 965933 348568709 159804782 595 | 151431996 151416378 187409069 338825351
46503015 44505977 63052665 | 967033 349001693 159996085 595 | 151616958 151602326 187615344 339217482
46508962 44511591 63123815 | 968090 349431720 160191652 595 | 151806705 151796445 187825762 339622018
46514565 44516837 63129061 | 969031 349818983 160368637 595 | 151984095 151972326 188023077 339994848
46520744 44522719 63200479 | 970066 350273208 160586754 595 | 152191950 152172881 188256406 340429177
46526660 44528377 63206137 | 971096 351226149 161308899 595 | 152427014 152369709 188512466 340882087
46533120 44534533 63212293 | 972074 351664143 161522889 595 | 152662157 152583427 188776833 341360178
46538985 44540194 63217953 | 973051 352104161 161738022 595 | 152891544 152785119 189029203 341814220
46808222 44809022 63748925 | 975330 352692128 161930205 596 | 153503983 153174539 189746717 342921179
47056161 45056479 63996383 | 978160 353338853 162087525 597 | 153632047 153543820 190009418 343553163
47375470 45375057 64380497 | 981041 354513459 162767074 598 | 154280022 154043105 190781347 344824378
47620387 45619074 64624514 | 983463 355193001 162993768 598 | 154468417 154452153 191055045 345507110
47626207 45624603 64630043 | 984479 355619380 163192799 598 | 154656878 154644741 191271746 345915993
47632616 45630721 64636160 | 985631 356053915 163378618 598 | 154844351 154830927 191478638 346309235
47638285 45635994 64641433 | 986591 356529658 163639764 598 | 155030094 155021407 191689679 346710556
47643792 45641209 64646649 | 987526 356924666 163822582 598 | 155219308 155193598 191908717 347092074
47649221 45646312 64651752 | 988476 357318754 164004331 598 | 155411758 155379515 192129976 347489210
47654343 45651155 64656595 | 989351 357724761 164207764 598 | 155636719 155567538 192410941 347926759
47660396 45656848 64924432 | 990406 358188870 164430730 598 | 155869117 155795886 192680512 348447071
47661993 45658349 64925933 | 990663 358289854 164475908 598 | 155921468 155921168 192736716 348657884
Migrate on Fault stats:
pgs loc checked -- pages found in swap/migration cache by do_swap_page()
with zero page_mapcount() and otherwise "stable".
pages misplacd -- of the "loc checked" pages, the number that were found
to be misplaced relative to the mempolicy in effect--vma, task
or system default.
pages migrated -- All pages migrated: migrate-on-fault, mbind(), ...
Exceeds the misplaced pages in the stats above because the
test load included programs that migrated memory regions about
using mbind() with the MPOL_MF_MOVE flag.
Auto-migration Stats:
tasks migrated - number of internode task migrations. Each of these
migrations resulted in the task walking its address space
looking for anon pages in vmas with local allocation policy.
Includes kicks via /proc/<pid>/migrate.
pages scanned -- total number of pages examined as candidates for
auto-migration in mm/mempolicy.c:check_range() as a result
of internode task migration or /proc/<pid>/migrate scans.
pages selected -- Anon pages selected for auto-migration.
If lazy auto-migration is enabled [default], these pages
will be unmapped to allow migrate-on-fault to migrate
them if and when a task faults the page. If lazy auto-
migration is disabled, these pages will be directly
migrated [pulled] to the destination node.
pages failed -- the number of the selected pages that the
kernel failed to unmap for lazy migration or failed
to direct migrate.
Migration Cache Statistics:
pgs added -- the number of pages added to the migration cache.
This occurs when pages are unmapped for lazy migration.
pgs removd -- the number of pages removed from the migration
cache. This occurs when the last pte referencing the
cache entry is replaced with a present page pte.
The nummber of pages added less the number removed
is the number of pages still in the cache.
duplicates -- count of migration_duplicate() calls, usually
via swap_duplicate(), to add a reference to a migration
cache entry. This occurs when a page in the migration
cache is unmapped in try_to_unmap_one() and when a task
with anon pages in the migration cache forks and all of
its anon pages become COW shared with the child in
copy_one_pte().
refs freed -- count of migration cache entry reference freed,
usually via one of the swap cache free functions.
When the reference count on a migration cache entry
goes zero, the entry is removed from the cache. Thus,
the number of pages added plus the number of duplicates
should equal the number of refs freed plus the number
of pages still in the cache [adds - removes].
[-- Attachment #3: automig_stats --]
[-- Type: application/x-shellscript, Size: 1383 bytes --]
next prev parent reply other threads:[~2010-11-17 17:03 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-11 19:44 [PATCH/RFC 0/8] numa - Migrate-on-Fault Lee Schermerhorn
2010-11-11 19:44 ` [PATCH/RFC 1/8] numa - Migrate-on-Fault - add Kconfig option Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 2/8] numa - Migrate-on-Fault - add cpuset control Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 3/8] numa - Migrate-on-Fault - check for misplaced page Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 4/8] numa - Migrate-on-Fault - migrate misplaced pages Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 5/8] numa - Migrate-on-Fault - migrate misplaced anon pages Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 6/8] numa - Migrate-on-Fault - add mbind() MPOL_MF_LAZY flag Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 7/8] numa - Migrate-on-Fault - mbind() NOOP policy Lee Schermerhorn
2010-11-11 19:45 ` [PATCH/RFC 8/8] numa - Migrate-on-Fault - add statistics Lee Schermerhorn
2010-11-14 6:37 ` [PATCH/RFC 0/8] numa - Migrate-on-Fault KOSAKI Motohiro
2010-11-15 14:13 ` Christoph Lameter
2010-11-15 14:21 ` Andi Kleen
2010-11-15 14:37 ` Andrea Arcangeli
2010-11-15 14:33 ` Andrea Arcangeli
2010-11-17 17:03 ` Lee Schermerhorn [this message]
2010-11-17 21:27 ` Andrea Arcangeli
2010-11-16 4:54 ` KOSAKI Motohiro
2010-11-17 14:45 ` Lee Schermerhorn
2010-11-17 17:10 ` Avi Kivity
2010-11-17 17:34 ` Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1290013436.3786.149.camel@zaphod \
--to=lee.schermerhorn@hp.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=avi@redhat.com \
--cc=cl@linux.com \
--cc=hughd@google.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-numa@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@kernel.dk \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).