From: Mel Gorman <mgorman@suse.de>
To: Linux-MM <linux-mm@kvack.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Dave Hansen <dave@sr71.net>,
Christoph Lameter <cl@linux.com>,
LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@suse.de>
Subject: [RFC PATCH 00/22] Per-cpu page allocator replacement prototype
Date: Wed, 8 May 2013 17:02:45 +0100 [thread overview]
Message-ID: <1368028987-8369-1-git-send-email-mgorman@suse.de> (raw)
Two LSF/MM's ago there was discussion on the per-cpu page allocator and
whether it could be replaced due to it's complexity, frequent drains/refills
and IPI overhead for global drains. The main obstacle to removal is that
without those lists the zone->lock becomes very heavily contended and
alternatives are inevitably going to share cache lines. I prototyped a
potential replacement on the flight home and then left it on a TODO list
for another year.
Last LSF/MM this was talked about in the hallway again although the meat
of the discussion was different and took into account Andi Kleen's talk
about lock batching. On this flight home, I rebased the old prototype,
added some additional bits and pieces and tidied it up a bit since. This
TODO item is about 3 years old so apparently sometimes the only way to
get someone to do something is to lock them in a metal box for a few hours.
This is a prototype replacement starts with some minor optimisations that
have nothing to do with anything really other than they were floating around
from another flights worth of work. It then replaces the per-cpu page
allocator with an IRQ-unsafe (no interrupts, no calls with local_irq_save)
magazine that is protected by a spinlock. This effectively has the allocator
use two locks. magazine_lock for non-IRQ users (e.g. page faults) and the
zone->lock for magazine drains/refills and users who have IRQs disabled
(interrupts, slab). It then splits the magazine into two where the preferred
magazine depends on the CPU id of the executing thread. Non-preferred
magazines may also be used and are searched round-robin as they are only
protected by spinlocks. The last part of the series does some mucking
around with lock contention and batching multiple frees due to exit or
compaction under the lock.
It has not been heavily tested in low memory, heavy interrupt situations
or extensively with all the debugging options enabled so it's likely
there are bugs hiding in there. However I'm interested in hearing if
the per-cpu page allocator is something we really want to replace or if
there is a better potential alternative than this prototype. There are
some interesting possibilities with this sort of design. For example, it
would be possible to allocate magazines to allocator-intensive processes
that are chained together for global drains where they are necessary. For
these processes there would be much less contention (only on drains/refills)
without having to use IPIs to drain their pages.
In the follow tests, no debugging was enabled but profiling was running
so the tests are heavily disrupted. Take the results with a grain of
salt. Machine was a single socket i7 with 16G RAM.
kernbench
3.9.0 3.9.0
vanilla magazine
User min 883.59 ( 0.00%) 700.16 ( 20.76%)
User mean 891.11 ( 0.00%) 741.02 ( 16.84%)
User stddev 12.14 ( 0.00%) 50.04 (-312.18%)
User max 915.30 ( 0.00%) 817.50 ( 10.69%)
User range 31.71 ( 0.00%) 117.34 (-270.04%)
System min 58.85 ( 0.00%) 43.65 ( 25.83%)
System mean 59.65 ( 0.00%) 47.65 ( 20.12%)
System stddev 1.35 ( 0.00%) 4.79 (-254.39%)
System max 62.35 ( 0.00%) 55.35 ( 11.23%)
System range 3.50 ( 0.00%) 11.70 (-234.29%)
Elapsed min 127.75 ( 0.00%) 99.37 ( 22.22%)
Elapsed mean 129.19 ( 0.00%) 105.73 ( 18.16%)
Elapsed stddev 2.07 ( 0.00%) 7.56 (-265.78%)
Elapsed max 133.25 ( 0.00%) 117.53 ( 11.80%)
Elapsed range 5.50 ( 0.00%) 18.16 (-230.18%)
CPU min 731.00 ( 0.00%) 742.00 ( -1.50%)
CPU mean 735.20 ( 0.00%) 745.60 ( -1.41%)
CPU stddev 2.99 ( 0.00%) 3.38 (-12.99%)
CPU max 739.00 ( 0.00%) 751.00 ( -1.62%)
CPU range 8.00 ( 0.00%) 9.00 (-12.50%)
Kernel build benchmark seemed fairly successful, if anything they seem
too good and I suspect this might have been a particularly "lucky" run.
A non-profiling run might reveal more.
pagealloc
3.9.0 3.9.0
vanilla magazine
order-0 alloc-1 671.11 ( 0.00%) 734.11 ( -9.39%)
order-0 alloc-2 520.33 ( 0.00%) 517.56 ( 0.53%)
order-0 alloc-4 426.56 ( 0.00%) 461.78 ( -8.26%)
order-0 alloc-8 666.33 ( 0.00%) 381.67 ( 42.72%)
order-0 alloc-16 341.78 ( 0.00%) 354.44 ( -3.71%)
order-0 alloc-32 336.89 ( 0.00%) 345.33 ( -2.51%)
order-0 alloc-64 331.33 ( 0.00%) 324.56 ( 2.05%)
order-0 alloc-128 324.56 ( 0.00%) 325.89 ( -0.41%)
order-0 alloc-256 350.44 ( 0.00%) 326.22 ( 6.91%)
order-0 alloc-512 369.33 ( 0.00%) 345.67 ( 6.41%)
order-0 alloc-1024 381.44 ( 0.00%) 352.67 ( 7.54%)
order-0 alloc-2048 387.50 ( 0.00%) 348.00 ( 10.19%)
order-0 alloc-4096 403.00 ( 0.00%) 384.50 ( 4.59%)
order-0 alloc-8192 413.00 ( 0.00%) 383.00 ( 7.26%)
order-0 alloc-16384 411.00 ( 0.00%) 396.80 ( 3.45%)
order-0 free-1 357.22 ( 0.00%) 458.89 (-28.46%)
order-0 free-2 285.11 ( 0.00%) 349.89 (-22.72%)
order-0 free-4 231.33 ( 0.00%) 296.00 (-27.95%)
order-0 free-8 371.56 ( 0.00%) 229.33 ( 38.28%)
order-0 free-16 189.78 ( 0.00%) 212.89 (-12.18%)
order-0 free-32 185.67 ( 0.00%) 206.11 (-11.01%)
order-0 free-64 178.44 ( 0.00%) 197.78 (-10.83%)
order-0 free-128 178.11 ( 0.00%) 197.00 (-10.61%)
order-0 free-256 227.89 ( 0.00%) 196.56 ( 13.75%)
order-0 free-512 280.11 ( 0.00%) 194.67 ( 30.50%)
order-0 free-1024 301.67 ( 0.00%) 227.50 ( 24.59%)
order-0 free-2048 325.50 ( 0.00%) 238.00 ( 26.88%)
order-0 free-4096 328.00 ( 0.00%) 278.25 ( 15.17%)
order-0 free-8192 337.50 ( 0.00%) 277.00 ( 17.93%)
order-0 free-16384 338.00 ( 0.00%) 285.20 ( 15.62%)
order-0 total-1 1031.89 ( 0.00%) 1197.22 (-16.02%)
order-0 total-2 805.44 ( 0.00%) 867.44 ( -7.70%)
order-0 total-4 657.89 ( 0.00%) 768.67 (-16.84%)
order-0 total-8 1039.56 ( 0.00%) 611.00 ( 41.22%)
order-0 total-16 531.56 ( 0.00%) 569.33 ( -7.11%)
order-0 total-32 523.44 ( 0.00%) 551.44 ( -5.35%)
order-0 total-64 509.78 ( 0.00%) 522.89 ( -2.57%)
order-0 total-128 502.67 ( 0.00%) 522.89 ( -4.02%)
order-0 total-256 579.56 ( 0.00%) 522.78 ( 9.80%)
order-0 total-512 649.44 ( 0.00%) 540.33 ( 16.80%)
order-0 total-1024 686.44 ( 0.00%) 580.17 ( 15.48%)
order-0 total-2048 713.00 ( 0.00%) 586.00 ( 17.81%)
order-0 total-4096 731.00 ( 0.00%) 663.75 ( 9.20%)
order-0 total-8192 750.50 ( 0.00%) 660.00 ( 12.06%)
order-0 total-16384 749.00 ( 0.00%) 683.00 ( 8.81%)
This is a systemtap-drive page allocator microbenchmark that allocates
order-0 pages in increasingly large "batches" and times the length of
time it takes to allocate and free. The results are a bit all over the
map. Frees are slower because they always wait for the preferred magazine
to be free and freeing single pages in a loop is a worst-case scenario.
For larger patches it tends to perform reasonably well though.
pft
3.9.0 3.9.0
vanilla magazine
Faults/cpu 1 953441.5530 ( 0.00%) 954011.8576 ( 0.06%)
Faults/cpu 2 923793.1533 ( 0.00%) 889436.7293 ( -3.72%)
Faults/cpu 3 876829.2292 ( 0.00%) 868230.6471 ( -0.98%)
Faults/cpu 4 819914.9333 ( 0.00%) 735264.4793 (-10.32%)
Faults/cpu 5 689049.0107 ( 0.00%) 663481.5446 ( -3.71%)
Faults/cpu 6 579924.4065 ( 0.00%) 571889.3687 ( -1.39%)
Faults/cpu 7 552024.1040 ( 0.00%) 494345.8698 (-10.45%)
Faults/cpu 8 461452.7560 ( 0.00%) 457877.1810 ( -0.77%)
Faults/sec 1 938245.7112 ( 0.00%) 939198.4047 ( 0.10%)
Faults/sec 2 1814498.4087 ( 0.00%) 1748800.3800 ( -3.62%)
Faults/sec 3 2544466.6368 ( 0.00%) 2359068.2163 ( -7.29%)
Faults/sec 4 3032778.8584 ( 0.00%) 2831753.6553 ( -6.63%)
Faults/sec 5 3025180.2736 ( 0.00%) 2952758.4589 ( -2.39%)
Faults/sec 6 3131131.0106 ( 0.00%) 3058954.4941 ( -2.31%)
Faults/sec 7 3286271.0631 ( 0.00%) 3183931.7940 ( -3.11%)
Faults/sec 8 3135331.0027 ( 0.00%) 3106746.5908 ( -0.91%)
This is a page faulting microbenchmark that is forced to use base pages only.
Here the new design suffers a bit because the allocation path is likely
to contend on the magazine lock.
So preliminary testing indicates the results are mixed bag. As long as
locks are not contended, it performs fine but parallel fault testing
hits into spinlock contention on the magazine locks. A greater problem
is that because CPUs share magazines it means that the struct pages are
frequently dirtied cache lines. If CPU A frees a page to a magazine and
CPU B immediately allocates it then the cache line for the page and the
magazine bounces and this costs. It's on the TODO list to research if the
available literature has anything useful to say that does not depend on
per-cpu lists and the associated problems with them.
Comments?
arch/sparc/mm/init_64.c | 4 +-
arch/sparc/mm/tsb.c | 2 +-
arch/tile/mm/homecache.c | 2 +-
fs/fuse/dev.c | 2 +-
include/linux/gfp.h | 12 +-
include/linux/mm.h | 3 -
include/linux/mmzone.h | 46 +-
include/linux/page-isolation.h | 7 +-
include/linux/pagemap.h | 2 +-
include/linux/swap.h | 2 +-
include/trace/events/kmem.h | 22 +-
init/main.c | 1 -
kernel/power/snapshot.c | 2 -
kernel/sysctl.c | 10 -
mm/compaction.c | 18 +-
mm/memory-failure.c | 2 +-
mm/memory_hotplug.c | 13 +-
mm/page_alloc.c | 1109 +++++++++++++++++-----------------------
mm/page_isolation.c | 30 +-
mm/rmap.c | 2 +-
mm/swap.c | 6 +-
mm/swap_state.c | 2 +-
mm/vmscan.c | 6 +-
mm/vmstat.c | 54 +-
24 files changed, 571 insertions(+), 788 deletions(-)
--
1.8.1.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2013-05-08 16:03 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-08 16:02 Mel Gorman [this message]
2013-05-08 16:02 ` [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
2013-05-08 16:02 ` [PATCH 02/22] mm: page allocator: Push down where IRQs are disabled during page free Mel Gorman
2013-05-08 16:02 ` [PATCH 03/22] mm: page allocator: Use unsigned int for order in more places Mel Gorman
2013-05-08 16:02 ` [PATCH 04/22] mm: page allocator: Only check migratetype of pages being drained while CMA active Mel Gorman
2013-05-08 16:02 ` [PATCH 05/22] oom: Use number of online nodes when deciding whether to suppress messages Mel Gorman
2013-05-08 16:02 ` [PATCH 06/22] mm: page allocator: Convert hot/cold parameter and immediate callers to bool Mel Gorman
2013-05-08 16:02 ` [PATCH 07/22] mm: page allocator: Do not lookup the pageblock migratetype during allocation Mel Gorman
2013-05-08 16:02 ` [PATCH 08/22] mm: page allocator: Remove the per-cpu page allocator Mel Gorman
2013-05-08 16:02 ` [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine Mel Gorman
2013-05-08 18:41 ` Christoph Lameter
2013-05-09 15:23 ` Mel Gorman
2013-05-09 16:21 ` Christoph Lameter
2013-05-09 17:27 ` Mel Gorman
2013-05-09 18:08 ` Christoph Lameter
2013-05-08 16:02 ` [PATCH 10/22] mm: page allocator: Allocate and free pages from magazine in batches Mel Gorman
2013-05-08 16:02 ` [PATCH 11/22] mm: page allocator: Shrink the magazine to the migratetypes in use Mel Gorman
2013-05-08 16:02 ` [PATCH 12/22] mm: page allocator: Remove knowledge of hot/cold from page allocator Mel Gorman
2013-05-08 16:02 ` [PATCH 13/22] mm: page allocator: Use list_splice to refill the magazine Mel Gorman
2013-05-08 16:02 ` [PATCH 14/22] mm: page allocator: Do not disable IRQs just to update stats Mel Gorman
2013-05-08 16:03 ` [PATCH 15/22] mm: page allocator: Check if interrupts are enabled only once per allocation attempt Mel Gorman
2013-05-08 16:03 ` [PATCH 16/22] mm: page allocator: Remove coalescing improvement heuristic during page free Mel Gorman
2013-05-08 16:03 ` [PATCH 17/22] mm: page allocator: Move magazine access behind accessors Mel Gorman
2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
2013-05-09 15:21 ` Dave Hansen
2013-05-15 19:44 ` Andi Kleen
2013-05-08 16:03 ` [PATCH 19/22] mm: page allocator: Watch for magazine and zone lock contention Mel Gorman
2013-05-08 16:03 ` [PATCH 20/22] mm: page allocator: Hold magazine lock for a batch of pages Mel Gorman
2013-05-08 16:03 ` [PATCH 21/22] mm: compaction: Release free page list under a batched magazine lock Mel Gorman
2013-05-08 16:03 ` [PATCH 22/22] mm: page allocator: Drain magazines for direct compact failures Mel Gorman
2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
2013-05-09 16:25 ` Christoph Lameter
2013-05-09 17:33 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1368028987-8369-1-git-send-email-mgorman@suse.de \
--to=mgorman@suse.de \
--cc=cl@linux.com \
--cc=dave@sr71.net \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).