* (no subject)
@ 2012-10-04 16:50 Andrea Arcangeli
2012-10-04 18:17 ` your mail Christoph Lameter
0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2012-10-04 16:50 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
Mike Galbraith, Paul E. McKenney
Subject: Re: [PATCH 29/33] autonuma: page_autonuma
Reply-To:
In-Reply-To: <0000013a2c223da2-632aa43e-21f8-4abd-a0ba-2e1b49881e3a-000000@email.amazonses.com>
Hi Christoph,
On Thu, Oct 04, 2012 at 02:16:14PM +0000, Christoph Lameter wrote:
> On Thu, 4 Oct 2012, Andrea Arcangeli wrote:
>
> > Move the autonuma_last_nid from the "struct page" to a separate
> > page_autonuma data structure allocated in the memsection (with
> > sparsemem) or in the pgdat (with flatmem).
>
> Note that there is a available word in struct page before the autonuma
> patches on x86_64 with CONFIG_HAVE_ALIGNED_STRUCT_PAGE.
>
> In fact the page_autonuma fills up the structure to nicely fit in one 64
> byte cacheline.
Good point indeed.
So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency
(AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it
also wouldn't be available on 32bit archs but the latter isn't a
problem).
I think it's a reasonable alternative to page_autonuma. Certainly it
looks more appealing than taking over 16 precious bits from
page->flags. There are still pros and cons. I'm neutral on it so more
comments would be welcome ;).
Andrea
PS. randomly moved some in Cc over to Bcc as I overflowed the max
header allowed on linux-kernel oops!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: your mail 2012-10-04 16:50 Andrea Arcangeli @ 2012-10-04 18:17 ` Christoph Lameter 2012-10-04 18:38 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2012-10-04 18:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney On Thu, 4 Oct 2012, Andrea Arcangeli wrote: > So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency > (AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it > also wouldn't be available on 32bit archs but the latter isn't a > problem). Nope it should depend on page struct alignment. Other kernel subsystems may be depeding on page struct alignment in the future (and some other arches may already have that requirement) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-04 18:17 ` your mail Christoph Lameter @ 2012-10-04 18:38 ` Andrea Arcangeli 2012-10-04 19:11 ` Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Andrea Arcangeli @ 2012-10-04 18:38 UTC (permalink / raw) To: Christoph Lameter Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney Hi Christoph, On Thu, Oct 04, 2012 at 06:17:37PM +0000, Christoph Lameter wrote: > On Thu, 4 Oct 2012, Andrea Arcangeli wrote: > > > So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency > > (AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it > > also wouldn't be available on 32bit archs but the latter isn't a > > problem). > > Nope it should depend on page struct alignment. Other kernel subsystems > may be depeding on page struct alignment in the future (and some other > arches may already have that requirement) But currently only SLUB x86 64bit selects CONFIG_HAVE_ALIGNED_STRUCT_PAGE: arch/Kconfig:config HAVE_ALIGNED_STRUCT_PAGE arch/x86/Kconfig: select HAVE_ALIGNED_STRUCT_PAGE if SLUB && !M386 include/linux/mm_types.h: defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) include/linux/mm_types.h:#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE mm/slub.c: defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) mm/slub.c: defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) mm/slub.c: defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) So in practice a dependency on CONFIG_HAVE_ALIGNED_STRUCT_PAGE would still mean the same: only available when SLUB enables it, and only on x86 64bit (ppc64?). If you mean CONFIG_AUTONUMA=y should select (not depend) on CONFIG_HAVE_ALIGNED_STRUCT_PAGE, that would allow to enable it in all .configs but it would have a worse cons: losing 8bytes per page unconditionally (even when booting on non-NUMA hardware). The current page_autonuma solution is substantially memory-cheaper than selecting CONFIG_HAVE_ALIGNED_STRUCT_PAGE: it allocates 2bytes per page at boot time but only if booting on real NUMA hardware (without altering the page structure). So to me it looks still quite a decent tradeoff. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-04 18:38 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli @ 2012-10-04 19:11 ` Christoph Lameter 2012-10-05 11:11 ` Andrea Arcangeli 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2012-10-04 19:11 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney On Thu, 4 Oct 2012, Andrea Arcangeli wrote: > If you mean CONFIG_AUTONUMA=y should select (not depend) on > CONFIG_HAVE_ALIGNED_STRUCT_PAGE, that would allow to enable it in all > .configs but it would have a worse cons: losing 8bytes per page > unconditionally (even when booting on non-NUMA hardware). I did not say anything like that. Still not convinced that autonuma is worth doing and that it is beneficial given the complexity it adds to the kernel. Just wanted to point out that there is a case to be made for adding another word to the page struct. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-04 19:11 ` Christoph Lameter @ 2012-10-05 11:11 ` Andrea Arcangeli 0 siblings, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2012-10-05 11:11 UTC (permalink / raw) To: Christoph Lameter Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney Hi Christoph, On Thu, Oct 04, 2012 at 07:11:51PM +0000, Christoph Lameter wrote: > I did not say anything like that. Still not convinced that autonuma is > worth doing and that it is beneficial given the complexity it adds to the > kernel. Just wanted to point out that there is a case to be made for > adding another word to the page struct. You've seen the benchmarks, no other solution that exists today solves all those cases and never showed a regression compared to upstream. Running that much faster is very beneficial in my view. Expecting the admin of a 2 socket system to use hard bindings manually is unrealistic, even for a 4 socket is unrealistic. If you've 512 node system well then you can afford to setup everything manually and boot with noautonuma, no argument about that. About the complexity, well there's no simple solution to an hard problem. The proof comes from the schednuma crowd that is currently copying the AutoNUMA scheduler cpu-follow-memory design at full force as we speak. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 00/33] AutoNUMA27
@ 2012-10-03 23:50 Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2012-10-03 23:50 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner,
Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner,
Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith,
Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt
Hello everyone,
This is a new AutoNUMA27 release for Linux v3.6.
I believe that this autonuma version answers all of the review
comments I got upstream. This patch set has undergone a huge series of
changes that includes changing the page migration implementation to
synchronous, reduction of memory overhead to minimum, internal
documentation, external documentation and benchmarking. I'm grateful
for all the reviews and contributions, that includes Rik, Karen, Avi,
Peter, Konrad, Hillf and all others, plus all runtime feedback
received (bugreports, KVM benchmarks, etc..).
The last 4 months were fully dedicated to answer the upstream review.
Linus, Andrew, please review, as the handful of performance results
show we're in excellent shape for inclusion. Further changes such as
transparent huge page native migration and more are expected but at
this point I would ask you to accept the current series and further
changes will be added in traditional gradual steps.
====
The objective of AutoNUMA is to provide out-of-the-box performance as
close as possible to (and potentially faster than) manual NUMA hard
bindings.
It is not very intrusive into the kernel core and is well structured
into separate source modules.
AutoNUMA was extensively tested against 3.x upstream kernels and other
NUMA placement algorithms such as numad (in userland through cpusets)
and schednuma (in kernel too) and was found superior in all cases.
Most important: not a single benchmark showed a regression yet when
compared to vanilla kernels. Not even on the 2 node systems where the
NUMA effects are less significant.
=== Some benchmark result ===
Key to the kernels used in the testing:
- 3.6.0 = upstream 3.6.0 kernel
- 3.6.0numactl = 3.6.0 kernel with numactl hard NUMA bindings
- autonuma26MoF = previous autonuma version based 3.6.0-rc7 kernel
== specjbb multi instance, 4 nodes, 4 instances ==
autonuma26MoF outperform 3.6.0 by 11% while 3.6.0numactl provides an
additional 9% increase.
3.6.0numactl:
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
38901 3075.56 0.54 0.07 7.53
38902 1.31 0.54 3065.37 7.53
38903 1.31 0.54 0.07 3070.10
38904 1.31 3064.56 0.07 7.53
autonuma26MoF:
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
9704 94.85 2862.37 50.86 139.35
9705 61.51 20.05 2963.78 40.62
9706 2941.80 11.68 104.12 7.70
9707 35.02 10.62 9.57 3042.25
== specjbb multi instance, 4 nodes, 8 instances (x2 CPU overcommit) ==
This verifies AutoNUMA converges with x2 overcommit too.
autonuma26MoF nmstat every 10sec:
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7410 335.48 2369.66 194.18 191.28
7411 50.09 100.95 2935.93 56.50
7412 2907.98 66.71 33.71 68.93
7413 46.70 31.59 24.24 2974.60
7426 1493.34 1156.18 221.60 217.93
7427 398.18 176.94 269.14 2237.49
7428 1028.12 1471.29 202.76 366.44
7430 126.81 451.92 2270.37 242.75
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7410 4.09 3047.02 20.87 18.79
7411 24.11 75.70 3012.76 32.99
7412 3061.95 28.88 13.70 36.88
7413 12.71 7.56 14.18 3042.85
7426 2521.48 402.80 87.61 77.32
7427 148.09 79.34 87.43 2767.11
7428 279.48 2598.05 71.96 119.30
7430 25.45 109.46 2912.09 45.03
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7410 2.09 3057.18 16.88 14.78
7411 8.13 4.96 3111.52 21.01
7412 3115.94 6.91 7.71 10.92
7413 10.23 3.53 4.20 3059.49
7426 2982.48 63.19 32.25 11.41
7427 68.05 21.32 47.80 2944.93
7428 65.80 2931.43 45.93 25.73
7430 13.56 49.91 3007.72 20.99
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7410 2.08 3128.38 15.55 9.05
7411 6.13 0.96 3119.53 19.14
7412 3124.12 3.03 5.56 8.92
7413 8.27 4.91 5.61 3130.11
7426 3035.93 7.08 17.30 29.37
7427 24.12 6.89 7.85 3043.63
7428 13.77 3022.68 23.95 8.94
7430 2.25 39.51 3044.04 6.68
== specjbb, 4 nodes, 4 instances, but start instance 1 and 2 first,
wait for them to converge, then start instance 3 and 4 under numactl
over the nodes that AutoNUMA picked to converge instance 1 and 2 ==
This verifies AutoNUMA plays along nicely with NUMA hard binding
syscalls.
autonuma26MoF nmstat every 10sec:
PID N0 N1 N2 N3
Per-node process memory usage (in MBs):
---------- ---------- ---------- ---------- ----------
7756 426.33 1171.21 470.66 1063.76
7757 1254.48 152.09 1415.17 244.25
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7756 342.42 1070.75 364.70 1354.14
7757 1260.54 152.10 1411.19 242.29
7883 4.30 2915.12 2.93 0.00
7884 4.30 2.21 2919.59 0.02
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7756 318.39 1036.31 348.68 1428.66
7757 1733.25 96.77 1075.89 160.24
7883 4.30 2975.99 2.93 0.00
7884 4.30 2.21 2989.96 0.02
Per-node process memory usage (in MBs):
PID N0 N1 N2 N3
---------- ---------- ---------- ---------- ----------
7756 35.22 42.48 18.96 3035.60
7757 3027.93 6.63 25.67 6.21
7883 4.30 3064.35 2.93 0.00
7884 4.30 2.21 3074.38 0.02
>From the last nmstat we can't even tell which pids were run under
numactl and which not. You can only tell it by reading the first
nmstat: pid 7756 and 7757 were the two processes not run under
numactl.
pid 7756 and 7757 memory and CPUs were decided by AutoNUMA.
pid 7883 and 7884 never ran outside of node N1 and N3 respectively
because of the numactl binds.
== stream modified to run each instance for ~5min ==
Objective: compare autonuma26MoF against itself with CPU and NUMA
bindings
By running 1/4/8/16/32 tasks, we also verified that the idle balancing
is done well, maxing out all memory bandwidth.
Result is "PASS" if the performance of the kernel without bindings is
within -10% and +5% of CPU and NUMA bindings.
upstream result is FAIL (worst DIFF is -33%, best DIFF is +1%).
autonuma26MoF result is PASS (worst DIFF is -2%, best DIFF is +2%).
The autonuma26MoF raw numbers for this test are appended at the end
of this email.
== iozone ==
ALL INIT RE RE RANDOM RANDOM BACKWD RECRE STRIDE F FRE F FRE
FILE TYPE (KB) IOS WRITE WRITE READ READ READ WRITE READ WRITE READ WRITE WRITE READ READ
====--------------------------------------------------------------------------------------------------------------
noautonuma ALL 2492 1224 1874 2699 3669 3724 2327 2638 4091 3525 1142 1692 2668 3696
autonuma ALL 2531 1221 1886 2732 3757 3760 2380 2650 4192 3599 1150 1731 2712 3825
AutoNUMA can't help much for I/O loads but you can see it seems a
small improvement there too. The important thing for I/O loads, is to
verify that there is no regression.
== autonuma benchmark 2 nodes & 8 nodes ==
http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120817.pdf
== autonuma27 ==
git clone --reference linux -b autonuma27 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
Real time updated development autonuma branch:
git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
To update:
git fetch && git checkout -f origin/autonuma
Andrea Arcangeli (32):
autonuma: make set_pmd_at always available
autonuma: export is_vma_temporary_stack() even if
CONFIG_TRANSPARENT_HUGEPAGE=n
autonuma: define _PAGE_NUMA
autonuma: pte_numa() and pmd_numa()
autonuma: teach gup_fast about pmd_numa
autonuma: mm_autonuma and task_autonuma data structures
autonuma: define the autonuma flags
autonuma: core autonuma.h header
autonuma: CPU follows memory algorithm
autonuma: add the autonuma_last_nid in the page structure
autonuma: Migrate On Fault per NUMA node data
autonuma: autonuma_enter/exit
autonuma: call autonuma_setup_new_exec()
autonuma: alloc/free/init task_autonuma
autonuma: alloc/free/init mm_autonuma
autonuma: prevent select_task_rq_fair to return -1
autonuma: teach CFS about autonuma affinity
autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection
autonuma: default mempolicy follow AutoNUMA
autonuma: call autonuma_split_huge_page()
autonuma: make khugepaged pte_numa aware
autonuma: retain page last_nid information in khugepaged
autonuma: split_huge_page: transfer the NUMA type from the pmd to the
pte
autonuma: numa hinting page faults entry points
autonuma: reset autonuma page data when pages are freed
autonuma: link mm/autonuma.o and kernel/sched/numa.o
autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
autonuma: page_autonuma
autonuma: bugcheck page_autonuma fields on newly allocated pages
autonuma: boost khugepaged scanning rate
autonuma: add migrate_allow_first_fault knob in sysfs
autonuma: add mm_autonuma working set estimation
Karen Noel (1):
autonuma: add Documentation/vm/autonuma.txt
Documentation/vm/autonuma.txt | 364 +++++++++
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/paravirt.h | 2 -
arch/x86/include/asm/pgtable.h | 65 ++-
arch/x86/include/asm/pgtable_types.h | 20 +
arch/x86/mm/gup.c | 13 +-
fs/exec.c | 7 +
include/asm-generic/pgtable.h | 12 +
include/linux/autonuma.h | 57 ++
include/linux/autonuma_flags.h | 159 ++++
include/linux/autonuma_sched.h | 59 ++
include/linux/autonuma_types.h | 126 +++
include/linux/huge_mm.h | 6 +-
include/linux/mm_types.h | 5 +
include/linux/mmzone.h | 23 +
include/linux/page_autonuma.h | 50 ++
include/linux/sched.h | 3 +
init/main.c | 2 +
kernel/fork.c | 18 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 82 ++-
kernel/sched/numa.c | 638 +++++++++++++++
kernel/sched/sched.h | 19 +
mm/Kconfig | 17 +
mm/Makefile | 1 +
mm/autonuma.c | 1414 ++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 96 +++-
mm/memory.c | 10 +
mm/mempolicy.c | 12 +-
mm/mmu_context.c | 3 +
mm/page_alloc.c | 7 +-
mm/page_autonuma.c | 237 ++++++
mm/sparse.c | 126 +++-
35 files changed, 3631 insertions(+), 28 deletions(-)
create mode 100644 Documentation/vm/autonuma.txt
create mode 100644 include/linux/autonuma.h
create mode 100644 include/linux/autonuma_flags.h
create mode 100644 include/linux/autonuma_sched.h
create mode 100644 include/linux/autonuma_types.h
create mode 100644 include/linux/page_autonuma.h
create mode 100644 kernel/sched/numa.c
create mode 100644 mm/autonuma.c
create mode 100644 mm/page_autonuma.c
== Changelog from AutoNUMA24 to AutoNUMA27 ==
o Migrate On Fault
At the mm mini summit some discussion happened about the real need
of asynchronous migration in AutoNUMA. Peter pointed out
asynchronous migration could be removed without adverse performance
effects and that would save lots of memory.
So over the last few weeks asynchronous migration was removed and
replaced with an ad-hoc Migrate On Fault implementation (one that
doesn't require to alter the migrate.c API).
All CPU/memory NUMA placement decisions remained identical: the
only change is that instead of adding a page to a migration LRU
list and returning to userland immediately, AutoNUMA is calling
migrate_pages() before returning to userland.
Peter was right: we found Migrate On Fault didn't degrade
performance significantly. Migrate on Fault seems more cache
friendly too.
Also note: after the workload converged, all memory migration stops
so it cannot make any difference after that.
With Migrate On Fault, the cost per-page of AutoNUMA has been
reduced to 2 bytes per page.
o Share the same pmd/pte bitflag (8) for both _PAGE_PROTNONE and
_PAGE_NUMA. This means pte_numa/pmd_numa cannot be used anymore in
code paths where mprotect(PROT_NONE) faults could trigger. Luckily
the paths are mutually exclusive and mprotect(PROT_NONE) regions
cannot reach handle_mm_fault() so no special checks on the
vma->vm_page_prot are required to find out if it's a pte/pmd_numa or
a mprotect(PROT_NONE).
This doesn't provide any runtime benefit but it leaves _PAGE_PAT
free for different usage in the future, so it looks cleaner.
o New overview document added in Documentation/vm/autonuma.txt
o Lockless NUMA hinting page faults.
Migrate On Fault needs to block and schedule within the context of
the NUMA hinting page faults. So the VM locks must be dropped
before the NUMA hinting page fault starts.
This is a worthwhile change for the asynchronous migration code
too, and it's included in an unofficial "dead" autonuma26 branch
(the last release with asynchronous migration).
o kmap bugfix for 32bit archs in __pmd_numa_fixup (nop for x86-64)
o Converted knuma_scand to use pmd_trans_huge_lock() cleaner API.
o Fixed a kernel crash on a 8 node system during a heavy infiniband
load if knuma_scand encounters an unstable pmd (a pmd_trans_unstable
check was needed as knuma_scand holds the mmap_sem only for
reading). The workload must have been using madvise(MADV_DONTNEED).
o Skip PROT_NONE regions from the knuma_scand scanning. We're now
sharing the same bitflag for mprotect(PROT_NONE) and pte/pmd_numa()
couldn't distinguish between a pte/pmd_numa and a PROT_NONE range
during the knuma_scand pass unless we check the vm_flags and skip
it. It wouldn't be fatal for knuma_scand to scan a PROT_NONE range
but it's not worth it.
o Removed the sub-directories from /sys/kernel/mm/autonuma/ (all sysfs
files are in the same autonuma/ directory now). It looked cleaner
this way after removing the knuma_migrated/ directory, now that the
only kernel daemon left is knuma_scand. This shows less
implementation details through the sysfs interface too which is a bonus.
o All "tuning" config tweaks in sysfs are visible only if
CONFIG_DEBUG_VM=y.
o Lots of cleanups and minor optimizations (better variable names
etc..).
o The ppc64 support is not included in this upstream submit until Ben
is happy with it (but it's still included in the git branch).
== Changelog from AutoNUMA19 to AutoNUMA24 ==
o Improved lots of comments and header commit messages.
o Rewritten from scratch the comment at the top of kernel/sched/numa.c
as the old comment wasn't well received in upstream reviews. Tried
to describe the algorithm from a global view now.
o Added ppc64 support.
o Improved patch splitup.
o Lots of code cleanups and variable renames to make the code more readable.
o Try to take advantage of task_autonuma_nid before the knuma_scand is
complete.
o Moved some performance tuning sysfs tweaks under DEBUG_VM so they
won't be visible on production kernels.
o Enabled by default the working set mode for the mm_autonuma data
collection.
o Halved the size of the mm_autonuma structure.
o scan_sleep_pass_millisecs now is more intuitive (you can can set it
to 10000 to mean one pass every 10 sec, in the previous release it had
to be set to 5000 to one pass every 10 sec).
o Removed PF_THREAD_BOUND to allow CPU isolation. Turned the VM_BUG_ON
verifying the hard binding into a WARN_ON_ONCE so the knuma_migrated
can be moved by root anywhere safely.
o Optimized autonuma_possible() to avoid checking num_possible_nodes()
every time.
o Added the math on the last_nid statistical effects from sched-numa
rewrite which also introduced the last_nid logic of AutoNUMA.
o Now handle systems with holes in the NUMA nodemask. Lots of
num_possible_nodes() replaced with nr_node_ids (nr_node_ids not so
nice name for such information).
o Fixed a bug affecting KSM. KSM failed to merge pages mapped with a
pte_numa pte, now it passes LTP fine.
o More...
== Changelog from AutoNUMA-alpha14 to AutoNUMA19 ==
o sched_autonuma_balance callout location removed from schedule() now it runs
in the softirq along with CFS load_balancing
o lots of documentation about the math in the sched_autonuma_balance algorithm
o fixed a bug in the fast path detection in sched_autonuma_balance that could
decrease performance with many nodes
o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page
o fixed a crash in __pmd_numa_fixup
o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
anyway)
o fixed a crash in autonuma_exit
o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
has been freed already
o assorted cleanups and probably more
Changelog from alpha13 to alpha14:
o page_autonuma introduction, no memory wasted if the kernel is booted
on not-NUMA hardware. Tested with flatmem/sparsemem on x86
autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
"noautonuma" kernel param disables autonuma permanently also when
booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
page_autonuma allocations, like cgroup_disable=memory)
o autonuma_balance only runs along with run_rebalance_domains, to
avoid altering the usual scheduler runtime. autonuma_balance gives a
"kick" to the scheduler after a rebalance (it overrides the load
balance activity if needed). It's not yet tested on specjbb or more
schedule intensive benchmark, hopefully there's no NUMA
regression. For intensive compute loads not involving a flood of
scheduling activity this doesn't show any performance regression,
and it avoids altering the strict schedule performance. It goes in
the direction of being less intrusive with the stock scheduler
runtime.
Note: autonuma_balance still runs from normal context (not softirq
context like run_rebalance_domains) to be able to wait on process
migration (avoid _nowait), but most of the time it does nothing at
all.
== Changelog from alpha11 to alpha13 ==
o autonuma_balance optimization (take the fast path when process is in
the preferred NUMA node)
== TODO ==
o THP native migration (orthogonal and also needed for
cpuset/migrate_pages(2)/numa/sched).
o powerpc has open issues to address. As result of this work Ben found
more other archs (not only some powerpc variant) didn't implement
PROT_NONE properly. Sharing the same pte/pmd bit of _PAGE_NUMA with
_PAGE_PROTNONE is quite handy, as the code paths of the two features
are mutually exclusive so they don't step into each other toes.
== stream benchmark: autonuma26MoF vs CPU/NUMA bindings ==
NUMA is Enabled. # of nodes = 4 nodes (0-3)
RESULTS: (MBs/sec) (higher is better)
| S C H E D U L I N G M O D E | | AFFINITY |
| | DEFAULT COMPARED TO |COMPARED TO|
| DEFAULT CPU AFFINITY NUMA AFFINITY | AFFINITY NUMA | NUMA |
NUMBER | | AVG | AVG | AVG | | |
OF |STREAM | WALL | WALL | WALL | % TEST | % TEST | % TEST |
STREAMS|FUNCT | TOTAL AVG STDEV SCALE CLK | TOTAL AVG STDEV SCALE CLK | TOTAL AVG STDEV SCALE CLK |DIFF STATUS|DIFF STATUS|DIFF STATUS|
-------+-------+----------------------------------+----------------------------------+----------------------------------+-----------+-----------+-----------+
1 | Add | 5496 5496 0.0 - 1606 | 5480 5480 0.0 - 1572 | 5477 5477 0.0 - 1571 | 0 PASS | 0 PASS | 0 PASS |
1 | Copy | 4411 4411 0.0 - 1606 | 4522 4522 0.0 - 1572 | 4521 4521 0.0 - 1571 | -2 PASS | -2 PASS | 0 PASS |
1 | Scale | 4417 4417 0.0 - 1606 | 4510 4510 0.0 - 1572 | 4514 4514 0.0 - 1571 | -2 PASS | -2 PASS | 0 PASS |
1 | Triad | 5338 5338 0.0 - 1606 | 5308 5308 0.0 - 1572 | 5306 5306 0.0 - 1571 | 1 PASS | 1 PASS | 0 PASS |
1 | ALL | 4950 4950 0.0 - 1606 | 4987 4987 0.0 - 1572 | 4990 4990 0.0 - 1571 | -1 PASS | -1 PASS | 0 PASS |
1 | A_OLD | 4916 4916 0.0 - 1606 | 4955 4955 0.0 - 1572 | 4954 4954 0.0 - 1571 | -1 PASS | -1 PASS | 0 PASS |
4 | Add | 22432 5608 81.3 4.1 1574 | 22344 5586 35.1 4.1 1562 | 22244 5561 41.8 4.1 1552 | 0 PASS | 1 PASS | 0 PASS |
4 | Copy | 18280 4570 65.8 4.1 1574 | 18332 4583 50.1 4.1 1562 | 18392 4598 19.5 4.1 1552 | 0 PASS | -1 PASS | 0 PASS |
4 | Scale | 18300 4575 63.1 4.1 1574 | 18328 4582 45.0 4.1 1562 | 18344 4586 31.9 4.1 1552 | 0 PASS | 0 PASS | 0 PASS |
4 | Triad | 21700 5425 66.2 4.1 1574 | 21664 5416 42.7 4.1 1562 | 21560 5390 43.2 4.1 1552 | 0 PASS | 1 PASS | 0 PASS |
4 | ALL | 20256 5064 71.2 4.1 1574 | 20232 5058 50.3 4.1 1562 | 20204 5051 34.3 4.0 1552 | 0 PASS | 0 PASS | 0 PASS |
4 | A_OLD | 20176 5044 495.9 4.1 1574 | 20168 5042 479.8 4.1 1562 | 20136 5034 461.8 4.1 1552 | 0 PASS | 0 PASS | 0 PASS |
8 | Add | 43568 5446 9.3 7.9 1614 | 43344 5418 36.5 7.9 1594 | 43144 5393 58.9 7.9 1614 | 1 PASS | 1 PASS | 0 PASS |
8 | Copy | 36216 4527 64.8 8.2 1614 | 36200 4525 71.6 8.0 1594 | 35904 4488 104.9 7.9 1614 | 0 PASS | 1 PASS | 1 PASS |
8 | Scale | 36496 4562 53.1 8.3 1614 | 36528 4566 47.0 8.1 1594 | 36272 4534 83.6 8.0 1614 | 0 PASS | 1 PASS | 1 PASS |
8 | Triad | 42600 5325 33.9 8.0 1614 | 42496 5312 48.4 8.0 1594 | 42272 5284 73.6 8.0 1614 | 0 PASS | 1 PASS | 1 PASS |
8 | ALL | 39640 4955 60.3 8.0 1614 | 39680 4960 55.2 8.0 1594 | 39448 4931 77.8 7.9 1614 | 0 PASS | 0 PASS | 1 PASS |
8 | A_OLD | 39720 4965 431.9 8.1 1614 | 39640 4955 421.2 8.0 1594 | 39400 4925 429.2 8.0 1614 | 0 PASS | 1 PASS | 1 PASS |
16 | Add | 69216 4326 190.2 12.6 2002 | 67600 4225 23.7 12.3 1991 | 67616 4226 16.1 12.3 1989 | 2 PASS | 2 PASS | 0 PASS |
16 | Copy | 58800 3675 194.1 13.3 2002 | 57408 3588 19.3 12.7 1991 | 57504 3594 17.6 12.7 1989 | 2 PASS | 2 PASS | 0 PASS |
16 | Scale | 60048 3753 135.5 13.6 2002 | 58976 3686 23.2 13.1 1991 | 58992 3687 19.1 13.1 1989 | 2 PASS | 2 PASS | 0 PASS |
16 | Triad | 67648 4228 157.9 12.7 2002 | 66304 4144 17.9 12.5 1991 | 66176 4136 11.1 12.5 1989 | 2 PASS | 2 PASS | 0 PASS |
16 | ALL | 63648 3978 141.9 12.9 2002 | 62480 3905 13.8 12.5 1991 | 62480 3905 12.1 12.5 1989 | 2 PASS | 2 PASS | 0 PASS |
16 | A_OLD | 63936 3996 332.3 13.0 2002 | 62576 3911 280.2 12.6 1991 | 62576 3911 276.8 12.6 1989 | 2 PASS | 2 PASS | 0 PASS |
32 | Add | 75968 2374 13.4 13.8 3562 | 75840 2370 14.1 13.8 3562 | 75840 2370 17.3 13.8 3562 | 0 PASS | 0 PASS | 0 PASS |
32 | Copy | 64032 2001 8.3 14.5 3562 | 64224 2007 2.0 14.2 3562 | 64160 2005 9.8 14.2 3562 | 0 PASS | 0 PASS | 0 PASS |
32 | Scale | 65376 2043 16.7 14.8 3562 | 65248 2039 14.4 14.5 3562 | 65440 2045 21.1 14.5 3562 | 0 PASS | 0 PASS | 0 PASS |
32 | Triad | 74144 2317 13.5 13.9 3562 | 74048 2314 7.7 14.0 3562 | 74400 2325 28.5 14.0 3562 | 0 PASS | 0 PASS | 0 PASS |
32 | ALL | 69440 2170 7.6 14.0 3562 | 69248 2164 2.4 13.9 3562 | 69440 2170 13.5 13.9 3562 | 0 PASS | 0 PASS | 0 PASS |
32 | A_OLD | 69888 2184 164.9 14.2 3562 | 69824 2182 162.2 14.1 3562 | 69952 2186 164.6 14.1 3562 | 0 PASS | 0 PASS | 0 PASS |
Test Acceptance Ranges:
Default vs CPU Affinity/NUMA: FAIL outside [-25, 10], WARN outside [-10, 5], PASS within [-10, 5]
CPU Affinity vs NUMA: FAIL outside [-10, 10], WARN outside [ -5, 5], PASS within [ -5, 5]
Results: PASS
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 29/33] autonuma: page_autonuma 2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli @ 2012-10-03 23:51 ` Andrea Arcangeli 2012-10-04 14:16 ` Christoph Lameter 2012-10-04 20:09 ` KOSAKI Motohiro 0 siblings, 2 replies; 9+ messages in thread From: Andrea Arcangeli @ 2012-10-03 23:51 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt Move the autonuma_last_nid from the "struct page" to a separate page_autonuma data structure allocated in the memsection (with sparsemem) or in the pgdat (with flatmem). This is done to avoid growing the size of "struct page". The page_autonuma data is only allocated if the kernel is booted on real NUMA hardware and noautonuma is not passed as a parameter to the kernel. An alternative would be to takeover 16 bits from the page->flags: but: 1) 32bit are already used (in fact 32bit archs are considering to adding another 32bit too to avoid losing common code features), 16 bits would be used by the last_nid, and several bits are used by per-node (readonly) zone/node information, so we would be left with just an handful of spare PG_ bits if we stole 16 for the last_nid. 2) We cannot exclude we'll want to add more bits of information in the future (and more than 16 wouldn't fit on page->flags). Changing the format or layout of the page_autonuma structure is trivial, compared to altering the format of the page->flags. So page_autonuma is much more hackable than page->flags. 3) page->flags can be modified from under us with locked ops (lock_page and all page flags operations). Normally we never change more than 1 bit at once on it. So the way page->flags could be updated is through cmpxchg. That's slow and tricky code would need to be written for it (potentially to drop late in case of point 2 above). Allocating those 2 bytes separately to me looks a lot cleaner even if it takes 0.048% of memory (but only when booting on NUMA hardware). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- include/linux/autonuma.h | 8 ++ include/linux/autonuma_types.h | 19 +++ include/linux/mm_types.h | 11 -- include/linux/mmzone.h | 12 ++ include/linux/page_autonuma.h | 50 +++++++++ init/main.c | 2 + mm/Makefile | 2 +- mm/autonuma.c | 37 +++++-- mm/huge_memory.c | 13 ++- mm/page_alloc.c | 14 +-- mm/page_autonuma.c | 237 ++++++++++++++++++++++++++++++++++++++++ mm/sparse.c | 126 ++++++++++++++++++++- 12 files changed, 490 insertions(+), 41 deletions(-) create mode 100644 include/linux/page_autonuma.h create mode 100644 mm/page_autonuma.c diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h index 02d4875..274c616 100644 --- a/include/linux/autonuma.h +++ b/include/linux/autonuma.h @@ -10,6 +10,13 @@ extern void autonuma_exit(struct mm_struct *mm); extern void autonuma_migrate_split_huge_page(struct page *page, struct page *page_tail); extern void autonuma_setup_new_exec(struct task_struct *p); +extern struct page_autonuma *lookup_page_autonuma(struct page *page); + +static inline void autonuma_free_page(struct page *page) +{ + if (autonuma_possible()) + lookup_page_autonuma(page)->autonuma_last_nid = -1; +} #define autonuma_printk(format, args...) \ if (autonuma_debug()) printk(format, ##args) @@ -21,6 +28,7 @@ static inline void autonuma_exit(struct mm_struct *mm) {} static inline void autonuma_migrate_split_huge_page(struct page *page, struct page *page_tail) {} static inline void autonuma_setup_new_exec(struct task_struct *p) {} +static inline void autonuma_free_page(struct page *page) {} #endif /* CONFIG_AUTONUMA */ diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h index 9673ce8..d0c6403 100644 --- a/include/linux/autonuma_types.h +++ b/include/linux/autonuma_types.h @@ -78,6 +78,25 @@ struct task_autonuma { /* do not add more variables here, the above array size is dynamic */ }; +/* + * Per page (or per-pageblock) structure dynamically allocated only if + * autonuma is possible. + */ +struct page_autonuma { + /* + * autonuma_last_nid records the NUMA node that accessed the + * page during the last NUMA hinting page fault. If a + * different node accesses the page next, AutoNUMA will not + * migrate the page. This tries to avoid page thrashing by + * requiring that a page be accessed by the same node twice in + * a row before it is queued for migration. + */ +#if MAX_NUMNODES > 32767 +#error "too many nodes" +#endif + short autonuma_last_nid; +}; + extern int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig, int node); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 9e8398a..c80101c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -152,17 +152,6 @@ struct page { struct page *first_page; /* Compound tail pages */ }; -#ifdef CONFIG_AUTONUMA - /* - * FIXME: move to pgdat section along with the memcg and allocate - * at runtime only in presence of a numa system. - */ -#if MAX_NUMNODES > 32767 -#error "too many nodes" -#endif - short autonuma_last_nid; -#endif - /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f793541..db68389 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -710,6 +710,9 @@ typedef struct pglist_data { int kswapd_max_order; enum zone_type classzone_idx; #ifdef CONFIG_AUTONUMA +#if !defined(CONFIG_SPARSEMEM) + struct page_autonuma *node_page_autonuma; +#endif /* * Lock serializing the per destination node AutoNUMA memory * migration rate limiting data. @@ -1081,6 +1084,15 @@ struct mem_section { * section. (see memcontrol.h/page_cgroup.h about this.) */ struct page_cgroup *page_cgroup; +#endif +#ifdef CONFIG_AUTONUMA + /* + * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use + * section. + */ + struct page_autonuma *section_page_autonuma; +#endif +#if defined(CONFIG_MEMCG) ^ defined(CONFIG_AUTONUMA) unsigned long pad; #endif }; diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h new file mode 100644 index 0000000..6da6c51 --- /dev/null +++ b/include/linux/page_autonuma.h @@ -0,0 +1,50 @@ +#ifndef _LINUX_PAGE_AUTONUMA_H +#define _LINUX_PAGE_AUTONUMA_H + +#include <linux/autonuma_flags.h> + +#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM) +extern void __init page_autonuma_init_flatmem(void); +#else +static inline void __init page_autonuma_init_flatmem(void) {} +#endif + +#ifdef CONFIG_AUTONUMA + +extern void __meminit page_autonuma_map_init(struct page *page, + struct page_autonuma *page_autonuma, + int nr_pages); + +#ifdef CONFIG_SPARSEMEM +#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma)) +#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE * \ + PAGES_PER_SECTION) +#endif + +extern void __meminit pgdat_autonuma_init(struct pglist_data *); + +#else /* CONFIG_AUTONUMA */ + +#ifdef CONFIG_SPARSEMEM +struct page_autonuma; +#define PAGE_AUTONUMA_SIZE 0 +#define SECTION_PAGE_AUTONUMA_SIZE 0 +#endif /* CONFIG_SPARSEMEM */ + +static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {} + +#endif /* CONFIG_AUTONUMA */ + +#ifdef CONFIG_SPARSEMEM +extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid, + unsigned long nr_pages); +extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma, + unsigned long nr_pages); +extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map, + unsigned long pnum_begin, + unsigned long pnum_end, + unsigned long map_count, + int nodeid); +#endif + +#endif /* _LINUX_PAGE_AUTONUMA_H */ diff --git a/init/main.c b/init/main.c index b286730..586764f 100644 --- a/init/main.c +++ b/init/main.c @@ -69,6 +69,7 @@ #include <linux/slab.h> #include <linux/perf_event.h> #include <linux/file.h> +#include <linux/page_autonuma.h> #include <asm/io.h> #include <asm/bugs.h> @@ -456,6 +457,7 @@ static void __init mm_init(void) * bigger than MAX_ORDER unless SPARSEMEM. */ page_cgroup_init_flatmem(); + page_autonuma_init_flatmem(); mem_init(); kmem_cache_init(); percpu_init_late(); diff --git a/mm/Makefile b/mm/Makefile index 0fd3165..5a4fa30 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -34,7 +34,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o obj-$(CONFIG_HAS_DMA) += dmapool.o obj-$(CONFIG_HUGETLBFS) += hugetlb.o obj-$(CONFIG_NUMA) += mempolicy.o -obj-$(CONFIG_AUTONUMA) += autonuma.o +obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SLOB) += slob.o diff --git a/mm/autonuma.c b/mm/autonuma.c index 1b2530c..b5c5ff6 100644 --- a/mm/autonuma.c +++ b/mm/autonuma.c @@ -55,10 +55,19 @@ void autonuma_migrate_split_huge_page(struct page *page, struct page *page_tail) { int last_nid; + struct page_autonuma *page_autonuma, *page_tail_autonuma; - last_nid = ACCESS_ONCE(page->autonuma_last_nid); + if (!autonuma_possible()) + return; + + page_autonuma = lookup_page_autonuma(page); + page_tail_autonuma = lookup_page_autonuma(page_tail); + + VM_BUG_ON(page_tail_autonuma->autonuma_last_nid != -1); + + last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid); if (last_nid >= 0) - page_tail->autonuma_last_nid = last_nid; + page_tail_autonuma->autonuma_last_nid = last_nid; } static int sync_isolate_migratepages(struct list_head *migratepages, @@ -176,13 +185,18 @@ static struct page *alloc_migrate_dst_page(struct page *page, { int nid = (int) data; struct page *newpage; + struct page_autonuma *page_autonuma, *newpage_autonuma; newpage = alloc_pages_exact_node(nid, (GFP_HIGHUSER_MOVABLE | GFP_THISNODE | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD) & ~GFP_IOFS, 0); - if (newpage) - newpage->autonuma_last_nid = page->autonuma_last_nid; + if (newpage) { + page_autonuma = lookup_page_autonuma(page); + newpage_autonuma = lookup_page_autonuma(newpage); + newpage_autonuma->autonuma_last_nid = + page_autonuma->autonuma_last_nid; + } return newpage; } @@ -291,13 +305,14 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p, static inline bool last_nid_set(struct page *page, int this_nid) { bool ret = true; - int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid); + struct page_autonuma *page_autonuma = lookup_page_autonuma(page); + int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid); VM_BUG_ON(this_nid < 0); VM_BUG_ON(this_nid >= MAX_NUMNODES); if (autonuma_last_nid != this_nid) { if (autonuma_last_nid >= 0) ret = false; - ACCESS_ONCE(page->autonuma_last_nid) = this_nid; + ACCESS_ONCE(page_autonuma->autonuma_last_nid) = this_nid; } return ret; } @@ -1185,7 +1200,8 @@ static int __init noautonuma_setup(char *str) } return 1; } -__setup("noautonuma", noautonuma_setup); +/* early so sparse.c also can see it */ +early_param("noautonuma", noautonuma_setup); static bool autonuma_init_checks_failed(void) { @@ -1209,7 +1225,12 @@ static int __init autonuma_init(void) VM_BUG_ON(num_possible_nodes() < 1); if (num_possible_nodes() <= 1 || !autonuma_possible()) { - clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags); + /* should have been already initialized by page_autonuma */ + if (autonuma_possible()) { + WARN_ON(1); + /* try to fixup if it wasn't ok */ + clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags); + } return -EINVAL; } else if (autonuma_init_checks_failed()) { printk("autonuma disengaged: init checks failed\n"); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 757c1cc..86db742 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1850,7 +1850,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page, bool mknuma = false; #ifdef CONFIG_AUTONUMA int autonuma_last_nid = -1; + struct page_autonuma *src_page_an, *page_an = NULL; + + if (autonuma_possible()) + page_an = lookup_page_autonuma(page); #endif + for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) { pte_t pteval = *_pte; struct page *src_page; @@ -1862,12 +1867,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page, src_page = pte_page(pteval); #ifdef CONFIG_AUTONUMA /* pick the first one, better than nothing */ - if (autonuma_last_nid < 0) { + if (autonuma_possible() && autonuma_last_nid < 0) { + src_page_an = lookup_page_autonuma(src_page); autonuma_last_nid = - ACCESS_ONCE(src_page-> - autonuma_last_nid); + ACCESS_ONCE(src_page_an->autonuma_last_nid); if (autonuma_last_nid >= 0) - ACCESS_ONCE(page->autonuma_last_nid) = + ACCESS_ONCE(page_an->autonuma_last_nid) = autonuma_last_nid; } #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e096742..8e6493a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -59,6 +59,7 @@ #include <linux/migrate.h> #include <linux/page-debug-flags.h> #include <linux/autonuma.h> +#include <linux/page_autonuma.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -619,9 +620,7 @@ static inline int free_pages_check(struct page *page) bad_page(page); return 1; } -#ifdef CONFIG_AUTONUMA - page->autonuma_last_nid = -1; -#endif + autonuma_free_page(page); if (page->flags & PAGE_FLAGS_CHECK_AT_PREP) page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; return 0; @@ -3797,9 +3796,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, set_pageblock_migratetype(page, MIGRATE_MOVABLE); INIT_LIST_HEAD(&page->lru); -#ifdef CONFIG_AUTONUMA - page->autonuma_last_nid = -1; -#endif #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ if (!is_highmem_idx(zone)) @@ -4402,14 +4398,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, int ret; pgdat_resize_init(pgdat); -#ifdef CONFIG_AUTONUMA - spin_lock_init(&pgdat->autonuma_migrate_lock); - pgdat->autonuma_migrate_nr_pages = 0; - pgdat->autonuma_migrate_last_jiffies = jiffies; -#endif init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); pgdat_page_cgroup_init(pgdat); + pgdat_autonuma_init(pgdat); for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c new file mode 100644 index 0000000..d400d7f --- /dev/null +++ b/mm/page_autonuma.c @@ -0,0 +1,237 @@ +#include <linux/mm.h> +#include <linux/memory.h> +#include <linux/autonuma.h> +#include <linux/page_autonuma.h> +#include <linux/bootmem.h> +#include <linux/vmalloc.h> + +void __meminit page_autonuma_map_init(struct page *page, + struct page_autonuma *page_autonuma, + int nr_pages) +{ + struct page *end; + for (end = page + nr_pages; page < end; page++, page_autonuma++) + page_autonuma->autonuma_last_nid = -1; +} + +static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat) +{ + spin_lock_init(&pgdat->autonuma_migrate_lock); + pgdat->autonuma_migrate_nr_pages = 0; + pgdat->autonuma_migrate_last_jiffies = jiffies; + + /* initialize autonuma_possible() */ + if (num_possible_nodes() <= 1) + clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags); +} + +#if !defined(CONFIG_SPARSEMEM) + +static unsigned long total_usage; + +void __meminit pgdat_autonuma_init(struct pglist_data *pgdat) +{ + __pgdat_autonuma_init(pgdat); + pgdat->node_page_autonuma = NULL; +} + +struct page_autonuma *lookup_page_autonuma(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long offset; + struct page_autonuma *base; + + base = NODE_DATA(page_to_nid(page))->node_page_autonuma; +#ifdef CONFIG_DEBUG_VM + /* + * The sanity checks the page allocator does upon freeing a + * page can reach here before the page_autonuma arrays are + * allocated when feeding a range of pages to the allocator + * for the first time during bootup or memory hotplug. + */ + if (unlikely(!base)) + return NULL; +#endif + offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn; + return base + offset; +} + +static int __init alloc_node_page_autonuma(int nid) +{ + struct page_autonuma *base; + unsigned long table_size; + unsigned long nr_pages; + + nr_pages = NODE_DATA(nid)->node_spanned_pages; + if (!nr_pages) + return 0; + + table_size = sizeof(struct page_autonuma) * nr_pages; + + base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), + table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); + if (!base) + return -ENOMEM; + NODE_DATA(nid)->node_page_autonuma = base; + total_usage += table_size; + page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages); + return 0; +} + +void __init page_autonuma_init_flatmem(void) +{ + + int nid, fail; + + /* __pgdat_autonuma_init initialized autonuma_possible() */ + if (!autonuma_possible()) + return; + + for_each_online_node(nid) { + fail = alloc_node_page_autonuma(nid); + if (fail) + goto fail; + } + printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n", + total_usage >> 10); + printk(KERN_INFO "please try the 'noautonuma' option if you" + " don't want to allocate page_autonuma memory\n"); + return; +fail: + printk(KERN_CRIT "allocation of page_autonuma failed.\n"); + printk(KERN_CRIT "please try the 'noautonuma' boot option\n"); + panic("Out of memory"); +} + +#else /* CONFIG_SPARSEMEM */ + +struct page_autonuma *lookup_page_autonuma(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + struct mem_section *section = __pfn_to_section(pfn); + + /* if it's not a power of two we may be wasting memory */ + BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE & + (SECTION_PAGE_AUTONUMA_SIZE-1)); + + /* memsection must be a power of two */ + BUILD_BUG_ON(sizeof(struct mem_section) & + (sizeof(struct mem_section)-1)); + +#ifdef CONFIG_DEBUG_VM + /* + * The sanity checks the page allocator does upon freeing a + * page can reach here before the page_autonuma arrays are + * allocated when feeding a range of pages to the allocator + * for the first time during bootup or memory hotplug. + */ + if (!section->section_page_autonuma) + return NULL; +#endif + return section->section_page_autonuma + pfn; +} + +void __meminit pgdat_autonuma_init(struct pglist_data *pgdat) +{ + __pgdat_autonuma_init(pgdat); +} + +struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid, + unsigned long nr_pages) +{ + struct page_autonuma *ret; + struct page *page; + unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages; + + page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN, + get_order(memmap_size)); + if (page) + goto got_map_page_autonuma; + + ret = vmalloc(memmap_size); + if (ret) + goto out; + + return NULL; +got_map_page_autonuma: + ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page)); +out: + return ret; +} + +void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma, + unsigned long nr_pages) +{ + if (is_vmalloc_addr(page_autonuma)) + vfree(page_autonuma); + else + free_pages((unsigned long)page_autonuma, + get_order(PAGE_AUTONUMA_SIZE * nr_pages)); +} + +static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum, + int nid) +{ + struct page_autonuma *map; + unsigned long size; + + map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE); + if (map) + return map; + + size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE); + map = __alloc_bootmem_node_high(NODE_DATA(nid), size, + PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); + return map; +} + +void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map, + unsigned long pnum_begin, + unsigned long pnum_end, + unsigned long map_count, + int nodeid) +{ + void *map; + unsigned long pnum; + unsigned long size = SECTION_PAGE_AUTONUMA_SIZE; + + map = alloc_remap(nodeid, size * map_count); + if (map) { + for (pnum = pnum_begin; pnum < pnum_end; pnum++) { + if (!present_section_nr(pnum)) + continue; + page_autonuma_map[pnum] = map; + map += size; + } + return; + } + + size = PAGE_ALIGN(size); + map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count, + PAGE_SIZE, __pa(MAX_DMA_ADDRESS)); + if (map) { + for (pnum = pnum_begin; pnum < pnum_end; pnum++) { + if (!present_section_nr(pnum)) + continue; + page_autonuma_map[pnum] = map; + map += size; + } + return; + } + + /* fallback */ + for (pnum = pnum_begin; pnum < pnum_end; pnum++) { + struct mem_section *ms; + + if (!present_section_nr(pnum)) + continue; + page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid); + if (page_autonuma_map[pnum]) + continue; + ms = __nr_to_section(pnum); + printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed " + "some memory will not be available.\n", __func__); + } +} + +#endif /* CONFIG_SPARSEMEM */ diff --git a/mm/sparse.c b/mm/sparse.c index fac95f2..5b8d018 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -9,6 +9,7 @@ #include <linux/export.h> #include <linux/spinlock.h> #include <linux/vmalloc.h> +#include <linux/page_autonuma.h> #include "internal.h" #include <asm/dma.h> #include <asm/pgalloc.h> @@ -230,7 +231,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn static int __meminit sparse_init_one_section(struct mem_section *ms, unsigned long pnum, struct page *mem_map, - unsigned long *pageblock_bitmap) + unsigned long *pageblock_bitmap, + struct page_autonuma *page_autonuma) { if (!present_section(ms)) return -EINVAL; @@ -239,6 +241,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms, ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) | SECTION_HAS_MEM_MAP; ms->pageblock_flags = pageblock_bitmap; +#ifdef CONFIG_AUTONUMA + if (page_autonuma) { + ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum); + page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION); + } +#else + BUG_ON(page_autonuma); +#endif return 1; } @@ -480,6 +490,9 @@ void __init sparse_init(void) int size2; struct page **map_map; #endif + struct page_autonuma **uninitialized_var(page_autonuma_map); + struct page_autonuma *page_autonuma; + int size3; /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */ set_pageblock_order(); @@ -577,6 +590,63 @@ void __init sparse_init(void) map_count, nodeid_begin); #endif + /* __pgdat_autonuma_init initialized autonuma_possible() */ + if (autonuma_possible()) { + unsigned long total_page_autonuma; + unsigned long page_autonuma_count; + + size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS; + page_autonuma_map = alloc_bootmem(size3); + if (!page_autonuma_map) + panic("can not allocate page_autonuma_map\n"); + + for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) { + struct mem_section *ms; + + if (!present_section_nr(pnum)) + continue; + ms = __nr_to_section(pnum); + nodeid_begin = sparse_early_nid(ms); + pnum_begin = pnum; + break; + } + total_page_autonuma = 0; + page_autonuma_count = 1; + for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) { + struct mem_section *ms; + int nodeid; + + if (!present_section_nr(pnum)) + continue; + ms = __nr_to_section(pnum); + nodeid = sparse_early_nid(ms); + if (nodeid == nodeid_begin) { + page_autonuma_count++; + continue; + } + /* ok, we need to take cake of from pnum_begin to pnum - 1*/ + sparse_early_page_autonuma_alloc_node(page_autonuma_map, + pnum_begin, + NR_MEM_SECTIONS, + page_autonuma_count, + nodeid_begin); + total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count; + /* new start, update count etc*/ + nodeid_begin = nodeid; + pnum_begin = pnum; + page_autonuma_count = 1; + } + /* ok, last chunk */ + sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin, + NR_MEM_SECTIONS, + page_autonuma_count, nodeid_begin); + total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count; + printk("allocated %lu KBytes of page_autonuma\n", + total_page_autonuma >> 10); + printk(KERN_INFO "please try the 'noautonuma' option if you" + " don't want to allocate page_autonuma memory\n"); + } + for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) { if (!present_section_nr(pnum)) continue; @@ -585,6 +655,13 @@ void __init sparse_init(void) if (!usemap) continue; + if (autonuma_possible()) { + page_autonuma = page_autonuma_map[pnum]; + if (!page_autonuma) + continue; + } else + page_autonuma = NULL; + #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER map = map_map[pnum]; #else @@ -594,11 +671,13 @@ void __init sparse_init(void) continue; sparse_init_one_section(__nr_to_section(pnum), pnum, map, - usemap); + usemap, page_autonuma); } vmemmap_populate_print_last(); + if (autonuma_possible()) + free_bootmem(__pa(page_autonuma_map), size3); #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER free_bootmem(__pa(map_map), size2); #endif @@ -685,7 +764,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages) } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ -static void free_section_usemap(struct page *memmap, unsigned long *usemap) +static void free_section_usemap(struct page *memmap, unsigned long *usemap, + struct page_autonuma *page_autonuma) { struct page *usemap_page; unsigned long nr_pages; @@ -699,8 +779,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap) */ if (PageSlab(usemap_page)) { kfree(usemap); - if (memmap) + if (memmap) { __kfree_section_memmap(memmap, PAGES_PER_SECTION); + if (autonuma_possible()) + __kfree_section_page_autonuma(page_autonuma, + PAGES_PER_SECTION); + else + BUG_ON(page_autonuma); + } return; } @@ -717,6 +803,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap) >> PAGE_SHIFT; free_map_bootmem(memmap_page, nr_pages); + + if (autonuma_possible()) { + struct page *page_autonuma_page; + page_autonuma_page = virt_to_page(page_autonuma); + free_map_bootmem(page_autonuma_page, nr_pages); + } else + BUG_ON(page_autonuma); } } @@ -732,6 +825,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn, struct pglist_data *pgdat = zone->zone_pgdat; struct mem_section *ms; struct page *memmap; + struct page_autonuma *page_autonuma; unsigned long *usemap; unsigned long flags; int ret; @@ -751,6 +845,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn, __kfree_section_memmap(memmap, nr_pages); return -ENOMEM; } + if (autonuma_possible()) { + page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id, + nr_pages); + if (!page_autonuma) { + kfree(usemap); + __kfree_section_memmap(memmap, nr_pages); + return -ENOMEM; + } + } else + page_autonuma = NULL; pgdat_resize_lock(pgdat, &flags); @@ -762,11 +866,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn, ms->section_mem_map |= SECTION_MARKED_PRESENT; - ret = sparse_init_one_section(ms, section_nr, memmap, usemap); + ret = sparse_init_one_section(ms, section_nr, memmap, usemap, + page_autonuma); out: pgdat_resize_unlock(pgdat, &flags); if (ret <= 0) { + if (autonuma_possible()) + __kfree_section_page_autonuma(page_autonuma, nr_pages); + else + BUG_ON(page_autonuma); kfree(usemap); __kfree_section_memmap(memmap, nr_pages); } @@ -777,6 +886,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) { struct page *memmap = NULL; unsigned long *usemap = NULL; + struct page_autonuma *page_autonuma = NULL; if (ms->section_mem_map) { usemap = ms->pageblock_flags; @@ -784,8 +894,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) __section_nr(ms)); ms->section_mem_map = 0; ms->pageblock_flags = NULL; + +#ifdef CONFIG_AUTONUMA + page_autonuma = ms->section_page_autonuma; +#endif } - free_section_usemap(memmap, usemap); + free_section_usemap(memmap, usemap, page_autonuma); } #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli @ 2012-10-04 14:16 ` Christoph Lameter 2012-10-04 20:09 ` KOSAKI Motohiro 1 sibling, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2012-10-04 14:16 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt On Thu, 4 Oct 2012, Andrea Arcangeli wrote: > Move the autonuma_last_nid from the "struct page" to a separate > page_autonuma data structure allocated in the memsection (with > sparsemem) or in the pgdat (with flatmem). Note that there is a available word in struct page before the autonuma patches on x86_64 with CONFIG_HAVE_ALIGNED_STRUCT_PAGE. In fact the page_autonuma fills up the structure to nicely fit in one 64 byte cacheline. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli 2012-10-04 14:16 ` Christoph Lameter @ 2012-10-04 20:09 ` KOSAKI Motohiro 2012-10-05 11:31 ` Andrea Arcangeli 1 sibling, 1 reply; 9+ messages in thread From: KOSAKI Motohiro @ 2012-10-04 20:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt, kosaki.motohiro > +struct page_autonuma *lookup_page_autonuma(struct page *page) > +{ > + unsigned long pfn = page_to_pfn(page); > + unsigned long offset; > + struct page_autonuma *base; > + > + base = NODE_DATA(page_to_nid(page))->node_page_autonuma; > +#ifdef CONFIG_DEBUG_VM > + /* > + * The sanity checks the page allocator does upon freeing a > + * page can reach here before the page_autonuma arrays are > + * allocated when feeding a range of pages to the allocator > + * for the first time during bootup or memory hotplug. > + */ > + if (unlikely(!base)) > + return NULL; > +#endif When using CONFIG_DEBUG_VM, please just use BUG_ON instead of additional sanity check. Otherwise only MM people might fault to find a real bug. And I have additional question here. What's happen if memory hotplug occur and several autonuma_last_nid will point to invalid node id? My quick skimming didn't find hotplug callback code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 29/33] autonuma: page_autonuma 2012-10-04 20:09 ` KOSAKI Motohiro @ 2012-10-05 11:31 ` Andrea Arcangeli 0 siblings, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2012-10-05 11:31 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins, Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones, Dan Smith, Thomas Gleixner, Paul Turner, Christoph Lameter, Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt Hi KOSAKI, On Thu, Oct 04, 2012 at 04:09:40PM -0400, KOSAKI Motohiro wrote: > > +struct page_autonuma *lookup_page_autonuma(struct page *page) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + unsigned long offset; > > + struct page_autonuma *base; > > + > > + base = NODE_DATA(page_to_nid(page))->node_page_autonuma; > > +#ifdef CONFIG_DEBUG_VM > > + /* > > + * The sanity checks the page allocator does upon freeing a > > + * page can reach here before the page_autonuma arrays are > > + * allocated when feeding a range of pages to the allocator > > + * for the first time during bootup or memory hotplug. > > + */ > > + if (unlikely(!base)) > > + return NULL; > > +#endif > > When using CONFIG_DEBUG_VM, please just use BUG_ON instead of additional > sanity check. Otherwise only MM people might fault to find a real bug. Agreed. But I just tried to stick to the page_cgroup.c model. I suggest you send a patch to fix it in mm/page_cgroup.c, then I'll synchronize mm/page_autonuma.c with whatever lands in page_cgroup.c. The idea is that in the future it'd be nice to unify those with a common implementation. And the closer page_cgroup.c and page_autonuma.c are, the less work it'll be to update them to use a common framework. And if it's never going to be worth it to unify it (if it generates more code than it saves), well then keeping the code as similar as possible, is still beneficial so it's easier to review both. > And I have additional question here. What's happen if memory hotplug occur > and several autonuma_last_nid will point to invalid node id? My quick skimming > didn't find hotplug callback code. last_nid is statistical info so if it's random it's ok (I didn't add bugchecks to trap uninitialized cases to it, maybe I should?). sparse_init_one_section also initializes it, and that's invoked by sparse_add_one_section. Also those fields are also initialized when the page is freed the first time to add it to the buddy, but I didn't want to depend on that, I thought an explicit init post-allocation would be more robust. By reviewing it the only thing I found is that I was wasting a bit of .text for 32bit builds (CONFIG_SPARSEMEM=n). diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c index d400d7f..303b427 100644 --- a/mm/page_autonuma.c +++ b/mm/page_autonuma.c @@ -14,7 +14,7 @@ void __meminit page_autonuma_map_init(struct page *page, page_autonuma->autonuma_last_nid = -1; } -static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat) +static void __paginginit __pgdat_autonuma_init(struct pglist_data *pgdat) { spin_lock_init(&pgdat->autonuma_migrate_lock); pgdat->autonuma_migrate_nr_pages = 0; @@ -29,7 +29,7 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat) static unsigned long total_usage; -void __meminit pgdat_autonuma_init(struct pglist_data *pgdat) +void __paginginit pgdat_autonuma_init(struct pglist_data *pgdat) { __pgdat_autonuma_init(pgdat); pgdat->node_page_autonuma = NULL; @@ -131,7 +131,7 @@ struct page_autonuma *lookup_page_autonuma(struct page *page) return section->section_page_autonuma + pfn; } -void __meminit pgdat_autonuma_init(struct pglist_data *pgdat) +void __paginginit pgdat_autonuma_init(struct pglist_data *pgdat) { __pgdat_autonuma_init(pgdat); } So those can be freed if it's a non sparsemem build. The caller has __paginging init too so it should be ok. The other page_autonuma.c places invoked only by sparsemem hotplug code are using meminit so in theory it should work (I haven't tested it yet). Thanks for the review! Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-10-05 11:31 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-04 16:50 Andrea Arcangeli 2012-10-04 18:17 ` your mail Christoph Lameter 2012-10-04 18:38 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli 2012-10-04 19:11 ` Christoph Lameter 2012-10-05 11:11 ` Andrea Arcangeli -- strict thread matches above, loose matches on Subject: below -- 2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli 2012-10-03 23:51 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli 2012-10-04 14:16 ` Christoph Lameter 2012-10-04 20:09 ` KOSAKI Motohiro 2012-10-05 11:31 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).