Transparent Hugepage Support #33

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Transparent Hugepage Support #33
@ 2010-12-15  5:15 Andrea Arcangeli
  2010-12-15 23:55 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2010-12-15  5:15 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi

Some of some relevant user of the project:

KVM Virtualization
GCC (kernel build included, requires a few liner patch to enable)
JVM
VMware Workstation
HPC

It would be great if it could go in -mm.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=blob;f=Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog

first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later: git fetch; git checkout -f origin/master

The tree is rebased and git pull won't work.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc5/transparent_hugepage-33/
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc5/transparent_hugepage-33.gz

Diff #32 -> #33:

 b/THP-disable-on-small-systems               |    4 

Improved header.

 b/clear_copy_huge_page                       |   60 +--

Update after upstream changes.

 b/compaction-add-trace-events                |  179 +++++++++
 b/compaction-instead-of-lumpy                |  415 ++++++++++++++++++++++
 b/compaction-lumpy_mode                      |  169 +++++++++
 b/compaction-migrate-async                   |  388 +++++++++++++++++++++
 b/compaction-migrate_pages-api-bool          |  133 +++++++
 b/compaction-movable-pageblocks              |   56 +++
 b/compaction-reclaim_mode                    |  248 +++++++++++++
 b/zone_watermark_ok_safe                     |  372 ++++++++++++++++++++

Mel's lumpy compaction (disables lumpy and uses compaction instead
when CONFIG_COMPACTION=y) allows proper runtime when there are
frequent hugepage allocations like with THP on. Picked from mmotm
broken-out patchset to allow easy -mm integration and to test it out
in combination of THP.

 b/compaction-all-orders                      |   23 +
 b/compaction-kswapd                          |  104 +++--

Split the compaction-all-orders part off compaction-kswapd.

 b/compound_get_put                           |   39 +-

Cleanups.

 b/compound_get_put_fix                       |   28 +

While reading code I think there was a super tiny race (never
reproduced) in the put_page of a tail page in case split_huge_page
would run on the head page after put_page releases the compound lock
but before put_page_testzero is called (only after put_page_testzero returns
true we're sure split_huge_page can't run from under us anymore as it
requires a reference on the head page to run, rechecking PageHead is
enough to fix it).

 b/compound_lock                              |   13 

Change the API to return flags instead of void.

 b/compound_trans_order                       |  120 ++++++

Be safe while reading compound_order on transparent hugepages that may
be under split_huge_page.

 b/gfp_no_kswapd                              |   17 

Define ___GFP_NO_KSWAPD.

 b/khugepaged-mmap_sem                        |  113 ++++++

Some user reported deadlocks after days of load with pvfs.

Allocate memory inside mmap_sem read mode (not anymore inside mmap_sem
write mode) within khugepaged collapse_huge_page to satisfy certain
filesystems in userland that may benefit from THP (so they don't need
to use MADV_NOHUGEPAGE). Not sure if this bugfix was really required
from a theoretical standpoint (as far as the deadlock is concerned
this may actually hide bugs), but it makes the code more scalable so
it actually makes the code better and it's a no brainer.

Still investigating the page lock usage in khugepaged vs fuse.

 b/ksm-swapcache                              |   64 ---

Use Hugh's equivalent one liner fix.

 b/kvm_transparent_hugepage                   |   38 +-

Adjust for hva_to_pfn interface change.

 b/madv_nohugepage                            |  157 ++++++++
 b/madv_nohugepage_define                     |   64 +++

Add MADV_NOHUGEPAGE to disable THP on low priority vmas (needed
especially now that KSM won't scan inside THP, later it will be less
important but maybe still useful to leave hugepages available for
higher priority virtual machines).

 b/memcg_compound                             |   71 ++-

Don't batch hugepage releasing in __do_uncharge.

 b/memcg_huge_memory                          |   12 

Optimize with mem_cgroup_uncharge_start/stop().

 b/memory-failure-thp-vs-hugetlbfs            |   44 ++

The new hugetlbfs memory-failure code merged upstream collided with
THP (reported by some users running
mce-test.git/hwpoison/run-huge-test.sh on aa.git).

Use PageHuge to differentiate between THP pages and hugetlbfs pages in
common paths that can run into any of the two types. PageTransHuge
will still return 1 for hugetlbfs pages because PageTransHuge must
only be used in the core VM paths where hugetlbfs pages can't be
processed. In any place where hugetlbfs shared the common paths with
the core VM code, PageHuge should be used to differentiate the
two. Usually PageHuge is only needed in THP context in slow paths
(memory-failure is not just a slow but even an error path), so it's
ok and we don't want to slowdown PageTransHuge considering PageHuge
already is there for this.

 b/pagetranscompound                          |   30 -

Cleanups.

 b/pmd_mangling_generic                       |  488 +++++++++++++++++++--------

Cleanups to save icache by moving slow common methods to
mm/pgtable-generic.c.

 b/pmd_mangling_x86                           |   41 --

Update header and undo a noop change.

 b/pmd_paravirt_ops                           |   12 

Fix x86 32bit build with PAE off and paravirt on.

 b/pmd_trans                                  |   13 

macro -> inline cleanups.

 b/pmd_trans_huge_migrate                     |   31 -

Remove false positive bug on.

 b/pte_alloc_trans_splitting                  |   13 

Add BUG_ON matching the issue in pmd_trans_huge_migrate (pmd must be
null to call __pte_alloc, pmd_present is not enough if pmd_trans_huge
can be set). The reason is that very temporarily to optimize away one
unnecessary IPI for every split_huge_page we mark the pmd not present
but still huge for the duration of the IPI (this is to prevent
simultaneous 4k and 2M tlb entries that would machine check some CPU
with erratas).

 b/set-recommended-min_free_kbytes            |   10 

Explicit call setup_per_zone_wmarks even if min_free_kbytes is already
bigger than recommended_min (otherwise the reserved pageblocks won't
be enabled on huge systems). This brings the kernel version of
set_recommended_min_free_kbytes fully equivalent to the hugeadm
--set-recommended-min_free_kbytes command line.

 b/transhuge-enable-direct-defrag             |    3 

Header update.

 b/transhuge-selects-compaction               |   15 

Header update to explain why THP selects compaction.

 b/transparent_hugepage                       |  114 ++++--

Make PageTransHuge inline and move it from huge_mm.h to page-flags.h.

Add BUG_ON if is_vma_temporary_stack is set during split_huge_page (we
can't fail, it shall never trigger because mremap done on the initial
kernel stack during execve that sets the temporary stack flag for its
duration, shouldn't work on hugepages). The BUG_ON makes sure it won't
break silently if the user stack is ever born huge. 

Use assert_spin_locked instead of VM_BUG_ON.

Remove potentially false positive bugcheck for not present pmd, same
as pte_alloc_trans_splitting.

 b/transparent_hugepage-doc                   |   67 ++-

Doc improvement from Mel.

 b/transparent_hugepage-numa                  |   50 +-

Fix memleak if memcg fails charge during khugepaged collapse_huge_page
with CONFIG_NUMA=y.

 b/transparent_hugepage_vmstat-anon_vma-chain |   16 

 memcg_consume_stock                          |   56 ---
 remove-lumpy_reclaim                         |  131 -------
 exec-migrate-race-anon_vma-chain

removed.

FAQ:

Q: When will 1G pages be supported? (by far the most frequently asked question
   in the last two days)
A: Not any time soon but it's not entirly impossible... The benefit of going
   from 2M to 1G is likely much lower than the benefit of going from 4k to 2M
   so it's unlikely to be a worthwhile effort for a while. And some CPUs
   won't have 1G TLB so it only speedup a bit the tlb miss handler but
   it won't actually decrease the tlb miss rate.

Q: When this will work on filebacked pages? (pagecache/swapcache/tmpfs)
A: Not until it's merged in mainline. It's already feature complete for many
   usages and the moment we expand into pagecache the patch would grow
   significantly.

Q: When will KSM will scan inside Transparent Hugepages?
A: Working on that, this should materialize soon enough.

Q: What is the next place where to remove split_huge_page_pmd()?
A: mremap. JVM uses mremap in the garbage collector so the ~18% boost (no virt)
   has further margin for optimizations.

Full diffstat:

 Documentation/vm/transhuge.txt        |  298 ++++
 arch/alpha/include/asm/mman.h         |    3 
 arch/mips/include/asm/mman.h          |    3 
 arch/parisc/include/asm/mman.h        |    3 
 arch/powerpc/mm/gup.c                 |   12 
 arch/x86/include/asm/kvm_host.h       |    1 
 arch/x86/include/asm/paravirt.h       |   25 
 arch/x86/include/asm/paravirt_types.h |    6 
 arch/x86/include/asm/pgtable-2level.h |    9 
 arch/x86/include/asm/pgtable-3level.h |   23 
 arch/x86/include/asm/pgtable.h        |  143 ++
 arch/x86/include/asm/pgtable_64.h     |   28 
 arch/x86/include/asm/pgtable_types.h  |    3 
 arch/x86/kernel/paravirt.c            |    3 
 arch/x86/kernel/tboot.c               |    2 
 arch/x86/kernel/vm86_32.c             |    1 
 arch/x86/kvm/mmu.c                    |   60 
 arch/x86/kvm/paging_tmpl.h            |    4 
 arch/x86/mm/gup.c                     |   28 
 arch/x86/mm/pgtable.c                 |   66 
 arch/xtensa/include/asm/mman.h        |    3 
 drivers/base/node.c                   |   21 
 fs/Kconfig                            |    2 
 fs/proc/meminfo.c                     |   14 
 fs/proc/page.c                        |   14 
 include/asm-generic/mman-common.h     |    3 
 include/asm-generic/pgtable.h         |  225 ++-
 include/linux/compaction.h            |   25 
 include/linux/gfp.h                   |   15 
 include/linux/huge_mm.h               |  159 ++
 include/linux/kernel.h                |    7 
 include/linux/khugepaged.h            |   67 
 include/linux/kvm_host.h              |    4 
 include/linux/memory_hotplug.h        |   14 
 include/linux/migrate.h               |   12 
 include/linux/mm.h                    |  137 +
 include/linux/mm_inline.h             |   19 
 include/linux/mm_types.h              |    3 
 include/linux/mmu_notifier.h          |   66 
 include/linux/mmzone.h                |   11 
 include/linux/page-flags.h            |   65 
 include/linux/rmap.h                  |    2 
 include/linux/sched.h                 |    1 
 include/linux/swap.h                  |    2 
 include/linux/vmstat.h                |    5 
 include/trace/events/compaction.h     |   74 +
 include/trace/events/vmscan.h         |    6 
 kernel/fork.c                         |   12 
 kernel/futex.c                        |   55 
 mm/Kconfig                            |   38 
 mm/Makefile                           |    3 
 mm/compaction.c                       |  174 +-
 mm/huge_memory.c                      | 2331 ++++++++++++++++++++++++++++++++++
 mm/hugetlb.c                          |   70 -
 mm/internal.h                         |    4 
 mm/ksm.c                              |   29 
 mm/madvise.c                          |   10 
 mm/memcontrol.c                       |  129 +
 mm/memory-failure.c                   |   22 
 mm/memory.c                           |  199 ++
 mm/memory_hotplug.c                   |   17 
 mm/mempolicy.c                        |   20 
 mm/migrate.c                          |   29 
 mm/mincore.c                          |    7 
 mm/mmap.c                             |    7 
 mm/mmu_notifier.c                     |   20 
 mm/mmzone.c                           |   21 
 mm/mprotect.c                         |   20 
 mm/mremap.c                           |    9 
 mm/page_alloc.c                       |   98 +
 mm/pagewalk.c                         |    1 
 mm/pgtable-generic.c                  |  123 +
 mm/rmap.c                             |   87 -
 mm/sparse.c                           |    4 
 mm/swap.c                             |  131 +
 mm/swap_state.c                       |    6 
 mm/swapfile.c                         |    2 
 mm/vmscan.c                           |  210 ++-
 mm/vmstat.c                           |   69 -
 virt/kvm/iommu.c                      |    2 
 virt/kvm/kvm_main.c                   |   56 
 81 files changed, 5189 insertions(+), 523 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-15  5:15 Transparent Hugepage Support #33 Andrea Arcangeli
@ 2010-12-15 23:55 ` Andrew Morton
  2010-12-16  2:35   ` kvm mmu transparent hugepage support for linux-next Andrea Arcangeli
  2010-12-16  0:54 ` Transparent Hugepage Support #33 KAMEZAWA Hiroyuki
  2010-12-20 11:16 ` Transparent Hugepage Support #33 Mel Gorman
  2 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2010-12-15 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, linux-kernel, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov,
	Miklos Szeredi

On Wed, 15 Dec 2010 06:15:40 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Some of some relevant user of the project:
> 
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
> 
> It would be great if it could go in -mm.

That all merged pretty easily on top of the current mm pile.  Except
for kvm-mmu-transparent-hugepage-support.patch which needs some thought
and testing to get it merged into the KVM changes in linux-next.  I
simply omitted kvm-mmu-transparent-hugepage-support.patch so please
take a look?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-15  5:15 Transparent Hugepage Support #33 Andrea Arcangeli
  2010-12-15 23:55 ` Andrew Morton
@ 2010-12-16  0:54 ` KAMEZAWA Hiroyuki
  2010-12-16  1:10   ` Daisuke Nishimura
  2010-12-16  1:18   ` Andrew Morton
  2010-12-20 11:16 ` Transparent Hugepage Support #33 Mel Gorman
  2 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-16  0:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov,
	Miklos Szeredi

On Wed, 15 Dec 2010 06:15:40 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Some of some relevant user of the project:
> 
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
> 
> It would be great if it could go in -mm.

Things should be done in memory cgroup is
 
 - make accounting correct (RSS count will be broken)
 - make move_charge() to work
   (at rmdir(), this is now broken. It seems move-charge-at-task-move to work)

Do you have known other viewpoints ? I'll look into when -mm is shipped.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-16  0:54 ` Transparent Hugepage Support #33 KAMEZAWA Hiroyuki
@ 2010-12-16  1:10   ` Daisuke Nishimura
  2010-12-16  2:13     ` Andrea Arcangeli
  2010-12-16  1:18   ` Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2010-12-16  1:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Chris Mason, Borislav Petkov, Miklos Szeredi, Daisuke Nishimura

Hi,

On Thu, 16 Dec 2010 09:54:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 15 Dec 2010 06:15:40 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Some of some relevant user of the project:
> > 
> > KVM Virtualization
> > GCC (kernel build included, requires a few liner patch to enable)
> > JVM
> > VMware Workstation
> > HPC
> > 
> > It would be great if it could go in -mm.
> 
> Things should be done in memory cgroup is
>  
>  - make accounting correct (RSS count will be broken)
>  - make move_charge() to work
>    (at rmdir(), this is now broken. It seems move-charge-at-task-move to work)
> 
Yes.
I think we should add mem_cgroup_split_hugepage_commit() and add PageTransHuge()
check in mem_cgroup_move_parent() as done in RHEL6 kernel.
As for move-charge-at-task-move, it will work because walk_pmd_range() splits
THP pages(it would be better to change move-charge not to split THP pages, but
it's not so urgent IMHO).

> Do you have known other viewpoints ?
Not yet, but I'll test and check.

> I'll look into when -mm is shipped.
> 
me too :)


Thanks,
Daisuke Nihimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-16  0:54 ` Transparent Hugepage Support #33 KAMEZAWA Hiroyuki
  2010-12-16  1:10   ` Daisuke Nishimura
@ 2010-12-16  1:18   ` Andrew Morton
  2010-12-16  2:02     ` linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33) Stephen Rothwell
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2010-12-16  1:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, linux-mm, Linus Torvalds, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov,
	Miklos Szeredi, Paul E. McKenney

On Thu, 16 Dec 2010 09:54:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> I'll look into when -mm is shipped.

That might take a while - linux-next is a screwed-up catastrophe and I
suppose some sucker has some bisecting to do.

(The second trace below looks similar to https://bugzilla.kernel.org/show_bug.cgi?id=24942)

[  241.227687] INFO: task modprobe:904 blocked for more than 120 seconds.
[  241.227979] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  241.228264] modprobe        D 0000000000000007     0   904      1 0x00000000
[  241.228525]  ffff880255cbdc48 0000000000000046 ffff88009edd1dd8 ffff88025736e880
[  241.228973]  ffff880257a508c0 ffff88025736ebd8 0000000000000002 0000000100000000
[  241.229421]  0000000000000002 0000000000000000 ffff88009edd1dd8 0000000000000000
[  241.229879] Call Trace:
[  241.230043]  [<ffffffff81391496>] schedule_timeout+0x24/0x1b6
[  241.230202]  [<ffffffff81391293>] ? wait_for_common+0x3a/0x129
[  241.230364]  [<ffffffff8105e1ca>] ? trace_hardirqs_on+0xd/0xf
[  241.230522]  [<ffffffff81391322>] wait_for_common+0xc9/0x129
[  241.230681]  [<ffffffff810317d1>] ? default_wake_function+0x0/0xf
[  241.230850]  [<ffffffff8139141c>] wait_for_completion+0x18/0x1a
[  241.231010]  [<ffffffff8107e7bb>] synchronize_sched+0x51/0x58
[  241.231169]  [<ffffffff8104d3d0>] ? wakeme_after_rcu+0x0/0xf
[  241.231329]  [<ffffffff8106a772>] load_module+0xd4e/0xe81
[  241.231489]  [<ffffffff8106a8e5>] sys_init_module+0x40/0x1d7
[  241.231658]  [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[  241.231831] INFO: lockdep is turned off.

and

[  271.500616] INFO: rcu_sched_state detected stall on CPU 5 (t=65032 jiffies)
[  271.500616] sending NMI to all CPUs:
[  271.500954] NMI backtrace for cpu 2
[  271.501110] CPU 2 
[  271.501157] Modules linked in: ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq ide_cd_mod serio_raw snd_seq_device snd_pcm_oss shpchp cdrom option usb_wwan snd_mixer_oss snd_pcm usbserial snd_timer snd i2c_i801 soundcore button floppy i2c_core intel_rng(-) snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[  271.503961] 
[  271.504122] Pid: 0, comm: kworker/0:1 Tainted: G        W   2.6.37-rc5-mm1 #1 /
[  271.504403] RIP: 0010:[<ffffffff81009c9b>]  [<ffffffff81009c9b>] mwait_idle+0x76/0x82
[  271.504662] RSP: 0018:ffff880257967f08  EFLAGS: 00000246
[  271.504662] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  271.504662] RDX: 0000000000000000 RSI: ffff880257966010 RDI: ffffffff81009c91
[  271.504662] RBP: ffff880257967f18 R08: 0000000000000000 R09: 0000000000000001
[  271.504662] R10: ffffffff8102b7d4 R11: ffffffff81396dcc R12: 0000000000000000
[  271.504662] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  271.504662] FS:  0000000000000000(0000) GS:ffff88009e200000(0000) knlGS:0000000000000000
[  271.504662] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  271.504662] CR2: 0000003e5f0948f0 CR3: 000000000179b000 CR4: 00000000000006e0
[  271.504662] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  271.504662] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  271.504662] Process kworker/0:1 (pid: 0, threadinfo ffff880257966000, task ffff8802579643c0)
[  271.504662] Stack:
[  271.504662]  0000000000000000 0000000000000002 ffff880257967f28 ffffffff810014cf
[  271.504662]  ffff880257967f48 ffffffff8138c3e8 ffffffff8138c25d 0000000000000000
[  271.504662]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  271.504662] Call Trace:
[  271.504662]  [<ffffffff810014cf>] cpu_idle+0x48/0x68
[  271.504662]  [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[  271.504662]  [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[  271.504662] Code: 31 db 48 89 f0 48 89 d9 48 89 da 0f 01 c8 0f ae f0 48 8b 87 38 e0 ff ff a8 08 75 11 e8 2c 45 05 00 48 89 d8 48 89 d9 fb 0f 01 c9 <eb> 06 e8 1b 45 05 00 fb 58 5b c9 c3 55 ba e8 12 00 00 48 89 e5 
[  271.504662] Call Trace:
[  271.504662]  [<ffffffff810014cf>] cpu_idle+0x48/0x68
[  271.504662]  [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[  271.504662]  [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[  271.504662] Pid: 0, comm: kworker/0:1 Tainted: G        W   2.6.37-rc5-mm1 #1
[  271.504662] Call Trace:
[  271.504662]  <NMI>  [<ffffffff8139529d>] ? arch_trigger_all_cpu_backtrace_handler+0x64/0x80
[  271.504662]  [<ffffffff81396d97>] ? notifier_call_chain+0x81/0xb6
[  271.504662]  [<ffffffff81396e27>] ? __atomic_notifier_call_chain+0x5b/0x84
[  271.504662]  [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[  271.504662]  [<ffffffff81396e5f>] ? atomic_notifier_call_chain+0xf/0x11
[  271.504662]  [<ffffffff81396e8f>] ? notify_die+0x2e/0x30
[  271.504662]  [<ffffffff8139454d>] ? do_nmi+0xa7/0x2a1
[  271.504662]  [<ffffffff8139424a>] ? nmi+0x1a/0x2c
[  271.504662]  [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[  271.504662]  [<ffffffff8102b7d4>] ? finish_task_switch+0x44/0xb8
[  271.504662]  [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[  271.504662]  [<ffffffff81009c9b>] ? mwait_idle+0x76/0x82
[  271.504662]  <<EOE>>  [<ffffffff810014cf>] ? cpu_idle+0x48/0x68
[  271.504662]  [<ffffffff8138c3e8>] ? start_secondary+0x18b/0x18f
[  271.504662]  [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[  271.500616] NMI backtrace for cpu 5
[  271.500616] CPU 5 
[  271.500616] Modules linked in: ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq ide_cd_mod serio_raw snd_seq_device snd_pcm_oss shpchp cdrom option usb_wwan snd_mixer_oss snd_pcm usbserial snd_timer snd i2c_i801 soundcore button floppy i2c_core intel_rng(-) snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[  271.500616] 
[  271.500616] Pid: 0, comm: kworker/0:1 Tainted: G        W   2.6.37-rc5-mm1 #1 /
[  271.500616] RIP: 0010:[<ffffffff8119b624>]  [<ffffffff8119b624>] __bitmap_empty+0x5a/0x63
[  271.500616] RSP: 0018:ffff88009e803e90  EFLAGS: 00000046
[  271.500616] RAX: 0000000000000000 RBX: 0000000000002710 RCX: ffffffff8180e4e8
[  271.500616] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffffffff8180e4e0
[  271.500616] RBP: ffff88009e803e98 R08: 0000000000000003 R09: 0000000000000000
[  271.500616] R10: 0000000000000000 R11: ffff88025589aec0 R12: ffff88009e9ce760
[  271.500616] R13: ffffffff817b3080 R14: 0000000000000000 R15: ffffffff817b3180
[  271.500616] FS:  0000000000000000(0000) GS:ffff88009e800000(0000) knlGS:0000000000000000
[  271.500616] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  271.500616] CR2: 00000000008cfb80 CR3: 0000000255941000 CR4: 00000000000006e0
[  271.500616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  271.500616] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  271.500616] Process kworker/0:1 (pid: 0, threadinfo ffff8802579e8000, task ffff8802579e66c0)
[  271.500616] Stack:
[  271.500616]  ffff88025589aec0 ffff88009e803eb8 ffffffff8101a53c ffff8802579e66c0
[  271.500616]  0000000000000005 ffff88009e803ef8 ffffffff8107ec29 ffff8802579e66c0
[  271.500616]  0000000000000005 0000000000000005 ffff8802579e66c0 0000000000000000
[  271.500616] Call Trace:
[  271.500616]  <IRQ> 
[  271.500616]  [<ffffffff8101a53c>] arch_trigger_all_cpu_backtrace+0x52/0x6a
[  271.500616]  [<ffffffff8107ec29>] __rcu_pending+0x7e/0x2f0
[  271.500616]  [<ffffffff8107ef1d>] rcu_check_callbacks+0x82/0xb3
[  271.500616]  [<ffffffff8104275f>] update_process_times+0x38/0x6e
[  271.500616]  [<ffffffff8105a0f8>] tick_periodic+0x63/0x6f
[  271.500616]  [<ffffffff8105a122>] tick_handle_periodic+0x1e/0x6b
[  271.500616]  [<ffffffff81019a37>] smp_apic_timer_interrupt+0x83/0x96
[  271.500616]  [<ffffffff810033d3>] apic_timer_interrupt+0x13/0x20
[  271.500616]  <EOI> 
[  271.500616]  [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[  271.500616]  [<ffffffff8102b7d4>] ? finish_task_switch+0x44/0xb8
[  271.500616]  [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[  271.500616]  [<ffffffff81009c9b>] ? mwait_idle+0x76/0x82
[  271.500616]  [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[  271.500616]  [<ffffffff810014cf>] cpu_idle+0x48/0x68
[  271.500616]  [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[  271.500616]  [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[  271.500616] Code: 89 f0 83 e0 3f 85 c0 74 24 89 f0 4c 63 c2 b9 40 00 00 00 99 f7 f9 b8 01 00 00 00 89 d1 48 d3 e0 48 ff c8 4a 85 04 c7 74 04 31 c0 <eb> 05 b8 01 00 00 00 c9 c3 55 ba 40 00 00 00 89 f1 48 89 e5 53 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33)
  2010-12-16  1:18   ` Andrew Morton
@ 2010-12-16  2:02     ` Stephen Rothwell
  2010-12-16  5:29       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Rothwell @ 2010-12-16  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm, Linus Torvalds,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi,
	Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]

Hi Andrew,

On Wed, 15 Dec 2010 17:18:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> That might take a while - linux-next is a screwed-up catastrophe and I
> suppose some sucker has some bisecting to do.

Yeah, all 6 of my boot tests failed last night.  This from a machine with
2G of memory (early after starting user mode):

pidof invoked oom-killer: gfp_mask=0x840d0, order=0, oom_adj=0, oom_score_adj=0
pidof cpuset=/ mems_allowed=0
Call Trace:
[c000000001c62fc0] [c000000000012214] .show_stack+0x7c/0x184 (unreliable)
[c000000001c63070] [c000000000129380] .dump_header.clone.2+0xd0/0x230
[c000000001c63170] [c00000000012955c] .oom_kill_process.clone.0+0x7c/0x304
[c000000001c63250] [c000000000129c78] .out_of_memory+0x494/0x54c
[c000000001c63340] [c00000000012ecb8] .__alloc_pages_nodemask+0x550/0x714
[c000000001c634c0] [c0000000001690f8] .alloc_pages_current+0xc4/0x104
[c000000001c63560] [c00000000016ea70] .new_slab+0xdc/0x2c8
[c000000001c63600] [c00000000016ef5c] .__slab_alloc+0x300/0x484
[c000000001c636d0] [c000000000170754] .kmem_cache_alloc+0x88/0x17c
[c000000001c63780] [c0000000001db3b4] .proc_alloc_inode+0x30/0xa8
[c000000001c63820] [c000000000196fe8] .alloc_inode+0x48/0xf8
[c000000001c638b0] [c0000000001974e0] .new_inode+0x28/0xa8
[c000000001c63930] [c0000000001dd0e8] .proc_pid_make_inode+0x24/0xe8
[c000000001c639d0] [c0000000001e0980] .proc_pid_instantiate+0x2c/0x104
[c000000001c63a60] [c0000000001dca1c] .proc_fill_cache+0x104/0x1f4
[c000000001c63b40] [c0000000001e1180] .proc_pid_readdir+0x134/0x228
[c000000001c63c30] [c0000000001dc2a8] .proc_root_readdir+0x58/0x78
[c000000001c63cc0] [c00000000018d778] .vfs_readdir+0xa4/0x108
[c000000001c63d70] [c00000000018d964] .SyS_getdents+0x84/0x128
[c000000001c63e30] [c000000000008628] syscall_exit+0x0/0x40
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:  186, btch:  31 usd:  24
CPU    1: hi:  186, btch:  31 usd:  60
active_anon:204 inactive_anon:15 isolated_anon:0
 active_file:0 inactive_file:0 isolated_file:0
 unevictable:7032 dirty:0 writeback:0 unstable:0
 free:1425 slab_reclaimable:34092 slab_unreclaimable:309770
 mapped:380 shmem:19 pagetables:20 bounce:0
Node 0 DMA free:5700kB min:5752kB low:7188kB high:8628kB active_anon:816kB inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:28128kB isolated(anon):0kB isolated(file):0kB present:2068480kB mlocked:0kB dirty:0kB writeback:0kB mapped:1520kB shmem:76kB slab_reclaimable:136368kB slab_unreclaimable:1239080kB kernel_stack:612912kB pagetables:80kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:14 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 81*4kB 84*8kB 36*16kB 15*32kB 11*64kB 23*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 5700kB
7072 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
524288 pages RAM
15987 pages reserved
623 pages shared
391424 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[ 1913]     0  1913     1290      365   1       0             0 plymouthd
[ 7152]     0  7152      617      144   1       0             0 pidof
Out of memory: Kill process 1913 (plymouthd) score 1 or sacrifice child
Killed process 1913 (plymouthd) total-vm:5160kB, anon-rss:392kB, file-rss:1068kB

it went on to say this:

Kernel panic - not syncing: Out of memory and no killable processes...

Next-20101214 booted fine.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-16  1:10   ` Daisuke Nishimura
@ 2010-12-16  2:13     ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2010-12-16  2:13 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Chris Mason, Borislav Petkov, Miklos Szeredi

Hi Daisuke and Kame,

On Thu, Dec 16, 2010 at 10:10:53AM +0900, Daisuke Nishimura wrote:
> Hi,
> 
> On Thu, 16 Dec 2010 09:54:08 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 15 Dec 2010 06:15:40 +0100
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > Some of some relevant user of the project:
> > > 
> > > KVM Virtualization
> > > GCC (kernel build included, requires a few liner patch to enable)
> > > JVM
> > > VMware Workstation
> > > HPC
> > > 
> > > It would be great if it could go in -mm.
> > 
> > Things should be done in memory cgroup is
> >  
> >  - make accounting correct (RSS count will be broken)
> >  - make move_charge() to work
> >    (at rmdir(), this is now broken. It seems move-charge-at-task-move to work)
> > 
> Yes.
> I think we should add mem_cgroup_split_hugepage_commit() and add PageTransHuge()
> check in mem_cgroup_move_parent() as done in RHEL6 kernel.

Yes, unfortunately porting all the RHEL6 THP cgroups bits wasn't
trivial because of the difference in the cgroup code.

> As for move-charge-at-task-move, it will work because walk_pmd_range() splits
> THP pages(it would be better to change move-charge not to split THP pages, but
> it's not so urgent IMHO).
> 
> > Do you have known other viewpoints ?
> Not yet, but I'll test and check.

Same here.

One detail I'd ask you to check is the compound_trans_order I added in
#33 for memory-failure and cgroups. It's not really necessary in memcg
if we stop reading the order and we do page_size = HPAGE_PMD_SIZE
instead. I thought having the cgroup code handling compound pages
without hardwiring the size was better but maybe it's not. Maybe the
compound_lock locking should also be extended there? It's up to you to
what you prefer there but I'll try to help as much as I can.

BTW, now that it's in -mm I'll keep any further change incremental at
the end and I'll stop rebasing to avoid confusion.

> > I'll look into when -mm is shipped.
> > 
> me too :)

Thanks a lot!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* kvm mmu transparent hugepage support for linux-next
  2010-12-15 23:55 ` Andrew Morton
@ 2010-12-16  2:35   ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2010-12-16  2:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Linus Torvalds, linux-kernel, Marcelo Tosatti,
	Adam Litke, Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman,
	Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov,
	Miklos Szeredi, Gleb Natapov

Hi Andrew,

On Wed, Dec 15, 2010 at 03:55:45PM -0800, Andrew Morton wrote:
> On Wed, 15 Dec 2010 06:15:40 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > Some of some relevant user of the project:
> > 
> > KVM Virtualization
> > GCC (kernel build included, requires a few liner patch to enable)
> > JVM
> > VMware Workstation
> > HPC
> > 
> > It would be great if it could go in -mm.
> 
> That all merged pretty easily on top of the current mm pile.  Except
> for kvm-mmu-transparent-hugepage-support.patch which needs some thought
> and testing to get it merged into the KVM changes in linux-next.  I
> simply omitted kvm-mmu-transparent-hugepage-support.patch so please
> take a look?

Ok, I've an untested patch as full replacement of the
5Akvm-mmu-transparent-hugepage-support.patch, for linux-next. It's
untested because I didn't even try to boot linux-next after reading
your last mail about it. In the meantime I'd appreciate review from
Marcelo.

For Marcelo: before we were calling gup and checking if the pfn was
part of a compound page, and we were returning the right "level" from
inside mapping_level(). Now mapping_level is only left to detect
hugetlbfs. So if hugetlbfs isn't detected, _after_ gfn_to_pfn runs, we
check if the pfn is part of a trans compound page. If it is, we adjust
pfn/gfn after the fact before invoking spte establishment. It should
be functionally equivalent to the previous version and it eliminates
one unnecessary gfn_to_pfn/gup invocation compared to the previous
code. I had to rewrite it to adjust after the fact (async page fault)
to avoid invalidating async page faults (or to avoid handling async
page faults inside mapping_level itself which would litter its
interface and make it a lot more complex). If we're allowed to adjust
after the fact, this is simpler more efficient and it'll live happily
with the async page faults. Note: I didn't adjust the guest virtual
address as I don't think it needs adjustment. Let me know if you see
something wrong with this, thanks! (good thing is, if something's
wrong we'll notice it very quick as soon as we can test it :)

=========
Subject: kvm mmu transparent hugepage support

From: Andrea Arcangeli <aarcange@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bdb9fa9..22062b2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2286,6 +2286,18 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
 	return 1;
 }
 
+static void transparent_hugepage_adjust(gfn_t *gfn, pfn_t *pfn, int * level)
+{
+	/* check if it's a transparent hugepage */
+	if (!is_error_pfn(*pfn) && !kvm_is_mmio_pfn(*pfn) &&
+	    *level == PT_PAGE_TABLE_LEVEL &&
+	    PageTransCompound(pfn_to_page(*pfn))) {
+		*level = PT_DIRECTORY_LEVEL;
+		*gfn = *gfn & ~(KVM_PAGES_PER_HPAGE(*level) - 1);
+		*pfn = *pfn & ~(KVM_PAGES_PER_HPAGE(*level) - 1);
+	}
+}
+
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
 			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
@@ -2314,6 +2326,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
 
 	if (try_async_pf(vcpu, no_apf, gfn, v, &pfn, write, &map_writable))
 		return 0;
+	transparent_hugepage_adjust(&gfn, &pfn, &level);
 
 	/* mmio */
 	if (is_error_pfn(pfn))
@@ -2676,6 +2689,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
 
 	if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn, write, &map_writable))
 		return 0;
+	transparent_hugepage_adjust(&gfn, &pfn, &level);
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 590bf12..bc91891 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -575,6 +575,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn, write_fault,
 			 &map_writable))
 		return 0;
+	transparent_hugepage_adjust(&walker.gfn, &pfn, &level);
 
 	/* mmio */
 	if (is_error_pfn(pfn))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb93ff9..4fa0121 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -103,8 +103,36 @@ static pfn_t fault_pfn;
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
 	if (pfn_valid(pfn)) {
-		struct page *page = compound_head(pfn_to_page(pfn));
-		return PageReserved(page);
+		struct page *head;
+		struct page *tail = pfn_to_page(pfn);
+		head = compound_head(tail);
+		if (head != tail) {
+			smp_rmb();
+			/*
+			 * head may be a dangling pointer.
+			 * __split_huge_page_refcount clears PageTail
+			 * before overwriting first_page, so if
+			 * PageTail is still there it means the head
+			 * pointer isn't dangling.
+			 */
+			if (PageTail(tail)) {
+				/*
+				 * the "head" is not a dangling
+				 * pointer but the hugepage may have
+				 * been splitted from under us (and we
+				 * may not hold a reference count on
+				 * the head page so it can be reused
+				 * before we run PageReferenced), so
+				 * we've to recheck PageTail before
+				 * returning what we just read.
+				 */
+				int reserved = PageReserved(head);
+				smp_rmb();
+				if (PageTail(tail))
+					return reserved;
+			}
+		}
+		return PageReserved(tail);
 	}
 
 	return true;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33)
  2010-12-16  2:02     ` linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33) Stephen Rothwell
@ 2010-12-16  5:29       ` Paul E. McKenney
  2010-12-16  6:08         ` Stephen Rothwell
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2010-12-16  5:29 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm,
	Linus Torvalds, linux-kernel, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi

On Thu, Dec 16, 2010 at 01:02:51PM +1100, Stephen Rothwell wrote:
> Hi Andrew,
> 
> On Wed, 15 Dec 2010 17:18:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > That might take a while - linux-next is a screwed-up catastrophe and I
> > suppose some sucker has some bisecting to do.
> 
> Yeah, all 6 of my boot tests failed last night.  This from a machine with
> 2G of memory (early after starting user mode):
> 
> pidof invoked oom-killer: gfp_mask=0x840d0, order=0, oom_adj=0, oom_score_adj=0
> pidof cpuset=/ mems_allowed=0
> Call Trace:
> [c000000001c62fc0] [c000000000012214] .show_stack+0x7c/0x184 (unreliable)
> [c000000001c63070] [c000000000129380] .dump_header.clone.2+0xd0/0x230
> [c000000001c63170] [c00000000012955c] .oom_kill_process.clone.0+0x7c/0x304
> [c000000001c63250] [c000000000129c78] .out_of_memory+0x494/0x54c
> [c000000001c63340] [c00000000012ecb8] .__alloc_pages_nodemask+0x550/0x714
> [c000000001c634c0] [c0000000001690f8] .alloc_pages_current+0xc4/0x104
> [c000000001c63560] [c00000000016ea70] .new_slab+0xdc/0x2c8
> [c000000001c63600] [c00000000016ef5c] .__slab_alloc+0x300/0x484
> [c000000001c636d0] [c000000000170754] .kmem_cache_alloc+0x88/0x17c
> [c000000001c63780] [c0000000001db3b4] .proc_alloc_inode+0x30/0xa8
> [c000000001c63820] [c000000000196fe8] .alloc_inode+0x48/0xf8
> [c000000001c638b0] [c0000000001974e0] .new_inode+0x28/0xa8
> [c000000001c63930] [c0000000001dd0e8] .proc_pid_make_inode+0x24/0xe8
> [c000000001c639d0] [c0000000001e0980] .proc_pid_instantiate+0x2c/0x104
> [c000000001c63a60] [c0000000001dca1c] .proc_fill_cache+0x104/0x1f4
> [c000000001c63b40] [c0000000001e1180] .proc_pid_readdir+0x134/0x228
> [c000000001c63c30] [c0000000001dc2a8] .proc_root_readdir+0x58/0x78
> [c000000001c63cc0] [c00000000018d778] .vfs_readdir+0xa4/0x108
> [c000000001c63d70] [c00000000018d964] .SyS_getdents+0x84/0x128
> [c000000001c63e30] [c000000000008628] syscall_exit+0x0/0x40
> Mem-Info:
> Node 0 DMA per-cpu:
> CPU    0: hi:  186, btch:  31 usd:  24
> CPU    1: hi:  186, btch:  31 usd:  60
> active_anon:204 inactive_anon:15 isolated_anon:0
>  active_file:0 inactive_file:0 isolated_file:0
>  unevictable:7032 dirty:0 writeback:0 unstable:0
>  free:1425 slab_reclaimable:34092 slab_unreclaimable:309770
>  mapped:380 shmem:19 pagetables:20 bounce:0
> Node 0 DMA free:5700kB min:5752kB low:7188kB high:8628kB active_anon:816kB inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:28128kB isolated(anon):0kB isolated(file):0kB present:2068480kB mlocked:0kB dirty:0kB writeback:0kB mapped:1520kB shmem:76kB slab_reclaimable:136368kB slab_unreclaimable:1239080kB kernel_stack:612912kB pagetables:80kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:14 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0
> Node 0 DMA: 81*4kB 84*8kB 36*16kB 15*32kB 11*64kB 23*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 5700kB
> 7072 total pagecache pages
> 0 pages in swap cache
> Swap cache stats: add 0, delete 0, find 0/0
> Free swap  = 0kB
> Total swap = 0kB
> 524288 pages RAM
> 15987 pages reserved
> 623 pages shared
> 391424 pages non-shared
> [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [ 1913]     0  1913     1290      365   1       0             0 plymouthd
> [ 7152]     0  7152      617      144   1       0             0 pidof
> Out of memory: Kill process 1913 (plymouthd) score 1 or sacrifice child
> Killed process 1913 (plymouthd) total-vm:5160kB, anon-rss:392kB, file-rss:1068kB
> 
> it went on to say this:
> 
> Kernel panic - not syncing: Out of memory and no killable processes...
> 
> Next-20101214 booted fine.

RCU problems would normally take longer to run the system out of memory,
but who knows?

I did a push into -rcu in the suspect time frame, so have pulled it.  I am
sure that kernel.org will push this change to its mirrors at some point.
Just in case tree-by-tree bisecting is faster than commit-by-commit
bisecting.

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33)
  2010-12-16  5:29       ` Paul E. McKenney
@ 2010-12-16  6:08         ` Stephen Rothwell
  2010-12-16  7:00           ` Stephen Rothwell
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Rothwell @ 2010-12-16  6:08 UTC (permalink / raw)
  To: paulmck
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm,
	Linus Torvalds, linux-kernel, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

Hi Paul,

On Wed, 15 Dec 2010 21:29:58 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>
> RCU problems would normally take longer to run the system out of memory,
> but who knows?
> 
> I did a push into -rcu in the suspect time frame, so have pulled it.  I am
> sure that kernel.org will push this change to its mirrors at some point.
> Just in case tree-by-tree bisecting is faster than commit-by-commit
> bisecting.

I have bisected it down to the rcu tree, so the three commits that were
added yesterday are the suspects.  I am still bisecting.  If will just
revert those three commits from linux-next today in the hope that Andrew
will end up with a working tree.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33)
  2010-12-16  6:08         ` Stephen Rothwell
@ 2010-12-16  7:00           ` Stephen Rothwell
  2010-12-16 15:11             ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Rothwell @ 2010-12-16  7:00 UTC (permalink / raw)
  To: paulmck
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm,
	Linus Torvalds, linux-kernel, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi

[-- Attachment #1: Type: text/plain, Size: 1622 bytes --]

Hi Paul,

On Thu, 16 Dec 2010 17:08:14 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Wed, 15 Dec 2010 21:29:58 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >
> > RCU problems would normally take longer to run the system out of memory,
> > but who knows?
> > 
> > I did a push into -rcu in the suspect time frame, so have pulled it.  I am
> > sure that kernel.org will push this change to its mirrors at some point.
> > Just in case tree-by-tree bisecting is faster than commit-by-commit
> > bisecting.
> 
> I have bisected it down to the rcu tree, so the three commits that were
> added yesterday are the suspects.  I am still bisecting.  If will just
> revert those three commits from linux-next today in the hope that Andrew
> will end up with a working tree.

Bisect finished:

4e40200dab0e673b019979b5b8f5e5d1b25885c2 is first bad commit
commit 4e40200dab0e673b019979b5b8f5e5d1b25885c2
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri Dec 10 15:02:47 2010 -0800

    rcu: fine-tune grace-period begin/end checks
    
    Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
    should try to report a quiescent state.  Handle overflow in the check
    for rdp->gpnum having fallen behind.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

So far 4 of my 6 boot tests that failed yesterday have succeeded today
(with those last three rcu commits reverted) - the others are still
building.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33)
  2010-12-16  7:00           ` Stephen Rothwell
@ 2010-12-16 15:11             ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2010-12-16 15:11 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm,
	Linus Torvalds, linux-kernel, Marcelo Tosatti, Adam Litke,
	Avi Kivity, Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	Christoph Lameter, Chris Wright, bpicco, KOSAKI Motohiro,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov, Miklos Szeredi

On Thu, Dec 16, 2010 at 06:00:47PM +1100, Stephen Rothwell wrote:
> Hi Paul,
> 
> On Thu, 16 Dec 2010 17:08:14 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > On Wed, 15 Dec 2010 21:29:58 -0800 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >
> > > RCU problems would normally take longer to run the system out of memory,
> > > but who knows?
> > > 
> > > I did a push into -rcu in the suspect time frame, so have pulled it.  I am
> > > sure that kernel.org will push this change to its mirrors at some point.
> > > Just in case tree-by-tree bisecting is faster than commit-by-commit
> > > bisecting.
> > 
> > I have bisected it down to the rcu tree, so the three commits that were
> > added yesterday are the suspects.  I am still bisecting.  If will just
> > revert those three commits from linux-next today in the hope that Andrew
> > will end up with a working tree.
> 
> Bisect finished:
> 
> 4e40200dab0e673b019979b5b8f5e5d1b25885c2 is first bad commit
> commit 4e40200dab0e673b019979b5b8f5e5d1b25885c2
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Fri Dec 10 15:02:47 2010 -0800
> 
>     rcu: fine-tune grace-period begin/end checks
>     
>     Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
>     should try to report a quiescent state.  Handle overflow in the check
>     for rdp->gpnum having fallen behind.
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> So far 4 of my 6 boot tests that failed yesterday have succeeded today
> (with those last three rcu commits reverted) - the others are still
> building.

So I blew it not once,  but twice -- once in the patch itself, and once
in messing up my -next process.  :-/

Please accept my apologies!!!

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Transparent Hugepage Support #33
  2010-12-15  5:15 Transparent Hugepage Support #33 Andrea Arcangeli
  2010-12-15 23:55 ` Andrew Morton
  2010-12-16  0:54 ` Transparent Hugepage Support #33 KAMEZAWA Hiroyuki
@ 2010-12-20 11:16 ` Mel Gorman
  2 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2010-12-20 11:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov, Miklos Szeredi

On Wed, Dec 15, 2010 at 06:15:40AM +0100, Andrea Arcangeli wrote:
> Some of some relevant user of the project:
> 
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
> 
> It would be great if it could go in -mm.
> 

I ran some basic performance tests comparing base pages, hugetlbfs and
transparent huge pages.

STREAM (triad only)
Triad--17.0    18955.94 ( 0.00%) 18955.94 ( 0.00%) 18955.94 ( 0.00%)
Triad--17.33   19756.78 ( 0.00%) 19756.78 ( 0.00%) 19808.90 ( 0.26%)
Triad--17.66   19918.20 ( 0.00%) 19918.20 ( 0.00%) 19918.20 ( 0.00%)
Triad--18.0    19303.15 ( 0.00%) 19687.37 ( 1.95%) 19199.75 (-0.54%)
Triad--18.33   18397.44 ( 0.00%) 18556.45 ( 0.86%) 18443.83 ( 0.25%)
Triad--18.66   18917.43 ( 0.00%) 19088.28 ( 0.90%) 18865.09 (-0.28%)
Triad--19.0    16338.07 ( 0.00%) 18794.78 (13.07%) 16380.81 ( 0.26%)
Triad--19.33   11402.08 ( 0.00%) 11387.21 (-0.13%) 11226.44 (-1.56%)
Triad--19.66    9654.13 ( 0.00%)  9516.96 (-1.44%)  9666.16 ( 0.12%)
Triad--20.0     9556.79 ( 0.00%)  9572.48 ( 0.16%)  9573.63 ( 0.18%)
Triad--20.33    9553.81 ( 0.00%)  9524.22 (-0.31%)  9552.19 (-0.02%)
Triad--20.66    9504.67 ( 0.00%)  9504.67 ( 0.00%)  9509.61 ( 0.05%)
Triad--21.0     9500.04 ( 0.00%)  9538.13 ( 0.40%)  9501.06 ( 0.01%)
Triad--21.33    9355.53 ( 0.00%)  9511.82 ( 1.64%)  9391.13 ( 0.38%)
Triad--21.66    9310.97 ( 0.00%)  9535.04 ( 2.35%)  9459.83 ( 1.57%)
Triad--22.0     9264.88 ( 0.00%)  9521.61 ( 2.70%)  9512.85 ( 2.61%)
Triad--22.33    9197.81 ( 0.00%)  9505.28 ( 3.23%)  9442.67 ( 2.59%)
Triad--22.66    8535.29 ( 0.00%)  8965.94 ( 4.80%)  8839.97 ( 3.45%)
Triad--23.0     7158.25 ( 0.00%)  7462.07 ( 4.07%)  7373.10 ( 2.91%)
Triad--23.33    5659.50 ( 0.00%)  5708.15 ( 0.85%)  5695.34 ( 0.63%)
Triad--23.66    5191.97 ( 0.00%)  5200.99 ( 0.17%)  5175.16 (-0.32%)
Triad--24.0     4960.82 ( 0.00%)  5038.79 ( 1.55%)  5017.61 ( 1.13%)
Triad--24.33    4734.72 ( 0.00%)  4767.03 ( 0.68%)  4752.25 ( 0.37%)
Triad--24.66    4694.59 ( 0.00%)  4687.10 (-0.16%)  4698.72 ( 0.09%)
Triad--25.0     4701.91 ( 0.00%)  4823.23 ( 2.52%)  4759.94 ( 1.22%)
Triad--25.33    4664.94 ( 0.00%)  4748.64 ( 1.76%)  4690.97 ( 0.55%)
Triad--25.66    4670.35 ( 0.00%)  4751.30 ( 1.70%)  4706.59 ( 0.77%)
Triad--26.0     4704.77 ( 0.00%)  4814.09 ( 2.27%)  4788.46 ( 1.75%)
Triad--26.33    4702.14 ( 0.00%)  4707.05 ( 0.10%)  4677.77 (-0.52%)
Triad--26.66    4668.22 ( 0.00%)  4682.79 ( 0.31%)  4671.49 ( 0.07%)
Triad--27.0     4728.34 ( 0.00%)  4807.55 ( 1.65%)  4794.87 ( 1.39%)
Triad--27.33    4722.43 ( 0.00%)  4765.43 ( 0.90%)  4757.13 ( 0.73%)
Triad--27.66    4721.08 ( 0.00%)  4748.82 ( 0.58%)  4748.01 ( 0.57%)
Triad--28.0     4720.13 ( 0.00%)  4804.78 ( 1.76%)  4792.87 ( 1.52%)
Triad--28.33    4685.32 ( 0.00%)  4674.07 (-0.24%)  4627.00 (-1.26%)
Triad--28.66    4689.31 ( 0.00%)  4690.17 ( 0.02%)  4654.35 (-0.75%)
Triad--29.0     4740.42 ( 0.00%)  4780.69 ( 0.84%)  4779.78 ( 0.82%)
Triad--29.33    4688.10 ( 0.00%)  4655.82 (-0.69%)  4722.80 ( 0.73%)
Triad--29.66    4719.65 ( 0.00%)  4670.27 (-1.06%)  4768.32 ( 1.02%)
Triad--30.0     4731.50 ( 0.00%)  4786.19 ( 1.14%)  4773.81 ( 0.89%)
Triad--30.33    4722.82 ( 0.00%)  4734.01 ( 0.24%)  4748.29 ( 0.54%)
Triad--30.66    4732.06 ( 0.00%)  4721.55 (-0.22%)  4733.16 ( 0.02%)
Triad--31.0     4756.53 ( 0.00%)  4784.76 ( 0.59%)  4767.52 ( 0.23%)

I didn't include the other operations because the results are comparable
each time. Broadly speaking, hugetlbfs does slightly better but
transparent huge pages did improve performance a small amount.

SYSBENCH 
threads                     base              huge         transhuge
1              18629.91 ( 0.00%) 19017.23 ( 2.04%) 18766.30 ( 0.73%)
2              29691.39 ( 0.00%) 30062.81 ( 1.24%) 29808.59 ( 0.39%)
3              39824.00 ( 0.00%) 40324.75 ( 1.24%) 40002.75 ( 0.45%)
4              67639.65 ( 0.00%) 69231.83 ( 2.30%) 68305.58 ( 0.97%)
5              66833.81 ( 0.00%) 68339.77 ( 2.20%) 67393.01 ( 0.83%)
6              66168.22 ( 0.00%) 67875.52 ( 2.52%) 67255.45 ( 1.62%)
7              65775.08 ( 0.00%) 67386.93 ( 2.39%) 66208.60 ( 0.65%)
8              64899.14 ( 0.00%) 66588.38 ( 2.54%) 65367.80 ( 0.72%)

In some ways this is more interesting. hugetlbfs is backing only the
shared memory segment where transhuge is promoting other areas. Hence,
it's not really a like-with-like comparison but still, transparent
hugepages is pushing up performance by a small amount.

NAS-SER C Class (time, lower is better)
                            base         huge-heap         transhuge
bt.C            1389.33 ( 0.00%)  1421.64 (-2.27%)  1315.75 ( 5.59%)
cg.C             561.27 ( 0.00%)   509.38 (10.19%)   562.71 (-0.26%)
ep.C             375.78 ( 0.00%)   376.69 (-0.24%)   371.86 ( 1.05%)
ft.C             374.43 ( 0.00%)   371.73 ( 0.73%)   341.87 ( 9.52%)
is.C              17.84 ( 0.00%)    18.80 (-5.11%)    18.49 (-3.52%)
lu.C            1655.91 ( 0.00%)  1668.52 (-0.76%)  1662.25 (-0.38%)
mg.C             134.28 ( 0.00%)   136.96 (-1.96%)   128.04 ( 4.87%)
sp.C            1214.57 ( 0.00%)  1261.40 (-3.71%)  1151.98 ( 5.43%)
ua.C            1070.87 ( 0.00%)  1115.73 (-4.02%)  1048.45 ( 2.14%)

This is more of a like-with-like comparison as hugetlbfs is only backing
the heap. Results were mixed. Sometimes hugetlbfs was better and other times
transhuge was THP won the majority of the time.

SPECjvm huge page comparison
                            base              huge         transhuge
compiler         145.54 ( 0.00%)   156.00 ( 6.71%)   156.23 ( 6.84%)
compress         168.07 ( 0.00%)   175.15 ( 4.04%)   174.83 ( 3.87%)
crypto           164.30 ( 0.00%)   157.16 (-4.54%)   156.39 (-5.06%)
derby             53.64 ( 0.00%)    68.71 (21.93%)    58.57 ( 8.42%)
mpegaudio         81.80 ( 0.00%)    94.29 (13.25%)    92.58 (11.64%)
scimark.large     22.97 ( 0.00%)    21.43 (-7.19%)    21.59 (-6.39%)
scimark.small    119.25 ( 0.00%)   122.10 ( 2.33%)   121.44 ( 1.80%)
serial            46.93 ( 0.00%)    46.83 (-0.21%)    47.65 ( 1.51%)
sunflow           47.49 ( 0.00%)    50.03 ( 5.08%)    48.51 ( 2.10%)
xml              206.17 ( 0.00%)   211.42 ( 2.48%)   212.77 ( 3.10%)

hugetlbfs edged out transparent hugepages the majority of the times but
broadly speaking they were comparable in terms of performance.

Bottom-line is that overall transparent hugepages is delivering the expected
performance for this range of workloads at least. It's generally not as
good as hugetlbfs in terms of raw performance but that is hardly a surprise
considering how they both operate and what their objectives are.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-12-20 11:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-15  5:15 Transparent Hugepage Support #33 Andrea Arcangeli
2010-12-15 23:55 ` Andrew Morton
2010-12-16  2:35   ` kvm mmu transparent hugepage support for linux-next Andrea Arcangeli
2010-12-16  0:54 ` Transparent Hugepage Support #33 KAMEZAWA Hiroyuki
2010-12-16  1:10   ` Daisuke Nishimura
2010-12-16  2:13     ` Andrea Arcangeli
2010-12-16  1:18   ` Andrew Morton
2010-12-16  2:02     ` linux-next early user mode crash (Was: Re: Transparent Hugepage Support #33) Stephen Rothwell
2010-12-16  5:29       ` Paul E. McKenney
2010-12-16  6:08         ` Stephen Rothwell
2010-12-16  7:00           ` Stephen Rothwell
2010-12-16 15:11             ` Paul E. McKenney
2010-12-20 11:16 ` Transparent Hugepage Support #33 Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).