Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ni zhan Chen <nizhan.chen@gmail.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	"H. Peter Anvin" <hpa@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
Date: Tue, 16 Oct 2012 17:53:07 +0800	[thread overview]
Message-ID: <507D2E83.4010702@gmail.com> (raw)
In-Reply-To: <1350280859-18801-1-git-send-email-kirill.shutemov@linux.intel.com>

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Hi,
>
> Andrew, here's huge zero page patchset rebased to v3.7-rc1.
>
> Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
> rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
> Mostly due new MMU notifiers interface.
>
> =================
>
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
>          char *p;
>          int i;
>
>          posix_memalign((void **)&p, 2 * MB, 200 * MB);
>          for (i = 0; i < 200 * MB; i+= 4096)
>                  assert(p[i] == 0);
>          pause();
>          return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
>
> Design overview.
>
> Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> zeros.  The way how we allocate it changes in the patchset:
>
> - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> - [09/10] lazy allocation on first use;
> - [10/10] lockless refcounting + shrinker-reclaimable hzp;
>
> We setup it in do_huge_pmd_anonymous_page() if area around fault address
> is suitable for THP and we've got read page fault.
> If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> normally do in THP.
>
> On wp fault to hzp we allocate real memory for the huge page and clear it.
> If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> fault address to newly allocated normal (4k) page. All other ptes in the
> pmd set to normal zero page.
>
> We cannot split hzp (and it's bug if we try), but we can split the pmd
> which points to it. On splitting the pmd we create a table with all ptes
> set to normal zero page.
>
> Patchset organized in bisect-friendly way:
>   Patches 01-07: prepare all code paths for hzp
>   Patch 08: all code paths are covered: safe to setup hzp
>   Patch 09: lazy allocation
>   Patch 10: lockless refcounting for hzp
>
> v4:
>   - Rebase to v3.7-rc1;
>   - Update commit message;
> v3:
>   - fix potential deadlock in refcounting code on preemptive kernel.
>   - do not mark huge zero page as movable.
>   - fix typo in comment.
>   - Reviewed-by tag from Andrea Arcangeli.
> v2:
>   - Avoid find_vma() if we've already had vma on stack.
>     Suggested by Andrea Arcangeli.
>   - Implement refcounting for huge zero page.
>
> --------------------------------------------------------------------------
>
> By hpa request I've tried alternative approach for hzp implementation (see
> Virtual huge zero page patchset): pmd table with all entries set to zero
> page. This way should be more cache friendly, but it increases TLB
> pressure.

Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.

>
> The problem with virtual huge zero page: it requires per-arch enabling.
> We need a way to mark that pmd table has all ptes set to zero page.
>
> Some numbers to compare two implementations (on 4s Westmere-EX):
>
> Mirobenchmark1
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 100; i++) {
>                  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>                  asm volatile ("": : :"memory");
>          }
>
> hzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
>      76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
>      36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
>       1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
>     134,355,715,816 instructions              #    1.75  insns per cycle
>                                               #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
>      13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
>           1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
>
>        32.413866442 seconds time elapsed                                          ( +-  0.13% )
>
> vhzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
>      71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
>      31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
>         773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
>     134,982,215,437 instructions              #    1.88  insns per cycle
>                                               #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
>      13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
>           1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
>
>        30.381324695 seconds time elapsed                                          ( +-  0.13% )

Could you tell me which data I should care in this performance counter. 
And what's the benefit of your current implementation compare to hpa's 
request?

>
> Mirobenchmark2
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 1000; i++) {
>                  char *_p = p;
>                  while (_p < p+4*GB) {
>                          assert(*_p == *(_p+4*GB));
>                          _p += 4096;
>                          asm volatile ("": : :"memory");
>                  }
>          }
>
> hzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>         3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
>                   9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
>               4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
>       8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
>       5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
>       2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
>       9,494,670,537 instructions              #    1.14  insns per cycle
>                                               #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
>       2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
>             158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
>       3,168,102,115 L1-dcache-loads
>            #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
>       1,048,710,998 L1-dcache-misses
>           #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
>       1,047,699,685 LLC-load
>                   #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
>               2,287 LLC-misses
>                 #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
>       3,166,187,367 dTLB-loads
>                 #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
>           4,266,538 dTLB-misses
>                #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
>
>         3.513339813 seconds time elapsed                                          ( +-  0.26% )
>
> vhzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>        27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
>                  62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
>               4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
>      64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
>      61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
>      56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
>      10,033,724,846 instructions              #    0.15  insns per cycle
>                                               #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
>       2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
>           1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
>       3,302,006,540 L1-dcache-loads
>            #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
>         271,374,358 L1-dcache-misses
>           #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
>          20,385,476 LLC-load
>                   #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
>              76,754 LLC-misses
>                 #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
>       3,309,927,290 dTLB-loads
>                 #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
>       2,098,967,427 dTLB-misses
>                #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
>
>        27.364448741 seconds time elapsed                                          ( +-  0.24% )

For this case, the same question as above, thanks in adance. :-)

>
> --------------------------------------------------------------------------
>
> I personally prefer implementation present in this patchset. It doesn't
> touch arch-specific code.
>
>
> Kirill A. Shutemov (10):
>    thp: huge zero page: basic preparation
>    thp: zap_huge_pmd(): zap huge zero pmd
>    thp: copy_huge_pmd(): copy huge zero page
>    thp: do_huge_pmd_wp_page(): handle huge zero page
>    thp: change_huge_pmd(): keep huge zero page write-protected
>    thp: change split_huge_page_pmd() interface
>    thp: implement splitting pmd for huge zero page
>    thp: setup huge zero page on non-write page fault
>    thp: lazy huge zero page allocation
>    thp: implement refcounting for huge zero page
>
>   Documentation/vm/transhuge.txt |    4 +-
>   arch/x86/kernel/vm86_32.c      |    2 +-
>   fs/proc/task_mmu.c             |    2 +-
>   include/linux/huge_mm.h        |   14 ++-
>   include/linux/mm.h             |    8 +
>   mm/huge_memory.c               |  331 +++++++++++++++++++++++++++++++++++++---
>   mm/memory.c                    |   11 +-
>   mm/mempolicy.c                 |    2 +-
>   mm/mprotect.c                  |    2 +-
>   mm/mremap.c                    |    2 +-
>   mm/pagewalk.c                  |    2 +-
>   11 files changed, 334 insertions(+), 46 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Ni zhan Chen <nizhan.chen@gmail.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	"H. Peter Anvin" <hpa@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
Date: Tue, 16 Oct 2012 17:53:07 +0800	[thread overview]
Message-ID: <507D2E83.4010702@gmail.com> (raw)
In-Reply-To: <1350280859-18801-1-git-send-email-kirill.shutemov@linux.intel.com>

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Hi,
>
> Andrew, here's huge zero page patchset rebased to v3.7-rc1.
>
> Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
> rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
> Mostly due new MMU notifiers interface.
>
> =================
>
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
>          char *p;
>          int i;
>
>          posix_memalign((void **)&p, 2 * MB, 200 * MB);
>          for (i = 0; i < 200 * MB; i+= 4096)
>                  assert(p[i] == 0);
>          pause();
>          return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
>
> Design overview.
>
> Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> zeros.  The way how we allocate it changes in the patchset:
>
> - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> - [09/10] lazy allocation on first use;
> - [10/10] lockless refcounting + shrinker-reclaimable hzp;
>
> We setup it in do_huge_pmd_anonymous_page() if area around fault address
> is suitable for THP and we've got read page fault.
> If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> normally do in THP.
>
> On wp fault to hzp we allocate real memory for the huge page and clear it.
> If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> fault address to newly allocated normal (4k) page. All other ptes in the
> pmd set to normal zero page.
>
> We cannot split hzp (and it's bug if we try), but we can split the pmd
> which points to it. On splitting the pmd we create a table with all ptes
> set to normal zero page.
>
> Patchset organized in bisect-friendly way:
>   Patches 01-07: prepare all code paths for hzp
>   Patch 08: all code paths are covered: safe to setup hzp
>   Patch 09: lazy allocation
>   Patch 10: lockless refcounting for hzp
>
> v4:
>   - Rebase to v3.7-rc1;
>   - Update commit message;
> v3:
>   - fix potential deadlock in refcounting code on preemptive kernel.
>   - do not mark huge zero page as movable.
>   - fix typo in comment.
>   - Reviewed-by tag from Andrea Arcangeli.
> v2:
>   - Avoid find_vma() if we've already had vma on stack.
>     Suggested by Andrea Arcangeli.
>   - Implement refcounting for huge zero page.
>
> --------------------------------------------------------------------------
>
> By hpa request I've tried alternative approach for hzp implementation (see
> Virtual huge zero page patchset): pmd table with all entries set to zero
> page. This way should be more cache friendly, but it increases TLB
> pressure.

Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.

>
> The problem with virtual huge zero page: it requires per-arch enabling.
> We need a way to mark that pmd table has all ptes set to zero page.
>
> Some numbers to compare two implementations (on 4s Westmere-EX):
>
> Mirobenchmark1
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 100; i++) {
>                  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>                  asm volatile ("": : :"memory");
>          }
>
> hzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
>      76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
>      36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
>       1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
>     134,355,715,816 instructions              #    1.75  insns per cycle
>                                               #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
>      13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
>           1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
>
>        32.413866442 seconds time elapsed                                          ( +-  0.13% )
>
> vhzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
>      71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
>      31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
>         773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
>     134,982,215,437 instructions              #    1.88  insns per cycle
>                                               #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
>      13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
>           1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
>
>        30.381324695 seconds time elapsed                                          ( +-  0.13% )

Could you tell me which data I should care in this performance counter. 
And what's the benefit of your current implementation compare to hpa's 
request?

>
> Mirobenchmark2
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 1000; i++) {
>                  char *_p = p;
>                  while (_p < p+4*GB) {
>                          assert(*_p == *(_p+4*GB));
>                          _p += 4096;
>                          asm volatile ("": : :"memory");
>                  }
>          }
>
> hzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>         3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
>                   9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
>               4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
>       8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
>       5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
>       2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
>       9,494,670,537 instructions              #    1.14  insns per cycle
>                                               #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
>       2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
>             158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
>       3,168,102,115 L1-dcache-loads
>            #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
>       1,048,710,998 L1-dcache-misses
>           #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
>       1,047,699,685 LLC-load
>                   #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
>               2,287 LLC-misses
>                 #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
>       3,166,187,367 dTLB-loads
>                 #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
>           4,266,538 dTLB-misses
>                #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
>
>         3.513339813 seconds time elapsed                                          ( +-  0.26% )
>
> vhzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>        27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
>                  62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
>               4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
>      64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
>      61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
>      56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
>      10,033,724,846 instructions              #    0.15  insns per cycle
>                                               #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
>       2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
>           1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
>       3,302,006,540 L1-dcache-loads
>            #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
>         271,374,358 L1-dcache-misses
>           #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
>          20,385,476 LLC-load
>                   #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
>              76,754 LLC-misses
>                 #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
>       3,309,927,290 dTLB-loads
>                 #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
>       2,098,967,427 dTLB-misses
>                #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
>
>        27.364448741 seconds time elapsed                                          ( +-  0.24% )

For this case, the same question as above, thanks in adance. :-)

>
> --------------------------------------------------------------------------
>
> I personally prefer implementation present in this patchset. It doesn't
> touch arch-specific code.
>
>
> Kirill A. Shutemov (10):
>    thp: huge zero page: basic preparation
>    thp: zap_huge_pmd(): zap huge zero pmd
>    thp: copy_huge_pmd(): copy huge zero page
>    thp: do_huge_pmd_wp_page(): handle huge zero page
>    thp: change_huge_pmd(): keep huge zero page write-protected
>    thp: change split_huge_page_pmd() interface
>    thp: implement splitting pmd for huge zero page
>    thp: setup huge zero page on non-write page fault
>    thp: lazy huge zero page allocation
>    thp: implement refcounting for huge zero page
>
>   Documentation/vm/transhuge.txt |    4 +-
>   arch/x86/kernel/vm86_32.c      |    2 +-
>   fs/proc/task_mmu.c             |    2 +-
>   include/linux/huge_mm.h        |   14 ++-
>   include/linux/mm.h             |    8 +
>   mm/huge_memory.c               |  331 +++++++++++++++++++++++++++++++++++++---
>   mm/memory.c                    |   11 +-
>   mm/mempolicy.c                 |    2 +-
>   mm/mprotect.c                  |    2 +-
>   mm/mremap.c                    |    2 +-
>   mm/pagewalk.c                  |    2 +-
>   11 files changed, 334 insertions(+), 46 deletions(-)
>

next prev parent reply	other threads:[~2012-10-16  9:53 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-15  6:00 [PATCH v4 00/10, REBASED] Introduce huge zero page Kirill A. Shutemov
2012-10-15  6:00 ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
2012-10-15  6:00   ` Kirill A. Shutemov
2012-10-18 23:45   ` Andrew Morton
2012-10-18 23:45     ` Andrew Morton
2012-10-18 23:59     ` Kirill A. Shutemov
2012-10-18 23:59       ` Kirill A. Shutemov
2012-10-23  6:35       ` Kirill A. Shutemov
2012-10-23  6:35         ` Kirill A. Shutemov
2012-10-23  6:43         ` Andrew Morton
2012-10-23  6:43           ` Andrew Morton
2012-10-23  7:00           ` Kirill A. Shutemov
2012-10-23 22:59             ` Andrew Morton
2012-10-23 22:59               ` Andrew Morton
2012-10-23 23:38               ` Kirill A. Shutemov
2012-10-23 23:38                 ` Kirill A. Shutemov
2012-10-24 19:22                 ` Andrew Morton
2012-10-24 19:22                   ` Andrew Morton
2012-10-24 19:45                   ` Kirill A. Shutemov
2012-10-24 20:25                     ` Andrew Morton
2012-10-24 20:25                       ` Andrew Morton
2012-10-24 20:33                       ` Kirill A. Shutemov
2012-10-24 20:44                         ` Andi Kleen
2012-10-24 20:44                           ` Andi Kleen
2012-10-25 20:49                       ` Kirill A. Shutemov
2012-10-25 21:05                         ` Andrew Morton
2012-10-25 21:05                           ` Andrew Morton
2012-10-25 21:22                           ` Kirill A. Shutemov
2012-10-25 21:22                             ` Kirill A. Shutemov
2012-10-25 21:37                             ` Andrew Morton
2012-10-25 21:37                               ` Andrew Morton
2012-10-25 22:10                               ` Kirill A. Shutemov
2012-10-16  9:53 ` Ni zhan Chen [this message]
2012-10-16  9:53   ` [PATCH v4 00/10, REBASED] Introduce " Ni zhan Chen
2012-10-16 10:54   ` Kirill A. Shutemov
2012-10-16 10:54     ` Kirill A. Shutemov
2012-10-16 11:13     ` Ni zhan Chen
2012-10-16 11:13       ` Ni zhan Chen
2012-10-16 11:28       ` Kirill A. Shutemov
2012-10-16 11:28         ` Kirill A. Shutemov
2012-10-16 11:37         ` Ni zhan Chen
2012-10-16 11:37           ` Ni zhan Chen
2012-10-26 15:14 ` [PATCH] thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events Kirill A. Shutemov
2012-10-26 15:14   ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=507D2E83.4010702@gmail.com \
    --to=nizhan.chen@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@linux.intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.