Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ni zhan Chen <nizhan.chen@gmail.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	"H. Peter Anvin" <hpa@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
Date: Tue, 16 Oct 2012 17:53:07 +0800	[thread overview]
Message-ID: <507D2E83.4010702@gmail.com> (raw)
In-Reply-To: <1350280859-18801-1-git-send-email-kirill.shutemov@linux.intel.com>

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Hi,
>
> Andrew, here's huge zero page patchset rebased to v3.7-rc1.
>
> Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
> rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
> Mostly due new MMU notifiers interface.
>
> =================
>
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
>          char *p;
>          int i;
>
>          posix_memalign((void **)&p, 2 * MB, 200 * MB);
>          for (i = 0; i < 200 * MB; i+= 4096)
>                  assert(p[i] == 0);
>          pause();
>          return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
>
> Design overview.
>
> Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> zeros.  The way how we allocate it changes in the patchset:
>
> - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> - [09/10] lazy allocation on first use;
> - [10/10] lockless refcounting + shrinker-reclaimable hzp;
>
> We setup it in do_huge_pmd_anonymous_page() if area around fault address
> is suitable for THP and we've got read page fault.
> If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> normally do in THP.
>
> On wp fault to hzp we allocate real memory for the huge page and clear it.
> If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> fault address to newly allocated normal (4k) page. All other ptes in the
> pmd set to normal zero page.
>
> We cannot split hzp (and it's bug if we try), but we can split the pmd
> which points to it. On splitting the pmd we create a table with all ptes
> set to normal zero page.
>
> Patchset organized in bisect-friendly way:
>   Patches 01-07: prepare all code paths for hzp
>   Patch 08: all code paths are covered: safe to setup hzp
>   Patch 09: lazy allocation
>   Patch 10: lockless refcounting for hzp
>
> v4:
>   - Rebase to v3.7-rc1;
>   - Update commit message;
> v3:
>   - fix potential deadlock in refcounting code on preemptive kernel.
>   - do not mark huge zero page as movable.
>   - fix typo in comment.
>   - Reviewed-by tag from Andrea Arcangeli.
> v2:
>   - Avoid find_vma() if we've already had vma on stack.
>     Suggested by Andrea Arcangeli.
>   - Implement refcounting for huge zero page.
>
> --------------------------------------------------------------------------
>
> By hpa request I've tried alternative approach for hzp implementation (see
> Virtual huge zero page patchset): pmd table with all entries set to zero
> page. This way should be more cache friendly, but it increases TLB
> pressure.

Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.

>
> The problem with virtual huge zero page: it requires per-arch enabling.
> We need a way to mark that pmd table has all ptes set to zero page.
>
> Some numbers to compare two implementations (on 4s Westmere-EX):
>
> Mirobenchmark1
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 100; i++) {
>                  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>                  asm volatile ("": : :"memory");
>          }
>
> hzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
>      76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
>      36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
>       1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
>     134,355,715,816 instructions              #    1.75  insns per cycle
>                                               #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
>      13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
>           1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
>
>        32.413866442 seconds time elapsed                                          ( +-  0.13% )
>
> vhzp:
>   Performance counter stats for './test_memcmp' (5 runs):
>
>        30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
>                  38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
>                   0 CPU-migrations            #    0.000 K/sec
>               4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
>      71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
>      31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
>         773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
>     134,982,215,437 instructions              #    1.88  insns per cycle
>                                               #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
>      13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
>           1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
>
>        30.381324695 seconds time elapsed                                          ( +-  0.13% )

Could you tell me which data I should care in this performance counter. 
And what's the benefit of your current implementation compare to hpa's 
request?

>
> Mirobenchmark2
> ==============
>
> test:
>          posix_memalign((void **)&p, 2 * MB, 8 * GB);
>          for (i = 0; i < 1000; i++) {
>                  char *_p = p;
>                  while (_p < p+4*GB) {
>                          assert(*_p == *(_p+4*GB));
>                          _p += 4096;
>                          asm volatile ("": : :"memory");
>                  }
>          }
>
> hzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>         3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
>                   9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
>               4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
>       8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
>       5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
>       2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
>       9,494,670,537 instructions              #    1.14  insns per cycle
>                                               #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
>       2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
>             158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
>       3,168,102,115 L1-dcache-loads
>            #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
>       1,048,710,998 L1-dcache-misses
>           #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
>       1,047,699,685 LLC-load
>                   #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
>               2,287 LLC-misses
>                 #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
>       3,166,187,367 dTLB-loads
>                 #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
>           4,266,538 dTLB-misses
>                #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
>
>         3.513339813 seconds time elapsed                                          ( +-  0.26% )
>
> vhzp:
>   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
>        27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
>                  62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
>               4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
>      64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
>      61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
>      56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
>      10,033,724,846 instructions              #    0.15  insns per cycle
>                                               #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
>       2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
>           1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
>       3,302,006,540 L1-dcache-loads
>            #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
>         271,374,358 L1-dcache-misses
>           #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
>          20,385,476 LLC-load
>                   #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
>              76,754 LLC-misses
>                 #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
>       3,309,927,290 dTLB-loads
>                 #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
>       2,098,967,427 dTLB-misses
>                #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
>
>        27.364448741 seconds time elapsed                                          ( +-  0.24% )

For this case, the same question as above, thanks in adance. :-)

>
> --------------------------------------------------------------------------
>
> I personally prefer implementation present in this patchset. It doesn't
> touch arch-specific code.
>
>
> Kirill A. Shutemov (10):
>    thp: huge zero page: basic preparation
>    thp: zap_huge_pmd(): zap huge zero pmd
>    thp: copy_huge_pmd(): copy huge zero page
>    thp: do_huge_pmd_wp_page(): handle huge zero page
>    thp: change_huge_pmd(): keep huge zero page write-protected
>    thp: change split_huge_page_pmd() interface
>    thp: implement splitting pmd for huge zero page
>    thp: setup huge zero page on non-write page fault
>    thp: lazy huge zero page allocation
>    thp: implement refcounting for huge zero page
>
>   Documentation/vm/transhuge.txt |    4 +-
>   arch/x86/kernel/vm86_32.c      |    2 +-
>   fs/proc/task_mmu.c             |    2 +-
>   include/linux/huge_mm.h        |   14 ++-
>   include/linux/mm.h             |    8 +
>   mm/huge_memory.c               |  331 +++++++++++++++++++++++++++++++++++++---
>   mm/memory.c                    |   11 +-
>   mm/mempolicy.c                 |    2 +-
>   mm/mprotect.c                  |    2 +-
>   mm/mremap.c                    |    2 +-
>   mm/pagewalk.c                  |    2 +-
>   11 files changed, 334 insertions(+), 46 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-10-16  9:53 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-15  6:00 [PATCH v4 00/10, REBASED] Introduce huge zero page Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 01/10] thp: huge zero page: basic preparation Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 02/10] thp: zap_huge_pmd(): zap huge zero pmd Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 03/10] thp: copy_huge_pmd(): copy huge zero page Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 04/10] thp: do_huge_pmd_wp_page(): handle " Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 05/10] thp: change_huge_pmd(): keep huge zero page write-protected Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 06/10] thp: change split_huge_page_pmd() interface Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 07/10] thp: implement splitting pmd for huge zero page Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 08/10] thp: setup huge zero page on non-write page fault Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 09/10] thp: lazy huge zero page allocation Kirill A. Shutemov
2012-10-15  6:00 ` [PATCH v4 10/10] thp: implement refcounting for huge zero page Kirill A. Shutemov
2012-10-18 23:45   ` Andrew Morton
2012-10-18 23:59     ` Kirill A. Shutemov
2012-10-23  6:35       ` Kirill A. Shutemov
2012-10-23  6:43         ` Andrew Morton
2012-10-23  7:00           ` Kirill A. Shutemov
2012-10-23 22:59             ` Andrew Morton
2012-10-23 23:38               ` Kirill A. Shutemov
2012-10-24 19:22                 ` Andrew Morton
2012-10-24 19:45                   ` Kirill A. Shutemov
2012-10-24 20:25                     ` Andrew Morton
2012-10-24 20:33                       ` Kirill A. Shutemov
2012-10-24 20:44                         ` Andi Kleen
2012-10-25 20:49                       ` Kirill A. Shutemov
2012-10-25 21:05                         ` Andrew Morton
2012-10-25 21:22                           ` Kirill A. Shutemov
2012-10-25 21:37                             ` Andrew Morton
2012-10-25 22:10                               ` Kirill A. Shutemov
2012-10-16  9:53 ` Ni zhan Chen [this message]
2012-10-16 10:54   ` [PATCH v4 00/10, REBASED] Introduce " Kirill A. Shutemov
2012-10-16 11:13     ` Ni zhan Chen
2012-10-16 11:28       ` Kirill A. Shutemov
2012-10-16 11:37         ` Ni zhan Chen
2012-10-26 15:14 ` [PATCH] thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=507D2E83.4010702@gmail.com \
    --to=nizhan.chen@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@linux.intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).