Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com,
	willy@infradead.org, mgorman@suse.de, peterz@infradead.org,
	rostedt@goodmis.org, tglx@linutronix.de,
	vincent.guittot@linaro.org, jon.grimm@amd.com, bharata@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
Date: Sat, 08 Apr 2023 15:46:56 -0700	[thread overview]
Message-ID: <87ttxqf0v3.fsf@oracle.com> (raw)
In-Reply-To: <271b85ec-281e-d33b-5495-59eb2bc9fde4@amd.com>


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                 12.24            17.54    +43.30%
>    pg-sz=1GB                17.98            37.24   +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.

Great. Thanks Raghavendra.

> SUT: Genoa AMD EPYC
>    Thread(s) per core:  2
>    Core(s) per socket:  128
>    Socket(s):           2
>
> NUMA:
>   NUMA node(s):          2
>   NUMA node0 CPU(s):     0-127,256-383
>   NUMA node1 CPU(s):     128-255,384-511
>
> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size  mm/clear_huge_page   x86/clear_huge_page
> 2M              5.4567          2.6774
> 1G              2.64452         1.011281

So translating into BW, for Genoa we have:

page-size  mm/clear_huge_page   x86/clear_huge_page
 2M              11.74              23.97
 1G              24.24              63.36

That's a pretty good bump over Milan:

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                12.24            17.54
>   pg-sz=1GB                17.98            37.24

Btw, are these numbers with boost=1?

> Full perfstat info
>
>  page size = 2M mm/clear_huge_page
>
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           5,434.71 msec task-clock                #    0.996 CPUs utilized
>          ( +-  0.55% )
>                  8      context-switches          #    1.466 /sec
>                  ( +-  4.66% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #    6.034 K/sec
>             ( +-  0.00% )
>     16,977,242,482      cycles                    #    3.112 GHz
>     ( +-  0.04% )  (35.70%)
>          1,961,724      stalled-cycles-frontend   #    0.01% frontend cycles
>         idle     ( +-  1.09% )  (35.72%)
>         35,685,674      stalled-cycles-backend    #    0.21% backend cycles idle
>        ( +-  3.48% )  (35.74%)
>      1,038,327,182      instructions              #    0.06  insn per cycle
>                                                   #    0.04  stalled cycles per
>                                                       insn  ( +-  0.38% )
>                                                       (35.75%)
>        221,409,216      branches                  #   40.584 M/sec
>        ( +-  0.36% )  (35.75%)
>            350,730      branch-misses             #    0.16% of all branches
>           ( +-  1.18% )  (35.75%)
>      2,520,888,779      L1-dcache-loads           #  462.077 M/sec
>      ( +-  0.03% )  (35.73%)
>      1,094,178,209      L1-dcache-load-misses     #   43.46% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.71%)
>         67,751,730      L1-icache-loads           #   12.419 M/sec
>         ( +-  0.11% )  (35.70%)
>            271,118      L1-icache-load-misses     #    0.40% of all L1-icache
>           accesses  ( +-  2.55% )  (35.70%)
>            506,635      dTLB-loads                #   92.866 K/sec
>            ( +-  3.31% )  (35.70%)
>            237,385      dTLB-load-misses          #   43.64% of all dTLB cache
>           accesses  ( +-  7.00% )  (35.69%)
>                268      iTLB-load-misses          # 6700.00% of all iTLB cache
>               accesses  ( +- 13.86% )  (35.70%)
>
>             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )
>
>  page size = 2M x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           2,780.69 msec task-clock                #    1.039 CPUs utilized
>          ( +-  1.03% )
>                  3      context-switches          #    1.121 /sec
>                  ( +- 21.34% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #   12.301 K/sec
>             ( +-  0.00% )
>      8,143,619,771      cycles                    #    3.043 GHz
>      ( +-  0.25% )  (35.62%)
>          2,024,872      stalled-cycles-frontend   #    0.02% frontend cycles
>         idle     ( +-320.93% )  (35.66%)
>        717,198,728      stalled-cycles-backend    #    8.82% backend cycles idle
>       ( +-  8.26% )  (35.69%)
>        606,549,334      instructions              #    0.07  insn per cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  0.23% )
>                                                       (35.73%)
>        108,856,550      branches                  #   40.677 M/sec
>        ( +-  0.24% )  (35.76%)
>            202,490      branch-misses             #    0.18% of all branches
>           ( +-  3.58% )  (35.78%)
>      2,348,818,806      L1-dcache-loads           #  877.701 M/sec
>      ( +-  0.03% )  (35.78%)
>      1,081,562,988      L1-dcache-load-misses     #   46.04% of all L1-dcache
>     accesses  ( +-  0.01% )  (35.78%)
>    <not supported>      LLC-loads
>    <not supported>      LLC-load-misses
>         43,411,167      L1-icache-loads           #   16.222 M/sec
>         ( +-  0.19% )  (35.77%)
>            273,042      L1-icache-load-misses     #    0.64% of all L1-icache
>           accesses  ( +-  4.94% )  (35.76%)
>            834,482      dTLB-loads                #  311.827 K/sec
>            ( +-  9.73% )  (35.72%)
>            437,343      dTLB-load-misses          #   65.86% of all dTLB cache
>           accesses  ( +-  8.56% )  (35.68%)
>                  0      iTLB-loads                #    0.000 /sec
>                  (35.65%)
>                160      iTLB-load-misses          # 1777.78% of all iTLB cache
>               accesses  ( +- 15.82% )  (35.62%)
>
>             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )
>
>  page size = 1G mm/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,625.24 msec task-clock                #    0.993 CPUs utilized
>          ( +-  0.23% )
>                  4      context-switches          #    1.513 /sec
>                  ( +-  4.49% )
>                  1      cpu-migrations            #    0.378 /sec
>                214      page-faults               #   80.965 /sec
>                ( +-  0.13% )
>      8,178,624,349      cycles                    #    3.094 GHz
>      ( +-  0.23% )  (35.65%)
>          2,942,576      stalled-cycles-frontend   #    0.04% frontend cycles
>         idle     ( +- 75.22% )  (35.69%)
>          7,117,425      stalled-cycles-backend    #    0.09% backend cycles idle
>         ( +-  3.79% )  (35.73%)
>        454,521,647      instructions              #    0.06  insn per cycle
>                                                   #    0.02  stalled cycles per
>                                                       insn  ( +-  0.10% )
>                                                       (35.77%)
>        113,223,853      branches                  #   42.837 M/sec
>        ( +-  0.08% )  (35.80%)
>             84,766      branch-misses             #    0.07% of all branches
>            ( +-  5.37% )  (35.80%)
>      2,294,528,890      L1-dcache-loads           #  868.111 M/sec
>      ( +-  0.02% )  (35.81%)
>      1,075,907,551      L1-dcache-load-misses     #   46.88% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.78%)
>         26,167,323      L1-icache-loads           #    9.900 M/sec
>         ( +-  0.24% )  (35.74%)
>            139,675      L1-icache-load-misses     #    0.54% of all L1-icache
>           accesses  ( +-  0.37% )  (35.70%)
>              3,459      dTLB-loads                #    1.309 K/sec
>              ( +- 12.75% )  (35.67%)
>                732      dTLB-load-misses          #   19.71% of all dTLB cache
>               accesses  ( +- 26.61% )  (35.62%)
>                 11      iTLB-load-misses          #  192.98% of all iTLB cache
>                accesses  ( +-238.28% )  (35.62%)
>
>            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )
>
>
>  page size = 1G x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           1,009.09 msec task-clock                #    0.998 CPUs utilized
>          ( +-  0.06% )
>                  2      context-switches          #    1.980 /sec
>                  ( +- 23.63% )
>                  1      cpu-migrations            #    0.990 /sec
>                214      page-faults               #  211.887 /sec
>                ( +-  0.16% )
>      3,154,980,463      cycles                    #    3.124 GHz
>      ( +-  0.06% )  (35.77%)
>            145,051      stalled-cycles-frontend   #    0.00% frontend cycles
>           idle     ( +-  6.26% )  (35.78%)
>        730,087,143      stalled-cycles-backend    #   23.12% backend cycles idle
>       ( +-  9.75% )  (35.78%)
>         45,813,391      instructions              #    0.01  insn per cycle
>                                                   #   18.51  stalled cycles per
>                                                      insn  ( +-  1.00% )
>                                                      (35.78%)
>          8,498,282      branches                  #    8.414 M/sec
>          ( +-  1.54% )  (35.78%)
>             63,351      branch-misses             #    0.74% of all branches
>            ( +-  6.70% )  (35.69%)
>         29,135,863      L1-dcache-loads           #   28.848 M/sec
>         ( +-  5.67% )  (35.68%)
>          8,537,280      L1-dcache-load-misses     #   28.66% of all L1-dcache
>         accesses  ( +- 10.15% )  (35.68%)
>          1,040,087      L1-icache-loads           #    1.030 M/sec
>          ( +-  1.60% )  (35.68%)
>              9,147      L1-icache-load-misses     #    0.85% of all L1-icache
>             accesses  ( +-  6.50% )  (35.67%)
>              1,084      dTLB-loads                #    1.073 K/sec
>              ( +- 12.05% )  (35.68%)
>                431      dTLB-load-misses          #   40.28% of all dTLB cache
>               accesses  ( +- 43.46% )  (35.68%)
>                 16      iTLB-load-misses          #    0.00% of all iTLB cache
>                accesses  ( +- 40.54% )  (35.68%)
>
>           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Thanks

Ankur

> Will come back with further observations on patch/performance if any

next prev parent reply	other threads:[~2023-04-08 22:47 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-03  5:22 [PATCH 0/9] x86/clear_huge_page: multi-page clearing Ankur Arora
2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
2023-04-03  5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
2023-04-06  8:19   ` Peter Zijlstra
2023-04-07  3:03     ` Ankur Arora
2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
2023-04-06  8:23   ` Peter Zijlstra
2023-04-07  0:50     ` Ankur Arora
2023-04-07 10:34       ` Peter Zijlstra
2023-04-09 13:26         ` Matthew Wilcox
2023-04-03  5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-04-05 20:07   ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-04-04  9:38   ` Thomas Gleixner
2023-04-05  5:29     ` Ankur Arora
2023-04-05 20:22   ` Peter Zijlstra
2023-04-06 16:56     ` Ankur Arora
2023-04-06 20:13       ` Peter Zijlstra
2023-04-06 20:16         ` Peter Zijlstra
2023-04-07  2:29         ` Ankur Arora
2023-04-07 10:23           ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-04-05 20:27   ` Peter Zijlstra
2023-04-06 17:00     ` Ankur Arora
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
2023-04-08 22:46   ` Ankur Arora [this message]
2023-04-10  6:26     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ttxqf0v3.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.