Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
	paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
	willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Date: Tue, 22 Apr 2025 12:22:06 -0700	[thread overview]
Message-ID: <87jz7cq9wh.fsf@oracle.com> (raw)
In-Reply-To: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s  +- stddev)      (GB/s  +- stddev)
>>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s +- stddev)      (GB/s +- stddev)
>>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.

Thanks for testing.

> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
>                  mm/folio_zero_user    x86/folio_zero_user     change
>   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57


Just to translate it into the same units I was using above:

                  mm/folio_zero_user        x86/folio_zero_user
   pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
   pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%

That's a decent improvement over Milan. Btw, are you using boost=1?

Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.

There's a gap on Milan as well but it is much smaller.

Thanks
Ankur

> More details (1G example run):
>
> base kernel    =   6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,476.47 msec task-clock                       #    1.002 CPUs
>          utilized            ( +-  0.39% )
>                  5      context-switches                 #    2.025 /sec ( +- 29.70% )
>                  2      cpu-migrations                   #    0.810 /sec ( +- 21.15% )
>                202      page-faults                      #   81.806 /sec ( +-  0.18% )
>      7,348,664,233      cycles                           #    2.976 GHz  ( +-  0.38% )  (38.39%)
>        878,805,326      stalled-cycles-frontend          #   11.99% frontend cycles idle     ( +-  0.74% )  (38.43%)
>        339,023,729      instructions                     #    0.05 insn per
>       cycle
>                                                   #    2.53  stalled cycles per
>                                                       insn  ( +-  0.08% )
>                                                       (38.47%)
>         88,579,915      branches                         #   35.873 M/sec
>        ( +-  0.06% )  (38.51%)
>         17,369,776      branch-misses                    #   19.55% of all
>        branches          ( +-  0.04% )  (38.55%)
>      2,261,339,695      L1-dcache-loads                  #  915.795 M/sec
>     ( +-  0.06% )  (38.56%)
>      1,073,880,164      L1-dcache-load-misses            #   47.48% of all
>     L1-dcache accesses  ( +-  0.05% )  (38.56%)
>          511,231,988      L1-icache-loads                  #  207.038 M/sec
>         ( +-  0.25% )  (38.52%)
>            128,533      L1-icache-load-misses            #    0.02% of all
>           L1-icache accesses  ( +-  0.40% )  (38.48%)
>             38,134      dTLB-loads                       #   15.443 K/sec
>            ( +-  4.22% )  (38.44%)
>             33,992      dTLB-load-misses                 #  114.39% of all dTLB
>            cache accesses  ( +-  9.42% )  (38.40%)
>                156      iTLB-loads                       #   63.177 /sec
>               ( +- 13.34% )  (38.36%)
>                156      iTLB-load-misses                 #  102.50% of all iTLB
>               cache accesses  ( +- 25.98% )  (38.36%)
>
>            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )
>
> x86/folio_zero_user
>           1,056.72 msec task-clock                       #    0.996 CPUs
>          utilized            ( +-  0.07% )
>                 10      context-switches                 #    9.436 /sec
>                ( +-  3.59% )
>                  3      cpu-migrations                   #    2.831 /sec
>                 ( +- 11.33% )
>                200      page-faults                      #  188.718 /sec
>               ( +-  0.15% )
>      3,146,571,264      cycles                           #    2.969 GHz
>      ( +-  0.07% )  (38.35%)
>         17,226,261      stalled-cycles-frontend          #    0.55% frontend
>        cycles idle     ( +-  4.12% )  (38.44%)
>         14,130,553      instructions                     #    0.00 insn per
>        cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  1.59% )
>                                                       (38.53%)
>          3,578,614      branches                         #    3.377 M/sec
>         ( +-  1.54% )  (38.62%)
>            415,807      branch-misses                    #   12.45% of all
>           branches          ( +-  1.17% )  (38.62%)
>         22,208,699      L1-dcache-loads                  #   20.956 M/sec
>        ( +-  5.27% )  (38.60%)
>          7,312,684      L1-dcache-load-misses            #   27.79% of all
>         L1-dcache accesses  ( +-  8.46% )  (38.51%)
>            4,032,315      L1-icache-loads                  #    3.805 M/sec
>           ( +-  1.29% )  (38.48%)
>             15,094      L1-icache-load-misses            #    0.38% of all
>            L1-icache accesses  ( +-  1.14% )  (38.39%)
>             14,365      dTLB-loads                       #   13.555 K/sec
>            ( +-  7.23% )  (38.38%)
>              9,477      dTLB-load-misses                 #   65.36% of all dTLB
>             cache accesses  ( +- 12.05% )  (38.38%)
>                 18      iTLB-loads                       #   16.985 /sec
>                ( +- 34.84% )  (38.38%)
>                 67      iTLB-load-misses                 #  158.39% of all iTLB
>                cache accesses  ( +- 48.32% )  (38.32%)
>
>           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )
>
> Thanks and Regards
> - Raghu


--
ankur

next prev parent reply	other threads:[~2025-04-22 19:22 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14  6:32   ` Ingo Molnar
2025-04-14 11:02     ` Peter Zijlstra
2025-04-14 11:14       ` Ingo Molnar
2025-04-14 19:46       ` Ankur Arora
2025-04-14 22:26       ` Mateusz Guzik
2025-04-15  6:14         ` Ankur Arora
2025-04-15  8:22           ` Mateusz Guzik
2025-04-15 20:01             ` Ankur Arora
2025-04-15 20:32               ` Mateusz Guzik
2025-04-14 19:52     ` Ankur Arora
2025-04-14 20:09       ` Matthew Wilcox
2025-04-15 21:59         ` Ankur Arora
2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14  6:53   ` Ingo Molnar
2025-04-14 21:21     ` Ankur Arora
2025-04-14  7:05   ` Ingo Molnar
2025-04-15  6:36     ` Ankur Arora
2025-04-22  6:36     ` Raghavendra K T
2025-04-22 19:14       ` Ankur Arora
2025-04-15 10:16   ` Mateusz Guzik
2025-04-15 21:46     ` Ankur Arora
2025-04-15 22:01       ` Mateusz Guzik
2025-04-16  4:46         ` Ankur Arora
2025-04-17 14:06           ` Mateusz Guzik
2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30   ` Ankur Arora
2025-04-14  6:36 ` Ingo Molnar
2025-04-14 19:19   ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32   ` Ankur Arora
2025-04-22  6:23 ` Raghavendra K T
2025-04-22 19:22   ` Ankur Arora [this message]
2025-04-23  8:12     ` Raghavendra K T
2025-04-23  9:18       ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jz7cq9wh.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).