Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
	paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
	willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Date: Tue, 22 Apr 2025 12:22:06 -0700	[thread overview]
Message-ID: <87jz7cq9wh.fsf@oracle.com> (raw)
In-Reply-To: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s  +- stddev)      (GB/s  +- stddev)
>>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s +- stddev)      (GB/s +- stddev)
>>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.

Thanks for testing.

> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
>                  mm/folio_zero_user    x86/folio_zero_user     change
>   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57


Just to translate it into the same units I was using above:

                  mm/folio_zero_user        x86/folio_zero_user
   pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
   pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%

That's a decent improvement over Milan. Btw, are you using boost=1?

Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.

There's a gap on Milan as well but it is much smaller.

Thanks
Ankur

> More details (1G example run):
>
> base kernel    =   6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,476.47 msec task-clock                       #    1.002 CPUs
>          utilized            ( +-  0.39% )
>                  5      context-switches                 #    2.025 /sec ( +- 29.70% )
>                  2      cpu-migrations                   #    0.810 /sec ( +- 21.15% )
>                202      page-faults                      #   81.806 /sec ( +-  0.18% )
>      7,348,664,233      cycles                           #    2.976 GHz  ( +-  0.38% )  (38.39%)
>        878,805,326      stalled-cycles-frontend          #   11.99% frontend cycles idle     ( +-  0.74% )  (38.43%)
>        339,023,729      instructions                     #    0.05 insn per
>       cycle
>                                                   #    2.53  stalled cycles per
>                                                       insn  ( +-  0.08% )
>                                                       (38.47%)
>         88,579,915      branches                         #   35.873 M/sec
>        ( +-  0.06% )  (38.51%)
>         17,369,776      branch-misses                    #   19.55% of all
>        branches          ( +-  0.04% )  (38.55%)
>      2,261,339,695      L1-dcache-loads                  #  915.795 M/sec
>     ( +-  0.06% )  (38.56%)
>      1,073,880,164      L1-dcache-load-misses            #   47.48% of all
>     L1-dcache accesses  ( +-  0.05% )  (38.56%)
>          511,231,988      L1-icache-loads                  #  207.038 M/sec
>         ( +-  0.25% )  (38.52%)
>            128,533      L1-icache-load-misses            #    0.02% of all
>           L1-icache accesses  ( +-  0.40% )  (38.48%)
>             38,134      dTLB-loads                       #   15.443 K/sec
>            ( +-  4.22% )  (38.44%)
>             33,992      dTLB-load-misses                 #  114.39% of all dTLB
>            cache accesses  ( +-  9.42% )  (38.40%)
>                156      iTLB-loads                       #   63.177 /sec
>               ( +- 13.34% )  (38.36%)
>                156      iTLB-load-misses                 #  102.50% of all iTLB
>               cache accesses  ( +- 25.98% )  (38.36%)
>
>            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )
>
> x86/folio_zero_user
>           1,056.72 msec task-clock                       #    0.996 CPUs
>          utilized            ( +-  0.07% )
>                 10      context-switches                 #    9.436 /sec
>                ( +-  3.59% )
>                  3      cpu-migrations                   #    2.831 /sec
>                 ( +- 11.33% )
>                200      page-faults                      #  188.718 /sec
>               ( +-  0.15% )
>      3,146,571,264      cycles                           #    2.969 GHz
>      ( +-  0.07% )  (38.35%)
>         17,226,261      stalled-cycles-frontend          #    0.55% frontend
>        cycles idle     ( +-  4.12% )  (38.44%)
>         14,130,553      instructions                     #    0.00 insn per
>        cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  1.59% )
>                                                       (38.53%)
>          3,578,614      branches                         #    3.377 M/sec
>         ( +-  1.54% )  (38.62%)
>            415,807      branch-misses                    #   12.45% of all
>           branches          ( +-  1.17% )  (38.62%)
>         22,208,699      L1-dcache-loads                  #   20.956 M/sec
>        ( +-  5.27% )  (38.60%)
>          7,312,684      L1-dcache-load-misses            #   27.79% of all
>         L1-dcache accesses  ( +-  8.46% )  (38.51%)
>            4,032,315      L1-icache-loads                  #    3.805 M/sec
>           ( +-  1.29% )  (38.48%)
>             15,094      L1-icache-load-misses            #    0.38% of all
>            L1-icache accesses  ( +-  1.14% )  (38.39%)
>             14,365      dTLB-loads                       #   13.555 K/sec
>            ( +-  7.23% )  (38.38%)
>              9,477      dTLB-load-misses                 #   65.36% of all dTLB
>             cache accesses  ( +- 12.05% )  (38.38%)
>                 18      iTLB-loads                       #   16.985 /sec
>                ( +- 34.84% )  (38.38%)
>                 67      iTLB-load-misses                 #  158.39% of all iTLB
>                cache accesses  ( +- 48.32% )  (38.32%)
>
>           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )
>
> Thanks and Regards
> - Raghu


--
ankur

next prev parent reply	other threads:[~2025-04-22 19:22 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14  6:32   ` Ingo Molnar
2025-04-14 11:02     ` Peter Zijlstra
2025-04-14 11:14       ` Ingo Molnar
2025-04-14 19:46       ` Ankur Arora
2025-04-14 22:26       ` Mateusz Guzik
2025-04-15  6:14         ` Ankur Arora
2025-04-15  8:22           ` Mateusz Guzik
2025-04-15 20:01             ` Ankur Arora
2025-04-15 20:32               ` Mateusz Guzik
2025-04-14 19:52     ` Ankur Arora
2025-04-14 20:09       ` Matthew Wilcox
2025-04-15 21:59         ` Ankur Arora
2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14  6:53   ` Ingo Molnar
2025-04-14 21:21     ` Ankur Arora
2025-04-14  7:05   ` Ingo Molnar
2025-04-15  6:36     ` Ankur Arora
2025-04-22  6:36     ` Raghavendra K T
2025-04-22 19:14       ` Ankur Arora
2025-04-15 10:16   ` Mateusz Guzik
2025-04-15 21:46     ` Ankur Arora
2025-04-15 22:01       ` Mateusz Guzik
2025-04-16  4:46         ` Ankur Arora
2025-04-17 14:06           ` Mateusz Guzik
2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30   ` Ankur Arora
2025-04-14  6:36 ` Ingo Molnar
2025-04-14 19:19   ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32   ` Ankur Arora
2025-04-22  6:23 ` Raghavendra K T
2025-04-22 19:22   ` Ankur Arora [this message]
2025-04-23  8:12     ` Raghavendra K T
2025-04-23  9:18       ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jz7cq9wh.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.