From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
torvalds@linux-foundation.org, akpm@linux-foundation.org,
bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Date: Tue, 22 Apr 2025 12:22:06 -0700 [thread overview]
Message-ID: <87jz7cq9wh.fsf@oracle.com> (raw)
In-Reply-To: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>
Raghavendra K T <raghavendra.kt@amd.com> writes:
> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
>> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
>> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.
Thanks for testing.
> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
> mm/folio_zero_user x86/folio_zero_user change
> pg-sz=1GB 2.47044 +- 0.38% 1.060877 +- 0.07% 57.06
> pg-sz=2MB 5.098403 +- 0.01% 2.52015 +- 0.36% 50.57
Just to translate it into the same units I was using above:
mm/folio_zero_user x86/folio_zero_user
pg-sz=1GB 25.91 GBps +- 0.38% 60.37 GBps +- 0.07%
pg-sz=2MB 12.57 GBps +- 0.01% 25.39 GBps +- 0.36%
That's a decent improvement over Milan. Btw, are you using boost=1?
Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.
There's a gap on Milan as well but it is much smaller.
Thanks
Ankur
> More details (1G example run):
>
> base kernel = 6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 2,476.47 msec task-clock # 1.002 CPUs
> utilized ( +- 0.39% )
> 5 context-switches # 2.025 /sec ( +- 29.70% )
> 2 cpu-migrations # 0.810 /sec ( +- 21.15% )
> 202 page-faults # 81.806 /sec ( +- 0.18% )
> 7,348,664,233 cycles # 2.976 GHz ( +- 0.38% ) (38.39%)
> 878,805,326 stalled-cycles-frontend # 11.99% frontend cycles idle ( +- 0.74% ) (38.43%)
> 339,023,729 instructions # 0.05 insn per
> cycle
> # 2.53 stalled cycles per
> insn ( +- 0.08% )
> (38.47%)
> 88,579,915 branches # 35.873 M/sec
> ( +- 0.06% ) (38.51%)
> 17,369,776 branch-misses # 19.55% of all
> branches ( +- 0.04% ) (38.55%)
> 2,261,339,695 L1-dcache-loads # 915.795 M/sec
> ( +- 0.06% ) (38.56%)
> 1,073,880,164 L1-dcache-load-misses # 47.48% of all
> L1-dcache accesses ( +- 0.05% ) (38.56%)
> 511,231,988 L1-icache-loads # 207.038 M/sec
> ( +- 0.25% ) (38.52%)
> 128,533 L1-icache-load-misses # 0.02% of all
> L1-icache accesses ( +- 0.40% ) (38.48%)
> 38,134 dTLB-loads # 15.443 K/sec
> ( +- 4.22% ) (38.44%)
> 33,992 dTLB-load-misses # 114.39% of all dTLB
> cache accesses ( +- 9.42% ) (38.40%)
> 156 iTLB-loads # 63.177 /sec
> ( +- 13.34% ) (38.36%)
> 156 iTLB-load-misses # 102.50% of all iTLB
> cache accesses ( +- 25.98% ) (38.36%)
>
> 2.47044 +- 0.00949 seconds time elapsed ( +- 0.38% )
>
> x86/folio_zero_user
> 1,056.72 msec task-clock # 0.996 CPUs
> utilized ( +- 0.07% )
> 10 context-switches # 9.436 /sec
> ( +- 3.59% )
> 3 cpu-migrations # 2.831 /sec
> ( +- 11.33% )
> 200 page-faults # 188.718 /sec
> ( +- 0.15% )
> 3,146,571,264 cycles # 2.969 GHz
> ( +- 0.07% ) (38.35%)
> 17,226,261 stalled-cycles-frontend # 0.55% frontend
> cycles idle ( +- 4.12% ) (38.44%)
> 14,130,553 instructions # 0.00 insn per
> cycle
> # 1.39 stalled cycles per
> insn ( +- 1.59% )
> (38.53%)
> 3,578,614 branches # 3.377 M/sec
> ( +- 1.54% ) (38.62%)
> 415,807 branch-misses # 12.45% of all
> branches ( +- 1.17% ) (38.62%)
> 22,208,699 L1-dcache-loads # 20.956 M/sec
> ( +- 5.27% ) (38.60%)
> 7,312,684 L1-dcache-load-misses # 27.79% of all
> L1-dcache accesses ( +- 8.46% ) (38.51%)
> 4,032,315 L1-icache-loads # 3.805 M/sec
> ( +- 1.29% ) (38.48%)
> 15,094 L1-icache-load-misses # 0.38% of all
> L1-icache accesses ( +- 1.14% ) (38.39%)
> 14,365 dTLB-loads # 13.555 K/sec
> ( +- 7.23% ) (38.38%)
> 9,477 dTLB-load-misses # 65.36% of all dTLB
> cache accesses ( +- 12.05% ) (38.38%)
> 18 iTLB-loads # 16.985 /sec
> ( +- 34.84% ) (38.38%)
> 67 iTLB-load-misses # 158.39% of all iTLB
> cache accesses ( +- 48.32% ) (38.32%)
>
> 1.060877 +- 0.000766 seconds time elapsed ( +- 0.07% )
>
> Thanks and Regards
> - Raghu
--
ankur
next prev parent reply other threads:[~2025-04-22 19:22 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-14 3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14 3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14 6:32 ` Ingo Molnar
2025-04-14 11:02 ` Peter Zijlstra
2025-04-14 11:14 ` Ingo Molnar
2025-04-14 19:46 ` Ankur Arora
2025-04-14 22:26 ` Mateusz Guzik
2025-04-15 6:14 ` Ankur Arora
2025-04-15 8:22 ` Mateusz Guzik
2025-04-15 20:01 ` Ankur Arora
2025-04-15 20:32 ` Mateusz Guzik
2025-04-14 19:52 ` Ankur Arora
2025-04-14 20:09 ` Matthew Wilcox
2025-04-15 21:59 ` Ankur Arora
2025-04-14 3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14 6:53 ` Ingo Molnar
2025-04-14 21:21 ` Ankur Arora
2025-04-14 7:05 ` Ingo Molnar
2025-04-15 6:36 ` Ankur Arora
2025-04-22 6:36 ` Raghavendra K T
2025-04-22 19:14 ` Ankur Arora
2025-04-15 10:16 ` Mateusz Guzik
2025-04-15 21:46 ` Ankur Arora
2025-04-15 22:01 ` Mateusz Guzik
2025-04-16 4:46 ` Ankur Arora
2025-04-17 14:06 ` Mateusz Guzik
2025-04-14 5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30 ` Ankur Arora
2025-04-14 6:36 ` Ingo Molnar
2025-04-14 19:19 ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32 ` Ankur Arora
2025-04-22 6:23 ` Raghavendra K T
2025-04-22 19:22 ` Ankur Arora [this message]
2025-04-23 8:12 ` Raghavendra K T
2025-04-23 9:18 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jz7cq9wh.fsf@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=bharata@amd.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.