From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
torvalds@linux-foundation.org, akpm@linux-foundation.org,
bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Date: Tue, 22 Apr 2025 12:22:06 -0700 [thread overview]
Message-ID: <87jz7cq9wh.fsf@oracle.com> (raw)
In-Reply-To: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>
Raghavendra K T <raghavendra.kt@amd.com> writes:
> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
>> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
>> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.
Thanks for testing.
> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
> mm/folio_zero_user x86/folio_zero_user change
> pg-sz=1GB 2.47044 +- 0.38% 1.060877 +- 0.07% 57.06
> pg-sz=2MB 5.098403 +- 0.01% 2.52015 +- 0.36% 50.57
Just to translate it into the same units I was using above:
mm/folio_zero_user x86/folio_zero_user
pg-sz=1GB 25.91 GBps +- 0.38% 60.37 GBps +- 0.07%
pg-sz=2MB 12.57 GBps +- 0.01% 25.39 GBps +- 0.36%
That's a decent improvement over Milan. Btw, are you using boost=1?
Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.
There's a gap on Milan as well but it is much smaller.
Thanks
Ankur
> More details (1G example run):
>
> base kernel = 6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
> 2,476.47 msec task-clock # 1.002 CPUs
> utilized ( +- 0.39% )
> 5 context-switches # 2.025 /sec ( +- 29.70% )
> 2 cpu-migrations # 0.810 /sec ( +- 21.15% )
> 202 page-faults # 81.806 /sec ( +- 0.18% )
> 7,348,664,233 cycles # 2.976 GHz ( +- 0.38% ) (38.39%)
> 878,805,326 stalled-cycles-frontend # 11.99% frontend cycles idle ( +- 0.74% ) (38.43%)
> 339,023,729 instructions # 0.05 insn per
> cycle
> # 2.53 stalled cycles per
> insn ( +- 0.08% )
> (38.47%)
> 88,579,915 branches # 35.873 M/sec
> ( +- 0.06% ) (38.51%)
> 17,369,776 branch-misses # 19.55% of all
> branches ( +- 0.04% ) (38.55%)
> 2,261,339,695 L1-dcache-loads # 915.795 M/sec
> ( +- 0.06% ) (38.56%)
> 1,073,880,164 L1-dcache-load-misses # 47.48% of all
> L1-dcache accesses ( +- 0.05% ) (38.56%)
> 511,231,988 L1-icache-loads # 207.038 M/sec
> ( +- 0.25% ) (38.52%)
> 128,533 L1-icache-load-misses # 0.02% of all
> L1-icache accesses ( +- 0.40% ) (38.48%)
> 38,134 dTLB-loads # 15.443 K/sec
> ( +- 4.22% ) (38.44%)
> 33,992 dTLB-load-misses # 114.39% of all dTLB
> cache accesses ( +- 9.42% ) (38.40%)
> 156 iTLB-loads # 63.177 /sec
> ( +- 13.34% ) (38.36%)
> 156 iTLB-load-misses # 102.50% of all iTLB
> cache accesses ( +- 25.98% ) (38.36%)
>
> 2.47044 +- 0.00949 seconds time elapsed ( +- 0.38% )
>
> x86/folio_zero_user
> 1,056.72 msec task-clock # 0.996 CPUs
> utilized ( +- 0.07% )
> 10 context-switches # 9.436 /sec
> ( +- 3.59% )
> 3 cpu-migrations # 2.831 /sec
> ( +- 11.33% )
> 200 page-faults # 188.718 /sec
> ( +- 0.15% )
> 3,146,571,264 cycles # 2.969 GHz
> ( +- 0.07% ) (38.35%)
> 17,226,261 stalled-cycles-frontend # 0.55% frontend
> cycles idle ( +- 4.12% ) (38.44%)
> 14,130,553 instructions # 0.00 insn per
> cycle
> # 1.39 stalled cycles per
> insn ( +- 1.59% )
> (38.53%)
> 3,578,614 branches # 3.377 M/sec
> ( +- 1.54% ) (38.62%)
> 415,807 branch-misses # 12.45% of all
> branches ( +- 1.17% ) (38.62%)
> 22,208,699 L1-dcache-loads # 20.956 M/sec
> ( +- 5.27% ) (38.60%)
> 7,312,684 L1-dcache-load-misses # 27.79% of all
> L1-dcache accesses ( +- 8.46% ) (38.51%)
> 4,032,315 L1-icache-loads # 3.805 M/sec
> ( +- 1.29% ) (38.48%)
> 15,094 L1-icache-load-misses # 0.38% of all
> L1-icache accesses ( +- 1.14% ) (38.39%)
> 14,365 dTLB-loads # 13.555 K/sec
> ( +- 7.23% ) (38.38%)
> 9,477 dTLB-load-misses # 65.36% of all dTLB
> cache accesses ( +- 12.05% ) (38.38%)
> 18 iTLB-loads # 16.985 /sec
> ( +- 34.84% ) (38.38%)
> 67 iTLB-load-misses # 158.39% of all iTLB
> cache accesses ( +- 48.32% ) (38.32%)
>
> 1.060877 +- 0.000766 seconds time elapsed ( +- 0.07% )
>
> Thanks and Regards
> - Raghu
--
ankur
next prev parent reply other threads:[~2025-04-22 19:22 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-14 3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14 3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14 6:32 ` Ingo Molnar
2025-04-14 11:02 ` Peter Zijlstra
2025-04-14 11:14 ` Ingo Molnar
2025-04-14 19:46 ` Ankur Arora
2025-04-14 22:26 ` Mateusz Guzik
2025-04-15 6:14 ` Ankur Arora
2025-04-15 8:22 ` Mateusz Guzik
2025-04-15 20:01 ` Ankur Arora
2025-04-15 20:32 ` Mateusz Guzik
2025-04-14 19:52 ` Ankur Arora
2025-04-14 20:09 ` Matthew Wilcox
2025-04-15 21:59 ` Ankur Arora
2025-04-14 3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14 3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14 6:53 ` Ingo Molnar
2025-04-14 21:21 ` Ankur Arora
2025-04-14 7:05 ` Ingo Molnar
2025-04-15 6:36 ` Ankur Arora
2025-04-22 6:36 ` Raghavendra K T
2025-04-22 19:14 ` Ankur Arora
2025-04-15 10:16 ` Mateusz Guzik
2025-04-15 21:46 ` Ankur Arora
2025-04-15 22:01 ` Mateusz Guzik
2025-04-16 4:46 ` Ankur Arora
2025-04-17 14:06 ` Mateusz Guzik
2025-04-14 5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30 ` Ankur Arora
2025-04-14 6:36 ` Ingo Molnar
2025-04-14 19:19 ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32 ` Ankur Arora
2025-04-22 6:23 ` Raghavendra K T
2025-04-22 19:22 ` Ankur Arora [this message]
2025-04-23 8:12 ` Raghavendra K T
2025-04-23 9:18 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jz7cq9wh.fsf@oracle.com \
--to=ankur.a.arora@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=bharata@amd.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jon.grimm@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).