* [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
@ 2025-08-07 8:17 kernel test robot
2025-08-07 8:27 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: kernel test robot @ 2025-08-07 8:17 UTC (permalink / raw)
To: Dev Jain
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Barry Song,
Lorenzo Stoakes, Pedro Falcato, Anshuman Khandual, Bang Li,
Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox,
Peter Xu, Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi,
Zi Yan, linux-mm, oliver.sang
Hello,
kernel test robot noticed a 37.3% regression of stress-ng.bigheap.realloc_calls_per_sec on:
commit: f822a9a81a31311d67f260aea96005540b18ab07 ("mm: optimize mremap() by PTE batching")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
[still regression on linus/master 186f3edfdd41f2ae87fc40a9ccba52a3bf930994]
[still regression on linux-next/master b9ddaa95fd283bce7041550ddbbe7e764c477110]
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 192 threads 2 sockets Intel(R) Xeon(R) Platinum 8468V CPU @ 2.4GHz (Sapphire Rapids) with 384G memory
parameters:
nr_threads: 100%
testtime: 60s
test: bigheap
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250807/202508071609.4e743d7c-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/igk-spr-2sp1/bigheap/stress-ng/60s
commit:
94dab12d86 ("mm: call pointers to ptes as ptep")
f822a9a81a ("mm: optimize mremap() by PTE batching")
94dab12d86cf77ff f822a9a81a31311d67f260aea96
---------------- ---------------------------
%stddev %change %stddev
\ | \
13777 ± 37% +45.0% 19979 ± 27% numa-vmstat.node1.nr_slab_reclaimable
367205 +2.3% 375703 vmstat.system.in
55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.KReclaimable
55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.SReclaimable
559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
11468 +1.2% 11603 stress-ng.time.system_time
296.25 +4.5% 309.70 stress-ng.time.user_time
0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 8:17 [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression kernel test robot
@ 2025-08-07 8:27 ` Lorenzo Stoakes
2025-08-07 8:56 ` Dev Jain
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 8:27 UTC (permalink / raw)
To: kernel test robot
Cc: Dev Jain, oe-lkp, lkp, linux-kernel, Andrew Morton, Barry Song,
Pedro Falcato, Anshuman Khandual, Bang Li, Baolin Wang, bibo mao,
David Hildenbrand, Hugh Dickins, Ingo Molnar, Jann Horn,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 37.3% regression of stress-ng.bigheap.realloc_calls_per_sec on:
>
Dev - could you please investigate and provide a fix for this as a
priority? As these numbers are quite scary (unless they're somehow super
synthetic or not meaningful or something).
>
> commit: f822a9a81a31311d67f260aea96005540b18ab07 ("mm: optimize mremap() by PTE batching")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [still regression on linus/master 186f3edfdd41f2ae87fc40a9ccba52a3bf930994]
> [still regression on linux-next/master b9ddaa95fd283bce7041550ddbbe7e764c477110]
>
> testcase: stress-ng
> config: x86_64-rhel-9.4
> compiler: gcc-12
> test machine: 192 threads 2 sockets Intel(R) Xeon(R) Platinum 8468V CPU @ 2.4GHz (Sapphire Rapids) with 384G memory
> parameters:
>
> nr_threads: 100%
> testtime: 60s
> test: bigheap
> cpufreq_governor: performance
>
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20250807/202508071609.4e743d7c-lkp@intel.com
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/igk-spr-2sp1/bigheap/stress-ng/60s
>
> commit:
> 94dab12d86 ("mm: call pointers to ptes as ptep")
> f822a9a81a ("mm: optimize mremap() by PTE batching")
>
> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 13777 ± 37% +45.0% 19979 ± 27% numa-vmstat.node1.nr_slab_reclaimable
> 367205 +2.3% 375703 vmstat.system.in
> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.KReclaimable
> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.SReclaimable
> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> 11468 +1.2% 11603 stress-ng.time.system_time
> 296.25 +4.5% 309.70 stress-ng.time.user_time
> 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
Yeah this also looks pretty consistent too...
Yikes.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 8:27 ` Lorenzo Stoakes
@ 2025-08-07 8:56 ` Dev Jain
2025-08-07 10:21 ` David Hildenbrand
2025-08-07 17:37 ` Jann Horn
2 siblings, 0 replies; 23+ messages in thread
From: Dev Jain @ 2025-08-07 8:56 UTC (permalink / raw)
To: Lorenzo Stoakes, kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Barry Song,
Pedro Falcato, Anshuman Khandual, Bang Li, Baolin Wang, bibo mao,
David Hildenbrand, Hugh Dickins, Ingo Molnar, Jann Horn,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On 07/08/25 1:57 pm, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>
>> Hello,
>>
>> kernel test robot noticed a 37.3% regression of stress-ng.bigheap.realloc_calls_per_sec on:
>>
> Dev - could you please investigate and provide a fix for this as a
> priority? As these numbers are quite scary (unless they're somehow super
> synthetic or not meaningful or something).
Yup I'll start looking.
>
>> commit: f822a9a81a31311d67f260aea96005540b18ab07 ("mm: optimize mremap() by PTE batching")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> [still regression on linus/master 186f3edfdd41f2ae87fc40a9ccba52a3bf930994]
>> [still regression on linux-next/master b9ddaa95fd283bce7041550ddbbe7e764c477110]
>>
>> testcase: stress-ng
>> config: x86_64-rhel-9.4
>> compiler: gcc-12
>> test machine: 192 threads 2 sockets Intel(R) Xeon(R) Platinum 8468V CPU @ 2.4GHz (Sapphire Rapids) with 384G memory
>> parameters:
>>
>> nr_threads: 100%
>> testtime: 60s
>> test: bigheap
>> cpufreq_governor: performance
>>
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20250807/202508071609.4e743d7c-lkp@intel.com
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>> gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/igk-spr-2sp1/bigheap/stress-ng/60s
>>
>> commit:
>> 94dab12d86 ("mm: call pointers to ptes as ptep")
>> f822a9a81a ("mm: optimize mremap() by PTE batching")
>>
>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 13777 ± 37% +45.0% 19979 ± 27% numa-vmstat.node1.nr_slab_reclaimable
>> 367205 +2.3% 375703 vmstat.system.in
>> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.KReclaimable
>> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.SReclaimable
>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>> 11468 +1.2% 11603 stress-ng.time.system_time
>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>> 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
>> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
>> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
>> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
>> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
> Yeah this also looks pretty consistent too...
>
> Yikes.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 8:27 ` Lorenzo Stoakes
2025-08-07 8:56 ` Dev Jain
@ 2025-08-07 10:21 ` David Hildenbrand
2025-08-07 16:06 ` Dev Jain
2025-08-07 17:37 ` Jann Horn
2 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-08-07 10:21 UTC (permalink / raw)
To: Lorenzo Stoakes, kernel test robot
Cc: Dev Jain, oe-lkp, lkp, linux-kernel, Andrew Morton, Barry Song,
Pedro Falcato, Anshuman Khandual, Bang Li, Baolin Wang, bibo mao,
Hugh Dickins, Ingo Molnar, Jann Horn, Lance Yang, Liam Howlett,
Matthew Wilcox, Peter Xu, Qi Zheng, Ryan Roberts, Vlastimil Babka,
Yang Shi, Zi Yan, linux-mm
On 07.08.25 10:27, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a 37.3% regression of stress-ng.bigheap.realloc_calls_per_sec on:
>>
>
> Dev - could you please investigate and provide a fix for this as a
> priority? As these numbers are quite scary (unless they're somehow super
> synthetic or not meaningful or something).
>
>>
>> commit: f822a9a81a31311d67f260aea96005540b18ab07 ("mm: optimize mremap() by PTE batching")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> [still regression on linus/master 186f3edfdd41f2ae87fc40a9ccba52a3bf930994]
>> [still regression on linux-next/master b9ddaa95fd283bce7041550ddbbe7e764c477110]
>>
>> testcase: stress-ng
>> config: x86_64-rhel-9.4
>> compiler: gcc-12
>> test machine: 192 threads 2 sockets Intel(R) Xeon(R) Platinum 8468V CPU @ 2.4GHz (Sapphire Rapids) with 384G memory
>> parameters:
>>
>> nr_threads: 100%
>> testtime: 60s
>> test: bigheap
>> cpufreq_governor: performance
>>
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
>>
>>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20250807/202508071609.4e743d7c-lkp@intel.com
>>
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>> gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/igk-spr-2sp1/bigheap/stress-ng/60s
>>
>> commit:
>> 94dab12d86 ("mm: call pointers to ptes as ptep")
>> f822a9a81a ("mm: optimize mremap() by PTE batching")
>>
>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 13777 ± 37% +45.0% 19979 ± 27% numa-vmstat.node1.nr_slab_reclaimable
>> 367205 +2.3% 375703 vmstat.system.in
>> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.KReclaimable
>> 55106 ± 37% +45.1% 79971 ± 27% numa-meminfo.node1.SReclaimable
>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>> 11468 +1.2% 11603 stress-ng.time.system_time
>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>> 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
>> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
>> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
>> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
>> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
>
> Yeah this also looks pretty consistent too...
It almost looks like some kind of NUMA effects?
I would have expected that it's the overhead of the vm_normal_folio(),
but not sure how that corresponds to the SLAB + local vs. remote stats.
Maybe they are just noise?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 10:21 ` David Hildenbrand
@ 2025-08-07 16:06 ` Dev Jain
2025-08-07 16:10 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-08-07 16:06 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes, kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Barry Song,
Pedro Falcato, Anshuman Khandual, Bang Li, Baolin Wang, bibo mao,
Hugh Dickins, Ingo Molnar, Jann Horn, Lance Yang, Liam Howlett,
Matthew Wilcox, Peter Xu, Qi Zheng, Ryan Roberts, Vlastimil Babka,
Yang Shi, Zi Yan, linux-mm
On 07/08/25 3:51 pm, David Hildenbrand wrote:
> On 07.08.25 10:27, Lorenzo Stoakes wrote:
>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 37.3% regression of
>>> stress-ng.bigheap.realloc_calls_per_sec on:
>>>
>>
>> Dev - could you please investigate and provide a fix for this as a
>> priority? As these numbers are quite scary (unless they're somehow super
>> synthetic or not meaningful or something).
>>
>>>
>>> commit: f822a9a81a31311d67f260aea96005540b18ab07 ("mm: optimize
>>> mremap() by PTE batching")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>
>>> [still regression on linus/master
>>> 186f3edfdd41f2ae87fc40a9ccba52a3bf930994]
>>> [still regression on linux-next/master
>>> b9ddaa95fd283bce7041550ddbbe7e764c477110]
>>>
>>> testcase: stress-ng
>>> config: x86_64-rhel-9.4
>>> compiler: gcc-12
>>> test machine: 192 threads 2 sockets Intel(R) Xeon(R) Platinum 8468V
>>> CPU @ 2.4GHz (Sapphire Rapids) with 384G memory
>>> parameters:
>>>
>>> nr_threads: 100%
>>> testtime: 60s
>>> test: bigheap
>>> cpufreq_governor: performance
>>>
>>>
>>>
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new
>>> version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes:
>>> https://lore.kernel.org/oe-lkp/202508071609.4e743d7c-lkp@intel.com
>>>
>>>
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>>
>>>
>>>
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20250807/202508071609.4e743d7c-lkp@intel.com
>>>
>>>
>>> =========================================================================================
>>>
>>> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>>>
>>> gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/igk-spr-2sp1/bigheap/stress-ng/60s
>>>
>>> commit:
>>> 94dab12d86 ("mm: call pointers to ptes as ptep")
>>> f822a9a81a ("mm: optimize mremap() by PTE batching")
>>>
>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>> ---------------- ---------------------------
>>> %stddev %change %stddev
>>> \ | \
>>> 13777 ± 37% +45.0% 19979 ± 27%
>>> numa-vmstat.node1.nr_slab_reclaimable
>>> 367205 +2.3% 375703 vmstat.system.in
>>> 55106 ± 37% +45.1% 79971 ± 27%
>>> numa-meminfo.node1.KReclaimable
>>> 55106 ± 37% +45.1% 79971 ± 27%
>>> numa-meminfo.node1.SReclaimable
>>> 559381 -37.3% 350757
>>> stress-ng.bigheap.realloc_calls_per_sec
>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>> 0.81 ±187% -100.0% 0.00
>>> perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>> 9.36 ±165% -100.0% 0.00
>>> perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>> 0.81 ±187% -100.0% 0.00
>>> perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>> 9.36 ±165% -100.0% 0.00
>>> perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
>>> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
>>> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
>>> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
>>> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
>>
>> Yeah this also looks pretty consistent too...
>
> It almost looks like some kind of NUMA effects?
>
> I would have expected that it's the overhead of the vm_normal_folio(),
> but not sure how that corresponds to the SLAB + local vs. remote
> stats. Maybe they are just noise?
Is there any way of making the robot test again? As you said, the only
suspect is vm_normal_folio(), nothing seems to pop up...
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 16:06 ` Dev Jain
@ 2025-08-07 16:10 ` Lorenzo Stoakes
2025-08-07 16:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 16:10 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 09:36:38PM +0530, Dev Jain wrote:
> > > > commit:
> > > > 94dab12d86 ("mm: call pointers to ptes as ptep")
> > > > f822a9a81a ("mm: optimize mremap() by PTE batching")
> > > >
> > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > ---------------- ---------------------------
> > > > %stddev %change %stddev
> > > > \ | \
> > > > 13777 ± 37% +45.0% 19979 ± 27%
> > > > numa-vmstat.node1.nr_slab_reclaimable
> > > > 367205 +2.3% 375703 vmstat.system.in
> > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > numa-meminfo.node1.KReclaimable
> > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > numa-meminfo.node1.SReclaimable
> > > > 559381 -37.3% 350757
> > > > stress-ng.bigheap.realloc_calls_per_sec
> > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
Hm is lack of zap some kind of clue here?
> > > > 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
> > > > 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
> > > > 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
> > > > 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
> > > > 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
> > >
> > > Yeah this also looks pretty consistent too...
> >
> > It almost looks like some kind of NUMA effects?
> >
> > I would have expected that it's the overhead of the vm_normal_folio(),
> > but not sure how that corresponds to the SLAB + local vs. remote stats.
> > Maybe they are just noise?
> Is there any way of making the robot test again? As you said, the only
> suspect is vm_normal_folio(), nothing seems to pop up...
>
Not sure there's much point in that, these tests are run repeatedly and
statistical analysis taken from them so what would another run accomplish unless
there's something very consistently wrong with the box that happens only to
trigger at your commit?
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 16:10 ` Lorenzo Stoakes
@ 2025-08-07 16:16 ` Lorenzo Stoakes
2025-08-07 17:04 ` Dev Jain
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 16:16 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 05:10:17PM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 09:36:38PM +0530, Dev Jain wrote:
>
> > > > > commit:
> > > > > 94dab12d86 ("mm: call pointers to ptes as ptep")
> > > > > f822a9a81a ("mm: optimize mremap() by PTE batching")
> > > > >
> > > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > > ---------------- ---------------------------
> > > > > %stddev %change %stddev
> > > > > \ | \
> > > > > 13777 ± 37% +45.0% 19979 ± 27%
> > > > > numa-vmstat.node1.nr_slab_reclaimable
> > > > > 367205 +2.3% 375703 vmstat.system.in
> > > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > > numa-meminfo.node1.KReclaimable
> > > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > > numa-meminfo.node1.SReclaimable
> > > > > 559381 -37.3% 350757
> > > > > stress-ng.bigheap.realloc_calls_per_sec
> > > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > > 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>
> Hm is lack of zap some kind of clue here?
>
> > > > > 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
> > > > > 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
> > > > > 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
> > > > > 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
> > > > > 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
> > > >
> > > > Yeah this also looks pretty consistent too...
> > >
> > > It almost looks like some kind of NUMA effects?
> > >
> > > I would have expected that it's the overhead of the vm_normal_folio(),
> > > but not sure how that corresponds to the SLAB + local vs. remote stats.
> > > Maybe they are just noise?
> > Is there any way of making the robot test again? As you said, the only
> > suspect is vm_normal_folio(), nothing seems to pop up...
> >
>
> Not sure there's much point in that, these tests are run repeatedly and
> statistical analysis taken from them so what would another run accomplish unless
> there's something very consistently wrong with the box that happens only to
> trigger at your commit?
>
> Cheers, Lorenzo
Let me play around on my test box roughly and see if I can repro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 16:16 ` Lorenzo Stoakes
@ 2025-08-07 17:04 ` Dev Jain
2025-08-07 17:07 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-08-07 17:04 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On 07/08/25 9:46 pm, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 05:10:17PM +0100, Lorenzo Stoakes wrote:
>> On Thu, Aug 07, 2025 at 09:36:38PM +0530, Dev Jain wrote:
>>
>>>>>> commit:
>>>>>> 94dab12d86 ("mm: call pointers to ptes as ptep")
>>>>>> f822a9a81a ("mm: optimize mremap() by PTE batching")
>>>>>>
>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>> ---------------- ---------------------------
>>>>>> %stddev %change %stddev
>>>>>> \ | \
>>>>>> 13777 ± 37% +45.0% 19979 ± 27%
>>>>>> numa-vmstat.node1.nr_slab_reclaimable
>>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>>> 55106 ± 37% +45.1% 79971 ± 27%
>>>>>> numa-meminfo.node1.KReclaimable
>>>>>> 55106 ± 37% +45.1% 79971 ± 27%
>>>>>> numa-meminfo.node1.SReclaimable
>>>>>> 559381 -37.3% 350757
>>>>>> stress-ng.bigheap.realloc_calls_per_sec
>>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>>> 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>> Hm is lack of zap some kind of clue here?
>>
>>>>>> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
>>>>>> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
>>>>>> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
>>>>>> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
>>>>>> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
>>>>> Yeah this also looks pretty consistent too...
>>>> It almost looks like some kind of NUMA effects?
>>>>
>>>> I would have expected that it's the overhead of the vm_normal_folio(),
>>>> but not sure how that corresponds to the SLAB + local vs. remote stats.
>>>> Maybe they are just noise?
>>> Is there any way of making the robot test again? As you said, the only
>>> suspect is vm_normal_folio(), nothing seems to pop up...
>>>
>> Not sure there's much point in that, these tests are run repeatedly and
>> statistical analysis taken from them so what would another run accomplish unless
>> there's something very consistently wrong with the box that happens only to
>> trigger at your commit?
>>
>> Cheers, Lorenzo
> Let me play around on my test box roughly and see if I can repro
So I tested with
./stress-ng --timeout 1 --times --verify --metrics --no-rand-seed --oom-avoid --bigheap 20
extracted the number out of the line containing the output "realloc calls per sec", did an
avg and standard deviation over 20 runs. Before the patch:
Average realloc calls/sec: 196907.380000
Standard deviation : 12685.721021
After the patch:
Average realloc calls/sec: 187894.300500
Standard deviation : 12494.153533
which is 5% approx.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:04 ` Dev Jain
@ 2025-08-07 17:07 ` Lorenzo Stoakes
2025-08-07 17:11 ` Dev Jain
0 siblings, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 17:07 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 10:34:43PM +0530, Dev Jain wrote:
>
> On 07/08/25 9:46 pm, Lorenzo Stoakes wrote:
> > On Thu, Aug 07, 2025 at 05:10:17PM +0100, Lorenzo Stoakes wrote:
> > > On Thu, Aug 07, 2025 at 09:36:38PM +0530, Dev Jain wrote:
> > >
> > > > > > > commit:
> > > > > > > 94dab12d86 ("mm: call pointers to ptes as ptep")
> > > > > > > f822a9a81a ("mm: optimize mremap() by PTE batching")
> > > > > > >
> > > > > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > > > > ---------------- ---------------------------
> > > > > > > %stddev %change %stddev
> > > > > > > \ | \
> > > > > > > 13777 ± 37% +45.0% 19979 ± 27%
> > > > > > > numa-vmstat.node1.nr_slab_reclaimable
> > > > > > > 367205 +2.3% 375703 vmstat.system.in
> > > > > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > > > > numa-meminfo.node1.KReclaimable
> > > > > > > 55106 ± 37% +45.1% 79971 ± 27%
> > > > > > > numa-meminfo.node1.SReclaimable
> > > > > > > 559381 -37.3% 350757
> > > > > > > stress-ng.bigheap.realloc_calls_per_sec
> > > > > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > > > > 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > Hm is lack of zap some kind of clue here?
> > >
> > > > > > > 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
> > > > > > > 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
> > > > > > > 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
> > > > > > > 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
> > > > > > > 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
> > > > > > Yeah this also looks pretty consistent too...
> > > > > It almost looks like some kind of NUMA effects?
> > > > >
> > > > > I would have expected that it's the overhead of the vm_normal_folio(),
> > > > > but not sure how that corresponds to the SLAB + local vs. remote stats.
> > > > > Maybe they are just noise?
> > > > Is there any way of making the robot test again? As you said, the only
> > > > suspect is vm_normal_folio(), nothing seems to pop up...
> > > >
> > > Not sure there's much point in that, these tests are run repeatedly and
> > > statistical analysis taken from them so what would another run accomplish unless
> > > there's something very consistently wrong with the box that happens only to
> > > trigger at your commit?
> > >
> > > Cheers, Lorenzo
> > Let me play around on my test box roughly and see if I can repro
>
> So I tested with
> ./stress-ng --timeout 1 --times --verify --metrics --no-rand-seed --oom-avoid --bigheap 20
> extracted the number out of the line containing the output "realloc calls per sec", did an
> avg and standard deviation over 20 runs. Before the patch:
>
> Average realloc calls/sec: 196907.380000
> Standard deviation : 12685.721021
>
> After the patch:
>
> Average realloc calls/sec: 187894.300500
> Standard deviation : 12494.153533
>
> which is 5% approx.
>
Are you testing that on x86-64 bare metal?
Anyway this is _not_ what I get.
I am testing on my test box, and seeing a _very significant_ regression as reported.
I am narrowing down the exact cause and will report back. Non-NUMA box, recent
uArch, dedicated machine.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:07 ` Lorenzo Stoakes
@ 2025-08-07 17:11 ` Dev Jain
0 siblings, 0 replies; 23+ messages in thread
From: Dev Jain @ 2025-08-07 17:11 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Jann Horn, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On 07/08/25 10:37 pm, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 10:34:43PM +0530, Dev Jain wrote:
>> On 07/08/25 9:46 pm, Lorenzo Stoakes wrote:
>>> On Thu, Aug 07, 2025 at 05:10:17PM +0100, Lorenzo Stoakes wrote:
>>>> On Thu, Aug 07, 2025 at 09:36:38PM +0530, Dev Jain wrote:
>>>>
>>>>>>>> commit:
>>>>>>>> 94dab12d86 ("mm: call pointers to ptes as ptep")
>>>>>>>> f822a9a81a ("mm: optimize mremap() by PTE batching")
>>>>>>>>
>>>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>>>> ---------------- ---------------------------
>>>>>>>> %stddev %change %stddev
>>>>>>>> \ | \
>>>>>>>> 13777 ± 37% +45.0% 19979 ± 27%
>>>>>>>> numa-vmstat.node1.nr_slab_reclaimable
>>>>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>>>>> 55106 ± 37% +45.1% 79971 ± 27%
>>>>>>>> numa-meminfo.node1.KReclaimable
>>>>>>>> 55106 ± 37% +45.1% 79971 ± 27%
>>>>>>>> numa-meminfo.node1.SReclaimable
>>>>>>>> 559381 -37.3% 350757
>>>>>>>> stress-ng.bigheap.realloc_calls_per_sec
>>>>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>>>>> 0.81 ±187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ±165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 0.81 ±187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ±165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>> Hm is lack of zap some kind of clue here?
>>>>
>>>>>>>> 5.50 ± 17% +390.9% 27.00 ± 56% perf-c2c.DRAM.local
>>>>>>>> 388.50 ± 10% +114.7% 834.17 ± 33% perf-c2c.DRAM.remote
>>>>>>>> 1214 ± 13% +107.3% 2517 ± 31% perf-c2c.HITM.local
>>>>>>>> 135.00 ± 19% +130.9% 311.67 ± 32% perf-c2c.HITM.remote
>>>>>>>> 1349 ± 13% +109.6% 2829 ± 31% perf-c2c.HITM.total
>>>>>>> Yeah this also looks pretty consistent too...
>>>>>> It almost looks like some kind of NUMA effects?
>>>>>>
>>>>>> I would have expected that it's the overhead of the vm_normal_folio(),
>>>>>> but not sure how that corresponds to the SLAB + local vs. remote stats.
>>>>>> Maybe they are just noise?
>>>>> Is there any way of making the robot test again? As you said, the only
>>>>> suspect is vm_normal_folio(), nothing seems to pop up...
>>>>>
>>>> Not sure there's much point in that, these tests are run repeatedly and
>>>> statistical analysis taken from them so what would another run accomplish unless
>>>> there's something very consistently wrong with the box that happens only to
>>>> trigger at your commit?
>>>>
>>>> Cheers, Lorenzo
>>> Let me play around on my test box roughly and see if I can repro
>> So I tested with
>> ./stress-ng --timeout 1 --times --verify --metrics --no-rand-seed --oom-avoid --bigheap 20
>> extracted the number out of the line containing the output "realloc calls per sec", did an
>> avg and standard deviation over 20 runs. Before the patch:
>>
>> Average realloc calls/sec: 196907.380000
>> Standard deviation : 12685.721021
>>
>> After the patch:
>>
>> Average realloc calls/sec: 187894.300500
>> Standard deviation : 12494.153533
>>
>> which is 5% approx.
>>
> Are you testing that on x86-64 bare metal?
Qemu VM on x86-64.
>
> Anyway this is _not_ what I get.
>
> I am testing on my test box, and seeing a _very significant_ regression as reported.
>
> I am narrowing down the exact cause and will report back. Non-NUMA box, recent
> uArch, dedicated machine.
Oops. Thanks for testing. Lemme stare at my patch for some more time :)
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 8:27 ` Lorenzo Stoakes
2025-08-07 8:56 ` Dev Jain
2025-08-07 10:21 ` David Hildenbrand
@ 2025-08-07 17:37 ` Jann Horn
2025-08-07 17:41 ` Lorenzo Stoakes
2 siblings, 1 reply; 23+ messages in thread
From: Jann Horn @ 2025-08-07 17:37 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > 367205 +2.3% 375703 vmstat.system.in
> > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > 11468 +1.2% 11603 stress-ng.time.system_time
> > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>
> Yeah this also looks pretty consistent too...
FWIW, HITM hat different meanings depending on exactly which
microarchitecture that test happened on; the message says it is from
Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
meaningful than if it came from a pre-IceLake system (see
https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
To me those numbers mainly look like you're accessing a lot more
cache-cold data. (On pre-IceLake they would indicate cacheline
bouncing, but I guess here they probably don't.) And that makes sense,
since before the patch, this path was just moving PTEs around without
looking at the associated pages/folios; basically more or less like a
memcpy() on x86-64. But after the patch, for every 8 bytes that you
copy, you have to load a cacheline from the vmemmap to get the page.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:37 ` Jann Horn
@ 2025-08-07 17:41 ` Lorenzo Stoakes
2025-08-07 17:46 ` Jann Horn
2025-08-07 17:59 ` David Hildenbrand
0 siblings, 2 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 17:41 UTC (permalink / raw)
To: Jann Horn
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > ---------------- ---------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > > 367205 +2.3% 375703 vmstat.system.in
> > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
> >
> > Yeah this also looks pretty consistent too...
>
> FWIW, HITM hat different meanings depending on exactly which
> microarchitecture that test happened on; the message says it is from
> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> meaningful than if it came from a pre-IceLake system (see
> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>
> To me those numbers mainly look like you're accessing a lot more
> cache-cold data. (On pre-IceLake they would indicate cacheline
> bouncing, but I guess here they probably don't.) And that makes sense,
> since before the patch, this path was just moving PTEs around without
> looking at the associated pages/folios; basically more or less like a
> memcpy() on x86-64. But after the patch, for every 8 bytes that you
> copy, you have to load a cacheline from the vmemmap to get the page.
Yup this is representative of what my investigation is showing.
I've narrowed it down but want to wait to report until I'm sure...
But yeah we're doing a _lot_ more work.
I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
especially sensitive to this (I found issues with this with my abortive mremap
anon merging stuff too, but really expected it there...)
On assumption arm64 is _definitely_ faster. I wonder if older arm64 arches might
suffer?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:41 ` Lorenzo Stoakes
@ 2025-08-07 17:46 ` Jann Horn
2025-08-07 17:50 ` Dev Jain
2025-08-07 17:51 ` Lorenzo Stoakes
2025-08-07 17:59 ` David Hildenbrand
1 sibling, 2 replies; 23+ messages in thread
From: Jann Horn @ 2025-08-07 17:46 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> > On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > ---------------- ---------------------------
> > > > %stddev %change %stddev
> > > > \ | \
> > > > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > > > 367205 +2.3% 375703 vmstat.system.in
> > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > > > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > > > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > > > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > > > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > > > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
> > >
> > > Yeah this also looks pretty consistent too...
> >
> > FWIW, HITM hat different meanings depending on exactly which
> > microarchitecture that test happened on; the message says it is from
> > Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> > meaningful than if it came from a pre-IceLake system (see
> > https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
> >
> > To me those numbers mainly look like you're accessing a lot more
> > cache-cold data. (On pre-IceLake they would indicate cacheline
> > bouncing, but I guess here they probably don't.) And that makes sense,
> > since before the patch, this path was just moving PTEs around without
> > looking at the associated pages/folios; basically more or less like a
> > memcpy() on x86-64. But after the patch, for every 8 bytes that you
> > copy, you have to load a cacheline from the vmemmap to get the page.
>
> Yup this is representative of what my investigation is showing.
>
> I've narrowed it down but want to wait to report until I'm sure...
>
> But yeah we're doing a _lot_ more work.
>
> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> especially sensitive to this (I found issues with this with my abortive mremap
> anon merging stuff too, but really expected it there...)
Another approach would be to always read and write PTEs in
contpte-sized chunks here, without caring whether they're actually
contiguous or whatever, or something along those lines.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:46 ` Jann Horn
@ 2025-08-07 17:50 ` Dev Jain
2025-08-07 17:53 ` Lorenzo Stoakes
2025-08-07 17:51 ` Lorenzo Stoakes
1 sibling, 1 reply; 23+ messages in thread
From: Dev Jain @ 2025-08-07 17:50 UTC (permalink / raw)
To: Jann Horn, Lorenzo Stoakes
Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Andrew Morton,
Barry Song, Pedro Falcato, Anshuman Khandual, Bang Li,
Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On 07/08/25 11:16 pm, Jann Horn wrote:
> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>> <lorenzo.stoakes@oracle.com> wrote:
>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>> ---------------- ---------------------------
>>>>> %stddev %change %stddev
>>>>> \ | \
>>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>> Yeah this also looks pretty consistent too...
>>> FWIW, HITM hat different meanings depending on exactly which
>>> microarchitecture that test happened on; the message says it is from
>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>> meaningful than if it came from a pre-IceLake system (see
>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>
>>> To me those numbers mainly look like you're accessing a lot more
>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>> since before the patch, this path was just moving PTEs around without
>>> looking at the associated pages/folios; basically more or less like a
>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>> copy, you have to load a cacheline from the vmemmap to get the page.
>> Yup this is representative of what my investigation is showing.
>>
>> I've narrowed it down but want to wait to report until I'm sure...
>>
>> But yeah we're doing a _lot_ more work.
>>
>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>> especially sensitive to this (I found issues with this with my abortive mremap
>> anon merging stuff too, but really expected it there...)
> Another approach would be to always read and write PTEs in
> contpte-sized chunks here, without caring whether they're actually
> contiguous or whatever, or something along those lines.
The initial approach was to wrap all of this around pte_batch_hint(),
effectively making the optimization only on arm64. I guess that sounds
reasonable now.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:46 ` Jann Horn
2025-08-07 17:50 ` Dev Jain
@ 2025-08-07 17:51 ` Lorenzo Stoakes
2025-08-07 18:01 ` David Hildenbrand
1 sibling, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 17:51 UTC (permalink / raw)
To: Jann Horn
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> > > On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > > ---------------- ---------------------------
> > > > > %stddev %change %stddev
> > > > > \ | \
> > > > > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > > > > 367205 +2.3% 375703 vmstat.system.in
> > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > > > > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > > > > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > > > > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > > > > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > > > > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
> > > >
> > > > Yeah this also looks pretty consistent too...
> > >
> > > FWIW, HITM hat different meanings depending on exactly which
> > > microarchitecture that test happened on; the message says it is from
> > > Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> > > meaningful than if it came from a pre-IceLake system (see
> > > https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
> > >
> > > To me those numbers mainly look like you're accessing a lot more
> > > cache-cold data. (On pre-IceLake they would indicate cacheline
> > > bouncing, but I guess here they probably don't.) And that makes sense,
> > > since before the patch, this path was just moving PTEs around without
> > > looking at the associated pages/folios; basically more or less like a
> > > memcpy() on x86-64. But after the patch, for every 8 bytes that you
> > > copy, you have to load a cacheline from the vmemmap to get the page.
> >
> > Yup this is representative of what my investigation is showing.
> >
> > I've narrowed it down but want to wait to report until I'm sure...
> >
> > But yeah we're doing a _lot_ more work.
> >
> > I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> > especially sensitive to this (I found issues with this with my abortive mremap
> > anon merging stuff too, but really expected it there...)
>
> Another approach would be to always read and write PTEs in
> contpte-sized chunks here, without caring whether they're actually
> contiguous or whatever, or something along those lines.
Not sure I love that, you'd have to figure out offset without cont pte batch and
can it vary? And we're doing this on non-arm64 arches for what reason?
And would it solve anything really? We'd still be looking at folio, yes less
than now, but uselessly for arches that don't benefit?
The basis of this series was (and I did explicitly ask) that it wouldn't harm
other arches.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:50 ` Dev Jain
@ 2025-08-07 17:53 ` Lorenzo Stoakes
0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 17:53 UTC (permalink / raw)
To: Dev Jain
Cc: Jann Horn, kernel test robot, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, David Hildenbrand, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 07, 2025 at 11:20:13PM +0530, Dev Jain wrote:
>
> On 07/08/25 11:16 pm, Jann Horn wrote:
> > On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> > > > On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > > > ---------------- ---------------------------
> > > > > > %stddev %change %stddev
> > > > > > \ | \
> > > > > > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > > > > > 367205 +2.3% 375703 vmstat.system.in
> > > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > > > > > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > > > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > > > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > > > > > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > > > > > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > > > > > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > > > > > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
> > > > > Yeah this also looks pretty consistent too...
> > > > FWIW, HITM hat different meanings depending on exactly which
> > > > microarchitecture that test happened on; the message says it is from
> > > > Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> > > > meaningful than if it came from a pre-IceLake system (see
> > > > https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
> > > >
> > > > To me those numbers mainly look like you're accessing a lot more
> > > > cache-cold data. (On pre-IceLake they would indicate cacheline
> > > > bouncing, but I guess here they probably don't.) And that makes sense,
> > > > since before the patch, this path was just moving PTEs around without
> > > > looking at the associated pages/folios; basically more or less like a
> > > > memcpy() on x86-64. But after the patch, for every 8 bytes that you
> > > > copy, you have to load a cacheline from the vmemmap to get the page.
> > > Yup this is representative of what my investigation is showing.
> > >
> > > I've narrowed it down but want to wait to report until I'm sure...
> > >
> > > But yeah we're doing a _lot_ more work.
> > >
> > > I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> > > especially sensitive to this (I found issues with this with my abortive mremap
> > > anon merging stuff too, but really expected it there...)
> > Another approach would be to always read and write PTEs in
> > contpte-sized chunks here, without caring whether they're actually
> > contiguous or whatever, or something along those lines.
>
> The initial approach was to wrap all of this around pte_batch_hint(),
> effectively making the optimization only on arm64. I guess that sounds
> reasonable now.
>
I wish people would just wait for me to finish checking this on my box...
Anyway, as with Jann's point, I have empirical evidence to support that yes
it's the folio lookup that's the issue.
I also was thinking exactly to try this hint thing and to see.
Let me try...
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:41 ` Lorenzo Stoakes
2025-08-07 17:46 ` Jann Horn
@ 2025-08-07 17:59 ` David Hildenbrand
1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-08-07 17:59 UTC (permalink / raw)
To: Lorenzo Stoakes, Jann Horn
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On 07.08.25 19:41, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>> ---------------- ---------------------------
>>>> %stddev %change %stddev
>>>> \ | \
>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>> 367205 +2.3% 375703 vmstat.system.in
>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>
>>> Yeah this also looks pretty consistent too...
>>
>> FWIW, HITM hat different meanings depending on exactly which
>> microarchitecture that test happened on; the message says it is from
>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>> meaningful than if it came from a pre-IceLake system (see
>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>
>> To me those numbers mainly look like you're accessing a lot more
>> cache-cold data. (On pre-IceLake they would indicate cacheline
>> bouncing, but I guess here they probably don't.) And that makes sense,
>> since before the patch, this path was just moving PTEs around without
>> looking at the associated pages/folios; basically more or less like a
>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>> copy, you have to load a cacheline from the vmemmap to get the page.
>
> Yup this is representative of what my investigation is showing.
>
> I've narrowed it down but want to wait to report until I'm sure...
>
> But yeah we're doing a _lot_ more work.
>
> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> especially sensitive to this (I found issues with this with my abortive mremap
> anon merging stuff too, but really expected it there...)
>
> On assumption arm64 is _definitely_ faster. I wonder if older arm64 arches might
> suffer?
Are we sure it's not also slower on arm64 with small folios? I would be
surprised if it isn't.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 17:51 ` Lorenzo Stoakes
@ 2025-08-07 18:01 ` David Hildenbrand
2025-08-07 18:04 ` Lorenzo Stoakes
2025-08-07 18:07 ` Jann Horn
0 siblings, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-08-07 18:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Jann Horn
Cc: kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On 07.08.25 19:51, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
>> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>>> <lorenzo.stoakes@oracle.com> wrote:
>>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>> ---------------- ---------------------------
>>>>>> %stddev %change %stddev
>>>>>> \ | \
>>>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>>>
>>>>> Yeah this also looks pretty consistent too...
>>>>
>>>> FWIW, HITM hat different meanings depending on exactly which
>>>> microarchitecture that test happened on; the message says it is from
>>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>>> meaningful than if it came from a pre-IceLake system (see
>>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>>
>>>> To me those numbers mainly look like you're accessing a lot more
>>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>>> since before the patch, this path was just moving PTEs around without
>>>> looking at the associated pages/folios; basically more or less like a
>>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>>> copy, you have to load a cacheline from the vmemmap to get the page.
>>>
>>> Yup this is representative of what my investigation is showing.
>>>
>>> I've narrowed it down but want to wait to report until I'm sure...
>>>
>>> But yeah we're doing a _lot_ more work.
>>>
>>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>>> especially sensitive to this (I found issues with this with my abortive mremap
>>> anon merging stuff too, but really expected it there...)
>>
>> Another approach would be to always read and write PTEs in
>> contpte-sized chunks here, without caring whether they're actually
>> contiguous or whatever, or something along those lines.
>
> Not sure I love that, you'd have to figure out offset without cont pte batch and
> can it vary? And we're doing this on non-arm64 arches for what reason?
>
> And would it solve anything really? We'd still be looking at folio, yes less
> than now, but uselessly for arches that don't benefit?
>
> The basis of this series was (and I did explicitly ask) that it wouldn't harm
> other arches.
We'd need some hint to detect "this is either small" or "this is
unbatchable".
Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
benefit with larger folios (e.g., 64K, 128K) with this patch.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 18:01 ` David Hildenbrand
@ 2025-08-07 18:04 ` Lorenzo Stoakes
2025-08-07 18:13 ` David Hildenbrand
2025-08-07 18:07 ` Jann Horn
1 sibling, 1 reply; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 18:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jann Horn, kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On Thu, Aug 07, 2025 at 08:01:51PM +0200, David Hildenbrand wrote:
> On 07.08.25 19:51, Lorenzo Stoakes wrote:
> > On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
> > > On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
> > > > > On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
> > > > > > > 94dab12d86cf77ff f822a9a81a31311d67f260aea96
> > > > > > > ---------------- ---------------------------
> > > > > > > %stddev %change %stddev
> > > > > > > \ | \
> > > > > > > 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
> > > > > > > 367205 +2.3% 375703 vmstat.system.in
> > > > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
> > > > > > > 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
> > > > > > > 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
> > > > > > > 11468 +1.2% 11603 stress-ng.time.system_time
> > > > > > > 296.25 +4.5% 309.70 stress-ng.time.user_time
> > > > > > > 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
> > > > > > > 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
> > > > > > > 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
> > > > > > > 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
> > > > > > > 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
> > > > > > > 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
> > > > > >
> > > > > > Yeah this also looks pretty consistent too...
> > > > >
> > > > > FWIW, HITM hat different meanings depending on exactly which
> > > > > microarchitecture that test happened on; the message says it is from
> > > > > Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
> > > > > meaningful than if it came from a pre-IceLake system (see
> > > > > https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
> > > > >
> > > > > To me those numbers mainly look like you're accessing a lot more
> > > > > cache-cold data. (On pre-IceLake they would indicate cacheline
> > > > > bouncing, but I guess here they probably don't.) And that makes sense,
> > > > > since before the patch, this path was just moving PTEs around without
> > > > > looking at the associated pages/folios; basically more or less like a
> > > > > memcpy() on x86-64. But after the patch, for every 8 bytes that you
> > > > > copy, you have to load a cacheline from the vmemmap to get the page.
> > > >
> > > > Yup this is representative of what my investigation is showing.
> > > >
> > > > I've narrowed it down but want to wait to report until I'm sure...
> > > >
> > > > But yeah we're doing a _lot_ more work.
> > > >
> > > > I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
> > > > especially sensitive to this (I found issues with this with my abortive mremap
> > > > anon merging stuff too, but really expected it there...)
> > >
> > > Another approach would be to always read and write PTEs in
> > > contpte-sized chunks here, without caring whether they're actually
> > > contiguous or whatever, or something along those lines.
> >
> > Not sure I love that, you'd have to figure out offset without cont pte batch and
> > can it vary? And we're doing this on non-arm64 arches for what reason?
> >
> > And would it solve anything really? We'd still be looking at folio, yes less
> > than now, but uselessly for arches that don't benefit?
> >
> > The basis of this series was (and I did explicitly ask) that it wouldn't harm
> > other arches.
>
> We'd need some hint to detect "this is either small" or "this is
> unbatchable".
>
> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
> benefit with larger folios (e.g., 64K, 128K) with this patch.
For the record I did think of using this prior to being mentioned, product of
actually trying to get the data to back this up instead of talking...
Anyway, isn't that chicken and egg? We'd have to go get the folio to find out if
large folio and incur the cost before we knew?
So how could we make that workable?
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 18:01 ` David Hildenbrand
2025-08-07 18:04 ` Lorenzo Stoakes
@ 2025-08-07 18:07 ` Jann Horn
2025-08-07 18:31 ` David Hildenbrand
1 sibling, 1 reply; 23+ messages in thread
From: Jann Horn @ 2025-08-07 18:07 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, kernel test robot, Dev Jain, oe-lkp, lkp,
linux-kernel, Andrew Morton, Barry Song, Pedro Falcato,
Anshuman Khandual, Bang Li, Baolin Wang, bibo mao, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On Thu, Aug 7, 2025 at 8:02 PM David Hildenbrand <david@redhat.com> wrote:
> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
> benefit with larger folios (e.g., 64K, 128K) with this patch.
Where would you expect such a benefit to come from? This function is
more or less a memcpy(), except it has to read PTEs with xchg(), write
them atomically, and set softdirty flags. For x86, what the associated
folios look like and whether the PTEs are contiguous shouldn't matter.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 18:04 ` Lorenzo Stoakes
@ 2025-08-07 18:13 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-08-07 18:13 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Jann Horn, kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On 07.08.25 20:04, Lorenzo Stoakes wrote:
> On Thu, Aug 07, 2025 at 08:01:51PM +0200, David Hildenbrand wrote:
>> On 07.08.25 19:51, Lorenzo Stoakes wrote:
>>> On Thu, Aug 07, 2025 at 07:46:39PM +0200, Jann Horn wrote:
>>>> On Thu, Aug 7, 2025 at 7:41 PM Lorenzo Stoakes
>>>> <lorenzo.stoakes@oracle.com> wrote:
>>>>> On Thu, Aug 07, 2025 at 07:37:38PM +0200, Jann Horn wrote:
>>>>>> On Thu, Aug 7, 2025 at 10:28 AM Lorenzo Stoakes
>>>>>> <lorenzo.stoakes@oracle.com> wrote:
>>>>>>> On Thu, Aug 07, 2025 at 04:17:09PM +0800, kernel test robot wrote:
>>>>>>>> 94dab12d86cf77ff f822a9a81a31311d67f260aea96
>>>>>>>> ---------------- ---------------------------
>>>>>>>> %stddev %change %stddev
>>>>>>>> \ | \
>>>>>>>> 13777 ą 37% +45.0% 19979 ą 27% numa-vmstat.node1.nr_slab_reclaimable
>>>>>>>> 367205 +2.3% 375703 vmstat.system.in
>>>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.KReclaimable
>>>>>>>> 55106 ą 37% +45.1% 79971 ą 27% numa-meminfo.node1.SReclaimable
>>>>>>>> 559381 -37.3% 350757 stress-ng.bigheap.realloc_calls_per_sec
>>>>>>>> 11468 +1.2% 11603 stress-ng.time.system_time
>>>>>>>> 296.25 +4.5% 309.70 stress-ng.time.user_time
>>>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 0.81 ą187% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 9.36 ą165% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.zap_pte_range.zap_pmd_range.isra.0
>>>>>>>> 5.50 ą 17% +390.9% 27.00 ą 56% perf-c2c.DRAM.local
>>>>>>>> 388.50 ą 10% +114.7% 834.17 ą 33% perf-c2c.DRAM.remote
>>>>>>>> 1214 ą 13% +107.3% 2517 ą 31% perf-c2c.HITM.local
>>>>>>>> 135.00 ą 19% +130.9% 311.67 ą 32% perf-c2c.HITM.remote
>>>>>>>> 1349 ą 13% +109.6% 2829 ą 31% perf-c2c.HITM.total
>>>>>>>
>>>>>>> Yeah this also looks pretty consistent too...
>>>>>>
>>>>>> FWIW, HITM hat different meanings depending on exactly which
>>>>>> microarchitecture that test happened on; the message says it is from
>>>>>> Sapphire Rapids, which is a successor of Ice Lake, so HITM is less
>>>>>> meaningful than if it came from a pre-IceLake system (see
>>>>>> https://lore.kernel.org/all/CAG48ez3RmV6SsVw9oyTXxQXHp3rqtKDk2qwJWo9TGvXCq7Xr-w@mail.gmail.com/).
>>>>>>
>>>>>> To me those numbers mainly look like you're accessing a lot more
>>>>>> cache-cold data. (On pre-IceLake they would indicate cacheline
>>>>>> bouncing, but I guess here they probably don't.) And that makes sense,
>>>>>> since before the patch, this path was just moving PTEs around without
>>>>>> looking at the associated pages/folios; basically more or less like a
>>>>>> memcpy() on x86-64. But after the patch, for every 8 bytes that you
>>>>>> copy, you have to load a cacheline from the vmemmap to get the page.
>>>>>
>>>>> Yup this is representative of what my investigation is showing.
>>>>>
>>>>> I've narrowed it down but want to wait to report until I'm sure...
>>>>>
>>>>> But yeah we're doing a _lot_ more work.
>>>>>
>>>>> I'm leaning towards disabling except for arm64 atm tbh, seems mremap is
>>>>> especially sensitive to this (I found issues with this with my abortive mremap
>>>>> anon merging stuff too, but really expected it there...)
>>>>
>>>> Another approach would be to always read and write PTEs in
>>>> contpte-sized chunks here, without caring whether they're actually
>>>> contiguous or whatever, or something along those lines.
>>>
>>> Not sure I love that, you'd have to figure out offset without cont pte batch and
>>> can it vary? And we're doing this on non-arm64 arches for what reason?
>>>
>>> And would it solve anything really? We'd still be looking at folio, yes less
>>> than now, but uselessly for arches that don't benefit?
>>>
>>> The basis of this series was (and I did explicitly ask) that it wouldn't harm
>>> other arches.
>>
>> We'd need some hint to detect "this is either small" or "this is
>> unbatchable".
>>
>> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
>> benefit with larger folios (e.g., 64K, 128K) with this patch.
>
> For the record I did think of using this prior to being mentioned, product of
> actually trying to get the data to back this up instead of talking...
>
> Anyway, isn't that chicken and egg? We'd have to go get the folio to find out if
> large folio and incur the cost before we knew?
>
> So how could we make that workable?
E.g., a best-effort check if the next pte likely points at the next PFN.
But as Jann mentioned, there might actually be no benefit on other
architectures (benchmarking would probably tell us the real story).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 18:07 ` Jann Horn
@ 2025-08-07 18:31 ` David Hildenbrand
2025-08-07 19:52 ` Lorenzo Stoakes
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-08-07 18:31 UTC (permalink / raw)
To: Jann Horn
Cc: Lorenzo Stoakes, kernel test robot, Dev Jain, oe-lkp, lkp,
linux-kernel, Andrew Morton, Barry Song, Pedro Falcato,
Anshuman Khandual, Bang Li, Baolin Wang, bibo mao, Hugh Dickins,
Ingo Molnar, Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu,
Qi Zheng, Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan,
linux-mm
On 07.08.25 20:07, Jann Horn wrote:
> On Thu, Aug 7, 2025 at 8:02 PM David Hildenbrand <david@redhat.com> wrote:
>> Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
>> benefit with larger folios (e.g., 64K, 128K) with this patch.
>
> Where would you expect such a benefit to come from? This function is
> more or less a memcpy(), except it has to read PTEs with xchg(), write
> them atomically, and set softdirty flags. For x86, what the associated
> folios look like and whether the PTEs are contiguous shouldn't matter.
>
Good point, I was assuming TLB flushing as well, but that doesn't really
apply here because we are already batching that.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression
2025-08-07 18:31 ` David Hildenbrand
@ 2025-08-07 19:52 ` Lorenzo Stoakes
0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 19:52 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jann Horn, kernel test robot, Dev Jain, oe-lkp, lkp, linux-kernel,
Andrew Morton, Barry Song, Pedro Falcato, Anshuman Khandual,
Bang Li, Baolin Wang, bibo mao, Hugh Dickins, Ingo Molnar,
Lance Yang, Liam Howlett, Matthew Wilcox, Peter Xu, Qi Zheng,
Ryan Roberts, Vlastimil Babka, Yang Shi, Zi Yan, linux-mm
On Thu, Aug 07, 2025 at 08:31:18PM +0200, David Hildenbrand wrote:
> On 07.08.25 20:07, Jann Horn wrote:
> > On Thu, Aug 7, 2025 at 8:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > Sure, we could use pte_batch_hint(), but I'm curious if x86 would also
> > > benefit with larger folios (e.g., 64K, 128K) with this patch.
> >
> > Where would you expect such a benefit to come from? This function is
> > more or less a memcpy(), except it has to read PTEs with xchg(), write
> > them atomically, and set softdirty flags. For x86, what the associated
> > folios look like and whether the PTEs are contiguous shouldn't matter.
> >
>
> Good point, I was assuming TLB flushing as well, but that doesn't really
> apply here because we are already batching that.
Ah good point, but indeed, while we force a TLB flush if we discover a
present pte, we do so only _after_ we have finished processing entries in
the PTE table, and we would only batch up to, at most, the end of the PTE
table, so we have zero possible delta here on that.
I did wonder if _somehow_ we'd get some benefit by grouping operations
(yes, this was a handwavey thought).
But Jann's point puts that to bed...
I really feel like this is a super arch-specfic feature that maybe we need
to go around and make arm64-only or predicated on something like the
contpte hint check to be effectively equivalent to.
Because my whole basis for accepting this on other arches is there'd be
little to no impact and now we have seen a huge impact and it's worrying.
>
> --
> Cheers,
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-08-07 19:53 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 8:17 [linus:master] [mm] f822a9a81a: stress-ng.bigheap.realloc_calls_per_sec 37.3% regression kernel test robot
2025-08-07 8:27 ` Lorenzo Stoakes
2025-08-07 8:56 ` Dev Jain
2025-08-07 10:21 ` David Hildenbrand
2025-08-07 16:06 ` Dev Jain
2025-08-07 16:10 ` Lorenzo Stoakes
2025-08-07 16:16 ` Lorenzo Stoakes
2025-08-07 17:04 ` Dev Jain
2025-08-07 17:07 ` Lorenzo Stoakes
2025-08-07 17:11 ` Dev Jain
2025-08-07 17:37 ` Jann Horn
2025-08-07 17:41 ` Lorenzo Stoakes
2025-08-07 17:46 ` Jann Horn
2025-08-07 17:50 ` Dev Jain
2025-08-07 17:53 ` Lorenzo Stoakes
2025-08-07 17:51 ` Lorenzo Stoakes
2025-08-07 18:01 ` David Hildenbrand
2025-08-07 18:04 ` Lorenzo Stoakes
2025-08-07 18:13 ` David Hildenbrand
2025-08-07 18:07 ` Jann Horn
2025-08-07 18:31 ` David Hildenbrand
2025-08-07 19:52 ` Lorenzo Stoakes
2025-08-07 17:59 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).