* [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
@ 2024-07-09 5:11 kernel test robot
2024-07-10 6:22 ` Yu Zhao
2024-07-17 7:52 ` Janosch Frank
0 siblings, 2 replies; 14+ messages in thread
From: kernel test robot @ 2024-07-09 5:11 UTC (permalink / raw)
To: Yu Zhao
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, oliver.sang
Hello,
kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
testcase: vm-scalability
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
parameters:
runtime: 300s
size: 512G
test: anon-cow-rand-hugetlb
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407091001.1250ad4a-oliver.sang@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240709/202407091001.1250ad4a-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
commit:
73236245e0 ("cachestat: do not flush stats in recency check")
875fa64577 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
73236245e0b47ea3 875fa64577da9bc8e9963ee14fe
---------------- ---------------------------
%stddev %change %stddev
\ | \
4.447e+09 ± 13% +342.0% 1.966e+10 ± 7% cpuidle..time
753730 ± 2% +1105.0% 9082792 ± 6% cpuidle..usage
582089 ± 6% +20.7% 702337 ± 8% numa-numastat.node0.local_node
658053 ± 5% +18.3% 778277 ± 3% numa-numastat.node0.numa_hit
255.89 ± 2% +32.4% 338.76 ± 3% uptime.boot
10264 ± 7% +147.3% 25384 ± 5% uptime.idle
2.00 ± 76% +72475.0% 1451 ±125% perf-c2c.DRAM.local
38.50 ± 13% +20427.7% 7903 ± 92% perf-c2c.DRAM.remote
23.33 ± 17% +1567.9% 389.17 ± 61% perf-c2c.HITM.local
11.33 ± 35% +1902.9% 227.00 ± 84% perf-c2c.HITM.remote
17.44 ± 10% +208.4% 53.80 ± 2% vmstat.cpu.id
71.79 ± 2% -37.5% 44.87 ± 3% vmstat.cpu.us
107.92 ± 2% -43.9% 60.57 ± 3% vmstat.procs.r
2724 ± 4% +151.1% 6842 ± 3% vmstat.system.cs
134997 ± 4% -23.8% 102899 ± 7% vmstat.system.in
16.63 ± 11% +36.8 53.43 ± 2% mpstat.cpu.all.idle%
0.34 ± 3% -0.1 0.21 ± 2% mpstat.cpu.all.irq%
0.03 ± 6% +0.0 0.04 ± 9% mpstat.cpu.all.soft%
10.61 ± 3% -9.5 1.15 ± 38% mpstat.cpu.all.sys%
72.40 ± 2% -27.2 45.16 ± 3% mpstat.cpu.all.usr%
6.83 ± 34% +195.1% 20.17 ± 3% mpstat.max_utilization.seconds
102366 ± 29% +115.2% 220258 ± 7% meminfo.AnonHugePages
70562810 +14.4% 80723780 meminfo.CommitLimit
43743 ± 2% -22.7% 33821 ± 2% meminfo.HugePages_Surp
43743 ± 2% -22.7% 33821 ± 2% meminfo.HugePages_Total
89587444 ± 2% -22.7% 69265470 ± 2% meminfo.Hugetlb
66355 ± 18% -40.8% 39283 ± 15% meminfo.Mapped
1.341e+08 +15.3% 1.545e+08 meminfo.MemAvailable
1.351e+08 +15.1% 1.555e+08 meminfo.MemFree
95643557 ± 2% -21.4% 75182673 ± 2% meminfo.Memused
49588 ± 2% -31.7% 33871 ± 3% vm-scalability.median
8.02 ± 9% -2.6 5.40 ± 12% vm-scalability.median_stddev%
6842353 -34.3% 4498326 ± 2% vm-scalability.throughput
205.20 ± 2% +40.7% 288.76 ± 4% vm-scalability.time.elapsed_time
205.20 ± 2% +40.7% 288.76 ± 4% vm-scalability.time.elapsed_time.max
149773 ± 2% -47.3% 78866 ± 3% vm-scalability.time.involuntary_context_switches
10634 ± 2% -43.5% 6008 ± 3% vm-scalability.time.percent_of_cpu_this_job_got
2772 ± 5% -84.9% 419.47 ± 42% vm-scalability.time.system_time
19039 -11.2% 16908 vm-scalability.time.user_time
14514 ± 2% +4380.2% 650265 vm-scalability.time.voluntary_context_switches
617106 ± 42% -67.8% 198580 ±130% numa-vmstat.node0.nr_file_pages
8937 ± 42% -65.6% 3075 ±110% numa-vmstat.node0.nr_mapped
18779 ± 30% -35.4% 12124 ± 40% numa-vmstat.node0.nr_slab_reclaimable
603050 ± 43% -70.7% 176415 ±148% numa-vmstat.node0.nr_unevictable
603050 ± 43% -70.7% 176415 ±148% numa-vmstat.node0.nr_zone_unevictable
657413 ± 5% +18.2% 776975 ± 3% numa-vmstat.node0.numa_hit
581443 ± 6% +20.6% 701035 ± 8% numa-vmstat.node0.numa_local
214166 ±122% +192.8% 627105 ± 40% numa-vmstat.node1.nr_file_pages
11263349 ± 5% +35.8% 15297395 ± 7% numa-vmstat.node1.nr_free_pages
9478 ± 59% +72.7% 16368 ± 29% numa-vmstat.node1.nr_slab_reclaimable
163852 ±161% +260.4% 590489 ± 44% numa-vmstat.node1.nr_unevictable
163852 ±161% +260.4% 590489 ± 44% numa-vmstat.node1.nr_zone_unevictable
49.90 ± 29% +115.4% 107.47 ± 7% proc-vmstat.nr_anon_transparent_hugepages
3345235 +15.3% 3857626 proc-vmstat.nr_dirty_background_threshold
6698650 +15.3% 7724685 proc-vmstat.nr_dirty_threshold
33770919 +15.2% 38900683 proc-vmstat.nr_free_pages
196929 -2.3% 192368 proc-vmstat.nr_inactive_anon
16843 ± 18% -40.4% 10031 ± 14% proc-vmstat.nr_mapped
2693 -7.6% 2487 proc-vmstat.nr_page_table_pages
196929 -2.3% 192368 proc-vmstat.nr_zone_inactive_anon
1404664 +9.0% 1530693 ± 3% proc-vmstat.numa_hit
1271130 +9.3% 1389279 ± 3% proc-vmstat.numa_local
69467 ± 7% -34.8% 45284 ± 22% proc-vmstat.pgactivate
1263160 +12.8% 1425219 ± 3% proc-vmstat.pgfault
37012 +16.6% 43157 ± 4% proc-vmstat.pgreuse
2468473 ± 42% -67.8% 794312 ±130% numa-meminfo.node0.FilePages
75120 ± 30% -35.4% 48496 ± 40% numa-meminfo.node0.KReclaimable
35040 ± 42% -66.3% 11825 ±112% numa-meminfo.node0.Mapped
75120 ± 30% -35.4% 48496 ± 40% numa-meminfo.node0.SReclaimable
227978 ± 10% -17.0% 189123 ± 10% numa-meminfo.node0.Slab
2412201 ± 43% -70.7% 705661 ±148% numa-meminfo.node0.Unevictable
856474 ±123% +192.9% 2508266 ± 40% numa-meminfo.node1.FilePages
25221 ± 4% -34.1% 16618 ± 13% numa-meminfo.node1.HugePages_Surp
25221 ± 4% -34.1% 16618 ± 13% numa-meminfo.node1.HugePages_Total
37917 ± 59% +72.7% 65467 ± 30% numa-meminfo.node1.KReclaimable
45044169 ± 5% +35.8% 61184692 ± 7% numa-meminfo.node1.MemFree
53983914 ± 4% -29.9% 37843391 ± 12% numa-meminfo.node1.MemUsed
37917 ± 59% +72.7% 65467 ± 30% numa-meminfo.node1.SReclaimable
153538 ± 16% +26.2% 193736 ± 10% numa-meminfo.node1.Slab
655409 ±161% +260.4% 2361959 ± 44% numa-meminfo.node1.Unevictable
1482 ± 9% -17.5% 1223 ± 9% sched_debug.cfs_rq:/.runnable_avg.max
661.67 ± 14% -93.0% 46.09 ± 15% sched_debug.cfs_rq:/.util_est.avg
1286 ± 12% -61.4% 496.42 ± 37% sched_debug.cfs_rq:/.util_est.max
123.89 ± 48% -57.2% 53.08 ± 29% sched_debug.cfs_rq:/.util_est.stddev
125242 ± 11% +35.5% 169710 ± 10% sched_debug.cpu.clock.avg
125264 ± 11% +35.5% 169723 ± 10% sched_debug.cpu.clock.max
125213 ± 11% +35.5% 169693 ± 10% sched_debug.cpu.clock.min
124816 ± 11% +35.6% 169267 ± 10% sched_debug.cpu.clock_task.avg
125011 ± 11% +35.6% 169465 ± 10% sched_debug.cpu.clock_task.max
115620 ± 12% +37.8% 159344 ± 10% sched_debug.cpu.clock_task.min
2909 ± 14% +172.9% 7941 ± 10% sched_debug.cpu.nr_switches.avg
715.68 ± 18% +470.7% 4084 ± 23% sched_debug.cpu.nr_switches.min
125215 ± 11% +35.5% 169695 ± 10% sched_debug.cpu_clk
123982 ± 11% +35.9% 168463 ± 10% sched_debug.ktime
126127 ± 11% +35.3% 170626 ± 10% sched_debug.sched_clk
15.81 ± 2% +357.5% 72.34 ± 5% perf-stat.i.MPKI
1.46e+10 ± 2% -32.9% 9.801e+09 ± 4% perf-stat.i.branch-instructions
0.10 ± 3% +0.5 0.65 ± 5% perf-stat.i.branch-miss-rate%
10768595 -27.8% 7778807 ± 3% perf-stat.i.branch-misses
96.93 -19.0 77.95 perf-stat.i.cache-miss-rate%
8.054e+08 ± 2% -33.0% 5.398e+08 ± 4% perf-stat.i.cache-misses
8.26e+08 ± 2% -29.1% 5.855e+08 ± 4% perf-stat.i.cache-references
2668 ± 4% +159.7% 6928 ± 3% perf-stat.i.context-switches
5.07 +42.6% 7.24 ± 12% perf-stat.i.cpi
2.809e+11 ± 2% -44.1% 1.571e+11 ± 3% perf-stat.i.cpu-cycles
213.40 ± 2% +41.5% 301.92 ± 5% perf-stat.i.cpu-migrations
360.56 -9.5% 326.39 ± 5% perf-stat.i.cycles-between-cache-misses
6.256e+10 ± 2% -32.6% 4.218e+10 ± 4% perf-stat.i.instructions
0.24 +39.6% 0.33 ± 3% perf-stat.i.ipc
5779 ± 2% -18.1% 4735 ± 2% perf-stat.i.minor-faults
5780 ± 2% -18.1% 4737 ± 2% perf-stat.i.page-faults
12.99 -1.8% 12.75 perf-stat.overall.MPKI
97.43 -5.1 92.33 perf-stat.overall.cache-miss-rate%
4.52 -17.5% 3.72 perf-stat.overall.cpi
347.63 -16.0% 291.93 perf-stat.overall.cycles-between-cache-misses
0.22 +21.3% 0.27 perf-stat.overall.ipc
10915 -3.4% 10545 perf-stat.overall.path-length
1.433e+10 ± 2% -31.5% 9.821e+09 ± 4% perf-stat.ps.branch-instructions
10358936 ± 2% -25.2% 7745475 ± 4% perf-stat.ps.branch-misses
7.973e+08 ± 2% -32.4% 5.389e+08 ± 4% perf-stat.ps.cache-misses
8.183e+08 ± 2% -28.7% 5.838e+08 ± 4% perf-stat.ps.cache-references
2648 ± 4% +157.5% 6819 ± 3% perf-stat.ps.context-switches
2.771e+11 ± 2% -43.3% 1.572e+11 ± 3% perf-stat.ps.cpu-cycles
211.28 ± 2% +41.6% 299.23 ± 5% perf-stat.ps.cpu-migrations
6.139e+10 ± 2% -31.2% 4.226e+10 ± 4% perf-stat.ps.instructions
5815 ± 2% -19.4% 4686 ± 2% perf-stat.ps.minor-faults
5816 ± 2% -19.4% 4687 ± 2% perf-stat.ps.page-faults
1.265e+13 -3.4% 1.222e+13 perf-stat.total.instructions
60.25 ± 15% -13.0 47.20 ± 55% perf-profile.calltrace.cycles-pp.do_rw_once
47.47 ± 14% -8.3 39.13 ± 57% perf-profile.calltrace.cycles-pp.lrand48_r@plt
2.17 ±130% -1.5 0.65 ±159% perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
1.57 ±127% -1.0 0.59 ±160% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault
0.56 ±146% -0.6 0.00 perf-profile.calltrace.cycles-pp.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault
1.30 ± 48% -0.4 0.88 ± 71% perf-profile.calltrace.cycles-pp.lrand48_r
2.11 ± 14% -0.1 1.99 ± 49% perf-profile.calltrace.cycles-pp.nrand48_r
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_fb_memcpy.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread
0.00 +0.1 0.08 ±223% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread.ret_from_fork
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.00 +0.1 0.09 ±223% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.clockevents_program_event.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.ret_from_fork_asm
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.prep_new_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
0.00 +0.1 0.10 ±223% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio
0.00 +0.1 0.11 ±223% perf-profile.calltrace.cycles-pp.update_process_times.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
0.00 +0.2 0.17 ±223% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp
0.00 +0.2 0.23 ±145% perf-profile.calltrace.cycles-pp.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
0.00 +0.2 0.24 ±144% perf-profile.calltrace.cycles-pp.prep_compound_page.get_page_from_freelist.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio
0.00 +0.2 0.25 ±144% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +0.3 0.25 ±142% perf-profile.calltrace.cycles-pp.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter
0.00 +0.3 0.26 ±144% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio
0.00 +0.3 0.26 ±144% perf-profile.calltrace.cycles-pp.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio
0.00 +0.3 0.28 ±144% perf-profile.calltrace.cycles-pp.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio
0.00 +0.3 0.28 ±144% perf-profile.calltrace.cycles-pp.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp
0.00 +0.3 0.28 ±144% perf-profile.calltrace.cycles-pp.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
0.00 +0.5 0.46 ±144% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt
0.00 +0.5 0.48 ±144% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter
0.00 +0.6 0.57 ±144% perf-profile.calltrace.cycles-pp.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
0.00 +0.7 0.72 ±144% perf-profile.calltrace.cycles-pp.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
0.00 +0.9 0.90 ±143% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
0.00 +1.1 1.14 ±142% perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
0.00 +2.1 2.06 ±143% perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
0.00 +2.1 2.09 ±143% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
0.00 +2.1 2.10 ±143% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
0.00 +2.2 2.23 ±143% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
0.00 +2.3 2.30 ±143% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
0.00 +2.3 2.30 ±143% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
0.00 +2.3 2.30 ±143% perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
0.00 +2.4 2.35 ±142% perf-profile.calltrace.cycles-pp.common_startup_64
0.00 +2.7 2.71 ±143% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
12.15 ±106% +14.7 26.84 ±129% perf-profile.calltrace.cycles-pp.do_access
10.20 ±128% +14.7 24.91 ±142% perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
10.20 ±128% +14.7 24.93 ±141% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
10.22 ±128% +14.7 24.95 ±142% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
10.21 ±128% +14.7 24.95 ±142% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
10.23 ±128% +14.8 25.05 ±141% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
7.77 ±127% +15.5 23.22 ±141% perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
7.83 ±127% +15.5 23.34 ±141% perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
7.84 ±127% +15.6 23.40 ±141% perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
8.00 ±127% +16.2 24.20 ±141% perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
84.24 ± 14% -17.2 67.02 ± 55% perf-profile.children.cycles-pp.do_rw_once
24.51 ± 14% -4.4 20.11 ± 57% perf-profile.children.cycles-pp.lrand48_r@plt
2.17 ±130% -1.5 0.65 ±159% perf-profile.children.cycles-pp.__mutex_lock
1.57 ±127% -1.0 0.59 ±160% perf-profile.children.cycles-pp.mutex_spin_on_owner
0.59 ±136% -0.6 0.04 ±152% perf-profile.children.cycles-pp.osq_lock
1.01 ± 19% -0.2 0.77 ± 47% perf-profile.children.cycles-pp.lrand48_r
2.81 ± 14% -0.1 2.71 ± 45% perf-profile.children.cycles-pp.nrand48_r
0.10 ± 57% -0.1 0.03 ±100% perf-profile.children.cycles-pp.main
0.10 ± 57% -0.1 0.03 ±100% perf-profile.children.cycles-pp.run_builtin
0.08 ± 92% -0.1 0.02 ±141% perf-profile.children.cycles-pp.__cmd_record
0.08 ± 92% -0.1 0.02 ±141% perf-profile.children.cycles-pp.cmd_record
0.06 ±112% -0.1 0.00 perf-profile.children.cycles-pp.perf_mmap__push
0.06 ±112% -0.1 0.00 perf-profile.children.cycles-pp.record__mmap_read_evlist
0.05 ±113% -0.1 0.00 perf-profile.children.cycles-pp.record__pushfn
0.05 ±111% -0.0 0.00 perf-profile.children.cycles-pp.writen
0.05 ±110% -0.0 0.00 perf-profile.children.cycles-pp.shmem_file_write_iter
0.04 ±110% -0.0 0.00 perf-profile.children.cycles-pp.generic_perform_write
0.01 ±223% -0.0 0.00 perf-profile.children.cycles-pp.copy_page_from_iter_atomic
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.__rmqueue_pcplist
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.io_serial_out
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.prepare_task_switch
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.seq_read_iter
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.update_curr
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.cpuidle_governor_latency_req
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.irqentry_enter
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.delay_tsc
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.nohz_balancer_kick
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.copy_mc_to_kernel
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.ksys_read
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.vfs_read
0.06 ± 17% +0.0 0.08 ± 36% perf-profile.children.cycles-pp.task_tick_fair
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.__sysvec_irq_work
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp._printk
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.asm_sysvec_irq_work
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.irq_work_run
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.irq_work_single
0.00 +0.0 0.01 ±223% perf-profile.children.cycles-pp.sysvec_irq_work
0.00 +0.0 0.02 ±223% perf-profile.children.cycles-pp.irq_work_run_list
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.__handle_mm_fault
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.dequeue_entity
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.native_apic_msr_eoi
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.sched_ttwu_pending
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.schedule_preempt_disabled
0.00 +0.0 0.02 ±144% perf-profile.children.cycles-pp.rcu_all_qs
0.00 +0.0 0.02 ±144% perf-profile.children.cycles-pp.rcu_pending
0.00 +0.0 0.02 ±144% perf-profile.children.cycles-pp.read
0.00 +0.0 0.02 ±141% perf-profile.children.cycles-pp.update_irq_load_avg
0.00 +0.0 0.02 ±141% perf-profile.children.cycles-pp.update_rq_clock
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.___perf_sw_event
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.__flush_smp_call_function_queue
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.__sysvec_call_function_single
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.dequeue_task_fair
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.enqueue_entity
0.00 +0.0 0.02 ±146% perf-profile.children.cycles-pp.rcu_report_qs_rdp
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.read_tsc
0.00 +0.0 0.02 ±142% perf-profile.children.cycles-pp.rmqueue
0.00 +0.0 0.02 ±141% perf-profile.children.cycles-pp.wait_for_xmitr
0.00 +0.0 0.02 ±141% perf-profile.children.cycles-pp.idle_cpu
0.00 +0.0 0.03 ±141% perf-profile.children.cycles-pp.enqueue_task_fair
0.00 +0.0 0.03 ±143% perf-profile.children.cycles-pp.lapic_next_deadline
0.00 +0.0 0.03 ±143% perf-profile.children.cycles-pp.note_gp_changes
0.00 +0.0 0.03 ±147% perf-profile.children.cycles-pp.rcu_sched_clock_irq
0.00 +0.0 0.03 ±144% perf-profile.children.cycles-pp.sched_clock
0.00 +0.0 0.03 ±141% perf-profile.children.cycles-pp.activate_task
0.00 +0.0 0.03 ±142% perf-profile.children.cycles-pp.native_sched_clock
0.00 +0.0 0.03 ±142% perf-profile.children.cycles-pp.irqtime_account_irq
0.00 +0.0 0.03 ±142% perf-profile.children.cycles-pp.ktime_get_update_offsets_now
0.00 +0.0 0.03 ±142% perf-profile.children.cycles-pp.sched_clock_cpu
0.00 +0.0 0.03 ±142% perf-profile.children.cycles-pp.update_rq_clock_task
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp.complete
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp.tick_nohz_next_event
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp.tick_nohz_irq_exit
0.00 +0.0 0.04 ±142% perf-profile.children.cycles-pp.ttwu_do_activate
0.00 +0.0 0.04 ±144% perf-profile.children.cycles-pp.arch_scale_freq_tick
0.00 +0.0 0.04 ±142% perf-profile.children.cycles-pp.schedule_idle
0.06 ± 9% +0.0 0.10 ± 79% perf-profile.children.cycles-pp.tmigr_requires_handle_remote
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp.rcu_do_batch
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.04 ± 45% +0.0 0.09 ± 78% perf-profile.children.cycles-pp.get_jiffies_update
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.00 +0.0 0.04 ±141% perf-profile.children.cycles-pp._raw_spin_lock
0.00 +0.0 0.04 ±152% perf-profile.children.cycles-pp.sched_balance_newidle
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.native_irq_return_iret
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.try_to_wake_up
0.04 ±101% +0.0 0.09 ± 78% perf-profile.children.cycles-pp.free_unref_page
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.clear_page_erms
0.00 +0.0 0.05 ±151% perf-profile.children.cycles-pp.pick_next_task_fair
0.06 ± 77% +0.0 0.10 ± 76% perf-profile.children.cycles-pp.exit_mm
0.00 +0.0 0.05 ±145% perf-profile.children.cycles-pp.__cond_resched
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.__get_user_pages
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.__mm_populate
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.clear_huge_page
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.hugetlb_no_page
0.00 +0.0 0.05 ±141% perf-profile.children.cycles-pp.populate_vma_page_range
0.00 +0.1 0.05 ±141% perf-profile.children.cycles-pp.__mmap
0.06 ± 77% +0.1 0.11 ± 76% perf-profile.children.cycles-pp.__mmput
0.06 ± 77% +0.1 0.11 ± 76% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.06 ± 77% +0.1 0.11 ± 76% perf-profile.children.cycles-pp.do_group_exit
0.06 ± 77% +0.1 0.11 ± 76% perf-profile.children.cycles-pp.exit_mmap
0.00 +0.1 0.05 ±142% perf-profile.children.cycles-pp.rest_init
0.00 +0.1 0.05 ±142% perf-profile.children.cycles-pp.start_kernel
0.00 +0.1 0.05 ±142% perf-profile.children.cycles-pp.x86_64_start_kernel
0.00 +0.1 0.05 ±142% perf-profile.children.cycles-pp.x86_64_start_reservations
0.04 ±102% +0.1 0.10 ± 77% perf-profile.children.cycles-pp.tlb_finish_mmu
0.06 ± 77% +0.1 0.11 ± 75% perf-profile.children.cycles-pp.do_exit
0.00 +0.1 0.05 ±143% perf-profile.children.cycles-pp.update_load_avg
0.04 ±102% +0.1 0.10 ± 77% perf-profile.children.cycles-pp.__tlb_batch_free_encoded_pages
0.04 ±102% +0.1 0.10 ± 77% perf-profile.children.cycles-pp.folios_put_refs
0.04 ±102% +0.1 0.10 ± 77% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.01 ±223% +0.1 0.06 ±149% perf-profile.children.cycles-pp.task_work_run
0.01 ±223% +0.1 0.06 ±148% perf-profile.children.cycles-pp.task_mm_cid_work
0.00 +0.1 0.06 ±141% perf-profile.children.cycles-pp.ksys_mmap_pgoff
0.00 +0.1 0.06 ±141% perf-profile.children.cycles-pp.__update_blocked_fair
0.00 +0.1 0.06 ±146% perf-profile.children.cycles-pp.update_sg_lb_stats
0.02 ±144% +0.1 0.09 ±145% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
0.00 +0.1 0.07 ±118% perf-profile.children.cycles-pp.vm_mmap_pgoff
0.16 ± 35% +0.1 0.23 ± 79% perf-profile.children.cycles-pp.ksys_write
0.16 ± 34% +0.1 0.23 ± 79% perf-profile.children.cycles-pp.vfs_write
0.16 ± 35% +0.1 0.23 ± 80% perf-profile.children.cycles-pp.write
0.00 +0.1 0.08 ±145% perf-profile.children.cycles-pp.sched_balance_find_src_group
0.00 +0.1 0.08 ±145% perf-profile.children.cycles-pp.update_sd_lb_stats
0.00 +0.1 0.08 ±148% perf-profile.children.cycles-pp.schedule_timeout
0.00 +0.1 0.09 ±146% perf-profile.children.cycles-pp.__wait_for_common
0.00 +0.1 0.09 ±132% perf-profile.children.cycles-pp._nohz_idle_balance
0.00 +0.1 0.09 ±145% perf-profile.children.cycles-pp.tick_irq_enter
0.00 +0.1 0.09 ±145% perf-profile.children.cycles-pp.wait_for_completion_state
0.00 +0.1 0.09 ±146% perf-profile.children.cycles-pp.__wait_rcu_gp
0.00 +0.1 0.09 ±142% perf-profile.children.cycles-pp.sched_balance_update_blocked_averages
0.00 +0.1 0.09 ±141% perf-profile.children.cycles-pp.rcu_core
0.00 +0.1 0.10 ±145% perf-profile.children.cycles-pp.irq_enter_rcu
0.00 +0.1 0.10 ±143% perf-profile.children.cycles-pp.menu_select
0.00 +0.1 0.10 ±145% perf-profile.children.cycles-pp.hugetlb_vmemmap_optimize_folio
0.13 ± 15% +0.1 0.23 ± 68% perf-profile.children.cycles-pp.sched_tick
0.00 +0.1 0.10 ±141% perf-profile.children.cycles-pp.sched_balance_domains
0.00 +0.1 0.10 ±134% perf-profile.children.cycles-pp.sysvec_call_function_single
0.00 +0.1 0.11 ±146% perf-profile.children.cycles-pp.schedule
0.00 +0.1 0.11 ±143% perf-profile.children.cycles-pp.sched_balance_rq
0.05 ± 76% +0.1 0.18 ± 76% perf-profile.children.cycles-pp.io_serial_in
0.12 ±125% +0.1 0.24 ±144% perf-profile.children.cycles-pp.prep_compound_page
0.08 ± 57% +0.1 0.21 ± 83% perf-profile.children.cycles-pp.devkmsg_emit
0.08 ± 57% +0.1 0.21 ± 83% perf-profile.children.cycles-pp.devkmsg_write
0.07 ± 57% +0.1 0.20 ± 70% perf-profile.children.cycles-pp.wait_for_lsr
0.09 ± 37% +0.1 0.22 ± 72% perf-profile.children.cycles-pp.serial8250_console_write
0.00 +0.1 0.14 ±117% perf-profile.children.cycles-pp.asm_sysvec_call_function_single
0.13 ±124% +0.1 0.27 ±144% perf-profile.children.cycles-pp.__alloc_pages_noprof
0.13 ±124% +0.1 0.26 ±144% perf-profile.children.cycles-pp.get_page_from_freelist
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.console_flush_all
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.console_unlock
0.10 ± 38% +0.1 0.24 ± 72% perf-profile.children.cycles-pp.vprintk_emit
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.ast_mode_config_helper_atomic_commit_tail
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.ast_primary_plane_helper_atomic_update
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.commit_tail
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.drm_atomic_helper_commit_planes
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.drm_atomic_helper_commit_tail_rpm
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.drm_fb_memcpy
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.children.cycles-pp.memcpy_toio
0.13 ±125% +0.1 0.28 ±144% perf-profile.children.cycles-pp.__alloc_fresh_hugetlb_folio
0.13 ±124% +0.1 0.28 ±144% perf-profile.children.cycles-pp.__folio_alloc_noprof
0.10 ± 38% +0.1 0.24 ± 74% perf-profile.children.cycles-pp.drm_atomic_commit
0.10 ± 38% +0.1 0.24 ± 74% perf-profile.children.cycles-pp.drm_atomic_helper_commit
0.10 ± 38% +0.1 0.24 ± 74% perf-profile.children.cycles-pp.drm_atomic_helper_dirtyfb
0.00 +0.1 0.15 ±144% perf-profile.children.cycles-pp.__schedule
0.13 ±124% +0.1 0.28 ±144% perf-profile.children.cycles-pp.alloc_buddy_hugetlb_folio
0.10 ± 36% +0.2 0.25 ± 73% perf-profile.children.cycles-pp.drm_fb_helper_damage_work
0.10 ± 36% +0.2 0.25 ± 73% perf-profile.children.cycles-pp.drm_fbdev_generic_helper_fb_dirty
0.11 ± 36% +0.2 0.26 ± 73% perf-profile.children.cycles-pp.process_one_work
0.11 ± 36% +0.2 0.27 ± 72% perf-profile.children.cycles-pp.worker_thread
0.09 ± 12% +0.2 0.25 ± 89% perf-profile.children.cycles-pp.clockevents_program_event
0.12 ± 35% +0.2 0.29 ± 70% perf-profile.children.cycles-pp.kthread
0.12 ± 35% +0.2 0.29 ± 70% perf-profile.children.cycles-pp.ret_from_fork
0.12 ± 35% +0.2 0.29 ± 70% perf-profile.children.cycles-pp.ret_from_fork_asm
0.00 +0.2 0.18 ±143% perf-profile.children.cycles-pp.prep_new_hugetlb_folio
0.00 +0.2 0.18 ±145% perf-profile.children.cycles-pp.synchronize_rcu_normal
0.24 ± 10% +0.2 0.45 ± 69% perf-profile.children.cycles-pp.update_process_times
0.29 ± 33% +0.2 0.53 ± 55% perf-profile.children.cycles-pp.do_syscall_64
0.29 ± 33% +0.2 0.53 ± 55% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
0.08 ± 14% +0.3 0.34 ±106% perf-profile.children.cycles-pp.ktime_get
0.27 ± 9% +0.3 0.54 ± 71% perf-profile.children.cycles-pp.tick_nohz_handler
0.28 ± 9% +0.3 0.56 ± 72% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.00 +0.3 0.34 ±144% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.00 +0.3 0.34 ±144% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.03 ±100% +0.4 0.39 ±119% perf-profile.children.cycles-pp.irq_exit_rcu
0.13 ±125% +0.4 0.58 ±140% perf-profile.children.cycles-pp.alloc_surplus_hugetlb_folio
0.39 ± 7% +0.5 0.87 ± 79% perf-profile.children.cycles-pp.hrtimer_interrupt
0.40 ± 7% +0.5 0.90 ± 79% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.14 ±126% +0.6 0.73 ±141% perf-profile.children.cycles-pp.alloc_hugetlb_folio
0.43 ± 7% +0.9 1.38 ± 96% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.59 ± 15% +1.9 2.48 ±108% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.02 ±141% +2.2 2.20 ±134% perf-profile.children.cycles-pp.acpi_safe_halt
0.02 ±141% +2.2 2.20 ±134% perf-profile.children.cycles-pp.acpi_idle_enter
0.02 ±141% +2.2 2.24 ±134% perf-profile.children.cycles-pp.cpuidle_enter_state
0.02 ±141% +2.2 2.24 ±134% perf-profile.children.cycles-pp.cpuidle_enter
0.02 ±142% +2.4 2.38 ±134% perf-profile.children.cycles-pp.cpuidle_idle_call
0.02 ±141% +2.4 2.39 ±134% perf-profile.children.cycles-pp.start_secondary
0.03 ±102% +2.4 2.46 ±133% perf-profile.children.cycles-pp.do_idle
0.03 ±102% +2.4 2.46 ±133% perf-profile.children.cycles-pp.common_startup_64
0.03 ±102% +2.4 2.46 ±133% perf-profile.children.cycles-pp.cpu_startup_entry
12.44 ±103% +14.7 27.14 ±127% perf-profile.children.cycles-pp.do_access
10.22 ±128% +14.8 25.04 ±141% perf-profile.children.cycles-pp.do_user_addr_fault
10.22 ±128% +14.8 25.04 ±141% perf-profile.children.cycles-pp.exc_page_fault
10.20 ±128% +14.8 25.04 ±141% perf-profile.children.cycles-pp.hugetlb_fault
10.22 ±128% +14.9 25.08 ±141% perf-profile.children.cycles-pp.handle_mm_fault
10.24 ±128% +14.9 25.15 ±141% perf-profile.children.cycles-pp.asm_exc_page_fault
7.82 ±127% +15.6 23.38 ±141% perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
7.83 ±127% +15.6 23.42 ±141% perf-profile.children.cycles-pp.copy_subpage
7.84 ±127% +15.6 23.48 ±141% perf-profile.children.cycles-pp.copy_user_large_folio
8.00 ±127% +16.3 24.27 ±141% perf-profile.children.cycles-pp.hugetlb_wp
83.41 ± 14% -17.2 66.25 ± 55% perf-profile.self.cycles-pp.do_rw_once
1.56 ±127% -1.0 0.59 ±160% perf-profile.self.cycles-pp.mutex_spin_on_owner
0.59 ±136% -0.6 0.04 ±152% perf-profile.self.cycles-pp.osq_lock
0.72 ± 25% -0.2 0.49 ± 53% perf-profile.self.cycles-pp.lrand48_r@plt
1.92 ± 14% -0.2 1.72 ± 47% perf-profile.self.cycles-pp.do_access
2.53 ± 14% -0.1 2.44 ± 45% perf-profile.self.cycles-pp.nrand48_r
0.30 ± 15% -0.0 0.28 ± 50% perf-profile.self.cycles-pp.lrand48_r
0.01 ±223% -0.0 0.00 perf-profile.self.cycles-pp.copy_page_from_iter_atomic
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.io_serial_out
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.nohz_balancer_kick
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.note_gp_changes
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.rcu_all_qs
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.rcu_pending
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.delay_tsc
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
0.00 +0.0 0.01 ±223% perf-profile.self.cycles-pp.update_rq_clock_task
0.00 +0.0 0.02 ±141% perf-profile.self.cycles-pp.___perf_sw_event
0.00 +0.0 0.02 ±141% perf-profile.self.cycles-pp.irqtime_account_irq
0.00 +0.0 0.02 ±142% perf-profile.self.cycles-pp.__schedule
0.00 +0.0 0.02 ±142% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.00 +0.0 0.02 ±142% perf-profile.self.cycles-pp.native_apic_msr_eoi
0.00 +0.0 0.02 ±142% perf-profile.self.cycles-pp.update_irq_load_avg
0.00 +0.0 0.02 ±141% perf-profile.self.cycles-pp.copy_user_large_folio
0.00 +0.0 0.02 ±146% perf-profile.self.cycles-pp.idle_cpu
0.00 +0.0 0.02 ±142% perf-profile.self.cycles-pp.read_tsc
0.00 +0.0 0.02 ±143% perf-profile.self.cycles-pp.hugetlb_wp
0.00 +0.0 0.02 ±141% perf-profile.self.cycles-pp.tick_nohz_next_event
0.00 +0.0 0.02 ±141% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.00 +0.0 0.03 ±143% perf-profile.self.cycles-pp.lapic_next_deadline
0.00 +0.0 0.03 ±150% perf-profile.self.cycles-pp.__cond_resched
0.00 +0.0 0.03 ±141% perf-profile.self.cycles-pp.native_sched_clock
0.00 +0.0 0.03 ±141% perf-profile.self.cycles-pp.sched_balance_domains
0.00 +0.0 0.03 ±144% perf-profile.self.cycles-pp.ktime_get_update_offsets_now
0.00 +0.0 0.03 ±144% perf-profile.self.cycles-pp.tick_nohz_handler
0.00 +0.0 0.03 ±146% perf-profile.self.cycles-pp.update_load_avg
0.00 +0.0 0.04 ±143% perf-profile.self.cycles-pp.copy_subpage
0.00 +0.0 0.04 ±141% perf-profile.self.cycles-pp.__update_blocked_fair
0.00 +0.0 0.04 ±147% perf-profile.self.cycles-pp.menu_select
0.00 +0.0 0.04 ±144% perf-profile.self.cycles-pp.arch_scale_freq_tick
0.00 +0.0 0.04 ±141% perf-profile.self.cycles-pp._raw_spin_lock
0.04 ±101% +0.0 0.08 ± 77% perf-profile.self.cycles-pp.free_unref_page
0.04 ± 45% +0.0 0.09 ± 78% perf-profile.self.cycles-pp.get_jiffies_update
0.00 +0.0 0.04 ±141% perf-profile.self.cycles-pp.clear_page_erms
0.00 +0.0 0.05 ±141% perf-profile.self.cycles-pp.native_irq_return_iret
0.00 +0.0 0.05 ±148% perf-profile.self.cycles-pp.update_sg_lb_stats
0.01 ±223% +0.1 0.06 ±149% perf-profile.self.cycles-pp.task_mm_cid_work
0.05 ± 76% +0.1 0.17 ± 75% perf-profile.self.cycles-pp.io_serial_in
0.11 ±124% +0.1 0.24 ±144% perf-profile.self.cycles-pp.prep_compound_page
0.10 ± 38% +0.1 0.24 ± 73% perf-profile.self.cycles-pp.memcpy_toio
0.08 ± 14% +0.2 0.32 ±105% perf-profile.self.cycles-pp.ktime_get
0.00 +0.3 0.34 ±144% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.00 +1.1 1.07 ±138% perf-profile.self.cycles-pp.acpi_safe_halt
7.77 ±127% +15.5 23.22 ±141% perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-09 5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
@ 2024-07-10 6:22 ` Yu Zhao
2024-07-14 12:26 ` Oliver Sang
2024-07-17 7:52 ` Janosch Frank
1 sibling, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-10 6:22 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin
On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
>
> Hello,
>
> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>
>
> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
This is likely caused by synchronize_rcu() wandering into the
allocation path. I'll patch that up soon.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-10 6:22 ` Yu Zhao
@ 2024-07-14 12:26 ` Oliver Sang
2024-07-15 2:40 ` Muchun Song
0 siblings, 1 reply; 14+ messages in thread
From: Oliver Sang @ 2024-07-14 12:26 UTC (permalink / raw)
To: Yu Zhao
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, oliver.sang
hi, Yu Zhao,
On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
> >
> > Hello,
> >
> > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >
> >
> > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>
> This is likely caused by synchronize_rcu() wandering into the
> allocation path. I'll patch that up soon.
>
we noticed this commit has already been merged into mainline
[bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
branch: linus/master
and the regression still exists in our tests. do you want us to test your
patch? Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-14 12:26 ` Oliver Sang
@ 2024-07-15 2:40 ` Muchun Song
2024-07-15 4:08 ` Oliver Sang
0 siblings, 1 reply; 14+ messages in thread
From: Muchun Song @ 2024-07-15 2:40 UTC (permalink / raw)
To: Oliver Sang
Cc: Yu Zhao, oe-lkp, kernel test robot, Linux Memory Management List,
Andrew Morton, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, LKML, Huang Ying, Feng Tang,
Yin Fengwei
> On Jul 14, 2024, at 20:26, Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yu Zhao,
>
> On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>>>
>>>
>>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>>
>> This is likely caused by synchronize_rcu() wandering into the
>> allocation path. I'll patch that up soon.
>>
>
> we noticed this commit has already been merged into mainline
>
> [bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
> branch: linus/master
Did you test with HVO enabled (there are two ways to enable HVO: 1) adding cmdline with "hugetlb_free_vmemmap=on"
or 2) write 1 to /proc/sys/vm/hugetlb_optimize_vmemmap)? I want to confirm if the regression is related
to HVO routine.
Thanks.
>
> and the regression still exists in our tests. do you want us to test your
> patch? Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-15 2:40 ` Muchun Song
@ 2024-07-15 4:08 ` Oliver Sang
0 siblings, 0 replies; 14+ messages in thread
From: Oliver Sang @ 2024-07-15 4:08 UTC (permalink / raw)
To: Muchun Song
Cc: Yu Zhao, oe-lkp, kernel test robot, Linux Memory Management List,
Andrew Morton, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, LKML, Huang Ying, Feng Tang,
Yin Fengwei, oliver.sang
hi, Muchun Song,
On Mon, Jul 15, 2024 at 10:40:43AM +0800, Muchun Song wrote:
>
>
> > On Jul 14, 2024, at 20:26, Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yu Zhao,
> >
> > On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
> >> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >>>
> >>>
> >>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> >>
> >> This is likely caused by synchronize_rcu() wandering into the
> >> allocation path. I'll patch that up soon.
> >>
> >
> > we noticed this commit has already been merged into mainline
> >
> > [bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
> > branch: linus/master
>
> Did you test with HVO enabled (there are two ways to enable HVO: 1) adding cmdline with "hugetlb_free_vmemmap=on"
> or 2) write 1 to /proc/sys/vm/hugetlb_optimize_vmemmap)? I want to confirm if the regression is related
> to HVO routine.
we found a strange thing, after adding 'hugetlb_free_vmemmap=on', the data
become unstable by run to run (we use kexec from previous job to next one).
below is for 875fa64577 + 'hugetlb_free_vmemmap=on'
"vm-scalability.throughput": [
611622,
645261,
705923,
833589,
840140,
884010
],
as a comparison, without 'hugetlb_free_vmemmap=on', for 875fa64577:
"vm-scalability.throughput": [
4597606,
4357960,
4385331,
4631803,
4554570,
4462691
],
for 73236245e0 (parent of 875fa64577):
"vm-scalability.throughput": [
6866441,
6769773,
6942991,
6877124,
6785790,
6812001
],
>
> Thanks.
>
> >
> > and the regression still exists in our tests. do you want us to test your
> > patch? Thanks!
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-09 5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
2024-07-10 6:22 ` Yu Zhao
@ 2024-07-17 7:52 ` Janosch Frank
2024-07-17 7:59 ` Christian Borntraeger
2024-07-17 8:36 ` Yu Zhao
1 sibling, 2 replies; 14+ messages in thread
From: Janosch Frank @ 2024-07-17 7:52 UTC (permalink / raw)
To: Yu Zhao
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
Marc Hartmayer, Heiko Carstens
On 7/9/24 07:11, kernel test robot wrote:
> Hello,
>
> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>
>
> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>
This has hit s390 huge page backed KVM guests as well.
Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-17 7:52 ` Janosch Frank
@ 2024-07-17 7:59 ` Christian Borntraeger
2024-07-17 8:36 ` Yu Zhao
1 sibling, 0 replies; 14+ messages in thread
From: Christian Borntraeger @ 2024-07-17 7:59 UTC (permalink / raw)
To: Janosch Frank, Yu Zhao
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Claudio Imbrenda, Marc Hartmayer,
Heiko Carstens
Am 17.07.24 um 09:52 schrieb Janosch Frank:
> On 7/9/24 07:11, kernel test robot wrote:
>> Hello,
>>
>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>>
>>
>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>>
> This has hit s390 huge page backed KVM guests as well.
> Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
Could this be one of the synchronize_rcu calls? This patch adds lots of them. On s390 with HZ=100 those are really expensive.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-17 7:52 ` Janosch Frank
2024-07-17 7:59 ` Christian Borntraeger
@ 2024-07-17 8:36 ` Yu Zhao
2024-07-17 15:44 ` Yu Zhao
1 sibling, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-17 8:36 UTC (permalink / raw)
To: Janosch Frank, kernel test robot
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
Marc Hartmayer, Heiko Carstens
[-- Attachment #1: Type: text/plain, Size: 746 bytes --]
Hi Janosch and Oliver,
On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
>
> On 7/9/24 07:11, kernel test robot wrote:
> > Hello,
> >
> > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >
> >
> > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> >
> This has hit s390 huge page backed KVM guests as well.
> Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
Could you try the attached patch please? Thank you.
[-- Attachment #2: hugetlb-fix.patch --]
[-- Type: application/octet-stream, Size: 3952 bytes --]
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8193906515c6..9e6fc4ce8d2b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -43,6 +43,8 @@ struct vmemmap_remap_walk {
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
/* Skip the TLB flush when we remap the PTE */
#define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
+/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
+#define VMEMMAP_SYNCHRONIZE_RCU BIT(2)
unsigned long flags;
};
@@ -451,6 +453,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
if (!folio_test_hugetlb_vmemmap_optimized(folio))
return 0;
+ if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+ synchronize_rcu();
+
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
vmemmap_reuse = vmemmap_start;
vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
@@ -483,10 +488,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
*/
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
{
- /* avoid writes from page_ref_add_unless() while unfolding vmemmap */
- synchronize_rcu();
-
- return __hugetlb_vmemmap_restore_folio(h, folio, 0);
+ return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
}
/**
@@ -509,14 +511,13 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct folio *folio, *t_folio;
long restored = 0;
long ret = 0;
-
- /* avoid writes from page_ref_add_unless() while unfolding vmemmap */
- synchronize_rcu();
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
if (folio_test_hugetlb_vmemmap_optimized(folio)) {
- ret = __hugetlb_vmemmap_restore_folio(h, folio,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
+ flags &= VMEMMAP_SYNCHRONIZE_RCU;
+
if (ret)
break;
restored++;
@@ -564,6 +565,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
return ret;
static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+ if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+ synchronize_rcu();
/*
* Very Subtle
* If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -611,10 +615,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
{
LIST_HEAD(vmemmap_pages);
- /* avoid writes from page_ref_add_unless() while folding vmemmap */
- synchronize_rcu();
-
- __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
+ __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
free_vmemmap_page_list(&vmemmap_pages);
}
@@ -641,6 +642,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
{
struct folio *folio;
LIST_HEAD(vmemmap_pages);
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
list_for_each_entry(folio, folio_list, lru) {
int ret = hugetlb_vmemmap_split_folio(h, folio);
@@ -657,14 +659,11 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
flush_tlb_all();
- /* avoid writes from page_ref_add_unless() while folding vmemmap */
- synchronize_rcu();
-
list_for_each_entry(folio, folio_list, lru) {
int ret;
- ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
+ flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
/*
* Pages to be freed may have been accumulated. If we
@@ -678,8 +677,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
INIT_LIST_HEAD(&vmemmap_pages);
- __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
}
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-17 8:36 ` Yu Zhao
@ 2024-07-17 15:44 ` Yu Zhao
2024-07-18 9:23 ` Marc Hartmayer
2024-07-19 8:42 ` Oliver Sang
0 siblings, 2 replies; 14+ messages in thread
From: Yu Zhao @ 2024-07-17 15:44 UTC (permalink / raw)
To: Janosch Frank, kernel test robot
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
Marc Hartmayer, Heiko Carstens, Yosry Ahmed
[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]
On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
>
> Hi Janosch and Oliver,
>
> On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> >
> > On 7/9/24 07:11, kernel test robot wrote:
> > > Hello,
> > >
> > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > >
> > >
> > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > >
> > This has hit s390 huge page backed KVM guests as well.
> > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
>
> Could you try the attached patch please? Thank you.
Thanks, Yosry, for spotting the following typo:
flags &= VMEMMAP_SYNCHRONIZE_RCU;
It's supposed to be:
flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
[-- Attachment #2: hugetlb-fix-v2.patch --]
[-- Type: application/octet-stream, Size: 3953 bytes --]
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8193906515c6..9e6fc4ce8d2b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -43,6 +43,8 @@ struct vmemmap_remap_walk {
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
/* Skip the TLB flush when we remap the PTE */
#define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
+/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
+#define VMEMMAP_SYNCHRONIZE_RCU BIT(2)
unsigned long flags;
};
@@ -451,6 +453,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
if (!folio_test_hugetlb_vmemmap_optimized(folio))
return 0;
+ if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+ synchronize_rcu();
+
vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
vmemmap_reuse = vmemmap_start;
vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
@@ -483,10 +488,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
*/
int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
{
- /* avoid writes from page_ref_add_unless() while unfolding vmemmap */
- synchronize_rcu();
-
- return __hugetlb_vmemmap_restore_folio(h, folio, 0);
+ return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
}
/**
@@ -509,14 +511,13 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
struct folio *folio, *t_folio;
long restored = 0;
long ret = 0;
-
- /* avoid writes from page_ref_add_unless() while unfolding vmemmap */
- synchronize_rcu();
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
if (folio_test_hugetlb_vmemmap_optimized(folio)) {
- ret = __hugetlb_vmemmap_restore_folio(h, folio,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
+ flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
+
if (ret)
break;
restored++;
@@ -564,6 +565,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
return ret;
static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+ if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+ synchronize_rcu();
/*
* Very Subtle
* If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -611,10 +615,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
{
LIST_HEAD(vmemmap_pages);
- /* avoid writes from page_ref_add_unless() while folding vmemmap */
- synchronize_rcu();
-
- __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
+ __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
free_vmemmap_page_list(&vmemmap_pages);
}
@@ -641,6 +642,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
{
struct folio *folio;
LIST_HEAD(vmemmap_pages);
+ unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
list_for_each_entry(folio, folio_list, lru) {
int ret = hugetlb_vmemmap_split_folio(h, folio);
@@ -657,14 +659,11 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
flush_tlb_all();
- /* avoid writes from page_ref_add_unless() while folding vmemmap */
- synchronize_rcu();
-
list_for_each_entry(folio, folio_list, lru) {
int ret;
- ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
+ flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
/*
* Pages to be freed may have been accumulated. If we
@@ -678,8 +677,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
INIT_LIST_HEAD(&vmemmap_pages);
- __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
- VMEMMAP_REMAP_NO_TLB_FLUSH);
+ __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
}
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-17 15:44 ` Yu Zhao
@ 2024-07-18 9:23 ` Marc Hartmayer
2024-07-19 8:42 ` Oliver Sang
1 sibling, 0 replies; 14+ messages in thread
From: Marc Hartmayer @ 2024-07-18 9:23 UTC (permalink / raw)
To: Yu Zhao, Janosch Frank, kernel test robot
Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
Muchun Song, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
Heiko Carstens, Yosry Ahmed
On Wed, Jul 17, 2024 at 09:44 AM -0600, Yu Zhao <yuzhao@google.com> wrote:
> On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> Hi Janosch and Oliver,
>>
>> On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
>> >
>> > On 7/9/24 07:11, kernel test robot wrote:
>> > > Hello,
>> > >
>> > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>> > >
>> > >
>> > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> > >
>> > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>> > >
>> > This has hit s390 huge page backed KVM guests as well.
>> > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
>>
>> Could you try the attached patch please? Thank you.
>
Hi,
thanks a lot for the fix, it has fixed the problem on s390.
--
Kind regards / Beste Grüße
Marc Hartmayer
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Wolfgang Wendt
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-17 15:44 ` Yu Zhao
2024-07-18 9:23 ` Marc Hartmayer
@ 2024-07-19 8:42 ` Oliver Sang
2024-07-19 16:06 ` Yu Zhao
1 sibling, 1 reply; 14+ messages in thread
From: Oliver Sang @ 2024-07-19 8:42 UTC (permalink / raw)
To: Yu Zhao
Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Muchun Song, David Hildenbrand,
Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
linux-kernel, ying.huang, feng.tang, fengwei.yin,
Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
Heiko Carstens, Yosry Ahmed, oliver.sang
hi, Yu Zhao,
On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Hi Janosch and Oliver,
> >
> > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > >
> > > On 7/9/24 07:11, kernel test robot wrote:
> > > > Hello,
> > > >
> > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > >
> > > >
> > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > >
> > > This has hit s390 huge page backed KVM guests as well.
> > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> >
> > Could you try the attached patch please? Thank you.
>
> Thanks, Yosry, for spotting the following typo:
> flags &= VMEMMAP_SYNCHRONIZE_RCU;
> It's supposed to be:
> flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
>
> Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
since the commit is in mainline now, I directly apply your v2 patch upon
bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
in our tests, your v2 patch not only recovers the performance regression, it
even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
bd225530a4c71)
detail is as below
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
commit:
5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
9a5b87b521401 <---- your v2 patch
5a4d8944d6b1e1aa bd225530a4c717714722c373144 9a5b87b5214018a2be217dc4648
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
4.271e+09 ± 10% +348.4% 1.915e+10 ± 6% -39.9% 2.567e+09 ± 20% cpuidle..time
774593 ± 4% +1060.9% 8992186 ± 6% -17.2% 641254 cpuidle..usage
555365 ± 8% +28.0% 710795 ± 2% -4.5% 530157 ± 5% numa-numastat.node0.local_node
629633 ± 4% +23.0% 774346 ± 5% +0.6% 633264 ± 4% numa-numastat.node0.numa_hit
255.76 ± 2% +31.1% 335.40 ± 3% -13.8% 220.53 ± 2% uptime.boot
10305 ± 6% +144.3% 25171 ± 5% -17.1% 8543 ± 8% uptime.idle
1.83 ± 58% +96200.0% 1765 ±155% +736.4% 15.33 ± 24% perf-c2c.DRAM.local
33.00 ± 16% +39068.2% 12925 ±122% +95.5% 64.50 ± 49% perf-c2c.DRAM.remote
21.33 ± 8% +2361.7% 525.17 ± 31% +271.1% 79.17 ± 52% perf-c2c.HITM.local
9.17 ± 21% +3438.2% 324.33 ± 57% +270.9% 34.00 ± 60% perf-c2c.HITM.remote
16.11 ± 7% +37.1 53.16 ± 2% -4.6 11.50 ± 19% mpstat.cpu.all.idle%
0.34 ± 2% -0.1 0.22 +0.0 0.35 ± 3% mpstat.cpu.all.irq%
0.03 ± 5% +0.0 0.04 ± 8% -0.0 0.02 mpstat.cpu.all.soft%
10.58 ± 4% -9.5 1.03 ± 36% +0.1 10.71 ± 2% mpstat.cpu.all.sys%
72.94 ± 2% -27.4 45.55 ± 3% +4.5 77.41 ± 2% mpstat.cpu.all.usr%
6.00 ± 16% +230.6% 19.83 ± 5% +8.3% 6.50 ± 17% mpstat.max_utilization.seconds
16.95 ± 7% +215.5% 53.48 ± 2% -26.2% 12.51 ± 16% vmstat.cpu.id
72.33 ± 2% -37.4% 45.31 ± 3% +6.0% 76.65 ± 2% vmstat.cpu.us
2.254e+08 -0.0% 2.254e+08 +14.7% 2.584e+08 vmstat.memory.free
108.30 -43.3% 61.43 ± 2% +5.4% 114.12 ± 2% vmstat.procs.r
2659 +162.6% 6982 ± 3% +3.6% 2753 ± 4% vmstat.system.cs
136384 ± 4% -21.9% 106579 ± 7% +13.3% 154581 ± 3% vmstat.system.in
203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% time.elapsed_time
203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% time.elapsed_time.max
148901 ± 6% -45.6% 81059 ± 4% -8.8% 135748 ± 8% time.involuntary_context_switches
169.83 ± 23% +85.3% 314.67 ± 8% +7.9% 183.33 ± 7% time.major_page_faults
10697 -43.4% 6050 ± 2% +5.6% 11294 ± 2% time.percent_of_cpu_this_job_got
2740 ± 6% -86.7% 365.06 ± 43% -16.1% 2298 time.system_time
19012 -11.9% 16746 -11.9% 16747 time.user_time
14412 ± 5% +4432.0% 653187 -16.6% 12025 ± 3% time.voluntary_context_switches
50095 ± 2% -31.5% 34325 ± 2% +18.6% 59408 vm-scalability.median
8.25 ± 16% -3.4 4.84 ± 22% -6.6 1.65 ± 15% vm-scalability.median_stddev%
6863720 -34.0% 4532485 +13.7% 7805408 vm-scalability.throughput
203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% vm-scalability.time.elapsed_time
203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% vm-scalability.time.elapsed_time.max
148901 ± 6% -45.6% 81059 ± 4% -8.8% 135748 ± 8% vm-scalability.time.involuntary_context_switches
10697 -43.4% 6050 ± 2% +5.6% 11294 ± 2% vm-scalability.time.percent_of_cpu_this_job_got
2740 ± 6% -86.7% 365.06 ± 43% -16.1% 2298 vm-scalability.time.system_time
19012 -11.9% 16746 -11.9% 16747 vm-scalability.time.user_time
14412 ± 5% +4432.0% 653187 -16.6% 12025 ± 3% vm-scalability.time.voluntary_context_switches
1.159e+09 +0.0% 1.159e+09 +1.6% 1.178e+09 vm-scalability.workload
22900043 ± 4% +1.2% 23166356 ± 6% -16.7% 19076170 ± 5% numa-vmstat.node0.nr_free_pages
42856 ± 43% +998.5% 470779 ± 51% +318.6% 179409 ±154% numa-vmstat.node0.nr_unevictable
42856 ± 43% +998.5% 470779 ± 51% +318.6% 179409 ±154% numa-vmstat.node0.nr_zone_unevictable
629160 ± 4% +22.9% 773391 ± 5% +0.5% 632570 ± 4% numa-vmstat.node0.numa_hit
554892 ± 8% +27.9% 709841 ± 2% -4.6% 529463 ± 5% numa-vmstat.node0.numa_local
27469 ± 14% +0.0% 27475 ± 41% -31.7% 18763 ± 13% numa-vmstat.node1.nr_active_anon
767179 ± 2% -55.8% 339212 ± 72% -19.7% 616417 ± 43% numa-vmstat.node1.nr_file_pages
10693349 ± 5% +46.3% 15639681 ± 7% +69.4% 18112002 ± 3% numa-vmstat.node1.nr_free_pages
14210 ± 27% -65.0% 4973 ± 49% -34.7% 9280 ± 39% numa-vmstat.node1.nr_mapped
724050 ± 2% -59.1% 296265 ± 82% -18.9% 587498 ± 47% numa-vmstat.node1.nr_unevictable
27469 ± 14% +0.0% 27475 ± 41% -31.7% 18763 ± 13% numa-vmstat.node1.nr_zone_active_anon
724050 ± 2% -59.1% 296265 ± 82% -18.9% 587498 ± 47% numa-vmstat.node1.nr_zone_unevictable
120619 ± 11% +13.6% 137042 ± 27% -31.2% 82976 ± 7% meminfo.Active
120472 ± 11% +13.6% 136895 ± 27% -31.2% 82826 ± 7% meminfo.Active(anon)
70234807 +14.6% 80512468 +10.2% 77431344 meminfo.CommitLimit
2.235e+08 +0.1% 2.237e+08 +15.1% 2.573e+08 meminfo.DirectMap1G
44064 -22.8% 34027 ± 2% +20.7% 53164 ± 2% meminfo.HugePages_Surp
44064 -22.8% 34027 ± 2% +20.7% 53164 ± 2% meminfo.HugePages_Total
90243440 -22.8% 69688103 ± 2% +20.7% 1.089e+08 ± 2% meminfo.Hugetlb
70163 ± 29% -42.6% 40293 ± 11% -21.9% 54789 ± 15% meminfo.Mapped
1.334e+08 +15.5% 1.541e+08 +10.7% 1.477e+08 meminfo.MemAvailable
1.344e+08 +15.4% 1.551e+08 +10.7% 1.488e+08 meminfo.MemFree
2.307e+08 +0.0% 2.307e+08 +14.3% 2.637e+08 meminfo.MemTotal
96309843 -21.5% 75639108 ± 2% +19.4% 1.15e+08 ± 2% meminfo.Memused
259553 ± 2% -0.9% 257226 ± 15% -10.5% 232211 ± 4% meminfo.Shmem
1.2e+08 -2.4% 1.172e+08 +13.3% 1.36e+08 meminfo.max_used_kB
18884 ± 10% -7.2% 17519 ± 15% +37.6% 25983 ± 6% numa-meminfo.node0.HugePages_Surp
18884 ± 10% -7.2% 17519 ± 15% +37.6% 25983 ± 6% numa-meminfo.node0.HugePages_Total
91526744 ± 4% +1.2% 92620825 ± 6% -16.7% 76248423 ± 5% numa-meminfo.node0.MemFree
40158207 ± 9% -2.7% 39064126 ± 15% +38.0% 55436528 ± 7% numa-meminfo.node0.MemUsed
171426 ± 43% +998.5% 1883116 ± 51% +318.6% 717638 ±154% numa-meminfo.node0.Unevictable
110091 ± 14% -0.1% 109981 ± 41% -31.7% 75226 ± 13% numa-meminfo.node1.Active
110025 ± 14% -0.1% 109915 ± 41% -31.7% 75176 ± 13% numa-meminfo.node1.Active(anon)
3068496 ± 2% -55.8% 1356754 ± 72% -19.6% 2466084 ± 43% numa-meminfo.node1.FilePages
25218 ± 4% -34.7% 16475 ± 12% +7.9% 27213 ± 3% numa-meminfo.node1.HugePages_Surp
25218 ± 4% -34.7% 16475 ± 12% +7.9% 27213 ± 3% numa-meminfo.node1.HugePages_Total
55867 ± 27% -65.5% 19266 ± 50% -34.4% 36671 ± 38% numa-meminfo.node1.Mapped
42795888 ± 5% +46.1% 62520130 ± 7% +69.3% 72441496 ± 3% numa-meminfo.node1.MemFree
99028084 +0.0% 99028084 +33.4% 1.321e+08 numa-meminfo.node1.MemTotal
56232195 ± 3% -35.1% 36507953 ± 12% +6.0% 59616707 ± 4% numa-meminfo.node1.MemUsed
2896199 ± 2% -59.1% 1185064 ± 82% -18.9% 2349991 ± 47% numa-meminfo.node1.Unevictable
507357 +0.0% 507357 +1.7% 516000 proc-vmstat.htlb_buddy_alloc_success
29942 ± 10% +14.3% 34235 ± 27% -30.7% 20740 ± 7% proc-vmstat.nr_active_anon
3324095 +15.7% 3847387 +10.9% 3686860 proc-vmstat.nr_dirty_background_threshold
6656318 +15.7% 7704181 +10.9% 7382735 proc-vmstat.nr_dirty_threshold
33559092 +15.6% 38798108 +10.9% 37209133 proc-vmstat.nr_free_pages
197697 ± 2% -2.5% 192661 +1.0% 199623 proc-vmstat.nr_inactive_anon
17939 ± 28% -42.5% 10307 ± 11% -22.4% 13927 ± 14% proc-vmstat.nr_mapped
2691 -7.1% 2501 +2.9% 2769 proc-vmstat.nr_page_table_pages
64848 ± 2% -0.7% 64386 ± 15% -10.6% 57987 ± 4% proc-vmstat.nr_shmem
29942 ± 10% +14.3% 34235 ± 27% -30.7% 20740 ± 7% proc-vmstat.nr_zone_active_anon
197697 ± 2% -2.5% 192661 +1.0% 199623 proc-vmstat.nr_zone_inactive_anon
1403095 +9.3% 1534152 ± 2% -3.2% 1358244 proc-vmstat.numa_hit
1267544 +10.6% 1401482 ± 2% -3.4% 1224210 proc-vmstat.numa_local
2.608e+08 +0.1% 2.609e+08 +1.7% 2.651e+08 proc-vmstat.pgalloc_normal
1259957 +13.4% 1428284 ± 2% -6.5% 1178198 proc-vmstat.pgfault
2.591e+08 +0.3% 2.6e+08 +2.3% 2.649e+08 proc-vmstat.pgfree
36883 ± 3% +18.5% 43709 ± 5% -12.2% 32371 ± 3% proc-vmstat.pgreuse
1.88 ± 16% -0.6 1.33 ±100% +0.9 2.80 ± 11% perf-profile.calltrace.cycles-pp.nrand48_r
16.19 ± 85% +28.6 44.75 ± 95% -11.4 4.78 ±218% perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
16.20 ± 85% +28.6 44.78 ± 95% -11.4 4.78 ±218% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
16.22 ± 85% +28.6 44.82 ± 95% -11.4 4.79 ±218% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
16.22 ± 85% +28.6 44.82 ± 95% -11.4 4.79 ±218% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
16.24 ± 85% +28.8 45.01 ± 95% -11.4 4.80 ±218% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
12.42 ± 84% +29.5 41.89 ± 95% -8.8 3.65 ±223% perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
12.52 ± 84% +29.6 42.08 ± 95% -8.8 3.68 ±223% perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
12.53 ± 84% +29.7 42.23 ± 95% -8.9 3.68 ±223% perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
12.80 ± 84% +30.9 43.65 ± 95% -9.0 3.76 ±223% perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
2.50 ± 17% -0.7 1.78 ±100% +1.2 3.73 ± 11% perf-profile.children.cycles-pp.nrand48_r
16.24 ± 85% +28.6 44.87 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.do_user_addr_fault
16.24 ± 85% +28.6 44.87 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.exc_page_fault
16.20 ± 85% +28.7 44.86 ± 95% -11.4 4.78 ±218% perf-profile.children.cycles-pp.hugetlb_fault
16.22 ± 85% +28.7 44.94 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.handle_mm_fault
16.26 ± 85% +28.8 45.06 ± 95% -11.5 4.80 ±218% perf-profile.children.cycles-pp.asm_exc_page_fault
12.51 ± 84% +29.5 42.01 ± 95% -8.8 3.75 ±218% perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
12.52 ± 84% +29.6 42.11 ± 95% -8.8 3.75 ±218% perf-profile.children.cycles-pp.copy_subpage
12.53 ± 84% +29.7 42.25 ± 95% -8.8 3.76 ±218% perf-profile.children.cycles-pp.copy_user_large_folio
12.80 ± 84% +30.9 43.65 ± 95% -9.0 3.83 ±218% perf-profile.children.cycles-pp.hugetlb_wp
2.25 ± 17% -0.7 1.59 ±100% +1.1 3.36 ± 11% perf-profile.self.cycles-pp.nrand48_r
1.74 ± 21% -0.5 1.25 ± 92% +1.2 2.94 ± 13% perf-profile.self.cycles-pp.do_access
0.27 ± 17% -0.1 0.19 ±100% +0.1 0.40 ± 11% perf-profile.self.cycles-pp.lrand48_r
12.41 ± 84% +29.4 41.80 ± 95% -8.7 3.72 ±218% perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
350208 ± 16% -2.7% 340891 ± 36% -47.2% 184918 ± 9% sched_debug.cfs_rq:/.avg_vruntime.stddev
16833 ±149% -100.0% 3.19 ±100% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.left_deadline.avg
2154658 ±149% -100.0% 317.15 ± 93% -100.0% 74.40 ±179% sched_debug.cfs_rq:/.left_deadline.max
189702 ±149% -100.0% 29.47 ± 94% -100.0% 6.55 ±179% sched_debug.cfs_rq:/.left_deadline.stddev
16833 ±149% -100.0% 3.05 ±102% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.left_vruntime.avg
2154613 ±149% -100.0% 298.70 ± 95% -100.0% 74.06 ±179% sched_debug.cfs_rq:/.left_vruntime.max
189698 ±149% -100.0% 27.96 ± 96% -100.0% 6.52 ±179% sched_debug.cfs_rq:/.left_vruntime.stddev
350208 ± 16% -2.7% 340891 ± 36% -47.2% 184918 ± 9% sched_debug.cfs_rq:/.min_vruntime.stddev
52.88 ± 14% -19.5% 42.56 ± 39% +22.8% 64.94 ± 9% sched_debug.cfs_rq:/.removed.load_avg.stddev
16833 ±149% -100.0% 3.05 ±102% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.right_vruntime.avg
2154613 ±149% -100.0% 298.70 ± 95% -100.0% 74.11 ±179% sched_debug.cfs_rq:/.right_vruntime.max
189698 ±149% -100.0% 27.96 ± 96% -100.0% 6.53 ±179% sched_debug.cfs_rq:/.right_vruntime.stddev
1588 ± 9% -31.2% 1093 ± 18% -20.0% 1270 ± 16% sched_debug.cfs_rq:/.runnable_avg.max
676.36 ± 7% -94.8% 35.08 ± 42% -2.7% 657.82 ± 3% sched_debug.cfs_rq:/.util_est.avg
1339 ± 8% -74.5% 341.42 ± 24% -22.6% 1037 ± 23% sched_debug.cfs_rq:/.util_est.max
152.67 ± 35% -72.3% 42.35 ± 21% -14.9% 129.89 ± 33% sched_debug.cfs_rq:/.util_est.stddev
1116839 ± 7% -7.1% 1037321 ± 4% +22.9% 1372316 ± 11% sched_debug.cpu.avg_idle.max
126915 ± 10% +31.6% 166966 ± 6% -12.2% 111446 ± 2% sched_debug.cpu.clock.avg
126930 ± 10% +31.6% 166977 ± 6% -12.2% 111459 ± 2% sched_debug.cpu.clock.max
126899 ± 10% +31.6% 166949 ± 6% -12.2% 111428 ± 2% sched_debug.cpu.clock.min
126491 ± 10% +31.7% 166537 ± 6% -12.2% 111078 ± 2% sched_debug.cpu.clock_task.avg
126683 ± 10% +31.6% 166730 ± 6% -12.2% 111237 ± 2% sched_debug.cpu.clock_task.max
117365 ± 11% +33.6% 156775 ± 6% -13.0% 102099 ± 2% sched_debug.cpu.clock_task.min
2826 ± 10% +178.1% 7858 ± 8% -10.3% 2534 ± 6% sched_debug.cpu.nr_switches.avg
755.38 ± 15% +423.8% 3956 ± 14% -15.2% 640.33 ± 3% sched_debug.cpu.nr_switches.min
126900 ± 10% +31.6% 166954 ± 6% -12.2% 111432 ± 2% sched_debug.cpu_clk
125667 ± 10% +31.9% 165721 ± 6% -12.3% 110200 ± 2% sched_debug.ktime
0.54 ±141% -99.9% 0.00 ±132% -99.9% 0.00 ±114% sched_debug.rt_rq:.rt_time.avg
69.73 ±141% -99.9% 0.06 ±132% -99.9% 0.07 ±114% sched_debug.rt_rq:.rt_time.max
6.14 ±141% -99.9% 0.01 ±132% -99.9% 0.01 ±114% sched_debug.rt_rq:.rt_time.stddev
127860 ± 10% +31.3% 167917 ± 6% -12.1% 112402 ± 2% sched_debug.sched_clk
15.99 +363.6% 74.14 ± 6% +10.1% 17.61 perf-stat.i.MPKI
1.467e+10 ± 2% -32.0% 9.975e+09 ± 3% +21.3% 1.779e+10 ± 2% perf-stat.i.branch-instructions
0.10 ± 5% +0.6 0.68 ± 5% +0.0 0.11 ± 4% perf-stat.i.branch-miss-rate%
10870114 ± 3% -26.4% 8001551 ± 3% +15.7% 12580898 ± 2% perf-stat.i.branch-misses
97.11 -20.0 77.11 -0.0 97.10 perf-stat.i.cache-miss-rate%
8.118e+08 ± 2% -32.5% 5.482e+08 ± 3% +23.1% 9.992e+08 ± 2% perf-stat.i.cache-misses
8.328e+08 ± 2% -28.4% 5.963e+08 ± 3% +22.8% 1.023e+09 ± 2% perf-stat.i.cache-references
2601 ± 2% +172.3% 7083 ± 3% +2.5% 2665 ± 5% perf-stat.i.context-switches
5.10 +39.5% 7.11 ± 9% -9.2% 4.62 perf-stat.i.cpi
2.826e+11 -44.1% 1.58e+11 ± 2% +5.7% 2.987e+11 ± 2% perf-stat.i.cpu-cycles
216.56 +42.4% 308.33 ± 6% +2.2% 221.23 perf-stat.i.cpu-migrations
358.79 -0.3% 357.70 ± 21% -14.1% 308.23 perf-stat.i.cycles-between-cache-misses
6.286e+10 ± 2% -31.7% 4.293e+10 ± 3% +21.3% 7.626e+10 ± 2% perf-stat.i.instructions
0.24 +39.9% 0.33 ± 4% +13.6% 0.27 perf-stat.i.ipc
5844 -16.9% 4856 ± 2% +12.5% 6577 perf-stat.i.minor-faults
5846 -16.9% 4857 ± 2% +12.5% 6578 perf-stat.i.page-faults
13.00 -2.2% 12.72 +1.2% 13.15 perf-stat.overall.MPKI
0.07 +0.0 0.08 -0.0 0.07 perf-stat.overall.branch-miss-rate%
97.44 -5.3 92.09 +0.2 97.66 perf-stat.overall.cache-miss-rate%
4.51 -18.4% 3.68 -13.0% 3.92 perf-stat.overall.cpi
346.76 -16.6% 289.11 -14.0% 298.06 perf-stat.overall.cycles-between-cache-misses
0.22 +22.6% 0.27 +15.0% 0.26 perf-stat.overall.ipc
10906 -3.4% 10541 -1.1% 10784 perf-stat.overall.path-length
1.445e+10 ± 2% -30.7% 1.001e+10 ± 3% +21.2% 1.752e+10 ± 2% perf-stat.ps.branch-instructions
10469697 ± 3% -23.5% 8005730 ± 3% +18.3% 12387061 ± 2% perf-stat.ps.branch-misses
8.045e+08 ± 2% -31.9% 5.478e+08 ± 3% +22.7% 9.874e+08 ± 2% perf-stat.ps.cache-misses
8.257e+08 ± 2% -27.9% 5.95e+08 ± 3% +22.5% 1.011e+09 ± 2% perf-stat.ps.cache-references
2584 ± 2% +169.3% 6958 ± 3% +2.7% 2654 ± 4% perf-stat.ps.context-switches
2.789e+11 -43.2% 1.583e+11 ± 2% +5.5% 2.943e+11 ± 2% perf-stat.ps.cpu-cycles
214.69 +41.8% 304.37 ± 6% +2.2% 219.46 perf-stat.ps.cpu-migrations
6.19e+10 ± 2% -30.4% 4.309e+10 ± 3% +21.3% 7.507e+10 ± 2% perf-stat.ps.instructions
5849 -18.0% 4799 ± 2% +12.3% 6568 ± 2% perf-stat.ps.minor-faults
5851 -18.0% 4800 ± 2% +12.3% 6570 ± 2% perf-stat.ps.page-faults
1.264e+13 -3.4% 1.222e+13 +0.5% 1.27e+13 perf-stat.total.instructions
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-19 8:42 ` Oliver Sang
@ 2024-07-19 16:06 ` Yu Zhao
2024-08-03 22:07 ` Yu Zhao
0 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-19 16:06 UTC (permalink / raw)
To: Oliver Sang
Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, Muchun Song, David Hildenbrand,
Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
linux-kernel, ying.huang, feng.tang, fengwei.yin,
Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
Heiko Carstens, Yosry Ahmed
On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yu Zhao,
>
> On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Hi Janosch and Oliver,
> > >
> > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > >
> > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > >
> > > > >
> > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > >
> > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > >
> > > > This has hit s390 huge page backed KVM guests as well.
> > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > >
> > > Could you try the attached patch please? Thank you.
> >
> > Thanks, Yosry, for spotting the following typo:
> > flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > It's supposed to be:
> > flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> >
> > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
>
> since the commit is in mainline now, I directly apply your v2 patch upon
> bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>
> in our tests, your v2 patch not only recovers the performance regression,
Thanks for verifying the fix!
> it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> bd225530a4c71)
Glad to hear!
(The original patch improved and regressed the performance at the same
time, but the regression is bigger. The fix removed the regression and
surfaced the improvement.)
> detail is as below
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
> gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>
> commit:
> 5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
> bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> 9a5b87b521401 <---- your v2 patch
>
> 5a4d8944d6b1e1aa bd225530a4c717714722c373144 9a5b87b5214018a2be217dc4648
> ---------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev
> \ | \ | \
> 4.271e+09 ± 10% +348.4% 1.915e+10 ± 6% -39.9% 2.567e+09 ± 20% cpuidle..time
> 774593 ± 4% +1060.9% 8992186 ± 6% -17.2% 641254 cpuidle..usage
> 555365 ± 8% +28.0% 710795 ± 2% -4.5% 530157 ± 5% numa-numastat.node0.local_node
> 629633 ± 4% +23.0% 774346 ± 5% +0.6% 633264 ± 4% numa-numastat.node0.numa_hit
> 255.76 ± 2% +31.1% 335.40 ± 3% -13.8% 220.53 ± 2% uptime.boot
> 10305 ± 6% +144.3% 25171 ± 5% -17.1% 8543 ± 8% uptime.idle
> 1.83 ± 58% +96200.0% 1765 ±155% +736.4% 15.33 ± 24% perf-c2c.DRAM.local
> 33.00 ± 16% +39068.2% 12925 ±122% +95.5% 64.50 ± 49% perf-c2c.DRAM.remote
> 21.33 ± 8% +2361.7% 525.17 ± 31% +271.1% 79.17 ± 52% perf-c2c.HITM.local
> 9.17 ± 21% +3438.2% 324.33 ± 57% +270.9% 34.00 ± 60% perf-c2c.HITM.remote
> 16.11 ± 7% +37.1 53.16 ± 2% -4.6 11.50 ± 19% mpstat.cpu.all.idle%
> 0.34 ± 2% -0.1 0.22 +0.0 0.35 ± 3% mpstat.cpu.all.irq%
> 0.03 ± 5% +0.0 0.04 ± 8% -0.0 0.02 mpstat.cpu.all.soft%
> 10.58 ± 4% -9.5 1.03 ± 36% +0.1 10.71 ± 2% mpstat.cpu.all.sys%
> 72.94 ± 2% -27.4 45.55 ± 3% +4.5 77.41 ± 2% mpstat.cpu.all.usr%
> 6.00 ± 16% +230.6% 19.83 ± 5% +8.3% 6.50 ± 17% mpstat.max_utilization.seconds
> 16.95 ± 7% +215.5% 53.48 ± 2% -26.2% 12.51 ± 16% vmstat.cpu.id
> 72.33 ± 2% -37.4% 45.31 ± 3% +6.0% 76.65 ± 2% vmstat.cpu.us
> 2.254e+08 -0.0% 2.254e+08 +14.7% 2.584e+08 vmstat.memory.free
> 108.30 -43.3% 61.43 ± 2% +5.4% 114.12 ± 2% vmstat.procs.r
> 2659 +162.6% 6982 ± 3% +3.6% 2753 ± 4% vmstat.system.cs
> 136384 ± 4% -21.9% 106579 ± 7% +13.3% 154581 ± 3% vmstat.system.in
> 203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% time.elapsed_time
> 203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% time.elapsed_time.max
> 148901 ± 6% -45.6% 81059 ± 4% -8.8% 135748 ± 8% time.involuntary_context_switches
> 169.83 ± 23% +85.3% 314.67 ± 8% +7.9% 183.33 ± 7% time.major_page_faults
> 10697 -43.4% 6050 ± 2% +5.6% 11294 ± 2% time.percent_of_cpu_this_job_got
> 2740 ± 6% -86.7% 365.06 ± 43% -16.1% 2298 time.system_time
> 19012 -11.9% 16746 -11.9% 16747 time.user_time
> 14412 ± 5% +4432.0% 653187 -16.6% 12025 ± 3% time.voluntary_context_switches
> 50095 ± 2% -31.5% 34325 ± 2% +18.6% 59408 vm-scalability.median
> 8.25 ± 16% -3.4 4.84 ± 22% -6.6 1.65 ± 15% vm-scalability.median_stddev%
> 6863720 -34.0% 4532485 +13.7% 7805408 vm-scalability.throughput
> 203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% vm-scalability.time.elapsed_time
> 203.41 ± 2% +39.2% 283.06 ± 4% -17.1% 168.71 ± 2% vm-scalability.time.elapsed_time.max
> 148901 ± 6% -45.6% 81059 ± 4% -8.8% 135748 ± 8% vm-scalability.time.involuntary_context_switches
> 10697 -43.4% 6050 ± 2% +5.6% 11294 ± 2% vm-scalability.time.percent_of_cpu_this_job_got
> 2740 ± 6% -86.7% 365.06 ± 43% -16.1% 2298 vm-scalability.time.system_time
> 19012 -11.9% 16746 -11.9% 16747 vm-scalability.time.user_time
> 14412 ± 5% +4432.0% 653187 -16.6% 12025 ± 3% vm-scalability.time.voluntary_context_switches
> 1.159e+09 +0.0% 1.159e+09 +1.6% 1.178e+09 vm-scalability.workload
> 22900043 ± 4% +1.2% 23166356 ± 6% -16.7% 19076170 ± 5% numa-vmstat.node0.nr_free_pages
> 42856 ± 43% +998.5% 470779 ± 51% +318.6% 179409 ±154% numa-vmstat.node0.nr_unevictable
> 42856 ± 43% +998.5% 470779 ± 51% +318.6% 179409 ±154% numa-vmstat.node0.nr_zone_unevictable
> 629160 ± 4% +22.9% 773391 ± 5% +0.5% 632570 ± 4% numa-vmstat.node0.numa_hit
> 554892 ± 8% +27.9% 709841 ± 2% -4.6% 529463 ± 5% numa-vmstat.node0.numa_local
> 27469 ± 14% +0.0% 27475 ± 41% -31.7% 18763 ± 13% numa-vmstat.node1.nr_active_anon
> 767179 ± 2% -55.8% 339212 ± 72% -19.7% 616417 ± 43% numa-vmstat.node1.nr_file_pages
> 10693349 ± 5% +46.3% 15639681 ± 7% +69.4% 18112002 ± 3% numa-vmstat.node1.nr_free_pages
> 14210 ± 27% -65.0% 4973 ± 49% -34.7% 9280 ± 39% numa-vmstat.node1.nr_mapped
> 724050 ± 2% -59.1% 296265 ± 82% -18.9% 587498 ± 47% numa-vmstat.node1.nr_unevictable
> 27469 ± 14% +0.0% 27475 ± 41% -31.7% 18763 ± 13% numa-vmstat.node1.nr_zone_active_anon
> 724050 ± 2% -59.1% 296265 ± 82% -18.9% 587498 ± 47% numa-vmstat.node1.nr_zone_unevictable
> 120619 ± 11% +13.6% 137042 ± 27% -31.2% 82976 ± 7% meminfo.Active
> 120472 ± 11% +13.6% 136895 ± 27% -31.2% 82826 ± 7% meminfo.Active(anon)
> 70234807 +14.6% 80512468 +10.2% 77431344 meminfo.CommitLimit
> 2.235e+08 +0.1% 2.237e+08 +15.1% 2.573e+08 meminfo.DirectMap1G
> 44064 -22.8% 34027 ± 2% +20.7% 53164 ± 2% meminfo.HugePages_Surp
> 44064 -22.8% 34027 ± 2% +20.7% 53164 ± 2% meminfo.HugePages_Total
> 90243440 -22.8% 69688103 ± 2% +20.7% 1.089e+08 ± 2% meminfo.Hugetlb
> 70163 ± 29% -42.6% 40293 ± 11% -21.9% 54789 ± 15% meminfo.Mapped
> 1.334e+08 +15.5% 1.541e+08 +10.7% 1.477e+08 meminfo.MemAvailable
> 1.344e+08 +15.4% 1.551e+08 +10.7% 1.488e+08 meminfo.MemFree
> 2.307e+08 +0.0% 2.307e+08 +14.3% 2.637e+08 meminfo.MemTotal
> 96309843 -21.5% 75639108 ± 2% +19.4% 1.15e+08 ± 2% meminfo.Memused
> 259553 ± 2% -0.9% 257226 ± 15% -10.5% 232211 ± 4% meminfo.Shmem
> 1.2e+08 -2.4% 1.172e+08 +13.3% 1.36e+08 meminfo.max_used_kB
> 18884 ± 10% -7.2% 17519 ± 15% +37.6% 25983 ± 6% numa-meminfo.node0.HugePages_Surp
> 18884 ± 10% -7.2% 17519 ± 15% +37.6% 25983 ± 6% numa-meminfo.node0.HugePages_Total
> 91526744 ± 4% +1.2% 92620825 ± 6% -16.7% 76248423 ± 5% numa-meminfo.node0.MemFree
> 40158207 ± 9% -2.7% 39064126 ± 15% +38.0% 55436528 ± 7% numa-meminfo.node0.MemUsed
> 171426 ± 43% +998.5% 1883116 ± 51% +318.6% 717638 ±154% numa-meminfo.node0.Unevictable
> 110091 ± 14% -0.1% 109981 ± 41% -31.7% 75226 ± 13% numa-meminfo.node1.Active
> 110025 ± 14% -0.1% 109915 ± 41% -31.7% 75176 ± 13% numa-meminfo.node1.Active(anon)
> 3068496 ± 2% -55.8% 1356754 ± 72% -19.6% 2466084 ± 43% numa-meminfo.node1.FilePages
> 25218 ± 4% -34.7% 16475 ± 12% +7.9% 27213 ± 3% numa-meminfo.node1.HugePages_Surp
> 25218 ± 4% -34.7% 16475 ± 12% +7.9% 27213 ± 3% numa-meminfo.node1.HugePages_Total
> 55867 ± 27% -65.5% 19266 ± 50% -34.4% 36671 ± 38% numa-meminfo.node1.Mapped
> 42795888 ± 5% +46.1% 62520130 ± 7% +69.3% 72441496 ± 3% numa-meminfo.node1.MemFree
> 99028084 +0.0% 99028084 +33.4% 1.321e+08 numa-meminfo.node1.MemTotal
> 56232195 ± 3% -35.1% 36507953 ± 12% +6.0% 59616707 ± 4% numa-meminfo.node1.MemUsed
> 2896199 ± 2% -59.1% 1185064 ± 82% -18.9% 2349991 ± 47% numa-meminfo.node1.Unevictable
> 507357 +0.0% 507357 +1.7% 516000 proc-vmstat.htlb_buddy_alloc_success
> 29942 ± 10% +14.3% 34235 ± 27% -30.7% 20740 ± 7% proc-vmstat.nr_active_anon
> 3324095 +15.7% 3847387 +10.9% 3686860 proc-vmstat.nr_dirty_background_threshold
> 6656318 +15.7% 7704181 +10.9% 7382735 proc-vmstat.nr_dirty_threshold
> 33559092 +15.6% 38798108 +10.9% 37209133 proc-vmstat.nr_free_pages
> 197697 ± 2% -2.5% 192661 +1.0% 199623 proc-vmstat.nr_inactive_anon
> 17939 ± 28% -42.5% 10307 ± 11% -22.4% 13927 ± 14% proc-vmstat.nr_mapped
> 2691 -7.1% 2501 +2.9% 2769 proc-vmstat.nr_page_table_pages
> 64848 ± 2% -0.7% 64386 ± 15% -10.6% 57987 ± 4% proc-vmstat.nr_shmem
> 29942 ± 10% +14.3% 34235 ± 27% -30.7% 20740 ± 7% proc-vmstat.nr_zone_active_anon
> 197697 ± 2% -2.5% 192661 +1.0% 199623 proc-vmstat.nr_zone_inactive_anon
> 1403095 +9.3% 1534152 ± 2% -3.2% 1358244 proc-vmstat.numa_hit
> 1267544 +10.6% 1401482 ± 2% -3.4% 1224210 proc-vmstat.numa_local
> 2.608e+08 +0.1% 2.609e+08 +1.7% 2.651e+08 proc-vmstat.pgalloc_normal
> 1259957 +13.4% 1428284 ± 2% -6.5% 1178198 proc-vmstat.pgfault
> 2.591e+08 +0.3% 2.6e+08 +2.3% 2.649e+08 proc-vmstat.pgfree
> 36883 ± 3% +18.5% 43709 ± 5% -12.2% 32371 ± 3% proc-vmstat.pgreuse
> 1.88 ± 16% -0.6 1.33 ±100% +0.9 2.80 ± 11% perf-profile.calltrace.cycles-pp.nrand48_r
> 16.19 ± 85% +28.6 44.75 ± 95% -11.4 4.78 ±218% perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
> 16.20 ± 85% +28.6 44.78 ± 95% -11.4 4.78 ±218% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> 16.22 ± 85% +28.6 44.82 ± 95% -11.4 4.79 ±218% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
> 16.22 ± 85% +28.6 44.82 ± 95% -11.4 4.79 ±218% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> 16.24 ± 85% +28.8 45.01 ± 95% -11.4 4.80 ±218% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
> 12.42 ± 84% +29.5 41.89 ± 95% -8.8 3.65 ±223% perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
> 12.52 ± 84% +29.6 42.08 ± 95% -8.8 3.68 ±223% perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
> 12.53 ± 84% +29.7 42.23 ± 95% -8.9 3.68 ±223% perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
> 12.80 ± 84% +30.9 43.65 ± 95% -9.0 3.76 ±223% perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
> 2.50 ± 17% -0.7 1.78 ±100% +1.2 3.73 ± 11% perf-profile.children.cycles-pp.nrand48_r
> 16.24 ± 85% +28.6 44.87 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.do_user_addr_fault
> 16.24 ± 85% +28.6 44.87 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.exc_page_fault
> 16.20 ± 85% +28.7 44.86 ± 95% -11.4 4.78 ±218% perf-profile.children.cycles-pp.hugetlb_fault
> 16.22 ± 85% +28.7 44.94 ± 95% -11.4 4.79 ±218% perf-profile.children.cycles-pp.handle_mm_fault
> 16.26 ± 85% +28.8 45.06 ± 95% -11.5 4.80 ±218% perf-profile.children.cycles-pp.asm_exc_page_fault
> 12.51 ± 84% +29.5 42.01 ± 95% -8.8 3.75 ±218% perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
> 12.52 ± 84% +29.6 42.11 ± 95% -8.8 3.75 ±218% perf-profile.children.cycles-pp.copy_subpage
> 12.53 ± 84% +29.7 42.25 ± 95% -8.8 3.76 ±218% perf-profile.children.cycles-pp.copy_user_large_folio
> 12.80 ± 84% +30.9 43.65 ± 95% -9.0 3.83 ±218% perf-profile.children.cycles-pp.hugetlb_wp
> 2.25 ± 17% -0.7 1.59 ±100% +1.1 3.36 ± 11% perf-profile.self.cycles-pp.nrand48_r
> 1.74 ± 21% -0.5 1.25 ± 92% +1.2 2.94 ± 13% perf-profile.self.cycles-pp.do_access
> 0.27 ± 17% -0.1 0.19 ±100% +0.1 0.40 ± 11% perf-profile.self.cycles-pp.lrand48_r
> 12.41 ± 84% +29.4 41.80 ± 95% -8.7 3.72 ±218% perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
> 350208 ± 16% -2.7% 340891 ± 36% -47.2% 184918 ± 9% sched_debug.cfs_rq:/.avg_vruntime.stddev
> 16833 ±149% -100.0% 3.19 ±100% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.left_deadline.avg
> 2154658 ±149% -100.0% 317.15 ± 93% -100.0% 74.40 ±179% sched_debug.cfs_rq:/.left_deadline.max
> 189702 ±149% -100.0% 29.47 ± 94% -100.0% 6.55 ±179% sched_debug.cfs_rq:/.left_deadline.stddev
> 16833 ±149% -100.0% 3.05 ±102% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.left_vruntime.avg
> 2154613 ±149% -100.0% 298.70 ± 95% -100.0% 74.06 ±179% sched_debug.cfs_rq:/.left_vruntime.max
> 189698 ±149% -100.0% 27.96 ± 96% -100.0% 6.52 ±179% sched_debug.cfs_rq:/.left_vruntime.stddev
> 350208 ± 16% -2.7% 340891 ± 36% -47.2% 184918 ± 9% sched_debug.cfs_rq:/.min_vruntime.stddev
> 52.88 ± 14% -19.5% 42.56 ± 39% +22.8% 64.94 ± 9% sched_debug.cfs_rq:/.removed.load_avg.stddev
> 16833 ±149% -100.0% 3.05 ±102% -100.0% 0.58 ±179% sched_debug.cfs_rq:/.right_vruntime.avg
> 2154613 ±149% -100.0% 298.70 ± 95% -100.0% 74.11 ±179% sched_debug.cfs_rq:/.right_vruntime.max
> 189698 ±149% -100.0% 27.96 ± 96% -100.0% 6.53 ±179% sched_debug.cfs_rq:/.right_vruntime.stddev
> 1588 ± 9% -31.2% 1093 ± 18% -20.0% 1270 ± 16% sched_debug.cfs_rq:/.runnable_avg.max
> 676.36 ± 7% -94.8% 35.08 ± 42% -2.7% 657.82 ± 3% sched_debug.cfs_rq:/.util_est.avg
> 1339 ± 8% -74.5% 341.42 ± 24% -22.6% 1037 ± 23% sched_debug.cfs_rq:/.util_est.max
> 152.67 ± 35% -72.3% 42.35 ± 21% -14.9% 129.89 ± 33% sched_debug.cfs_rq:/.util_est.stddev
> 1116839 ± 7% -7.1% 1037321 ± 4% +22.9% 1372316 ± 11% sched_debug.cpu.avg_idle.max
> 126915 ± 10% +31.6% 166966 ± 6% -12.2% 111446 ± 2% sched_debug.cpu.clock.avg
> 126930 ± 10% +31.6% 166977 ± 6% -12.2% 111459 ± 2% sched_debug.cpu.clock.max
> 126899 ± 10% +31.6% 166949 ± 6% -12.2% 111428 ± 2% sched_debug.cpu.clock.min
> 126491 ± 10% +31.7% 166537 ± 6% -12.2% 111078 ± 2% sched_debug.cpu.clock_task.avg
> 126683 ± 10% +31.6% 166730 ± 6% -12.2% 111237 ± 2% sched_debug.cpu.clock_task.max
> 117365 ± 11% +33.6% 156775 ± 6% -13.0% 102099 ± 2% sched_debug.cpu.clock_task.min
> 2826 ± 10% +178.1% 7858 ± 8% -10.3% 2534 ± 6% sched_debug.cpu.nr_switches.avg
> 755.38 ± 15% +423.8% 3956 ± 14% -15.2% 640.33 ± 3% sched_debug.cpu.nr_switches.min
> 126900 ± 10% +31.6% 166954 ± 6% -12.2% 111432 ± 2% sched_debug.cpu_clk
> 125667 ± 10% +31.9% 165721 ± 6% -12.3% 110200 ± 2% sched_debug.ktime
> 0.54 ±141% -99.9% 0.00 ±132% -99.9% 0.00 ±114% sched_debug.rt_rq:.rt_time.avg
> 69.73 ±141% -99.9% 0.06 ±132% -99.9% 0.07 ±114% sched_debug.rt_rq:.rt_time.max
> 6.14 ±141% -99.9% 0.01 ±132% -99.9% 0.01 ±114% sched_debug.rt_rq:.rt_time.stddev
> 127860 ± 10% +31.3% 167917 ± 6% -12.1% 112402 ± 2% sched_debug.sched_clk
> 15.99 +363.6% 74.14 ± 6% +10.1% 17.61 perf-stat.i.MPKI
> 1.467e+10 ± 2% -32.0% 9.975e+09 ± 3% +21.3% 1.779e+10 ± 2% perf-stat.i.branch-instructions
> 0.10 ± 5% +0.6 0.68 ± 5% +0.0 0.11 ± 4% perf-stat.i.branch-miss-rate%
> 10870114 ± 3% -26.4% 8001551 ± 3% +15.7% 12580898 ± 2% perf-stat.i.branch-misses
> 97.11 -20.0 77.11 -0.0 97.10 perf-stat.i.cache-miss-rate%
> 8.118e+08 ± 2% -32.5% 5.482e+08 ± 3% +23.1% 9.992e+08 ± 2% perf-stat.i.cache-misses
> 8.328e+08 ± 2% -28.4% 5.963e+08 ± 3% +22.8% 1.023e+09 ± 2% perf-stat.i.cache-references
> 2601 ± 2% +172.3% 7083 ± 3% +2.5% 2665 ± 5% perf-stat.i.context-switches
> 5.10 +39.5% 7.11 ± 9% -9.2% 4.62 perf-stat.i.cpi
> 2.826e+11 -44.1% 1.58e+11 ± 2% +5.7% 2.987e+11 ± 2% perf-stat.i.cpu-cycles
> 216.56 +42.4% 308.33 ± 6% +2.2% 221.23 perf-stat.i.cpu-migrations
> 358.79 -0.3% 357.70 ± 21% -14.1% 308.23 perf-stat.i.cycles-between-cache-misses
> 6.286e+10 ± 2% -31.7% 4.293e+10 ± 3% +21.3% 7.626e+10 ± 2% perf-stat.i.instructions
> 0.24 +39.9% 0.33 ± 4% +13.6% 0.27 perf-stat.i.ipc
> 5844 -16.9% 4856 ± 2% +12.5% 6577 perf-stat.i.minor-faults
> 5846 -16.9% 4857 ± 2% +12.5% 6578 perf-stat.i.page-faults
> 13.00 -2.2% 12.72 +1.2% 13.15 perf-stat.overall.MPKI
> 0.07 +0.0 0.08 -0.0 0.07 perf-stat.overall.branch-miss-rate%
> 97.44 -5.3 92.09 +0.2 97.66 perf-stat.overall.cache-miss-rate%
> 4.51 -18.4% 3.68 -13.0% 3.92 perf-stat.overall.cpi
> 346.76 -16.6% 289.11 -14.0% 298.06 perf-stat.overall.cycles-between-cache-misses
> 0.22 +22.6% 0.27 +15.0% 0.26 perf-stat.overall.ipc
> 10906 -3.4% 10541 -1.1% 10784 perf-stat.overall.path-length
> 1.445e+10 ± 2% -30.7% 1.001e+10 ± 3% +21.2% 1.752e+10 ± 2% perf-stat.ps.branch-instructions
> 10469697 ± 3% -23.5% 8005730 ± 3% +18.3% 12387061 ± 2% perf-stat.ps.branch-misses
> 8.045e+08 ± 2% -31.9% 5.478e+08 ± 3% +22.7% 9.874e+08 ± 2% perf-stat.ps.cache-misses
> 8.257e+08 ± 2% -27.9% 5.95e+08 ± 3% +22.5% 1.011e+09 ± 2% perf-stat.ps.cache-references
> 2584 ± 2% +169.3% 6958 ± 3% +2.7% 2654 ± 4% perf-stat.ps.context-switches
> 2.789e+11 -43.2% 1.583e+11 ± 2% +5.5% 2.943e+11 ± 2% perf-stat.ps.cpu-cycles
> 214.69 +41.8% 304.37 ± 6% +2.2% 219.46 perf-stat.ps.cpu-migrations
> 6.19e+10 ± 2% -30.4% 4.309e+10 ± 3% +21.3% 7.507e+10 ± 2% perf-stat.ps.instructions
> 5849 -18.0% 4799 ± 2% +12.3% 6568 ± 2% perf-stat.ps.minor-faults
> 5851 -18.0% 4800 ± 2% +12.3% 6570 ± 2% perf-stat.ps.page-faults
> 1.264e+13 -3.4% 1.222e+13 +0.5% 1.27e+13 perf-stat.total.instructions
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-07-19 16:06 ` Yu Zhao
@ 2024-08-03 22:07 ` Yu Zhao
2024-08-06 3:01 ` Oliver Sang
0 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-08-03 22:07 UTC (permalink / raw)
To: Oliver Sang, Muchun Song
Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
Andrew Morton, David Hildenbrand, Frank van der Linden,
Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
Marc Hartmayer, Heiko Carstens, Yosry Ahmed
[-- Attachment #1: Type: text/plain, Size: 2990 bytes --]
Hi Oliver,
On Fri, Jul 19, 2024 at 10:06 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yu Zhao,
> >
> > On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Hi Janosch and Oliver,
> > > >
> > > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > > >
> > > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > > Hello,
> > > > > >
> > > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > > >
> > > > > >
> > > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > > >
> > > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > > >
> > > > > This has hit s390 huge page backed KVM guests as well.
> > > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > > >
> > > > Could you try the attached patch please? Thank you.
> > >
> > > Thanks, Yosry, for spotting the following typo:
> > > flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > > It's supposed to be:
> > > flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> > >
> > > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
> >
> > since the commit is in mainline now, I directly apply your v2 patch upon
> > bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> >
> > in our tests, your v2 patch not only recovers the performance regression,
>
> Thanks for verifying the fix!
>
> > it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> > bd225530a4c71)
>
> Glad to hear!
>
> (The original patch improved and regressed the performance at the same
> time, but the regression is bigger. The fix removed the regression and
> surfaced the improvement.)
Can you please run the benchmark again with the attached patch on top
of the last fix?
I spotted something else worth optimizing last time, and with the
patch attached, I was able to measure some significant improvements in
1GB hugeTLB allocation and free time, e.g., when allocating and free
700 1GB hugeTLB pages:
Before:
# time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
real 0m13.500s
user 0m0.000s
sys 0m13.311s
# time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
real 0m11.269s
user 0m0.000s
sys 0m11.187s
After:
# time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
real 0m10.643s
user 0m0.001s
sys 0m10.487s
# time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
real 0m1.541s
user 0m0.000s
sys 0m1.528s
Thanks!
[-- Attachment #2: hugetlb.patch --]
[-- Type: application/octet-stream, Size: 22480 bytes --]
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 9db877506ea8..3d58ce1a8730 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
bool no_warn);
+extern struct folio *cma_alloc_folio(struct cma *cma, int order);
extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a0..630ab4f5f78d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -900,9 +900,9 @@ static inline bool hugepage_movable_supported(struct hstate *h)
static inline gfp_t htlb_alloc_mask(struct hstate *h)
{
if (hugepage_movable_supported(h))
- return GFP_HIGHUSER_MOVABLE;
+ return GFP_HIGHUSER_MOVABLE | __GFP_COMP;
else
- return GFP_HIGHUSER;
+ return GFP_HIGHUSER | __GFP_COMP;
}
static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
diff --git a/mm/cma.c b/mm/cma.c
index 3e9724716bad..39b6b99c6af1 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma)
spin_unlock_irq(&cma->lock);
}
-/**
- * cma_alloc() - allocate pages from contiguous area
- * @cma: Contiguous memory region for which the allocation is performed.
- * @count: Requested number of pages.
- * @align: Requested alignment of pages (in PAGE_SIZE order).
- * @no_warn: Avoid printing message about failed allocation
- *
- * This function allocates part of contiguous memory on specific
- * contiguous memory area.
- */
-struct page *cma_alloc(struct cma *cma, unsigned long count,
- unsigned int align, bool no_warn)
+static struct page *__cma_alloc(struct cma *cma, unsigned long count,
+ unsigned int align, gfp_t gfp)
{
unsigned long mask, offset;
unsigned long pfn = -1;
@@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
- ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
- GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+ ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
@@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
page_kasan_tag_reset(nth_page(page, i));
}
- if (ret && !no_warn) {
+ if (ret && !(gfp & __GFP_NOWARN)) {
pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n",
__func__, cma->name, count, ret);
cma_debug_show_areas(cma);
@@ -513,6 +502,31 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
return page;
}
+/**
+ * cma_alloc() - allocate pages from contiguous area
+ * @cma: Contiguous memory region for which the allocation is performed.
+ * @count: Requested number of pages.
+ * @align: Requested alignment of pages (in PAGE_SIZE order).
+ * @no_warn: Avoid printing message about failed allocation
+ *
+ * This function allocates part of contiguous memory on specific
+ * contiguous memory area.
+ */
+struct page *cma_alloc(struct cma *cma, unsigned long count,
+ unsigned int align, bool no_warn)
+{
+ return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+}
+
+struct folio *cma_alloc_folio(struct cma *cma, int order)
+{
+ struct page *page;
+
+ page = __cma_alloc(cma, 1 << order, order, GFP_KERNEL | __GFP_COMP);
+
+ return page ? page_folio(page) : NULL;
+}
+
bool cma_pages_valid(struct cma *cma, const struct page *pages,
unsigned long count)
{
diff --git a/mm/compaction.c b/mm/compaction.c
index eb95e9b435d0..00fb571727d3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -86,33 +86,6 @@ static struct page *mark_allocated_noprof(struct page *page, unsigned int order,
}
#define mark_allocated(...) alloc_hooks(mark_allocated_noprof(__VA_ARGS__))
-static void split_map_pages(struct list_head *freepages)
-{
- unsigned int i, order;
- struct page *page, *next;
- LIST_HEAD(tmp_list);
-
- for (order = 0; order < NR_PAGE_ORDERS; order++) {
- list_for_each_entry_safe(page, next, &freepages[order], lru) {
- unsigned int nr_pages;
-
- list_del(&page->lru);
-
- nr_pages = 1 << order;
-
- mark_allocated(page, order, __GFP_MOVABLE);
- if (order)
- split_page(page, order);
-
- for (i = 0; i < nr_pages; i++) {
- list_add(&page->lru, &tmp_list);
- page++;
- }
- }
- list_splice_init(&tmp_list, &freepages[0]);
- }
-}
-
static unsigned long release_free_list(struct list_head *freepages)
{
int order;
@@ -754,10 +727,9 @@ isolate_freepages_range(struct compact_control *cc,
{
unsigned long isolated, pfn, block_start_pfn, block_end_pfn;
int order;
- struct list_head tmp_freepages[NR_PAGE_ORDERS];
for (order = 0; order < NR_PAGE_ORDERS; order++)
- INIT_LIST_HEAD(&tmp_freepages[order]);
+ INIT_LIST_HEAD(&cc->freepages[order]);
pfn = start_pfn;
block_start_pfn = pageblock_start_pfn(pfn);
@@ -788,7 +760,7 @@ isolate_freepages_range(struct compact_control *cc,
break;
isolated = isolate_freepages_block(cc, &isolate_start_pfn,
- block_end_pfn, tmp_freepages, 0, true);
+ block_end_pfn, cc->freepages, 0, true);
/*
* In strict mode, isolate_freepages_block() returns 0 if
@@ -807,13 +779,10 @@ isolate_freepages_range(struct compact_control *cc,
if (pfn < end_pfn) {
/* Loop terminated early, cleanup. */
- release_free_list(tmp_freepages);
+ release_free_list(cc->freepages);
return 0;
}
- /* __isolate_free_page() does not map the pages */
- split_map_pages(tmp_freepages);
-
/* We don't use freelists for anything. */
return pfn;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aaf508be0a2b..2061d094cd19 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1512,43 +1512,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
-/* used to demote non-gigantic_huge pages as well */
-static void __destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order, bool demote)
-{
- int i;
- int nr_pages = 1 << order;
- struct page *p;
-
- atomic_set(&folio->_entire_mapcount, 0);
- atomic_set(&folio->_large_mapcount, 0);
- atomic_set(&folio->_pincount, 0);
-
- for (i = 1; i < nr_pages; i++) {
- p = folio_page(folio, i);
- p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
- p->mapping = NULL;
- clear_compound_head(p);
- if (!demote)
- set_page_refcounted(p);
- }
-
- __folio_clear_head(folio);
-}
-
-static void destroy_compound_hugetlb_folio_for_demote(struct folio *folio,
- unsigned int order)
-{
- __destroy_compound_gigantic_folio(folio, order, true);
-}
-
#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order)
-{
- __destroy_compound_gigantic_folio(folio, order, false);
-}
-
static void free_gigantic_folio(struct folio *folio, unsigned int order)
{
/*
@@ -1569,38 +1533,52 @@ static void free_gigantic_folio(struct folio *folio, unsigned int order)
static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
{
- struct page *page;
- unsigned long nr_pages = pages_per_huge_page(h);
+ struct folio *folio;
+ int order = huge_page_order(h);
+ bool retry = false;
+
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
-
+retry:
+ folio = NULL;
#ifdef CONFIG_CMA
{
int node;
- if (hugetlb_cma[nid]) {
- page = cma_alloc(hugetlb_cma[nid], nr_pages,
- huge_page_order(h), true);
- if (page)
- return page_folio(page);
- }
+ if (hugetlb_cma[nid])
+ folio = cma_alloc_folio(hugetlb_cma[nid], order);
- if (!(gfp_mask & __GFP_THISNODE)) {
+ if (!folio && !(gfp_mask & __GFP_THISNODE)) {
for_each_node_mask(node, *nodemask) {
if (node == nid || !hugetlb_cma[node])
continue;
- page = cma_alloc(hugetlb_cma[node], nr_pages,
- huge_page_order(h), true);
- if (page)
- return page_folio(page);
+ folio = cma_alloc_folio(hugetlb_cma[node], order);
+ if (folio)
+ break;
}
}
}
#endif
+ if (!folio) {
+ struct page *page = alloc_contig_pages(1 << order, gfp_mask, nid, nodemask);
- page = alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
- return page ? page_folio(page) : NULL;
+ if (!page)
+ return NULL;
+
+ folio = page_folio(page);
+ }
+
+ if (folio_ref_freeze(folio, 1))
+ return folio;
+
+ pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+ free_gigantic_folio(folio, order);
+ if (!retry) {
+ retry = true;
+ goto retry;
+ }
+ return NULL;
}
#else /* !CONFIG_CONTIG_ALLOC */
@@ -1619,8 +1597,6 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
}
static inline void free_gigantic_folio(struct folio *folio,
unsigned int order) { }
-static inline void destroy_compound_gigantic_folio(struct folio *folio,
- unsigned int order) { }
#endif
/*
@@ -1747,19 +1723,17 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
folio_clear_hugetlb_hwpoison(folio);
folio_ref_unfreeze(folio, 1);
+ INIT_LIST_HEAD(&folio->_deferred_list);
/*
* Non-gigantic pages demoted from CMA allocated gigantic pages
* need to be given back to CMA in free_gigantic_folio.
*/
if (hstate_is_gigantic(h) ||
- hugetlb_cma_folio(folio, huge_page_order(h))) {
- destroy_compound_gigantic_folio(folio, huge_page_order(h));
+ hugetlb_cma_folio(folio, huge_page_order(h)))
free_gigantic_folio(folio, huge_page_order(h));
- } else {
- INIT_LIST_HEAD(&folio->_deferred_list);
+ else
folio_put(folio);
- }
}
/*
@@ -2032,95 +2006,6 @@ static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni
spin_unlock_irq(&hugetlb_lock);
}
-static bool __prep_compound_gigantic_folio(struct folio *folio,
- unsigned int order, bool demote)
-{
- int i, j;
- int nr_pages = 1 << order;
- struct page *p;
-
- __folio_clear_reserved(folio);
- for (i = 0; i < nr_pages; i++) {
- p = folio_page(folio, i);
-
- /*
- * For gigantic hugepages allocated through bootmem at
- * boot, it's safer to be consistent with the not-gigantic
- * hugepages and clear the PG_reserved bit from all tail pages
- * too. Otherwise drivers using get_user_pages() to access tail
- * pages may get the reference counting wrong if they see
- * PG_reserved set on a tail page (despite the head page not
- * having PG_reserved set). Enforcing this consistency between
- * head and tail pages allows drivers to optimize away a check
- * on the head page when they need know if put_page() is needed
- * after get_user_pages().
- */
- if (i != 0) /* head page cleared above */
- __ClearPageReserved(p);
- /*
- * Subtle and very unlikely
- *
- * Gigantic 'page allocators' such as memblock or cma will
- * return a set of pages with each page ref counted. We need
- * to turn this set of pages into a compound page with tail
- * page ref counts set to zero. Code such as speculative page
- * cache adding could take a ref on a 'to be' tail page.
- * We need to respect any increased ref count, and only set
- * the ref count to zero if count is currently 1. If count
- * is not 1, we return an error. An error return indicates
- * the set of pages can not be converted to a gigantic page.
- * The caller who allocated the pages should then discard the
- * pages using the appropriate free interface.
- *
- * In the case of demote, the ref count will be zero.
- */
- if (!demote) {
- if (!page_ref_freeze(p, 1)) {
- pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
- goto out_error;
- }
- } else {
- VM_BUG_ON_PAGE(page_count(p), p);
- }
- if (i != 0)
- set_compound_head(p, &folio->page);
- }
- __folio_set_head(folio);
- /* we rely on prep_new_hugetlb_folio to set the hugetlb flag */
- folio_set_order(folio, order);
- atomic_set(&folio->_entire_mapcount, -1);
- atomic_set(&folio->_large_mapcount, -1);
- atomic_set(&folio->_pincount, 0);
- return true;
-
-out_error:
- /* undo page modifications made above */
- for (j = 0; j < i; j++) {
- p = folio_page(folio, j);
- if (j != 0)
- clear_compound_head(p);
- set_page_refcounted(p);
- }
- /* need to clear PG_reserved on remaining tail pages */
- for (; j < nr_pages; j++) {
- p = folio_page(folio, j);
- __ClearPageReserved(p);
- }
- return false;
-}
-
-static bool prep_compound_gigantic_folio(struct folio *folio,
- unsigned int order)
-{
- return __prep_compound_gigantic_folio(folio, order, false);
-}
-
-static bool prep_compound_gigantic_folio_for_demote(struct folio *folio,
- unsigned int order)
-{
- return __prep_compound_gigantic_folio(folio, order, true);
-}
-
/*
* Find and lock address space (mapping) in write mode.
*
@@ -2159,7 +2044,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
*/
if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
alloc_try_hard = false;
- gfp_mask |= __GFP_COMP|__GFP_NOWARN;
+ gfp_mask |= __GFP_NOWARN;
if (alloc_try_hard)
gfp_mask |= __GFP_RETRY_MAYFAIL;
if (nid == NUMA_NO_NODE)
@@ -2206,48 +2091,14 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
return folio;
}
-static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask,
- nodemask_t *node_alloc_noretry)
-{
- struct folio *folio;
- bool retry = false;
-
-retry:
- if (hstate_is_gigantic(h))
- folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
- else
- folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
- nid, nmask, node_alloc_noretry);
- if (!folio)
- return NULL;
-
- if (hstate_is_gigantic(h)) {
- if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
- /*
- * Rare failure to convert pages to compound page.
- * Free pages and try again - ONCE!
- */
- free_gigantic_folio(folio, huge_page_order(h));
- if (!retry) {
- retry = true;
- goto retry;
- }
- return NULL;
- }
- }
-
- return folio;
-}
-
static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
gfp_t gfp_mask, int nid, nodemask_t *nmask,
nodemask_t *node_alloc_noretry)
{
struct folio *folio;
- folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
- node_alloc_noretry);
+ folio = hstate_is_gigantic(h) ? alloc_gigantic_folio(h, gfp_mask, nid, nmask) :
+ alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, node_alloc_noretry);
if (folio)
init_new_hugetlb_folio(h, folio);
return folio;
@@ -2265,7 +2116,8 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
{
struct folio *folio;
- folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+ folio = hstate_is_gigantic(h) ? alloc_gigantic_folio(h, gfp_mask, nid, nmask) :
+ alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
if (!folio)
return NULL;
@@ -3333,6 +3185,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
+ __ClearPageReserved(folio_page(folio, pfn - head_pfn));
__init_single_page(page, pfn, zone, nid);
prep_compound_tail((struct page *)folio, pfn - head_pfn);
ret = page_ref_freeze(page, 1);
@@ -3950,11 +3803,9 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
}
}
- /*
- * Use destroy_compound_hugetlb_folio_for_demote for all huge page
- * sizes as it will not ref count folios.
- */
- destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(h));
+ split_page_memcg(&folio->page, huge_page_order(h), huge_page_order(target_hstate));
+ split_page_owner(&folio->page, huge_page_order(h), huge_page_order(target_hstate));
+ pgalloc_tag_split(&folio->page, 1 << huge_page_order(h));
/*
* Taking target hstate mutex synchronizes with set_max_huge_pages.
@@ -3969,11 +3820,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
i += pages_per_huge_page(target_hstate)) {
subpage = folio_page(folio, i);
inner_folio = page_folio(subpage);
- if (hstate_is_gigantic(target_hstate))
- prep_compound_gigantic_folio_for_demote(inner_folio,
- target_hstate->order);
- else
- prep_compound_page(subpage, target_hstate->order);
+ prep_compound_page(subpage, target_hstate->order);
folio_change_private(inner_folio, NULL);
prep_new_hugetlb_folio(target_hstate, inner_folio, nid);
free_huge_folio(inner_folio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 28f80daf5c04..4ecf2c9428f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,16 +1192,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock_irqrestore(&zone->lock, flags);
}
+/* Split a multi-block free page into its individual pageblocks */
+static void split_large_buddy(struct zone *zone, struct page *page,
+ unsigned long pfn, int order, fpi_t fpi_flags)
+{
+ unsigned long end_pfn = pfn + (1 << order);
+
+ VM_WARN_ON_ONCE(pfn & ((1 << order) - 1));
+ /* Caller removed page from freelist, buddy info cleared! */
+ VM_WARN_ON_ONCE(PageBuddy(page));
+
+ if (order > pageblock_order)
+ order = pageblock_order;
+
+ while (pfn != end_pfn) {
+ int mt = get_pfnblock_migratetype(page, pfn);
+
+ __free_one_page(page, pfn, zone, order, mt, fpi_flags);
+ pfn += 1 << order;
+ page = pfn_to_page(pfn);
+ }
+}
+
static void free_one_page(struct zone *zone, struct page *page,
unsigned long pfn, unsigned int order,
fpi_t fpi_flags)
{
unsigned long flags;
- int migratetype;
spin_lock_irqsave(&zone->lock, flags);
- migratetype = get_pfnblock_migratetype(page, pfn);
- __free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+ split_large_buddy(zone, page, pfn, order, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1693,27 +1713,6 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
return start_pfn;
}
-/* Split a multi-block free page into its individual pageblocks */
-static void split_large_buddy(struct zone *zone, struct page *page,
- unsigned long pfn, int order)
-{
- unsigned long end_pfn = pfn + (1 << order);
-
- VM_WARN_ON_ONCE(order <= pageblock_order);
- VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
-
- /* Caller removed page from freelist, buddy info cleared! */
- VM_WARN_ON_ONCE(PageBuddy(page));
-
- while (pfn != end_pfn) {
- int mt = get_pfnblock_migratetype(page, pfn);
-
- __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE);
- pfn += pageblock_nr_pages;
- page = pfn_to_page(pfn);
- }
-}
-
/**
* move_freepages_block_isolate - move free pages in block for page isolation
* @zone: the zone
@@ -1754,7 +1753,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
del_page_from_free_list(buddy, zone, order,
get_pfnblock_migratetype(buddy, pfn));
set_pageblock_migratetype(page, migratetype);
- split_large_buddy(zone, buddy, pfn, order);
+ split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
return true;
}
@@ -1765,7 +1764,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
del_page_from_free_list(page, zone, order,
get_pfnblock_migratetype(page, pfn));
set_pageblock_migratetype(page, migratetype);
- split_large_buddy(zone, page, pfn, order);
+ split_large_buddy(zone, page, pfn, order, FPI_NONE);
return true;
}
move:
@@ -6439,6 +6438,40 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
return (ret < 0) ? ret : 0;
}
+static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
+{
+ post_alloc_hook(page, order, __GFP_MOVABLE);
+ return page;
+}
+#define mark_allocated(...) alloc_hooks(mark_allocated_noprof(__VA_ARGS__))
+
+static void split_free_pages(struct list_head *freepages)
+{
+ unsigned int i, order;
+ struct page *page, *next;
+ LIST_HEAD(tmp_list);
+
+ for (order = 0; order < NR_PAGE_ORDERS; order++) {
+ list_for_each_entry_safe(page, next, &freepages[order], lru) {
+ unsigned int nr_pages;
+
+ list_del(&page->lru);
+
+ nr_pages = 1 << order;
+
+ mark_allocated(page, order, __GFP_MOVABLE);
+ if (order)
+ split_page(page, order);
+
+ for (i = 0; i < nr_pages; i++) {
+ list_add(&page->lru, &tmp_list);
+ page++;
+ }
+ }
+ list_splice_init(&tmp_list, &freepages[0]);
+ }
+}
+
/**
* alloc_contig_range() -- tries to allocate given range of pages
* @start: start PFN to allocate
@@ -6551,12 +6584,25 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
goto done;
}
- /* Free head and tail (if any) */
- if (start != outer_start)
- free_contig_range(outer_start, start - outer_start);
- if (end != outer_end)
- free_contig_range(end, outer_end - end);
+ if (!(gfp_mask & __GFP_COMP)) {
+ split_free_pages(cc.freepages);
+ /* Free head and tail (if any) */
+ if (start != outer_start)
+ free_contig_range(outer_start, start - outer_start);
+ if (end != outer_end)
+ free_contig_range(end, outer_end - end);
+ } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
+ struct page *head = pfn_to_page(start);
+ int order = ilog2(end - start);
+
+ check_new_pages(head, order);
+ prep_new_page(head, order, gfp_mask, 0);
+ } else {
+ ret = -EINVAL;
+ WARN(true, "PFN range: requested [%lu, %lu), leaked [%lu, %lu)\n",
+ start, end, outer_start, outer_end);
+ }
done:
undo_isolate_page_range(start, end, migratetype);
return ret;
@@ -6665,6 +6711,18 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
void free_contig_range(unsigned long pfn, unsigned long nr_pages)
{
unsigned long count = 0;
+ struct folio *folio = pfn_folio(pfn);
+
+ if (folio_test_large(folio)) {
+ int expected = folio_nr_pages(folio);
+
+ if (nr_pages == expected)
+ folio_put(folio);
+ else
+ WARN(true, "PFN %lu: nr_pages %lu != expected %d\n",
+ pfn, nr_pages, expected);
+ return;
+ }
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
2024-08-03 22:07 ` Yu Zhao
@ 2024-08-06 3:01 ` Oliver Sang
0 siblings, 0 replies; 14+ messages in thread
From: Oliver Sang @ 2024-08-06 3:01 UTC (permalink / raw)
To: Yu Zhao
Cc: Muchun Song, Janosch Frank, oe-lkp, lkp,
Linux Memory Management List, Andrew Morton, David Hildenbrand,
Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
linux-kernel, ying.huang, feng.tang, fengwei.yin,
Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
Heiko Carstens, Yosry Ahmed, oliver.sang
hi, Yu Zhao,
On Sat, Aug 03, 2024 at 04:07:55PM -0600, Yu Zhao wrote:
> Hi Oliver,
>
> On Fri, Jul 19, 2024 at 10:06 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
> > >
> > > hi, Yu Zhao,
> > >
> > > On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > > > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > Hi Janosch and Oliver,
> > > > >
> > > > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > > > >
> > > > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > > > >
> > > > > > >
> > > > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > > > >
> > > > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > > > >
> > > > > > This has hit s390 huge page backed KVM guests as well.
> > > > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > > > >
> > > > > Could you try the attached patch please? Thank you.
> > > >
> > > > Thanks, Yosry, for spotting the following typo:
> > > > flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > > > It's supposed to be:
> > > > flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> > > >
> > > > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
> > >
> > > since the commit is in mainline now, I directly apply your v2 patch upon
> > > bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > >
> > > in our tests, your v2 patch not only recovers the performance regression,
> >
> > Thanks for verifying the fix!
> >
> > > it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> > > bd225530a4c71)
> >
> > Glad to hear!
> >
> > (The original patch improved and regressed the performance at the same
> > time, but the regression is bigger. The fix removed the regression and
> > surfaced the improvement.)
>
> Can you please run the benchmark again with the attached patch on top
> of the last fix?
last time, I applied your last fix (1) directly upon mainline commit (2)
9a5b87b521401 fix for 875fa64577 (then bd225530a4 in main) <--- (1)
bd225530a4c71 mm/hugetlb_vmemmap: fix race with speculative PFN walkers <--- (2)
but I failed to apply your patch this time upon (1)
then I found I can apply above (1) upon mainline commit (3), as below (4).
your patch this time can be applied upon (4) successfully, as below (5)
e2b8dff50992a new hugetlb-20240805.patch <--- (5)
b5af188232e56 v2 fix for bd225530a4 but apply on mainline tip 17712b7ea0756 <--- (4)
17712b7ea0756 Merge tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux <--- (3)
I tested (3)(4)(5) and compared them with bd225530a4c71 and its parent. detail
as below [1]
you may notice the data for bd225530a4c71 and its parent are different with
previous data. this is due to we found some problem for gcc-13, we convert
to use gcc-12 now, our config is also changed.
we have below observations.
* bd225530a4c71 still has a similar -36.6% regression compare to its parent
* 17712b7ea0756 has similar data as bd225530a4c71 (a little worse, so -39.2%
comparing to 5a4d8944d6b1e who is parent of bd225530a4c71)
* your last fix still do the work to recover the regression, but is not better
than 5a4d8944d6b1e
* your patch this time seems not impact performance data a lot
[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
commit:
5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
17712b7ea0756 ("Merge tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux")
b5af188232e56 <--- apply your last fix upon 17712b7ea0756
e2b8dff50992a <--- then apply your patch this time upon b5af188232e56
5a4d8944d6b1e1aa bd225530a4c717714722c373144 17712b7ea0756799635ba159cc7 b5af188232e564d17fc3c1784f7 e2b8dff50992a56c67308f905bd
---------------- --------------------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \ | \
3.312e+09 ± 34% +472.2% 1.895e+10 ± 3% +487.2% 1.945e+10 ± 7% -4.4% 3.167e+09 ± 29% -15.6% 2.795e+09 ± 29% cpuidle..time
684985 ± 5% +1112.3% 8304355 ± 2% +1099.5% 8216278 -2.4% 668573 ± 5% -5.2% 649406 ± 2% cpuidle..usage
231.53 ± 3% +40.7% 325.70 ± 2% +45.1% 335.98 ± 5% +1.4% 234.78 ± 4% +0.2% 231.94 ± 3% uptime.boot
10015 ± 10% +156.8% 25723 ± 4% +156.8% 25724 ± 7% +4.3% 10447 ± 12% -0.7% 9945 ± 13% uptime.idle
577860 ± 7% +18.1% 682388 ± 8% +12.1% 647808 ± 6% +9.9% 635341 ± 6% -0.3% 576189 ± 5% numa-numastat.node0.local_node
624764 ± 5% +16.1% 725128 ± 4% +18.0% 736975 ± 2% +10.2% 688399 ± 2% +3.0% 643587 ± 5% numa-numastat.node0.numa_hit
647823 ± 5% +11.3% 721266 ± 9% +15.7% 749411 ± 6% -10.0% 583278 ± 5% -1.0% 641117 ± 3% numa-numastat.node1.local_node
733550 ± 4% +10.6% 811157 ± 4% +8.4% 795091 ± 3% -9.0% 667814 ± 3% -3.4% 708807 ± 3% numa-numastat.node1.numa_hit
6.17 ±108% +1521.6% 100.00 ± 38% +26137.8% 1618 ±172% -74.1% 1.60 ± 84% +27.0% 7.83 ±114% perf-c2c.DRAM.local
46.17 ± 43% +2759.6% 1320 ± 26% +12099.6% 5632 ±112% +18.3% 54.60 ± 56% +48.7% 68.67 ± 42% perf-c2c.DRAM.remote
36.50 ± 52% +1526.5% 593.67 ± 26% +1305.5% 513.00 ± 53% +2.5% 37.40 ± 46% +62.6% 59.33 ± 66% perf-c2c.HITM.local
15.33 ± 74% +2658.7% 423.00 ± 36% +2275.0% 364.17 ± 67% +48.7% 22.80 ± 75% +122.8% 34.17 ± 58% perf-c2c.HITM.remote
15.34 ± 27% +265.8% 56.12 +256.0% 54.63 -2.5% 14.96 ± 23% -12.7% 13.39 ± 23% vmstat.cpu.id
73.93 ± 5% -41.4% 43.30 ± 2% -39.3% 44.85 ± 2% +0.5% 74.27 ± 4% +2.4% 75.72 ± 3% vmstat.cpu.us
110.76 ± 4% -47.2% 58.47 ± 2% -45.7% 60.14 ± 2% +0.1% 110.90 ± 4% +1.9% 112.84 ± 3% vmstat.procs.r
2729 ± 3% +167.3% 7294 ± 2% +155.7% 6979 ± 4% +0.2% 2734 -1.3% 2692 ± 5% vmstat.system.cs
150274 ± 5% -23.2% 115398 ± 6% -27.2% 109377 ± 13% +0.6% 151130 ± 4% +0.9% 151666 ± 3% vmstat.system.in
14.31 ± 29% +41.4 55.74 +40.0 54.31 -0.5 13.85 ± 25% -1.9 12.42 ± 24% mpstat.cpu.all.idle%
0.34 ± 5% -0.1 0.21 ± 2% -0.1 0.21 ± 2% -0.0 0.34 ± 4% +0.0 0.35 ± 4% mpstat.cpu.all.irq%
0.02 ± 4% +0.0 0.03 +0.0 0.03 ± 4% -0.0 0.02 ± 2% -0.0 0.02 ± 2% mpstat.cpu.all.soft%
10.63 ± 4% -10.2 0.43 ± 4% -10.3 0.35 ± 29% +0.1 10.71 ± 2% +0.2 10.79 ± 4% mpstat.cpu.all.sys%
74.69 ± 5% -31.1 43.59 ± 2% -29.6 45.10 ± 2% +0.4 75.08 ± 4% +1.7 76.42 ± 3% mpstat.cpu.all.usr%
6.83 ± 15% +380.5% 32.83 ± 45% +217.1% 21.67 ± 5% +40.5% 9.60 ± 41% -7.3% 6.33 ± 7% mpstat.max_utilization.seconds
0.71 ± 55% +0.4 1.14 ± 3% +0.2 0.96 ± 44% +0.4 1.09 ± 4% +0.2 0.91 ± 30% perf-profile.calltrace.cycles-pp.lrand48_r
65.57 ± 10% +3.5 69.09 -7.3 58.23 ± 45% +2.4 67.94 +7.2 72.76 perf-profile.calltrace.cycles-pp.do_rw_once
0.06 ± 7% -0.0 0.05 ± 46% +0.0 0.11 ± 48% +0.0 0.07 ± 16% +0.0 0.08 ± 16% perf-profile.children.cycles-pp.get_jiffies_update
0.28 ± 10% +0.0 0.29 ± 8% +0.3 0.58 ± 74% +0.0 0.30 ± 13% +0.0 0.32 ± 12% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.24 ± 10% +0.0 0.25 ± 10% +0.2 0.46 ± 66% +0.0 0.26 ± 13% +0.0 0.28 ± 12% perf-profile.children.cycles-pp.update_process_times
0.06 ± 7% -0.0 0.05 ± 46% +0.0 0.11 ± 48% +0.0 0.07 ± 16% +0.0 0.08 ± 16% perf-profile.self.cycles-pp.get_jiffies_update
0.50 ± 7% +0.1 0.56 ± 3% -0.0 0.46 ± 44% +0.0 0.53 ± 3% +0.0 0.52 ± 6% perf-profile.self.cycles-pp.lrand48_r@plt
26722 ± 4% -33.8% 17690 ± 15% -40.0% 16038 ± 29% +6.9% 28560 ± 3% +9.0% 29116 ± 3% numa-meminfo.node0.HugePages_Surp
26722 ± 4% -33.8% 17690 ± 15% -40.0% 16038 ± 29% +6.9% 28560 ± 3% +9.0% 29116 ± 3% numa-meminfo.node0.HugePages_Total
74013758 ± 3% +24.7% 92302659 ± 5% +30.0% 96190384 ± 9% -5.6% 69852204 ± 2% -6.0% 69592735 ± 3% numa-meminfo.node0.MemFree
57671194 ± 4% -31.7% 39382292 ± 12% -38.5% 35494567 ± 26% +7.2% 61832747 ± 3% +7.7% 62092216 ± 3% numa-meminfo.node0.MemUsed
84822 ± 19% +57.1% 133225 ± 17% +13.5% 96280 ± 39% -4.4% 81114 ± 9% -6.0% 79743 ± 11% numa-meminfo.node1.Active
84781 ± 19% +57.1% 133211 ± 17% +13.5% 96254 ± 39% -4.4% 81091 ± 9% -6.0% 79729 ± 11% numa-meminfo.node1.Active(anon)
78416592 ± 7% +13.3% 88860070 ± 5% +6.7% 83660976 ± 11% +4.5% 81951764 ± 4% +2.4% 80309519 ± 3% numa-meminfo.node1.MemFree
53641607 ± 11% -19.5% 43198129 ± 11% -9.8% 48397199 ± 19% -6.6% 50106411 ± 7% -3.5% 51748656 ± 4% numa-meminfo.node1.MemUsed
18516537 ± 3% +24.7% 23084190 ± 5% +29.9% 24053750 ± 9% -5.6% 17484374 ± 3% -6.1% 17387753 ± 2% numa-vmstat.node0.nr_free_pages
624065 ± 5% +16.0% 724171 ± 4% +18.0% 736399 ± 2% +10.1% 687335 ± 2% +3.0% 642802 ± 5% numa-vmstat.node0.numa_hit
577161 ± 8% +18.1% 681431 ± 8% +12.1% 647232 ± 6% +9.9% 634277 ± 6% -0.3% 575404 ± 5% numa-vmstat.node0.numa_local
21141 ± 19% +57.4% 33269 ± 17% +13.7% 24027 ± 39% -4.2% 20242 ± 9% -5.6% 19967 ± 11% numa-vmstat.node1.nr_active_anon
19586357 ± 7% +13.5% 22224344 ± 5% +6.8% 20914089 ± 11% +4.6% 20487157 ± 4% +2.6% 20087311 ± 3% numa-vmstat.node1.nr_free_pages
21141 ± 19% +57.4% 33269 ± 17% +13.7% 24027 ± 39% -4.2% 20242 ± 9% -5.6% 19967 ± 11% numa-vmstat.node1.nr_zone_active_anon
732629 ± 4% +10.5% 809596 ± 4% +8.4% 793911 ± 3% -9.0% 666417 ± 3% -3.5% 707191 ± 3% numa-vmstat.node1.numa_hit
646902 ± 5% +11.3% 719705 ± 9% +15.7% 748231 ± 6% -10.1% 581882 ± 5% -1.1% 639501 ± 3% numa-vmstat.node1.numa_local
167.87 ± 2% +56.6% 262.93 +65.3% 277.49 ± 3% -1.9% 164.74 -1.8% 164.84 time.elapsed_time
167.87 ± 2% +56.6% 262.93 +65.3% 277.49 ± 3% -1.9% 164.74 -1.8% 164.84 time.elapsed_time.max
140035 ± 6% -50.5% 69271 ± 5% -50.8% 68889 ± 4% -5.9% 131759 ± 3% -1.2% 138362 ± 8% time.involuntary_context_switches
163.67 ± 10% +63.7% 268.00 ± 5% +76.5% 288.83 ± 5% +13.0% 185.00 ± 8% +22.4% 200.33 ± 4% time.major_page_faults
11308 ± 2% -48.4% 5830 -47.0% 5995 +1.8% 11514 +2.5% 11591 time.percent_of_cpu_this_job_got
2347 -94.0% 139.98 ± 3% -95.0% 117.34 ± 21% +0.3% 2354 -0.0% 2347 time.system_time
16627 -8.6% 15191 -0.6% 16529 ± 5% -0.1% 16616 +0.8% 16759 ± 2% time.user_time
12158 ± 2% +5329.5% 660155 +5325.1% 659615 -1.0% 12037 ± 3% -1.4% 11985 ± 3% time.voluntary_context_switches
59662 -37.0% 37607 -40.5% 35489 ± 4% +0.5% 59969 -0.1% 59610 vm-scalability.median
2.19 ± 20% +1.7 3.91 ± 30% +3.3 5.51 ± 30% +0.6 2.82 ± 23% +1.5 3.72 ± 25% vm-scalability.median_stddev%
2.92 ± 22% +0.6 3.49 ± 32% +1.5 4.45 ± 19% +0.4 3.35 ± 17% +1.5 4.39 ± 16% vm-scalability.stddev%
7821791 -36.6% 4961402 -39.2% 4758850 ± 2% -0.2% 7809010 -0.7% 7769662 vm-scalability.throughput
167.87 ± 2% +56.6% 262.93 +65.3% 277.49 ± 3% -1.9% 164.74 -1.8% 164.84 vm-scalability.time.elapsed_time
167.87 ± 2% +56.6% 262.93 +65.3% 277.49 ± 3% -1.9% 164.74 -1.8% 164.84 vm-scalability.time.elapsed_time.max
140035 ± 6% -50.5% 69271 ± 5% -50.8% 68889 ± 4% -5.9% 131759 ± 3% -1.2% 138362 ± 8% vm-scalability.time.involuntary_context_switches
11308 ± 2% -48.4% 5830 -47.0% 5995 +1.8% 11514 +2.5% 11591 vm-scalability.time.percent_of_cpu_this_job_got
2347 -94.0% 139.98 ± 3% -95.0% 117.34 ± 21% +0.3% 2354 -0.0% 2347 vm-scalability.time.system_time
16627 -8.6% 15191 -0.6% 16529 ± 5% -0.1% 16616 +0.8% 16759 ± 2% vm-scalability.time.user_time
12158 ± 2% +5329.5% 660155 +5325.1% 659615 -1.0% 12037 ± 3% -1.4% 11985 ± 3% vm-scalability.time.voluntary_context_switches
88841 ± 18% +56.6% 139142 ± 16% +18.6% 105352 ± 34% -3.1% 86098 ± 9% -6.8% 82770 ± 11% meminfo.Active
88726 ± 18% +56.7% 139024 ± 16% +18.6% 105233 ± 34% -3.1% 85984 ± 9% -6.8% 82654 ± 11% meminfo.Active(anon)
79226777 ± 3% +18.1% 93562456 +17.2% 92853282 -0.3% 78961619 ± 2% -1.5% 78023229 ± 2% meminfo.CommitLimit
51410 ± 5% -27.2% 37411 ± 2% -25.9% 38103 ± 2% +0.5% 51669 ± 4% +2.3% 52586 ± 3% meminfo.HugePages_Surp
51410 ± 5% -27.2% 37411 ± 2% -25.9% 38103 ± 2% +0.5% 51669 ± 4% +2.3% 52586 ± 3% meminfo.HugePages_Total
1.053e+08 ± 5% -27.2% 76618243 ± 2% -25.9% 78036556 ± 2% +0.5% 1.058e+08 ± 4% +2.3% 1.077e+08 ± 3% meminfo.Hugetlb
59378 ± 9% -27.2% 43256 ± 9% -29.4% 41897 ± 15% -3.0% 57584 ± 9% -1.5% 58465 ± 8% meminfo.Mapped
1.513e+08 ± 3% +19.0% 1.801e+08 +18.1% 1.787e+08 -0.3% 1.508e+08 ± 3% -1.6% 1.489e+08 ± 2% meminfo.MemAvailable
1.523e+08 ± 3% +18.9% 1.811e+08 +18.0% 1.798e+08 -0.3% 1.518e+08 ± 3% -1.6% 1.499e+08 ± 2% meminfo.MemFree
1.114e+08 ± 4% -25.8% 82607720 ± 2% -24.6% 83956777 ± 2% +0.5% 1.119e+08 ± 4% +2.2% 1.138e+08 ± 3% meminfo.Memused
10914 ± 2% -9.3% 9894 -9.0% 9935 +0.8% 10999 ± 2% +1.3% 11059 meminfo.PageTables
235415 ± 4% +17.2% 275883 ± 9% +2.4% 241001 ± 17% -1.9% 230929 ± 2% -2.2% 230261 ± 4% meminfo.Shmem
22170 ± 18% +57.0% 34801 ± 17% +18.9% 26361 ± 34% -2.6% 21594 ± 9% -6.6% 20698 ± 11% proc-vmstat.nr_active_anon
3774988 ± 3% +19.0% 4493004 +18.2% 4461258 -0.3% 3762775 ± 2% -1.7% 3712537 ± 2% proc-vmstat.nr_dirty_background_threshold
7559208 ± 3% +19.0% 8996995 +18.2% 8933426 -0.3% 7534750 ± 2% -1.7% 7434153 ± 2% proc-vmstat.nr_dirty_threshold
824427 +1.2% 834568 +0.3% 826777 -0.0% 824269 -0.0% 824023 proc-vmstat.nr_file_pages
38091344 ± 3% +18.9% 45280310 +18.0% 44962412 -0.3% 37969040 ± 2% -1.6% 37466065 ± 2% proc-vmstat.nr_free_pages
25681 -1.7% 25241 -1.6% 25268 +0.9% 25908 -0.1% 25665 proc-vmstat.nr_kernel_stack
15161 ± 9% -28.5% 10841 ± 9% -30.3% 10565 ± 14% -3.4% 14641 ± 9% -2.1% 14849 ± 7% proc-vmstat.nr_mapped
2729 ± 2% -9.4% 2473 -9.1% 2480 +0.7% 2748 ± 2% +1.2% 2762 proc-vmstat.nr_page_table_pages
58775 ± 4% +17.3% 68926 ± 9% +2.5% 60274 ± 18% -1.8% 57736 ± 2% -2.1% 57526 ± 4% proc-vmstat.nr_shmem
22170 ± 18% +57.0% 34801 ± 17% +18.9% 26361 ± 34% -2.6% 21594 ± 9% -6.6% 20698 ± 11% proc-vmstat.nr_zone_active_anon
1360860 +13.0% 1537181 +12.7% 1533834 -0.2% 1357949 -0.5% 1354233 proc-vmstat.numa_hit
1228230 +14.4% 1404550 +13.9% 1398987 -0.6% 1220355 -0.7% 1219146 proc-vmstat.numa_local
132626 +0.0% 132681 +1.7% 134822 +3.7% 137582 ± 4% +1.9% 135086 proc-vmstat.numa_other
1186558 +18.1% 1400807 +19.5% 1417837 -0.3% 1182763 -0.5% 1180560 proc-vmstat.pgfault
31861 ± 3% +28.2% 40847 +31.7% 41945 ± 5% -3.1% 30881 ± 3% -1.7% 31316 ± 4% proc-vmstat.pgreuse
17.18 ± 3% +337.2% 75.11 ± 2% +318.3% 71.87 ± 5% -1.3% 16.96 ± 4% -0.0% 17.18 ± 3% perf-stat.i.MPKI
1.727e+10 ± 5% -37.8% 1.073e+10 ± 2% -41.2% 1.015e+10 ± 6% +0.7% 1.738e+10 ± 3% +1.7% 1.757e+10 ± 4% perf-stat.i.branch-instructions
0.12 ± 36% +0.6 0.73 ± 5% +0.7 0.79 ± 6% +0.0 0.12 ± 27% -0.0 0.11 ± 32% perf-stat.i.branch-miss-rate%
10351997 ± 16% -28.0% 7451909 ± 13% -29.7% 7276965 ± 16% -10.0% 9315546 ± 22% -7.3% 9592438 ± 25% perf-stat.i.branch-misses
94.27 ± 3% -20.3 73.99 ± 2% -19.2 75.03 -0.8 93.49 ± 3% +0.3 94.60 ± 3% perf-stat.i.cache-miss-rate%
9.7e+08 ± 5% -39.6% 5.859e+08 ± 2% -42.8% 5.552e+08 ± 5% +0.6% 9.759e+08 ± 3% +1.6% 9.854e+08 ± 4% perf-stat.i.cache-misses
9.936e+08 ± 5% -35.3% 6.431e+08 ± 2% -38.8% 6.084e+08 ± 5% +0.5% 9.99e+08 ± 3% +1.5% 1.008e+09 ± 4% perf-stat.i.cache-references
2640 ± 3% +180.7% 7410 ± 2% +168.8% 7097 ± 4% -0.0% 2640 -1.5% 2601 ± 5% perf-stat.i.context-switches
4.60 ± 2% +22.2% 5.62 +18.1% 5.44 ± 5% -1.0% 4.56 ± 2% +0.5% 4.62 perf-stat.i.cpi
2.888e+11 ± 5% -47.9% 1.503e+11 ± 2% -46.8% 1.538e+11 ± 2% +0.6% 2.907e+11 ± 4% +2.4% 2.956e+11 ± 3% perf-stat.i.cpu-cycles
214.97 ± 3% +48.6% 319.40 ± 2% +50.3% 323.15 +0.3% 215.56 +0.9% 216.91 perf-stat.i.cpu-migrations
7.4e+10 ± 5% -37.6% 4.618e+10 ± 2% -41.0% 4.369e+10 ± 6% +0.7% 7.449e+10 ± 3% +1.7% 7.529e+10 ± 4% perf-stat.i.instructions
0.28 ± 7% +33.6% 0.38 ± 3% +31.5% 0.37 ± 2% +0.0% 0.28 ± 6% -2.7% 0.27 ± 5% perf-stat.i.ipc
6413 ± 4% -21.5% 5037 -24.5% 4839 ± 5% -0.2% 6397 ± 4% +0.8% 6464 ± 2% perf-stat.i.minor-faults
6414 ± 4% -21.5% 5038 -24.5% 4840 ± 5% -0.3% 6398 ± 4% +0.8% 6465 ± 2% perf-stat.i.page-faults
13.16 -4.0% 12.64 -3.9% 12.64 +0.0% 13.17 +0.1% 13.17 perf-stat.overall.MPKI
97.57 -6.3 91.24 -6.1 91.44 +0.1 97.64 +0.1 97.67 perf-stat.overall.cache-miss-rate%
3.91 -16.9% 3.25 -9.8% 3.53 ± 5% -0.0% 3.91 +0.7% 3.94 perf-stat.overall.cpi
296.89 -13.4% 257.07 -6.1% 278.90 ± 5% -0.1% 296.69 +0.7% 298.84 perf-stat.overall.cycles-between-cache-misses
0.26 +20.3% 0.31 +11.1% 0.28 ± 5% +0.0% 0.26 -0.7% 0.25 perf-stat.overall.ipc
10770 -2.2% 10537 -2.3% 10523 +0.2% 10788 +0.1% 10784 perf-stat.overall.path-length
1.7e+10 ± 4% -36.8% 1.074e+10 ± 2% -39.8% 1.023e+10 ± 5% +0.6% 1.711e+10 ± 3% +1.6% 1.727e+10 ± 4% perf-stat.ps.branch-instructions
10207074 ± 15% -27.2% 7428222 ± 13% -29.6% 7182646 ± 16% -9.7% 9221719 ± 22% -6.6% 9530095 ± 25% perf-stat.ps.branch-misses
9.588e+08 ± 4% -39.1% 5.838e+08 -42.0% 5.566e+08 ± 5% +0.7% 9.651e+08 ± 3% +1.6% 9.744e+08 ± 4% perf-stat.ps.cache-misses
9.826e+08 ± 4% -34.9% 6.398e+08 -38.1% 6.087e+08 ± 5% +0.6% 9.884e+08 ± 3% +1.5% 9.975e+08 ± 4% perf-stat.ps.cache-references
2628 ± 3% +176.7% 7271 ± 2% +164.7% 6956 ± 4% +0.3% 2635 -1.0% 2600 ± 5% perf-stat.ps.context-switches
2.847e+11 ± 4% -47.3% 1.501e+11 ± 2% -45.6% 1.548e+11 ± 2% +0.6% 2.864e+11 ± 4% +2.3% 2.911e+11 ± 3% perf-stat.ps.cpu-cycles
213.42 ± 3% +47.5% 314.87 ± 2% +49.2% 318.34 +0.5% 214.42 +1.3% 216.10 perf-stat.ps.cpu-migrations
7.284e+10 ± 4% -36.6% 4.62e+10 ± 2% -39.6% 4.402e+10 ± 5% +0.6% 7.33e+10 ± 3% +1.6% 7.398e+10 ± 4% perf-stat.ps.instructions
6416 ± 3% -22.4% 4976 -25.6% 4772 ± 5% +0.2% 6426 ± 3% +1.6% 6516 ± 2% perf-stat.ps.minor-faults
6417 ± 3% -22.4% 4977 -25.6% 4774 ± 5% +0.2% 6428 ± 3% +1.6% 6517 ± 2% perf-stat.ps.page-faults
1.268e+13 -2.2% 1.241e+13 -2.3% 1.239e+13 +0.2% 1.27e+13 +0.1% 1.27e+13 perf-stat.total.instructions
7783325 ± 13% -22.8% 6008522 ± 10% -20.8% 6163644 ± 20% -13.8% 6708575 ± 22% -4.5% 7429947 ± 26% sched_debug.cfs_rq:/.avg_vruntime.avg
8109328 ± 13% -18.8% 6584206 ± 10% -15.3% 6872509 ± 19% -14.2% 6957983 ± 22% -5.4% 7673718 ± 26% sched_debug.cfs_rq:/.avg_vruntime.max
244161 ± 30% +28.2% 313090 ± 22% +76.6% 431126 ± 21% -23.5% 186903 ± 26% -28.7% 173977 ± 29% sched_debug.cfs_rq:/.avg_vruntime.stddev
0.66 ± 11% -22.0% 0.52 ± 21% -41.3% 0.39 ± 29% -0.1% 0.66 ± 8% -3.5% 0.64 ± 16% sched_debug.cfs_rq:/.h_nr_running.avg
495.88 ± 33% -44.7% 274.12 ± 3% -11.5% 438.85 ± 32% -11.2% 440.30 ± 18% -12.2% 435.24 ± 27% sched_debug.cfs_rq:/.load_avg.max
81.79 ± 28% -33.2% 54.62 ± 16% -15.5% 69.10 ± 26% +7.2% 87.66 ± 23% -8.4% 74.91 ± 38% sched_debug.cfs_rq:/.load_avg.stddev
7783325 ± 13% -22.8% 6008522 ± 10% -20.8% 6163644 ± 20% -13.8% 6708575 ± 22% -4.5% 7429947 ± 26% sched_debug.cfs_rq:/.min_vruntime.avg
8109328 ± 13% -18.8% 6584206 ± 10% -15.3% 6872509 ± 19% -14.2% 6957983 ± 22% -5.4% 7673718 ± 26% sched_debug.cfs_rq:/.min_vruntime.max
244161 ± 30% +28.2% 313090 ± 22% +76.6% 431126 ± 21% -23.5% 186902 ± 26% -28.7% 173977 ± 29% sched_debug.cfs_rq:/.min_vruntime.stddev
0.66 ± 11% -22.3% 0.51 ± 21% -41.5% 0.38 ± 29% -0.4% 0.66 ± 8% -3.8% 0.63 ± 16% sched_debug.cfs_rq:/.nr_running.avg
382.00 ± 36% -44.2% 213.33 ± 8% -23.3% 292.98 ± 42% -2.3% 373.40 ± 18% +4.0% 397.33 ± 20% sched_debug.cfs_rq:/.removed.load_avg.max
194.86 ± 36% -44.3% 108.59 ± 8% -24.0% 148.18 ± 40% -2.4% 190.23 ± 18% +3.7% 202.10 ± 20% sched_debug.cfs_rq:/.removed.runnable_avg.max
194.86 ± 36% -44.3% 108.59 ± 8% -24.0% 148.18 ± 40% -2.4% 190.23 ± 18% +3.7% 202.10 ± 20% sched_debug.cfs_rq:/.removed.util_avg.max
713.50 ± 11% -22.6% 552.00 ± 20% -39.7% 430.54 ± 26% -0.2% 712.27 ± 7% -3.0% 691.86 ± 14% sched_debug.cfs_rq:/.runnable_avg.avg
1348 ± 10% -15.9% 1133 ± 15% -20.9% 1067 ± 12% +2.4% 1380 ± 8% +5.1% 1417 ± 18% sched_debug.cfs_rq:/.runnable_avg.max
708.60 ± 11% -22.6% 548.41 ± 20% -39.6% 427.82 ± 26% -0.1% 707.59 ± 7% -3.1% 686.34 ± 14% sched_debug.cfs_rq:/.util_avg.avg
1119 ± 5% -16.3% 937.08 ± 11% -18.6% 910.83 ± 11% +2.0% 1141 ± 6% -0.1% 1117 ± 8% sched_debug.cfs_rq:/.util_avg.max
633.71 ± 11% -95.7% 27.38 ± 17% -96.7% 21.00 ± 19% -0.6% 630.15 ± 10% -3.6% 610.78 ± 17% sched_debug.cfs_rq:/.util_est.avg
1102 ± 18% -63.9% 397.88 ± 15% -67.5% 358.19 ± 8% +6.1% 1169 ± 14% +6.6% 1174 ± 24% sched_debug.cfs_rq:/.util_est.max
119.77 ± 55% -64.5% 42.46 ± 12% -67.8% 38.59 ± 12% -3.2% 115.93 ± 51% -12.3% 105.01 ± 70% sched_debug.cfs_rq:/.util_est.stddev
145182 ± 12% -37.6% 90551 ± 11% -29.5% 102317 ± 18% -7.3% 134528 ± 10% -17.2% 120251 ± 18% sched_debug.cpu.avg_idle.stddev
122256 ± 8% +41.4% 172906 ± 7% +38.2% 168929 ± 14% -5.4% 115642 ± 6% -1.3% 120639 ± 14% sched_debug.cpu.clock.avg
122268 ± 8% +41.4% 172920 ± 7% +38.2% 168942 ± 14% -5.4% 115657 ± 6% -1.3% 120655 ± 14% sched_debug.cpu.clock.max
122242 ± 8% +41.4% 172892 ± 7% +38.2% 168914 ± 14% -5.4% 115627 ± 6% -1.3% 120621 ± 14% sched_debug.cpu.clock.min
121865 ± 8% +41.5% 172490 ± 7% +38.3% 168517 ± 14% -5.4% 115298 ± 6% -1.3% 120268 ± 14% sched_debug.cpu.clock_task.avg
122030 ± 8% +41.5% 172681 ± 7% +38.3% 168714 ± 14% -5.4% 115451 ± 6% -1.3% 120421 ± 14% sched_debug.cpu.clock_task.max
112808 ± 8% +44.2% 162675 ± 7% +41.0% 159006 ± 15% -5.5% 106630 ± 7% -1.1% 111604 ± 15% sched_debug.cpu.clock_task.min
5671 ± 6% +24.6% 7069 ± 4% +24.0% 7034 ± 8% -7.2% 5261 ± 7% -3.5% 5471 ± 10% sched_debug.cpu.curr->pid.max
0.00 ± 12% +22.5% 0.00 ± 50% +17.7% 0.00 ± 42% +71.0% 0.00 ± 35% +59.0% 0.00 ± 43% sched_debug.cpu.next_balance.stddev
0.66 ± 11% -22.0% 0.51 ± 21% -41.4% 0.39 ± 29% -0.3% 0.66 ± 8% -3.6% 0.64 ± 16% sched_debug.cpu.nr_running.avg
2659 ± 12% +208.6% 8204 ± 7% +192.0% 7763 ± 14% -10.1% 2391 ± 11% -6.2% 2493 ± 15% sched_debug.cpu.nr_switches.avg
679.31 ± 10% +516.8% 4189 ± 14% +401.6% 3407 ± 24% -14.7% 579.50 ± 19% -6.8% 633.18 ± 25% sched_debug.cpu.nr_switches.min
0.00 ± 9% +12202.6% 0.31 ± 42% +12627.8% 0.32 ± 37% +67.0% 0.00 ± 50% -34.8% 0.00 ± 72% sched_debug.cpu.nr_uninterruptible.avg
122243 ± 8% +41.4% 172893 ± 7% +38.2% 168916 ± 14% -5.4% 115628 ± 6% -1.3% 120623 ± 14% sched_debug.cpu_clk
120996 ± 8% +41.9% 171660 ± 7% +38.6% 167751 ± 15% -5.4% 114462 ± 6% -1.3% 119457 ± 14% sched_debug.ktime
123137 ± 8% +41.1% 173805 ± 7% +37.9% 169767 ± 14% -5.4% 116479 ± 6% -1.4% 121452 ± 13% sched_debug.sched_clk
>
> I spotted something else worth optimizing last time, and with the
> patch attached, I was able to measure some significant improvements in
> 1GB hugeTLB allocation and free time, e.g., when allocating and free
> 700 1GB hugeTLB pages:
>
> Before:
> # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> real 0m13.500s
> user 0m0.000s
> sys 0m13.311s
>
> # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> real 0m11.269s
> user 0m0.000s
> sys 0m11.187s
>
>
> After:
> # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> real 0m10.643s
> user 0m0.001s
> sys 0m10.487s
>
> # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> real 0m1.541s
> user 0m0.000s
> sys 0m1.528s
>
> Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-08-06 3:02 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
2024-07-10 6:22 ` Yu Zhao
2024-07-14 12:26 ` Oliver Sang
2024-07-15 2:40 ` Muchun Song
2024-07-15 4:08 ` Oliver Sang
2024-07-17 7:52 ` Janosch Frank
2024-07-17 7:59 ` Christian Borntraeger
2024-07-17 8:36 ` Yu Zhao
2024-07-17 15:44 ` Yu Zhao
2024-07-18 9:23 ` Marc Hartmayer
2024-07-19 8:42 ` Oliver Sang
2024-07-19 16:06 ` Yu Zhao
2024-08-03 22:07 ` Yu Zhao
2024-08-06 3:01 ` Oliver Sang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).