linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [linux-next:master] [mm/hugetlb_vmemmap]  875fa64577: vm-scalability.throughput -34.3% regression
@ 2024-07-09  5:11 kernel test robot
  2024-07-10  6:22 ` Yu Zhao
  2024-07-17  7:52 ` Janosch Frank
  0 siblings, 2 replies; 14+ messages in thread
From: kernel test robot @ 2024-07-09  5:11 UTC (permalink / raw)
  To: Yu Zhao
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, oliver.sang



Hello,

kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:


commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]

testcase: vm-scalability
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
parameters:

	runtime: 300s
	size: 512G
	test: anon-cow-rand-hugetlb
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407091001.1250ad4a-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240709/202407091001.1250ad4a-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability

commit: 
  73236245e0 ("cachestat: do not flush stats in recency check")
  875fa64577 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")

73236245e0b47ea3 875fa64577da9bc8e9963ee14fe 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 4.447e+09 ± 13%    +342.0%  1.966e+10 ±  7%  cpuidle..time
    753730 ±  2%   +1105.0%    9082792 ±  6%  cpuidle..usage
    582089 ±  6%     +20.7%     702337 ±  8%  numa-numastat.node0.local_node
    658053 ±  5%     +18.3%     778277 ±  3%  numa-numastat.node0.numa_hit
    255.89 ±  2%     +32.4%     338.76 ±  3%  uptime.boot
     10264 ±  7%    +147.3%      25384 ±  5%  uptime.idle
      2.00 ± 76%  +72475.0%       1451 ±125%  perf-c2c.DRAM.local
     38.50 ± 13%  +20427.7%       7903 ± 92%  perf-c2c.DRAM.remote
     23.33 ± 17%   +1567.9%     389.17 ± 61%  perf-c2c.HITM.local
     11.33 ± 35%   +1902.9%     227.00 ± 84%  perf-c2c.HITM.remote
     17.44 ± 10%    +208.4%      53.80 ±  2%  vmstat.cpu.id
     71.79 ±  2%     -37.5%      44.87 ±  3%  vmstat.cpu.us
    107.92 ±  2%     -43.9%      60.57 ±  3%  vmstat.procs.r
      2724 ±  4%    +151.1%       6842 ±  3%  vmstat.system.cs
    134997 ±  4%     -23.8%     102899 ±  7%  vmstat.system.in
     16.63 ± 11%     +36.8       53.43 ±  2%  mpstat.cpu.all.idle%
      0.34 ±  3%      -0.1        0.21 ±  2%  mpstat.cpu.all.irq%
      0.03 ±  6%      +0.0        0.04 ±  9%  mpstat.cpu.all.soft%
     10.61 ±  3%      -9.5        1.15 ± 38%  mpstat.cpu.all.sys%
     72.40 ±  2%     -27.2       45.16 ±  3%  mpstat.cpu.all.usr%
      6.83 ± 34%    +195.1%      20.17 ±  3%  mpstat.max_utilization.seconds
    102366 ± 29%    +115.2%     220258 ±  7%  meminfo.AnonHugePages
  70562810           +14.4%   80723780        meminfo.CommitLimit
     43743 ±  2%     -22.7%      33821 ±  2%  meminfo.HugePages_Surp
     43743 ±  2%     -22.7%      33821 ±  2%  meminfo.HugePages_Total
  89587444 ±  2%     -22.7%   69265470 ±  2%  meminfo.Hugetlb
     66355 ± 18%     -40.8%      39283 ± 15%  meminfo.Mapped
 1.341e+08           +15.3%  1.545e+08        meminfo.MemAvailable
 1.351e+08           +15.1%  1.555e+08        meminfo.MemFree
  95643557 ±  2%     -21.4%   75182673 ±  2%  meminfo.Memused
     49588 ±  2%     -31.7%      33871 ±  3%  vm-scalability.median
      8.02 ±  9%      -2.6        5.40 ± 12%  vm-scalability.median_stddev%
   6842353           -34.3%    4498326 ±  2%  vm-scalability.throughput
    205.20 ±  2%     +40.7%     288.76 ±  4%  vm-scalability.time.elapsed_time
    205.20 ±  2%     +40.7%     288.76 ±  4%  vm-scalability.time.elapsed_time.max
    149773 ±  2%     -47.3%      78866 ±  3%  vm-scalability.time.involuntary_context_switches
     10634 ±  2%     -43.5%       6008 ±  3%  vm-scalability.time.percent_of_cpu_this_job_got
      2772 ±  5%     -84.9%     419.47 ± 42%  vm-scalability.time.system_time
     19039           -11.2%      16908        vm-scalability.time.user_time
     14514 ±  2%   +4380.2%     650265        vm-scalability.time.voluntary_context_switches
    617106 ± 42%     -67.8%     198580 ±130%  numa-vmstat.node0.nr_file_pages
      8937 ± 42%     -65.6%       3075 ±110%  numa-vmstat.node0.nr_mapped
     18779 ± 30%     -35.4%      12124 ± 40%  numa-vmstat.node0.nr_slab_reclaimable
    603050 ± 43%     -70.7%     176415 ±148%  numa-vmstat.node0.nr_unevictable
    603050 ± 43%     -70.7%     176415 ±148%  numa-vmstat.node0.nr_zone_unevictable
    657413 ±  5%     +18.2%     776975 ±  3%  numa-vmstat.node0.numa_hit
    581443 ±  6%     +20.6%     701035 ±  8%  numa-vmstat.node0.numa_local
    214166 ±122%    +192.8%     627105 ± 40%  numa-vmstat.node1.nr_file_pages
  11263349 ±  5%     +35.8%   15297395 ±  7%  numa-vmstat.node1.nr_free_pages
      9478 ± 59%     +72.7%      16368 ± 29%  numa-vmstat.node1.nr_slab_reclaimable
    163852 ±161%    +260.4%     590489 ± 44%  numa-vmstat.node1.nr_unevictable
    163852 ±161%    +260.4%     590489 ± 44%  numa-vmstat.node1.nr_zone_unevictable
     49.90 ± 29%    +115.4%     107.47 ±  7%  proc-vmstat.nr_anon_transparent_hugepages
   3345235           +15.3%    3857626        proc-vmstat.nr_dirty_background_threshold
   6698650           +15.3%    7724685        proc-vmstat.nr_dirty_threshold
  33770919           +15.2%   38900683        proc-vmstat.nr_free_pages
    196929            -2.3%     192368        proc-vmstat.nr_inactive_anon
     16843 ± 18%     -40.4%      10031 ± 14%  proc-vmstat.nr_mapped
      2693            -7.6%       2487        proc-vmstat.nr_page_table_pages
    196929            -2.3%     192368        proc-vmstat.nr_zone_inactive_anon
   1404664            +9.0%    1530693 ±  3%  proc-vmstat.numa_hit
   1271130            +9.3%    1389279 ±  3%  proc-vmstat.numa_local
     69467 ±  7%     -34.8%      45284 ± 22%  proc-vmstat.pgactivate
   1263160           +12.8%    1425219 ±  3%  proc-vmstat.pgfault
     37012           +16.6%      43157 ±  4%  proc-vmstat.pgreuse
   2468473 ± 42%     -67.8%     794312 ±130%  numa-meminfo.node0.FilePages
     75120 ± 30%     -35.4%      48496 ± 40%  numa-meminfo.node0.KReclaimable
     35040 ± 42%     -66.3%      11825 ±112%  numa-meminfo.node0.Mapped
     75120 ± 30%     -35.4%      48496 ± 40%  numa-meminfo.node0.SReclaimable
    227978 ± 10%     -17.0%     189123 ± 10%  numa-meminfo.node0.Slab
   2412201 ± 43%     -70.7%     705661 ±148%  numa-meminfo.node0.Unevictable
    856474 ±123%    +192.9%    2508266 ± 40%  numa-meminfo.node1.FilePages
     25221 ±  4%     -34.1%      16618 ± 13%  numa-meminfo.node1.HugePages_Surp
     25221 ±  4%     -34.1%      16618 ± 13%  numa-meminfo.node1.HugePages_Total
     37917 ± 59%     +72.7%      65467 ± 30%  numa-meminfo.node1.KReclaimable
  45044169 ±  5%     +35.8%   61184692 ±  7%  numa-meminfo.node1.MemFree
  53983914 ±  4%     -29.9%   37843391 ± 12%  numa-meminfo.node1.MemUsed
     37917 ± 59%     +72.7%      65467 ± 30%  numa-meminfo.node1.SReclaimable
    153538 ± 16%     +26.2%     193736 ± 10%  numa-meminfo.node1.Slab
    655409 ±161%    +260.4%    2361959 ± 44%  numa-meminfo.node1.Unevictable
      1482 ±  9%     -17.5%       1223 ±  9%  sched_debug.cfs_rq:/.runnable_avg.max
    661.67 ± 14%     -93.0%      46.09 ± 15%  sched_debug.cfs_rq:/.util_est.avg
      1286 ± 12%     -61.4%     496.42 ± 37%  sched_debug.cfs_rq:/.util_est.max
    123.89 ± 48%     -57.2%      53.08 ± 29%  sched_debug.cfs_rq:/.util_est.stddev
    125242 ± 11%     +35.5%     169710 ± 10%  sched_debug.cpu.clock.avg
    125264 ± 11%     +35.5%     169723 ± 10%  sched_debug.cpu.clock.max
    125213 ± 11%     +35.5%     169693 ± 10%  sched_debug.cpu.clock.min
    124816 ± 11%     +35.6%     169267 ± 10%  sched_debug.cpu.clock_task.avg
    125011 ± 11%     +35.6%     169465 ± 10%  sched_debug.cpu.clock_task.max
    115620 ± 12%     +37.8%     159344 ± 10%  sched_debug.cpu.clock_task.min
      2909 ± 14%    +172.9%       7941 ± 10%  sched_debug.cpu.nr_switches.avg
    715.68 ± 18%    +470.7%       4084 ± 23%  sched_debug.cpu.nr_switches.min
    125215 ± 11%     +35.5%     169695 ± 10%  sched_debug.cpu_clk
    123982 ± 11%     +35.9%     168463 ± 10%  sched_debug.ktime
    126127 ± 11%     +35.3%     170626 ± 10%  sched_debug.sched_clk
     15.81 ±  2%    +357.5%      72.34 ±  5%  perf-stat.i.MPKI
  1.46e+10 ±  2%     -32.9%  9.801e+09 ±  4%  perf-stat.i.branch-instructions
      0.10 ±  3%      +0.5        0.65 ±  5%  perf-stat.i.branch-miss-rate%
  10768595           -27.8%    7778807 ±  3%  perf-stat.i.branch-misses
     96.93           -19.0       77.95        perf-stat.i.cache-miss-rate%
 8.054e+08 ±  2%     -33.0%  5.398e+08 ±  4%  perf-stat.i.cache-misses
  8.26e+08 ±  2%     -29.1%  5.855e+08 ±  4%  perf-stat.i.cache-references
      2668 ±  4%    +159.7%       6928 ±  3%  perf-stat.i.context-switches
      5.07           +42.6%       7.24 ± 12%  perf-stat.i.cpi
 2.809e+11 ±  2%     -44.1%  1.571e+11 ±  3%  perf-stat.i.cpu-cycles
    213.40 ±  2%     +41.5%     301.92 ±  5%  perf-stat.i.cpu-migrations
    360.56            -9.5%     326.39 ±  5%  perf-stat.i.cycles-between-cache-misses
 6.256e+10 ±  2%     -32.6%  4.218e+10 ±  4%  perf-stat.i.instructions
      0.24           +39.6%       0.33 ±  3%  perf-stat.i.ipc
      5779 ±  2%     -18.1%       4735 ±  2%  perf-stat.i.minor-faults
      5780 ±  2%     -18.1%       4737 ±  2%  perf-stat.i.page-faults
     12.99            -1.8%      12.75        perf-stat.overall.MPKI
     97.43            -5.1       92.33        perf-stat.overall.cache-miss-rate%
      4.52           -17.5%       3.72        perf-stat.overall.cpi
    347.63           -16.0%     291.93        perf-stat.overall.cycles-between-cache-misses
      0.22           +21.3%       0.27        perf-stat.overall.ipc
     10915            -3.4%      10545        perf-stat.overall.path-length
 1.433e+10 ±  2%     -31.5%  9.821e+09 ±  4%  perf-stat.ps.branch-instructions
  10358936 ±  2%     -25.2%    7745475 ±  4%  perf-stat.ps.branch-misses
 7.973e+08 ±  2%     -32.4%  5.389e+08 ±  4%  perf-stat.ps.cache-misses
 8.183e+08 ±  2%     -28.7%  5.838e+08 ±  4%  perf-stat.ps.cache-references
      2648 ±  4%    +157.5%       6819 ±  3%  perf-stat.ps.context-switches
 2.771e+11 ±  2%     -43.3%  1.572e+11 ±  3%  perf-stat.ps.cpu-cycles
    211.28 ±  2%     +41.6%     299.23 ±  5%  perf-stat.ps.cpu-migrations
 6.139e+10 ±  2%     -31.2%  4.226e+10 ±  4%  perf-stat.ps.instructions
      5815 ±  2%     -19.4%       4686 ±  2%  perf-stat.ps.minor-faults
      5816 ±  2%     -19.4%       4687 ±  2%  perf-stat.ps.page-faults
 1.265e+13            -3.4%  1.222e+13        perf-stat.total.instructions
     60.25 ± 15%     -13.0       47.20 ± 55%  perf-profile.calltrace.cycles-pp.do_rw_once
     47.47 ± 14%      -8.3       39.13 ± 57%  perf-profile.calltrace.cycles-pp.lrand48_r@plt
      2.17 ±130%      -1.5        0.65 ±159%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
      1.57 ±127%      -1.0        0.59 ±160%  perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault
      0.56 ±146%      -0.6        0.00        perf-profile.calltrace.cycles-pp.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault
      1.30 ± 48%      -0.4        0.88 ± 71%  perf-profile.calltrace.cycles-pp.lrand48_r
      2.11 ± 14%      -0.1        1.99 ± 49%  perf-profile.calltrace.cycles-pp.nrand48_r
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.commit_tail.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail.commit_tail.drm_atomic_helper_commit.drm_atomic_commit
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_fb_memcpy.ast_primary_plane_helper_atomic_update.drm_atomic_helper_commit_planes.drm_atomic_helper_commit_tail_rpm.ast_mode_config_helper_atomic_commit_tail
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_atomic_helper_commit.drm_atomic_commit.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.drm_atomic_helper_dirtyfb.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread
      0.00            +0.1        0.08 ±223%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread.ret_from_fork
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.drm_fbdev_generic_helper_fb_dirty.drm_fb_helper_damage_work.process_one_work.worker_thread.kthread
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.00            +0.1        0.09 ±223%  perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.clockevents_program_event.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.ret_from_fork_asm
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.prep_new_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
      0.00            +0.1        0.10 ±223%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio
      0.00            +0.1        0.11 ±223%  perf-profile.calltrace.cycles-pp.update_process_times.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
      0.00            +0.2        0.17 ±223%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp
      0.00            +0.2        0.23 ±145%  perf-profile.calltrace.cycles-pp.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
      0.00            +0.2        0.24 ±144%  perf-profile.calltrace.cycles-pp.prep_compound_page.get_page_from_freelist.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio
      0.00            +0.2        0.25 ±144%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.00            +0.3        0.25 ±142%  perf-profile.calltrace.cycles-pp.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter
      0.00            +0.3        0.26 ±144%  perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio
      0.00            +0.3        0.26 ±144%  perf-profile.calltrace.cycles-pp.__alloc_pages_noprof.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio
      0.00            +0.3        0.28 ±144%  perf-profile.calltrace.cycles-pp.__folio_alloc_noprof.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio
      0.00            +0.3        0.28 ±144%  perf-profile.calltrace.cycles-pp.alloc_buddy_hugetlb_folio.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp
      0.00            +0.3        0.28 ±144%  perf-profile.calltrace.cycles-pp.__alloc_fresh_hugetlb_folio.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault
      0.00            +0.5        0.46 ±144%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt
      0.00            +0.5        0.48 ±144%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter
      0.00            +0.6        0.57 ±144%  perf-profile.calltrace.cycles-pp.alloc_surplus_hugetlb_folio.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
      0.00            +0.7        0.72 ±144%  perf-profile.calltrace.cycles-pp.alloc_hugetlb_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
      0.00            +0.9        0.90 ±143%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state
      0.00            +1.1        1.14 ±142%  perf-profile.calltrace.cycles-pp.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      0.00            +2.1        2.06 ±143%  perf-profile.calltrace.cycles-pp.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.00            +2.1        2.09 ±143%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      0.00            +2.1        2.10 ±143%  perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      0.00            +2.2        2.23 ±143%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      0.00            +2.3        2.30 ±143%  perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      0.00            +2.3        2.30 ±143%  perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
      0.00            +2.3        2.30 ±143%  perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
      0.00            +2.4        2.35 ±142%  perf-profile.calltrace.cycles-pp.common_startup_64
      0.00            +2.7        2.71 ±143%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.acpi_safe_halt.acpi_idle_enter.cpuidle_enter_state.cpuidle_enter
     12.15 ±106%     +14.7       26.84 ±129%  perf-profile.calltrace.cycles-pp.do_access
     10.20 ±128%     +14.7       24.91 ±142%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
     10.20 ±128%     +14.7       24.93 ±141%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     10.22 ±128%     +14.7       24.95 ±142%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
     10.21 ±128%     +14.7       24.95 ±142%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     10.23 ±128%     +14.8       25.05 ±141%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
      7.77 ±127%     +15.5       23.22 ±141%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
      7.83 ±127%     +15.5       23.34 ±141%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
      7.84 ±127%     +15.6       23.40 ±141%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
      8.00 ±127%     +16.2       24.20 ±141%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
     84.24 ± 14%     -17.2       67.02 ± 55%  perf-profile.children.cycles-pp.do_rw_once
     24.51 ± 14%      -4.4       20.11 ± 57%  perf-profile.children.cycles-pp.lrand48_r@plt
      2.17 ±130%      -1.5        0.65 ±159%  perf-profile.children.cycles-pp.__mutex_lock
      1.57 ±127%      -1.0        0.59 ±160%  perf-profile.children.cycles-pp.mutex_spin_on_owner
      0.59 ±136%      -0.6        0.04 ±152%  perf-profile.children.cycles-pp.osq_lock
      1.01 ± 19%      -0.2        0.77 ± 47%  perf-profile.children.cycles-pp.lrand48_r
      2.81 ± 14%      -0.1        2.71 ± 45%  perf-profile.children.cycles-pp.nrand48_r
      0.10 ± 57%      -0.1        0.03 ±100%  perf-profile.children.cycles-pp.main
      0.10 ± 57%      -0.1        0.03 ±100%  perf-profile.children.cycles-pp.run_builtin
      0.08 ± 92%      -0.1        0.02 ±141%  perf-profile.children.cycles-pp.__cmd_record
      0.08 ± 92%      -0.1        0.02 ±141%  perf-profile.children.cycles-pp.cmd_record
      0.06 ±112%      -0.1        0.00        perf-profile.children.cycles-pp.perf_mmap__push
      0.06 ±112%      -0.1        0.00        perf-profile.children.cycles-pp.record__mmap_read_evlist
      0.05 ±113%      -0.1        0.00        perf-profile.children.cycles-pp.record__pushfn
      0.05 ±111%      -0.0        0.00        perf-profile.children.cycles-pp.writen
      0.05 ±110%      -0.0        0.00        perf-profile.children.cycles-pp.shmem_file_write_iter
      0.04 ±110%      -0.0        0.00        perf-profile.children.cycles-pp.generic_perform_write
      0.01 ±223%      -0.0        0.00        perf-profile.children.cycles-pp.copy_page_from_iter_atomic
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.__rmqueue_pcplist
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.io_serial_out
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.prepare_task_switch
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.seq_read_iter
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.update_curr
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.irqentry_enter
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.delay_tsc
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.nohz_balancer_kick
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.copy_mc_to_kernel
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.ksys_read
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.vfs_read
      0.06 ± 17%      +0.0        0.08 ± 36%  perf-profile.children.cycles-pp.task_tick_fair
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.__sysvec_irq_work
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp._printk
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.asm_sysvec_irq_work
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.irq_work_run
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.irq_work_single
      0.00            +0.0        0.01 ±223%  perf-profile.children.cycles-pp.sysvec_irq_work
      0.00            +0.0        0.02 ±223%  perf-profile.children.cycles-pp.irq_work_run_list
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.__handle_mm_fault
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.dequeue_entity
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.native_apic_msr_eoi
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.sched_ttwu_pending
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.schedule_preempt_disabled
      0.00            +0.0        0.02 ±144%  perf-profile.children.cycles-pp.rcu_all_qs
      0.00            +0.0        0.02 ±144%  perf-profile.children.cycles-pp.rcu_pending
      0.00            +0.0        0.02 ±144%  perf-profile.children.cycles-pp.read
      0.00            +0.0        0.02 ±141%  perf-profile.children.cycles-pp.update_irq_load_avg
      0.00            +0.0        0.02 ±141%  perf-profile.children.cycles-pp.update_rq_clock
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.___perf_sw_event
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.__sysvec_call_function_single
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.dequeue_task_fair
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.enqueue_entity
      0.00            +0.0        0.02 ±146%  perf-profile.children.cycles-pp.rcu_report_qs_rdp
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.read_tsc
      0.00            +0.0        0.02 ±142%  perf-profile.children.cycles-pp.rmqueue
      0.00            +0.0        0.02 ±141%  perf-profile.children.cycles-pp.wait_for_xmitr
      0.00            +0.0        0.02 ±141%  perf-profile.children.cycles-pp.idle_cpu
      0.00            +0.0        0.03 ±141%  perf-profile.children.cycles-pp.enqueue_task_fair
      0.00            +0.0        0.03 ±143%  perf-profile.children.cycles-pp.lapic_next_deadline
      0.00            +0.0        0.03 ±143%  perf-profile.children.cycles-pp.note_gp_changes
      0.00            +0.0        0.03 ±147%  perf-profile.children.cycles-pp.rcu_sched_clock_irq
      0.00            +0.0        0.03 ±144%  perf-profile.children.cycles-pp.sched_clock
      0.00            +0.0        0.03 ±141%  perf-profile.children.cycles-pp.activate_task
      0.00            +0.0        0.03 ±142%  perf-profile.children.cycles-pp.native_sched_clock
      0.00            +0.0        0.03 ±142%  perf-profile.children.cycles-pp.irqtime_account_irq
      0.00            +0.0        0.03 ±142%  perf-profile.children.cycles-pp.ktime_get_update_offsets_now
      0.00            +0.0        0.03 ±142%  perf-profile.children.cycles-pp.sched_clock_cpu
      0.00            +0.0        0.03 ±142%  perf-profile.children.cycles-pp.update_rq_clock_task
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp.complete
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp.tick_nohz_next_event
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp.tick_nohz_irq_exit
      0.00            +0.0        0.04 ±142%  perf-profile.children.cycles-pp.ttwu_do_activate
      0.00            +0.0        0.04 ±144%  perf-profile.children.cycles-pp.arch_scale_freq_tick
      0.00            +0.0        0.04 ±142%  perf-profile.children.cycles-pp.schedule_idle
      0.06 ±  9%      +0.0        0.10 ± 79%  perf-profile.children.cycles-pp.tmigr_requires_handle_remote
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp.rcu_do_batch
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
      0.04 ± 45%      +0.0        0.09 ± 78%  perf-profile.children.cycles-pp.get_jiffies_update
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.00            +0.0        0.04 ±141%  perf-profile.children.cycles-pp._raw_spin_lock
      0.00            +0.0        0.04 ±152%  perf-profile.children.cycles-pp.sched_balance_newidle
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.native_irq_return_iret
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.try_to_wake_up
      0.04 ±101%      +0.0        0.09 ± 78%  perf-profile.children.cycles-pp.free_unref_page
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.clear_page_erms
      0.00            +0.0        0.05 ±151%  perf-profile.children.cycles-pp.pick_next_task_fair
      0.06 ± 77%      +0.0        0.10 ± 76%  perf-profile.children.cycles-pp.exit_mm
      0.00            +0.0        0.05 ±145%  perf-profile.children.cycles-pp.__cond_resched
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.__get_user_pages
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.__mm_populate
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.clear_huge_page
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.hugetlb_no_page
      0.00            +0.0        0.05 ±141%  perf-profile.children.cycles-pp.populate_vma_page_range
      0.00            +0.1        0.05 ±141%  perf-profile.children.cycles-pp.__mmap
      0.06 ± 77%      +0.1        0.11 ± 76%  perf-profile.children.cycles-pp.__mmput
      0.06 ± 77%      +0.1        0.11 ± 76%  perf-profile.children.cycles-pp.__x64_sys_exit_group
      0.06 ± 77%      +0.1        0.11 ± 76%  perf-profile.children.cycles-pp.do_group_exit
      0.06 ± 77%      +0.1        0.11 ± 76%  perf-profile.children.cycles-pp.exit_mmap
      0.00            +0.1        0.05 ±142%  perf-profile.children.cycles-pp.rest_init
      0.00            +0.1        0.05 ±142%  perf-profile.children.cycles-pp.start_kernel
      0.00            +0.1        0.05 ±142%  perf-profile.children.cycles-pp.x86_64_start_kernel
      0.00            +0.1        0.05 ±142%  perf-profile.children.cycles-pp.x86_64_start_reservations
      0.04 ±102%      +0.1        0.10 ± 77%  perf-profile.children.cycles-pp.tlb_finish_mmu
      0.06 ± 77%      +0.1        0.11 ± 75%  perf-profile.children.cycles-pp.do_exit
      0.00            +0.1        0.05 ±143%  perf-profile.children.cycles-pp.update_load_avg
      0.04 ±102%      +0.1        0.10 ± 77%  perf-profile.children.cycles-pp.__tlb_batch_free_encoded_pages
      0.04 ±102%      +0.1        0.10 ± 77%  perf-profile.children.cycles-pp.folios_put_refs
      0.04 ±102%      +0.1        0.10 ± 77%  perf-profile.children.cycles-pp.free_pages_and_swap_cache
      0.01 ±223%      +0.1        0.06 ±149%  perf-profile.children.cycles-pp.task_work_run
      0.01 ±223%      +0.1        0.06 ±148%  perf-profile.children.cycles-pp.task_mm_cid_work
      0.00            +0.1        0.06 ±141%  perf-profile.children.cycles-pp.ksys_mmap_pgoff
      0.00            +0.1        0.06 ±141%  perf-profile.children.cycles-pp.__update_blocked_fair
      0.00            +0.1        0.06 ±146%  perf-profile.children.cycles-pp.update_sg_lb_stats
      0.02 ±144%      +0.1        0.09 ±145%  perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
      0.00            +0.1        0.07 ±118%  perf-profile.children.cycles-pp.vm_mmap_pgoff
      0.16 ± 35%      +0.1        0.23 ± 79%  perf-profile.children.cycles-pp.ksys_write
      0.16 ± 34%      +0.1        0.23 ± 79%  perf-profile.children.cycles-pp.vfs_write
      0.16 ± 35%      +0.1        0.23 ± 80%  perf-profile.children.cycles-pp.write
      0.00            +0.1        0.08 ±145%  perf-profile.children.cycles-pp.sched_balance_find_src_group
      0.00            +0.1        0.08 ±145%  perf-profile.children.cycles-pp.update_sd_lb_stats
      0.00            +0.1        0.08 ±148%  perf-profile.children.cycles-pp.schedule_timeout
      0.00            +0.1        0.09 ±146%  perf-profile.children.cycles-pp.__wait_for_common
      0.00            +0.1        0.09 ±132%  perf-profile.children.cycles-pp._nohz_idle_balance
      0.00            +0.1        0.09 ±145%  perf-profile.children.cycles-pp.tick_irq_enter
      0.00            +0.1        0.09 ±145%  perf-profile.children.cycles-pp.wait_for_completion_state
      0.00            +0.1        0.09 ±146%  perf-profile.children.cycles-pp.__wait_rcu_gp
      0.00            +0.1        0.09 ±142%  perf-profile.children.cycles-pp.sched_balance_update_blocked_averages
      0.00            +0.1        0.09 ±141%  perf-profile.children.cycles-pp.rcu_core
      0.00            +0.1        0.10 ±145%  perf-profile.children.cycles-pp.irq_enter_rcu
      0.00            +0.1        0.10 ±143%  perf-profile.children.cycles-pp.menu_select
      0.00            +0.1        0.10 ±145%  perf-profile.children.cycles-pp.hugetlb_vmemmap_optimize_folio
      0.13 ± 15%      +0.1        0.23 ± 68%  perf-profile.children.cycles-pp.sched_tick
      0.00            +0.1        0.10 ±141%  perf-profile.children.cycles-pp.sched_balance_domains
      0.00            +0.1        0.10 ±134%  perf-profile.children.cycles-pp.sysvec_call_function_single
      0.00            +0.1        0.11 ±146%  perf-profile.children.cycles-pp.schedule
      0.00            +0.1        0.11 ±143%  perf-profile.children.cycles-pp.sched_balance_rq
      0.05 ± 76%      +0.1        0.18 ± 76%  perf-profile.children.cycles-pp.io_serial_in
      0.12 ±125%      +0.1        0.24 ±144%  perf-profile.children.cycles-pp.prep_compound_page
      0.08 ± 57%      +0.1        0.21 ± 83%  perf-profile.children.cycles-pp.devkmsg_emit
      0.08 ± 57%      +0.1        0.21 ± 83%  perf-profile.children.cycles-pp.devkmsg_write
      0.07 ± 57%      +0.1        0.20 ± 70%  perf-profile.children.cycles-pp.wait_for_lsr
      0.09 ± 37%      +0.1        0.22 ± 72%  perf-profile.children.cycles-pp.serial8250_console_write
      0.00            +0.1        0.14 ±117%  perf-profile.children.cycles-pp.asm_sysvec_call_function_single
      0.13 ±124%      +0.1        0.27 ±144%  perf-profile.children.cycles-pp.__alloc_pages_noprof
      0.13 ±124%      +0.1        0.26 ±144%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.console_flush_all
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.console_unlock
      0.10 ± 38%      +0.1        0.24 ± 72%  perf-profile.children.cycles-pp.vprintk_emit
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.ast_mode_config_helper_atomic_commit_tail
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.ast_primary_plane_helper_atomic_update
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.commit_tail
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.drm_atomic_helper_commit_planes
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.drm_atomic_helper_commit_tail_rpm
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.drm_fb_memcpy
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.children.cycles-pp.memcpy_toio
      0.13 ±125%      +0.1        0.28 ±144%  perf-profile.children.cycles-pp.__alloc_fresh_hugetlb_folio
      0.13 ±124%      +0.1        0.28 ±144%  perf-profile.children.cycles-pp.__folio_alloc_noprof
      0.10 ± 38%      +0.1        0.24 ± 74%  perf-profile.children.cycles-pp.drm_atomic_commit
      0.10 ± 38%      +0.1        0.24 ± 74%  perf-profile.children.cycles-pp.drm_atomic_helper_commit
      0.10 ± 38%      +0.1        0.24 ± 74%  perf-profile.children.cycles-pp.drm_atomic_helper_dirtyfb
      0.00            +0.1        0.15 ±144%  perf-profile.children.cycles-pp.__schedule
      0.13 ±124%      +0.1        0.28 ±144%  perf-profile.children.cycles-pp.alloc_buddy_hugetlb_folio
      0.10 ± 36%      +0.2        0.25 ± 73%  perf-profile.children.cycles-pp.drm_fb_helper_damage_work
      0.10 ± 36%      +0.2        0.25 ± 73%  perf-profile.children.cycles-pp.drm_fbdev_generic_helper_fb_dirty
      0.11 ± 36%      +0.2        0.26 ± 73%  perf-profile.children.cycles-pp.process_one_work
      0.11 ± 36%      +0.2        0.27 ± 72%  perf-profile.children.cycles-pp.worker_thread
      0.09 ± 12%      +0.2        0.25 ± 89%  perf-profile.children.cycles-pp.clockevents_program_event
      0.12 ± 35%      +0.2        0.29 ± 70%  perf-profile.children.cycles-pp.kthread
      0.12 ± 35%      +0.2        0.29 ± 70%  perf-profile.children.cycles-pp.ret_from_fork
      0.12 ± 35%      +0.2        0.29 ± 70%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.00            +0.2        0.18 ±143%  perf-profile.children.cycles-pp.prep_new_hugetlb_folio
      0.00            +0.2        0.18 ±145%  perf-profile.children.cycles-pp.synchronize_rcu_normal
      0.24 ± 10%      +0.2        0.45 ± 69%  perf-profile.children.cycles-pp.update_process_times
      0.29 ± 33%      +0.2        0.53 ± 55%  perf-profile.children.cycles-pp.do_syscall_64
      0.29 ± 33%      +0.2        0.53 ± 55%  perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.08 ± 14%      +0.3        0.34 ±106%  perf-profile.children.cycles-pp.ktime_get
      0.27 ±  9%      +0.3        0.54 ± 71%  perf-profile.children.cycles-pp.tick_nohz_handler
      0.28 ±  9%      +0.3        0.56 ± 72%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.00            +0.3        0.34 ±144%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.00            +0.3        0.34 ±144%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      0.03 ±100%      +0.4        0.39 ±119%  perf-profile.children.cycles-pp.irq_exit_rcu
      0.13 ±125%      +0.4        0.58 ±140%  perf-profile.children.cycles-pp.alloc_surplus_hugetlb_folio
      0.39 ±  7%      +0.5        0.87 ± 79%  perf-profile.children.cycles-pp.hrtimer_interrupt
      0.40 ±  7%      +0.5        0.90 ± 79%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.14 ±126%      +0.6        0.73 ±141%  perf-profile.children.cycles-pp.alloc_hugetlb_folio
      0.43 ±  7%      +0.9        1.38 ± 96%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.59 ± 15%      +1.9        2.48 ±108%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.02 ±141%      +2.2        2.20 ±134%  perf-profile.children.cycles-pp.acpi_safe_halt
      0.02 ±141%      +2.2        2.20 ±134%  perf-profile.children.cycles-pp.acpi_idle_enter
      0.02 ±141%      +2.2        2.24 ±134%  perf-profile.children.cycles-pp.cpuidle_enter_state
      0.02 ±141%      +2.2        2.24 ±134%  perf-profile.children.cycles-pp.cpuidle_enter
      0.02 ±142%      +2.4        2.38 ±134%  perf-profile.children.cycles-pp.cpuidle_idle_call
      0.02 ±141%      +2.4        2.39 ±134%  perf-profile.children.cycles-pp.start_secondary
      0.03 ±102%      +2.4        2.46 ±133%  perf-profile.children.cycles-pp.do_idle
      0.03 ±102%      +2.4        2.46 ±133%  perf-profile.children.cycles-pp.common_startup_64
      0.03 ±102%      +2.4        2.46 ±133%  perf-profile.children.cycles-pp.cpu_startup_entry
     12.44 ±103%     +14.7       27.14 ±127%  perf-profile.children.cycles-pp.do_access
     10.22 ±128%     +14.8       25.04 ±141%  perf-profile.children.cycles-pp.do_user_addr_fault
     10.22 ±128%     +14.8       25.04 ±141%  perf-profile.children.cycles-pp.exc_page_fault
     10.20 ±128%     +14.8       25.04 ±141%  perf-profile.children.cycles-pp.hugetlb_fault
     10.22 ±128%     +14.9       25.08 ±141%  perf-profile.children.cycles-pp.handle_mm_fault
     10.24 ±128%     +14.9       25.15 ±141%  perf-profile.children.cycles-pp.asm_exc_page_fault
      7.82 ±127%     +15.6       23.38 ±141%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
      7.83 ±127%     +15.6       23.42 ±141%  perf-profile.children.cycles-pp.copy_subpage
      7.84 ±127%     +15.6       23.48 ±141%  perf-profile.children.cycles-pp.copy_user_large_folio
      8.00 ±127%     +16.3       24.27 ±141%  perf-profile.children.cycles-pp.hugetlb_wp
     83.41 ± 14%     -17.2       66.25 ± 55%  perf-profile.self.cycles-pp.do_rw_once
      1.56 ±127%      -1.0        0.59 ±160%  perf-profile.self.cycles-pp.mutex_spin_on_owner
      0.59 ±136%      -0.6        0.04 ±152%  perf-profile.self.cycles-pp.osq_lock
      0.72 ± 25%      -0.2        0.49 ± 53%  perf-profile.self.cycles-pp.lrand48_r@plt
      1.92 ± 14%      -0.2        1.72 ± 47%  perf-profile.self.cycles-pp.do_access
      2.53 ± 14%      -0.1        2.44 ± 45%  perf-profile.self.cycles-pp.nrand48_r
      0.30 ± 15%      -0.0        0.28 ± 50%  perf-profile.self.cycles-pp.lrand48_r
      0.01 ±223%      -0.0        0.00        perf-profile.self.cycles-pp.copy_page_from_iter_atomic
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.io_serial_out
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.nohz_balancer_kick
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.note_gp_changes
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.rcu_all_qs
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.rcu_pending
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.delay_tsc
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.00            +0.0        0.01 ±223%  perf-profile.self.cycles-pp.update_rq_clock_task
      0.00            +0.0        0.02 ±141%  perf-profile.self.cycles-pp.___perf_sw_event
      0.00            +0.0        0.02 ±141%  perf-profile.self.cycles-pp.irqtime_account_irq
      0.00            +0.0        0.02 ±142%  perf-profile.self.cycles-pp.__schedule
      0.00            +0.0        0.02 ±142%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.00            +0.0        0.02 ±142%  perf-profile.self.cycles-pp.native_apic_msr_eoi
      0.00            +0.0        0.02 ±142%  perf-profile.self.cycles-pp.update_irq_load_avg
      0.00            +0.0        0.02 ±141%  perf-profile.self.cycles-pp.copy_user_large_folio
      0.00            +0.0        0.02 ±146%  perf-profile.self.cycles-pp.idle_cpu
      0.00            +0.0        0.02 ±142%  perf-profile.self.cycles-pp.read_tsc
      0.00            +0.0        0.02 ±143%  perf-profile.self.cycles-pp.hugetlb_wp
      0.00            +0.0        0.02 ±141%  perf-profile.self.cycles-pp.tick_nohz_next_event
      0.00            +0.0        0.02 ±141%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.00            +0.0        0.03 ±143%  perf-profile.self.cycles-pp.lapic_next_deadline
      0.00            +0.0        0.03 ±150%  perf-profile.self.cycles-pp.__cond_resched
      0.00            +0.0        0.03 ±141%  perf-profile.self.cycles-pp.native_sched_clock
      0.00            +0.0        0.03 ±141%  perf-profile.self.cycles-pp.sched_balance_domains
      0.00            +0.0        0.03 ±144%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
      0.00            +0.0        0.03 ±144%  perf-profile.self.cycles-pp.tick_nohz_handler
      0.00            +0.0        0.03 ±146%  perf-profile.self.cycles-pp.update_load_avg
      0.00            +0.0        0.04 ±143%  perf-profile.self.cycles-pp.copy_subpage
      0.00            +0.0        0.04 ±141%  perf-profile.self.cycles-pp.__update_blocked_fair
      0.00            +0.0        0.04 ±147%  perf-profile.self.cycles-pp.menu_select
      0.00            +0.0        0.04 ±144%  perf-profile.self.cycles-pp.arch_scale_freq_tick
      0.00            +0.0        0.04 ±141%  perf-profile.self.cycles-pp._raw_spin_lock
      0.04 ±101%      +0.0        0.08 ± 77%  perf-profile.self.cycles-pp.free_unref_page
      0.04 ± 45%      +0.0        0.09 ± 78%  perf-profile.self.cycles-pp.get_jiffies_update
      0.00            +0.0        0.04 ±141%  perf-profile.self.cycles-pp.clear_page_erms
      0.00            +0.0        0.05 ±141%  perf-profile.self.cycles-pp.native_irq_return_iret
      0.00            +0.0        0.05 ±148%  perf-profile.self.cycles-pp.update_sg_lb_stats
      0.01 ±223%      +0.1        0.06 ±149%  perf-profile.self.cycles-pp.task_mm_cid_work
      0.05 ± 76%      +0.1        0.17 ± 75%  perf-profile.self.cycles-pp.io_serial_in
      0.11 ±124%      +0.1        0.24 ±144%  perf-profile.self.cycles-pp.prep_compound_page
      0.10 ± 38%      +0.1        0.24 ± 73%  perf-profile.self.cycles-pp.memcpy_toio
      0.08 ± 14%      +0.2        0.32 ±105%  perf-profile.self.cycles-pp.ktime_get
      0.00            +0.3        0.34 ±144%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.00            +1.1        1.07 ±138%  perf-profile.self.cycles-pp.acpi_safe_halt
      7.77 ±127%     +15.5       23.22 ±141%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-09  5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
@ 2024-07-10  6:22 ` Yu Zhao
  2024-07-14 12:26   ` Oliver Sang
  2024-07-17  7:52 ` Janosch Frank
  1 sibling, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-10  6:22 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin

On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
>
> Hello,
>
> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>
>
> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")

This is likely caused by synchronize_rcu() wandering into the
allocation path. I'll patch that up soon.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-10  6:22 ` Yu Zhao
@ 2024-07-14 12:26   ` Oliver Sang
  2024-07-15  2:40     ` Muchun Song
  0 siblings, 1 reply; 14+ messages in thread
From: Oliver Sang @ 2024-07-14 12:26 UTC (permalink / raw)
  To: Yu Zhao
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, oliver.sang

hi, Yu Zhao,

On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
> >
> > Hello,
> >
> > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >
> >
> > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> 
> This is likely caused by synchronize_rcu() wandering into the
> allocation path. I'll patch that up soon.
> 

we noticed this commit has already been merged into mainline

[bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
branch: linus/master

and the regression still exists in our tests. do you want us to test your
patch? Thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-14 12:26   ` Oliver Sang
@ 2024-07-15  2:40     ` Muchun Song
  2024-07-15  4:08       ` Oliver Sang
  0 siblings, 1 reply; 14+ messages in thread
From: Muchun Song @ 2024-07-15  2:40 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Yu Zhao, oe-lkp, kernel test robot, Linux Memory Management List,
	Andrew Morton, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, LKML, Huang Ying, Feng Tang,
	Yin Fengwei



> On Jul 14, 2024, at 20:26, Oliver Sang <oliver.sang@intel.com> wrote:
> 
> hi, Yu Zhao,
> 
> On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
>>> 
>>> Hello,
>>> 
>>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>>> 
>>> 
>>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>> 
>> This is likely caused by synchronize_rcu() wandering into the
>> allocation path. I'll patch that up soon.
>> 
> 
> we noticed this commit has already been merged into mainline
> 
> [bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
> branch: linus/master

Did you test with HVO enabled (there are two ways to enable HVO: 1) adding cmdline with "hugetlb_free_vmemmap=on"
or 2) write 1 to /proc/sys/vm/hugetlb_optimize_vmemmap)? I want to confirm if the regression is related
to HVO routine.

Thanks.

> 
> and the regression still exists in our tests. do you want us to test your
> patch? Thanks!



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-15  2:40     ` Muchun Song
@ 2024-07-15  4:08       ` Oliver Sang
  0 siblings, 0 replies; 14+ messages in thread
From: Oliver Sang @ 2024-07-15  4:08 UTC (permalink / raw)
  To: Muchun Song
  Cc: Yu Zhao, oe-lkp, kernel test robot, Linux Memory Management List,
	Andrew Morton, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, LKML, Huang Ying, Feng Tang,
	Yin Fengwei, oliver.sang

hi, Muchun Song,

On Mon, Jul 15, 2024 at 10:40:43AM +0800, Muchun Song wrote:
> 
> 
> > On Jul 14, 2024, at 20:26, Oliver Sang <oliver.sang@intel.com> wrote:
> > 
> > hi, Yu Zhao,
> > 
> > On Wed, Jul 10, 2024 at 12:22:40AM -0600, Yu Zhao wrote:
> >> On Mon, Jul 8, 2024 at 11:11 PM kernel test robot <oliver.sang@intel.com> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >>> 
> >>> 
> >>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> >> 
> >> This is likely caused by synchronize_rcu() wandering into the
> >> allocation path. I'll patch that up soon.
> >> 
> > 
> > we noticed this commit has already been merged into mainline
> > 
> > [bd225530a4c717714722c3731442b78954c765b3] mm/hugetlb_vmemmap: fix race with speculative PFN walkers
> > branch: linus/master
> 
> Did you test with HVO enabled (there are two ways to enable HVO: 1) adding cmdline with "hugetlb_free_vmemmap=on"
> or 2) write 1 to /proc/sys/vm/hugetlb_optimize_vmemmap)? I want to confirm if the regression is related
> to HVO routine.

we found a strange thing, after adding 'hugetlb_free_vmemmap=on', the data
become unstable by run to run (we use kexec from previous job to next one).
below is for 875fa64577 + 'hugetlb_free_vmemmap=on'

  "vm-scalability.throughput": [
    611622,
    645261,
    705923,
    833589,
    840140,
    884010
  ],


as a comparison, without 'hugetlb_free_vmemmap=on', for 875fa64577:

  "vm-scalability.throughput": [
    4597606,
    4357960,
    4385331,
    4631803,
    4554570,
    4462691
  ],

for 73236245e0 (parent of 875fa64577):

  "vm-scalability.throughput": [
    6866441,
    6769773,
    6942991,
    6877124,
    6785790,
    6812001
  ],

> 
> Thanks.
> 
> > 
> > and the regression still exists in our tests. do you want us to test your
> > patch? Thanks!
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-09  5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
  2024-07-10  6:22 ` Yu Zhao
@ 2024-07-17  7:52 ` Janosch Frank
  2024-07-17  7:59   ` Christian Borntraeger
  2024-07-17  8:36   ` Yu Zhao
  1 sibling, 2 replies; 14+ messages in thread
From: Janosch Frank @ 2024-07-17  7:52 UTC (permalink / raw)
  To: Yu Zhao
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
	Marc Hartmayer, Heiko Carstens

On 7/9/24 07:11, kernel test robot wrote:
> Hello,
> 
> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> 
> 
> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>
This has hit s390 huge page backed KVM guests as well.
Our simple start/stop test case went from ~5 to over 50 seconds of runtime.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-17  7:52 ` Janosch Frank
@ 2024-07-17  7:59   ` Christian Borntraeger
  2024-07-17  8:36   ` Yu Zhao
  1 sibling, 0 replies; 14+ messages in thread
From: Christian Borntraeger @ 2024-07-17  7:59 UTC (permalink / raw)
  To: Janosch Frank, Yu Zhao
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Claudio Imbrenda, Marc Hartmayer,
	Heiko Carstens

Am 17.07.24 um 09:52 schrieb Janosch Frank:
> On 7/9/24 07:11, kernel test robot wrote:
>> Hello,
>>
>> kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>>
>>
>> commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>>
> This has hit s390 huge page backed KVM guests as well.
> Our simple start/stop test case went from ~5 to over 50 seconds of runtime.

Could this be one of the synchronize_rcu calls? This patch adds lots of them. On s390 with HZ=100 those are really expensive.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-17  7:52 ` Janosch Frank
  2024-07-17  7:59   ` Christian Borntraeger
@ 2024-07-17  8:36   ` Yu Zhao
  2024-07-17 15:44     ` Yu Zhao
  1 sibling, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-17  8:36 UTC (permalink / raw)
  To: Janosch Frank, kernel test robot
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
	Marc Hartmayer, Heiko Carstens

[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

Hi Janosch and Oliver,

On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
>
> On 7/9/24 07:11, kernel test robot wrote:
> > Hello,
> >
> > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> >
> >
> > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> >
> This has hit s390 huge page backed KVM guests as well.
> Our simple start/stop test case went from ~5 to over 50 seconds of runtime.

Could you try the attached patch please? Thank you.

[-- Attachment #2: hugetlb-fix.patch --]
[-- Type: application/octet-stream, Size: 3952 bytes --]

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8193906515c6..9e6fc4ce8d2b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -43,6 +43,8 @@ struct vmemmap_remap_walk {
 #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
 /* Skip the TLB flush when we remap the PTE */
 #define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
+/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
+#define VMEMMAP_SYNCHRONIZE_RCU		BIT(2)
 	unsigned long		flags;
 };
 
@@ -451,6 +453,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	if (!folio_test_hugetlb_vmemmap_optimized(folio))
 		return 0;
 
+	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+		synchronize_rcu();
+
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 	vmemmap_reuse	= vmemmap_start;
 	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
@@ -483,10 +488,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
  */
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
 {
-	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
-
-	return __hugetlb_vmemmap_restore_folio(h, folio, 0);
+	return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
 }
 
 /**
@@ -509,14 +511,13 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	struct folio *folio, *t_folio;
 	long restored = 0;
 	long ret = 0;
-
-	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
-			ret = __hugetlb_vmemmap_restore_folio(h, folio,
-							      VMEMMAP_REMAP_NO_TLB_FLUSH);
+			ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
+			flags &= VMEMMAP_SYNCHRONIZE_RCU;
+
 			if (ret)
 				break;
 			restored++;
@@ -564,6 +565,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 		return ret;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+		synchronize_rcu();
 	/*
 	 * Very Subtle
 	 * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -611,10 +615,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	/* avoid writes from page_ref_add_unless() while folding vmemmap */
-	synchronize_rcu();
-
-	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
+	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -641,6 +642,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 {
 	struct folio *folio;
 	LIST_HEAD(vmemmap_pages);
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret = hugetlb_vmemmap_split_folio(h, folio);
@@ -657,14 +659,11 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 
 	flush_tlb_all();
 
-	/* avoid writes from page_ref_add_unless() while folding vmemmap */
-	synchronize_rcu();
-
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
 
-		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
-						       VMEMMAP_REMAP_NO_TLB_FLUSH);
+		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
+		flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
 
 		/*
 		 * Pages to be freed may have been accumulated.  If we
@@ -678,8 +677,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 			flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
-			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
-							 VMEMMAP_REMAP_NO_TLB_FLUSH);
+			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-17  8:36   ` Yu Zhao
@ 2024-07-17 15:44     ` Yu Zhao
  2024-07-18  9:23       ` Marc Hartmayer
  2024-07-19  8:42       ` Oliver Sang
  0 siblings, 2 replies; 14+ messages in thread
From: Yu Zhao @ 2024-07-17 15:44 UTC (permalink / raw)
  To: Janosch Frank, kernel test robot
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
	Marc Hartmayer, Heiko Carstens, Yosry Ahmed

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
>
> Hi Janosch and Oliver,
>
> On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> >
> > On 7/9/24 07:11, kernel test robot wrote:
> > > Hello,
> > >
> > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > >
> > >
> > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > >
> > This has hit s390 huge page backed KVM guests as well.
> > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
>
> Could you try the attached patch please? Thank you.

Thanks, Yosry, for spotting the following typo:
  flags &= VMEMMAP_SYNCHRONIZE_RCU;
It's supposed to be:
  flags &= ~VMEMMAP_SYNCHRONIZE_RCU;

Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.

[-- Attachment #2: hugetlb-fix-v2.patch --]
[-- Type: application/octet-stream, Size: 3953 bytes --]

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 8193906515c6..9e6fc4ce8d2b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -43,6 +43,8 @@ struct vmemmap_remap_walk {
 #define VMEMMAP_SPLIT_NO_TLB_FLUSH	BIT(0)
 /* Skip the TLB flush when we remap the PTE */
 #define VMEMMAP_REMAP_NO_TLB_FLUSH	BIT(1)
+/* synchronize_rcu() to avoid writes from page_ref_add_unless() */
+#define VMEMMAP_SYNCHRONIZE_RCU		BIT(2)
 	unsigned long		flags;
 };
 
@@ -451,6 +453,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
 	if (!folio_test_hugetlb_vmemmap_optimized(folio))
 		return 0;
 
+	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+		synchronize_rcu();
+
 	vmemmap_end	= vmemmap_start + hugetlb_vmemmap_size(h);
 	vmemmap_reuse	= vmemmap_start;
 	vmemmap_start	+= HUGETLB_VMEMMAP_RESERVE_SIZE;
@@ -483,10 +488,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h,
  */
 int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
 {
-	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
-
-	return __hugetlb_vmemmap_restore_folio(h, folio, 0);
+	return __hugetlb_vmemmap_restore_folio(h, folio, VMEMMAP_SYNCHRONIZE_RCU);
 }
 
 /**
@@ -509,14 +511,13 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	struct folio *folio, *t_folio;
 	long restored = 0;
 	long ret = 0;
-
-	/* avoid writes from page_ref_add_unless() while unfolding vmemmap */
-	synchronize_rcu();
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
 	list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
 		if (folio_test_hugetlb_vmemmap_optimized(folio)) {
-			ret = __hugetlb_vmemmap_restore_folio(h, folio,
-							      VMEMMAP_REMAP_NO_TLB_FLUSH);
+			ret = __hugetlb_vmemmap_restore_folio(h, folio, flags);
+			flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
+
 			if (ret)
 				break;
 			restored++;
@@ -564,6 +565,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h,
 		return ret;
 
 	static_branch_inc(&hugetlb_optimize_vmemmap_key);
+
+	if (flags & VMEMMAP_SYNCHRONIZE_RCU)
+		synchronize_rcu();
 	/*
 	 * Very Subtle
 	 * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
@@ -611,10 +615,7 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
 {
 	LIST_HEAD(vmemmap_pages);
 
-	/* avoid writes from page_ref_add_unless() while folding vmemmap */
-	synchronize_rcu();
-
-	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0);
+	__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, VMEMMAP_SYNCHRONIZE_RCU);
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
@@ -641,6 +642,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 {
 	struct folio *folio;
 	LIST_HEAD(vmemmap_pages);
+	unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH | VMEMMAP_SYNCHRONIZE_RCU;
 
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret = hugetlb_vmemmap_split_folio(h, folio);
@@ -657,14 +659,11 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 
 	flush_tlb_all();
 
-	/* avoid writes from page_ref_add_unless() while folding vmemmap */
-	synchronize_rcu();
-
 	list_for_each_entry(folio, folio_list, lru) {
 		int ret;
 
-		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
-						       VMEMMAP_REMAP_NO_TLB_FLUSH);
+		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
+		flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
 
 		/*
 		 * Pages to be freed may have been accumulated.  If we
@@ -678,8 +677,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
 			flush_tlb_all();
 			free_vmemmap_page_list(&vmemmap_pages);
 			INIT_LIST_HEAD(&vmemmap_pages);
-			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
-							 VMEMMAP_REMAP_NO_TLB_FLUSH);
+			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, flags);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-17 15:44     ` Yu Zhao
@ 2024-07-18  9:23       ` Marc Hartmayer
  2024-07-19  8:42       ` Oliver Sang
  1 sibling, 0 replies; 14+ messages in thread
From: Marc Hartmayer @ 2024-07-18  9:23 UTC (permalink / raw)
  To: Yu Zhao, Janosch Frank, kernel test robot
  Cc: oe-lkp, lkp, Linux Memory Management List, Andrew Morton,
	Muchun Song, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
	Heiko Carstens, Yosry Ahmed

On Wed, Jul 17, 2024 at 09:44 AM -0600, Yu Zhao <yuzhao@google.com> wrote:
> On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> Hi Janosch and Oliver,
>>
>> On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
>> >
>> > On 7/9/24 07:11, kernel test robot wrote:
>> > > Hello,
>> > >
>> > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
>> > >
>> > >
>> > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> > >
>> > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
>> > >
>> > This has hit s390 huge page backed KVM guests as well.
>> > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
>>
>> Could you try the attached patch please? Thank you.
>

Hi,

thanks a lot for the fix, it has fixed the problem on s390.

-- 
Kind regards / Beste Grüße
   Marc Hartmayer

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Wolfgang Wendt
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-17 15:44     ` Yu Zhao
  2024-07-18  9:23       ` Marc Hartmayer
@ 2024-07-19  8:42       ` Oliver Sang
  2024-07-19 16:06         ` Yu Zhao
  1 sibling, 1 reply; 14+ messages in thread
From: Oliver Sang @ 2024-07-19  8:42 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Muchun Song, David Hildenbrand,
	Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
	linux-kernel, ying.huang, feng.tang, fengwei.yin,
	Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
	Heiko Carstens, Yosry Ahmed, oliver.sang

hi, Yu Zhao,

On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Hi Janosch and Oliver,
> >
> > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > >
> > > On 7/9/24 07:11, kernel test robot wrote:
> > > > Hello,
> > > >
> > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > >
> > > >
> > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > >
> > > This has hit s390 huge page backed KVM guests as well.
> > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> >
> > Could you try the attached patch please? Thank you.
> 
> Thanks, Yosry, for spotting the following typo:
>   flags &= VMEMMAP_SYNCHRONIZE_RCU;
> It's supposed to be:
>   flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> 
> Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.

since the commit is in mainline now, I directly apply your v2 patch upon
bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")

in our tests, your v2 patch not only recovers the performance regression, it
even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
bd225530a4c71)

detail is as below

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability

commit:
  5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
  bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
  9a5b87b521401 <---- your v2 patch

5a4d8944d6b1e1aa bd225530a4c717714722c373144 9a5b87b5214018a2be217dc4648
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
 4.271e+09 ± 10%    +348.4%  1.915e+10 ±  6%     -39.9%  2.567e+09 ± 20%  cpuidle..time
    774593 ±  4%   +1060.9%    8992186 ±  6%     -17.2%     641254        cpuidle..usage
    555365 ±  8%     +28.0%     710795 ±  2%      -4.5%     530157 ±  5%  numa-numastat.node0.local_node
    629633 ±  4%     +23.0%     774346 ±  5%      +0.6%     633264 ±  4%  numa-numastat.node0.numa_hit
    255.76 ±  2%     +31.1%     335.40 ±  3%     -13.8%     220.53 ±  2%  uptime.boot
     10305 ±  6%    +144.3%      25171 ±  5%     -17.1%       8543 ±  8%  uptime.idle
      1.83 ± 58%  +96200.0%       1765 ±155%    +736.4%      15.33 ± 24%  perf-c2c.DRAM.local
     33.00 ± 16%  +39068.2%      12925 ±122%     +95.5%      64.50 ± 49%  perf-c2c.DRAM.remote
     21.33 ±  8%   +2361.7%     525.17 ± 31%    +271.1%      79.17 ± 52%  perf-c2c.HITM.local
      9.17 ± 21%   +3438.2%     324.33 ± 57%    +270.9%      34.00 ± 60%  perf-c2c.HITM.remote
     16.11 ±  7%     +37.1       53.16 ±  2%      -4.6       11.50 ± 19%  mpstat.cpu.all.idle%
      0.34 ±  2%      -0.1        0.22            +0.0        0.35 ±  3%  mpstat.cpu.all.irq%
      0.03 ±  5%      +0.0        0.04 ±  8%      -0.0        0.02        mpstat.cpu.all.soft%
     10.58 ±  4%      -9.5        1.03 ± 36%      +0.1       10.71 ±  2%  mpstat.cpu.all.sys%
     72.94 ±  2%     -27.4       45.55 ±  3%      +4.5       77.41 ±  2%  mpstat.cpu.all.usr%
      6.00 ± 16%    +230.6%      19.83 ±  5%      +8.3%       6.50 ± 17%  mpstat.max_utilization.seconds
     16.95 ±  7%    +215.5%      53.48 ±  2%     -26.2%      12.51 ± 16%  vmstat.cpu.id
     72.33 ±  2%     -37.4%      45.31 ±  3%      +6.0%      76.65 ±  2%  vmstat.cpu.us
 2.254e+08            -0.0%  2.254e+08           +14.7%  2.584e+08        vmstat.memory.free
    108.30           -43.3%      61.43 ±  2%      +5.4%     114.12 ±  2%  vmstat.procs.r
      2659          +162.6%       6982 ±  3%      +3.6%       2753 ±  4%  vmstat.system.cs
    136384 ±  4%     -21.9%     106579 ±  7%     +13.3%     154581 ±  3%  vmstat.system.in
    203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  time.elapsed_time
    203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  time.elapsed_time.max
    148901 ±  6%     -45.6%      81059 ±  4%      -8.8%     135748 ±  8%  time.involuntary_context_switches
    169.83 ± 23%     +85.3%     314.67 ±  8%      +7.9%     183.33 ±  7%  time.major_page_faults
     10697           -43.4%       6050 ±  2%      +5.6%      11294 ±  2%  time.percent_of_cpu_this_job_got
      2740 ±  6%     -86.7%     365.06 ± 43%     -16.1%       2298        time.system_time
     19012           -11.9%      16746           -11.9%      16747        time.user_time
     14412 ±  5%   +4432.0%     653187           -16.6%      12025 ±  3%  time.voluntary_context_switches
     50095 ±  2%     -31.5%      34325 ±  2%     +18.6%      59408        vm-scalability.median
      8.25 ± 16%      -3.4        4.84 ± 22%      -6.6        1.65 ± 15%  vm-scalability.median_stddev%
   6863720           -34.0%    4532485           +13.7%    7805408        vm-scalability.throughput
    203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  vm-scalability.time.elapsed_time
    203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  vm-scalability.time.elapsed_time.max
    148901 ±  6%     -45.6%      81059 ±  4%      -8.8%     135748 ±  8%  vm-scalability.time.involuntary_context_switches
     10697           -43.4%       6050 ±  2%      +5.6%      11294 ±  2%  vm-scalability.time.percent_of_cpu_this_job_got
      2740 ±  6%     -86.7%     365.06 ± 43%     -16.1%       2298        vm-scalability.time.system_time
     19012           -11.9%      16746           -11.9%      16747        vm-scalability.time.user_time
     14412 ±  5%   +4432.0%     653187           -16.6%      12025 ±  3%  vm-scalability.time.voluntary_context_switches
 1.159e+09            +0.0%  1.159e+09            +1.6%  1.178e+09        vm-scalability.workload
  22900043 ±  4%      +1.2%   23166356 ±  6%     -16.7%   19076170 ±  5%  numa-vmstat.node0.nr_free_pages
     42856 ± 43%    +998.5%     470779 ± 51%    +318.6%     179409 ±154%  numa-vmstat.node0.nr_unevictable
     42856 ± 43%    +998.5%     470779 ± 51%    +318.6%     179409 ±154%  numa-vmstat.node0.nr_zone_unevictable
    629160 ±  4%     +22.9%     773391 ±  5%      +0.5%     632570 ±  4%  numa-vmstat.node0.numa_hit
    554892 ±  8%     +27.9%     709841 ±  2%      -4.6%     529463 ±  5%  numa-vmstat.node0.numa_local
     27469 ± 14%      +0.0%      27475 ± 41%     -31.7%      18763 ± 13%  numa-vmstat.node1.nr_active_anon
    767179 ±  2%     -55.8%     339212 ± 72%     -19.7%     616417 ± 43%  numa-vmstat.node1.nr_file_pages
  10693349 ±  5%     +46.3%   15639681 ±  7%     +69.4%   18112002 ±  3%  numa-vmstat.node1.nr_free_pages
     14210 ± 27%     -65.0%       4973 ± 49%     -34.7%       9280 ± 39%  numa-vmstat.node1.nr_mapped
    724050 ±  2%     -59.1%     296265 ± 82%     -18.9%     587498 ± 47%  numa-vmstat.node1.nr_unevictable
     27469 ± 14%      +0.0%      27475 ± 41%     -31.7%      18763 ± 13%  numa-vmstat.node1.nr_zone_active_anon
    724050 ±  2%     -59.1%     296265 ± 82%     -18.9%     587498 ± 47%  numa-vmstat.node1.nr_zone_unevictable
    120619 ± 11%     +13.6%     137042 ± 27%     -31.2%      82976 ±  7%  meminfo.Active
    120472 ± 11%     +13.6%     136895 ± 27%     -31.2%      82826 ±  7%  meminfo.Active(anon)
  70234807           +14.6%   80512468           +10.2%   77431344        meminfo.CommitLimit
 2.235e+08            +0.1%  2.237e+08           +15.1%  2.573e+08        meminfo.DirectMap1G
     44064           -22.8%      34027 ±  2%     +20.7%      53164 ±  2%  meminfo.HugePages_Surp
     44064           -22.8%      34027 ±  2%     +20.7%      53164 ±  2%  meminfo.HugePages_Total
  90243440           -22.8%   69688103 ±  2%     +20.7%  1.089e+08 ±  2%  meminfo.Hugetlb
     70163 ± 29%     -42.6%      40293 ± 11%     -21.9%      54789 ± 15%  meminfo.Mapped
 1.334e+08           +15.5%  1.541e+08           +10.7%  1.477e+08        meminfo.MemAvailable
 1.344e+08           +15.4%  1.551e+08           +10.7%  1.488e+08        meminfo.MemFree
 2.307e+08            +0.0%  2.307e+08           +14.3%  2.637e+08        meminfo.MemTotal
  96309843           -21.5%   75639108 ±  2%     +19.4%   1.15e+08 ±  2%  meminfo.Memused
    259553 ±  2%      -0.9%     257226 ± 15%     -10.5%     232211 ±  4%  meminfo.Shmem
   1.2e+08            -2.4%  1.172e+08           +13.3%   1.36e+08        meminfo.max_used_kB
     18884 ± 10%      -7.2%      17519 ± 15%     +37.6%      25983 ±  6%  numa-meminfo.node0.HugePages_Surp
     18884 ± 10%      -7.2%      17519 ± 15%     +37.6%      25983 ±  6%  numa-meminfo.node0.HugePages_Total
  91526744 ±  4%      +1.2%   92620825 ±  6%     -16.7%   76248423 ±  5%  numa-meminfo.node0.MemFree
  40158207 ±  9%      -2.7%   39064126 ± 15%     +38.0%   55436528 ±  7%  numa-meminfo.node0.MemUsed
    171426 ± 43%    +998.5%    1883116 ± 51%    +318.6%     717638 ±154%  numa-meminfo.node0.Unevictable
    110091 ± 14%      -0.1%     109981 ± 41%     -31.7%      75226 ± 13%  numa-meminfo.node1.Active
    110025 ± 14%      -0.1%     109915 ± 41%     -31.7%      75176 ± 13%  numa-meminfo.node1.Active(anon)
   3068496 ±  2%     -55.8%    1356754 ± 72%     -19.6%    2466084 ± 43%  numa-meminfo.node1.FilePages
     25218 ±  4%     -34.7%      16475 ± 12%      +7.9%      27213 ±  3%  numa-meminfo.node1.HugePages_Surp
     25218 ±  4%     -34.7%      16475 ± 12%      +7.9%      27213 ±  3%  numa-meminfo.node1.HugePages_Total
     55867 ± 27%     -65.5%      19266 ± 50%     -34.4%      36671 ± 38%  numa-meminfo.node1.Mapped
  42795888 ±  5%     +46.1%   62520130 ±  7%     +69.3%   72441496 ±  3%  numa-meminfo.node1.MemFree
  99028084            +0.0%   99028084           +33.4%  1.321e+08        numa-meminfo.node1.MemTotal
  56232195 ±  3%     -35.1%   36507953 ± 12%      +6.0%   59616707 ±  4%  numa-meminfo.node1.MemUsed
   2896199 ±  2%     -59.1%    1185064 ± 82%     -18.9%    2349991 ± 47%  numa-meminfo.node1.Unevictable
    507357            +0.0%     507357            +1.7%     516000        proc-vmstat.htlb_buddy_alloc_success
     29942 ± 10%     +14.3%      34235 ± 27%     -30.7%      20740 ±  7%  proc-vmstat.nr_active_anon
   3324095           +15.7%    3847387           +10.9%    3686860        proc-vmstat.nr_dirty_background_threshold
   6656318           +15.7%    7704181           +10.9%    7382735        proc-vmstat.nr_dirty_threshold
  33559092           +15.6%   38798108           +10.9%   37209133        proc-vmstat.nr_free_pages
    197697 ±  2%      -2.5%     192661            +1.0%     199623        proc-vmstat.nr_inactive_anon
     17939 ± 28%     -42.5%      10307 ± 11%     -22.4%      13927 ± 14%  proc-vmstat.nr_mapped
      2691            -7.1%       2501            +2.9%       2769        proc-vmstat.nr_page_table_pages
     64848 ±  2%      -0.7%      64386 ± 15%     -10.6%      57987 ±  4%  proc-vmstat.nr_shmem
     29942 ± 10%     +14.3%      34235 ± 27%     -30.7%      20740 ±  7%  proc-vmstat.nr_zone_active_anon
    197697 ±  2%      -2.5%     192661            +1.0%     199623        proc-vmstat.nr_zone_inactive_anon
   1403095            +9.3%    1534152 ±  2%      -3.2%    1358244        proc-vmstat.numa_hit
   1267544           +10.6%    1401482 ±  2%      -3.4%    1224210        proc-vmstat.numa_local
 2.608e+08            +0.1%  2.609e+08            +1.7%  2.651e+08        proc-vmstat.pgalloc_normal
   1259957           +13.4%    1428284 ±  2%      -6.5%    1178198        proc-vmstat.pgfault
 2.591e+08            +0.3%    2.6e+08            +2.3%  2.649e+08        proc-vmstat.pgfree
     36883 ±  3%     +18.5%      43709 ±  5%     -12.2%      32371 ±  3%  proc-vmstat.pgreuse
      1.88 ± 16%      -0.6        1.33 ±100%      +0.9        2.80 ± 11%  perf-profile.calltrace.cycles-pp.nrand48_r
     16.19 ± 85%     +28.6       44.75 ± 95%     -11.4        4.78 ±218%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
     16.20 ± 85%     +28.6       44.78 ± 95%     -11.4        4.78 ±218%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     16.22 ± 85%     +28.6       44.82 ± 95%     -11.4        4.79 ±218%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
     16.22 ± 85%     +28.6       44.82 ± 95%     -11.4        4.79 ±218%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     16.24 ± 85%     +28.8       45.01 ± 95%     -11.4        4.80 ±218%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
     12.42 ± 84%     +29.5       41.89 ± 95%      -8.8        3.65 ±223%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
     12.52 ± 84%     +29.6       42.08 ± 95%      -8.8        3.68 ±223%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
     12.53 ± 84%     +29.7       42.23 ± 95%      -8.9        3.68 ±223%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
     12.80 ± 84%     +30.9       43.65 ± 95%      -9.0        3.76 ±223%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
      2.50 ± 17%      -0.7        1.78 ±100%      +1.2        3.73 ± 11%  perf-profile.children.cycles-pp.nrand48_r
     16.24 ± 85%     +28.6       44.87 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.do_user_addr_fault
     16.24 ± 85%     +28.6       44.87 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.exc_page_fault
     16.20 ± 85%     +28.7       44.86 ± 95%     -11.4        4.78 ±218%  perf-profile.children.cycles-pp.hugetlb_fault
     16.22 ± 85%     +28.7       44.94 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.handle_mm_fault
     16.26 ± 85%     +28.8       45.06 ± 95%     -11.5        4.80 ±218%  perf-profile.children.cycles-pp.asm_exc_page_fault
     12.51 ± 84%     +29.5       42.01 ± 95%      -8.8        3.75 ±218%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
     12.52 ± 84%     +29.6       42.11 ± 95%      -8.8        3.75 ±218%  perf-profile.children.cycles-pp.copy_subpage
     12.53 ± 84%     +29.7       42.25 ± 95%      -8.8        3.76 ±218%  perf-profile.children.cycles-pp.copy_user_large_folio
     12.80 ± 84%     +30.9       43.65 ± 95%      -9.0        3.83 ±218%  perf-profile.children.cycles-pp.hugetlb_wp
      2.25 ± 17%      -0.7        1.59 ±100%      +1.1        3.36 ± 11%  perf-profile.self.cycles-pp.nrand48_r
      1.74 ± 21%      -0.5        1.25 ± 92%      +1.2        2.94 ± 13%  perf-profile.self.cycles-pp.do_access
      0.27 ± 17%      -0.1        0.19 ±100%      +0.1        0.40 ± 11%  perf-profile.self.cycles-pp.lrand48_r
     12.41 ± 84%     +29.4       41.80 ± 95%      -8.7        3.72 ±218%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
    350208 ± 16%      -2.7%     340891 ± 36%     -47.2%     184918 ±  9%  sched_debug.cfs_rq:/.avg_vruntime.stddev
     16833 ±149%    -100.0%       3.19 ±100%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.left_deadline.avg
   2154658 ±149%    -100.0%     317.15 ± 93%    -100.0%      74.40 ±179%  sched_debug.cfs_rq:/.left_deadline.max
    189702 ±149%    -100.0%      29.47 ± 94%    -100.0%       6.55 ±179%  sched_debug.cfs_rq:/.left_deadline.stddev
     16833 ±149%    -100.0%       3.05 ±102%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.left_vruntime.avg
   2154613 ±149%    -100.0%     298.70 ± 95%    -100.0%      74.06 ±179%  sched_debug.cfs_rq:/.left_vruntime.max
    189698 ±149%    -100.0%      27.96 ± 96%    -100.0%       6.52 ±179%  sched_debug.cfs_rq:/.left_vruntime.stddev
    350208 ± 16%      -2.7%     340891 ± 36%     -47.2%     184918 ±  9%  sched_debug.cfs_rq:/.min_vruntime.stddev
     52.88 ± 14%     -19.5%      42.56 ± 39%     +22.8%      64.94 ±  9%  sched_debug.cfs_rq:/.removed.load_avg.stddev
     16833 ±149%    -100.0%       3.05 ±102%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.right_vruntime.avg
   2154613 ±149%    -100.0%     298.70 ± 95%    -100.0%      74.11 ±179%  sched_debug.cfs_rq:/.right_vruntime.max
    189698 ±149%    -100.0%      27.96 ± 96%    -100.0%       6.53 ±179%  sched_debug.cfs_rq:/.right_vruntime.stddev
      1588 ±  9%     -31.2%       1093 ± 18%     -20.0%       1270 ± 16%  sched_debug.cfs_rq:/.runnable_avg.max
    676.36 ±  7%     -94.8%      35.08 ± 42%      -2.7%     657.82 ±  3%  sched_debug.cfs_rq:/.util_est.avg
      1339 ±  8%     -74.5%     341.42 ± 24%     -22.6%       1037 ± 23%  sched_debug.cfs_rq:/.util_est.max
    152.67 ± 35%     -72.3%      42.35 ± 21%     -14.9%     129.89 ± 33%  sched_debug.cfs_rq:/.util_est.stddev
   1116839 ±  7%      -7.1%    1037321 ±  4%     +22.9%    1372316 ± 11%  sched_debug.cpu.avg_idle.max
    126915 ± 10%     +31.6%     166966 ±  6%     -12.2%     111446 ±  2%  sched_debug.cpu.clock.avg
    126930 ± 10%     +31.6%     166977 ±  6%     -12.2%     111459 ±  2%  sched_debug.cpu.clock.max
    126899 ± 10%     +31.6%     166949 ±  6%     -12.2%     111428 ±  2%  sched_debug.cpu.clock.min
    126491 ± 10%     +31.7%     166537 ±  6%     -12.2%     111078 ±  2%  sched_debug.cpu.clock_task.avg
    126683 ± 10%     +31.6%     166730 ±  6%     -12.2%     111237 ±  2%  sched_debug.cpu.clock_task.max
    117365 ± 11%     +33.6%     156775 ±  6%     -13.0%     102099 ±  2%  sched_debug.cpu.clock_task.min
      2826 ± 10%    +178.1%       7858 ±  8%     -10.3%       2534 ±  6%  sched_debug.cpu.nr_switches.avg
    755.38 ± 15%    +423.8%       3956 ± 14%     -15.2%     640.33 ±  3%  sched_debug.cpu.nr_switches.min
    126900 ± 10%     +31.6%     166954 ±  6%     -12.2%     111432 ±  2%  sched_debug.cpu_clk
    125667 ± 10%     +31.9%     165721 ±  6%     -12.3%     110200 ±  2%  sched_debug.ktime
      0.54 ±141%     -99.9%       0.00 ±132%     -99.9%       0.00 ±114%  sched_debug.rt_rq:.rt_time.avg
     69.73 ±141%     -99.9%       0.06 ±132%     -99.9%       0.07 ±114%  sched_debug.rt_rq:.rt_time.max
      6.14 ±141%     -99.9%       0.01 ±132%     -99.9%       0.01 ±114%  sched_debug.rt_rq:.rt_time.stddev
    127860 ± 10%     +31.3%     167917 ±  6%     -12.1%     112402 ±  2%  sched_debug.sched_clk
     15.99          +363.6%      74.14 ±  6%     +10.1%      17.61        perf-stat.i.MPKI
 1.467e+10 ±  2%     -32.0%  9.975e+09 ±  3%     +21.3%  1.779e+10 ±  2%  perf-stat.i.branch-instructions
      0.10 ±  5%      +0.6        0.68 ±  5%      +0.0        0.11 ±  4%  perf-stat.i.branch-miss-rate%
  10870114 ±  3%     -26.4%    8001551 ±  3%     +15.7%   12580898 ±  2%  perf-stat.i.branch-misses
     97.11           -20.0       77.11            -0.0       97.10        perf-stat.i.cache-miss-rate%
 8.118e+08 ±  2%     -32.5%  5.482e+08 ±  3%     +23.1%  9.992e+08 ±  2%  perf-stat.i.cache-misses
 8.328e+08 ±  2%     -28.4%  5.963e+08 ±  3%     +22.8%  1.023e+09 ±  2%  perf-stat.i.cache-references
      2601 ±  2%    +172.3%       7083 ±  3%      +2.5%       2665 ±  5%  perf-stat.i.context-switches
      5.10           +39.5%       7.11 ±  9%      -9.2%       4.62        perf-stat.i.cpi
 2.826e+11           -44.1%   1.58e+11 ±  2%      +5.7%  2.987e+11 ±  2%  perf-stat.i.cpu-cycles
    216.56           +42.4%     308.33 ±  6%      +2.2%     221.23        perf-stat.i.cpu-migrations
    358.79            -0.3%     357.70 ± 21%     -14.1%     308.23        perf-stat.i.cycles-between-cache-misses
 6.286e+10 ±  2%     -31.7%  4.293e+10 ±  3%     +21.3%  7.626e+10 ±  2%  perf-stat.i.instructions
      0.24           +39.9%       0.33 ±  4%     +13.6%       0.27        perf-stat.i.ipc
      5844           -16.9%       4856 ±  2%     +12.5%       6577        perf-stat.i.minor-faults
      5846           -16.9%       4857 ±  2%     +12.5%       6578        perf-stat.i.page-faults
     13.00            -2.2%      12.72            +1.2%      13.15        perf-stat.overall.MPKI
      0.07            +0.0        0.08            -0.0        0.07        perf-stat.overall.branch-miss-rate%
     97.44            -5.3       92.09            +0.2       97.66        perf-stat.overall.cache-miss-rate%
      4.51           -18.4%       3.68           -13.0%       3.92        perf-stat.overall.cpi
    346.76           -16.6%     289.11           -14.0%     298.06        perf-stat.overall.cycles-between-cache-misses
      0.22           +22.6%       0.27           +15.0%       0.26        perf-stat.overall.ipc
     10906            -3.4%      10541            -1.1%      10784        perf-stat.overall.path-length
 1.445e+10 ±  2%     -30.7%  1.001e+10 ±  3%     +21.2%  1.752e+10 ±  2%  perf-stat.ps.branch-instructions
  10469697 ±  3%     -23.5%    8005730 ±  3%     +18.3%   12387061 ±  2%  perf-stat.ps.branch-misses
 8.045e+08 ±  2%     -31.9%  5.478e+08 ±  3%     +22.7%  9.874e+08 ±  2%  perf-stat.ps.cache-misses
 8.257e+08 ±  2%     -27.9%   5.95e+08 ±  3%     +22.5%  1.011e+09 ±  2%  perf-stat.ps.cache-references
      2584 ±  2%    +169.3%       6958 ±  3%      +2.7%       2654 ±  4%  perf-stat.ps.context-switches
 2.789e+11           -43.2%  1.583e+11 ±  2%      +5.5%  2.943e+11 ±  2%  perf-stat.ps.cpu-cycles
    214.69           +41.8%     304.37 ±  6%      +2.2%     219.46        perf-stat.ps.cpu-migrations
  6.19e+10 ±  2%     -30.4%  4.309e+10 ±  3%     +21.3%  7.507e+10 ±  2%  perf-stat.ps.instructions
      5849           -18.0%       4799 ±  2%     +12.3%       6568 ±  2%  perf-stat.ps.minor-faults
      5851           -18.0%       4800 ±  2%     +12.3%       6570 ±  2%  perf-stat.ps.page-faults
 1.264e+13            -3.4%  1.222e+13            +0.5%   1.27e+13        perf-stat.total.instructions



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-19  8:42       ` Oliver Sang
@ 2024-07-19 16:06         ` Yu Zhao
  2024-08-03 22:07           ` Yu Zhao
  0 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-07-19 16:06 UTC (permalink / raw)
  To: Oliver Sang
  Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, Muchun Song, David Hildenbrand,
	Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
	linux-kernel, ying.huang, feng.tang, fengwei.yin,
	Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
	Heiko Carstens, Yosry Ahmed

On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yu Zhao,
>
> On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Hi Janosch and Oliver,
> > >
> > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > >
> > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > >
> > > > >
> > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > >
> > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > >
> > > > This has hit s390 huge page backed KVM guests as well.
> > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > >
> > > Could you try the attached patch please? Thank you.
> >
> > Thanks, Yosry, for spotting the following typo:
> >   flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > It's supposed to be:
> >   flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> >
> > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
>
> since the commit is in mainline now, I directly apply your v2 patch upon
> bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>
> in our tests, your v2 patch not only recovers the performance regression,

Thanks for verifying the fix!

> it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> bd225530a4c71)

Glad to hear!

(The original patch improved and regressed the performance at the same
time, but the regression is bigger. The fix removed the regression and
surfaced the improvement.)

> detail is as below
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>   gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>
> commit:
>   5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
>   bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
>   9a5b87b521401 <---- your v2 patch
>
> 5a4d8944d6b1e1aa bd225530a4c717714722c373144 9a5b87b5214018a2be217dc4648
> ---------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \
>  4.271e+09 ± 10%    +348.4%  1.915e+10 ±  6%     -39.9%  2.567e+09 ± 20%  cpuidle..time
>     774593 ±  4%   +1060.9%    8992186 ±  6%     -17.2%     641254        cpuidle..usage
>     555365 ±  8%     +28.0%     710795 ±  2%      -4.5%     530157 ±  5%  numa-numastat.node0.local_node
>     629633 ±  4%     +23.0%     774346 ±  5%      +0.6%     633264 ±  4%  numa-numastat.node0.numa_hit
>     255.76 ±  2%     +31.1%     335.40 ±  3%     -13.8%     220.53 ±  2%  uptime.boot
>      10305 ±  6%    +144.3%      25171 ±  5%     -17.1%       8543 ±  8%  uptime.idle
>       1.83 ± 58%  +96200.0%       1765 ±155%    +736.4%      15.33 ± 24%  perf-c2c.DRAM.local
>      33.00 ± 16%  +39068.2%      12925 ±122%     +95.5%      64.50 ± 49%  perf-c2c.DRAM.remote
>      21.33 ±  8%   +2361.7%     525.17 ± 31%    +271.1%      79.17 ± 52%  perf-c2c.HITM.local
>       9.17 ± 21%   +3438.2%     324.33 ± 57%    +270.9%      34.00 ± 60%  perf-c2c.HITM.remote
>      16.11 ±  7%     +37.1       53.16 ±  2%      -4.6       11.50 ± 19%  mpstat.cpu.all.idle%
>       0.34 ±  2%      -0.1        0.22            +0.0        0.35 ±  3%  mpstat.cpu.all.irq%
>       0.03 ±  5%      +0.0        0.04 ±  8%      -0.0        0.02        mpstat.cpu.all.soft%
>      10.58 ±  4%      -9.5        1.03 ± 36%      +0.1       10.71 ±  2%  mpstat.cpu.all.sys%
>      72.94 ±  2%     -27.4       45.55 ±  3%      +4.5       77.41 ±  2%  mpstat.cpu.all.usr%
>       6.00 ± 16%    +230.6%      19.83 ±  5%      +8.3%       6.50 ± 17%  mpstat.max_utilization.seconds
>      16.95 ±  7%    +215.5%      53.48 ±  2%     -26.2%      12.51 ± 16%  vmstat.cpu.id
>      72.33 ±  2%     -37.4%      45.31 ±  3%      +6.0%      76.65 ±  2%  vmstat.cpu.us
>  2.254e+08            -0.0%  2.254e+08           +14.7%  2.584e+08        vmstat.memory.free
>     108.30           -43.3%      61.43 ±  2%      +5.4%     114.12 ±  2%  vmstat.procs.r
>       2659          +162.6%       6982 ±  3%      +3.6%       2753 ±  4%  vmstat.system.cs
>     136384 ±  4%     -21.9%     106579 ±  7%     +13.3%     154581 ±  3%  vmstat.system.in
>     203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  time.elapsed_time
>     203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  time.elapsed_time.max
>     148901 ±  6%     -45.6%      81059 ±  4%      -8.8%     135748 ±  8%  time.involuntary_context_switches
>     169.83 ± 23%     +85.3%     314.67 ±  8%      +7.9%     183.33 ±  7%  time.major_page_faults
>      10697           -43.4%       6050 ±  2%      +5.6%      11294 ±  2%  time.percent_of_cpu_this_job_got
>       2740 ±  6%     -86.7%     365.06 ± 43%     -16.1%       2298        time.system_time
>      19012           -11.9%      16746           -11.9%      16747        time.user_time
>      14412 ±  5%   +4432.0%     653187           -16.6%      12025 ±  3%  time.voluntary_context_switches
>      50095 ±  2%     -31.5%      34325 ±  2%     +18.6%      59408        vm-scalability.median
>       8.25 ± 16%      -3.4        4.84 ± 22%      -6.6        1.65 ± 15%  vm-scalability.median_stddev%
>    6863720           -34.0%    4532485           +13.7%    7805408        vm-scalability.throughput
>     203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  vm-scalability.time.elapsed_time
>     203.41 ±  2%     +39.2%     283.06 ±  4%     -17.1%     168.71 ±  2%  vm-scalability.time.elapsed_time.max
>     148901 ±  6%     -45.6%      81059 ±  4%      -8.8%     135748 ±  8%  vm-scalability.time.involuntary_context_switches
>      10697           -43.4%       6050 ±  2%      +5.6%      11294 ±  2%  vm-scalability.time.percent_of_cpu_this_job_got
>       2740 ±  6%     -86.7%     365.06 ± 43%     -16.1%       2298        vm-scalability.time.system_time
>      19012           -11.9%      16746           -11.9%      16747        vm-scalability.time.user_time
>      14412 ±  5%   +4432.0%     653187           -16.6%      12025 ±  3%  vm-scalability.time.voluntary_context_switches
>  1.159e+09            +0.0%  1.159e+09            +1.6%  1.178e+09        vm-scalability.workload
>   22900043 ±  4%      +1.2%   23166356 ±  6%     -16.7%   19076170 ±  5%  numa-vmstat.node0.nr_free_pages
>      42856 ± 43%    +998.5%     470779 ± 51%    +318.6%     179409 ±154%  numa-vmstat.node0.nr_unevictable
>      42856 ± 43%    +998.5%     470779 ± 51%    +318.6%     179409 ±154%  numa-vmstat.node0.nr_zone_unevictable
>     629160 ±  4%     +22.9%     773391 ±  5%      +0.5%     632570 ±  4%  numa-vmstat.node0.numa_hit
>     554892 ±  8%     +27.9%     709841 ±  2%      -4.6%     529463 ±  5%  numa-vmstat.node0.numa_local
>      27469 ± 14%      +0.0%      27475 ± 41%     -31.7%      18763 ± 13%  numa-vmstat.node1.nr_active_anon
>     767179 ±  2%     -55.8%     339212 ± 72%     -19.7%     616417 ± 43%  numa-vmstat.node1.nr_file_pages
>   10693349 ±  5%     +46.3%   15639681 ±  7%     +69.4%   18112002 ±  3%  numa-vmstat.node1.nr_free_pages
>      14210 ± 27%     -65.0%       4973 ± 49%     -34.7%       9280 ± 39%  numa-vmstat.node1.nr_mapped
>     724050 ±  2%     -59.1%     296265 ± 82%     -18.9%     587498 ± 47%  numa-vmstat.node1.nr_unevictable
>      27469 ± 14%      +0.0%      27475 ± 41%     -31.7%      18763 ± 13%  numa-vmstat.node1.nr_zone_active_anon
>     724050 ±  2%     -59.1%     296265 ± 82%     -18.9%     587498 ± 47%  numa-vmstat.node1.nr_zone_unevictable
>     120619 ± 11%     +13.6%     137042 ± 27%     -31.2%      82976 ±  7%  meminfo.Active
>     120472 ± 11%     +13.6%     136895 ± 27%     -31.2%      82826 ±  7%  meminfo.Active(anon)
>   70234807           +14.6%   80512468           +10.2%   77431344        meminfo.CommitLimit
>  2.235e+08            +0.1%  2.237e+08           +15.1%  2.573e+08        meminfo.DirectMap1G
>      44064           -22.8%      34027 ±  2%     +20.7%      53164 ±  2%  meminfo.HugePages_Surp
>      44064           -22.8%      34027 ±  2%     +20.7%      53164 ±  2%  meminfo.HugePages_Total
>   90243440           -22.8%   69688103 ±  2%     +20.7%  1.089e+08 ±  2%  meminfo.Hugetlb
>      70163 ± 29%     -42.6%      40293 ± 11%     -21.9%      54789 ± 15%  meminfo.Mapped
>  1.334e+08           +15.5%  1.541e+08           +10.7%  1.477e+08        meminfo.MemAvailable
>  1.344e+08           +15.4%  1.551e+08           +10.7%  1.488e+08        meminfo.MemFree
>  2.307e+08            +0.0%  2.307e+08           +14.3%  2.637e+08        meminfo.MemTotal
>   96309843           -21.5%   75639108 ±  2%     +19.4%   1.15e+08 ±  2%  meminfo.Memused
>     259553 ±  2%      -0.9%     257226 ± 15%     -10.5%     232211 ±  4%  meminfo.Shmem
>    1.2e+08            -2.4%  1.172e+08           +13.3%   1.36e+08        meminfo.max_used_kB
>      18884 ± 10%      -7.2%      17519 ± 15%     +37.6%      25983 ±  6%  numa-meminfo.node0.HugePages_Surp
>      18884 ± 10%      -7.2%      17519 ± 15%     +37.6%      25983 ±  6%  numa-meminfo.node0.HugePages_Total
>   91526744 ±  4%      +1.2%   92620825 ±  6%     -16.7%   76248423 ±  5%  numa-meminfo.node0.MemFree
>   40158207 ±  9%      -2.7%   39064126 ± 15%     +38.0%   55436528 ±  7%  numa-meminfo.node0.MemUsed
>     171426 ± 43%    +998.5%    1883116 ± 51%    +318.6%     717638 ±154%  numa-meminfo.node0.Unevictable
>     110091 ± 14%      -0.1%     109981 ± 41%     -31.7%      75226 ± 13%  numa-meminfo.node1.Active
>     110025 ± 14%      -0.1%     109915 ± 41%     -31.7%      75176 ± 13%  numa-meminfo.node1.Active(anon)
>    3068496 ±  2%     -55.8%    1356754 ± 72%     -19.6%    2466084 ± 43%  numa-meminfo.node1.FilePages
>      25218 ±  4%     -34.7%      16475 ± 12%      +7.9%      27213 ±  3%  numa-meminfo.node1.HugePages_Surp
>      25218 ±  4%     -34.7%      16475 ± 12%      +7.9%      27213 ±  3%  numa-meminfo.node1.HugePages_Total
>      55867 ± 27%     -65.5%      19266 ± 50%     -34.4%      36671 ± 38%  numa-meminfo.node1.Mapped
>   42795888 ±  5%     +46.1%   62520130 ±  7%     +69.3%   72441496 ±  3%  numa-meminfo.node1.MemFree
>   99028084            +0.0%   99028084           +33.4%  1.321e+08        numa-meminfo.node1.MemTotal
>   56232195 ±  3%     -35.1%   36507953 ± 12%      +6.0%   59616707 ±  4%  numa-meminfo.node1.MemUsed
>    2896199 ±  2%     -59.1%    1185064 ± 82%     -18.9%    2349991 ± 47%  numa-meminfo.node1.Unevictable
>     507357            +0.0%     507357            +1.7%     516000        proc-vmstat.htlb_buddy_alloc_success
>      29942 ± 10%     +14.3%      34235 ± 27%     -30.7%      20740 ±  7%  proc-vmstat.nr_active_anon
>    3324095           +15.7%    3847387           +10.9%    3686860        proc-vmstat.nr_dirty_background_threshold
>    6656318           +15.7%    7704181           +10.9%    7382735        proc-vmstat.nr_dirty_threshold
>   33559092           +15.6%   38798108           +10.9%   37209133        proc-vmstat.nr_free_pages
>     197697 ±  2%      -2.5%     192661            +1.0%     199623        proc-vmstat.nr_inactive_anon
>      17939 ± 28%     -42.5%      10307 ± 11%     -22.4%      13927 ± 14%  proc-vmstat.nr_mapped
>       2691            -7.1%       2501            +2.9%       2769        proc-vmstat.nr_page_table_pages
>      64848 ±  2%      -0.7%      64386 ± 15%     -10.6%      57987 ±  4%  proc-vmstat.nr_shmem
>      29942 ± 10%     +14.3%      34235 ± 27%     -30.7%      20740 ±  7%  proc-vmstat.nr_zone_active_anon
>     197697 ±  2%      -2.5%     192661            +1.0%     199623        proc-vmstat.nr_zone_inactive_anon
>    1403095            +9.3%    1534152 ±  2%      -3.2%    1358244        proc-vmstat.numa_hit
>    1267544           +10.6%    1401482 ±  2%      -3.4%    1224210        proc-vmstat.numa_local
>  2.608e+08            +0.1%  2.609e+08            +1.7%  2.651e+08        proc-vmstat.pgalloc_normal
>    1259957           +13.4%    1428284 ±  2%      -6.5%    1178198        proc-vmstat.pgfault
>  2.591e+08            +0.3%    2.6e+08            +2.3%  2.649e+08        proc-vmstat.pgfree
>      36883 ±  3%     +18.5%      43709 ±  5%     -12.2%      32371 ±  3%  proc-vmstat.pgreuse
>       1.88 ± 16%      -0.6        1.33 ±100%      +0.9        2.80 ± 11%  perf-profile.calltrace.cycles-pp.nrand48_r
>      16.19 ± 85%     +28.6       44.75 ± 95%     -11.4        4.78 ±218%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>      16.20 ± 85%     +28.6       44.78 ± 95%     -11.4        4.78 ±218%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>      16.22 ± 85%     +28.6       44.82 ± 95%     -11.4        4.79 ±218%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>      16.22 ± 85%     +28.6       44.82 ± 95%     -11.4        4.79 ±218%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>      16.24 ± 85%     +28.8       45.01 ± 95%     -11.4        4.80 ±218%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>      12.42 ± 84%     +29.5       41.89 ± 95%      -8.8        3.65 ±223%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>      12.52 ± 84%     +29.6       42.08 ± 95%      -8.8        3.68 ±223%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>      12.53 ± 84%     +29.7       42.23 ± 95%      -8.9        3.68 ±223%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>      12.80 ± 84%     +30.9       43.65 ± 95%      -9.0        3.76 ±223%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>       2.50 ± 17%      -0.7        1.78 ±100%      +1.2        3.73 ± 11%  perf-profile.children.cycles-pp.nrand48_r
>      16.24 ± 85%     +28.6       44.87 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.do_user_addr_fault
>      16.24 ± 85%     +28.6       44.87 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.exc_page_fault
>      16.20 ± 85%     +28.7       44.86 ± 95%     -11.4        4.78 ±218%  perf-profile.children.cycles-pp.hugetlb_fault
>      16.22 ± 85%     +28.7       44.94 ± 95%     -11.4        4.79 ±218%  perf-profile.children.cycles-pp.handle_mm_fault
>      16.26 ± 85%     +28.8       45.06 ± 95%     -11.5        4.80 ±218%  perf-profile.children.cycles-pp.asm_exc_page_fault
>      12.51 ± 84%     +29.5       42.01 ± 95%      -8.8        3.75 ±218%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>      12.52 ± 84%     +29.6       42.11 ± 95%      -8.8        3.75 ±218%  perf-profile.children.cycles-pp.copy_subpage
>      12.53 ± 84%     +29.7       42.25 ± 95%      -8.8        3.76 ±218%  perf-profile.children.cycles-pp.copy_user_large_folio
>      12.80 ± 84%     +30.9       43.65 ± 95%      -9.0        3.83 ±218%  perf-profile.children.cycles-pp.hugetlb_wp
>       2.25 ± 17%      -0.7        1.59 ±100%      +1.1        3.36 ± 11%  perf-profile.self.cycles-pp.nrand48_r
>       1.74 ± 21%      -0.5        1.25 ± 92%      +1.2        2.94 ± 13%  perf-profile.self.cycles-pp.do_access
>       0.27 ± 17%      -0.1        0.19 ±100%      +0.1        0.40 ± 11%  perf-profile.self.cycles-pp.lrand48_r
>      12.41 ± 84%     +29.4       41.80 ± 95%      -8.7        3.72 ±218%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
>     350208 ± 16%      -2.7%     340891 ± 36%     -47.2%     184918 ±  9%  sched_debug.cfs_rq:/.avg_vruntime.stddev
>      16833 ±149%    -100.0%       3.19 ±100%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.left_deadline.avg
>    2154658 ±149%    -100.0%     317.15 ± 93%    -100.0%      74.40 ±179%  sched_debug.cfs_rq:/.left_deadline.max
>     189702 ±149%    -100.0%      29.47 ± 94%    -100.0%       6.55 ±179%  sched_debug.cfs_rq:/.left_deadline.stddev
>      16833 ±149%    -100.0%       3.05 ±102%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.left_vruntime.avg
>    2154613 ±149%    -100.0%     298.70 ± 95%    -100.0%      74.06 ±179%  sched_debug.cfs_rq:/.left_vruntime.max
>     189698 ±149%    -100.0%      27.96 ± 96%    -100.0%       6.52 ±179%  sched_debug.cfs_rq:/.left_vruntime.stddev
>     350208 ± 16%      -2.7%     340891 ± 36%     -47.2%     184918 ±  9%  sched_debug.cfs_rq:/.min_vruntime.stddev
>      52.88 ± 14%     -19.5%      42.56 ± 39%     +22.8%      64.94 ±  9%  sched_debug.cfs_rq:/.removed.load_avg.stddev
>      16833 ±149%    -100.0%       3.05 ±102%    -100.0%       0.58 ±179%  sched_debug.cfs_rq:/.right_vruntime.avg
>    2154613 ±149%    -100.0%     298.70 ± 95%    -100.0%      74.11 ±179%  sched_debug.cfs_rq:/.right_vruntime.max
>     189698 ±149%    -100.0%      27.96 ± 96%    -100.0%       6.53 ±179%  sched_debug.cfs_rq:/.right_vruntime.stddev
>       1588 ±  9%     -31.2%       1093 ± 18%     -20.0%       1270 ± 16%  sched_debug.cfs_rq:/.runnable_avg.max
>     676.36 ±  7%     -94.8%      35.08 ± 42%      -2.7%     657.82 ±  3%  sched_debug.cfs_rq:/.util_est.avg
>       1339 ±  8%     -74.5%     341.42 ± 24%     -22.6%       1037 ± 23%  sched_debug.cfs_rq:/.util_est.max
>     152.67 ± 35%     -72.3%      42.35 ± 21%     -14.9%     129.89 ± 33%  sched_debug.cfs_rq:/.util_est.stddev
>    1116839 ±  7%      -7.1%    1037321 ±  4%     +22.9%    1372316 ± 11%  sched_debug.cpu.avg_idle.max
>     126915 ± 10%     +31.6%     166966 ±  6%     -12.2%     111446 ±  2%  sched_debug.cpu.clock.avg
>     126930 ± 10%     +31.6%     166977 ±  6%     -12.2%     111459 ±  2%  sched_debug.cpu.clock.max
>     126899 ± 10%     +31.6%     166949 ±  6%     -12.2%     111428 ±  2%  sched_debug.cpu.clock.min
>     126491 ± 10%     +31.7%     166537 ±  6%     -12.2%     111078 ±  2%  sched_debug.cpu.clock_task.avg
>     126683 ± 10%     +31.6%     166730 ±  6%     -12.2%     111237 ±  2%  sched_debug.cpu.clock_task.max
>     117365 ± 11%     +33.6%     156775 ±  6%     -13.0%     102099 ±  2%  sched_debug.cpu.clock_task.min
>       2826 ± 10%    +178.1%       7858 ±  8%     -10.3%       2534 ±  6%  sched_debug.cpu.nr_switches.avg
>     755.38 ± 15%    +423.8%       3956 ± 14%     -15.2%     640.33 ±  3%  sched_debug.cpu.nr_switches.min
>     126900 ± 10%     +31.6%     166954 ±  6%     -12.2%     111432 ±  2%  sched_debug.cpu_clk
>     125667 ± 10%     +31.9%     165721 ±  6%     -12.3%     110200 ±  2%  sched_debug.ktime
>       0.54 ±141%     -99.9%       0.00 ±132%     -99.9%       0.00 ±114%  sched_debug.rt_rq:.rt_time.avg
>      69.73 ±141%     -99.9%       0.06 ±132%     -99.9%       0.07 ±114%  sched_debug.rt_rq:.rt_time.max
>       6.14 ±141%     -99.9%       0.01 ±132%     -99.9%       0.01 ±114%  sched_debug.rt_rq:.rt_time.stddev
>     127860 ± 10%     +31.3%     167917 ±  6%     -12.1%     112402 ±  2%  sched_debug.sched_clk
>      15.99          +363.6%      74.14 ±  6%     +10.1%      17.61        perf-stat.i.MPKI
>  1.467e+10 ±  2%     -32.0%  9.975e+09 ±  3%     +21.3%  1.779e+10 ±  2%  perf-stat.i.branch-instructions
>       0.10 ±  5%      +0.6        0.68 ±  5%      +0.0        0.11 ±  4%  perf-stat.i.branch-miss-rate%
>   10870114 ±  3%     -26.4%    8001551 ±  3%     +15.7%   12580898 ±  2%  perf-stat.i.branch-misses
>      97.11           -20.0       77.11            -0.0       97.10        perf-stat.i.cache-miss-rate%
>  8.118e+08 ±  2%     -32.5%  5.482e+08 ±  3%     +23.1%  9.992e+08 ±  2%  perf-stat.i.cache-misses
>  8.328e+08 ±  2%     -28.4%  5.963e+08 ±  3%     +22.8%  1.023e+09 ±  2%  perf-stat.i.cache-references
>       2601 ±  2%    +172.3%       7083 ±  3%      +2.5%       2665 ±  5%  perf-stat.i.context-switches
>       5.10           +39.5%       7.11 ±  9%      -9.2%       4.62        perf-stat.i.cpi
>  2.826e+11           -44.1%   1.58e+11 ±  2%      +5.7%  2.987e+11 ±  2%  perf-stat.i.cpu-cycles
>     216.56           +42.4%     308.33 ±  6%      +2.2%     221.23        perf-stat.i.cpu-migrations
>     358.79            -0.3%     357.70 ± 21%     -14.1%     308.23        perf-stat.i.cycles-between-cache-misses
>  6.286e+10 ±  2%     -31.7%  4.293e+10 ±  3%     +21.3%  7.626e+10 ±  2%  perf-stat.i.instructions
>       0.24           +39.9%       0.33 ±  4%     +13.6%       0.27        perf-stat.i.ipc
>       5844           -16.9%       4856 ±  2%     +12.5%       6577        perf-stat.i.minor-faults
>       5846           -16.9%       4857 ±  2%     +12.5%       6578        perf-stat.i.page-faults
>      13.00            -2.2%      12.72            +1.2%      13.15        perf-stat.overall.MPKI
>       0.07            +0.0        0.08            -0.0        0.07        perf-stat.overall.branch-miss-rate%
>      97.44            -5.3       92.09            +0.2       97.66        perf-stat.overall.cache-miss-rate%
>       4.51           -18.4%       3.68           -13.0%       3.92        perf-stat.overall.cpi
>     346.76           -16.6%     289.11           -14.0%     298.06        perf-stat.overall.cycles-between-cache-misses
>       0.22           +22.6%       0.27           +15.0%       0.26        perf-stat.overall.ipc
>      10906            -3.4%      10541            -1.1%      10784        perf-stat.overall.path-length
>  1.445e+10 ±  2%     -30.7%  1.001e+10 ±  3%     +21.2%  1.752e+10 ±  2%  perf-stat.ps.branch-instructions
>   10469697 ±  3%     -23.5%    8005730 ±  3%     +18.3%   12387061 ±  2%  perf-stat.ps.branch-misses
>  8.045e+08 ±  2%     -31.9%  5.478e+08 ±  3%     +22.7%  9.874e+08 ±  2%  perf-stat.ps.cache-misses
>  8.257e+08 ±  2%     -27.9%   5.95e+08 ±  3%     +22.5%  1.011e+09 ±  2%  perf-stat.ps.cache-references
>       2584 ±  2%    +169.3%       6958 ±  3%      +2.7%       2654 ±  4%  perf-stat.ps.context-switches
>  2.789e+11           -43.2%  1.583e+11 ±  2%      +5.5%  2.943e+11 ±  2%  perf-stat.ps.cpu-cycles
>     214.69           +41.8%     304.37 ±  6%      +2.2%     219.46        perf-stat.ps.cpu-migrations
>   6.19e+10 ±  2%     -30.4%  4.309e+10 ±  3%     +21.3%  7.507e+10 ±  2%  perf-stat.ps.instructions
>       5849           -18.0%       4799 ±  2%     +12.3%       6568 ±  2%  perf-stat.ps.minor-faults
>       5851           -18.0%       4800 ±  2%     +12.3%       6570 ±  2%  perf-stat.ps.page-faults
>  1.264e+13            -3.4%  1.222e+13            +0.5%   1.27e+13        perf-stat.total.instructions


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-07-19 16:06         ` Yu Zhao
@ 2024-08-03 22:07           ` Yu Zhao
  2024-08-06  3:01             ` Oliver Sang
  0 siblings, 1 reply; 14+ messages in thread
From: Yu Zhao @ 2024-08-03 22:07 UTC (permalink / raw)
  To: Oliver Sang, Muchun Song
  Cc: Janosch Frank, oe-lkp, lkp, Linux Memory Management List,
	Andrew Morton, David Hildenbrand, Frank van der Linden,
	Matthew Wilcox, Peter Xu, Yang Shi, linux-kernel, ying.huang,
	feng.tang, fengwei.yin, Christian Borntraeger, Claudio Imbrenda,
	Marc Hartmayer, Heiko Carstens, Yosry Ahmed

[-- Attachment #1: Type: text/plain, Size: 2990 bytes --]

Hi Oliver,

On Fri, Jul 19, 2024 at 10:06 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yu Zhao,
> >
> > On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > Hi Janosch and Oliver,
> > > >
> > > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > > >
> > > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > > Hello,
> > > > > >
> > > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > > >
> > > > > >
> > > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > > >
> > > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > > >
> > > > > This has hit s390 huge page backed KVM guests as well.
> > > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > > >
> > > > Could you try the attached patch please? Thank you.
> > >
> > > Thanks, Yosry, for spotting the following typo:
> > >   flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > > It's supposed to be:
> > >   flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> > >
> > > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
> >
> > since the commit is in mainline now, I directly apply your v2 patch upon
> > bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> >
> > in our tests, your v2 patch not only recovers the performance regression,
>
> Thanks for verifying the fix!
>
> > it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> > bd225530a4c71)
>
> Glad to hear!
>
> (The original patch improved and regressed the performance at the same
> time, but the regression is bigger. The fix removed the regression and
> surfaced the improvement.)

Can you please run the benchmark again with the attached patch on top
of the last fix?

I spotted something else worth optimizing last time, and with the
patch attached, I was able to measure some significant improvements in
1GB hugeTLB allocation and free time, e.g., when allocating and free
700 1GB hugeTLB pages:

Before:
  # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  real  0m13.500s
  user  0m0.000s
  sys   0m13.311s

  # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  real  0m11.269s
  user  0m0.000s
  sys   0m11.187s


After:
  # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  real  0m10.643s
  user  0m0.001s
  sys   0m10.487s

  # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  real  0m1.541s
  user  0m0.000s
  sys   0m1.528s

Thanks!

[-- Attachment #2: hugetlb.patch --]
[-- Type: application/octet-stream, Size: 22480 bytes --]

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 9db877506ea8..3d58ce1a8730 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 					struct cma **res_cma);
 extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
 			      bool no_warn);
+extern struct folio *cma_alloc_folio(struct cma *cma, int order);
 extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a0..630ab4f5f78d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -900,9 +900,9 @@ static inline bool hugepage_movable_supported(struct hstate *h)
 static inline gfp_t htlb_alloc_mask(struct hstate *h)
 {
 	if (hugepage_movable_supported(h))
-		return GFP_HIGHUSER_MOVABLE;
+		return GFP_HIGHUSER_MOVABLE | __GFP_COMP;
 	else
-		return GFP_HIGHUSER;
+		return GFP_HIGHUSER | __GFP_COMP;
 }
 
 static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
diff --git a/mm/cma.c b/mm/cma.c
index 3e9724716bad..39b6b99c6af1 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -403,18 +403,8 @@ static void cma_debug_show_areas(struct cma *cma)
 	spin_unlock_irq(&cma->lock);
 }
 
-/**
- * cma_alloc() - allocate pages from contiguous area
- * @cma:   Contiguous memory region for which the allocation is performed.
- * @count: Requested number of pages.
- * @align: Requested alignment of pages (in PAGE_SIZE order).
- * @no_warn: Avoid printing message about failed allocation
- *
- * This function allocates part of contiguous memory on specific
- * contiguous memory area.
- */
-struct page *cma_alloc(struct cma *cma, unsigned long count,
-		       unsigned int align, bool no_warn)
+static struct page *__cma_alloc(struct cma *cma, unsigned long count,
+				unsigned int align, gfp_t gfp)
 {
 	unsigned long mask, offset;
 	unsigned long pfn = -1;
@@ -463,8 +453,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		mutex_lock(&cma_mutex);
-		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
-				     GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
 		mutex_unlock(&cma_mutex);
 		if (ret == 0) {
 			page = pfn_to_page(pfn);
@@ -494,7 +483,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 			page_kasan_tag_reset(nth_page(page, i));
 	}
 
-	if (ret && !no_warn) {
+	if (ret && !(gfp & __GFP_NOWARN)) {
 		pr_err_ratelimited("%s: %s: alloc failed, req-size: %lu pages, ret: %d\n",
 				   __func__, cma->name, count, ret);
 		cma_debug_show_areas(cma);
@@ -513,6 +502,31 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
 	return page;
 }
 
+/**
+ * cma_alloc() - allocate pages from contiguous area
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @count: Requested number of pages.
+ * @align: Requested alignment of pages (in PAGE_SIZE order).
+ * @no_warn: Avoid printing message about failed allocation
+ *
+ * This function allocates part of contiguous memory on specific
+ * contiguous memory area.
+ */
+struct page *cma_alloc(struct cma *cma, unsigned long count,
+		       unsigned int align, bool no_warn)
+{
+	return __cma_alloc(cma, count, align, GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+}
+
+struct folio *cma_alloc_folio(struct cma *cma, int order)
+{
+	struct page *page;
+
+	page = __cma_alloc(cma, 1 << order, order, GFP_KERNEL | __GFP_COMP);
+
+	return page ? page_folio(page) : NULL;
+}
+
 bool cma_pages_valid(struct cma *cma, const struct page *pages,
 		     unsigned long count)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index eb95e9b435d0..00fb571727d3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -86,33 +86,6 @@ static struct page *mark_allocated_noprof(struct page *page, unsigned int order,
 }
 #define mark_allocated(...)	alloc_hooks(mark_allocated_noprof(__VA_ARGS__))
 
-static void split_map_pages(struct list_head *freepages)
-{
-	unsigned int i, order;
-	struct page *page, *next;
-	LIST_HEAD(tmp_list);
-
-	for (order = 0; order < NR_PAGE_ORDERS; order++) {
-		list_for_each_entry_safe(page, next, &freepages[order], lru) {
-			unsigned int nr_pages;
-
-			list_del(&page->lru);
-
-			nr_pages = 1 << order;
-
-			mark_allocated(page, order, __GFP_MOVABLE);
-			if (order)
-				split_page(page, order);
-
-			for (i = 0; i < nr_pages; i++) {
-				list_add(&page->lru, &tmp_list);
-				page++;
-			}
-		}
-		list_splice_init(&tmp_list, &freepages[0]);
-	}
-}
-
 static unsigned long release_free_list(struct list_head *freepages)
 {
 	int order;
@@ -754,10 +727,9 @@ isolate_freepages_range(struct compact_control *cc,
 {
 	unsigned long isolated, pfn, block_start_pfn, block_end_pfn;
 	int order;
-	struct list_head tmp_freepages[NR_PAGE_ORDERS];
 
 	for (order = 0; order < NR_PAGE_ORDERS; order++)
-		INIT_LIST_HEAD(&tmp_freepages[order]);
+		INIT_LIST_HEAD(&cc->freepages[order]);
 
 	pfn = start_pfn;
 	block_start_pfn = pageblock_start_pfn(pfn);
@@ -788,7 +760,7 @@ isolate_freepages_range(struct compact_control *cc,
 			break;
 
 		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
-					block_end_pfn, tmp_freepages, 0, true);
+					block_end_pfn, cc->freepages, 0, true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -807,13 +779,10 @@ isolate_freepages_range(struct compact_control *cc,
 
 	if (pfn < end_pfn) {
 		/* Loop terminated early, cleanup. */
-		release_free_list(tmp_freepages);
+		release_free_list(cc->freepages);
 		return 0;
 	}
 
-	/* __isolate_free_page() does not map the pages */
-	split_map_pages(tmp_freepages);
-
 	/* We don't use freelists for anything. */
 	return pfn;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aaf508be0a2b..2061d094cd19 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1512,43 +1512,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-/* used to demote non-gigantic_huge pages as well */
-static void __destroy_compound_gigantic_folio(struct folio *folio,
-					unsigned int order, bool demote)
-{
-	int i;
-	int nr_pages = 1 << order;
-	struct page *p;
-
-	atomic_set(&folio->_entire_mapcount, 0);
-	atomic_set(&folio->_large_mapcount, 0);
-	atomic_set(&folio->_pincount, 0);
-
-	for (i = 1; i < nr_pages; i++) {
-		p = folio_page(folio, i);
-		p->flags &= ~PAGE_FLAGS_CHECK_AT_FREE;
-		p->mapping = NULL;
-		clear_compound_head(p);
-		if (!demote)
-			set_page_refcounted(p);
-	}
-
-	__folio_clear_head(folio);
-}
-
-static void destroy_compound_hugetlb_folio_for_demote(struct folio *folio,
-					unsigned int order)
-{
-	__destroy_compound_gigantic_folio(folio, order, true);
-}
-
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_folio(struct folio *folio,
-					unsigned int order)
-{
-	__destroy_compound_gigantic_folio(folio, order, false);
-}
-
 static void free_gigantic_folio(struct folio *folio, unsigned int order)
 {
 	/*
@@ -1569,38 +1533,52 @@ static void free_gigantic_folio(struct folio *folio, unsigned int order)
 static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
 		int nid, nodemask_t *nodemask)
 {
-	struct page *page;
-	unsigned long nr_pages = pages_per_huge_page(h);
+	struct folio *folio;
+	int order = huge_page_order(h);
+	bool retry = false;
+
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
-
+retry:
+	folio = NULL;
 #ifdef CONFIG_CMA
 	{
 		int node;
 
-		if (hugetlb_cma[nid]) {
-			page = cma_alloc(hugetlb_cma[nid], nr_pages,
-					huge_page_order(h), true);
-			if (page)
-				return page_folio(page);
-		}
+		if (hugetlb_cma[nid])
+			folio = cma_alloc_folio(hugetlb_cma[nid], order);
 
-		if (!(gfp_mask & __GFP_THISNODE)) {
+		if (!folio && !(gfp_mask & __GFP_THISNODE)) {
 			for_each_node_mask(node, *nodemask) {
 				if (node == nid || !hugetlb_cma[node])
 					continue;
 
-				page = cma_alloc(hugetlb_cma[node], nr_pages,
-						huge_page_order(h), true);
-				if (page)
-					return page_folio(page);
+				folio = cma_alloc_folio(hugetlb_cma[node], order);
+				if (folio)
+					break;
 			}
 		}
 	}
 #endif
+	if (!folio) {
+		struct page *page = alloc_contig_pages(1 << order, gfp_mask, nid, nodemask);
 
-	page = alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
-	return page ? page_folio(page) : NULL;
+		if (!page)
+			return NULL;
+
+		folio = page_folio(page);
+	}
+
+	if (folio_ref_freeze(folio, 1))
+		return folio;
+
+	pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+	free_gigantic_folio(folio, order);
+	if (!retry) {
+		retry = true;
+		goto retry;
+	}
+	return NULL;
 }
 
 #else /* !CONFIG_CONTIG_ALLOC */
@@ -1619,8 +1597,6 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
 }
 static inline void free_gigantic_folio(struct folio *folio,
 						unsigned int order) { }
-static inline void destroy_compound_gigantic_folio(struct folio *folio,
-						unsigned int order) { }
 #endif
 
 /*
@@ -1747,19 +1723,17 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		folio_clear_hugetlb_hwpoison(folio);
 
 	folio_ref_unfreeze(folio, 1);
+	INIT_LIST_HEAD(&folio->_deferred_list);
 
 	/*
 	 * Non-gigantic pages demoted from CMA allocated gigantic pages
 	 * need to be given back to CMA in free_gigantic_folio.
 	 */
 	if (hstate_is_gigantic(h) ||
-	    hugetlb_cma_folio(folio, huge_page_order(h))) {
-		destroy_compound_gigantic_folio(folio, huge_page_order(h));
+	    hugetlb_cma_folio(folio, huge_page_order(h)))
 		free_gigantic_folio(folio, huge_page_order(h));
-	} else {
-		INIT_LIST_HEAD(&folio->_deferred_list);
+	else
 		folio_put(folio);
-	}
 }
 
 /*
@@ -2032,95 +2006,6 @@ static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int ni
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-static bool __prep_compound_gigantic_folio(struct folio *folio,
-					unsigned int order, bool demote)
-{
-	int i, j;
-	int nr_pages = 1 << order;
-	struct page *p;
-
-	__folio_clear_reserved(folio);
-	for (i = 0; i < nr_pages; i++) {
-		p = folio_page(folio, i);
-
-		/*
-		 * For gigantic hugepages allocated through bootmem at
-		 * boot, it's safer to be consistent with the not-gigantic
-		 * hugepages and clear the PG_reserved bit from all tail pages
-		 * too.  Otherwise drivers using get_user_pages() to access tail
-		 * pages may get the reference counting wrong if they see
-		 * PG_reserved set on a tail page (despite the head page not
-		 * having PG_reserved set).  Enforcing this consistency between
-		 * head and tail pages allows drivers to optimize away a check
-		 * on the head page when they need know if put_page() is needed
-		 * after get_user_pages().
-		 */
-		if (i != 0)	/* head page cleared above */
-			__ClearPageReserved(p);
-		/*
-		 * Subtle and very unlikely
-		 *
-		 * Gigantic 'page allocators' such as memblock or cma will
-		 * return a set of pages with each page ref counted.  We need
-		 * to turn this set of pages into a compound page with tail
-		 * page ref counts set to zero.  Code such as speculative page
-		 * cache adding could take a ref on a 'to be' tail page.
-		 * We need to respect any increased ref count, and only set
-		 * the ref count to zero if count is currently 1.  If count
-		 * is not 1, we return an error.  An error return indicates
-		 * the set of pages can not be converted to a gigantic page.
-		 * The caller who allocated the pages should then discard the
-		 * pages using the appropriate free interface.
-		 *
-		 * In the case of demote, the ref count will be zero.
-		 */
-		if (!demote) {
-			if (!page_ref_freeze(p, 1)) {
-				pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
-				goto out_error;
-			}
-		} else {
-			VM_BUG_ON_PAGE(page_count(p), p);
-		}
-		if (i != 0)
-			set_compound_head(p, &folio->page);
-	}
-	__folio_set_head(folio);
-	/* we rely on prep_new_hugetlb_folio to set the hugetlb flag */
-	folio_set_order(folio, order);
-	atomic_set(&folio->_entire_mapcount, -1);
-	atomic_set(&folio->_large_mapcount, -1);
-	atomic_set(&folio->_pincount, 0);
-	return true;
-
-out_error:
-	/* undo page modifications made above */
-	for (j = 0; j < i; j++) {
-		p = folio_page(folio, j);
-		if (j != 0)
-			clear_compound_head(p);
-		set_page_refcounted(p);
-	}
-	/* need to clear PG_reserved on remaining tail pages  */
-	for (; j < nr_pages; j++) {
-		p = folio_page(folio, j);
-		__ClearPageReserved(p);
-	}
-	return false;
-}
-
-static bool prep_compound_gigantic_folio(struct folio *folio,
-							unsigned int order)
-{
-	return __prep_compound_gigantic_folio(folio, order, false);
-}
-
-static bool prep_compound_gigantic_folio_for_demote(struct folio *folio,
-							unsigned int order)
-{
-	return __prep_compound_gigantic_folio(folio, order, true);
-}
-
 /*
  * Find and lock address space (mapping) in write mode.
  *
@@ -2159,7 +2044,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	 */
 	if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
 		alloc_try_hard = false;
-	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
+	gfp_mask |= __GFP_NOWARN;
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
 	if (nid == NUMA_NO_NODE)
@@ -2206,48 +2091,14 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	return folio;
 }
 
-static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
-				gfp_t gfp_mask, int nid, nodemask_t *nmask,
-				nodemask_t *node_alloc_noretry)
-{
-	struct folio *folio;
-	bool retry = false;
-
-retry:
-	if (hstate_is_gigantic(h))
-		folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
-	else
-		folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
-				nid, nmask, node_alloc_noretry);
-	if (!folio)
-		return NULL;
-
-	if (hstate_is_gigantic(h)) {
-		if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
-			/*
-			 * Rare failure to convert pages to compound page.
-			 * Free pages and try again - ONCE!
-			 */
-			free_gigantic_folio(folio, huge_page_order(h));
-			if (!retry) {
-				retry = true;
-				goto retry;
-			}
-			return NULL;
-		}
-	}
-
-	return folio;
-}
-
 static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
 		gfp_t gfp_mask, int nid, nodemask_t *nmask,
 		nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
 
-	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
-						node_alloc_noretry);
+	folio = hstate_is_gigantic(h) ? alloc_gigantic_folio(h, gfp_mask, nid, nmask) :
+		alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, node_alloc_noretry);
 	if (folio)
 		init_new_hugetlb_folio(h, folio);
 	return folio;
@@ -2265,7 +2116,8 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 {
 	struct folio *folio;
 
-	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
+	folio = hstate_is_gigantic(h) ? alloc_gigantic_folio(h, gfp_mask, nid, nmask) :
+		alloc_buddy_hugetlb_folio(h, gfp_mask, nid, nmask, NULL);
 	if (!folio)
 		return NULL;
 
@@ -3333,6 +3185,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
+		__ClearPageReserved(folio_page(folio, pfn - head_pfn));
 		__init_single_page(page, pfn, zone, nid);
 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
 		ret = page_ref_freeze(page, 1);
@@ -3950,11 +3803,9 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
 		}
 	}
 
-	/*
-	 * Use destroy_compound_hugetlb_folio_for_demote for all huge page
-	 * sizes as it will not ref count folios.
-	 */
-	destroy_compound_hugetlb_folio_for_demote(folio, huge_page_order(h));
+	split_page_memcg(&folio->page, huge_page_order(h), huge_page_order(target_hstate));
+	split_page_owner(&folio->page, huge_page_order(h), huge_page_order(target_hstate));
+	pgalloc_tag_split(&folio->page, 1 <<  huge_page_order(h));
 
 	/*
 	 * Taking target hstate mutex synchronizes with set_max_huge_pages.
@@ -3969,11 +3820,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio)
 				i += pages_per_huge_page(target_hstate)) {
 		subpage = folio_page(folio, i);
 		inner_folio = page_folio(subpage);
-		if (hstate_is_gigantic(target_hstate))
-			prep_compound_gigantic_folio_for_demote(inner_folio,
-							target_hstate->order);
-		else
-			prep_compound_page(subpage, target_hstate->order);
+		prep_compound_page(subpage, target_hstate->order);
 		folio_change_private(inner_folio, NULL);
 		prep_new_hugetlb_folio(target_hstate, inner_folio, nid);
 		free_huge_folio(inner_folio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 28f80daf5c04..4ecf2c9428f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,16 +1192,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
+/* Split a multi-block free page into its individual pageblocks */
+static void split_large_buddy(struct zone *zone, struct page *page,
+			      unsigned long pfn, int order, fpi_t fpi_flags)
+{
+	unsigned long end_pfn = pfn + (1 << order);
+
+	VM_WARN_ON_ONCE(pfn & ((1 << order) - 1));
+	/* Caller removed page from freelist, buddy info cleared! */
+	VM_WARN_ON_ONCE(PageBuddy(page));
+
+	if (order > pageblock_order)
+		order = pageblock_order;
+
+	while (pfn != end_pfn) {
+		int mt = get_pfnblock_migratetype(page, pfn);
+
+		__free_one_page(page, pfn, zone, order, mt, fpi_flags);
+		pfn += 1 << order;
+		page = pfn_to_page(pfn);
+	}
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
 	unsigned long flags;
-	int migratetype;
 
 	spin_lock_irqsave(&zone->lock, flags);
-	migratetype = get_pfnblock_migratetype(page, pfn);
-	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
+	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
@@ -1693,27 +1713,6 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
 	return start_pfn;
 }
 
-/* Split a multi-block free page into its individual pageblocks */
-static void split_large_buddy(struct zone *zone, struct page *page,
-			      unsigned long pfn, int order)
-{
-	unsigned long end_pfn = pfn + (1 << order);
-
-	VM_WARN_ON_ONCE(order <= pageblock_order);
-	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
-
-	/* Caller removed page from freelist, buddy info cleared! */
-	VM_WARN_ON_ONCE(PageBuddy(page));
-
-	while (pfn != end_pfn) {
-		int mt = get_pfnblock_migratetype(page, pfn);
-
-		__free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE);
-		pfn += pageblock_nr_pages;
-		page = pfn_to_page(pfn);
-	}
-}
-
 /**
  * move_freepages_block_isolate - move free pages in block for page isolation
  * @zone: the zone
@@ -1754,7 +1753,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 		del_page_from_free_list(buddy, zone, order,
 					get_pfnblock_migratetype(buddy, pfn));
 		set_pageblock_migratetype(page, migratetype);
-		split_large_buddy(zone, buddy, pfn, order);
+		split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
 		return true;
 	}
 
@@ -1765,7 +1764,7 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 		del_page_from_free_list(page, zone, order,
 					get_pfnblock_migratetype(page, pfn));
 		set_pageblock_migratetype(page, migratetype);
-		split_large_buddy(zone, page, pfn, order);
+		split_large_buddy(zone, page, pfn, order, FPI_NONE);
 		return true;
 	}
 move:
@@ -6439,6 +6438,40 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
 	return (ret < 0) ? ret : 0;
 }
 
+static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
+{
+	post_alloc_hook(page, order, __GFP_MOVABLE);
+	return page;
+}
+#define mark_allocated(...)	alloc_hooks(mark_allocated_noprof(__VA_ARGS__))
+
+static void split_free_pages(struct list_head *freepages)
+{
+	unsigned int i, order;
+	struct page *page, *next;
+	LIST_HEAD(tmp_list);
+
+	for (order = 0; order < NR_PAGE_ORDERS; order++) {
+		list_for_each_entry_safe(page, next, &freepages[order], lru) {
+			unsigned int nr_pages;
+
+			list_del(&page->lru);
+
+			nr_pages = 1 << order;
+
+			mark_allocated(page, order, __GFP_MOVABLE);
+			if (order)
+				split_page(page, order);
+
+			for (i = 0; i < nr_pages; i++) {
+				list_add(&page->lru, &tmp_list);
+				page++;
+			}
+		}
+		list_splice_init(&tmp_list, &freepages[0]);
+	}
+}
+
 /**
  * alloc_contig_range() -- tries to allocate given range of pages
  * @start:	start PFN to allocate
@@ -6551,12 +6584,25 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 		goto done;
 	}
 
-	/* Free head and tail (if any) */
-	if (start != outer_start)
-		free_contig_range(outer_start, start - outer_start);
-	if (end != outer_end)
-		free_contig_range(end, outer_end - end);
+	if (!(gfp_mask & __GFP_COMP)) {
+		split_free_pages(cc.freepages);
 
+		/* Free head and tail (if any) */
+		if (start != outer_start)
+			free_contig_range(outer_start, start - outer_start);
+		if (end != outer_end)
+			free_contig_range(end, outer_end - end);
+	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
+		struct page *head = pfn_to_page(start);
+		int order = ilog2(end - start);
+
+		check_new_pages(head, order);
+		prep_new_page(head, order, gfp_mask, 0);
+	} else {
+		ret = -EINVAL;
+		WARN(true, "PFN range: requested [%lu, %lu), leaked [%lu, %lu)\n",
+		     start, end, outer_start, outer_end);
+	}
 done:
 	undo_isolate_page_range(start, end, migratetype);
 	return ret;
@@ -6665,6 +6711,18 @@ struct page *alloc_contig_pages_noprof(unsigned long nr_pages, gfp_t gfp_mask,
 void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 {
 	unsigned long count = 0;
+	struct folio *folio = pfn_folio(pfn);
+
+	if (folio_test_large(folio)) {
+		int expected = folio_nr_pages(folio);
+
+		if (nr_pages == expected)
+			folio_put(folio);
+		else
+			WARN(true, "PFN %lu: nr_pages %lu != expected %d\n",
+			     pfn, nr_pages, expected);
+		return;
+	}
 
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression
  2024-08-03 22:07           ` Yu Zhao
@ 2024-08-06  3:01             ` Oliver Sang
  0 siblings, 0 replies; 14+ messages in thread
From: Oliver Sang @ 2024-08-06  3:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Muchun Song, Janosch Frank, oe-lkp, lkp,
	Linux Memory Management List, Andrew Morton, David Hildenbrand,
	Frank van der Linden, Matthew Wilcox, Peter Xu, Yang Shi,
	linux-kernel, ying.huang, feng.tang, fengwei.yin,
	Christian Borntraeger, Claudio Imbrenda, Marc Hartmayer,
	Heiko Carstens, Yosry Ahmed, oliver.sang

hi, Yu Zhao,

On Sat, Aug 03, 2024 at 04:07:55PM -0600, Yu Zhao wrote:
> Hi Oliver,
> 
> On Fri, Jul 19, 2024 at 10:06 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Fri, Jul 19, 2024 at 2:44 AM Oliver Sang <oliver.sang@intel.com> wrote:
> > >
> > > hi, Yu Zhao,
> > >
> > > On Wed, Jul 17, 2024 at 09:44:33AM -0600, Yu Zhao wrote:
> > > > On Wed, Jul 17, 2024 at 2:36 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > Hi Janosch and Oliver,
> > > > >
> > > > > On Wed, Jul 17, 2024 at 1:57 AM Janosch Frank <frankja@linux.ibm.com> wrote:
> > > > > >
> > > > > > On 7/9/24 07:11, kernel test robot wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > kernel test robot noticed a -34.3% regression of vm-scalability.throughput on:
> > > > > > >
> > > > > > >
> > > > > > > commit: 875fa64577da9bc8e9963ee14fef8433f20653e7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > > > > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > > > > >
> > > > > > > [still regression on linux-next/master 0b58e108042b0ed28a71cd7edf5175999955b233]
> > > > > > >
> > > > > > This has hit s390 huge page backed KVM guests as well.
> > > > > > Our simple start/stop test case went from ~5 to over 50 seconds of runtime.
> > > > >
> > > > > Could you try the attached patch please? Thank you.
> > > >
> > > > Thanks, Yosry, for spotting the following typo:
> > > >   flags &= VMEMMAP_SYNCHRONIZE_RCU;
> > > > It's supposed to be:
> > > >   flags &= ~VMEMMAP_SYNCHRONIZE_RCU;
> > > >
> > > > Reattaching v2 with the above typo fixed. Please let me know, Janosch & Oliver.
> > >
> > > since the commit is in mainline now, I directly apply your v2 patch upon
> > > bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > >
> > > in our tests, your v2 patch not only recovers the performance regression,
> >
> > Thanks for verifying the fix!
> >
> > > it even has +13.7% performance improvement than 5a4d8944d6b1e (parent of
> > > bd225530a4c71)
> >
> > Glad to hear!
> >
> > (The original patch improved and regressed the performance at the same
> > time, but the regression is bigger. The fix removed the regression and
> > surfaced the improvement.)
> 
> Can you please run the benchmark again with the attached patch on top
> of the last fix?

last time, I applied your last fix (1)  directly upon mainline commit (2)

9a5b87b521401 fix for 875fa64577 (then bd225530a4 in main)               <--- (1)
bd225530a4c71 mm/hugetlb_vmemmap: fix race with speculative PFN walkers  <--- (2)

but I failed to apply your patch this time upon (1)

then I found I can apply above (1) upon mainline commit (3), as below (4).
your patch this time can be applied upon (4) successfully, as below (5)

e2b8dff50992a new hugetlb-20240805.patch                                       <--- (5)
b5af188232e56 v2 fix for bd225530a4 but apply on mainline tip 17712b7ea0756    <--- (4)
17712b7ea0756 Merge tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux  <--- (3)

I tested (3)(4)(5) and compared them with bd225530a4c71 and its parent. detail
as below [1]

you may notice the data for bd225530a4c71 and its parent are different with
previous data. this is due to we found some problem for gcc-13, we convert
to use gcc-12 now, our config is also changed.

we have below observations.
* bd225530a4c71 still has a similar -36.6% regression compare to its parent
* 17712b7ea0756 has similar data as bd225530a4c71 (a little worse, so -39.2%
  comparing to 5a4d8944d6b1e who is parent of bd225530a4c71)
* your last fix still do the work to recover the regression, but is not better
  than 5a4d8944d6b1e
* your patch this time seems not impact performance data a lot


[1]

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability

commit:
  5a4d8944d6b1e ("cachestat: do not flush stats in recency check")
  bd225530a4c71 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
  17712b7ea0756 ("Merge tag 'io_uring-6.11-20240802' of git://git.kernel.dk/linux")
  b5af188232e56  <--- apply your last fix upon 17712b7ea0756
  e2b8dff50992a  <--- then apply your patch this time upon b5af188232e56

5a4d8944d6b1e1aa bd225530a4c717714722c373144 17712b7ea0756799635ba159cc7 b5af188232e564d17fc3c1784f7 e2b8dff50992a56c67308f905bd
---------------- --------------------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \          |                \
 3.312e+09 ± 34%    +472.2%  1.895e+10 ±  3%    +487.2%  1.945e+10 ±  7%      -4.4%  3.167e+09 ± 29%     -15.6%  2.795e+09 ± 29%  cpuidle..time
    684985 ±  5%   +1112.3%    8304355 ±  2%   +1099.5%    8216278            -2.4%     668573 ±  5%      -5.2%     649406 ±  2%  cpuidle..usage
    231.53 ±  3%     +40.7%     325.70 ±  2%     +45.1%     335.98 ±  5%      +1.4%     234.78 ±  4%      +0.2%     231.94 ±  3%  uptime.boot
     10015 ± 10%    +156.8%      25723 ±  4%    +156.8%      25724 ±  7%      +4.3%      10447 ± 12%      -0.7%       9945 ± 13%  uptime.idle
    577860 ±  7%     +18.1%     682388 ±  8%     +12.1%     647808 ±  6%      +9.9%     635341 ±  6%      -0.3%     576189 ±  5%  numa-numastat.node0.local_node
    624764 ±  5%     +16.1%     725128 ±  4%     +18.0%     736975 ±  2%     +10.2%     688399 ±  2%      +3.0%     643587 ±  5%  numa-numastat.node0.numa_hit
    647823 ±  5%     +11.3%     721266 ±  9%     +15.7%     749411 ±  6%     -10.0%     583278 ±  5%      -1.0%     641117 ±  3%  numa-numastat.node1.local_node
    733550 ±  4%     +10.6%     811157 ±  4%      +8.4%     795091 ±  3%      -9.0%     667814 ±  3%      -3.4%     708807 ±  3%  numa-numastat.node1.numa_hit
      6.17 ±108%   +1521.6%     100.00 ± 38%  +26137.8%       1618 ±172%     -74.1%       1.60 ± 84%     +27.0%       7.83 ±114%  perf-c2c.DRAM.local
     46.17 ± 43%   +2759.6%       1320 ± 26%  +12099.6%       5632 ±112%     +18.3%      54.60 ± 56%     +48.7%      68.67 ± 42%  perf-c2c.DRAM.remote
     36.50 ± 52%   +1526.5%     593.67 ± 26%   +1305.5%     513.00 ± 53%      +2.5%      37.40 ± 46%     +62.6%      59.33 ± 66%  perf-c2c.HITM.local
     15.33 ± 74%   +2658.7%     423.00 ± 36%   +2275.0%     364.17 ± 67%     +48.7%      22.80 ± 75%    +122.8%      34.17 ± 58%  perf-c2c.HITM.remote
     15.34 ± 27%    +265.8%      56.12          +256.0%      54.63            -2.5%      14.96 ± 23%     -12.7%      13.39 ± 23%  vmstat.cpu.id
     73.93 ±  5%     -41.4%      43.30 ±  2%     -39.3%      44.85 ±  2%      +0.5%      74.27 ±  4%      +2.4%      75.72 ±  3%  vmstat.cpu.us
    110.76 ±  4%     -47.2%      58.47 ±  2%     -45.7%      60.14 ±  2%      +0.1%     110.90 ±  4%      +1.9%     112.84 ±  3%  vmstat.procs.r
      2729 ±  3%    +167.3%       7294 ±  2%    +155.7%       6979 ±  4%      +0.2%       2734            -1.3%       2692 ±  5%  vmstat.system.cs
    150274 ±  5%     -23.2%     115398 ±  6%     -27.2%     109377 ± 13%      +0.6%     151130 ±  4%      +0.9%     151666 ±  3%  vmstat.system.in
     14.31 ± 29%     +41.4       55.74           +40.0       54.31            -0.5       13.85 ± 25%      -1.9       12.42 ± 24%  mpstat.cpu.all.idle%
      0.34 ±  5%      -0.1        0.21 ±  2%      -0.1        0.21 ±  2%      -0.0        0.34 ±  4%      +0.0        0.35 ±  4%  mpstat.cpu.all.irq%
      0.02 ±  4%      +0.0        0.03            +0.0        0.03 ±  4%      -0.0        0.02 ±  2%      -0.0        0.02 ±  2%  mpstat.cpu.all.soft%
     10.63 ±  4%     -10.2        0.43 ±  4%     -10.3        0.35 ± 29%      +0.1       10.71 ±  2%      +0.2       10.79 ±  4%  mpstat.cpu.all.sys%
     74.69 ±  5%     -31.1       43.59 ±  2%     -29.6       45.10 ±  2%      +0.4       75.08 ±  4%      +1.7       76.42 ±  3%  mpstat.cpu.all.usr%
      6.83 ± 15%    +380.5%      32.83 ± 45%    +217.1%      21.67 ±  5%     +40.5%       9.60 ± 41%      -7.3%       6.33 ±  7%  mpstat.max_utilization.seconds
      0.71 ± 55%      +0.4        1.14 ±  3%      +0.2        0.96 ± 44%      +0.4        1.09 ±  4%      +0.2        0.91 ± 30%  perf-profile.calltrace.cycles-pp.lrand48_r
     65.57 ± 10%      +3.5       69.09            -7.3       58.23 ± 45%      +2.4       67.94            +7.2       72.76        perf-profile.calltrace.cycles-pp.do_rw_once
      0.06 ±  7%      -0.0        0.05 ± 46%      +0.0        0.11 ± 48%      +0.0        0.07 ± 16%      +0.0        0.08 ± 16%  perf-profile.children.cycles-pp.get_jiffies_update
      0.28 ± 10%      +0.0        0.29 ±  8%      +0.3        0.58 ± 74%      +0.0        0.30 ± 13%      +0.0        0.32 ± 12%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.24 ± 10%      +0.0        0.25 ± 10%      +0.2        0.46 ± 66%      +0.0        0.26 ± 13%      +0.0        0.28 ± 12%  perf-profile.children.cycles-pp.update_process_times
      0.06 ±  7%      -0.0        0.05 ± 46%      +0.0        0.11 ± 48%      +0.0        0.07 ± 16%      +0.0        0.08 ± 16%  perf-profile.self.cycles-pp.get_jiffies_update
      0.50 ±  7%      +0.1        0.56 ±  3%      -0.0        0.46 ± 44%      +0.0        0.53 ±  3%      +0.0        0.52 ±  6%  perf-profile.self.cycles-pp.lrand48_r@plt
     26722 ±  4%     -33.8%      17690 ± 15%     -40.0%      16038 ± 29%      +6.9%      28560 ±  3%      +9.0%      29116 ±  3%  numa-meminfo.node0.HugePages_Surp
     26722 ±  4%     -33.8%      17690 ± 15%     -40.0%      16038 ± 29%      +6.9%      28560 ±  3%      +9.0%      29116 ±  3%  numa-meminfo.node0.HugePages_Total
  74013758 ±  3%     +24.7%   92302659 ±  5%     +30.0%   96190384 ±  9%      -5.6%   69852204 ±  2%      -6.0%   69592735 ±  3%  numa-meminfo.node0.MemFree
  57671194 ±  4%     -31.7%   39382292 ± 12%     -38.5%   35494567 ± 26%      +7.2%   61832747 ±  3%      +7.7%   62092216 ±  3%  numa-meminfo.node0.MemUsed
     84822 ± 19%     +57.1%     133225 ± 17%     +13.5%      96280 ± 39%      -4.4%      81114 ±  9%      -6.0%      79743 ± 11%  numa-meminfo.node1.Active
     84781 ± 19%     +57.1%     133211 ± 17%     +13.5%      96254 ± 39%      -4.4%      81091 ±  9%      -6.0%      79729 ± 11%  numa-meminfo.node1.Active(anon)
  78416592 ±  7%     +13.3%   88860070 ±  5%      +6.7%   83660976 ± 11%      +4.5%   81951764 ±  4%      +2.4%   80309519 ±  3%  numa-meminfo.node1.MemFree
  53641607 ± 11%     -19.5%   43198129 ± 11%      -9.8%   48397199 ± 19%      -6.6%   50106411 ±  7%      -3.5%   51748656 ±  4%  numa-meminfo.node1.MemUsed
  18516537 ±  3%     +24.7%   23084190 ±  5%     +29.9%   24053750 ±  9%      -5.6%   17484374 ±  3%      -6.1%   17387753 ±  2%  numa-vmstat.node0.nr_free_pages
    624065 ±  5%     +16.0%     724171 ±  4%     +18.0%     736399 ±  2%     +10.1%     687335 ±  2%      +3.0%     642802 ±  5%  numa-vmstat.node0.numa_hit
    577161 ±  8%     +18.1%     681431 ±  8%     +12.1%     647232 ±  6%      +9.9%     634277 ±  6%      -0.3%     575404 ±  5%  numa-vmstat.node0.numa_local
     21141 ± 19%     +57.4%      33269 ± 17%     +13.7%      24027 ± 39%      -4.2%      20242 ±  9%      -5.6%      19967 ± 11%  numa-vmstat.node1.nr_active_anon
  19586357 ±  7%     +13.5%   22224344 ±  5%      +6.8%   20914089 ± 11%      +4.6%   20487157 ±  4%      +2.6%   20087311 ±  3%  numa-vmstat.node1.nr_free_pages
     21141 ± 19%     +57.4%      33269 ± 17%     +13.7%      24027 ± 39%      -4.2%      20242 ±  9%      -5.6%      19967 ± 11%  numa-vmstat.node1.nr_zone_active_anon
    732629 ±  4%     +10.5%     809596 ±  4%      +8.4%     793911 ±  3%      -9.0%     666417 ±  3%      -3.5%     707191 ±  3%  numa-vmstat.node1.numa_hit
    646902 ±  5%     +11.3%     719705 ±  9%     +15.7%     748231 ±  6%     -10.1%     581882 ±  5%      -1.1%     639501 ±  3%  numa-vmstat.node1.numa_local
    167.87 ±  2%     +56.6%     262.93           +65.3%     277.49 ±  3%      -1.9%     164.74            -1.8%     164.84        time.elapsed_time
    167.87 ±  2%     +56.6%     262.93           +65.3%     277.49 ±  3%      -1.9%     164.74            -1.8%     164.84        time.elapsed_time.max
    140035 ±  6%     -50.5%      69271 ±  5%     -50.8%      68889 ±  4%      -5.9%     131759 ±  3%      -1.2%     138362 ±  8%  time.involuntary_context_switches
    163.67 ± 10%     +63.7%     268.00 ±  5%     +76.5%     288.83 ±  5%     +13.0%     185.00 ±  8%     +22.4%     200.33 ±  4%  time.major_page_faults
     11308 ±  2%     -48.4%       5830           -47.0%       5995            +1.8%      11514            +2.5%      11591        time.percent_of_cpu_this_job_got
      2347           -94.0%     139.98 ±  3%     -95.0%     117.34 ± 21%      +0.3%       2354            -0.0%       2347        time.system_time
     16627            -8.6%      15191            -0.6%      16529 ±  5%      -0.1%      16616            +0.8%      16759 ±  2%  time.user_time
     12158 ±  2%   +5329.5%     660155         +5325.1%     659615            -1.0%      12037 ±  3%      -1.4%      11985 ±  3%  time.voluntary_context_switches
     59662           -37.0%      37607           -40.5%      35489 ±  4%      +0.5%      59969            -0.1%      59610        vm-scalability.median
      2.19 ± 20%      +1.7        3.91 ± 30%      +3.3        5.51 ± 30%      +0.6        2.82 ± 23%      +1.5        3.72 ± 25%  vm-scalability.median_stddev%
      2.92 ± 22%      +0.6        3.49 ± 32%      +1.5        4.45 ± 19%      +0.4        3.35 ± 17%      +1.5        4.39 ± 16%  vm-scalability.stddev%
   7821791           -36.6%    4961402           -39.2%    4758850 ±  2%      -0.2%    7809010            -0.7%    7769662        vm-scalability.throughput
    167.87 ±  2%     +56.6%     262.93           +65.3%     277.49 ±  3%      -1.9%     164.74            -1.8%     164.84        vm-scalability.time.elapsed_time
    167.87 ±  2%     +56.6%     262.93           +65.3%     277.49 ±  3%      -1.9%     164.74            -1.8%     164.84        vm-scalability.time.elapsed_time.max
    140035 ±  6%     -50.5%      69271 ±  5%     -50.8%      68889 ±  4%      -5.9%     131759 ±  3%      -1.2%     138362 ±  8%  vm-scalability.time.involuntary_context_switches
     11308 ±  2%     -48.4%       5830           -47.0%       5995            +1.8%      11514            +2.5%      11591        vm-scalability.time.percent_of_cpu_this_job_got
      2347           -94.0%     139.98 ±  3%     -95.0%     117.34 ± 21%      +0.3%       2354            -0.0%       2347        vm-scalability.time.system_time
     16627            -8.6%      15191            -0.6%      16529 ±  5%      -0.1%      16616            +0.8%      16759 ±  2%  vm-scalability.time.user_time
     12158 ±  2%   +5329.5%     660155         +5325.1%     659615            -1.0%      12037 ±  3%      -1.4%      11985 ±  3%  vm-scalability.time.voluntary_context_switches
     88841 ± 18%     +56.6%     139142 ± 16%     +18.6%     105352 ± 34%      -3.1%      86098 ±  9%      -6.8%      82770 ± 11%  meminfo.Active
     88726 ± 18%     +56.7%     139024 ± 16%     +18.6%     105233 ± 34%      -3.1%      85984 ±  9%      -6.8%      82654 ± 11%  meminfo.Active(anon)
  79226777 ±  3%     +18.1%   93562456           +17.2%   92853282            -0.3%   78961619 ±  2%      -1.5%   78023229 ±  2%  meminfo.CommitLimit
     51410 ±  5%     -27.2%      37411 ±  2%     -25.9%      38103 ±  2%      +0.5%      51669 ±  4%      +2.3%      52586 ±  3%  meminfo.HugePages_Surp
     51410 ±  5%     -27.2%      37411 ±  2%     -25.9%      38103 ±  2%      +0.5%      51669 ±  4%      +2.3%      52586 ±  3%  meminfo.HugePages_Total
 1.053e+08 ±  5%     -27.2%   76618243 ±  2%     -25.9%   78036556 ±  2%      +0.5%  1.058e+08 ±  4%      +2.3%  1.077e+08 ±  3%  meminfo.Hugetlb
     59378 ±  9%     -27.2%      43256 ±  9%     -29.4%      41897 ± 15%      -3.0%      57584 ±  9%      -1.5%      58465 ±  8%  meminfo.Mapped
 1.513e+08 ±  3%     +19.0%  1.801e+08           +18.1%  1.787e+08            -0.3%  1.508e+08 ±  3%      -1.6%  1.489e+08 ±  2%  meminfo.MemAvailable
 1.523e+08 ±  3%     +18.9%  1.811e+08           +18.0%  1.798e+08            -0.3%  1.518e+08 ±  3%      -1.6%  1.499e+08 ±  2%  meminfo.MemFree
 1.114e+08 ±  4%     -25.8%   82607720 ±  2%     -24.6%   83956777 ±  2%      +0.5%  1.119e+08 ±  4%      +2.2%  1.138e+08 ±  3%  meminfo.Memused
     10914 ±  2%      -9.3%       9894            -9.0%       9935            +0.8%      10999 ±  2%      +1.3%      11059        meminfo.PageTables
    235415 ±  4%     +17.2%     275883 ±  9%      +2.4%     241001 ± 17%      -1.9%     230929 ±  2%      -2.2%     230261 ±  4%  meminfo.Shmem
     22170 ± 18%     +57.0%      34801 ± 17%     +18.9%      26361 ± 34%      -2.6%      21594 ±  9%      -6.6%      20698 ± 11%  proc-vmstat.nr_active_anon
   3774988 ±  3%     +19.0%    4493004           +18.2%    4461258            -0.3%    3762775 ±  2%      -1.7%    3712537 ±  2%  proc-vmstat.nr_dirty_background_threshold
   7559208 ±  3%     +19.0%    8996995           +18.2%    8933426            -0.3%    7534750 ±  2%      -1.7%    7434153 ±  2%  proc-vmstat.nr_dirty_threshold
    824427            +1.2%     834568            +0.3%     826777            -0.0%     824269            -0.0%     824023        proc-vmstat.nr_file_pages
  38091344 ±  3%     +18.9%   45280310           +18.0%   44962412            -0.3%   37969040 ±  2%      -1.6%   37466065 ±  2%  proc-vmstat.nr_free_pages
     25681            -1.7%      25241            -1.6%      25268            +0.9%      25908            -0.1%      25665        proc-vmstat.nr_kernel_stack
     15161 ±  9%     -28.5%      10841 ±  9%     -30.3%      10565 ± 14%      -3.4%      14641 ±  9%      -2.1%      14849 ±  7%  proc-vmstat.nr_mapped
      2729 ±  2%      -9.4%       2473            -9.1%       2480            +0.7%       2748 ±  2%      +1.2%       2762        proc-vmstat.nr_page_table_pages
     58775 ±  4%     +17.3%      68926 ±  9%      +2.5%      60274 ± 18%      -1.8%      57736 ±  2%      -2.1%      57526 ±  4%  proc-vmstat.nr_shmem
     22170 ± 18%     +57.0%      34801 ± 17%     +18.9%      26361 ± 34%      -2.6%      21594 ±  9%      -6.6%      20698 ± 11%  proc-vmstat.nr_zone_active_anon
   1360860           +13.0%    1537181           +12.7%    1533834            -0.2%    1357949            -0.5%    1354233        proc-vmstat.numa_hit
   1228230           +14.4%    1404550           +13.9%    1398987            -0.6%    1220355            -0.7%    1219146        proc-vmstat.numa_local
    132626            +0.0%     132681            +1.7%     134822            +3.7%     137582 ±  4%      +1.9%     135086        proc-vmstat.numa_other
   1186558           +18.1%    1400807           +19.5%    1417837            -0.3%    1182763            -0.5%    1180560        proc-vmstat.pgfault
     31861 ±  3%     +28.2%      40847           +31.7%      41945 ±  5%      -3.1%      30881 ±  3%      -1.7%      31316 ±  4%  proc-vmstat.pgreuse
     17.18 ±  3%    +337.2%      75.11 ±  2%    +318.3%      71.87 ±  5%      -1.3%      16.96 ±  4%      -0.0%      17.18 ±  3%  perf-stat.i.MPKI
 1.727e+10 ±  5%     -37.8%  1.073e+10 ±  2%     -41.2%  1.015e+10 ±  6%      +0.7%  1.738e+10 ±  3%      +1.7%  1.757e+10 ±  4%  perf-stat.i.branch-instructions
      0.12 ± 36%      +0.6        0.73 ±  5%      +0.7        0.79 ±  6%      +0.0        0.12 ± 27%      -0.0        0.11 ± 32%  perf-stat.i.branch-miss-rate%
  10351997 ± 16%     -28.0%    7451909 ± 13%     -29.7%    7276965 ± 16%     -10.0%    9315546 ± 22%      -7.3%    9592438 ± 25%  perf-stat.i.branch-misses
     94.27 ±  3%     -20.3       73.99 ±  2%     -19.2       75.03            -0.8       93.49 ±  3%      +0.3       94.60 ±  3%  perf-stat.i.cache-miss-rate%
   9.7e+08 ±  5%     -39.6%  5.859e+08 ±  2%     -42.8%  5.552e+08 ±  5%      +0.6%  9.759e+08 ±  3%      +1.6%  9.854e+08 ±  4%  perf-stat.i.cache-misses
 9.936e+08 ±  5%     -35.3%  6.431e+08 ±  2%     -38.8%  6.084e+08 ±  5%      +0.5%   9.99e+08 ±  3%      +1.5%  1.008e+09 ±  4%  perf-stat.i.cache-references
      2640 ±  3%    +180.7%       7410 ±  2%    +168.8%       7097 ±  4%      -0.0%       2640            -1.5%       2601 ±  5%  perf-stat.i.context-switches
      4.60 ±  2%     +22.2%       5.62           +18.1%       5.44 ±  5%      -1.0%       4.56 ±  2%      +0.5%       4.62        perf-stat.i.cpi
 2.888e+11 ±  5%     -47.9%  1.503e+11 ±  2%     -46.8%  1.538e+11 ±  2%      +0.6%  2.907e+11 ±  4%      +2.4%  2.956e+11 ±  3%  perf-stat.i.cpu-cycles
    214.97 ±  3%     +48.6%     319.40 ±  2%     +50.3%     323.15            +0.3%     215.56            +0.9%     216.91        perf-stat.i.cpu-migrations
   7.4e+10 ±  5%     -37.6%  4.618e+10 ±  2%     -41.0%  4.369e+10 ±  6%      +0.7%  7.449e+10 ±  3%      +1.7%  7.529e+10 ±  4%  perf-stat.i.instructions
      0.28 ±  7%     +33.6%       0.38 ±  3%     +31.5%       0.37 ±  2%      +0.0%       0.28 ±  6%      -2.7%       0.27 ±  5%  perf-stat.i.ipc
      6413 ±  4%     -21.5%       5037           -24.5%       4839 ±  5%      -0.2%       6397 ±  4%      +0.8%       6464 ±  2%  perf-stat.i.minor-faults
      6414 ±  4%     -21.5%       5038           -24.5%       4840 ±  5%      -0.3%       6398 ±  4%      +0.8%       6465 ±  2%  perf-stat.i.page-faults
     13.16            -4.0%      12.64            -3.9%      12.64            +0.0%      13.17            +0.1%      13.17        perf-stat.overall.MPKI
     97.57            -6.3       91.24            -6.1       91.44            +0.1       97.64            +0.1       97.67        perf-stat.overall.cache-miss-rate%
      3.91           -16.9%       3.25            -9.8%       3.53 ±  5%      -0.0%       3.91            +0.7%       3.94        perf-stat.overall.cpi
    296.89           -13.4%     257.07            -6.1%     278.90 ±  5%      -0.1%     296.69            +0.7%     298.84        perf-stat.overall.cycles-between-cache-misses
      0.26           +20.3%       0.31           +11.1%       0.28 ±  5%      +0.0%       0.26            -0.7%       0.25        perf-stat.overall.ipc
     10770            -2.2%      10537            -2.3%      10523            +0.2%      10788            +0.1%      10784        perf-stat.overall.path-length
   1.7e+10 ±  4%     -36.8%  1.074e+10 ±  2%     -39.8%  1.023e+10 ±  5%      +0.6%  1.711e+10 ±  3%      +1.6%  1.727e+10 ±  4%  perf-stat.ps.branch-instructions
  10207074 ± 15%     -27.2%    7428222 ± 13%     -29.6%    7182646 ± 16%      -9.7%    9221719 ± 22%      -6.6%    9530095 ± 25%  perf-stat.ps.branch-misses
 9.588e+08 ±  4%     -39.1%  5.838e+08           -42.0%  5.566e+08 ±  5%      +0.7%  9.651e+08 ±  3%      +1.6%  9.744e+08 ±  4%  perf-stat.ps.cache-misses
 9.826e+08 ±  4%     -34.9%  6.398e+08           -38.1%  6.087e+08 ±  5%      +0.6%  9.884e+08 ±  3%      +1.5%  9.975e+08 ±  4%  perf-stat.ps.cache-references
      2628 ±  3%    +176.7%       7271 ±  2%    +164.7%       6956 ±  4%      +0.3%       2635            -1.0%       2600 ±  5%  perf-stat.ps.context-switches
 2.847e+11 ±  4%     -47.3%  1.501e+11 ±  2%     -45.6%  1.548e+11 ±  2%      +0.6%  2.864e+11 ±  4%      +2.3%  2.911e+11 ±  3%  perf-stat.ps.cpu-cycles
    213.42 ±  3%     +47.5%     314.87 ±  2%     +49.2%     318.34            +0.5%     214.42            +1.3%     216.10        perf-stat.ps.cpu-migrations
 7.284e+10 ±  4%     -36.6%   4.62e+10 ±  2%     -39.6%  4.402e+10 ±  5%      +0.6%   7.33e+10 ±  3%      +1.6%  7.398e+10 ±  4%  perf-stat.ps.instructions
      6416 ±  3%     -22.4%       4976           -25.6%       4772 ±  5%      +0.2%       6426 ±  3%      +1.6%       6516 ±  2%  perf-stat.ps.minor-faults
      6417 ±  3%     -22.4%       4977           -25.6%       4774 ±  5%      +0.2%       6428 ±  3%      +1.6%       6517 ±  2%  perf-stat.ps.page-faults
 1.268e+13            -2.2%  1.241e+13            -2.3%  1.239e+13            +0.2%   1.27e+13            +0.1%   1.27e+13        perf-stat.total.instructions
   7783325 ± 13%     -22.8%    6008522 ± 10%     -20.8%    6163644 ± 20%     -13.8%    6708575 ± 22%      -4.5%    7429947 ± 26%  sched_debug.cfs_rq:/.avg_vruntime.avg
   8109328 ± 13%     -18.8%    6584206 ± 10%     -15.3%    6872509 ± 19%     -14.2%    6957983 ± 22%      -5.4%    7673718 ± 26%  sched_debug.cfs_rq:/.avg_vruntime.max
    244161 ± 30%     +28.2%     313090 ± 22%     +76.6%     431126 ± 21%     -23.5%     186903 ± 26%     -28.7%     173977 ± 29%  sched_debug.cfs_rq:/.avg_vruntime.stddev
      0.66 ± 11%     -22.0%       0.52 ± 21%     -41.3%       0.39 ± 29%      -0.1%       0.66 ±  8%      -3.5%       0.64 ± 16%  sched_debug.cfs_rq:/.h_nr_running.avg
    495.88 ± 33%     -44.7%     274.12 ±  3%     -11.5%     438.85 ± 32%     -11.2%     440.30 ± 18%     -12.2%     435.24 ± 27%  sched_debug.cfs_rq:/.load_avg.max
     81.79 ± 28%     -33.2%      54.62 ± 16%     -15.5%      69.10 ± 26%      +7.2%      87.66 ± 23%      -8.4%      74.91 ± 38%  sched_debug.cfs_rq:/.load_avg.stddev
   7783325 ± 13%     -22.8%    6008522 ± 10%     -20.8%    6163644 ± 20%     -13.8%    6708575 ± 22%      -4.5%    7429947 ± 26%  sched_debug.cfs_rq:/.min_vruntime.avg
   8109328 ± 13%     -18.8%    6584206 ± 10%     -15.3%    6872509 ± 19%     -14.2%    6957983 ± 22%      -5.4%    7673718 ± 26%  sched_debug.cfs_rq:/.min_vruntime.max
    244161 ± 30%     +28.2%     313090 ± 22%     +76.6%     431126 ± 21%     -23.5%     186902 ± 26%     -28.7%     173977 ± 29%  sched_debug.cfs_rq:/.min_vruntime.stddev
      0.66 ± 11%     -22.3%       0.51 ± 21%     -41.5%       0.38 ± 29%      -0.4%       0.66 ±  8%      -3.8%       0.63 ± 16%  sched_debug.cfs_rq:/.nr_running.avg
    382.00 ± 36%     -44.2%     213.33 ±  8%     -23.3%     292.98 ± 42%      -2.3%     373.40 ± 18%      +4.0%     397.33 ± 20%  sched_debug.cfs_rq:/.removed.load_avg.max
    194.86 ± 36%     -44.3%     108.59 ±  8%     -24.0%     148.18 ± 40%      -2.4%     190.23 ± 18%      +3.7%     202.10 ± 20%  sched_debug.cfs_rq:/.removed.runnable_avg.max
    194.86 ± 36%     -44.3%     108.59 ±  8%     -24.0%     148.18 ± 40%      -2.4%     190.23 ± 18%      +3.7%     202.10 ± 20%  sched_debug.cfs_rq:/.removed.util_avg.max
    713.50 ± 11%     -22.6%     552.00 ± 20%     -39.7%     430.54 ± 26%      -0.2%     712.27 ±  7%      -3.0%     691.86 ± 14%  sched_debug.cfs_rq:/.runnable_avg.avg
      1348 ± 10%     -15.9%       1133 ± 15%     -20.9%       1067 ± 12%      +2.4%       1380 ±  8%      +5.1%       1417 ± 18%  sched_debug.cfs_rq:/.runnable_avg.max
    708.60 ± 11%     -22.6%     548.41 ± 20%     -39.6%     427.82 ± 26%      -0.1%     707.59 ±  7%      -3.1%     686.34 ± 14%  sched_debug.cfs_rq:/.util_avg.avg
      1119 ±  5%     -16.3%     937.08 ± 11%     -18.6%     910.83 ± 11%      +2.0%       1141 ±  6%      -0.1%       1117 ±  8%  sched_debug.cfs_rq:/.util_avg.max
    633.71 ± 11%     -95.7%      27.38 ± 17%     -96.7%      21.00 ± 19%      -0.6%     630.15 ± 10%      -3.6%     610.78 ± 17%  sched_debug.cfs_rq:/.util_est.avg
      1102 ± 18%     -63.9%     397.88 ± 15%     -67.5%     358.19 ±  8%      +6.1%       1169 ± 14%      +6.6%       1174 ± 24%  sched_debug.cfs_rq:/.util_est.max
    119.77 ± 55%     -64.5%      42.46 ± 12%     -67.8%      38.59 ± 12%      -3.2%     115.93 ± 51%     -12.3%     105.01 ± 70%  sched_debug.cfs_rq:/.util_est.stddev
    145182 ± 12%     -37.6%      90551 ± 11%     -29.5%     102317 ± 18%      -7.3%     134528 ± 10%     -17.2%     120251 ± 18%  sched_debug.cpu.avg_idle.stddev
    122256 ±  8%     +41.4%     172906 ±  7%     +38.2%     168929 ± 14%      -5.4%     115642 ±  6%      -1.3%     120639 ± 14%  sched_debug.cpu.clock.avg
    122268 ±  8%     +41.4%     172920 ±  7%     +38.2%     168942 ± 14%      -5.4%     115657 ±  6%      -1.3%     120655 ± 14%  sched_debug.cpu.clock.max
    122242 ±  8%     +41.4%     172892 ±  7%     +38.2%     168914 ± 14%      -5.4%     115627 ±  6%      -1.3%     120621 ± 14%  sched_debug.cpu.clock.min
    121865 ±  8%     +41.5%     172490 ±  7%     +38.3%     168517 ± 14%      -5.4%     115298 ±  6%      -1.3%     120268 ± 14%  sched_debug.cpu.clock_task.avg
    122030 ±  8%     +41.5%     172681 ±  7%     +38.3%     168714 ± 14%      -5.4%     115451 ±  6%      -1.3%     120421 ± 14%  sched_debug.cpu.clock_task.max
    112808 ±  8%     +44.2%     162675 ±  7%     +41.0%     159006 ± 15%      -5.5%     106630 ±  7%      -1.1%     111604 ± 15%  sched_debug.cpu.clock_task.min
      5671 ±  6%     +24.6%       7069 ±  4%     +24.0%       7034 ±  8%      -7.2%       5261 ±  7%      -3.5%       5471 ± 10%  sched_debug.cpu.curr->pid.max
      0.00 ± 12%     +22.5%       0.00 ± 50%     +17.7%       0.00 ± 42%     +71.0%       0.00 ± 35%     +59.0%       0.00 ± 43%  sched_debug.cpu.next_balance.stddev
      0.66 ± 11%     -22.0%       0.51 ± 21%     -41.4%       0.39 ± 29%      -0.3%       0.66 ±  8%      -3.6%       0.64 ± 16%  sched_debug.cpu.nr_running.avg
      2659 ± 12%    +208.6%       8204 ±  7%    +192.0%       7763 ± 14%     -10.1%       2391 ± 11%      -6.2%       2493 ± 15%  sched_debug.cpu.nr_switches.avg
    679.31 ± 10%    +516.8%       4189 ± 14%    +401.6%       3407 ± 24%     -14.7%     579.50 ± 19%      -6.8%     633.18 ± 25%  sched_debug.cpu.nr_switches.min
      0.00 ±  9%  +12202.6%       0.31 ± 42%  +12627.8%       0.32 ± 37%     +67.0%       0.00 ± 50%     -34.8%       0.00 ± 72%  sched_debug.cpu.nr_uninterruptible.avg
    122243 ±  8%     +41.4%     172893 ±  7%     +38.2%     168916 ± 14%      -5.4%     115628 ±  6%      -1.3%     120623 ± 14%  sched_debug.cpu_clk
    120996 ±  8%     +41.9%     171660 ±  7%     +38.6%     167751 ± 15%      -5.4%     114462 ±  6%      -1.3%     119457 ± 14%  sched_debug.ktime
    123137 ±  8%     +41.1%     173805 ±  7%     +37.9%     169767 ± 14%      -5.4%     116479 ±  6%      -1.4%     121452 ± 13%  sched_debug.sched_clk


> 
> I spotted something else worth optimizing last time, and with the
> patch attached, I was able to measure some significant improvements in
> 1GB hugeTLB allocation and free time, e.g., when allocating and free
> 700 1GB hugeTLB pages:
> 
> Before:
>   # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>   real  0m13.500s
>   user  0m0.000s
>   sys   0m13.311s
> 
>   # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>   real  0m11.269s
>   user  0m0.000s
>   sys   0m11.187s
> 
> 
> After:
>   # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>   real  0m10.643s
>   user  0m0.001s
>   sys   0m10.487s
> 
>   # time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>   real  0m1.541s
>   user  0m0.000s
>   sys   0m1.528s
> 
> Thanks!




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-08-06  3:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09  5:11 [linux-next:master] [mm/hugetlb_vmemmap] 875fa64577: vm-scalability.throughput -34.3% regression kernel test robot
2024-07-10  6:22 ` Yu Zhao
2024-07-14 12:26   ` Oliver Sang
2024-07-15  2:40     ` Muchun Song
2024-07-15  4:08       ` Oliver Sang
2024-07-17  7:52 ` Janosch Frank
2024-07-17  7:59   ` Christian Borntraeger
2024-07-17  8:36   ` Yu Zhao
2024-07-17 15:44     ` Yu Zhao
2024-07-18  9:23       ` Marc Hartmayer
2024-07-19  8:42       ` Oliver Sang
2024-07-19 16:06         ` Yu Zhao
2024-08-03 22:07           ` Yu Zhao
2024-08-06  3:01             ` Oliver Sang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).