linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [linus:master] [mm]  c0bff412e6:  stress-ng.clone.ops_per_sec -2.9% regression
@ 2024-07-30  5:00 kernel test robot
  2024-07-30  8:11 ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: kernel test robot @ 2024-07-30  5:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, David Hildenbrand,
	Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor,
	Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang,
	fengwei.yin, oliver.sang



Hello,

kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on:


commit: c0bff412e67b781d761e330ff9578aa9ed2be79e ("mm: allow anon exclusive check over hugetlb tail pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: clone
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407301049.5051dc19-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/clone/stress-ng/60s

commit: 
  9cb28da546 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
  c0bff412e6 ("mm: allow anon exclusive check over hugetlb tail pages")

9cb28da54643ad46 c0bff412e67b781d761e330ff95 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     37842            -3.4%      36554        vmstat.system.cs
      0.00 ± 17%     -86.4%       0.00 ±223%  sched_debug.rt_rq:.rt_time.avg
      0.19 ± 17%     -86.4%       0.03 ±223%  sched_debug.rt_rq:.rt_time.max
      0.02 ± 17%     -86.4%       0.00 ±223%  sched_debug.rt_rq:.rt_time.stddev
     24081            -3.7%      23200        proc-vmstat.nr_page_table_pages
    399380            -2.3%     390288        proc-vmstat.nr_slab_reclaimable
   1625589            -2.4%    1585989        proc-vmstat.nr_slab_unreclaimable
 1.019e+08            -3.8%   98035999        proc-vmstat.numa_hit
 1.018e+08            -3.9%   97870705        proc-vmstat.numa_local
 1.092e+08            -3.8%   1.05e+08        proc-vmstat.pgalloc_normal
  1.06e+08            -3.8%  1.019e+08        proc-vmstat.pgfree
   2659199            -2.3%    2597978        proc-vmstat.pgreuse
      2910            +3.4%       3010        stress-ng.clone.microsecs_per_clone
    562874            -2.9%     546587        stress-ng.clone.ops
      9298            -2.9%       9031        stress-ng.clone.ops_per_sec
    686858            -2.8%     667416        stress-ng.time.involuntary_context_switches
   9091031            -3.9%    8734352        stress-ng.time.minor_page_faults
      4200            +2.4%       4299        stress-ng.time.percent_of_cpu_this_job_got
      2543            +2.4%       2603        stress-ng.time.system_time
    342849            -2.8%     333189        stress-ng.time.voluntary_context_switches
      6.67            -6.1%       6.26        perf-stat.i.MPKI
 6.388e+08            -5.4%  6.045e+08        perf-stat.i.cache-misses
 1.558e+09            -4.6%  1.487e+09        perf-stat.i.cache-references
     40791            -3.6%      39330        perf-stat.i.context-switches
    353.55            +5.4%     372.76        perf-stat.i.cycles-between-cache-misses
      7.95 ±  3%      -6.5%       7.43 ±  3%  perf-stat.i.metric.K/sec
    251389 ±  3%      -6.5%     235029 ±  3%  perf-stat.i.minor-faults
    251423 ±  3%      -6.5%     235064 ±  3%  perf-stat.i.page-faults
      6.75            -6.1%       6.33        perf-stat.overall.MPKI
      0.38            -0.0        0.37        perf-stat.overall.branch-miss-rate%
    350.09            +5.8%     370.24        perf-stat.overall.cycles-between-cache-misses
  68503488            -1.2%   67660585        perf-stat.ps.branch-misses
  6.33e+08            -5.4%  5.987e+08        perf-stat.ps.cache-misses
 1.518e+09            -4.6%  1.449e+09        perf-stat.ps.cache-references
     38819            -3.3%      37542        perf-stat.ps.context-switches
      3637            +1.2%       3680        perf-stat.ps.cpu-migrations
    235473 ±  3%      -6.3%     220601 ±  3%  perf-stat.ps.minor-faults
    235504 ±  3%      -6.3%     220632 ±  3%  perf-stat.ps.page-faults
     45.55            -2.5       43.04        perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
     44.86            -2.5       42.37        perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
     44.42            -2.1       42.37        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
     44.42            -2.1       42.37        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
     44.41            -2.1       42.36        perf-profile.calltrace.cycles-pp.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
     44.41            -2.1       42.36        perf-profile.calltrace.cycles-pp.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
     39.08            -1.7       37.34        perf-profile.calltrace.cycles-pp.exit_mm.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
     38.96            -1.7       37.22        perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exit_mm.do_exit.__x64_sys_exit
     38.97            -1.7       37.24        perf-profile.calltrace.cycles-pp.__mmput.exit_mm.do_exit.__x64_sys_exit.do_syscall_64
     36.16            -1.6       34.57        perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.exit_mm.do_exit
     35.99            -1.6       34.40        perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.exit_mm
     32.17            -1.5       30.62        perf-profile.calltrace.cycles-pp.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
     12.49            -1.0       11.52        perf-profile.calltrace.cycles-pp._compound_head.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
      9.66            -0.9        8.74        perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.copy_process.kernel_clone
      9.61            -0.9        8.69        perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.copy_process
     10.71            -0.9        9.84        perf-profile.calltrace.cycles-pp.__mmput.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
     10.70            -0.9        9.84        perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.copy_process.kernel_clone.__do_sys_clone3
     10.41            -0.8        9.58        perf-profile.calltrace.cycles-pp.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range
     10.42            -0.8        9.59        perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
     10.21            -0.8        9.40        perf-profile.calltrace.cycles-pp.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
      5.47            -0.4        5.04        perf-profile.calltrace.cycles-pp.folios_put_refs.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range
      1.11            -0.3        0.79 ± 33%  perf-profile.calltrace.cycles-pp.anon_vma_interval_tree_insert.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm
     14.18            -0.3       13.87        perf-profile.calltrace.cycles-pp.folio_remove_rmap_ptes.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
      5.17            -0.3        4.86        perf-profile.calltrace.cycles-pp.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
      4.80            -0.3        4.53        perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64
      4.40            -0.3        4.14        perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.do_exit.__x64_sys_exit
      2.74            -0.2        2.58        perf-profile.calltrace.cycles-pp.anon_vma_fork.dup_mmap.dup_mm.copy_process.kernel_clone
      2.25            -0.1        2.11        perf-profile.calltrace.cycles-pp.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm.copy_process
      1.47            -0.1        1.34        perf-profile.calltrace.cycles-pp.put_files_struct.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
      1.87            -0.1        1.76        perf-profile.calltrace.cycles-pp.dnotify_flush.filp_flush.filp_close.put_files_struct.do_exit
      1.98            -0.1        1.88        perf-profile.calltrace.cycles-pp.free_pgtables.exit_mmap.__mmput.exit_mm.do_exit
      1.28            -0.1        1.18        perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.copy_process.kernel_clone.__do_sys_clone3
      1.19            -0.1        1.09        perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.copy_process.kernel_clone
      1.31 ±  2%      -0.1        1.25        perf-profile.calltrace.cycles-pp.unlink_anon_vmas.free_pgtables.exit_mmap.__mmput.exit_mm
      0.58            -0.0        0.55        perf-profile.calltrace.cycles-pp.vm_normal_page.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
     33.54            +0.6       34.10        perf-profile.calltrace.cycles-pp.syscall
     33.45            +0.6       34.01        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
     33.45            +0.6       34.01        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
     33.35            +0.6       33.90        perf-profile.calltrace.cycles-pp.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
     33.34            +0.6       33.90        perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
     33.30            +0.6       33.86        perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
     20.63            +1.6       22.21        perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
     20.55            +1.6       22.14        perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone3
     19.40            +1.8       21.19        perf-profile.calltrace.cycles-pp.__clone
     19.24            +1.8       21.04        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
     19.24            +1.8       21.04        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__clone
     19.14            +1.8       20.94        perf-profile.calltrace.cycles-pp.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
     19.14            +1.8       20.94        perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
     19.05            +1.8       20.85        perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe
     18.74            +1.8       20.56        perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone.do_syscall_64
     18.67            +1.8       20.49        perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone
     12.24            +3.1       15.35        perf-profile.calltrace.cycles-pp._compound_head.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
     34.37            +3.7       38.02        perf-profile.calltrace.cycles-pp.copy_page_range.dup_mmap.dup_mm.copy_process.kernel_clone
     34.34            +3.7       38.00        perf-profile.calltrace.cycles-pp.copy_p4d_range.copy_page_range.dup_mmap.dup_mm.copy_process
     30.99            +3.7       34.69        perf-profile.calltrace.cycles-pp.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
     33.16            +3.7       36.88        perf-profile.calltrace.cycles-pp.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap.dup_mm
      0.00            +3.9        3.90        perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
     49.67            -2.6       47.07        perf-profile.children.cycles-pp.exit_mmap
     49.69            -2.6       47.08        perf-profile.children.cycles-pp.__mmput
     45.84            -2.5       43.32        perf-profile.children.cycles-pp.unmap_vmas
     45.56            -2.5       43.05        perf-profile.children.cycles-pp.zap_pmd_range
     45.61            -2.5       43.10        perf-profile.children.cycles-pp.unmap_page_range
     44.98            -2.5       42.48        perf-profile.children.cycles-pp.zap_pte_range
     44.53            -2.1       42.48        perf-profile.children.cycles-pp.__x64_sys_exit
     44.54            -2.1       42.48        perf-profile.children.cycles-pp.do_exit
     39.10            -1.7       37.36        perf-profile.children.cycles-pp.exit_mm
     32.99            -1.6       31.41        perf-profile.children.cycles-pp.zap_present_ptes
     10.53            -0.8        9.71        perf-profile.children.cycles-pp.tlb_flush_mmu
     10.91            -0.7       10.19        perf-profile.children.cycles-pp.__tlb_batch_free_encoded_pages
     10.88            -0.7       10.16        perf-profile.children.cycles-pp.free_pages_and_swap_cache
      6.64            -0.4        6.22        perf-profile.children.cycles-pp.put_files_struct
      5.76            -0.4        5.38        perf-profile.children.cycles-pp.folios_put_refs
      6.11            -0.4        5.73        perf-profile.children.cycles-pp.filp_close
      5.62            -0.4        5.25        perf-profile.children.cycles-pp.filp_flush
     14.28            -0.3       13.97        perf-profile.children.cycles-pp.folio_remove_rmap_ptes
      2.75            -0.2        2.58        perf-profile.children.cycles-pp.anon_vma_fork
      2.38            -0.2        2.22        perf-profile.children.cycles-pp.dnotify_flush
      2.50            -0.1        2.36        perf-profile.children.cycles-pp.free_pgtables
      2.25            -0.1        2.11        perf-profile.children.cycles-pp.anon_vma_clone
      0.20 ± 33%      -0.1        0.08 ± 58%  perf-profile.children.cycles-pp.ordered_events__queue
      0.20 ± 33%      -0.1        0.08 ± 58%  perf-profile.children.cycles-pp.queue_event
      1.24 ±  4%      -0.1        1.14        perf-profile.children.cycles-pp.down_write
      1.67 ±  2%      -0.1        1.58        perf-profile.children.cycles-pp.unlink_anon_vmas
      1.59            -0.1        1.50        perf-profile.children.cycles-pp.__alloc_pages_noprof
      1.55            -0.1        1.46        perf-profile.children.cycles-pp.alloc_pages_mpol_noprof
      1.58            -0.1        1.50        perf-profile.children.cycles-pp.vm_normal_page
      1.11            -0.1        1.04        perf-profile.children.cycles-pp.anon_vma_interval_tree_insert
      1.33            -0.1        1.26 ±  2%  perf-profile.children.cycles-pp.pte_alloc_one
      0.47 ± 11%      -0.1        0.40 ±  4%  perf-profile.children.cycles-pp.rwsem_down_write_slowpath
      0.45 ± 11%      -0.1        0.38 ±  4%  perf-profile.children.cycles-pp.rwsem_optimistic_spin
      1.00            -0.1        0.94 ±  2%  perf-profile.children.cycles-pp.get_page_from_freelist
      1.36            -0.1        1.31        perf-profile.children.cycles-pp.kmem_cache_free
      1.08            -0.0        1.04        perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
      0.62            -0.0        0.58 ±  2%  perf-profile.children.cycles-pp.dup_fd
      0.63            -0.0        0.59 ±  3%  perf-profile.children.cycles-pp.__pte_alloc
      0.73            -0.0        0.69        perf-profile.children.cycles-pp.__tlb_remove_folio_pages_size
      0.58            -0.0        0.54        perf-profile.children.cycles-pp.locks_remove_posix
      0.90            -0.0        0.86        perf-profile.children.cycles-pp.copy_huge_pmd
      0.54            -0.0        0.51        perf-profile.children.cycles-pp.__memcg_kmem_charge_page
      0.76            -0.0        0.72        perf-profile.children.cycles-pp.vm_area_dup
      0.31 ±  2%      -0.0        0.28 ±  3%  perf-profile.children.cycles-pp.rwsem_spin_on_owner
      0.50            -0.0        0.47        perf-profile.children.cycles-pp.__anon_vma_interval_tree_remove
      0.53            -0.0        0.50        perf-profile.children.cycles-pp.clear_page_erms
      0.49            -0.0        0.46        perf-profile.children.cycles-pp.free_swap_cache
      0.72            -0.0        0.69        perf-profile.children.cycles-pp.__memcg_slab_post_alloc_hook
      0.37 ±  2%      -0.0        0.34 ±  2%  perf-profile.children.cycles-pp.unlink_file_vma
      0.62            -0.0        0.60        perf-profile.children.cycles-pp.__memcg_slab_free_hook
      0.42            -0.0        0.40 ±  2%  perf-profile.children.cycles-pp.rmqueue
      0.37            -0.0        0.35 ±  2%  perf-profile.children.cycles-pp.__rmqueue_pcplist
      0.28            -0.0        0.25        perf-profile.children.cycles-pp.__rb_insert_augmented
      0.35            -0.0        0.33 ±  2%  perf-profile.children.cycles-pp.rmqueue_bulk
      0.56            -0.0        0.54        perf-profile.children.cycles-pp.fput
      0.48            -0.0        0.46        perf-profile.children.cycles-pp._raw_spin_lock
      0.51            -0.0        0.50        perf-profile.children.cycles-pp.free_unref_page
      0.45            -0.0        0.43        perf-profile.children.cycles-pp.__x64_sys_unshare
      0.44            -0.0        0.42        perf-profile.children.cycles-pp.free_unref_page_commit
      0.45            -0.0        0.43        perf-profile.children.cycles-pp.ksys_unshare
      0.31            -0.0        0.30        perf-profile.children.cycles-pp.memcg_account_kmem
      0.27            -0.0        0.26        perf-profile.children.cycles-pp.__mod_memcg_state
      0.44            -0.0        0.43        perf-profile.children.cycles-pp.__slab_free
      0.28            -0.0        0.26        perf-profile.children.cycles-pp.__vm_area_free
      0.22 ±  2%      -0.0        0.21        perf-profile.children.cycles-pp.___slab_alloc
      0.21            -0.0        0.20 ±  2%  perf-profile.children.cycles-pp.__tlb_remove_folio_pages
      0.13            -0.0        0.12        perf-profile.children.cycles-pp.__rb_erase_color
      0.07            -0.0        0.06        perf-profile.children.cycles-pp.find_idlest_cpu
      0.09            -0.0        0.08        perf-profile.children.cycles-pp.wake_up_new_task
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.kfree
      0.06            -0.0        0.05        perf-profile.children.cycles-pp.update_sg_wakeup_stats
      0.11            -0.0        0.10        perf-profile.children.cycles-pp.allocate_slab
      0.44 ±  2%      +0.1        0.53 ±  2%  perf-profile.children.cycles-pp.tlb_finish_mmu
     98.24            +0.2       98.46        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     98.24            +0.2       98.46        perf-profile.children.cycles-pp.do_syscall_64
     33.55            +0.6       34.10        perf-profile.children.cycles-pp.syscall
     33.35            +0.6       33.90        perf-profile.children.cycles-pp.__do_sys_clone3
     19.41            +1.8       21.20        perf-profile.children.cycles-pp.__clone
     19.14            +1.8       20.94        perf-profile.children.cycles-pp.__do_sys_clone
     24.94            +2.1       27.07        perf-profile.children.cycles-pp._compound_head
     52.48            +2.4       54.84        perf-profile.children.cycles-pp.kernel_clone
     52.36            +2.4       54.72        perf-profile.children.cycles-pp.copy_process
     39.38            +3.4       42.77        perf-profile.children.cycles-pp.dup_mm
     39.24            +3.4       42.64        perf-profile.children.cycles-pp.dup_mmap
     34.34            +3.7       38.00        perf-profile.children.cycles-pp.copy_p4d_range
     34.37            +3.7       38.03        perf-profile.children.cycles-pp.copy_page_range
     33.28            +3.7       36.98        perf-profile.children.cycles-pp.copy_pte_range
     31.41            +3.8       35.18        perf-profile.children.cycles-pp.copy_present_ptes
      0.00            +4.0        4.01        perf-profile.children.cycles-pp.folio_try_dup_anon_rmap_ptes
     18.44            -3.2       15.24        perf-profile.self.cycles-pp.copy_present_ptes
      5.66            -0.4        5.28        perf-profile.self.cycles-pp.folios_put_refs
      4.78            -0.3        4.46        perf-profile.self.cycles-pp.free_pages_and_swap_cache
     14.11            -0.3       13.80        perf-profile.self.cycles-pp.folio_remove_rmap_ptes
      4.82            -0.2        4.59        perf-profile.self.cycles-pp.zap_present_ptes
      2.66            -0.2        2.49        perf-profile.self.cycles-pp.filp_flush
      2.36            -0.2        2.20        perf-profile.self.cycles-pp.dnotify_flush
      0.20 ± 32%      -0.1        0.08 ± 58%  perf-profile.self.cycles-pp.queue_event
      1.44            -0.1        1.36        perf-profile.self.cycles-pp.zap_pte_range
      1.11            -0.1        1.03        perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
      1.26            -0.1        1.20        perf-profile.self.cycles-pp.vm_normal_page
      0.56            -0.0        0.52 ±  2%  perf-profile.self.cycles-pp.dup_fd
      0.56            -0.0        0.53        perf-profile.self.cycles-pp.locks_remove_posix
      0.31            -0.0        0.28        perf-profile.self.cycles-pp.put_files_struct
      0.58            -0.0        0.55        perf-profile.self.cycles-pp.__tlb_remove_folio_pages_size
      0.49            -0.0        0.46 ±  2%  perf-profile.self.cycles-pp.__anon_vma_interval_tree_remove
      0.30 ±  3%      -0.0        0.28 ±  3%  perf-profile.self.cycles-pp.rwsem_spin_on_owner
      0.52            -0.0        0.49 ±  2%  perf-profile.self.cycles-pp.clear_page_erms
      0.31            -0.0        0.29        perf-profile.self.cycles-pp.free_swap_cache
      0.33            -0.0        0.31        perf-profile.self.cycles-pp.__memcg_slab_free_hook
      0.45            -0.0        0.43        perf-profile.self.cycles-pp._raw_spin_lock
      0.55            -0.0        0.53        perf-profile.self.cycles-pp.fput
      0.38            -0.0        0.36        perf-profile.self.cycles-pp.__memcg_slab_post_alloc_hook
      0.47            -0.0        0.45        perf-profile.self.cycles-pp.up_write
      0.26            -0.0        0.24        perf-profile.self.cycles-pp.__rb_insert_augmented
      0.33            -0.0        0.32        perf-profile.self.cycles-pp.mod_objcg_state
      0.31            -0.0        0.30        perf-profile.self.cycles-pp.__free_one_page
      0.09            -0.0        0.08        perf-profile.self.cycles-pp.___slab_alloc
     24.40            +2.1       26.55        perf-profile.self.cycles-pp._compound_head
      0.00            +3.9        3.89        perf-profile.self.cycles-pp.folio_try_dup_anon_rmap_ptes




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-07-30  5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot
@ 2024-07-30  8:11 ` David Hildenbrand
  2024-08-01  6:39   ` Yin, Fengwei
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-07-30  8:11 UTC (permalink / raw)
  To: kernel test robot, Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
	Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
	WANG Xuerui, linux-mm, ying.huang, feng.tang, fengwei.yin

On 30.07.24 07:00, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on:

Is that test even using hugetlb? Anyhow, this pretty much sounds like 
noise and can be ignored.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-07-30  8:11 ` David Hildenbrand
@ 2024-08-01  6:39   ` Yin, Fengwei
  2024-08-01  6:49     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Yin, Fengwei @ 2024-08-01  6:39 UTC (permalink / raw)
  To: David Hildenbrand, kernel test robot, Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
	Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
	WANG Xuerui, linux-mm, ying.huang, feng.tang

Hi David,

On 7/30/2024 4:11 PM, David Hildenbrand wrote:
> On 30.07.24 07:00, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a -2.9% regression of 
>> stress-ng.clone.ops_per_sec on:
> 
> Is that test even using hugetlb? Anyhow, this pretty much sounds like 
> noise and can be ignored.
> 
It's not about hugetlb. It looks like related with the change:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c03..7577fe7debafc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
  static __always_inline int PageAnonExclusive(const struct page *page)
  {
         VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
-       VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+       /*
+        * HugeTLB stores this information on the head page; THP keeps 
it per
+        * page
+        */
+       if (PageHuge(page))
+               page = compound_head(page);
         return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);


The PageAnonExclusive() function is changed. And the profiling data
showed it:

       0.00            +3.9        3.90 
perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range

According 
https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b:
	# CONFIG_DEBUG_VM is not set
So maybe such code change could bring difference?

And yes. 2.9% regression can be in noise range. Thanks.


Regards
Yin, Fengwei


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01  6:39   ` Yin, Fengwei
@ 2024-08-01  6:49     ` David Hildenbrand
  2024-08-01  7:44       ` Yin, Fengwei
  2024-08-01 13:30       ` Mateusz Guzik
  0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01  6:49 UTC (permalink / raw)
  To: Yin, Fengwei, kernel test robot, Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
	Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
	WANG Xuerui, linux-mm, ying.huang, feng.tang

On 01.08.24 08:39, Yin, Fengwei wrote:
> Hi David,
> 
> On 7/30/2024 4:11 PM, David Hildenbrand wrote:
>> On 30.07.24 07:00, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a -2.9% regression of
>>> stress-ng.clone.ops_per_sec on:
>>
>> Is that test even using hugetlb? Anyhow, this pretty much sounds like
>> noise and can be ignored.
>>
> It's not about hugetlb. It looks like related with the change:

Ah, that makes sense!

> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c03..7577fe7debafc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>    static __always_inline int PageAnonExclusive(const struct page *page)
>    {
>           VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -       VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +       /*
> +        * HugeTLB stores this information on the head page; THP keeps
> it per
> +        * page
> +        */
> +       if (PageHuge(page))
> +               page = compound_head(page);
>           return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> 
> 
> The PageAnonExclusive() function is changed. And the profiling data
> showed it:
> 
>         0.00            +3.9        3.90
> perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
> 
> According
> https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b:
> 	# CONFIG_DEBUG_VM is not set
> So maybe such code change could bring difference?

Yes indeed. fork() can be extremely sensitive to each added instruction.

I even pointed out to Peter why I didn't add the PageHuge check in there 
originally [1].

"Well, and I didn't want to have runtime-hugetlb checks in
PageAnonExclusive code called on certainly-not-hugetlb code paths."


We now have to do a page_folio(page) and then test for hugetlb.

	return folio_test_hugetlb(page_folio(page));

Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, 
so maybe at least part of the overhead is gone.


[1] 
https://lore.kernel.org/r/all/8b0b24bb-3c38-4f27-a2c9-f7d7adc4a115@redhat.com/


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01  6:49     ` David Hildenbrand
@ 2024-08-01  7:44       ` Yin, Fengwei
  2024-08-01  7:54         ` David Hildenbrand
  2024-08-01 13:30       ` Mateusz Guzik
  1 sibling, 1 reply; 22+ messages in thread
From: Yin, Fengwei @ 2024-08-01  7:44 UTC (permalink / raw)
  To: David Hildenbrand, kernel test robot, Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
	Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
	WANG Xuerui, linux-mm, ying.huang, feng.tang

Hi David,

On 8/1/2024 2:49 PM, David Hildenbrand wrote:
> We now have to do a page_folio(page) and then test for hugetlb.
> 
>      return folio_test_hugetlb(page_folio(page));
> 
> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, 
> so maybe at least part of the overhead is gone.
This is great. We will check the trend to know whether it's recovered
in some level.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01  7:44       ` Yin, Fengwei
@ 2024-08-01  7:54         ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01  7:54 UTC (permalink / raw)
  To: Yin, Fengwei, kernel test robot, Peter Xu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
	Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
	WANG Xuerui, linux-mm, ying.huang, feng.tang

On 01.08.24 09:44, Yin, Fengwei wrote:
> Hi David,
> 
> On 8/1/2024 2:49 PM, David Hildenbrand wrote:
>> We now have to do a page_folio(page) and then test for hugetlb.
>>
>>       return folio_test_hugetlb(page_folio(page));
>>
>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times,
>> so maybe at least part of the overhead is gone.
> This is great. We will check the trend to know whether it's recovered
> in some level.

Oh, I think d99e3140a4d33e26066183ff727d8f02f56bec64 went upstream 
before c0bff412e67b781d761e330ff9578aa9ed2be79e, so at the time of 
c0bff412e6 we already should have had the faster check!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01  6:49     ` David Hildenbrand
  2024-08-01  7:44       ` Yin, Fengwei
@ 2024-08-01 13:30       ` Mateusz Guzik
  2024-08-01 13:34         ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-01 13:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> Yes indeed. fork() can be extremely sensitive to each added instruction.
> 
> I even pointed out to Peter why I didn't add the PageHuge check in there
> originally [1].
> 
> "Well, and I didn't want to have runtime-hugetlb checks in
> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> 
> 
> We now have to do a page_folio(page) and then test for hugetlb.
> 
> 	return folio_test_hugetlb(page_folio(page));
> 
> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
> maybe at least part of the overhead is gone.
> 

I'll note page_folio expands to a call to _compound_head.

While _compound_head is declared as an inline, it ends up being big
enough that the compiler decides to emit a real function instead and
real func calls are not particularly cheap.

I had a brief look with a profiler myself and for single-threaded usage
the func is quite high up there, while it manages to get out with the
first branch -- that is to say there is definitely performance lost for
having a func call instead of an inlined branch.

The routine is deinlined because of a call to page_fixed_fake_head,
which itself is annotated with always_inline.

This is of course patchable with minor shoveling.

I did not go for it because stress-ng results were too unstable for me
to confidently state win/loss.

But should you want to whack the regression, this is what I would look
into.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01 13:30       ` Mateusz Guzik
@ 2024-08-01 13:34         ` David Hildenbrand
  2024-08-01 13:37           ` Mateusz Guzik
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 13:34 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 01.08.24 15:30, Mateusz Guzik wrote:
> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>> Yes indeed. fork() can be extremely sensitive to each added instruction.
>>
>> I even pointed out to Peter why I didn't add the PageHuge check in there
>> originally [1].
>>
>> "Well, and I didn't want to have runtime-hugetlb checks in
>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>
>>
>> We now have to do a page_folio(page) and then test for hugetlb.
>>
>> 	return folio_test_hugetlb(page_folio(page));
>>
>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
>> maybe at least part of the overhead is gone.
>>
> 
> I'll note page_folio expands to a call to _compound_head.
> 
> While _compound_head is declared as an inline, it ends up being big
> enough that the compiler decides to emit a real function instead and
> real func calls are not particularly cheap.
> 
> I had a brief look with a profiler myself and for single-threaded usage
> the func is quite high up there, while it manages to get out with the
> first branch -- that is to say there is definitely performance lost for
> having a func call instead of an inlined branch.
> 
> The routine is deinlined because of a call to page_fixed_fake_head,
> which itself is annotated with always_inline.
> 
> This is of course patchable with minor shoveling.
> 
> I did not go for it because stress-ng results were too unstable for me
> to confidently state win/loss.
> 
> But should you want to whack the regression, this is what I would look
> into.
> 

This might improve it, at least for small folios I guess:

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..7796ae116018 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
   */
  static inline bool PageHuge(const struct page *page)
  {
-       return folio_test_hugetlb(page_folio(page));
+       return PageCompound(page) && folio_test_hugetlb(page_folio(page));
  }
  
  /*


We would avoid the function call for small folios.

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01 13:34         ` David Hildenbrand
@ 2024-08-01 13:37           ` Mateusz Guzik
  2024-08-01 13:44             ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-01 13:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.08.24 15:30, Mateusz Guzik wrote:
> > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >> Yes indeed. fork() can be extremely sensitive to each added instruction.
> >>
> >> I even pointed out to Peter why I didn't add the PageHuge check in there
> >> originally [1].
> >>
> >> "Well, and I didn't want to have runtime-hugetlb checks in
> >> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>
> >>
> >> We now have to do a page_folio(page) and then test for hugetlb.
> >>
> >>      return folio_test_hugetlb(page_folio(page));
> >>
> >> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
> >> maybe at least part of the overhead is gone.
> >>
> >
> > I'll note page_folio expands to a call to _compound_head.
> >
> > While _compound_head is declared as an inline, it ends up being big
> > enough that the compiler decides to emit a real function instead and
> > real func calls are not particularly cheap.
> >
> > I had a brief look with a profiler myself and for single-threaded usage
> > the func is quite high up there, while it manages to get out with the
> > first branch -- that is to say there is definitely performance lost for
> > having a func call instead of an inlined branch.
> >
> > The routine is deinlined because of a call to page_fixed_fake_head,
> > which itself is annotated with always_inline.
> >
> > This is of course patchable with minor shoveling.
> >
> > I did not go for it because stress-ng results were too unstable for me
> > to confidently state win/loss.
> >
> > But should you want to whack the regression, this is what I would look
> > into.
> >
>
> This might improve it, at least for small folios I guess:
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..7796ae116018 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>    */
>   static inline bool PageHuge(const struct page *page)
>   {
> -       return folio_test_hugetlb(page_folio(page));
> +       return PageCompound(page) && folio_test_hugetlb(page_folio(page));
>   }
>
>   /*
>
>
> We would avoid the function call for small folios.
>

why not massage _compound_head back to an inlineable form instead? for
all i know you may even register a small win in total

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01 13:37           ` Mateusz Guzik
@ 2024-08-01 13:44             ` David Hildenbrand
  2024-08-12  4:43               ` Yin Fengwei
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 13:44 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 01.08.24 15:37, Mateusz Guzik wrote:
> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>> Yes indeed. fork() can be extremely sensitive to each added instruction.
>>>>
>>>> I even pointed out to Peter why I didn't add the PageHuge check in there
>>>> originally [1].
>>>>
>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>
>>>>
>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>
>>>>       return folio_test_hugetlb(page_folio(page));
>>>>
>>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
>>>> maybe at least part of the overhead is gone.
>>>>
>>>
>>> I'll note page_folio expands to a call to _compound_head.
>>>
>>> While _compound_head is declared as an inline, it ends up being big
>>> enough that the compiler decides to emit a real function instead and
>>> real func calls are not particularly cheap.
>>>
>>> I had a brief look with a profiler myself and for single-threaded usage
>>> the func is quite high up there, while it manages to get out with the
>>> first branch -- that is to say there is definitely performance lost for
>>> having a func call instead of an inlined branch.
>>>
>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>> which itself is annotated with always_inline.
>>>
>>> This is of course patchable with minor shoveling.
>>>
>>> I did not go for it because stress-ng results were too unstable for me
>>> to confidently state win/loss.
>>>
>>> But should you want to whack the regression, this is what I would look
>>> into.
>>>
>>
>> This might improve it, at least for small folios I guess:
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 5769fe6e4950..7796ae116018 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>>     */
>>    static inline bool PageHuge(const struct page *page)
>>    {
>> -       return folio_test_hugetlb(page_folio(page));
>> +       return PageCompound(page) && folio_test_hugetlb(page_folio(page));
>>    }
>>
>>    /*
>>
>>
>> We would avoid the function call for small folios.
>>
> 
> why not massage _compound_head back to an inlineable form instead? for
> all i know you may even register a small win in total

Agreed, likely it will increase code size a bit which is why the 
compiler decides to not inline. We could force it with __always_inline.

Finding ways to shrink page_fixed_fake_head() might be even better.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-01 13:44             ` David Hildenbrand
@ 2024-08-12  4:43               ` Yin Fengwei
  2024-08-12  4:49                 ` Mateusz Guzik
  0 siblings, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-12  4:43 UTC (permalink / raw)
  To: David Hildenbrand, Mateusz Guzik
  Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
	Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
	ying.huang, feng.tang

Hi David,

On 8/1/24 09:44, David Hildenbrand wrote:
> On 01.08.24 15:37, Mateusz Guzik wrote:
>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>> Yes indeed. fork() can be extremely sensitive to each added 
>>>>> instruction.
>>>>>
>>>>> I even pointed out to Peter why I didn't add the PageHuge check in 
>>>>> there
>>>>> originally [1].
>>>>>
>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>
>>>>>
>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>
>>>>>       return folio_test_hugetlb(page_folio(page));
>>>>>
>>>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 
>>>>> times, so
>>>>> maybe at least part of the overhead is gone.
>>>>>
>>>>
>>>> I'll note page_folio expands to a call to _compound_head.
>>>>
>>>> While _compound_head is declared as an inline, it ends up being big
>>>> enough that the compiler decides to emit a real function instead and
>>>> real func calls are not particularly cheap.
>>>>
>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>> the func is quite high up there, while it manages to get out with the
>>>> first branch -- that is to say there is definitely performance lost for
>>>> having a func call instead of an inlined branch.
>>>>
>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>> which itself is annotated with always_inline.
>>>>
>>>> This is of course patchable with minor shoveling.
>>>>
>>>> I did not go for it because stress-ng results were too unstable for me
>>>> to confidently state win/loss.
>>>>
>>>> But should you want to whack the regression, this is what I would look
>>>> into.
>>>>
>>>
>>> This might improve it, at least for small folios I guess:
Do you want us to test this change? Or you have further optimization
ongoing? Thanks.

Regards
Yin, Fengwei

>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 5769fe6e4950..7796ae116018 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>>>     */
>>>    static inline bool PageHuge(const struct page *page)
>>>    {
>>> -       return folio_test_hugetlb(page_folio(page));
>>> +       return PageCompound(page) && 
>>> folio_test_hugetlb(page_folio(page));
>>>    }
>>>
>>>    /*
>>>
>>>
>>> We would avoid the function call for small folios.
>>>
>>
>> why not massage _compound_head back to an inlineable form instead? for
>> all i know you may even register a small win in total
> 
> Agreed, likely it will increase code size a bit which is why the 
> compiler decides to not inline. We could force it with __always_inline.
> 
> Finding ways to shrink page_fixed_fake_head() might be even better.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-12  4:43               ` Yin Fengwei
@ 2024-08-12  4:49                 ` Mateusz Guzik
  2024-08-12  8:12                   ` David Hildenbrand
  2024-08-13  7:09                   ` Yin Fengwei
  0 siblings, 2 replies; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-12  4:49 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> Hi David,
> 
> On 8/1/24 09:44, David Hildenbrand wrote:
> > On 01.08.24 15:37, Mateusz Guzik wrote:
> > > On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> > > wrote:
> > > > 
> > > > On 01.08.24 15:30, Mateusz Guzik wrote:
> > > > > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> > > > > > Yes indeed. fork() can be extremely sensitive to each
> > > > > > added instruction.
> > > > > > 
> > > > > > I even pointed out to Peter why I didn't add the
> > > > > > PageHuge check in there
> > > > > > originally [1].
> > > > > > 
> > > > > > "Well, and I didn't want to have runtime-hugetlb checks in
> > > > > > PageAnonExclusive code called on certainly-not-hugetlb code paths."
> > > > > > 
> > > > > > 
> > > > > > We now have to do a page_folio(page) and then test for hugetlb.
> > > > > > 
> > > > > >       return folio_test_hugetlb(page_folio(page));
> > > > > > 
> > > > > > Nowadays, folio_test_hugetlb() will be faster than at
> > > > > > c0bff412e6 times, so
> > > > > > maybe at least part of the overhead is gone.
> > > > > > 
> > > > > 
> > > > > I'll note page_folio expands to a call to _compound_head.
> > > > > 
> > > > > While _compound_head is declared as an inline, it ends up being big
> > > > > enough that the compiler decides to emit a real function instead and
> > > > > real func calls are not particularly cheap.
> > > > > 
> > > > > I had a brief look with a profiler myself and for single-threaded usage
> > > > > the func is quite high up there, while it manages to get out with the
> > > > > first branch -- that is to say there is definitely performance lost for
> > > > > having a func call instead of an inlined branch.
> > > > > 
> > > > > The routine is deinlined because of a call to page_fixed_fake_head,
> > > > > which itself is annotated with always_inline.
> > > > > 
> > > > > This is of course patchable with minor shoveling.
> > > > > 
> > > > > I did not go for it because stress-ng results were too unstable for me
> > > > > to confidently state win/loss.
> > > > > 
> > > > > But should you want to whack the regression, this is what I would look
> > > > > into.
> > > > > 
> > > > 
> > > > This might improve it, at least for small folios I guess:
> Do you want us to test this change? Or you have further optimization
> ongoing? Thanks.

I verified the thing below boots, I have no idea about performance. If
it helps it can be massaged later from style perspective.

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..2d5d61ab385b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -194,34 +194,13 @@ enum pageflags {
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
 
-/*
- * Return the real head page struct iff the @page is a fake head page, otherwise
- * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
- */
+const struct page *_page_fixed_fake_head(const struct page *page);
+
 static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
 {
 	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
 		return page;
-
-	/*
-	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
-	 * struct page. The alignment check aims to avoid access the fields (
-	 * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
-	 * cold cacheline in some cases.
-	 */
-	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
-	    test_bit(PG_head, &page->flags)) {
-		/*
-		 * We can safely access the field of the @page[1] with PG_head
-		 * because the @page is a compound page composed with at least
-		 * two contiguous pages.
-		 */
-		unsigned long head = READ_ONCE(page[1].compound_head);
-
-		if (likely(head & 1))
-			return (const struct page *)(head - 1);
-	}
-	return page;
+	return _page_fixed_fake_head(page);
 }
 #else
 static inline const struct page *page_fixed_fake_head(const struct page *page)
@@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
 	return page_fixed_fake_head(page) != page;
 }
 
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
 {
 	unsigned long head = READ_ONCE(page->compound_head);
 
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 829112b0a914..3fbc00db607a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -19,6 +19,33 @@
 #include <asm/tlbflush.h>
 #include "hugetlb_vmemmap.h"
 
+/*
+ * Return the real head page struct iff the @page is a fake head page, otherwise
+ * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
+ */
+const struct page *_page_fixed_fake_head(const struct page *page)
+{
+	/*
+	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
+	 * struct page. The alignment check aims to avoid access the fields (
+	 * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
+	 * cold cacheline in some cases.
+	 */
+	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
+	    test_bit(PG_head, &page->flags)) {
+		/*
+		 * We can safely access the field of the @page[1] with PG_head
+		 * because the @page is a compound page composed with at least
+		 * two contiguous pages.
+		 */
+		unsigned long head = READ_ONCE(page[1].compound_head);
+
+		if (likely(head & 1))
+			return (const struct page *)(head - 1);
+	}
+	return page;
+}
+
 /**
  * struct vmemmap_remap_walk - walk vmemmap page table
  *


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-12  4:49                 ` Mateusz Guzik
@ 2024-08-12  8:12                   ` David Hildenbrand
  2024-08-12  8:18                     ` Mateusz Guzik
  2024-08-13  7:09                   ` Yin Fengwei
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-12  8:12 UTC (permalink / raw)
  To: Mateusz Guzik, Yin Fengwei
  Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
	Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
	ying.huang, feng.tang

On 12.08.24 06:49, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>> Hi David,
>>
>> On 8/1/24 09:44, David Hildenbrand wrote:
>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>> wrote:
>>>>>
>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>> added instruction.
>>>>>>>
>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>> PageHuge check in there
>>>>>>> originally [1].
>>>>>>>
>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>
>>>>>>>
>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>
>>>>>>>        return folio_test_hugetlb(page_folio(page));
>>>>>>>
>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>> c0bff412e6 times, so
>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>
>>>>>>
>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>
>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>> real func calls are not particularly cheap.
>>>>>>
>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>> having a func call instead of an inlined branch.
>>>>>>
>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>> which itself is annotated with always_inline.
>>>>>>
>>>>>> This is of course patchable with minor shoveling.
>>>>>>
>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>> to confidently state win/loss.
>>>>>>
>>>>>> But should you want to whack the regression, this is what I would look
>>>>>> into.
>>>>>>
>>>>>
>>>>> This might improve it, at least for small folios I guess:
>> Do you want us to test this change? Or you have further optimization
>> ongoing? Thanks.
> 
> I verified the thing below boots, I have no idea about performance. If
> it helps it can be massaged later from style perspective.

As quite a lot of setups already run with the vmemmap optimization enabled, I
wonder how effective this would be (might need more fine tuning, did not look
at the generated code):


diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 085dd8dcbea2..7ddcdbd712ec 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
         return page_fixed_fake_head(page) != page;
  }
  
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
  {
         unsigned long head = READ_ONCE(page->compound_head);
  


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-12  8:12                   ` David Hildenbrand
@ 2024-08-12  8:18                     ` Mateusz Guzik
  2024-08-12  8:23                       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-12  8:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.08.24 06:49, Mateusz Guzik wrote:
> > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> >> Hi David,
> >>
> >> On 8/1/24 09:44, David Hildenbrand wrote:
> >>> On 01.08.24 15:37, Mateusz Guzik wrote:
> >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
> >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >>>>>>> Yes indeed. fork() can be extremely sensitive to each
> >>>>>>> added instruction.
> >>>>>>>
> >>>>>>> I even pointed out to Peter why I didn't add the
> >>>>>>> PageHuge check in there
> >>>>>>> originally [1].
> >>>>>>>
> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
> >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>>>>>>
> >>>>>>>
> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
> >>>>>>>
> >>>>>>>        return folio_test_hugetlb(page_folio(page));
> >>>>>>>
> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
> >>>>>>> c0bff412e6 times, so
> >>>>>>> maybe at least part of the overhead is gone.
> >>>>>>>
> >>>>>>
> >>>>>> I'll note page_folio expands to a call to _compound_head.
> >>>>>>
> >>>>>> While _compound_head is declared as an inline, it ends up being big
> >>>>>> enough that the compiler decides to emit a real function instead and
> >>>>>> real func calls are not particularly cheap.
> >>>>>>
> >>>>>> I had a brief look with a profiler myself and for single-threaded usage
> >>>>>> the func is quite high up there, while it manages to get out with the
> >>>>>> first branch -- that is to say there is definitely performance lost for
> >>>>>> having a func call instead of an inlined branch.
> >>>>>>
> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
> >>>>>> which itself is annotated with always_inline.
> >>>>>>
> >>>>>> This is of course patchable with minor shoveling.
> >>>>>>
> >>>>>> I did not go for it because stress-ng results were too unstable for me
> >>>>>> to confidently state win/loss.
> >>>>>>
> >>>>>> But should you want to whack the regression, this is what I would look
> >>>>>> into.
> >>>>>>
> >>>>>
> >>>>> This might improve it, at least for small folios I guess:
> >> Do you want us to test this change? Or you have further optimization
> >> ongoing? Thanks.
> >
> > I verified the thing below boots, I have no idea about performance. If
> > it helps it can be massaged later from style perspective.
>
> As quite a lot of setups already run with the vmemmap optimization enabled, I
> wonder how effective this would be (might need more fine tuning, did not look
> at the generated code):
>
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 085dd8dcbea2..7ddcdbd712ec 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>          return page_fixed_fake_head(page) != page;
>   }
>
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
>   {
>          unsigned long head = READ_ONCE(page->compound_head);
>
>

Well one may need to justify it with bloat-o-meter which is why I did
not just straight up inline the entire thing.

But if you are down to fight opposition of the sort I agree this is
the patch to benchmark. :)
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-12  8:18                     ` Mateusz Guzik
@ 2024-08-12  8:23                       ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-12  8:23 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 12.08.24 10:18, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.08.24 06:49, Mateusz Guzik wrote:
>>> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>>>> Hi David,
>>>>
>>>> On 8/1/24 09:44, David Hildenbrand wrote:
>>>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>>>> added instruction.
>>>>>>>>>
>>>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>>>> PageHuge check in there
>>>>>>>>> originally [1].
>>>>>>>>>
>>>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>>>
>>>>>>>>>         return folio_test_hugetlb(page_folio(page));
>>>>>>>>>
>>>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>>>> c0bff412e6 times, so
>>>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>>>
>>>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>>>> real func calls are not particularly cheap.
>>>>>>>>
>>>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>>>> having a func call instead of an inlined branch.
>>>>>>>>
>>>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>>>> which itself is annotated with always_inline.
>>>>>>>>
>>>>>>>> This is of course patchable with minor shoveling.
>>>>>>>>
>>>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>>>> to confidently state win/loss.
>>>>>>>>
>>>>>>>> But should you want to whack the regression, this is what I would look
>>>>>>>> into.
>>>>>>>>
>>>>>>>
>>>>>>> This might improve it, at least for small folios I guess:
>>>> Do you want us to test this change? Or you have further optimization
>>>> ongoing? Thanks.
>>>
>>> I verified the thing below boots, I have no idea about performance. If
>>> it helps it can be massaged later from style perspective.
>>
>> As quite a lot of setups already run with the vmemmap optimization enabled, I
>> wonder how effective this would be (might need more fine tuning, did not look
>> at the generated code):
>>
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 085dd8dcbea2..7ddcdbd712ec 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>>           return page_fixed_fake_head(page) != page;
>>    }
>>
>> -static inline unsigned long _compound_head(const struct page *page)
>> +static __always_inline unsigned long _compound_head(const struct page *page)
>>    {
>>           unsigned long head = READ_ONCE(page->compound_head);
>>
>>
> 
> Well one may need to justify it with bloat-o-meter which is why I did
> not just straight up inline the entire thing.
> 
> But if you are down to fight opposition of the sort I agree this is
> the patch to benchmark. :)

I spotted that we already to that for 
PageHead()/PageTail()/page_is_fake_head(). So we effectively 
force-inline it everywhere except into _compound_head() I think.

But yeah, measuring the bloat would be a necessary exercise.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-12  4:49                 ` Mateusz Guzik
  2024-08-12  8:12                   ` David Hildenbrand
@ 2024-08-13  7:09                   ` Yin Fengwei
  2024-08-13  7:14                     ` Mateusz Guzik
  1 sibling, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-13  7:09 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 8/12/24 00:49, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>> Hi David,
>>
>> On 8/1/24 09:44, David Hildenbrand wrote:
>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>> wrote:
>>>>>
>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>> added instruction.
>>>>>>>
>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>> PageHuge check in there
>>>>>>> originally [1].
>>>>>>>
>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>
>>>>>>>
>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>
>>>>>>>        return folio_test_hugetlb(page_folio(page));
>>>>>>>
>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>> c0bff412e6 times, so
>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>
>>>>>>
>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>
>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>> real func calls are not particularly cheap.
>>>>>>
>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>> having a func call instead of an inlined branch.
>>>>>>
>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>> which itself is annotated with always_inline.
>>>>>>
>>>>>> This is of course patchable with minor shoveling.
>>>>>>
>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>> to confidently state win/loss.
>>>>>>
>>>>>> But should you want to whack the regression, this is what I would look
>>>>>> into.
>>>>>>
>>>>>
>>>>> This might improve it, at least for small folios I guess:
>> Do you want us to test this change? Or you have further optimization
>> ongoing? Thanks.
> 
> I verified the thing below boots, I have no idea about performance. If
> it helps it can be massaged later from style perspective.
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..2d5d61ab385b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -194,34 +194,13 @@ enum pageflags {
>   #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>   DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
>   
> -/*
> - * Return the real head page struct iff the @page is a fake head page, otherwise
> - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> - */
> +const struct page *_page_fixed_fake_head(const struct page *page);
> +
>   static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
>   {
>   	if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
>   		return page;
> -
> -	/*
> -	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> -	 * struct page. The alignment check aims to avoid access the fields (
> -	 * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> -	 * cold cacheline in some cases.
> -	 */
> -	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> -	    test_bit(PG_head, &page->flags)) {
> -		/*
> -		 * We can safely access the field of the @page[1] with PG_head
> -		 * because the @page is a compound page composed with at least
> -		 * two contiguous pages.
> -		 */
> -		unsigned long head = READ_ONCE(page[1].compound_head);
> -
> -		if (likely(head & 1))
> -			return (const struct page *)(head - 1);
> -	}
> -	return page;
> +	return _page_fixed_fake_head(page);
>   }
>   #else
>   static inline const struct page *page_fixed_fake_head(const struct page *page)
> @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>   	return page_fixed_fake_head(page) != page;
>   }
>   
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
>   {
>   	unsigned long head = READ_ONCE(page->compound_head);
>   
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 829112b0a914..3fbc00db607a 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -19,6 +19,33 @@
>   #include <asm/tlbflush.h>
>   #include "hugetlb_vmemmap.h"
>   
> +/*
> + * Return the real head page struct iff the @page is a fake head page, otherwise
> + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> + */
> +const struct page *_page_fixed_fake_head(const struct page *page)
> +{
> +	/*
> +	 * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> +	 * struct page. The alignment check aims to avoid access the fields (
> +	 * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> +	 * cold cacheline in some cases.
> +	 */
> +	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> +	    test_bit(PG_head, &page->flags)) {
> +		/*
> +		 * We can safely access the field of the @page[1] with PG_head
> +		 * because the @page is a compound page composed with at least
> +		 * two contiguous pages.
> +		 */
> +		unsigned long head = READ_ONCE(page[1].compound_head);
> +
> +		if (likely(head & 1))
> +			return (const struct page *)(head - 1);
> +	}
> +	return page;
> +}
> +
>   /**
>    * struct vmemmap_remap_walk - walk vmemmap page table
>    *
> 

The change can resolve the regression (from -3% to 0.5%):

Please note:
   9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit
   3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above
                                            patch
   c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which
                                            introduced regression


=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
 
lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2

commit:
   9cb28da54643ad464c47585cd5866c30b0218e67
   3f16e4b516ef02d9461b7e0b6c50e05ba0811886
   c0bff412e67b781d761e330ff9578aa9ed2be79e

9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95
---------------- --------------------------- ---------------------------
        fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
            |             |             |             |             |
           3:3            0%           3:3            0%           3:3 
   stress-ng.clone.microsecs_per_clone.pass
           3:3            0%           3:3            0%           3:3 
   stress-ng.clone.pass
          %stddev     %change         %stddev     %change         %stddev
              \          |                \          |                \
       2904            -0.6%       2886            +3.7%       3011 
   stress-ng.clone.microsecs_per_clone
     563520            +0.5%     566296            -3.1%     546122 
   stress-ng.clone.ops
       9306            +0.5%       9356            -3.0%       9024 
   stress-ng.clone.ops_per_sec


BTW, the change needs to export symbol _page_fixed_fake_head otherwise
some modules hit build error.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-13  7:09                   ` Yin Fengwei
@ 2024-08-13  7:14                     ` Mateusz Guzik
  2024-08-14  3:02                       ` Yin Fengwei
  0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-13  7:14 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Tue, Aug 13, 2024 at 9:09 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> On 8/12/24 00:49, Mateusz Guzik wrote:
> > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> >> Hi David,
> >>
> >> On 8/1/24 09:44, David Hildenbrand wrote:
> >>> On 01.08.24 15:37, Mateusz Guzik wrote:
> >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
> >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >>>>>>> Yes indeed. fork() can be extremely sensitive to each
> >>>>>>> added instruction.
> >>>>>>>
> >>>>>>> I even pointed out to Peter why I didn't add the
> >>>>>>> PageHuge check in there
> >>>>>>> originally [1].
> >>>>>>>
> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
> >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>>>>>>
> >>>>>>>
> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
> >>>>>>>
> >>>>>>>        return folio_test_hugetlb(page_folio(page));
> >>>>>>>
> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
> >>>>>>> c0bff412e6 times, so
> >>>>>>> maybe at least part of the overhead is gone.
> >>>>>>>
> >>>>>>
> >>>>>> I'll note page_folio expands to a call to _compound_head.
> >>>>>>
> >>>>>> While _compound_head is declared as an inline, it ends up being big
> >>>>>> enough that the compiler decides to emit a real function instead and
> >>>>>> real func calls are not particularly cheap.
> >>>>>>
> >>>>>> I had a brief look with a profiler myself and for single-threaded usage
> >>>>>> the func is quite high up there, while it manages to get out with the
> >>>>>> first branch -- that is to say there is definitely performance lost for
> >>>>>> having a func call instead of an inlined branch.
> >>>>>>
> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
> >>>>>> which itself is annotated with always_inline.
> >>>>>>
> >>>>>> This is of course patchable with minor shoveling.
> >>>>>>
> >>>>>> I did not go for it because stress-ng results were too unstable for me
> >>>>>> to confidently state win/loss.
> >>>>>>
> >>>>>> But should you want to whack the regression, this is what I would look
> >>>>>> into.
> >>>>>>
> >>>>>
> >>>>> This might improve it, at least for small folios I guess:
> >> Do you want us to test this change? Or you have further optimization
> >> ongoing? Thanks.
> >
> > I verified the thing below boots, I have no idea about performance. If
> > it helps it can be massaged later from style perspective.
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5769fe6e4950..2d5d61ab385b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -194,34 +194,13 @@ enum pageflags {
> >   #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >   DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
> >
> > -/*
> > - * Return the real head page struct iff the @page is a fake head page, otherwise
> > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > - */
> > +const struct page *_page_fixed_fake_head(const struct page *page);
> > +
> >   static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
> >   {
> >       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> >               return page;
> > -
> > -     /*
> > -      * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > -      * struct page. The alignment check aims to avoid access the fields (
> > -      * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > -      * cold cacheline in some cases.
> > -      */
> > -     if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > -         test_bit(PG_head, &page->flags)) {
> > -             /*
> > -              * We can safely access the field of the @page[1] with PG_head
> > -              * because the @page is a compound page composed with at least
> > -              * two contiguous pages.
> > -              */
> > -             unsigned long head = READ_ONCE(page[1].compound_head);
> > -
> > -             if (likely(head & 1))
> > -                     return (const struct page *)(head - 1);
> > -     }
> > -     return page;
> > +     return _page_fixed_fake_head(page);
> >   }
> >   #else
> >   static inline const struct page *page_fixed_fake_head(const struct page *page)
> > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> >       return page_fixed_fake_head(page) != page;
> >   }
> >
> > -static inline unsigned long _compound_head(const struct page *page)
> > +static __always_inline unsigned long _compound_head(const struct page *page)
> >   {
> >       unsigned long head = READ_ONCE(page->compound_head);
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 829112b0a914..3fbc00db607a 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -19,6 +19,33 @@
> >   #include <asm/tlbflush.h>
> >   #include "hugetlb_vmemmap.h"
> >
> > +/*
> > + * Return the real head page struct iff the @page is a fake head page, otherwise
> > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > + */
> > +const struct page *_page_fixed_fake_head(const struct page *page)
> > +{
> > +     /*
> > +      * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > +      * struct page. The alignment check aims to avoid access the fields (
> > +      * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > +      * cold cacheline in some cases.
> > +      */
> > +     if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > +         test_bit(PG_head, &page->flags)) {
> > +             /*
> > +              * We can safely access the field of the @page[1] with PG_head
> > +              * because the @page is a compound page composed with at least
> > +              * two contiguous pages.
> > +              */
> > +             unsigned long head = READ_ONCE(page[1].compound_head);
> > +
> > +             if (likely(head & 1))
> > +                     return (const struct page *)(head - 1);
> > +     }
> > +     return page;
> > +}
> > +
> >   /**
> >    * struct vmemmap_remap_walk - walk vmemmap page table
> >    *
> >
>
> The change can resolve the regression (from -3% to 0.5%):
>

thanks for testing

would you mind benchmarking the change which merely force-inlines _compund_page?

https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/

> Please note:
>    9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit
>    3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above
>                                             patch
>    c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which
>                                             introduced regression
>
>
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
>
> lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
>
> commit:
>    9cb28da54643ad464c47585cd5866c30b0218e67
>    3f16e4b516ef02d9461b7e0b6c50e05ba0811886
>    c0bff412e67b781d761e330ff9578aa9ed2be79e
>
> 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95
> ---------------- --------------------------- ---------------------------
>         fail:runs  %reproduction    fail:runs  %reproduction    fail:runs
>             |             |             |             |             |
>            3:3            0%           3:3            0%           3:3
>    stress-ng.clone.microsecs_per_clone.pass
>            3:3            0%           3:3            0%           3:3
>    stress-ng.clone.pass
>           %stddev     %change         %stddev     %change         %stddev
>               \          |                \          |                \
>        2904            -0.6%       2886            +3.7%       3011
>    stress-ng.clone.microsecs_per_clone
>      563520            +0.5%     566296            -3.1%     546122
>    stress-ng.clone.ops
>        9306            +0.5%       9356            -3.0%       9024
>    stress-ng.clone.ops_per_sec
>
>
> BTW, the change needs to export symbol _page_fixed_fake_head otherwise
> some modules hit build error.
>

ok, I'll patch that up if this approach will be the thing to do

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-13  7:14                     ` Mateusz Guzik
@ 2024-08-14  3:02                       ` Yin Fengwei
  2024-08-14  4:10                         ` Mateusz Guzik
  0 siblings, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-14  3:02 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 8/13/24 03:14, Mateusz Guzik wrote:
> thanks for testing
> 
> would you mind benchmarking the change which merely force-inlines _compund_page?
> 
> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
This change can resolve the regression also:
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
 
lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2

commit:
   9cb28da54643ad464c47585cd5866c30b0218e67  parent commit
   c0bff412e67b781d761e330ff9578aa9ed2be79e  commit introduced regression
   450b96d2c4f740152e03c6b79b484a10347b3ea9  the change proposed by David
                                             in above link

9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4
---------------- --------------------------- ---------------------------
          %stddev     %change         %stddev     %change         %stddev
              \          |                \          |                \
       2906            +3.5%       3007            +0.4%       2919 
   stress-ng.clone.microsecs_per_clone
     562884            -2.9%     546575            -0.6%     559718 
   stress-ng.clone.ops
       9295            -2.9%       9028            -0.5%       9248 
   stress-ng.clone.ops_per_sec



Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-14  3:02                       ` Yin Fengwei
@ 2024-08-14  4:10                         ` Mateusz Guzik
  2024-08-14  9:45                           ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-14  4:10 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> On 8/13/24 03:14, Mateusz Guzik wrote:
> > would you mind benchmarking the change which merely force-inlines _compund_page?
> >
> > https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
> This change can resolve the regression also:

Great, thanks.

David, I guess this means it would be fine to inline the entire thing
at least from this bench standpoint. Given that this is your idea I
guess you should do the needful(tm)? :)

> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
>
> lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
>
> commit:
>    9cb28da54643ad464c47585cd5866c30b0218e67  parent commit
>    c0bff412e67b781d761e330ff9578aa9ed2be79e  commit introduced regression
>    450b96d2c4f740152e03c6b79b484a10347b3ea9  the change proposed by David
>                                              in above link
>
> 9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4
> ---------------- --------------------------- ---------------------------
>           %stddev     %change         %stddev     %change         %stddev
>               \          |                \          |                \
>        2906            +3.5%       3007            +0.4%       2919
>    stress-ng.clone.microsecs_per_clone
>      562884            -2.9%     546575            -0.6%     559718
>    stress-ng.clone.ops
>        9295            -2.9%       9028            -0.5%       9248
>    stress-ng.clone.ops_per_sec
>
>
>
> Regards
> Yin, Fengwei
>


-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-14  4:10                         ` Mateusz Guzik
@ 2024-08-14  9:45                           ` David Hildenbrand
  2024-08-14 11:06                             ` Mateusz Guzik
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-14  9:45 UTC (permalink / raw)
  To: Mateusz Guzik, Yin Fengwei
  Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
	Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
	ying.huang, feng.tang

On 14.08.24 06:10, Mateusz Guzik wrote:
> On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> On 8/13/24 03:14, Mateusz Guzik wrote:
>>> would you mind benchmarking the change which merely force-inlines _compund_page?
>>>
>>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
>> This change can resolve the regression also:
> 
> Great, thanks.
> 
> David, I guess this means it would be fine to inline the entire thing
> at least from this bench standpoint. Given that this is your idea I
> guess you should do the needful(tm)? :)

Testing

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..25e25b34f4a0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
         return page_fixed_fake_head(page) != page;
  }
  
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
  {
         unsigned long head = READ_ONCE(page->compound_head);
  

With a kernel-config based on something derived from Fedora
config-6.8.9-100.fc38.x86_64 for convenience with

CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
Function                                     old     new   delta
change_pte_range                               -    2308   +2308
iommu_put_dma_cookie                         454    1276    +822
get_hwpoison_page                           2007    2580    +573
end_bbio_data_read                          1171    1626    +455
end_bbio_meta_read                           492     934    +442
ext4_finish_bio                              773    1208    +435
fq_ring_free_locked                          128     541    +413
end_bbio_meta_write                          493     872    +379
gup_fast_fallback                           4207    4568    +361
v1_free_pgtable                              166     519    +353
iommu_v1_map_pages                          2747    3098    +351
end_bbio_data_write                          609     960    +351
fsverity_verify_bio                          334     656    +322
follow_page_mask                            3399    3719    +320
__read_end_io                                316     635    +319
btrfs_end_super_write                        494     789    +295
iommu_alloc_pages_node.constprop             286     572    +286
free_buffers.part                              -     271    +271
gup_must_unshare                               -     268    +268
smaps_pte_range                             1285    1513    +228
pagemap_pmd_range                           2189    2393    +204
iommu_alloc_pages_node                         -     193    +193
smaps_hugetlb_range                          705     897    +192
follow_page_pte                             1584    1758    +174
__migrate_device_pages                      2435    2595    +160
unpin_user_pages_dirty_lock                  205     362    +157
_compound_head                                 -     150    +150
unpin_user_pages                             143     282    +139
put_ref_page.part                              -     126    +126
iomap_finish_ioend                           866     972    +106
iomap_read_end_io                            673     763     +90
end_bbio_meta_read.cold                       42     131     +89
btrfs_do_readpage                           1759    1845     +86
extent_write_cache_pages                    2133    2212     +79
end_bbio_data_write.cold                      32     108     +76
end_bbio_meta_write.cold                      40     108     +68
__read_end_io.cold                            25      91     +66
btrfs_end_super_write.cold                    25      89     +64
ext4_finish_bio.cold                         118     178     +60
fsverity_verify_bio.cold                      25      84     +59
block_write_begin                            217     274     +57
end_bbio_data_read.cold                      378     426     +48
__pfx__compound_head                           -      48     +48
copy_hugetlb_page_range                     3050    3097     +47
lruvec_stat_mod_folio.constprop              585     630     +45
iomap_finish_ioend.cold                      163     202     +39
md_bitmap_file_unmap                         150     187     +37
free_pgd_range                              1949    1985     +36
prep_move_freepages_block                    319     349     +30
iommu_alloc_pages_node.cold                    -      25     +25
iomap_read_end_io.cold                        65      89     +24
zap_huge_pmd                                 874     897     +23
cont_write_begin.cold                        108     130     +22
skb_splice_from_iter                         822     843     +21
set_pmd_migration_entry                     1037    1058     +21
zerocopy_fill_skb_from_iter                 1321    1340     +19
pagemap_scan_pmd_entry                      3261    3279     +18
try_grab_folio_fast                          452     469     +17
change_huge_pmd                             1174    1191     +17
folio_put                                     48      64     +16
__pfx_set_p4d                                  -      16     +16
__pfx_put_ref_page.part                        -      16     +16
__pfx_lruvec_stat_mod_folio.constprop        208     224     +16
__pfx_iommu_alloc_pages_node.constprop        16      32     +16
__pfx_iommu_alloc_pages_node                   -      16     +16
__pfx_gup_must_unshare                         -      16     +16
__pfx_free_buffers.part                        -      16     +16
__pfx_folio_put                               48      64     +16
__pfx_change_pte_range                         -      16     +16
__pfx___pte                                   32      48     +16
offline_pages                               1962    1975     +13
memfd_pin_folios                            1284    1297     +13
uprobe_write_opcode                         2062    2073     +11
set_p4d                                        -      11     +11
__pte                                         22      33     +11
copy_page_from_iter_atomic                  1714    1724     +10
__migrate_device_pages.cold                   60      70     +10
try_to_unmap_one                            3355    3364      +9
try_to_migrate_one                          3310    3319      +9
stable_page_flags                           1034    1043      +9
io_sqe_buffer_register                      1404    1413      +9
dio_zero_block                               644     652      +8
add_ra_bio_pages.constprop.isra             1542    1550      +8
__add_to_kill                                969     977      +8
btrfs_writepage_fixup_worker                1199    1206      +7
write_protect_page                          1186    1192      +6
iommu_v2_map_pages.cold                      145     151      +6
gup_fast_fallback.cold                       112     117      +5
try_to_merge_one_page                       1857    1860      +3
__apply_to_page_range                       2235    2238      +3
wbc_account_cgroup_owner                     217     219      +2
change_protection.cold                       105     107      +2
can_change_pte_writable                      354     356      +2
vmf_insert_pfn_pud                           699     700      +1
split_huge_page_to_list_to_order.cold        152     151      -1
pte_pfn                                       40      39      -1
move_pages                                  5270    5269      -1
isolate_single_pageblock                    1056    1055      -1
__apply_to_page_range.cold                    92      91      -1
unmap_page_range.cold                         88      86      -2
do_huge_pmd_numa_page                       1175    1173      -2
free_pgd_range.cold                          162     159      -3
copy_page_to_iter                            329     326      -3
copy_page_range.cold                         149     146      -3
copy_page_from_iter                          307     304      -3
can_finish_ordered_extent                    551     548      -3
__replace_page                              1133    1130      -3
__reset_isolation_pfn                        645     641      -4
dio_send_cur_page                           1113    1108      -5
__access_remote_vm                          1010    1005      -5
pagemap_hugetlb_category                     468     459      -9
extent_write_locked_range                   1148    1139      -9
unuse_pte_range                             1834    1821     -13
do_migrate_range                            1935    1922     -13
__get_user_pages                            1952    1938     -14
migrate_vma_collect_pmd                     2817    2802     -15
copy_page_to_iter_nofault                   2373    2358     -15
hugetlb_fault                               4054    4038     -16
__pfx_shake_page                              16       -     -16
__pfx_put_page                                16       -     -16
__pfx_pfn_swap_entry_to_page                  32      16     -16
__pfx_gup_must_unshare.part                   16       -     -16
__pfx_gup_folio_next                          16       -     -16
__pfx_free_buffers                            16       -     -16
__pfx___get_unpoison_page                     16       -     -16
btrfs_cleanup_ordered_extents                622     604     -18
read_rdev                                    694     673     -21
isolate_migratepages_block.cold              222     197     -25
hugetlb_mfill_atomic_pte                    1869    1844     -25
folio_pte_batch.constprop                   1020     995     -25
hugetlb_reserve_pages                       1468    1441     -27
__alloc_fresh_hugetlb_folio                  676     649     -27
intel_pasid_alloc_table.cold                  83      52     -31
__pfx_iommu_put_pages_list                    48      16     -32
__pfx_PageHuge                                32       -     -32
__blockdev_direct_IO.cold                    952     920     -32
io_ctl_prepare_pages                         832     794     -38
__handle_mm_fault                           4237    4195     -42
finish_fault                                1007     962     -45
__pfx_pfn_swap_entry_folio                    64      16     -48
vm_normal_folio_pmd                           84      34     -50
vm_normal_folio                               84      34     -50
set_migratetype_isolate                     1429    1375     -54
do_set_pmd                                   618     561     -57
can_change_pmd_writable                      293     229     -64
__unmap_hugepage_range                      2389    2325     -64
do_fault                                    1187    1121     -66
fault_dirty_shared_page                      425     358     -67
madvise_free_huge_pmd                        863     792     -71
insert_page_into_pte_locked.isra             502     429     -73
restore_exclusive_pte                        539     463     -76
isolate_migratepages_block                  5436    5355     -81
__do_fault                                   366     276     -90
set_pte_range                                593     502     -91
follow_devmap_pmd                            559     468     -91
__pfx_bio_first_folio                        144      48     -96
shake_page                                   105       -    -105
hugetlb_change_protection                   2314    2204    -110
hugetlb_wp                                  2134    2017    -117
__blockdev_direct_IO                        5063    4946    -117
skb_tx_error                                 272     149    -123
put_page                                     123       -    -123
gup_must_unshare.part                        135       -    -135
PageHuge                                     136       -    -136
ksm_scan_thread                             9172    9032    -140
intel_pasid_alloc_table                      596     447    -149
copy_huge_pmd                               1539    1385    -154
skb_split                                   1534    1376    -158
split_huge_pmd_locked                       4024    3865    -159
skb_append_pagefrags                         663     504    -159
memory_failure                              2784    2624    -160
unpoison_memory                             1328    1167    -161
cont_write_begin                             959     793    -166
pfn_swap_entry_to_page                       250      82    -168
skb_pp_cow_data                             1539    1367    -172
gup_folio_next                               180       -    -180
intel_pasid_get_entry.isra                   607     425    -182
v2_alloc_pgtable                             309     126    -183
do_huge_pmd_wp_page                         1173     988    -185
bio_first_folio.cold                         315     105    -210
unmap_page_range                            6091    5873    -218
split_huge_page_to_list_to_order            4141    3905    -236
move_pages_huge_pmd                         2053    1813    -240
free_buffers                                 286       -    -286
iommu_v2_map_pages                          1722    1428    -294
soft_offline_page                           2149    1843    -306
do_wp_page                                  3340    2993    -347
do_swap_page                                4619    4265    -354
md_import_device                            1002     635    -367
copy_page_range                             7436    7040    -396
__get_unpoison_page                          415       -    -415
pfn_swap_entry_folio                         596     149    -447
iommu_put_pages_list                        1071     344    -727
bio_first_folio                             2322     774   -1548
change_protection                           5008    2790   -2218
Total: Before=32786363, After=32785282, chg -0.00%


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-14  9:45                           ` David Hildenbrand
@ 2024-08-14 11:06                             ` Mateusz Guzik
  2024-08-14 12:02                               ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-14 11:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.08.24 06:10, Mateusz Guzik wrote:
> > On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >> On 8/13/24 03:14, Mateusz Guzik wrote:
> >>> would you mind benchmarking the change which merely force-inlines _compund_page?
> >>>
> >>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
> >> This change can resolve the regression also:
> >
> > Great, thanks.
> >
> > David, I guess this means it would be fine to inline the entire thing
> > at least from this bench standpoint. Given that this is your idea I
> > guess you should do the needful(tm)? :)
>
> Testing
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..25e25b34f4a0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>          return page_fixed_fake_head(page) != page;
>   }
>
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
>   {
>          unsigned long head = READ_ONCE(page->compound_head);
>
>
> With a kernel-config based on something derived from Fedora
> config-6.8.9-100.fc38.x86_64 for convenience with
>
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>
> add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
[snip]
> Total: Before=32786363, After=32785282, chg -0.00%

I guess there should be no opposition then?

Given that this is your patch I presume you are going to see this through.

I don't want any mention or cc on the patch, thanks for understanding :)

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
  2024-08-14 11:06                             ` Mateusz Guzik
@ 2024-08-14 12:02                               ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-14 12:02 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
	linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
	Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
	linux-mm, ying.huang, feng.tang

On 14.08.24 13:06, Mateusz Guzik wrote:
> On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.08.24 06:10, Mateusz Guzik wrote:
>>> On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>> On 8/13/24 03:14, Mateusz Guzik wrote:
>>>>> would you mind benchmarking the change which merely force-inlines _compund_page?
>>>>>
>>>>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
>>>> This change can resolve the regression also:
>>>
>>> Great, thanks.
>>>
>>> David, I guess this means it would be fine to inline the entire thing
>>> at least from this bench standpoint. Given that this is your idea I
>>> guess you should do the needful(tm)? :)
>>
>> Testing
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 5769fe6e4950..25e25b34f4a0 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>>           return page_fixed_fake_head(page) != page;
>>    }
>>
>> -static inline unsigned long _compound_head(const struct page *page)
>> +static __always_inline unsigned long _compound_head(const struct page *page)
>>    {
>>           unsigned long head = READ_ONCE(page->compound_head);
>>
>>
>> With a kernel-config based on something derived from Fedora
>> config-6.8.9-100.fc38.x86_64 for convenience with
>>
>> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>>
>> add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
> [snip]
>> Total: Before=32786363, After=32785282, chg -0.00%
> 
> I guess there should be no opposition then?
> 
> Given that this is your patch I presume you are going to see this through.

I was hoping that you could send an official patch, after all you did 
most of the work here.

> 
> I don't want any mention or cc on the patch, thanks for understanding :)

If I have to send it I will respect it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-08-14 12:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30  5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot
2024-07-30  8:11 ` David Hildenbrand
2024-08-01  6:39   ` Yin, Fengwei
2024-08-01  6:49     ` David Hildenbrand
2024-08-01  7:44       ` Yin, Fengwei
2024-08-01  7:54         ` David Hildenbrand
2024-08-01 13:30       ` Mateusz Guzik
2024-08-01 13:34         ` David Hildenbrand
2024-08-01 13:37           ` Mateusz Guzik
2024-08-01 13:44             ` David Hildenbrand
2024-08-12  4:43               ` Yin Fengwei
2024-08-12  4:49                 ` Mateusz Guzik
2024-08-12  8:12                   ` David Hildenbrand
2024-08-12  8:18                     ` Mateusz Guzik
2024-08-12  8:23                       ` David Hildenbrand
2024-08-13  7:09                   ` Yin Fengwei
2024-08-13  7:14                     ` Mateusz Guzik
2024-08-14  3:02                       ` Yin Fengwei
2024-08-14  4:10                         ` Mateusz Guzik
2024-08-14  9:45                           ` David Hildenbrand
2024-08-14 11:06                             ` Mateusz Guzik
2024-08-14 12:02                               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).