* [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
@ 2024-07-30 5:00 kernel test robot
2024-07-30 8:11 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: kernel test robot @ 2024-07-30 5:00 UTC (permalink / raw)
To: Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, David Hildenbrand,
Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor,
Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang,
fengwei.yin, oliver.sang
Hello,
kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on:
commit: c0bff412e67b781d761e330ff9578aa9ed2be79e ("mm: allow anon exclusive check over hugetlb tail pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: clone
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407301049.5051dc19-oliver.sang@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/clone/stress-ng/60s
commit:
9cb28da546 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
c0bff412e6 ("mm: allow anon exclusive check over hugetlb tail pages")
9cb28da54643ad46 c0bff412e67b781d761e330ff95
---------------- ---------------------------
%stddev %change %stddev
\ | \
37842 -3.4% 36554 vmstat.system.cs
0.00 ± 17% -86.4% 0.00 ±223% sched_debug.rt_rq:.rt_time.avg
0.19 ± 17% -86.4% 0.03 ±223% sched_debug.rt_rq:.rt_time.max
0.02 ± 17% -86.4% 0.00 ±223% sched_debug.rt_rq:.rt_time.stddev
24081 -3.7% 23200 proc-vmstat.nr_page_table_pages
399380 -2.3% 390288 proc-vmstat.nr_slab_reclaimable
1625589 -2.4% 1585989 proc-vmstat.nr_slab_unreclaimable
1.019e+08 -3.8% 98035999 proc-vmstat.numa_hit
1.018e+08 -3.9% 97870705 proc-vmstat.numa_local
1.092e+08 -3.8% 1.05e+08 proc-vmstat.pgalloc_normal
1.06e+08 -3.8% 1.019e+08 proc-vmstat.pgfree
2659199 -2.3% 2597978 proc-vmstat.pgreuse
2910 +3.4% 3010 stress-ng.clone.microsecs_per_clone
562874 -2.9% 546587 stress-ng.clone.ops
9298 -2.9% 9031 stress-ng.clone.ops_per_sec
686858 -2.8% 667416 stress-ng.time.involuntary_context_switches
9091031 -3.9% 8734352 stress-ng.time.minor_page_faults
4200 +2.4% 4299 stress-ng.time.percent_of_cpu_this_job_got
2543 +2.4% 2603 stress-ng.time.system_time
342849 -2.8% 333189 stress-ng.time.voluntary_context_switches
6.67 -6.1% 6.26 perf-stat.i.MPKI
6.388e+08 -5.4% 6.045e+08 perf-stat.i.cache-misses
1.558e+09 -4.6% 1.487e+09 perf-stat.i.cache-references
40791 -3.6% 39330 perf-stat.i.context-switches
353.55 +5.4% 372.76 perf-stat.i.cycles-between-cache-misses
7.95 ± 3% -6.5% 7.43 ± 3% perf-stat.i.metric.K/sec
251389 ± 3% -6.5% 235029 ± 3% perf-stat.i.minor-faults
251423 ± 3% -6.5% 235064 ± 3% perf-stat.i.page-faults
6.75 -6.1% 6.33 perf-stat.overall.MPKI
0.38 -0.0 0.37 perf-stat.overall.branch-miss-rate%
350.09 +5.8% 370.24 perf-stat.overall.cycles-between-cache-misses
68503488 -1.2% 67660585 perf-stat.ps.branch-misses
6.33e+08 -5.4% 5.987e+08 perf-stat.ps.cache-misses
1.518e+09 -4.6% 1.449e+09 perf-stat.ps.cache-references
38819 -3.3% 37542 perf-stat.ps.context-switches
3637 +1.2% 3680 perf-stat.ps.cpu-migrations
235473 ± 3% -6.3% 220601 ± 3% perf-stat.ps.minor-faults
235504 ± 3% -6.3% 220632 ± 3% perf-stat.ps.page-faults
45.55 -2.5 43.04 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
44.86 -2.5 42.37 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
44.42 -2.1 42.37 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
44.42 -2.1 42.37 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
44.41 -2.1 42.36 perf-profile.calltrace.cycles-pp.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
44.41 -2.1 42.36 perf-profile.calltrace.cycles-pp.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
39.08 -1.7 37.34 perf-profile.calltrace.cycles-pp.exit_mm.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
38.96 -1.7 37.22 perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exit_mm.do_exit.__x64_sys_exit
38.97 -1.7 37.24 perf-profile.calltrace.cycles-pp.__mmput.exit_mm.do_exit.__x64_sys_exit.do_syscall_64
36.16 -1.6 34.57 perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.exit_mm.do_exit
35.99 -1.6 34.40 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.exit_mm
32.17 -1.5 30.62 perf-profile.calltrace.cycles-pp.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
12.49 -1.0 11.52 perf-profile.calltrace.cycles-pp._compound_head.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
9.66 -0.9 8.74 perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.copy_process.kernel_clone
9.61 -0.9 8.69 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.copy_process
10.71 -0.9 9.84 perf-profile.calltrace.cycles-pp.__mmput.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
10.70 -0.9 9.84 perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.copy_process.kernel_clone.__do_sys_clone3
10.41 -0.8 9.58 perf-profile.calltrace.cycles-pp.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range
10.42 -0.8 9.59 perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
10.21 -0.8 9.40 perf-profile.calltrace.cycles-pp.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
5.47 -0.4 5.04 perf-profile.calltrace.cycles-pp.folios_put_refs.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range
1.11 -0.3 0.79 ± 33% perf-profile.calltrace.cycles-pp.anon_vma_interval_tree_insert.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm
14.18 -0.3 13.87 perf-profile.calltrace.cycles-pp.folio_remove_rmap_ptes.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
5.17 -0.3 4.86 perf-profile.calltrace.cycles-pp.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.80 -0.3 4.53 perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64
4.40 -0.3 4.14 perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.do_exit.__x64_sys_exit
2.74 -0.2 2.58 perf-profile.calltrace.cycles-pp.anon_vma_fork.dup_mmap.dup_mm.copy_process.kernel_clone
2.25 -0.1 2.11 perf-profile.calltrace.cycles-pp.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm.copy_process
1.47 -0.1 1.34 perf-profile.calltrace.cycles-pp.put_files_struct.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
1.87 -0.1 1.76 perf-profile.calltrace.cycles-pp.dnotify_flush.filp_flush.filp_close.put_files_struct.do_exit
1.98 -0.1 1.88 perf-profile.calltrace.cycles-pp.free_pgtables.exit_mmap.__mmput.exit_mm.do_exit
1.28 -0.1 1.18 perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.copy_process.kernel_clone.__do_sys_clone3
1.19 -0.1 1.09 perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.copy_process.kernel_clone
1.31 ± 2% -0.1 1.25 perf-profile.calltrace.cycles-pp.unlink_anon_vmas.free_pgtables.exit_mmap.__mmput.exit_mm
0.58 -0.0 0.55 perf-profile.calltrace.cycles-pp.vm_normal_page.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
33.54 +0.6 34.10 perf-profile.calltrace.cycles-pp.syscall
33.45 +0.6 34.01 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.45 +0.6 34.01 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
33.35 +0.6 33.90 perf-profile.calltrace.cycles-pp.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.34 +0.6 33.90 perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.30 +0.6 33.86 perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
20.63 +1.6 22.21 perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
20.55 +1.6 22.14 perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone3
19.40 +1.8 21.19 perf-profile.calltrace.cycles-pp.__clone
19.24 +1.8 21.04 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.24 +1.8 21.04 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__clone
19.14 +1.8 20.94 perf-profile.calltrace.cycles-pp.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.14 +1.8 20.94 perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.05 +1.8 20.85 perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe
18.74 +1.8 20.56 perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone.do_syscall_64
18.67 +1.8 20.49 perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone
12.24 +3.1 15.35 perf-profile.calltrace.cycles-pp._compound_head.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
34.37 +3.7 38.02 perf-profile.calltrace.cycles-pp.copy_page_range.dup_mmap.dup_mm.copy_process.kernel_clone
34.34 +3.7 38.00 perf-profile.calltrace.cycles-pp.copy_p4d_range.copy_page_range.dup_mmap.dup_mm.copy_process
30.99 +3.7 34.69 perf-profile.calltrace.cycles-pp.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
33.16 +3.7 36.88 perf-profile.calltrace.cycles-pp.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap.dup_mm
0.00 +3.9 3.90 perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
49.67 -2.6 47.07 perf-profile.children.cycles-pp.exit_mmap
49.69 -2.6 47.08 perf-profile.children.cycles-pp.__mmput
45.84 -2.5 43.32 perf-profile.children.cycles-pp.unmap_vmas
45.56 -2.5 43.05 perf-profile.children.cycles-pp.zap_pmd_range
45.61 -2.5 43.10 perf-profile.children.cycles-pp.unmap_page_range
44.98 -2.5 42.48 perf-profile.children.cycles-pp.zap_pte_range
44.53 -2.1 42.48 perf-profile.children.cycles-pp.__x64_sys_exit
44.54 -2.1 42.48 perf-profile.children.cycles-pp.do_exit
39.10 -1.7 37.36 perf-profile.children.cycles-pp.exit_mm
32.99 -1.6 31.41 perf-profile.children.cycles-pp.zap_present_ptes
10.53 -0.8 9.71 perf-profile.children.cycles-pp.tlb_flush_mmu
10.91 -0.7 10.19 perf-profile.children.cycles-pp.__tlb_batch_free_encoded_pages
10.88 -0.7 10.16 perf-profile.children.cycles-pp.free_pages_and_swap_cache
6.64 -0.4 6.22 perf-profile.children.cycles-pp.put_files_struct
5.76 -0.4 5.38 perf-profile.children.cycles-pp.folios_put_refs
6.11 -0.4 5.73 perf-profile.children.cycles-pp.filp_close
5.62 -0.4 5.25 perf-profile.children.cycles-pp.filp_flush
14.28 -0.3 13.97 perf-profile.children.cycles-pp.folio_remove_rmap_ptes
2.75 -0.2 2.58 perf-profile.children.cycles-pp.anon_vma_fork
2.38 -0.2 2.22 perf-profile.children.cycles-pp.dnotify_flush
2.50 -0.1 2.36 perf-profile.children.cycles-pp.free_pgtables
2.25 -0.1 2.11 perf-profile.children.cycles-pp.anon_vma_clone
0.20 ± 33% -0.1 0.08 ± 58% perf-profile.children.cycles-pp.ordered_events__queue
0.20 ± 33% -0.1 0.08 ± 58% perf-profile.children.cycles-pp.queue_event
1.24 ± 4% -0.1 1.14 perf-profile.children.cycles-pp.down_write
1.67 ± 2% -0.1 1.58 perf-profile.children.cycles-pp.unlink_anon_vmas
1.59 -0.1 1.50 perf-profile.children.cycles-pp.__alloc_pages_noprof
1.55 -0.1 1.46 perf-profile.children.cycles-pp.alloc_pages_mpol_noprof
1.58 -0.1 1.50 perf-profile.children.cycles-pp.vm_normal_page
1.11 -0.1 1.04 perf-profile.children.cycles-pp.anon_vma_interval_tree_insert
1.33 -0.1 1.26 ± 2% perf-profile.children.cycles-pp.pte_alloc_one
0.47 ± 11% -0.1 0.40 ± 4% perf-profile.children.cycles-pp.rwsem_down_write_slowpath
0.45 ± 11% -0.1 0.38 ± 4% perf-profile.children.cycles-pp.rwsem_optimistic_spin
1.00 -0.1 0.94 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
1.36 -0.1 1.31 perf-profile.children.cycles-pp.kmem_cache_free
1.08 -0.0 1.04 perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
0.62 -0.0 0.58 ± 2% perf-profile.children.cycles-pp.dup_fd
0.63 -0.0 0.59 ± 3% perf-profile.children.cycles-pp.__pte_alloc
0.73 -0.0 0.69 perf-profile.children.cycles-pp.__tlb_remove_folio_pages_size
0.58 -0.0 0.54 perf-profile.children.cycles-pp.locks_remove_posix
0.90 -0.0 0.86 perf-profile.children.cycles-pp.copy_huge_pmd
0.54 -0.0 0.51 perf-profile.children.cycles-pp.__memcg_kmem_charge_page
0.76 -0.0 0.72 perf-profile.children.cycles-pp.vm_area_dup
0.31 ± 2% -0.0 0.28 ± 3% perf-profile.children.cycles-pp.rwsem_spin_on_owner
0.50 -0.0 0.47 perf-profile.children.cycles-pp.__anon_vma_interval_tree_remove
0.53 -0.0 0.50 perf-profile.children.cycles-pp.clear_page_erms
0.49 -0.0 0.46 perf-profile.children.cycles-pp.free_swap_cache
0.72 -0.0 0.69 perf-profile.children.cycles-pp.__memcg_slab_post_alloc_hook
0.37 ± 2% -0.0 0.34 ± 2% perf-profile.children.cycles-pp.unlink_file_vma
0.62 -0.0 0.60 perf-profile.children.cycles-pp.__memcg_slab_free_hook
0.42 -0.0 0.40 ± 2% perf-profile.children.cycles-pp.rmqueue
0.37 -0.0 0.35 ± 2% perf-profile.children.cycles-pp.__rmqueue_pcplist
0.28 -0.0 0.25 perf-profile.children.cycles-pp.__rb_insert_augmented
0.35 -0.0 0.33 ± 2% perf-profile.children.cycles-pp.rmqueue_bulk
0.56 -0.0 0.54 perf-profile.children.cycles-pp.fput
0.48 -0.0 0.46 perf-profile.children.cycles-pp._raw_spin_lock
0.51 -0.0 0.50 perf-profile.children.cycles-pp.free_unref_page
0.45 -0.0 0.43 perf-profile.children.cycles-pp.__x64_sys_unshare
0.44 -0.0 0.42 perf-profile.children.cycles-pp.free_unref_page_commit
0.45 -0.0 0.43 perf-profile.children.cycles-pp.ksys_unshare
0.31 -0.0 0.30 perf-profile.children.cycles-pp.memcg_account_kmem
0.27 -0.0 0.26 perf-profile.children.cycles-pp.__mod_memcg_state
0.44 -0.0 0.43 perf-profile.children.cycles-pp.__slab_free
0.28 -0.0 0.26 perf-profile.children.cycles-pp.__vm_area_free
0.22 ± 2% -0.0 0.21 perf-profile.children.cycles-pp.___slab_alloc
0.21 -0.0 0.20 ± 2% perf-profile.children.cycles-pp.__tlb_remove_folio_pages
0.13 -0.0 0.12 perf-profile.children.cycles-pp.__rb_erase_color
0.07 -0.0 0.06 perf-profile.children.cycles-pp.find_idlest_cpu
0.09 -0.0 0.08 perf-profile.children.cycles-pp.wake_up_new_task
0.06 -0.0 0.05 perf-profile.children.cycles-pp.kfree
0.06 -0.0 0.05 perf-profile.children.cycles-pp.update_sg_wakeup_stats
0.11 -0.0 0.10 perf-profile.children.cycles-pp.allocate_slab
0.44 ± 2% +0.1 0.53 ± 2% perf-profile.children.cycles-pp.tlb_finish_mmu
98.24 +0.2 98.46 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
98.24 +0.2 98.46 perf-profile.children.cycles-pp.do_syscall_64
33.55 +0.6 34.10 perf-profile.children.cycles-pp.syscall
33.35 +0.6 33.90 perf-profile.children.cycles-pp.__do_sys_clone3
19.41 +1.8 21.20 perf-profile.children.cycles-pp.__clone
19.14 +1.8 20.94 perf-profile.children.cycles-pp.__do_sys_clone
24.94 +2.1 27.07 perf-profile.children.cycles-pp._compound_head
52.48 +2.4 54.84 perf-profile.children.cycles-pp.kernel_clone
52.36 +2.4 54.72 perf-profile.children.cycles-pp.copy_process
39.38 +3.4 42.77 perf-profile.children.cycles-pp.dup_mm
39.24 +3.4 42.64 perf-profile.children.cycles-pp.dup_mmap
34.34 +3.7 38.00 perf-profile.children.cycles-pp.copy_p4d_range
34.37 +3.7 38.03 perf-profile.children.cycles-pp.copy_page_range
33.28 +3.7 36.98 perf-profile.children.cycles-pp.copy_pte_range
31.41 +3.8 35.18 perf-profile.children.cycles-pp.copy_present_ptes
0.00 +4.0 4.01 perf-profile.children.cycles-pp.folio_try_dup_anon_rmap_ptes
18.44 -3.2 15.24 perf-profile.self.cycles-pp.copy_present_ptes
5.66 -0.4 5.28 perf-profile.self.cycles-pp.folios_put_refs
4.78 -0.3 4.46 perf-profile.self.cycles-pp.free_pages_and_swap_cache
14.11 -0.3 13.80 perf-profile.self.cycles-pp.folio_remove_rmap_ptes
4.82 -0.2 4.59 perf-profile.self.cycles-pp.zap_present_ptes
2.66 -0.2 2.49 perf-profile.self.cycles-pp.filp_flush
2.36 -0.2 2.20 perf-profile.self.cycles-pp.dnotify_flush
0.20 ± 32% -0.1 0.08 ± 58% perf-profile.self.cycles-pp.queue_event
1.44 -0.1 1.36 perf-profile.self.cycles-pp.zap_pte_range
1.11 -0.1 1.03 perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
1.26 -0.1 1.20 perf-profile.self.cycles-pp.vm_normal_page
0.56 -0.0 0.52 ± 2% perf-profile.self.cycles-pp.dup_fd
0.56 -0.0 0.53 perf-profile.self.cycles-pp.locks_remove_posix
0.31 -0.0 0.28 perf-profile.self.cycles-pp.put_files_struct
0.58 -0.0 0.55 perf-profile.self.cycles-pp.__tlb_remove_folio_pages_size
0.49 -0.0 0.46 ± 2% perf-profile.self.cycles-pp.__anon_vma_interval_tree_remove
0.30 ± 3% -0.0 0.28 ± 3% perf-profile.self.cycles-pp.rwsem_spin_on_owner
0.52 -0.0 0.49 ± 2% perf-profile.self.cycles-pp.clear_page_erms
0.31 -0.0 0.29 perf-profile.self.cycles-pp.free_swap_cache
0.33 -0.0 0.31 perf-profile.self.cycles-pp.__memcg_slab_free_hook
0.45 -0.0 0.43 perf-profile.self.cycles-pp._raw_spin_lock
0.55 -0.0 0.53 perf-profile.self.cycles-pp.fput
0.38 -0.0 0.36 perf-profile.self.cycles-pp.__memcg_slab_post_alloc_hook
0.47 -0.0 0.45 perf-profile.self.cycles-pp.up_write
0.26 -0.0 0.24 perf-profile.self.cycles-pp.__rb_insert_augmented
0.33 -0.0 0.32 perf-profile.self.cycles-pp.mod_objcg_state
0.31 -0.0 0.30 perf-profile.self.cycles-pp.__free_one_page
0.09 -0.0 0.08 perf-profile.self.cycles-pp.___slab_alloc
24.40 +2.1 26.55 perf-profile.self.cycles-pp._compound_head
0.00 +3.9 3.89 perf-profile.self.cycles-pp.folio_try_dup_anon_rmap_ptes
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-07-30 5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot
@ 2024-07-30 8:11 ` David Hildenbrand
2024-08-01 6:39 ` Yin, Fengwei
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-07-30 8:11 UTC (permalink / raw)
To: kernel test robot, Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
WANG Xuerui, linux-mm, ying.huang, feng.tang, fengwei.yin
On 30.07.24 07:00, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on:
Is that test even using hugetlb? Anyhow, this pretty much sounds like
noise and can be ignored.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-07-30 8:11 ` David Hildenbrand
@ 2024-08-01 6:39 ` Yin, Fengwei
2024-08-01 6:49 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: Yin, Fengwei @ 2024-08-01 6:39 UTC (permalink / raw)
To: David Hildenbrand, kernel test robot, Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
WANG Xuerui, linux-mm, ying.huang, feng.tang
Hi David,
On 7/30/2024 4:11 PM, David Hildenbrand wrote:
> On 30.07.24 07:00, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a -2.9% regression of
>> stress-ng.clone.ops_per_sec on:
>
> Is that test even using hugetlb? Anyhow, this pretty much sounds like
> noise and can be ignored.
>
It's not about hugetlb. It looks like related with the change:
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c03..7577fe7debafc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
static __always_inline int PageAnonExclusive(const struct page *page)
{
VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
- VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ /*
+ * HugeTLB stores this information on the head page; THP keeps
it per
+ * page
+ */
+ if (PageHuge(page))
+ page = compound_head(page);
return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
The PageAnonExclusive() function is changed. And the profiling data
showed it:
0.00 +3.9 3.90
perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
According
https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b:
# CONFIG_DEBUG_VM is not set
So maybe such code change could bring difference?
And yes. 2.9% regression can be in noise range. Thanks.
Regards
Yin, Fengwei
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 6:39 ` Yin, Fengwei
@ 2024-08-01 6:49 ` David Hildenbrand
2024-08-01 7:44 ` Yin, Fengwei
2024-08-01 13:30 ` Mateusz Guzik
0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 6:49 UTC (permalink / raw)
To: Yin, Fengwei, kernel test robot, Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
WANG Xuerui, linux-mm, ying.huang, feng.tang
On 01.08.24 08:39, Yin, Fengwei wrote:
> Hi David,
>
> On 7/30/2024 4:11 PM, David Hildenbrand wrote:
>> On 30.07.24 07:00, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a -2.9% regression of
>>> stress-ng.clone.ops_per_sec on:
>>
>> Is that test even using hugetlb? Anyhow, this pretty much sounds like
>> noise and can be ignored.
>>
> It's not about hugetlb. It looks like related with the change:
Ah, that makes sense!
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c03..7577fe7debafc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> - VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> + /*
> + * HugeTLB stores this information on the head page; THP keeps
> it per
> + * page
> + */
> + if (PageHuge(page))
> + page = compound_head(page);
> return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>
>
> The PageAnonExclusive() function is changed. And the profiling data
> showed it:
>
> 0.00 +3.9 3.90
> perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
>
> According
> https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b:
> # CONFIG_DEBUG_VM is not set
> So maybe such code change could bring difference?
Yes indeed. fork() can be extremely sensitive to each added instruction.
I even pointed out to Peter why I didn't add the PageHuge check in there
originally [1].
"Well, and I didn't want to have runtime-hugetlb checks in
PageAnonExclusive code called on certainly-not-hugetlb code paths."
We now have to do a page_folio(page) and then test for hugetlb.
return folio_test_hugetlb(page_folio(page));
Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times,
so maybe at least part of the overhead is gone.
[1]
https://lore.kernel.org/r/all/8b0b24bb-3c38-4f27-a2c9-f7d7adc4a115@redhat.com/
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 6:49 ` David Hildenbrand
@ 2024-08-01 7:44 ` Yin, Fengwei
2024-08-01 7:54 ` David Hildenbrand
2024-08-01 13:30 ` Mateusz Guzik
1 sibling, 1 reply; 22+ messages in thread
From: Yin, Fengwei @ 2024-08-01 7:44 UTC (permalink / raw)
To: David Hildenbrand, kernel test robot, Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
WANG Xuerui, linux-mm, ying.huang, feng.tang
Hi David,
On 8/1/2024 2:49 PM, David Hildenbrand wrote:
> We now have to do a page_folio(page) and then test for hugetlb.
>
> return folio_test_hugetlb(page_folio(page));
>
> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times,
> so maybe at least part of the overhead is gone.
This is great. We will check the trend to know whether it's recovered
in some level.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 7:44 ` Yin, Fengwei
@ 2024-08-01 7:54 ` David Hildenbrand
0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 7:54 UTC (permalink / raw)
To: Yin, Fengwei, kernel test robot, Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen,
Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts,
WANG Xuerui, linux-mm, ying.huang, feng.tang
On 01.08.24 09:44, Yin, Fengwei wrote:
> Hi David,
>
> On 8/1/2024 2:49 PM, David Hildenbrand wrote:
>> We now have to do a page_folio(page) and then test for hugetlb.
>>
>> return folio_test_hugetlb(page_folio(page));
>>
>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times,
>> so maybe at least part of the overhead is gone.
> This is great. We will check the trend to know whether it's recovered
> in some level.
Oh, I think d99e3140a4d33e26066183ff727d8f02f56bec64 went upstream
before c0bff412e67b781d761e330ff9578aa9ed2be79e, so at the time of
c0bff412e6 we already should have had the faster check!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 6:49 ` David Hildenbrand
2024-08-01 7:44 ` Yin, Fengwei
@ 2024-08-01 13:30 ` Mateusz Guzik
2024-08-01 13:34 ` David Hildenbrand
1 sibling, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-01 13:30 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> Yes indeed. fork() can be extremely sensitive to each added instruction.
>
> I even pointed out to Peter why I didn't add the PageHuge check in there
> originally [1].
>
> "Well, and I didn't want to have runtime-hugetlb checks in
> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>
>
> We now have to do a page_folio(page) and then test for hugetlb.
>
> return folio_test_hugetlb(page_folio(page));
>
> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
> maybe at least part of the overhead is gone.
>
I'll note page_folio expands to a call to _compound_head.
While _compound_head is declared as an inline, it ends up being big
enough that the compiler decides to emit a real function instead and
real func calls are not particularly cheap.
I had a brief look with a profiler myself and for single-threaded usage
the func is quite high up there, while it manages to get out with the
first branch -- that is to say there is definitely performance lost for
having a func call instead of an inlined branch.
The routine is deinlined because of a call to page_fixed_fake_head,
which itself is annotated with always_inline.
This is of course patchable with minor shoveling.
I did not go for it because stress-ng results were too unstable for me
to confidently state win/loss.
But should you want to whack the regression, this is what I would look
into.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 13:30 ` Mateusz Guzik
@ 2024-08-01 13:34 ` David Hildenbrand
2024-08-01 13:37 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 13:34 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 01.08.24 15:30, Mateusz Guzik wrote:
> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>> Yes indeed. fork() can be extremely sensitive to each added instruction.
>>
>> I even pointed out to Peter why I didn't add the PageHuge check in there
>> originally [1].
>>
>> "Well, and I didn't want to have runtime-hugetlb checks in
>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>
>>
>> We now have to do a page_folio(page) and then test for hugetlb.
>>
>> return folio_test_hugetlb(page_folio(page));
>>
>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
>> maybe at least part of the overhead is gone.
>>
>
> I'll note page_folio expands to a call to _compound_head.
>
> While _compound_head is declared as an inline, it ends up being big
> enough that the compiler decides to emit a real function instead and
> real func calls are not particularly cheap.
>
> I had a brief look with a profiler myself and for single-threaded usage
> the func is quite high up there, while it manages to get out with the
> first branch -- that is to say there is definitely performance lost for
> having a func call instead of an inlined branch.
>
> The routine is deinlined because of a call to page_fixed_fake_head,
> which itself is annotated with always_inline.
>
> This is of course patchable with minor shoveling.
>
> I did not go for it because stress-ng results were too unstable for me
> to confidently state win/loss.
>
> But should you want to whack the regression, this is what I would look
> into.
>
This might improve it, at least for small folios I guess:
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..7796ae116018 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
*/
static inline bool PageHuge(const struct page *page)
{
- return folio_test_hugetlb(page_folio(page));
+ return PageCompound(page) && folio_test_hugetlb(page_folio(page));
}
/*
We would avoid the function call for small folios.
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 13:34 ` David Hildenbrand
@ 2024-08-01 13:37 ` Mateusz Guzik
2024-08-01 13:44 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-01 13:37 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.08.24 15:30, Mateusz Guzik wrote:
> > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >> Yes indeed. fork() can be extremely sensitive to each added instruction.
> >>
> >> I even pointed out to Peter why I didn't add the PageHuge check in there
> >> originally [1].
> >>
> >> "Well, and I didn't want to have runtime-hugetlb checks in
> >> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>
> >>
> >> We now have to do a page_folio(page) and then test for hugetlb.
> >>
> >> return folio_test_hugetlb(page_folio(page));
> >>
> >> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
> >> maybe at least part of the overhead is gone.
> >>
> >
> > I'll note page_folio expands to a call to _compound_head.
> >
> > While _compound_head is declared as an inline, it ends up being big
> > enough that the compiler decides to emit a real function instead and
> > real func calls are not particularly cheap.
> >
> > I had a brief look with a profiler myself and for single-threaded usage
> > the func is quite high up there, while it manages to get out with the
> > first branch -- that is to say there is definitely performance lost for
> > having a func call instead of an inlined branch.
> >
> > The routine is deinlined because of a call to page_fixed_fake_head,
> > which itself is annotated with always_inline.
> >
> > This is of course patchable with minor shoveling.
> >
> > I did not go for it because stress-ng results were too unstable for me
> > to confidently state win/loss.
> >
> > But should you want to whack the regression, this is what I would look
> > into.
> >
>
> This might improve it, at least for small folios I guess:
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..7796ae116018 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
> */
> static inline bool PageHuge(const struct page *page)
> {
> - return folio_test_hugetlb(page_folio(page));
> + return PageCompound(page) && folio_test_hugetlb(page_folio(page));
> }
>
> /*
>
>
> We would avoid the function call for small folios.
>
why not massage _compound_head back to an inlineable form instead? for
all i know you may even register a small win in total
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 13:37 ` Mateusz Guzik
@ 2024-08-01 13:44 ` David Hildenbrand
2024-08-12 4:43 ` Yin Fengwei
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-01 13:44 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 01.08.24 15:37, Mateusz Guzik wrote:
> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>> Yes indeed. fork() can be extremely sensitive to each added instruction.
>>>>
>>>> I even pointed out to Peter why I didn't add the PageHuge check in there
>>>> originally [1].
>>>>
>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>
>>>>
>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>
>>>> return folio_test_hugetlb(page_folio(page));
>>>>
>>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so
>>>> maybe at least part of the overhead is gone.
>>>>
>>>
>>> I'll note page_folio expands to a call to _compound_head.
>>>
>>> While _compound_head is declared as an inline, it ends up being big
>>> enough that the compiler decides to emit a real function instead and
>>> real func calls are not particularly cheap.
>>>
>>> I had a brief look with a profiler myself and for single-threaded usage
>>> the func is quite high up there, while it manages to get out with the
>>> first branch -- that is to say there is definitely performance lost for
>>> having a func call instead of an inlined branch.
>>>
>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>> which itself is annotated with always_inline.
>>>
>>> This is of course patchable with minor shoveling.
>>>
>>> I did not go for it because stress-ng results were too unstable for me
>>> to confidently state win/loss.
>>>
>>> But should you want to whack the regression, this is what I would look
>>> into.
>>>
>>
>> This might improve it, at least for small folios I guess:
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 5769fe6e4950..7796ae116018 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>> */
>> static inline bool PageHuge(const struct page *page)
>> {
>> - return folio_test_hugetlb(page_folio(page));
>> + return PageCompound(page) && folio_test_hugetlb(page_folio(page));
>> }
>>
>> /*
>>
>>
>> We would avoid the function call for small folios.
>>
>
> why not massage _compound_head back to an inlineable form instead? for
> all i know you may even register a small win in total
Agreed, likely it will increase code size a bit which is why the
compiler decides to not inline. We could force it with __always_inline.
Finding ways to shrink page_fixed_fake_head() might be even better.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-01 13:44 ` David Hildenbrand
@ 2024-08-12 4:43 ` Yin Fengwei
2024-08-12 4:49 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-12 4:43 UTC (permalink / raw)
To: David Hildenbrand, Mateusz Guzik
Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
ying.huang, feng.tang
Hi David,
On 8/1/24 09:44, David Hildenbrand wrote:
> On 01.08.24 15:37, Mateusz Guzik wrote:
>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>> wrote:
>>>
>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>> Yes indeed. fork() can be extremely sensitive to each added
>>>>> instruction.
>>>>>
>>>>> I even pointed out to Peter why I didn't add the PageHuge check in
>>>>> there
>>>>> originally [1].
>>>>>
>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>
>>>>>
>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>
>>>>> return folio_test_hugetlb(page_folio(page));
>>>>>
>>>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6
>>>>> times, so
>>>>> maybe at least part of the overhead is gone.
>>>>>
>>>>
>>>> I'll note page_folio expands to a call to _compound_head.
>>>>
>>>> While _compound_head is declared as an inline, it ends up being big
>>>> enough that the compiler decides to emit a real function instead and
>>>> real func calls are not particularly cheap.
>>>>
>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>> the func is quite high up there, while it manages to get out with the
>>>> first branch -- that is to say there is definitely performance lost for
>>>> having a func call instead of an inlined branch.
>>>>
>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>> which itself is annotated with always_inline.
>>>>
>>>> This is of course patchable with minor shoveling.
>>>>
>>>> I did not go for it because stress-ng results were too unstable for me
>>>> to confidently state win/loss.
>>>>
>>>> But should you want to whack the regression, this is what I would look
>>>> into.
>>>>
>>>
>>> This might improve it, at least for small folios I guess:
Do you want us to test this change? Or you have further optimization
ongoing? Thanks.
Regards
Yin, Fengwei
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 5769fe6e4950..7796ae116018 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
>>> */
>>> static inline bool PageHuge(const struct page *page)
>>> {
>>> - return folio_test_hugetlb(page_folio(page));
>>> + return PageCompound(page) &&
>>> folio_test_hugetlb(page_folio(page));
>>> }
>>>
>>> /*
>>>
>>>
>>> We would avoid the function call for small folios.
>>>
>>
>> why not massage _compound_head back to an inlineable form instead? for
>> all i know you may even register a small win in total
>
> Agreed, likely it will increase code size a bit which is why the
> compiler decides to not inline. We could force it with __always_inline.
>
> Finding ways to shrink page_fixed_fake_head() might be even better.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-12 4:43 ` Yin Fengwei
@ 2024-08-12 4:49 ` Mateusz Guzik
2024-08-12 8:12 ` David Hildenbrand
2024-08-13 7:09 ` Yin Fengwei
0 siblings, 2 replies; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-12 4:49 UTC (permalink / raw)
To: Yin Fengwei
Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> Hi David,
>
> On 8/1/24 09:44, David Hildenbrand wrote:
> > On 01.08.24 15:37, Mateusz Guzik wrote:
> > > On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> > > wrote:
> > > >
> > > > On 01.08.24 15:30, Mateusz Guzik wrote:
> > > > > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> > > > > > Yes indeed. fork() can be extremely sensitive to each
> > > > > > added instruction.
> > > > > >
> > > > > > I even pointed out to Peter why I didn't add the
> > > > > > PageHuge check in there
> > > > > > originally [1].
> > > > > >
> > > > > > "Well, and I didn't want to have runtime-hugetlb checks in
> > > > > > PageAnonExclusive code called on certainly-not-hugetlb code paths."
> > > > > >
> > > > > >
> > > > > > We now have to do a page_folio(page) and then test for hugetlb.
> > > > > >
> > > > > > return folio_test_hugetlb(page_folio(page));
> > > > > >
> > > > > > Nowadays, folio_test_hugetlb() will be faster than at
> > > > > > c0bff412e6 times, so
> > > > > > maybe at least part of the overhead is gone.
> > > > > >
> > > > >
> > > > > I'll note page_folio expands to a call to _compound_head.
> > > > >
> > > > > While _compound_head is declared as an inline, it ends up being big
> > > > > enough that the compiler decides to emit a real function instead and
> > > > > real func calls are not particularly cheap.
> > > > >
> > > > > I had a brief look with a profiler myself and for single-threaded usage
> > > > > the func is quite high up there, while it manages to get out with the
> > > > > first branch -- that is to say there is definitely performance lost for
> > > > > having a func call instead of an inlined branch.
> > > > >
> > > > > The routine is deinlined because of a call to page_fixed_fake_head,
> > > > > which itself is annotated with always_inline.
> > > > >
> > > > > This is of course patchable with minor shoveling.
> > > > >
> > > > > I did not go for it because stress-ng results were too unstable for me
> > > > > to confidently state win/loss.
> > > > >
> > > > > But should you want to whack the regression, this is what I would look
> > > > > into.
> > > > >
> > > >
> > > > This might improve it, at least for small folios I guess:
> Do you want us to test this change? Or you have further optimization
> ongoing? Thanks.
I verified the thing below boots, I have no idea about performance. If
it helps it can be massaged later from style perspective.
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..2d5d61ab385b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -194,34 +194,13 @@ enum pageflags {
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
-/*
- * Return the real head page struct iff the @page is a fake head page, otherwise
- * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
- */
+const struct page *_page_fixed_fake_head(const struct page *page);
+
static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
{
if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
return page;
-
- /*
- * Only addresses aligned with PAGE_SIZE of struct page may be fake head
- * struct page. The alignment check aims to avoid access the fields (
- * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
- * cold cacheline in some cases.
- */
- if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
- test_bit(PG_head, &page->flags)) {
- /*
- * We can safely access the field of the @page[1] with PG_head
- * because the @page is a compound page composed with at least
- * two contiguous pages.
- */
- unsigned long head = READ_ONCE(page[1].compound_head);
-
- if (likely(head & 1))
- return (const struct page *)(head - 1);
- }
- return page;
+ return _page_fixed_fake_head(page);
}
#else
static inline const struct page *page_fixed_fake_head(const struct page *page)
@@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
return page_fixed_fake_head(page) != page;
}
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 829112b0a914..3fbc00db607a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -19,6 +19,33 @@
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
+/*
+ * Return the real head page struct iff the @page is a fake head page, otherwise
+ * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
+ */
+const struct page *_page_fixed_fake_head(const struct page *page)
+{
+ /*
+ * Only addresses aligned with PAGE_SIZE of struct page may be fake head
+ * struct page. The alignment check aims to avoid access the fields (
+ * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
+ * cold cacheline in some cases.
+ */
+ if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
+ test_bit(PG_head, &page->flags)) {
+ /*
+ * We can safely access the field of the @page[1] with PG_head
+ * because the @page is a compound page composed with at least
+ * two contiguous pages.
+ */
+ unsigned long head = READ_ONCE(page[1].compound_head);
+
+ if (likely(head & 1))
+ return (const struct page *)(head - 1);
+ }
+ return page;
+}
+
/**
* struct vmemmap_remap_walk - walk vmemmap page table
*
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-12 4:49 ` Mateusz Guzik
@ 2024-08-12 8:12 ` David Hildenbrand
2024-08-12 8:18 ` Mateusz Guzik
2024-08-13 7:09 ` Yin Fengwei
1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-12 8:12 UTC (permalink / raw)
To: Mateusz Guzik, Yin Fengwei
Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
ying.huang, feng.tang
On 12.08.24 06:49, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>> Hi David,
>>
>> On 8/1/24 09:44, David Hildenbrand wrote:
>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>> wrote:
>>>>>
>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>> added instruction.
>>>>>>>
>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>> PageHuge check in there
>>>>>>> originally [1].
>>>>>>>
>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>
>>>>>>>
>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>
>>>>>>> return folio_test_hugetlb(page_folio(page));
>>>>>>>
>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>> c0bff412e6 times, so
>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>
>>>>>>
>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>
>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>> real func calls are not particularly cheap.
>>>>>>
>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>> having a func call instead of an inlined branch.
>>>>>>
>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>> which itself is annotated with always_inline.
>>>>>>
>>>>>> This is of course patchable with minor shoveling.
>>>>>>
>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>> to confidently state win/loss.
>>>>>>
>>>>>> But should you want to whack the regression, this is what I would look
>>>>>> into.
>>>>>>
>>>>>
>>>>> This might improve it, at least for small folios I guess:
>> Do you want us to test this change? Or you have further optimization
>> ongoing? Thanks.
>
> I verified the thing below boots, I have no idea about performance. If
> it helps it can be massaged later from style perspective.
As quite a lot of setups already run with the vmemmap optimization enabled, I
wonder how effective this would be (might need more fine tuning, did not look
at the generated code):
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 085dd8dcbea2..7ddcdbd712ec 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
return page_fixed_fake_head(page) != page;
}
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-12 8:12 ` David Hildenbrand
@ 2024-08-12 8:18 ` Mateusz Guzik
2024-08-12 8:23 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-12 8:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.08.24 06:49, Mateusz Guzik wrote:
> > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> >> Hi David,
> >>
> >> On 8/1/24 09:44, David Hildenbrand wrote:
> >>> On 01.08.24 15:37, Mateusz Guzik wrote:
> >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
> >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >>>>>>> Yes indeed. fork() can be extremely sensitive to each
> >>>>>>> added instruction.
> >>>>>>>
> >>>>>>> I even pointed out to Peter why I didn't add the
> >>>>>>> PageHuge check in there
> >>>>>>> originally [1].
> >>>>>>>
> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
> >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>>>>>>
> >>>>>>>
> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
> >>>>>>>
> >>>>>>> return folio_test_hugetlb(page_folio(page));
> >>>>>>>
> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
> >>>>>>> c0bff412e6 times, so
> >>>>>>> maybe at least part of the overhead is gone.
> >>>>>>>
> >>>>>>
> >>>>>> I'll note page_folio expands to a call to _compound_head.
> >>>>>>
> >>>>>> While _compound_head is declared as an inline, it ends up being big
> >>>>>> enough that the compiler decides to emit a real function instead and
> >>>>>> real func calls are not particularly cheap.
> >>>>>>
> >>>>>> I had a brief look with a profiler myself and for single-threaded usage
> >>>>>> the func is quite high up there, while it manages to get out with the
> >>>>>> first branch -- that is to say there is definitely performance lost for
> >>>>>> having a func call instead of an inlined branch.
> >>>>>>
> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
> >>>>>> which itself is annotated with always_inline.
> >>>>>>
> >>>>>> This is of course patchable with minor shoveling.
> >>>>>>
> >>>>>> I did not go for it because stress-ng results were too unstable for me
> >>>>>> to confidently state win/loss.
> >>>>>>
> >>>>>> But should you want to whack the regression, this is what I would look
> >>>>>> into.
> >>>>>>
> >>>>>
> >>>>> This might improve it, at least for small folios I guess:
> >> Do you want us to test this change? Or you have further optimization
> >> ongoing? Thanks.
> >
> > I verified the thing below boots, I have no idea about performance. If
> > it helps it can be massaged later from style perspective.
>
> As quite a lot of setups already run with the vmemmap optimization enabled, I
> wonder how effective this would be (might need more fine tuning, did not look
> at the generated code):
>
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 085dd8dcbea2..7ddcdbd712ec 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> return page_fixed_fake_head(page) != page;
> }
>
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
> {
> unsigned long head = READ_ONCE(page->compound_head);
>
>
Well one may need to justify it with bloat-o-meter which is why I did
not just straight up inline the entire thing.
But if you are down to fight opposition of the sort I agree this is
the patch to benchmark. :)
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-12 8:18 ` Mateusz Guzik
@ 2024-08-12 8:23 ` David Hildenbrand
0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-12 8:23 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 12.08.24 10:18, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.08.24 06:49, Mateusz Guzik wrote:
>>> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>>>> Hi David,
>>>>
>>>> On 8/1/24 09:44, David Hildenbrand wrote:
>>>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>>>> added instruction.
>>>>>>>>>
>>>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>>>> PageHuge check in there
>>>>>>>>> originally [1].
>>>>>>>>>
>>>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>>>
>>>>>>>>> return folio_test_hugetlb(page_folio(page));
>>>>>>>>>
>>>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>>>> c0bff412e6 times, so
>>>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>>>
>>>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>>>> real func calls are not particularly cheap.
>>>>>>>>
>>>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>>>> having a func call instead of an inlined branch.
>>>>>>>>
>>>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>>>> which itself is annotated with always_inline.
>>>>>>>>
>>>>>>>> This is of course patchable with minor shoveling.
>>>>>>>>
>>>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>>>> to confidently state win/loss.
>>>>>>>>
>>>>>>>> But should you want to whack the regression, this is what I would look
>>>>>>>> into.
>>>>>>>>
>>>>>>>
>>>>>>> This might improve it, at least for small folios I guess:
>>>> Do you want us to test this change? Or you have further optimization
>>>> ongoing? Thanks.
>>>
>>> I verified the thing below boots, I have no idea about performance. If
>>> it helps it can be massaged later from style perspective.
>>
>> As quite a lot of setups already run with the vmemmap optimization enabled, I
>> wonder how effective this would be (might need more fine tuning, did not look
>> at the generated code):
>>
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 085dd8dcbea2..7ddcdbd712ec 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>> return page_fixed_fake_head(page) != page;
>> }
>>
>> -static inline unsigned long _compound_head(const struct page *page)
>> +static __always_inline unsigned long _compound_head(const struct page *page)
>> {
>> unsigned long head = READ_ONCE(page->compound_head);
>>
>>
>
> Well one may need to justify it with bloat-o-meter which is why I did
> not just straight up inline the entire thing.
>
> But if you are down to fight opposition of the sort I agree this is
> the patch to benchmark. :)
I spotted that we already to that for
PageHead()/PageTail()/page_is_fake_head(). So we effectively
force-inline it everywhere except into _compound_head() I think.
But yeah, measuring the bloat would be a necessary exercise.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-12 4:49 ` Mateusz Guzik
2024-08-12 8:12 ` David Hildenbrand
@ 2024-08-13 7:09 ` Yin Fengwei
2024-08-13 7:14 ` Mateusz Guzik
1 sibling, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-13 7:09 UTC (permalink / raw)
To: Mateusz Guzik
Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 8/12/24 00:49, Mateusz Guzik wrote:
> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
>> Hi David,
>>
>> On 8/1/24 09:44, David Hildenbrand wrote:
>>> On 01.08.24 15:37, Mateusz Guzik wrote:
>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
>>>> wrote:
>>>>>
>>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
>>>>>>> Yes indeed. fork() can be extremely sensitive to each
>>>>>>> added instruction.
>>>>>>>
>>>>>>> I even pointed out to Peter why I didn't add the
>>>>>>> PageHuge check in there
>>>>>>> originally [1].
>>>>>>>
>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
>>>>>>>
>>>>>>>
>>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
>>>>>>>
>>>>>>> return folio_test_hugetlb(page_folio(page));
>>>>>>>
>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
>>>>>>> c0bff412e6 times, so
>>>>>>> maybe at least part of the overhead is gone.
>>>>>>>
>>>>>>
>>>>>> I'll note page_folio expands to a call to _compound_head.
>>>>>>
>>>>>> While _compound_head is declared as an inline, it ends up being big
>>>>>> enough that the compiler decides to emit a real function instead and
>>>>>> real func calls are not particularly cheap.
>>>>>>
>>>>>> I had a brief look with a profiler myself and for single-threaded usage
>>>>>> the func is quite high up there, while it manages to get out with the
>>>>>> first branch -- that is to say there is definitely performance lost for
>>>>>> having a func call instead of an inlined branch.
>>>>>>
>>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
>>>>>> which itself is annotated with always_inline.
>>>>>>
>>>>>> This is of course patchable with minor shoveling.
>>>>>>
>>>>>> I did not go for it because stress-ng results were too unstable for me
>>>>>> to confidently state win/loss.
>>>>>>
>>>>>> But should you want to whack the regression, this is what I would look
>>>>>> into.
>>>>>>
>>>>>
>>>>> This might improve it, at least for small folios I guess:
>> Do you want us to test this change? Or you have further optimization
>> ongoing? Thanks.
>
> I verified the thing below boots, I have no idea about performance. If
> it helps it can be massaged later from style perspective.
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..2d5d61ab385b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -194,34 +194,13 @@ enum pageflags {
> #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
>
> -/*
> - * Return the real head page struct iff the @page is a fake head page, otherwise
> - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> - */
> +const struct page *_page_fixed_fake_head(const struct page *page);
> +
> static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
> {
> if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> return page;
> -
> - /*
> - * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> - * struct page. The alignment check aims to avoid access the fields (
> - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> - * cold cacheline in some cases.
> - */
> - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> - test_bit(PG_head, &page->flags)) {
> - /*
> - * We can safely access the field of the @page[1] with PG_head
> - * because the @page is a compound page composed with at least
> - * two contiguous pages.
> - */
> - unsigned long head = READ_ONCE(page[1].compound_head);
> -
> - if (likely(head & 1))
> - return (const struct page *)(head - 1);
> - }
> - return page;
> + return _page_fixed_fake_head(page);
> }
> #else
> static inline const struct page *page_fixed_fake_head(const struct page *page)
> @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> return page_fixed_fake_head(page) != page;
> }
>
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
> {
> unsigned long head = READ_ONCE(page->compound_head);
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 829112b0a914..3fbc00db607a 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -19,6 +19,33 @@
> #include <asm/tlbflush.h>
> #include "hugetlb_vmemmap.h"
>
> +/*
> + * Return the real head page struct iff the @page is a fake head page, otherwise
> + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> + */
> +const struct page *_page_fixed_fake_head(const struct page *page)
> +{
> + /*
> + * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> + * struct page. The alignment check aims to avoid access the fields (
> + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> + * cold cacheline in some cases.
> + */
> + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> + test_bit(PG_head, &page->flags)) {
> + /*
> + * We can safely access the field of the @page[1] with PG_head
> + * because the @page is a compound page composed with at least
> + * two contiguous pages.
> + */
> + unsigned long head = READ_ONCE(page[1].compound_head);
> +
> + if (likely(head & 1))
> + return (const struct page *)(head - 1);
> + }
> + return page;
> +}
> +
> /**
> * struct vmemmap_remap_walk - walk vmemmap page table
> *
>
The change can resolve the regression (from -3% to 0.5%):
Please note:
9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit
3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above
patch
c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which
introduced regression
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
commit:
9cb28da54643ad464c47585cd5866c30b0218e67
3f16e4b516ef02d9461b7e0b6c50e05ba0811886
c0bff412e67b781d761e330ff9578aa9ed2be79e
9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95
---------------- --------------------------- ---------------------------
fail:runs %reproduction fail:runs %reproduction fail:runs
| | | | |
3:3 0% 3:3 0% 3:3
stress-ng.clone.microsecs_per_clone.pass
3:3 0% 3:3 0% 3:3
stress-ng.clone.pass
%stddev %change %stddev %change %stddev
\ | \ | \
2904 -0.6% 2886 +3.7% 3011
stress-ng.clone.microsecs_per_clone
563520 +0.5% 566296 -3.1% 546122
stress-ng.clone.ops
9306 +0.5% 9356 -3.0% 9024
stress-ng.clone.ops_per_sec
BTW, the change needs to export symbol _page_fixed_fake_head otherwise
some modules hit build error.
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-13 7:09 ` Yin Fengwei
@ 2024-08-13 7:14 ` Mateusz Guzik
2024-08-14 3:02 ` Yin Fengwei
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-13 7:14 UTC (permalink / raw)
To: Yin Fengwei
Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Tue, Aug 13, 2024 at 9:09 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> On 8/12/24 00:49, Mateusz Guzik wrote:
> > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote:
> >> Hi David,
> >>
> >> On 8/1/24 09:44, David Hildenbrand wrote:
> >>> On 01.08.24 15:37, Mateusz Guzik wrote:
> >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com>
> >>>> wrote:
> >>>>>
> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote:
> >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote:
> >>>>>>> Yes indeed. fork() can be extremely sensitive to each
> >>>>>>> added instruction.
> >>>>>>>
> >>>>>>> I even pointed out to Peter why I didn't add the
> >>>>>>> PageHuge check in there
> >>>>>>> originally [1].
> >>>>>>>
> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in
> >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths."
> >>>>>>>
> >>>>>>>
> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb.
> >>>>>>>
> >>>>>>> return folio_test_hugetlb(page_folio(page));
> >>>>>>>
> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at
> >>>>>>> c0bff412e6 times, so
> >>>>>>> maybe at least part of the overhead is gone.
> >>>>>>>
> >>>>>>
> >>>>>> I'll note page_folio expands to a call to _compound_head.
> >>>>>>
> >>>>>> While _compound_head is declared as an inline, it ends up being big
> >>>>>> enough that the compiler decides to emit a real function instead and
> >>>>>> real func calls are not particularly cheap.
> >>>>>>
> >>>>>> I had a brief look with a profiler myself and for single-threaded usage
> >>>>>> the func is quite high up there, while it manages to get out with the
> >>>>>> first branch -- that is to say there is definitely performance lost for
> >>>>>> having a func call instead of an inlined branch.
> >>>>>>
> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head,
> >>>>>> which itself is annotated with always_inline.
> >>>>>>
> >>>>>> This is of course patchable with minor shoveling.
> >>>>>>
> >>>>>> I did not go for it because stress-ng results were too unstable for me
> >>>>>> to confidently state win/loss.
> >>>>>>
> >>>>>> But should you want to whack the regression, this is what I would look
> >>>>>> into.
> >>>>>>
> >>>>>
> >>>>> This might improve it, at least for small folios I guess:
> >> Do you want us to test this change? Or you have further optimization
> >> ongoing? Thanks.
> >
> > I verified the thing below boots, I have no idea about performance. If
> > it helps it can be massaged later from style perspective.
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5769fe6e4950..2d5d61ab385b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -194,34 +194,13 @@ enum pageflags {
> > #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> > DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key);
> >
> > -/*
> > - * Return the real head page struct iff the @page is a fake head page, otherwise
> > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > - */
> > +const struct page *_page_fixed_fake_head(const struct page *page);
> > +
> > static __always_inline const struct page *page_fixed_fake_head(const struct page *page)
> > {
> > if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> > return page;
> > -
> > - /*
> > - * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > - * struct page. The alignment check aims to avoid access the fields (
> > - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > - * cold cacheline in some cases.
> > - */
> > - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > - test_bit(PG_head, &page->flags)) {
> > - /*
> > - * We can safely access the field of the @page[1] with PG_head
> > - * because the @page is a compound page composed with at least
> > - * two contiguous pages.
> > - */
> > - unsigned long head = READ_ONCE(page[1].compound_head);
> > -
> > - if (likely(head & 1))
> > - return (const struct page *)(head - 1);
> > - }
> > - return page;
> > + return _page_fixed_fake_head(page);
> > }
> > #else
> > static inline const struct page *page_fixed_fake_head(const struct page *page)
> > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> > return page_fixed_fake_head(page) != page;
> > }
> >
> > -static inline unsigned long _compound_head(const struct page *page)
> > +static __always_inline unsigned long _compound_head(const struct page *page)
> > {
> > unsigned long head = READ_ONCE(page->compound_head);
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 829112b0a914..3fbc00db607a 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -19,6 +19,33 @@
> > #include <asm/tlbflush.h>
> > #include "hugetlb_vmemmap.h"
> >
> > +/*
> > + * Return the real head page struct iff the @page is a fake head page, otherwise
> > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst.
> > + */
> > +const struct page *_page_fixed_fake_head(const struct page *page)
> > +{
> > + /*
> > + * Only addresses aligned with PAGE_SIZE of struct page may be fake head
> > + * struct page. The alignment check aims to avoid access the fields (
> > + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly)
> > + * cold cacheline in some cases.
> > + */
> > + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > + test_bit(PG_head, &page->flags)) {
> > + /*
> > + * We can safely access the field of the @page[1] with PG_head
> > + * because the @page is a compound page composed with at least
> > + * two contiguous pages.
> > + */
> > + unsigned long head = READ_ONCE(page[1].compound_head);
> > +
> > + if (likely(head & 1))
> > + return (const struct page *)(head - 1);
> > + }
> > + return page;
> > +}
> > +
> > /**
> > * struct vmemmap_remap_walk - walk vmemmap page table
> > *
> >
>
> The change can resolve the regression (from -3% to 0.5%):
>
thanks for testing
would you mind benchmarking the change which merely force-inlines _compund_page?
https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
> Please note:
> 9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit
> 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above
> patch
> c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which
> introduced regression
>
>
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
>
> lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
>
> commit:
> 9cb28da54643ad464c47585cd5866c30b0218e67
> 3f16e4b516ef02d9461b7e0b6c50e05ba0811886
> c0bff412e67b781d761e330ff9578aa9ed2be79e
>
> 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95
> ---------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | |
> 3:3 0% 3:3 0% 3:3
> stress-ng.clone.microsecs_per_clone.pass
> 3:3 0% 3:3 0% 3:3
> stress-ng.clone.pass
> %stddev %change %stddev %change %stddev
> \ | \ | \
> 2904 -0.6% 2886 +3.7% 3011
> stress-ng.clone.microsecs_per_clone
> 563520 +0.5% 566296 -3.1% 546122
> stress-ng.clone.ops
> 9306 +0.5% 9356 -3.0% 9024
> stress-ng.clone.ops_per_sec
>
>
> BTW, the change needs to export symbol _page_fixed_fake_head otherwise
> some modules hit build error.
>
ok, I'll patch that up if this approach will be the thing to do
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-13 7:14 ` Mateusz Guzik
@ 2024-08-14 3:02 ` Yin Fengwei
2024-08-14 4:10 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: Yin Fengwei @ 2024-08-14 3:02 UTC (permalink / raw)
To: Mateusz Guzik
Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 8/13/24 03:14, Mateusz Guzik wrote:
> thanks for testing
>
> would you mind benchmarking the change which merely force-inlines _compund_page?
>
> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
This change can resolve the regression also:
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
commit:
9cb28da54643ad464c47585cd5866c30b0218e67 parent commit
c0bff412e67b781d761e330ff9578aa9ed2be79e commit introduced regression
450b96d2c4f740152e03c6b79b484a10347b3ea9 the change proposed by David
in above link
9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
2906 +3.5% 3007 +0.4% 2919
stress-ng.clone.microsecs_per_clone
562884 -2.9% 546575 -0.6% 559718
stress-ng.clone.ops
9295 -2.9% 9028 -0.5% 9248
stress-ng.clone.ops_per_sec
Regards
Yin, Fengwei
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-14 3:02 ` Yin Fengwei
@ 2024-08-14 4:10 ` Mateusz Guzik
2024-08-14 9:45 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-14 4:10 UTC (permalink / raw)
To: Yin Fengwei
Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> On 8/13/24 03:14, Mateusz Guzik wrote:
> > would you mind benchmarking the change which merely force-inlines _compund_page?
> >
> > https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
> This change can resolve the regression also:
Great, thanks.
David, I guess this means it would be fine to inline the entire thing
at least from this bench standpoint. Given that this is your idea I
guess you should do the needful(tm)? :)
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup:
>
> lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2
>
> commit:
> 9cb28da54643ad464c47585cd5866c30b0218e67 parent commit
> c0bff412e67b781d761e330ff9578aa9ed2be79e commit introduced regression
> 450b96d2c4f740152e03c6b79b484a10347b3ea9 the change proposed by David
> in above link
>
> 9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4
> ---------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev
> \ | \ | \
> 2906 +3.5% 3007 +0.4% 2919
> stress-ng.clone.microsecs_per_clone
> 562884 -2.9% 546575 -0.6% 559718
> stress-ng.clone.ops
> 9295 -2.9% 9028 -0.5% 9248
> stress-ng.clone.ops_per_sec
>
>
>
> Regards
> Yin, Fengwei
>
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-14 4:10 ` Mateusz Guzik
@ 2024-08-14 9:45 ` David Hildenbrand
2024-08-14 11:06 ` Mateusz Guzik
0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2024-08-14 9:45 UTC (permalink / raw)
To: Mateusz Guzik, Yin Fengwei
Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel,
Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox,
Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm,
ying.huang, feng.tang
On 14.08.24 06:10, Mateusz Guzik wrote:
> On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> On 8/13/24 03:14, Mateusz Guzik wrote:
>>> would you mind benchmarking the change which merely force-inlines _compund_page?
>>>
>>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
>> This change can resolve the regression also:
>
> Great, thanks.
>
> David, I guess this means it would be fine to inline the entire thing
> at least from this bench standpoint. Given that this is your idea I
> guess you should do the needful(tm)? :)
Testing
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5769fe6e4950..25e25b34f4a0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
return page_fixed_fake_head(page) != page;
}
-static inline unsigned long _compound_head(const struct page *page)
+static __always_inline unsigned long _compound_head(const struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
With a kernel-config based on something derived from Fedora
config-6.8.9-100.fc38.x86_64 for convenience with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
Function old new delta
change_pte_range - 2308 +2308
iommu_put_dma_cookie 454 1276 +822
get_hwpoison_page 2007 2580 +573
end_bbio_data_read 1171 1626 +455
end_bbio_meta_read 492 934 +442
ext4_finish_bio 773 1208 +435
fq_ring_free_locked 128 541 +413
end_bbio_meta_write 493 872 +379
gup_fast_fallback 4207 4568 +361
v1_free_pgtable 166 519 +353
iommu_v1_map_pages 2747 3098 +351
end_bbio_data_write 609 960 +351
fsverity_verify_bio 334 656 +322
follow_page_mask 3399 3719 +320
__read_end_io 316 635 +319
btrfs_end_super_write 494 789 +295
iommu_alloc_pages_node.constprop 286 572 +286
free_buffers.part - 271 +271
gup_must_unshare - 268 +268
smaps_pte_range 1285 1513 +228
pagemap_pmd_range 2189 2393 +204
iommu_alloc_pages_node - 193 +193
smaps_hugetlb_range 705 897 +192
follow_page_pte 1584 1758 +174
__migrate_device_pages 2435 2595 +160
unpin_user_pages_dirty_lock 205 362 +157
_compound_head - 150 +150
unpin_user_pages 143 282 +139
put_ref_page.part - 126 +126
iomap_finish_ioend 866 972 +106
iomap_read_end_io 673 763 +90
end_bbio_meta_read.cold 42 131 +89
btrfs_do_readpage 1759 1845 +86
extent_write_cache_pages 2133 2212 +79
end_bbio_data_write.cold 32 108 +76
end_bbio_meta_write.cold 40 108 +68
__read_end_io.cold 25 91 +66
btrfs_end_super_write.cold 25 89 +64
ext4_finish_bio.cold 118 178 +60
fsverity_verify_bio.cold 25 84 +59
block_write_begin 217 274 +57
end_bbio_data_read.cold 378 426 +48
__pfx__compound_head - 48 +48
copy_hugetlb_page_range 3050 3097 +47
lruvec_stat_mod_folio.constprop 585 630 +45
iomap_finish_ioend.cold 163 202 +39
md_bitmap_file_unmap 150 187 +37
free_pgd_range 1949 1985 +36
prep_move_freepages_block 319 349 +30
iommu_alloc_pages_node.cold - 25 +25
iomap_read_end_io.cold 65 89 +24
zap_huge_pmd 874 897 +23
cont_write_begin.cold 108 130 +22
skb_splice_from_iter 822 843 +21
set_pmd_migration_entry 1037 1058 +21
zerocopy_fill_skb_from_iter 1321 1340 +19
pagemap_scan_pmd_entry 3261 3279 +18
try_grab_folio_fast 452 469 +17
change_huge_pmd 1174 1191 +17
folio_put 48 64 +16
__pfx_set_p4d - 16 +16
__pfx_put_ref_page.part - 16 +16
__pfx_lruvec_stat_mod_folio.constprop 208 224 +16
__pfx_iommu_alloc_pages_node.constprop 16 32 +16
__pfx_iommu_alloc_pages_node - 16 +16
__pfx_gup_must_unshare - 16 +16
__pfx_free_buffers.part - 16 +16
__pfx_folio_put 48 64 +16
__pfx_change_pte_range - 16 +16
__pfx___pte 32 48 +16
offline_pages 1962 1975 +13
memfd_pin_folios 1284 1297 +13
uprobe_write_opcode 2062 2073 +11
set_p4d - 11 +11
__pte 22 33 +11
copy_page_from_iter_atomic 1714 1724 +10
__migrate_device_pages.cold 60 70 +10
try_to_unmap_one 3355 3364 +9
try_to_migrate_one 3310 3319 +9
stable_page_flags 1034 1043 +9
io_sqe_buffer_register 1404 1413 +9
dio_zero_block 644 652 +8
add_ra_bio_pages.constprop.isra 1542 1550 +8
__add_to_kill 969 977 +8
btrfs_writepage_fixup_worker 1199 1206 +7
write_protect_page 1186 1192 +6
iommu_v2_map_pages.cold 145 151 +6
gup_fast_fallback.cold 112 117 +5
try_to_merge_one_page 1857 1860 +3
__apply_to_page_range 2235 2238 +3
wbc_account_cgroup_owner 217 219 +2
change_protection.cold 105 107 +2
can_change_pte_writable 354 356 +2
vmf_insert_pfn_pud 699 700 +1
split_huge_page_to_list_to_order.cold 152 151 -1
pte_pfn 40 39 -1
move_pages 5270 5269 -1
isolate_single_pageblock 1056 1055 -1
__apply_to_page_range.cold 92 91 -1
unmap_page_range.cold 88 86 -2
do_huge_pmd_numa_page 1175 1173 -2
free_pgd_range.cold 162 159 -3
copy_page_to_iter 329 326 -3
copy_page_range.cold 149 146 -3
copy_page_from_iter 307 304 -3
can_finish_ordered_extent 551 548 -3
__replace_page 1133 1130 -3
__reset_isolation_pfn 645 641 -4
dio_send_cur_page 1113 1108 -5
__access_remote_vm 1010 1005 -5
pagemap_hugetlb_category 468 459 -9
extent_write_locked_range 1148 1139 -9
unuse_pte_range 1834 1821 -13
do_migrate_range 1935 1922 -13
__get_user_pages 1952 1938 -14
migrate_vma_collect_pmd 2817 2802 -15
copy_page_to_iter_nofault 2373 2358 -15
hugetlb_fault 4054 4038 -16
__pfx_shake_page 16 - -16
__pfx_put_page 16 - -16
__pfx_pfn_swap_entry_to_page 32 16 -16
__pfx_gup_must_unshare.part 16 - -16
__pfx_gup_folio_next 16 - -16
__pfx_free_buffers 16 - -16
__pfx___get_unpoison_page 16 - -16
btrfs_cleanup_ordered_extents 622 604 -18
read_rdev 694 673 -21
isolate_migratepages_block.cold 222 197 -25
hugetlb_mfill_atomic_pte 1869 1844 -25
folio_pte_batch.constprop 1020 995 -25
hugetlb_reserve_pages 1468 1441 -27
__alloc_fresh_hugetlb_folio 676 649 -27
intel_pasid_alloc_table.cold 83 52 -31
__pfx_iommu_put_pages_list 48 16 -32
__pfx_PageHuge 32 - -32
__blockdev_direct_IO.cold 952 920 -32
io_ctl_prepare_pages 832 794 -38
__handle_mm_fault 4237 4195 -42
finish_fault 1007 962 -45
__pfx_pfn_swap_entry_folio 64 16 -48
vm_normal_folio_pmd 84 34 -50
vm_normal_folio 84 34 -50
set_migratetype_isolate 1429 1375 -54
do_set_pmd 618 561 -57
can_change_pmd_writable 293 229 -64
__unmap_hugepage_range 2389 2325 -64
do_fault 1187 1121 -66
fault_dirty_shared_page 425 358 -67
madvise_free_huge_pmd 863 792 -71
insert_page_into_pte_locked.isra 502 429 -73
restore_exclusive_pte 539 463 -76
isolate_migratepages_block 5436 5355 -81
__do_fault 366 276 -90
set_pte_range 593 502 -91
follow_devmap_pmd 559 468 -91
__pfx_bio_first_folio 144 48 -96
shake_page 105 - -105
hugetlb_change_protection 2314 2204 -110
hugetlb_wp 2134 2017 -117
__blockdev_direct_IO 5063 4946 -117
skb_tx_error 272 149 -123
put_page 123 - -123
gup_must_unshare.part 135 - -135
PageHuge 136 - -136
ksm_scan_thread 9172 9032 -140
intel_pasid_alloc_table 596 447 -149
copy_huge_pmd 1539 1385 -154
skb_split 1534 1376 -158
split_huge_pmd_locked 4024 3865 -159
skb_append_pagefrags 663 504 -159
memory_failure 2784 2624 -160
unpoison_memory 1328 1167 -161
cont_write_begin 959 793 -166
pfn_swap_entry_to_page 250 82 -168
skb_pp_cow_data 1539 1367 -172
gup_folio_next 180 - -180
intel_pasid_get_entry.isra 607 425 -182
v2_alloc_pgtable 309 126 -183
do_huge_pmd_wp_page 1173 988 -185
bio_first_folio.cold 315 105 -210
unmap_page_range 6091 5873 -218
split_huge_page_to_list_to_order 4141 3905 -236
move_pages_huge_pmd 2053 1813 -240
free_buffers 286 - -286
iommu_v2_map_pages 1722 1428 -294
soft_offline_page 2149 1843 -306
do_wp_page 3340 2993 -347
do_swap_page 4619 4265 -354
md_import_device 1002 635 -367
copy_page_range 7436 7040 -396
__get_unpoison_page 415 - -415
pfn_swap_entry_folio 596 149 -447
iommu_put_pages_list 1071 344 -727
bio_first_folio 2322 774 -1548
change_protection 5008 2790 -2218
Total: Before=32786363, After=32785282, chg -0.00%
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-14 9:45 ` David Hildenbrand
@ 2024-08-14 11:06 ` Mateusz Guzik
2024-08-14 12:02 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: Mateusz Guzik @ 2024-08-14 11:06 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.08.24 06:10, Mateusz Guzik wrote:
> > On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >> On 8/13/24 03:14, Mateusz Guzik wrote:
> >>> would you mind benchmarking the change which merely force-inlines _compund_page?
> >>>
> >>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
> >> This change can resolve the regression also:
> >
> > Great, thanks.
> >
> > David, I guess this means it would be fine to inline the entire thing
> > at least from this bench standpoint. Given that this is your idea I
> > guess you should do the needful(tm)? :)
>
> Testing
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 5769fe6e4950..25e25b34f4a0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
> return page_fixed_fake_head(page) != page;
> }
>
> -static inline unsigned long _compound_head(const struct page *page)
> +static __always_inline unsigned long _compound_head(const struct page *page)
> {
> unsigned long head = READ_ONCE(page->compound_head);
>
>
> With a kernel-config based on something derived from Fedora
> config-6.8.9-100.fc38.x86_64 for convenience with
>
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>
> add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
[snip]
> Total: Before=32786363, After=32785282, chg -0.00%
I guess there should be no opposition then?
Given that this is your patch I presume you are going to see this through.
I don't want any mention or cc on the patch, thanks for understanding :)
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
2024-08-14 11:06 ` Mateusz Guzik
@ 2024-08-14 12:02 ` David Hildenbrand
0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2024-08-14 12:02 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp,
linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe,
Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui,
linux-mm, ying.huang, feng.tang
On 14.08.24 13:06, Mateusz Guzik wrote:
> On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.08.24 06:10, Mateusz Guzik wrote:
>>> On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>> On 8/13/24 03:14, Mateusz Guzik wrote:
>>>>> would you mind benchmarking the change which merely force-inlines _compund_page?
>>>>>
>>>>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/
>>>> This change can resolve the regression also:
>>>
>>> Great, thanks.
>>>
>>> David, I guess this means it would be fine to inline the entire thing
>>> at least from this bench standpoint. Given that this is your idea I
>>> guess you should do the needful(tm)? :)
>>
>> Testing
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 5769fe6e4950..25e25b34f4a0 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page)
>> return page_fixed_fake_head(page) != page;
>> }
>>
>> -static inline unsigned long _compound_head(const struct page *page)
>> +static __always_inline unsigned long _compound_head(const struct page *page)
>> {
>> unsigned long head = READ_ONCE(page->compound_head);
>>
>>
>> With a kernel-config based on something derived from Fedora
>> config-6.8.9-100.fc38.x86_64 for convenience with
>>
>> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
>>
>> add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081)
> [snip]
>> Total: Before=32786363, After=32785282, chg -0.00%
>
> I guess there should be no opposition then?
>
> Given that this is your patch I presume you are going to see this through.
I was hoping that you could send an official patch, after all you did
most of the work here.
>
> I don't want any mention or cc on the patch, thanks for understanding :)
If I have to send it I will respect it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2024-08-14 12:02 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot
2024-07-30 8:11 ` David Hildenbrand
2024-08-01 6:39 ` Yin, Fengwei
2024-08-01 6:49 ` David Hildenbrand
2024-08-01 7:44 ` Yin, Fengwei
2024-08-01 7:54 ` David Hildenbrand
2024-08-01 13:30 ` Mateusz Guzik
2024-08-01 13:34 ` David Hildenbrand
2024-08-01 13:37 ` Mateusz Guzik
2024-08-01 13:44 ` David Hildenbrand
2024-08-12 4:43 ` Yin Fengwei
2024-08-12 4:49 ` Mateusz Guzik
2024-08-12 8:12 ` David Hildenbrand
2024-08-12 8:18 ` Mateusz Guzik
2024-08-12 8:23 ` David Hildenbrand
2024-08-13 7:09 ` Yin Fengwei
2024-08-13 7:14 ` Mateusz Guzik
2024-08-14 3:02 ` Yin Fengwei
2024-08-14 4:10 ` Mateusz Guzik
2024-08-14 9:45 ` David Hildenbrand
2024-08-14 11:06 ` Mateusz Guzik
2024-08-14 12:02 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).