* [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression
@ 2024-07-30 5:00 kernel test robot
2024-07-30 8:11 ` David Hildenbrand
0 siblings, 1 reply; 22+ messages in thread
From: kernel test robot @ 2024-07-30 5:00 UTC (permalink / raw)
To: Peter Xu
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, David Hildenbrand,
Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor,
Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang,
fengwei.yin, oliver.sang
Hello,
kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on:
commit: c0bff412e67b781d761e330ff9578aa9ed2be79e ("mm: allow anon exclusive check over hugetlb tail pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: clone
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202407301049.5051dc19-oliver.sang@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/clone/stress-ng/60s
commit:
9cb28da546 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
c0bff412e6 ("mm: allow anon exclusive check over hugetlb tail pages")
9cb28da54643ad46 c0bff412e67b781d761e330ff95
---------------- ---------------------------
%stddev %change %stddev
\ | \
37842 -3.4% 36554 vmstat.system.cs
0.00 ± 17% -86.4% 0.00 ±223% sched_debug.rt_rq:.rt_time.avg
0.19 ± 17% -86.4% 0.03 ±223% sched_debug.rt_rq:.rt_time.max
0.02 ± 17% -86.4% 0.00 ±223% sched_debug.rt_rq:.rt_time.stddev
24081 -3.7% 23200 proc-vmstat.nr_page_table_pages
399380 -2.3% 390288 proc-vmstat.nr_slab_reclaimable
1625589 -2.4% 1585989 proc-vmstat.nr_slab_unreclaimable
1.019e+08 -3.8% 98035999 proc-vmstat.numa_hit
1.018e+08 -3.9% 97870705 proc-vmstat.numa_local
1.092e+08 -3.8% 1.05e+08 proc-vmstat.pgalloc_normal
1.06e+08 -3.8% 1.019e+08 proc-vmstat.pgfree
2659199 -2.3% 2597978 proc-vmstat.pgreuse
2910 +3.4% 3010 stress-ng.clone.microsecs_per_clone
562874 -2.9% 546587 stress-ng.clone.ops
9298 -2.9% 9031 stress-ng.clone.ops_per_sec
686858 -2.8% 667416 stress-ng.time.involuntary_context_switches
9091031 -3.9% 8734352 stress-ng.time.minor_page_faults
4200 +2.4% 4299 stress-ng.time.percent_of_cpu_this_job_got
2543 +2.4% 2603 stress-ng.time.system_time
342849 -2.8% 333189 stress-ng.time.voluntary_context_switches
6.67 -6.1% 6.26 perf-stat.i.MPKI
6.388e+08 -5.4% 6.045e+08 perf-stat.i.cache-misses
1.558e+09 -4.6% 1.487e+09 perf-stat.i.cache-references
40791 -3.6% 39330 perf-stat.i.context-switches
353.55 +5.4% 372.76 perf-stat.i.cycles-between-cache-misses
7.95 ± 3% -6.5% 7.43 ± 3% perf-stat.i.metric.K/sec
251389 ± 3% -6.5% 235029 ± 3% perf-stat.i.minor-faults
251423 ± 3% -6.5% 235064 ± 3% perf-stat.i.page-faults
6.75 -6.1% 6.33 perf-stat.overall.MPKI
0.38 -0.0 0.37 perf-stat.overall.branch-miss-rate%
350.09 +5.8% 370.24 perf-stat.overall.cycles-between-cache-misses
68503488 -1.2% 67660585 perf-stat.ps.branch-misses
6.33e+08 -5.4% 5.987e+08 perf-stat.ps.cache-misses
1.518e+09 -4.6% 1.449e+09 perf-stat.ps.cache-references
38819 -3.3% 37542 perf-stat.ps.context-switches
3637 +1.2% 3680 perf-stat.ps.cpu-migrations
235473 ± 3% -6.3% 220601 ± 3% perf-stat.ps.minor-faults
235504 ± 3% -6.3% 220632 ± 3% perf-stat.ps.page-faults
45.55 -2.5 43.04 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap.__mmput
44.86 -2.5 42.37 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.exit_mmap
44.42 -2.1 42.37 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
44.42 -2.1 42.37 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
44.41 -2.1 42.36 perf-profile.calltrace.cycles-pp.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
44.41 -2.1 42.36 perf-profile.calltrace.cycles-pp.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
39.08 -1.7 37.34 perf-profile.calltrace.cycles-pp.exit_mm.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
38.96 -1.7 37.22 perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.exit_mm.do_exit.__x64_sys_exit
38.97 -1.7 37.24 perf-profile.calltrace.cycles-pp.__mmput.exit_mm.do_exit.__x64_sys_exit.do_syscall_64
36.16 -1.6 34.57 perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.exit_mm.do_exit
35.99 -1.6 34.40 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.exit_mm
32.17 -1.5 30.62 perf-profile.calltrace.cycles-pp.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
12.49 -1.0 11.52 perf-profile.calltrace.cycles-pp._compound_head.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
9.66 -0.9 8.74 perf-profile.calltrace.cycles-pp.unmap_vmas.exit_mmap.__mmput.copy_process.kernel_clone
9.61 -0.9 8.69 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.exit_mmap.__mmput.copy_process
10.71 -0.9 9.84 perf-profile.calltrace.cycles-pp.__mmput.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
10.70 -0.9 9.84 perf-profile.calltrace.cycles-pp.exit_mmap.__mmput.copy_process.kernel_clone.__do_sys_clone3
10.41 -0.8 9.58 perf-profile.calltrace.cycles-pp.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range
10.42 -0.8 9.59 perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
10.21 -0.8 9.40 perf-profile.calltrace.cycles-pp.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
5.47 -0.4 5.04 perf-profile.calltrace.cycles-pp.folios_put_refs.free_pages_and_swap_cache.__tlb_batch_free_encoded_pages.tlb_flush_mmu.zap_pte_range
1.11 -0.3 0.79 ± 33% perf-profile.calltrace.cycles-pp.anon_vma_interval_tree_insert.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm
14.18 -0.3 13.87 perf-profile.calltrace.cycles-pp.folio_remove_rmap_ptes.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
5.17 -0.3 4.86 perf-profile.calltrace.cycles-pp.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.80 -0.3 4.53 perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.do_exit.__x64_sys_exit.do_syscall_64
4.40 -0.3 4.14 perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.do_exit.__x64_sys_exit
2.74 -0.2 2.58 perf-profile.calltrace.cycles-pp.anon_vma_fork.dup_mmap.dup_mm.copy_process.kernel_clone
2.25 -0.1 2.11 perf-profile.calltrace.cycles-pp.anon_vma_clone.anon_vma_fork.dup_mmap.dup_mm.copy_process
1.47 -0.1 1.34 perf-profile.calltrace.cycles-pp.put_files_struct.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
1.87 -0.1 1.76 perf-profile.calltrace.cycles-pp.dnotify_flush.filp_flush.filp_close.put_files_struct.do_exit
1.98 -0.1 1.88 perf-profile.calltrace.cycles-pp.free_pgtables.exit_mmap.__mmput.exit_mm.do_exit
1.28 -0.1 1.18 perf-profile.calltrace.cycles-pp.filp_close.put_files_struct.copy_process.kernel_clone.__do_sys_clone3
1.19 -0.1 1.09 perf-profile.calltrace.cycles-pp.filp_flush.filp_close.put_files_struct.copy_process.kernel_clone
1.31 ± 2% -0.1 1.25 perf-profile.calltrace.cycles-pp.unlink_anon_vmas.free_pgtables.exit_mmap.__mmput.exit_mm
0.58 -0.0 0.55 perf-profile.calltrace.cycles-pp.vm_normal_page.zap_present_ptes.zap_pte_range.zap_pmd_range.unmap_page_range
33.54 +0.6 34.10 perf-profile.calltrace.cycles-pp.syscall
33.45 +0.6 34.01 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.45 +0.6 34.01 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
33.35 +0.6 33.90 perf-profile.calltrace.cycles-pp.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.34 +0.6 33.90 perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
33.30 +0.6 33.86 perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64.entry_SYSCALL_64_after_hwframe
20.63 +1.6 22.21 perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone3.do_syscall_64
20.55 +1.6 22.14 perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone3
19.40 +1.8 21.19 perf-profile.calltrace.cycles-pp.__clone
19.24 +1.8 21.04 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.24 +1.8 21.04 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__clone
19.14 +1.8 20.94 perf-profile.calltrace.cycles-pp.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.14 +1.8 20.94 perf-profile.calltrace.cycles-pp.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__clone
19.05 +1.8 20.85 perf-profile.calltrace.cycles-pp.copy_process.kernel_clone.__do_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe
18.74 +1.8 20.56 perf-profile.calltrace.cycles-pp.dup_mm.copy_process.kernel_clone.__do_sys_clone.do_syscall_64
18.67 +1.8 20.49 perf-profile.calltrace.cycles-pp.dup_mmap.dup_mm.copy_process.kernel_clone.__do_sys_clone
12.24 +3.1 15.35 perf-profile.calltrace.cycles-pp._compound_head.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
34.37 +3.7 38.02 perf-profile.calltrace.cycles-pp.copy_page_range.dup_mmap.dup_mm.copy_process.kernel_clone
34.34 +3.7 38.00 perf-profile.calltrace.cycles-pp.copy_p4d_range.copy_page_range.dup_mmap.dup_mm.copy_process
30.99 +3.7 34.69 perf-profile.calltrace.cycles-pp.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap
33.16 +3.7 36.88 perf-profile.calltrace.cycles-pp.copy_pte_range.copy_p4d_range.copy_page_range.dup_mmap.dup_mm
0.00 +3.9 3.90 perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range
49.67 -2.6 47.07 perf-profile.children.cycles-pp.exit_mmap
49.69 -2.6 47.08 perf-profile.children.cycles-pp.__mmput
45.84 -2.5 43.32 perf-profile.children.cycles-pp.unmap_vmas
45.56 -2.5 43.05 perf-profile.children.cycles-pp.zap_pmd_range
45.61 -2.5 43.10 perf-profile.children.cycles-pp.unmap_page_range
44.98 -2.5 42.48 perf-profile.children.cycles-pp.zap_pte_range
44.53 -2.1 42.48 perf-profile.children.cycles-pp.__x64_sys_exit
44.54 -2.1 42.48 perf-profile.children.cycles-pp.do_exit
39.10 -1.7 37.36 perf-profile.children.cycles-pp.exit_mm
32.99 -1.6 31.41 perf-profile.children.cycles-pp.zap_present_ptes
10.53 -0.8 9.71 perf-profile.children.cycles-pp.tlb_flush_mmu
10.91 -0.7 10.19 perf-profile.children.cycles-pp.__tlb_batch_free_encoded_pages
10.88 -0.7 10.16 perf-profile.children.cycles-pp.free_pages_and_swap_cache
6.64 -0.4 6.22 perf-profile.children.cycles-pp.put_files_struct
5.76 -0.4 5.38 perf-profile.children.cycles-pp.folios_put_refs
6.11 -0.4 5.73 perf-profile.children.cycles-pp.filp_close
5.62 -0.4 5.25 perf-profile.children.cycles-pp.filp_flush
14.28 -0.3 13.97 perf-profile.children.cycles-pp.folio_remove_rmap_ptes
2.75 -0.2 2.58 perf-profile.children.cycles-pp.anon_vma_fork
2.38 -0.2 2.22 perf-profile.children.cycles-pp.dnotify_flush
2.50 -0.1 2.36 perf-profile.children.cycles-pp.free_pgtables
2.25 -0.1 2.11 perf-profile.children.cycles-pp.anon_vma_clone
0.20 ± 33% -0.1 0.08 ± 58% perf-profile.children.cycles-pp.ordered_events__queue
0.20 ± 33% -0.1 0.08 ± 58% perf-profile.children.cycles-pp.queue_event
1.24 ± 4% -0.1 1.14 perf-profile.children.cycles-pp.down_write
1.67 ± 2% -0.1 1.58 perf-profile.children.cycles-pp.unlink_anon_vmas
1.59 -0.1 1.50 perf-profile.children.cycles-pp.__alloc_pages_noprof
1.55 -0.1 1.46 perf-profile.children.cycles-pp.alloc_pages_mpol_noprof
1.58 -0.1 1.50 perf-profile.children.cycles-pp.vm_normal_page
1.11 -0.1 1.04 perf-profile.children.cycles-pp.anon_vma_interval_tree_insert
1.33 -0.1 1.26 ± 2% perf-profile.children.cycles-pp.pte_alloc_one
0.47 ± 11% -0.1 0.40 ± 4% perf-profile.children.cycles-pp.rwsem_down_write_slowpath
0.45 ± 11% -0.1 0.38 ± 4% perf-profile.children.cycles-pp.rwsem_optimistic_spin
1.00 -0.1 0.94 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
1.36 -0.1 1.31 perf-profile.children.cycles-pp.kmem_cache_free
1.08 -0.0 1.04 perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
0.62 -0.0 0.58 ± 2% perf-profile.children.cycles-pp.dup_fd
0.63 -0.0 0.59 ± 3% perf-profile.children.cycles-pp.__pte_alloc
0.73 -0.0 0.69 perf-profile.children.cycles-pp.__tlb_remove_folio_pages_size
0.58 -0.0 0.54 perf-profile.children.cycles-pp.locks_remove_posix
0.90 -0.0 0.86 perf-profile.children.cycles-pp.copy_huge_pmd
0.54 -0.0 0.51 perf-profile.children.cycles-pp.__memcg_kmem_charge_page
0.76 -0.0 0.72 perf-profile.children.cycles-pp.vm_area_dup
0.31 ± 2% -0.0 0.28 ± 3% perf-profile.children.cycles-pp.rwsem_spin_on_owner
0.50 -0.0 0.47 perf-profile.children.cycles-pp.__anon_vma_interval_tree_remove
0.53 -0.0 0.50 perf-profile.children.cycles-pp.clear_page_erms
0.49 -0.0 0.46 perf-profile.children.cycles-pp.free_swap_cache
0.72 -0.0 0.69 perf-profile.children.cycles-pp.__memcg_slab_post_alloc_hook
0.37 ± 2% -0.0 0.34 ± 2% perf-profile.children.cycles-pp.unlink_file_vma
0.62 -0.0 0.60 perf-profile.children.cycles-pp.__memcg_slab_free_hook
0.42 -0.0 0.40 ± 2% perf-profile.children.cycles-pp.rmqueue
0.37 -0.0 0.35 ± 2% perf-profile.children.cycles-pp.__rmqueue_pcplist
0.28 -0.0 0.25 perf-profile.children.cycles-pp.__rb_insert_augmented
0.35 -0.0 0.33 ± 2% perf-profile.children.cycles-pp.rmqueue_bulk
0.56 -0.0 0.54 perf-profile.children.cycles-pp.fput
0.48 -0.0 0.46 perf-profile.children.cycles-pp._raw_spin_lock
0.51 -0.0 0.50 perf-profile.children.cycles-pp.free_unref_page
0.45 -0.0 0.43 perf-profile.children.cycles-pp.__x64_sys_unshare
0.44 -0.0 0.42 perf-profile.children.cycles-pp.free_unref_page_commit
0.45 -0.0 0.43 perf-profile.children.cycles-pp.ksys_unshare
0.31 -0.0 0.30 perf-profile.children.cycles-pp.memcg_account_kmem
0.27 -0.0 0.26 perf-profile.children.cycles-pp.__mod_memcg_state
0.44 -0.0 0.43 perf-profile.children.cycles-pp.__slab_free
0.28 -0.0 0.26 perf-profile.children.cycles-pp.__vm_area_free
0.22 ± 2% -0.0 0.21 perf-profile.children.cycles-pp.___slab_alloc
0.21 -0.0 0.20 ± 2% perf-profile.children.cycles-pp.__tlb_remove_folio_pages
0.13 -0.0 0.12 perf-profile.children.cycles-pp.__rb_erase_color
0.07 -0.0 0.06 perf-profile.children.cycles-pp.find_idlest_cpu
0.09 -0.0 0.08 perf-profile.children.cycles-pp.wake_up_new_task
0.06 -0.0 0.05 perf-profile.children.cycles-pp.kfree
0.06 -0.0 0.05 perf-profile.children.cycles-pp.update_sg_wakeup_stats
0.11 -0.0 0.10 perf-profile.children.cycles-pp.allocate_slab
0.44 ± 2% +0.1 0.53 ± 2% perf-profile.children.cycles-pp.tlb_finish_mmu
98.24 +0.2 98.46 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
98.24 +0.2 98.46 perf-profile.children.cycles-pp.do_syscall_64
33.55 +0.6 34.10 perf-profile.children.cycles-pp.syscall
33.35 +0.6 33.90 perf-profile.children.cycles-pp.__do_sys_clone3
19.41 +1.8 21.20 perf-profile.children.cycles-pp.__clone
19.14 +1.8 20.94 perf-profile.children.cycles-pp.__do_sys_clone
24.94 +2.1 27.07 perf-profile.children.cycles-pp._compound_head
52.48 +2.4 54.84 perf-profile.children.cycles-pp.kernel_clone
52.36 +2.4 54.72 perf-profile.children.cycles-pp.copy_process
39.38 +3.4 42.77 perf-profile.children.cycles-pp.dup_mm
39.24 +3.4 42.64 perf-profile.children.cycles-pp.dup_mmap
34.34 +3.7 38.00 perf-profile.children.cycles-pp.copy_p4d_range
34.37 +3.7 38.03 perf-profile.children.cycles-pp.copy_page_range
33.28 +3.7 36.98 perf-profile.children.cycles-pp.copy_pte_range
31.41 +3.8 35.18 perf-profile.children.cycles-pp.copy_present_ptes
0.00 +4.0 4.01 perf-profile.children.cycles-pp.folio_try_dup_anon_rmap_ptes
18.44 -3.2 15.24 perf-profile.self.cycles-pp.copy_present_ptes
5.66 -0.4 5.28 perf-profile.self.cycles-pp.folios_put_refs
4.78 -0.3 4.46 perf-profile.self.cycles-pp.free_pages_and_swap_cache
14.11 -0.3 13.80 perf-profile.self.cycles-pp.folio_remove_rmap_ptes
4.82 -0.2 4.59 perf-profile.self.cycles-pp.zap_present_ptes
2.66 -0.2 2.49 perf-profile.self.cycles-pp.filp_flush
2.36 -0.2 2.20 perf-profile.self.cycles-pp.dnotify_flush
0.20 ± 32% -0.1 0.08 ± 58% perf-profile.self.cycles-pp.queue_event
1.44 -0.1 1.36 perf-profile.self.cycles-pp.zap_pte_range
1.11 -0.1 1.03 perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
1.26 -0.1 1.20 perf-profile.self.cycles-pp.vm_normal_page
0.56 -0.0 0.52 ± 2% perf-profile.self.cycles-pp.dup_fd
0.56 -0.0 0.53 perf-profile.self.cycles-pp.locks_remove_posix
0.31 -0.0 0.28 perf-profile.self.cycles-pp.put_files_struct
0.58 -0.0 0.55 perf-profile.self.cycles-pp.__tlb_remove_folio_pages_size
0.49 -0.0 0.46 ± 2% perf-profile.self.cycles-pp.__anon_vma_interval_tree_remove
0.30 ± 3% -0.0 0.28 ± 3% perf-profile.self.cycles-pp.rwsem_spin_on_owner
0.52 -0.0 0.49 ± 2% perf-profile.self.cycles-pp.clear_page_erms
0.31 -0.0 0.29 perf-profile.self.cycles-pp.free_swap_cache
0.33 -0.0 0.31 perf-profile.self.cycles-pp.__memcg_slab_free_hook
0.45 -0.0 0.43 perf-profile.self.cycles-pp._raw_spin_lock
0.55 -0.0 0.53 perf-profile.self.cycles-pp.fput
0.38 -0.0 0.36 perf-profile.self.cycles-pp.__memcg_slab_post_alloc_hook
0.47 -0.0 0.45 perf-profile.self.cycles-pp.up_write
0.26 -0.0 0.24 perf-profile.self.cycles-pp.__rb_insert_augmented
0.33 -0.0 0.32 perf-profile.self.cycles-pp.mod_objcg_state
0.31 -0.0 0.30 perf-profile.self.cycles-pp.__free_one_page
0.09 -0.0 0.08 perf-profile.self.cycles-pp.___slab_alloc
24.40 +2.1 26.55 perf-profile.self.cycles-pp._compound_head
0.00 +3.9 3.89 perf-profile.self.cycles-pp.folio_try_dup_anon_rmap_ptes
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-07-30 5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot @ 2024-07-30 8:11 ` David Hildenbrand 2024-08-01 6:39 ` Yin, Fengwei 0 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2024-07-30 8:11 UTC (permalink / raw) To: kernel test robot, Peter Xu Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang, fengwei.yin On 30.07.24 07:00, kernel test robot wrote: > > > Hello, > > kernel test robot noticed a -2.9% regression of stress-ng.clone.ops_per_sec on: Is that test even using hugetlb? Anyhow, this pretty much sounds like noise and can be ignored. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-07-30 8:11 ` David Hildenbrand @ 2024-08-01 6:39 ` Yin, Fengwei 2024-08-01 6:49 ` David Hildenbrand 0 siblings, 1 reply; 22+ messages in thread From: Yin, Fengwei @ 2024-08-01 6:39 UTC (permalink / raw) To: David Hildenbrand, kernel test robot, Peter Xu Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang Hi David, On 7/30/2024 4:11 PM, David Hildenbrand wrote: > On 30.07.24 07:00, kernel test robot wrote: >> >> >> Hello, >> >> kernel test robot noticed a -2.9% regression of >> stress-ng.clone.ops_per_sec on: > > Is that test even using hugetlb? Anyhow, this pretty much sounds like > noise and can be ignored. > It's not about hugetlb. It looks like related with the change: diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 888353c209c03..7577fe7debafc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY); static __always_inline int PageAnonExclusive(const struct page *page) { VM_BUG_ON_PGFLAGS(!PageAnon(page), page); - VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); + /* + * HugeTLB stores this information on the head page; THP keeps it per + * page + */ + if (PageHuge(page)) + page = compound_head(page); return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); The PageAnonExclusive() function is changed. And the profiling data showed it: 0.00 +3.9 3.90 perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range According https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b: # CONFIG_DEBUG_VM is not set So maybe such code change could bring difference? And yes. 2.9% regression can be in noise range. Thanks. Regards Yin, Fengwei ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 6:39 ` Yin, Fengwei @ 2024-08-01 6:49 ` David Hildenbrand 2024-08-01 7:44 ` Yin, Fengwei 2024-08-01 13:30 ` Mateusz Guzik 0 siblings, 2 replies; 22+ messages in thread From: David Hildenbrand @ 2024-08-01 6:49 UTC (permalink / raw) To: Yin, Fengwei, kernel test robot, Peter Xu Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 01.08.24 08:39, Yin, Fengwei wrote: > Hi David, > > On 7/30/2024 4:11 PM, David Hildenbrand wrote: >> On 30.07.24 07:00, kernel test robot wrote: >>> >>> >>> Hello, >>> >>> kernel test robot noticed a -2.9% regression of >>> stress-ng.clone.ops_per_sec on: >> >> Is that test even using hugetlb? Anyhow, this pretty much sounds like >> noise and can be ignored. >> > It's not about hugetlb. It looks like related with the change: Ah, that makes sense! > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 888353c209c03..7577fe7debafc 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -1095,7 +1095,12 @@ PAGEFLAG(Isolated, isolated, PF_ANY); > static __always_inline int PageAnonExclusive(const struct page *page) > { > VM_BUG_ON_PGFLAGS(!PageAnon(page), page); > - VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); > + /* > + * HugeTLB stores this information on the head page; THP keeps > it per > + * page > + */ > + if (PageHuge(page)) > + page = compound_head(page); > return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags); > > > The PageAnonExclusive() function is changed. And the profiling data > showed it: > > 0.00 +3.9 3.90 > perf-profile.calltrace.cycles-pp.folio_try_dup_anon_rmap_ptes.copy_present_ptes.copy_pte_range.copy_p4d_range.copy_page_range > > According > https://download.01.org/0day-ci/archive/20240730/202407301049.5051dc19-oliver.sang@intel.com/config-6.9.0-rc4-00197-gc0bff412e67b: > # CONFIG_DEBUG_VM is not set > So maybe such code change could bring difference? Yes indeed. fork() can be extremely sensitive to each added instruction. I even pointed out to Peter why I didn't add the PageHuge check in there originally [1]. "Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive code called on certainly-not-hugetlb code paths." We now have to do a page_folio(page) and then test for hugetlb. return folio_test_hugetlb(page_folio(page)); Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so maybe at least part of the overhead is gone. [1] https://lore.kernel.org/r/all/8b0b24bb-3c38-4f27-a2c9-f7d7adc4a115@redhat.com/ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 6:49 ` David Hildenbrand @ 2024-08-01 7:44 ` Yin, Fengwei 2024-08-01 7:54 ` David Hildenbrand 2024-08-01 13:30 ` Mateusz Guzik 1 sibling, 1 reply; 22+ messages in thread From: Yin, Fengwei @ 2024-08-01 7:44 UTC (permalink / raw) To: David Hildenbrand, kernel test robot, Peter Xu Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang Hi David, On 8/1/2024 2:49 PM, David Hildenbrand wrote: > We now have to do a page_folio(page) and then test for hugetlb. > > return folio_test_hugetlb(page_folio(page)); > > Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, > so maybe at least part of the overhead is gone. This is great. We will check the trend to know whether it's recovered in some level. Regards Yin, Fengwei ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 7:44 ` Yin, Fengwei @ 2024-08-01 7:54 ` David Hildenbrand 0 siblings, 0 replies; 22+ messages in thread From: David Hildenbrand @ 2024-08-01 7:54 UTC (permalink / raw) To: Yin, Fengwei, kernel test robot, Peter Xu Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 01.08.24 09:44, Yin, Fengwei wrote: > Hi David, > > On 8/1/2024 2:49 PM, David Hildenbrand wrote: >> We now have to do a page_folio(page) and then test for hugetlb. >> >> return folio_test_hugetlb(page_folio(page)); >> >> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, >> so maybe at least part of the overhead is gone. > This is great. We will check the trend to know whether it's recovered > in some level. Oh, I think d99e3140a4d33e26066183ff727d8f02f56bec64 went upstream before c0bff412e67b781d761e330ff9578aa9ed2be79e, so at the time of c0bff412e6 we already should have had the faster check! -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 6:49 ` David Hildenbrand 2024-08-01 7:44 ` Yin, Fengwei @ 2024-08-01 13:30 ` Mateusz Guzik 2024-08-01 13:34 ` David Hildenbrand 1 sibling, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-01 13:30 UTC (permalink / raw) To: David Hildenbrand Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > Yes indeed. fork() can be extremely sensitive to each added instruction. > > I even pointed out to Peter why I didn't add the PageHuge check in there > originally [1]. > > "Well, and I didn't want to have runtime-hugetlb checks in > PageAnonExclusive code called on certainly-not-hugetlb code paths." > > > We now have to do a page_folio(page) and then test for hugetlb. > > return folio_test_hugetlb(page_folio(page)); > > Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so > maybe at least part of the overhead is gone. > I'll note page_folio expands to a call to _compound_head. While _compound_head is declared as an inline, it ends up being big enough that the compiler decides to emit a real function instead and real func calls are not particularly cheap. I had a brief look with a profiler myself and for single-threaded usage the func is quite high up there, while it manages to get out with the first branch -- that is to say there is definitely performance lost for having a func call instead of an inlined branch. The routine is deinlined because of a call to page_fixed_fake_head, which itself is annotated with always_inline. This is of course patchable with minor shoveling. I did not go for it because stress-ng results were too unstable for me to confidently state win/loss. But should you want to whack the regression, this is what I would look into. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 13:30 ` Mateusz Guzik @ 2024-08-01 13:34 ` David Hildenbrand 2024-08-01 13:37 ` Mateusz Guzik 0 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2024-08-01 13:34 UTC (permalink / raw) To: Mateusz Guzik Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 01.08.24 15:30, Mateusz Guzik wrote: > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >> Yes indeed. fork() can be extremely sensitive to each added instruction. >> >> I even pointed out to Peter why I didn't add the PageHuge check in there >> originally [1]. >> >> "Well, and I didn't want to have runtime-hugetlb checks in >> PageAnonExclusive code called on certainly-not-hugetlb code paths." >> >> >> We now have to do a page_folio(page) and then test for hugetlb. >> >> return folio_test_hugetlb(page_folio(page)); >> >> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so >> maybe at least part of the overhead is gone. >> > > I'll note page_folio expands to a call to _compound_head. > > While _compound_head is declared as an inline, it ends up being big > enough that the compiler decides to emit a real function instead and > real func calls are not particularly cheap. > > I had a brief look with a profiler myself and for single-threaded usage > the func is quite high up there, while it manages to get out with the > first branch -- that is to say there is definitely performance lost for > having a func call instead of an inlined branch. > > The routine is deinlined because of a call to page_fixed_fake_head, > which itself is annotated with always_inline. > > This is of course patchable with minor shoveling. > > I did not go for it because stress-ng results were too unstable for me > to confidently state win/loss. > > But should you want to whack the regression, this is what I would look > into. > This might improve it, at least for small folios I guess: diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5769fe6e4950..7796ae116018 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) */ static inline bool PageHuge(const struct page *page) { - return folio_test_hugetlb(page_folio(page)); + return PageCompound(page) && folio_test_hugetlb(page_folio(page)); } /* We would avoid the function call for small folios. -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 13:34 ` David Hildenbrand @ 2024-08-01 13:37 ` Mateusz Guzik 2024-08-01 13:44 ` David Hildenbrand 0 siblings, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-01 13:37 UTC (permalink / raw) To: David Hildenbrand Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote: > > On 01.08.24 15:30, Mateusz Guzik wrote: > > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > >> Yes indeed. fork() can be extremely sensitive to each added instruction. > >> > >> I even pointed out to Peter why I didn't add the PageHuge check in there > >> originally [1]. > >> > >> "Well, and I didn't want to have runtime-hugetlb checks in > >> PageAnonExclusive code called on certainly-not-hugetlb code paths." > >> > >> > >> We now have to do a page_folio(page) and then test for hugetlb. > >> > >> return folio_test_hugetlb(page_folio(page)); > >> > >> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so > >> maybe at least part of the overhead is gone. > >> > > > > I'll note page_folio expands to a call to _compound_head. > > > > While _compound_head is declared as an inline, it ends up being big > > enough that the compiler decides to emit a real function instead and > > real func calls are not particularly cheap. > > > > I had a brief look with a profiler myself and for single-threaded usage > > the func is quite high up there, while it manages to get out with the > > first branch -- that is to say there is definitely performance lost for > > having a func call instead of an inlined branch. > > > > The routine is deinlined because of a call to page_fixed_fake_head, > > which itself is annotated with always_inline. > > > > This is of course patchable with minor shoveling. > > > > I did not go for it because stress-ng results were too unstable for me > > to confidently state win/loss. > > > > But should you want to whack the regression, this is what I would look > > into. > > > > This might improve it, at least for small folios I guess: > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5769fe6e4950..7796ae116018 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) > */ > static inline bool PageHuge(const struct page *page) > { > - return folio_test_hugetlb(page_folio(page)); > + return PageCompound(page) && folio_test_hugetlb(page_folio(page)); > } > > /* > > > We would avoid the function call for small folios. > why not massage _compound_head back to an inlineable form instead? for all i know you may even register a small win in total -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 13:37 ` Mateusz Guzik @ 2024-08-01 13:44 ` David Hildenbrand 2024-08-12 4:43 ` Yin Fengwei 0 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2024-08-01 13:44 UTC (permalink / raw) To: Mateusz Guzik Cc: Yin, Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 01.08.24 15:37, Mateusz Guzik wrote: > On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 01.08.24 15:30, Mateusz Guzik wrote: >>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >>>> Yes indeed. fork() can be extremely sensitive to each added instruction. >>>> >>>> I even pointed out to Peter why I didn't add the PageHuge check in there >>>> originally [1]. >>>> >>>> "Well, and I didn't want to have runtime-hugetlb checks in >>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." >>>> >>>> >>>> We now have to do a page_folio(page) and then test for hugetlb. >>>> >>>> return folio_test_hugetlb(page_folio(page)); >>>> >>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 times, so >>>> maybe at least part of the overhead is gone. >>>> >>> >>> I'll note page_folio expands to a call to _compound_head. >>> >>> While _compound_head is declared as an inline, it ends up being big >>> enough that the compiler decides to emit a real function instead and >>> real func calls are not particularly cheap. >>> >>> I had a brief look with a profiler myself and for single-threaded usage >>> the func is quite high up there, while it manages to get out with the >>> first branch -- that is to say there is definitely performance lost for >>> having a func call instead of an inlined branch. >>> >>> The routine is deinlined because of a call to page_fixed_fake_head, >>> which itself is annotated with always_inline. >>> >>> This is of course patchable with minor shoveling. >>> >>> I did not go for it because stress-ng results were too unstable for me >>> to confidently state win/loss. >>> >>> But should you want to whack the regression, this is what I would look >>> into. >>> >> >> This might improve it, at least for small folios I guess: >> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >> index 5769fe6e4950..7796ae116018 100644 >> --- a/include/linux/page-flags.h >> +++ b/include/linux/page-flags.h >> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) >> */ >> static inline bool PageHuge(const struct page *page) >> { >> - return folio_test_hugetlb(page_folio(page)); >> + return PageCompound(page) && folio_test_hugetlb(page_folio(page)); >> } >> >> /* >> >> >> We would avoid the function call for small folios. >> > > why not massage _compound_head back to an inlineable form instead? for > all i know you may even register a small win in total Agreed, likely it will increase code size a bit which is why the compiler decides to not inline. We could force it with __always_inline. Finding ways to shrink page_fixed_fake_head() might be even better. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-01 13:44 ` David Hildenbrand @ 2024-08-12 4:43 ` Yin Fengwei 2024-08-12 4:49 ` Mateusz Guzik 0 siblings, 1 reply; 22+ messages in thread From: Yin Fengwei @ 2024-08-12 4:43 UTC (permalink / raw) To: David Hildenbrand, Mateusz Guzik Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang Hi David, On 8/1/24 09:44, David Hildenbrand wrote: > On 01.08.24 15:37, Mateusz Guzik wrote: >> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> >> wrote: >>> >>> On 01.08.24 15:30, Mateusz Guzik wrote: >>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >>>>> Yes indeed. fork() can be extremely sensitive to each added >>>>> instruction. >>>>> >>>>> I even pointed out to Peter why I didn't add the PageHuge check in >>>>> there >>>>> originally [1]. >>>>> >>>>> "Well, and I didn't want to have runtime-hugetlb checks in >>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." >>>>> >>>>> >>>>> We now have to do a page_folio(page) and then test for hugetlb. >>>>> >>>>> return folio_test_hugetlb(page_folio(page)); >>>>> >>>>> Nowadays, folio_test_hugetlb() will be faster than at c0bff412e6 >>>>> times, so >>>>> maybe at least part of the overhead is gone. >>>>> >>>> >>>> I'll note page_folio expands to a call to _compound_head. >>>> >>>> While _compound_head is declared as an inline, it ends up being big >>>> enough that the compiler decides to emit a real function instead and >>>> real func calls are not particularly cheap. >>>> >>>> I had a brief look with a profiler myself and for single-threaded usage >>>> the func is quite high up there, while it manages to get out with the >>>> first branch -- that is to say there is definitely performance lost for >>>> having a func call instead of an inlined branch. >>>> >>>> The routine is deinlined because of a call to page_fixed_fake_head, >>>> which itself is annotated with always_inline. >>>> >>>> This is of course patchable with minor shoveling. >>>> >>>> I did not go for it because stress-ng results were too unstable for me >>>> to confidently state win/loss. >>>> >>>> But should you want to whack the regression, this is what I would look >>>> into. >>>> >>> >>> This might improve it, at least for small folios I guess: Do you want us to test this change? Or you have further optimization ongoing? Thanks. Regards Yin, Fengwei >>> >>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >>> index 5769fe6e4950..7796ae116018 100644 >>> --- a/include/linux/page-flags.h >>> +++ b/include/linux/page-flags.h >>> @@ -1086,7 +1086,7 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc) >>> */ >>> static inline bool PageHuge(const struct page *page) >>> { >>> - return folio_test_hugetlb(page_folio(page)); >>> + return PageCompound(page) && >>> folio_test_hugetlb(page_folio(page)); >>> } >>> >>> /* >>> >>> >>> We would avoid the function call for small folios. >>> >> >> why not massage _compound_head back to an inlineable form instead? for >> all i know you may even register a small win in total > > Agreed, likely it will increase code size a bit which is why the > compiler decides to not inline. We could force it with __always_inline. > > Finding ways to shrink page_fixed_fake_head() might be even better. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-12 4:43 ` Yin Fengwei @ 2024-08-12 4:49 ` Mateusz Guzik 2024-08-12 8:12 ` David Hildenbrand 2024-08-13 7:09 ` Yin Fengwei 0 siblings, 2 replies; 22+ messages in thread From: Mateusz Guzik @ 2024-08-12 4:49 UTC (permalink / raw) To: Yin Fengwei Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: > Hi David, > > On 8/1/24 09:44, David Hildenbrand wrote: > > On 01.08.24 15:37, Mateusz Guzik wrote: > > > On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> > > > wrote: > > > > > > > > On 01.08.24 15:30, Mateusz Guzik wrote: > > > > > On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > > > > > > Yes indeed. fork() can be extremely sensitive to each > > > > > > added instruction. > > > > > > > > > > > > I even pointed out to Peter why I didn't add the > > > > > > PageHuge check in there > > > > > > originally [1]. > > > > > > > > > > > > "Well, and I didn't want to have runtime-hugetlb checks in > > > > > > PageAnonExclusive code called on certainly-not-hugetlb code paths." > > > > > > > > > > > > > > > > > > We now have to do a page_folio(page) and then test for hugetlb. > > > > > > > > > > > > return folio_test_hugetlb(page_folio(page)); > > > > > > > > > > > > Nowadays, folio_test_hugetlb() will be faster than at > > > > > > c0bff412e6 times, so > > > > > > maybe at least part of the overhead is gone. > > > > > > > > > > > > > > > > I'll note page_folio expands to a call to _compound_head. > > > > > > > > > > While _compound_head is declared as an inline, it ends up being big > > > > > enough that the compiler decides to emit a real function instead and > > > > > real func calls are not particularly cheap. > > > > > > > > > > I had a brief look with a profiler myself and for single-threaded usage > > > > > the func is quite high up there, while it manages to get out with the > > > > > first branch -- that is to say there is definitely performance lost for > > > > > having a func call instead of an inlined branch. > > > > > > > > > > The routine is deinlined because of a call to page_fixed_fake_head, > > > > > which itself is annotated with always_inline. > > > > > > > > > > This is of course patchable with minor shoveling. > > > > > > > > > > I did not go for it because stress-ng results were too unstable for me > > > > > to confidently state win/loss. > > > > > > > > > > But should you want to whack the regression, this is what I would look > > > > > into. > > > > > > > > > > > > > This might improve it, at least for small folios I guess: > Do you want us to test this change? Or you have further optimization > ongoing? Thanks. I verified the thing below boots, I have no idea about performance. If it helps it can be massaged later from style perspective. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5769fe6e4950..2d5d61ab385b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -194,34 +194,13 @@ enum pageflags { #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); -/* - * Return the real head page struct iff the @page is a fake head page, otherwise - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. - */ +const struct page *_page_fixed_fake_head(const struct page *page); + static __always_inline const struct page *page_fixed_fake_head(const struct page *page) { if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) return page; - - /* - * Only addresses aligned with PAGE_SIZE of struct page may be fake head - * struct page. The alignment check aims to avoid access the fields ( - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) - * cold cacheline in some cases. - */ - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && - test_bit(PG_head, &page->flags)) { - /* - * We can safely access the field of the @page[1] with PG_head - * because the @page is a compound page composed with at least - * two contiguous pages. - */ - unsigned long head = READ_ONCE(page[1].compound_head); - - if (likely(head & 1)) - return (const struct page *)(head - 1); - } - return page; + return _page_fixed_fake_head(page); } #else static inline const struct page *page_fixed_fake_head(const struct page *page) @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page) return page_fixed_fake_head(page) != page; } -static inline unsigned long _compound_head(const struct page *page) +static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 829112b0a914..3fbc00db607a 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -19,6 +19,33 @@ #include <asm/tlbflush.h> #include "hugetlb_vmemmap.h" +/* + * Return the real head page struct iff the @page is a fake head page, otherwise + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. + */ +const struct page *_page_fixed_fake_head(const struct page *page) +{ + /* + * Only addresses aligned with PAGE_SIZE of struct page may be fake head + * struct page. The alignment check aims to avoid access the fields ( + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) + * cold cacheline in some cases. + */ + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && + test_bit(PG_head, &page->flags)) { + /* + * We can safely access the field of the @page[1] with PG_head + * because the @page is a compound page composed with at least + * two contiguous pages. + */ + unsigned long head = READ_ONCE(page[1].compound_head); + + if (likely(head & 1)) + return (const struct page *)(head - 1); + } + return page; +} + /** * struct vmemmap_remap_walk - walk vmemmap page table * ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-12 4:49 ` Mateusz Guzik @ 2024-08-12 8:12 ` David Hildenbrand 2024-08-12 8:18 ` Mateusz Guzik 2024-08-13 7:09 ` Yin Fengwei 1 sibling, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2024-08-12 8:12 UTC (permalink / raw) To: Mateusz Guzik, Yin Fengwei Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 12.08.24 06:49, Mateusz Guzik wrote: > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: >> Hi David, >> >> On 8/1/24 09:44, David Hildenbrand wrote: >>> On 01.08.24 15:37, Mateusz Guzik wrote: >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> >>>> wrote: >>>>> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >>>>>>> Yes indeed. fork() can be extremely sensitive to each >>>>>>> added instruction. >>>>>>> >>>>>>> I even pointed out to Peter why I didn't add the >>>>>>> PageHuge check in there >>>>>>> originally [1]. >>>>>>> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." >>>>>>> >>>>>>> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. >>>>>>> >>>>>>> return folio_test_hugetlb(page_folio(page)); >>>>>>> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at >>>>>>> c0bff412e6 times, so >>>>>>> maybe at least part of the overhead is gone. >>>>>>> >>>>>> >>>>>> I'll note page_folio expands to a call to _compound_head. >>>>>> >>>>>> While _compound_head is declared as an inline, it ends up being big >>>>>> enough that the compiler decides to emit a real function instead and >>>>>> real func calls are not particularly cheap. >>>>>> >>>>>> I had a brief look with a profiler myself and for single-threaded usage >>>>>> the func is quite high up there, while it manages to get out with the >>>>>> first branch -- that is to say there is definitely performance lost for >>>>>> having a func call instead of an inlined branch. >>>>>> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, >>>>>> which itself is annotated with always_inline. >>>>>> >>>>>> This is of course patchable with minor shoveling. >>>>>> >>>>>> I did not go for it because stress-ng results were too unstable for me >>>>>> to confidently state win/loss. >>>>>> >>>>>> But should you want to whack the regression, this is what I would look >>>>>> into. >>>>>> >>>>> >>>>> This might improve it, at least for small folios I guess: >> Do you want us to test this change? Or you have further optimization >> ongoing? Thanks. > > I verified the thing below boots, I have no idea about performance. If > it helps it can be massaged later from style perspective. As quite a lot of setups already run with the vmemmap optimization enabled, I wonder how effective this would be (might need more fine tuning, did not look at the generated code): diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 085dd8dcbea2..7ddcdbd712ec 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page) return page_fixed_fake_head(page) != page; } -static inline unsigned long _compound_head(const struct page *page) +static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-12 8:12 ` David Hildenbrand @ 2024-08-12 8:18 ` Mateusz Guzik 2024-08-12 8:23 ` David Hildenbrand 0 siblings, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-12 8:18 UTC (permalink / raw) To: David Hildenbrand Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote: > > On 12.08.24 06:49, Mateusz Guzik wrote: > > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: > >> Hi David, > >> > >> On 8/1/24 09:44, David Hildenbrand wrote: > >>> On 01.08.24 15:37, Mateusz Guzik wrote: > >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> > >>>> wrote: > >>>>> > >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: > >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > >>>>>>> Yes indeed. fork() can be extremely sensitive to each > >>>>>>> added instruction. > >>>>>>> > >>>>>>> I even pointed out to Peter why I didn't add the > >>>>>>> PageHuge check in there > >>>>>>> originally [1]. > >>>>>>> > >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in > >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." > >>>>>>> > >>>>>>> > >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. > >>>>>>> > >>>>>>> return folio_test_hugetlb(page_folio(page)); > >>>>>>> > >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at > >>>>>>> c0bff412e6 times, so > >>>>>>> maybe at least part of the overhead is gone. > >>>>>>> > >>>>>> > >>>>>> I'll note page_folio expands to a call to _compound_head. > >>>>>> > >>>>>> While _compound_head is declared as an inline, it ends up being big > >>>>>> enough that the compiler decides to emit a real function instead and > >>>>>> real func calls are not particularly cheap. > >>>>>> > >>>>>> I had a brief look with a profiler myself and for single-threaded usage > >>>>>> the func is quite high up there, while it manages to get out with the > >>>>>> first branch -- that is to say there is definitely performance lost for > >>>>>> having a func call instead of an inlined branch. > >>>>>> > >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, > >>>>>> which itself is annotated with always_inline. > >>>>>> > >>>>>> This is of course patchable with minor shoveling. > >>>>>> > >>>>>> I did not go for it because stress-ng results were too unstable for me > >>>>>> to confidently state win/loss. > >>>>>> > >>>>>> But should you want to whack the regression, this is what I would look > >>>>>> into. > >>>>>> > >>>>> > >>>>> This might improve it, at least for small folios I guess: > >> Do you want us to test this change? Or you have further optimization > >> ongoing? Thanks. > > > > I verified the thing below boots, I have no idea about performance. If > > it helps it can be massaged later from style perspective. > > As quite a lot of setups already run with the vmemmap optimization enabled, I > wonder how effective this would be (might need more fine tuning, did not look > at the generated code): > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 085dd8dcbea2..7ddcdbd712ec 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > return page_fixed_fake_head(page) != page; > } > > -static inline unsigned long _compound_head(const struct page *page) > +static __always_inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > > Well one may need to justify it with bloat-o-meter which is why I did not just straight up inline the entire thing. But if you are down to fight opposition of the sort I agree this is the patch to benchmark. :) -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-12 8:18 ` Mateusz Guzik @ 2024-08-12 8:23 ` David Hildenbrand 0 siblings, 0 replies; 22+ messages in thread From: David Hildenbrand @ 2024-08-12 8:23 UTC (permalink / raw) To: Mateusz Guzik Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 12.08.24 10:18, Mateusz Guzik wrote: > On Mon, Aug 12, 2024 at 10:12 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 12.08.24 06:49, Mateusz Guzik wrote: >>> On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: >>>> Hi David, >>>> >>>> On 8/1/24 09:44, David Hildenbrand wrote: >>>>> On 01.08.24 15:37, Mateusz Guzik wrote: >>>>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> >>>>>> wrote: >>>>>>> >>>>>>> On 01.08.24 15:30, Mateusz Guzik wrote: >>>>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >>>>>>>>> Yes indeed. fork() can be extremely sensitive to each >>>>>>>>> added instruction. >>>>>>>>> >>>>>>>>> I even pointed out to Peter why I didn't add the >>>>>>>>> PageHuge check in there >>>>>>>>> originally [1]. >>>>>>>>> >>>>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in >>>>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." >>>>>>>>> >>>>>>>>> >>>>>>>>> We now have to do a page_folio(page) and then test for hugetlb. >>>>>>>>> >>>>>>>>> return folio_test_hugetlb(page_folio(page)); >>>>>>>>> >>>>>>>>> Nowadays, folio_test_hugetlb() will be faster than at >>>>>>>>> c0bff412e6 times, so >>>>>>>>> maybe at least part of the overhead is gone. >>>>>>>>> >>>>>>>> >>>>>>>> I'll note page_folio expands to a call to _compound_head. >>>>>>>> >>>>>>>> While _compound_head is declared as an inline, it ends up being big >>>>>>>> enough that the compiler decides to emit a real function instead and >>>>>>>> real func calls are not particularly cheap. >>>>>>>> >>>>>>>> I had a brief look with a profiler myself and for single-threaded usage >>>>>>>> the func is quite high up there, while it manages to get out with the >>>>>>>> first branch -- that is to say there is definitely performance lost for >>>>>>>> having a func call instead of an inlined branch. >>>>>>>> >>>>>>>> The routine is deinlined because of a call to page_fixed_fake_head, >>>>>>>> which itself is annotated with always_inline. >>>>>>>> >>>>>>>> This is of course patchable with minor shoveling. >>>>>>>> >>>>>>>> I did not go for it because stress-ng results were too unstable for me >>>>>>>> to confidently state win/loss. >>>>>>>> >>>>>>>> But should you want to whack the regression, this is what I would look >>>>>>>> into. >>>>>>>> >>>>>>> >>>>>>> This might improve it, at least for small folios I guess: >>>> Do you want us to test this change? Or you have further optimization >>>> ongoing? Thanks. >>> >>> I verified the thing below boots, I have no idea about performance. If >>> it helps it can be massaged later from style perspective. >> >> As quite a lot of setups already run with the vmemmap optimization enabled, I >> wonder how effective this would be (might need more fine tuning, did not look >> at the generated code): >> >> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >> index 085dd8dcbea2..7ddcdbd712ec 100644 >> --- a/include/linux/page-flags.h >> +++ b/include/linux/page-flags.h >> @@ -233,7 +233,7 @@ static __always_inline int page_is_fake_head(const struct page *page) >> return page_fixed_fake_head(page) != page; >> } >> >> -static inline unsigned long _compound_head(const struct page *page) >> +static __always_inline unsigned long _compound_head(const struct page *page) >> { >> unsigned long head = READ_ONCE(page->compound_head); >> >> > > Well one may need to justify it with bloat-o-meter which is why I did > not just straight up inline the entire thing. > > But if you are down to fight opposition of the sort I agree this is > the patch to benchmark. :) I spotted that we already to that for PageHead()/PageTail()/page_is_fake_head(). So we effectively force-inline it everywhere except into _compound_head() I think. But yeah, measuring the bloat would be a necessary exercise. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-12 4:49 ` Mateusz Guzik 2024-08-12 8:12 ` David Hildenbrand @ 2024-08-13 7:09 ` Yin Fengwei 2024-08-13 7:14 ` Mateusz Guzik 1 sibling, 1 reply; 22+ messages in thread From: Yin Fengwei @ 2024-08-13 7:09 UTC (permalink / raw) To: Mateusz Guzik Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 8/12/24 00:49, Mateusz Guzik wrote: > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: >> Hi David, >> >> On 8/1/24 09:44, David Hildenbrand wrote: >>> On 01.08.24 15:37, Mateusz Guzik wrote: >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> >>>> wrote: >>>>> >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: >>>>>>> Yes indeed. fork() can be extremely sensitive to each >>>>>>> added instruction. >>>>>>> >>>>>>> I even pointed out to Peter why I didn't add the >>>>>>> PageHuge check in there >>>>>>> originally [1]. >>>>>>> >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." >>>>>>> >>>>>>> >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. >>>>>>> >>>>>>> return folio_test_hugetlb(page_folio(page)); >>>>>>> >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at >>>>>>> c0bff412e6 times, so >>>>>>> maybe at least part of the overhead is gone. >>>>>>> >>>>>> >>>>>> I'll note page_folio expands to a call to _compound_head. >>>>>> >>>>>> While _compound_head is declared as an inline, it ends up being big >>>>>> enough that the compiler decides to emit a real function instead and >>>>>> real func calls are not particularly cheap. >>>>>> >>>>>> I had a brief look with a profiler myself and for single-threaded usage >>>>>> the func is quite high up there, while it manages to get out with the >>>>>> first branch -- that is to say there is definitely performance lost for >>>>>> having a func call instead of an inlined branch. >>>>>> >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, >>>>>> which itself is annotated with always_inline. >>>>>> >>>>>> This is of course patchable with minor shoveling. >>>>>> >>>>>> I did not go for it because stress-ng results were too unstable for me >>>>>> to confidently state win/loss. >>>>>> >>>>>> But should you want to whack the regression, this is what I would look >>>>>> into. >>>>>> >>>>> >>>>> This might improve it, at least for small folios I guess: >> Do you want us to test this change? Or you have further optimization >> ongoing? Thanks. > > I verified the thing below boots, I have no idea about performance. If > it helps it can be massaged later from style perspective. > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5769fe6e4950..2d5d61ab385b 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -194,34 +194,13 @@ enum pageflags { > #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); > > -/* > - * Return the real head page struct iff the @page is a fake head page, otherwise > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > - */ > +const struct page *_page_fixed_fake_head(const struct page *page); > + > static __always_inline const struct page *page_fixed_fake_head(const struct page *page) > { > if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > return page; > - > - /* > - * Only addresses aligned with PAGE_SIZE of struct page may be fake head > - * struct page. The alignment check aims to avoid access the fields ( > - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > - * cold cacheline in some cases. > - */ > - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > - test_bit(PG_head, &page->flags)) { > - /* > - * We can safely access the field of the @page[1] with PG_head > - * because the @page is a compound page composed with at least > - * two contiguous pages. > - */ > - unsigned long head = READ_ONCE(page[1].compound_head); > - > - if (likely(head & 1)) > - return (const struct page *)(head - 1); > - } > - return page; > + return _page_fixed_fake_head(page); > } > #else > static inline const struct page *page_fixed_fake_head(const struct page *page) > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > return page_fixed_fake_head(page) != page; > } > > -static inline unsigned long _compound_head(const struct page *page) > +static __always_inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 829112b0a914..3fbc00db607a 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -19,6 +19,33 @@ > #include <asm/tlbflush.h> > #include "hugetlb_vmemmap.h" > > +/* > + * Return the real head page struct iff the @page is a fake head page, otherwise > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > + */ > +const struct page *_page_fixed_fake_head(const struct page *page) > +{ > + /* > + * Only addresses aligned with PAGE_SIZE of struct page may be fake head > + * struct page. The alignment check aims to avoid access the fields ( > + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > + * cold cacheline in some cases. > + */ > + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > + test_bit(PG_head, &page->flags)) { > + /* > + * We can safely access the field of the @page[1] with PG_head > + * because the @page is a compound page composed with at least > + * two contiguous pages. > + */ > + unsigned long head = READ_ONCE(page[1].compound_head); > + > + if (likely(head & 1)) > + return (const struct page *)(head - 1); > + } > + return page; > +} > + > /** > * struct vmemmap_remap_walk - walk vmemmap page table > * > The change can resolve the regression (from -3% to 0.5%): Please note: 9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above patch c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which introduced regression ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup: lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2 commit: 9cb28da54643ad464c47585cd5866c30b0218e67 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 c0bff412e67b781d761e330ff9578aa9ed2be79e 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95 ---------------- --------------------------- --------------------------- fail:runs %reproduction fail:runs %reproduction fail:runs | | | | | 3:3 0% 3:3 0% 3:3 stress-ng.clone.microsecs_per_clone.pass 3:3 0% 3:3 0% 3:3 stress-ng.clone.pass %stddev %change %stddev %change %stddev \ | \ | \ 2904 -0.6% 2886 +3.7% 3011 stress-ng.clone.microsecs_per_clone 563520 +0.5% 566296 -3.1% 546122 stress-ng.clone.ops 9306 +0.5% 9356 -3.0% 9024 stress-ng.clone.ops_per_sec BTW, the change needs to export symbol _page_fixed_fake_head otherwise some modules hit build error. Regards Yin, Fengwei ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-13 7:09 ` Yin Fengwei @ 2024-08-13 7:14 ` Mateusz Guzik 2024-08-14 3:02 ` Yin Fengwei 0 siblings, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-13 7:14 UTC (permalink / raw) To: Yin Fengwei Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Tue, Aug 13, 2024 at 9:09 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > On 8/12/24 00:49, Mateusz Guzik wrote: > > On Mon, Aug 12, 2024 at 12:43:08PM +0800, Yin Fengwei wrote: > >> Hi David, > >> > >> On 8/1/24 09:44, David Hildenbrand wrote: > >>> On 01.08.24 15:37, Mateusz Guzik wrote: > >>>> On Thu, Aug 1, 2024 at 3:34 PM David Hildenbrand <david@redhat.com> > >>>> wrote: > >>>>> > >>>>> On 01.08.24 15:30, Mateusz Guzik wrote: > >>>>>> On Thu, Aug 01, 2024 at 08:49:27AM +0200, David Hildenbrand wrote: > >>>>>>> Yes indeed. fork() can be extremely sensitive to each > >>>>>>> added instruction. > >>>>>>> > >>>>>>> I even pointed out to Peter why I didn't add the > >>>>>>> PageHuge check in there > >>>>>>> originally [1]. > >>>>>>> > >>>>>>> "Well, and I didn't want to have runtime-hugetlb checks in > >>>>>>> PageAnonExclusive code called on certainly-not-hugetlb code paths." > >>>>>>> > >>>>>>> > >>>>>>> We now have to do a page_folio(page) and then test for hugetlb. > >>>>>>> > >>>>>>> return folio_test_hugetlb(page_folio(page)); > >>>>>>> > >>>>>>> Nowadays, folio_test_hugetlb() will be faster than at > >>>>>>> c0bff412e6 times, so > >>>>>>> maybe at least part of the overhead is gone. > >>>>>>> > >>>>>> > >>>>>> I'll note page_folio expands to a call to _compound_head. > >>>>>> > >>>>>> While _compound_head is declared as an inline, it ends up being big > >>>>>> enough that the compiler decides to emit a real function instead and > >>>>>> real func calls are not particularly cheap. > >>>>>> > >>>>>> I had a brief look with a profiler myself and for single-threaded usage > >>>>>> the func is quite high up there, while it manages to get out with the > >>>>>> first branch -- that is to say there is definitely performance lost for > >>>>>> having a func call instead of an inlined branch. > >>>>>> > >>>>>> The routine is deinlined because of a call to page_fixed_fake_head, > >>>>>> which itself is annotated with always_inline. > >>>>>> > >>>>>> This is of course patchable with minor shoveling. > >>>>>> > >>>>>> I did not go for it because stress-ng results were too unstable for me > >>>>>> to confidently state win/loss. > >>>>>> > >>>>>> But should you want to whack the regression, this is what I would look > >>>>>> into. > >>>>>> > >>>>> > >>>>> This might improve it, at least for small folios I guess: > >> Do you want us to test this change? Or you have further optimization > >> ongoing? Thanks. > > > > I verified the thing below boots, I have no idea about performance. If > > it helps it can be massaged later from style perspective. > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 5769fe6e4950..2d5d61ab385b 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -194,34 +194,13 @@ enum pageflags { > > #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > > DECLARE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); > > > > -/* > > - * Return the real head page struct iff the @page is a fake head page, otherwise > > - * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > > - */ > > +const struct page *_page_fixed_fake_head(const struct page *page); > > + > > static __always_inline const struct page *page_fixed_fake_head(const struct page *page) > > { > > if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > > return page; > > - > > - /* > > - * Only addresses aligned with PAGE_SIZE of struct page may be fake head > > - * struct page. The alignment check aims to avoid access the fields ( > > - * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > > - * cold cacheline in some cases. > > - */ > > - if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > > - test_bit(PG_head, &page->flags)) { > > - /* > > - * We can safely access the field of the @page[1] with PG_head > > - * because the @page is a compound page composed with at least > > - * two contiguous pages. > > - */ > > - unsigned long head = READ_ONCE(page[1].compound_head); > > - > > - if (likely(head & 1)) > > - return (const struct page *)(head - 1); > > - } > > - return page; > > + return _page_fixed_fake_head(page); > > } > > #else > > static inline const struct page *page_fixed_fake_head(const struct page *page) > > @@ -235,7 +214,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > > return page_fixed_fake_head(page) != page; > > } > > > > -static inline unsigned long _compound_head(const struct page *page) > > +static __always_inline unsigned long _compound_head(const struct page *page) > > { > > unsigned long head = READ_ONCE(page->compound_head); > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 829112b0a914..3fbc00db607a 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -19,6 +19,33 @@ > > #include <asm/tlbflush.h> > > #include "hugetlb_vmemmap.h" > > > > +/* > > + * Return the real head page struct iff the @page is a fake head page, otherwise > > + * return the @page itself. See Documentation/mm/vmemmap_dedup.rst. > > + */ > > +const struct page *_page_fixed_fake_head(const struct page *page) > > +{ > > + /* > > + * Only addresses aligned with PAGE_SIZE of struct page may be fake head > > + * struct page. The alignment check aims to avoid access the fields ( > > + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > > + * cold cacheline in some cases. > > + */ > > + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > > + test_bit(PG_head, &page->flags)) { > > + /* > > + * We can safely access the field of the @page[1] with PG_head > > + * because the @page is a compound page composed with at least > > + * two contiguous pages. > > + */ > > + unsigned long head = READ_ONCE(page[1].compound_head); > > + > > + if (likely(head & 1)) > > + return (const struct page *)(head - 1); > > + } > > + return page; > > +} > > + > > /** > > * struct vmemmap_remap_walk - walk vmemmap page table > > * > > > > The change can resolve the regression (from -3% to 0.5%): > thanks for testing would you mind benchmarking the change which merely force-inlines _compund_page? https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ > Please note: > 9cb28da54643ad464c47585cd5866c30b0218e67 is the parent commit > 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 is the commit with above > patch > c0bff412e67b781d761e330ff9578aa9ed2be79e is the commit which > introduced regression > > > ========================================================================================= > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup: > > lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2 > > commit: > 9cb28da54643ad464c47585cd5866c30b0218e67 > 3f16e4b516ef02d9461b7e0b6c50e05ba0811886 > c0bff412e67b781d761e330ff9578aa9ed2be79e > > 9cb28da54643ad46 3f16e4b516ef02d9461b7e0b6c5 c0bff412e67b781d761e330ff95 > ---------------- --------------------------- --------------------------- > fail:runs %reproduction fail:runs %reproduction fail:runs > | | | | | > 3:3 0% 3:3 0% 3:3 > stress-ng.clone.microsecs_per_clone.pass > 3:3 0% 3:3 0% 3:3 > stress-ng.clone.pass > %stddev %change %stddev %change %stddev > \ | \ | \ > 2904 -0.6% 2886 +3.7% 3011 > stress-ng.clone.microsecs_per_clone > 563520 +0.5% 566296 -3.1% 546122 > stress-ng.clone.ops > 9306 +0.5% 9356 -3.0% 9024 > stress-ng.clone.ops_per_sec > > > BTW, the change needs to export symbol _page_fixed_fake_head otherwise > some modules hit build error. > ok, I'll patch that up if this approach will be the thing to do -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-13 7:14 ` Mateusz Guzik @ 2024-08-14 3:02 ` Yin Fengwei 2024-08-14 4:10 ` Mateusz Guzik 0 siblings, 1 reply; 22+ messages in thread From: Yin Fengwei @ 2024-08-14 3:02 UTC (permalink / raw) To: Mateusz Guzik Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 8/13/24 03:14, Mateusz Guzik wrote: > thanks for testing > > would you mind benchmarking the change which merely force-inlines _compund_page? > > https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ This change can resolve the regression also: ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup: lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2 commit: 9cb28da54643ad464c47585cd5866c30b0218e67 parent commit c0bff412e67b781d761e330ff9578aa9ed2be79e commit introduced regression 450b96d2c4f740152e03c6b79b484a10347b3ea9 the change proposed by David in above link 9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4 ---------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev \ | \ | \ 2906 +3.5% 3007 +0.4% 2919 stress-ng.clone.microsecs_per_clone 562884 -2.9% 546575 -0.6% 559718 stress-ng.clone.ops 9295 -2.9% 9028 -0.5% 9248 stress-ng.clone.ops_per_sec Regards Yin, Fengwei ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-14 3:02 ` Yin Fengwei @ 2024-08-14 4:10 ` Mateusz Guzik 2024-08-14 9:45 ` David Hildenbrand 0 siblings, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-14 4:10 UTC (permalink / raw) To: Yin Fengwei Cc: David Hildenbrand, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > On 8/13/24 03:14, Mateusz Guzik wrote: > > would you mind benchmarking the change which merely force-inlines _compund_page? > > > > https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ > This change can resolve the regression also: Great, thanks. David, I guess this means it would be fine to inline the entire thing at least from this bench standpoint. Given that this is your idea I guess you should do the needful(tm)? :) > ========================================================================================= > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/testtime/test/cpufreq_governor/debug-setup: > > lkp-icl-2sp8/stress-ng/debian-12-x86_64-20240206.cgz/x86_64-rhel-8.3/gcc-12/100%/60s/clone/performance/yfw_test2 > > commit: > 9cb28da54643ad464c47585cd5866c30b0218e67 parent commit > c0bff412e67b781d761e330ff9578aa9ed2be79e commit introduced regression > 450b96d2c4f740152e03c6b79b484a10347b3ea9 the change proposed by David > in above link > > 9cb28da54643ad46 c0bff412e67b781d761e330ff95 450b96d2c4f740152e03c6b79b4 > ---------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev > \ | \ | \ > 2906 +3.5% 3007 +0.4% 2919 > stress-ng.clone.microsecs_per_clone > 562884 -2.9% 546575 -0.6% 559718 > stress-ng.clone.ops > 9295 -2.9% 9028 -0.5% 9248 > stress-ng.clone.ops_per_sec > > > > Regards > Yin, Fengwei > -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-14 4:10 ` Mateusz Guzik @ 2024-08-14 9:45 ` David Hildenbrand 2024-08-14 11:06 ` Mateusz Guzik 0 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2024-08-14 9:45 UTC (permalink / raw) To: Mateusz Guzik, Yin Fengwei Cc: kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 14.08.24 06:10, Mateusz Guzik wrote: > On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> On 8/13/24 03:14, Mateusz Guzik wrote: >>> would you mind benchmarking the change which merely force-inlines _compund_page? >>> >>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ >> This change can resolve the regression also: > > Great, thanks. > > David, I guess this means it would be fine to inline the entire thing > at least from this bench standpoint. Given that this is your idea I > guess you should do the needful(tm)? :) Testing diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5769fe6e4950..25e25b34f4a0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page) return page_fixed_fake_head(page) != page; } -static inline unsigned long _compound_head(const struct page *page) +static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); With a kernel-config based on something derived from Fedora config-6.8.9-100.fc38.x86_64 for convenience with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081) Function old new delta change_pte_range - 2308 +2308 iommu_put_dma_cookie 454 1276 +822 get_hwpoison_page 2007 2580 +573 end_bbio_data_read 1171 1626 +455 end_bbio_meta_read 492 934 +442 ext4_finish_bio 773 1208 +435 fq_ring_free_locked 128 541 +413 end_bbio_meta_write 493 872 +379 gup_fast_fallback 4207 4568 +361 v1_free_pgtable 166 519 +353 iommu_v1_map_pages 2747 3098 +351 end_bbio_data_write 609 960 +351 fsverity_verify_bio 334 656 +322 follow_page_mask 3399 3719 +320 __read_end_io 316 635 +319 btrfs_end_super_write 494 789 +295 iommu_alloc_pages_node.constprop 286 572 +286 free_buffers.part - 271 +271 gup_must_unshare - 268 +268 smaps_pte_range 1285 1513 +228 pagemap_pmd_range 2189 2393 +204 iommu_alloc_pages_node - 193 +193 smaps_hugetlb_range 705 897 +192 follow_page_pte 1584 1758 +174 __migrate_device_pages 2435 2595 +160 unpin_user_pages_dirty_lock 205 362 +157 _compound_head - 150 +150 unpin_user_pages 143 282 +139 put_ref_page.part - 126 +126 iomap_finish_ioend 866 972 +106 iomap_read_end_io 673 763 +90 end_bbio_meta_read.cold 42 131 +89 btrfs_do_readpage 1759 1845 +86 extent_write_cache_pages 2133 2212 +79 end_bbio_data_write.cold 32 108 +76 end_bbio_meta_write.cold 40 108 +68 __read_end_io.cold 25 91 +66 btrfs_end_super_write.cold 25 89 +64 ext4_finish_bio.cold 118 178 +60 fsverity_verify_bio.cold 25 84 +59 block_write_begin 217 274 +57 end_bbio_data_read.cold 378 426 +48 __pfx__compound_head - 48 +48 copy_hugetlb_page_range 3050 3097 +47 lruvec_stat_mod_folio.constprop 585 630 +45 iomap_finish_ioend.cold 163 202 +39 md_bitmap_file_unmap 150 187 +37 free_pgd_range 1949 1985 +36 prep_move_freepages_block 319 349 +30 iommu_alloc_pages_node.cold - 25 +25 iomap_read_end_io.cold 65 89 +24 zap_huge_pmd 874 897 +23 cont_write_begin.cold 108 130 +22 skb_splice_from_iter 822 843 +21 set_pmd_migration_entry 1037 1058 +21 zerocopy_fill_skb_from_iter 1321 1340 +19 pagemap_scan_pmd_entry 3261 3279 +18 try_grab_folio_fast 452 469 +17 change_huge_pmd 1174 1191 +17 folio_put 48 64 +16 __pfx_set_p4d - 16 +16 __pfx_put_ref_page.part - 16 +16 __pfx_lruvec_stat_mod_folio.constprop 208 224 +16 __pfx_iommu_alloc_pages_node.constprop 16 32 +16 __pfx_iommu_alloc_pages_node - 16 +16 __pfx_gup_must_unshare - 16 +16 __pfx_free_buffers.part - 16 +16 __pfx_folio_put 48 64 +16 __pfx_change_pte_range - 16 +16 __pfx___pte 32 48 +16 offline_pages 1962 1975 +13 memfd_pin_folios 1284 1297 +13 uprobe_write_opcode 2062 2073 +11 set_p4d - 11 +11 __pte 22 33 +11 copy_page_from_iter_atomic 1714 1724 +10 __migrate_device_pages.cold 60 70 +10 try_to_unmap_one 3355 3364 +9 try_to_migrate_one 3310 3319 +9 stable_page_flags 1034 1043 +9 io_sqe_buffer_register 1404 1413 +9 dio_zero_block 644 652 +8 add_ra_bio_pages.constprop.isra 1542 1550 +8 __add_to_kill 969 977 +8 btrfs_writepage_fixup_worker 1199 1206 +7 write_protect_page 1186 1192 +6 iommu_v2_map_pages.cold 145 151 +6 gup_fast_fallback.cold 112 117 +5 try_to_merge_one_page 1857 1860 +3 __apply_to_page_range 2235 2238 +3 wbc_account_cgroup_owner 217 219 +2 change_protection.cold 105 107 +2 can_change_pte_writable 354 356 +2 vmf_insert_pfn_pud 699 700 +1 split_huge_page_to_list_to_order.cold 152 151 -1 pte_pfn 40 39 -1 move_pages 5270 5269 -1 isolate_single_pageblock 1056 1055 -1 __apply_to_page_range.cold 92 91 -1 unmap_page_range.cold 88 86 -2 do_huge_pmd_numa_page 1175 1173 -2 free_pgd_range.cold 162 159 -3 copy_page_to_iter 329 326 -3 copy_page_range.cold 149 146 -3 copy_page_from_iter 307 304 -3 can_finish_ordered_extent 551 548 -3 __replace_page 1133 1130 -3 __reset_isolation_pfn 645 641 -4 dio_send_cur_page 1113 1108 -5 __access_remote_vm 1010 1005 -5 pagemap_hugetlb_category 468 459 -9 extent_write_locked_range 1148 1139 -9 unuse_pte_range 1834 1821 -13 do_migrate_range 1935 1922 -13 __get_user_pages 1952 1938 -14 migrate_vma_collect_pmd 2817 2802 -15 copy_page_to_iter_nofault 2373 2358 -15 hugetlb_fault 4054 4038 -16 __pfx_shake_page 16 - -16 __pfx_put_page 16 - -16 __pfx_pfn_swap_entry_to_page 32 16 -16 __pfx_gup_must_unshare.part 16 - -16 __pfx_gup_folio_next 16 - -16 __pfx_free_buffers 16 - -16 __pfx___get_unpoison_page 16 - -16 btrfs_cleanup_ordered_extents 622 604 -18 read_rdev 694 673 -21 isolate_migratepages_block.cold 222 197 -25 hugetlb_mfill_atomic_pte 1869 1844 -25 folio_pte_batch.constprop 1020 995 -25 hugetlb_reserve_pages 1468 1441 -27 __alloc_fresh_hugetlb_folio 676 649 -27 intel_pasid_alloc_table.cold 83 52 -31 __pfx_iommu_put_pages_list 48 16 -32 __pfx_PageHuge 32 - -32 __blockdev_direct_IO.cold 952 920 -32 io_ctl_prepare_pages 832 794 -38 __handle_mm_fault 4237 4195 -42 finish_fault 1007 962 -45 __pfx_pfn_swap_entry_folio 64 16 -48 vm_normal_folio_pmd 84 34 -50 vm_normal_folio 84 34 -50 set_migratetype_isolate 1429 1375 -54 do_set_pmd 618 561 -57 can_change_pmd_writable 293 229 -64 __unmap_hugepage_range 2389 2325 -64 do_fault 1187 1121 -66 fault_dirty_shared_page 425 358 -67 madvise_free_huge_pmd 863 792 -71 insert_page_into_pte_locked.isra 502 429 -73 restore_exclusive_pte 539 463 -76 isolate_migratepages_block 5436 5355 -81 __do_fault 366 276 -90 set_pte_range 593 502 -91 follow_devmap_pmd 559 468 -91 __pfx_bio_first_folio 144 48 -96 shake_page 105 - -105 hugetlb_change_protection 2314 2204 -110 hugetlb_wp 2134 2017 -117 __blockdev_direct_IO 5063 4946 -117 skb_tx_error 272 149 -123 put_page 123 - -123 gup_must_unshare.part 135 - -135 PageHuge 136 - -136 ksm_scan_thread 9172 9032 -140 intel_pasid_alloc_table 596 447 -149 copy_huge_pmd 1539 1385 -154 skb_split 1534 1376 -158 split_huge_pmd_locked 4024 3865 -159 skb_append_pagefrags 663 504 -159 memory_failure 2784 2624 -160 unpoison_memory 1328 1167 -161 cont_write_begin 959 793 -166 pfn_swap_entry_to_page 250 82 -168 skb_pp_cow_data 1539 1367 -172 gup_folio_next 180 - -180 intel_pasid_get_entry.isra 607 425 -182 v2_alloc_pgtable 309 126 -183 do_huge_pmd_wp_page 1173 988 -185 bio_first_folio.cold 315 105 -210 unmap_page_range 6091 5873 -218 split_huge_page_to_list_to_order 4141 3905 -236 move_pages_huge_pmd 2053 1813 -240 free_buffers 286 - -286 iommu_v2_map_pages 1722 1428 -294 soft_offline_page 2149 1843 -306 do_wp_page 3340 2993 -347 do_swap_page 4619 4265 -354 md_import_device 1002 635 -367 copy_page_range 7436 7040 -396 __get_unpoison_page 415 - -415 pfn_swap_entry_folio 596 149 -447 iommu_put_pages_list 1071 344 -727 bio_first_folio 2322 774 -1548 change_protection 5008 2790 -2218 Total: Before=32786363, After=32785282, chg -0.00% -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-14 9:45 ` David Hildenbrand @ 2024-08-14 11:06 ` Mateusz Guzik 2024-08-14 12:02 ` David Hildenbrand 0 siblings, 1 reply; 22+ messages in thread From: Mateusz Guzik @ 2024-08-14 11:06 UTC (permalink / raw) To: David Hildenbrand Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote: > > On 14.08.24 06:10, Mateusz Guzik wrote: > > On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> On 8/13/24 03:14, Mateusz Guzik wrote: > >>> would you mind benchmarking the change which merely force-inlines _compund_page? > >>> > >>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ > >> This change can resolve the regression also: > > > > Great, thanks. > > > > David, I guess this means it would be fine to inline the entire thing > > at least from this bench standpoint. Given that this is your idea I > > guess you should do the needful(tm)? :) > > Testing > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5769fe6e4950..25e25b34f4a0 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page) > return page_fixed_fake_head(page) != page; > } > > -static inline unsigned long _compound_head(const struct page *page) > +static __always_inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > > > With a kernel-config based on something derived from Fedora > config-6.8.9-100.fc38.x86_64 for convenience with > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y > > add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081) [snip] > Total: Before=32786363, After=32785282, chg -0.00% I guess there should be no opposition then? Given that this is your patch I presume you are going to see this through. I don't want any mention or cc on the patch, thanks for understanding :) -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression 2024-08-14 11:06 ` Mateusz Guzik @ 2024-08-14 12:02 ` David Hildenbrand 0 siblings, 0 replies; 22+ messages in thread From: David Hildenbrand @ 2024-08-14 12:02 UTC (permalink / raw) To: Mateusz Guzik Cc: Yin Fengwei, kernel test robot, Peter Xu, oe-lkp, lkp, linux-kernel, Andrew Morton, Huacai Chen, Jason Gunthorpe, Matthew Wilcox, Nathan Chancellor, Ryan Roberts, WANG Xuerui, linux-mm, ying.huang, feng.tang On 14.08.24 13:06, Mateusz Guzik wrote: > On Wed, Aug 14, 2024 at 11:45 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 14.08.24 06:10, Mateusz Guzik wrote: >>> On Wed, Aug 14, 2024 at 5:02 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> On 8/13/24 03:14, Mateusz Guzik wrote: >>>>> would you mind benchmarking the change which merely force-inlines _compund_page? >>>>> >>>>> https://lore.kernel.org/linux-mm/66c4fcc5-47f6-438c-a73a-3af6e19c3200@redhat.com/ >>>> This change can resolve the regression also: >>> >>> Great, thanks. >>> >>> David, I guess this means it would be fine to inline the entire thing >>> at least from this bench standpoint. Given that this is your idea I >>> guess you should do the needful(tm)? :) >> >> Testing >> >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >> index 5769fe6e4950..25e25b34f4a0 100644 >> --- a/include/linux/page-flags.h >> +++ b/include/linux/page-flags.h >> @@ -235,7 +235,7 @@ static __always_inline int page_is_fake_head(const struct page *page) >> return page_fixed_fake_head(page) != page; >> } >> >> -static inline unsigned long _compound_head(const struct page *page) >> +static __always_inline unsigned long _compound_head(const struct page *page) >> { >> unsigned long head = READ_ONCE(page->compound_head); >> >> >> With a kernel-config based on something derived from Fedora >> config-6.8.9-100.fc38.x86_64 for convenience with >> >> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y >> >> add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081) > [snip] >> Total: Before=32786363, After=32785282, chg -0.00% > > I guess there should be no opposition then? > > Given that this is your patch I presume you are going to see this through. I was hoping that you could send an official patch, after all you did most of the work here. > > I don't want any mention or cc on the patch, thanks for understanding :) If I have to send it I will respect it. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2024-08-14 12:02 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-30 5:00 [linus:master] [mm] c0bff412e6: stress-ng.clone.ops_per_sec -2.9% regression kernel test robot 2024-07-30 8:11 ` David Hildenbrand 2024-08-01 6:39 ` Yin, Fengwei 2024-08-01 6:49 ` David Hildenbrand 2024-08-01 7:44 ` Yin, Fengwei 2024-08-01 7:54 ` David Hildenbrand 2024-08-01 13:30 ` Mateusz Guzik 2024-08-01 13:34 ` David Hildenbrand 2024-08-01 13:37 ` Mateusz Guzik 2024-08-01 13:44 ` David Hildenbrand 2024-08-12 4:43 ` Yin Fengwei 2024-08-12 4:49 ` Mateusz Guzik 2024-08-12 8:12 ` David Hildenbrand 2024-08-12 8:18 ` Mateusz Guzik 2024-08-12 8:23 ` David Hildenbrand 2024-08-13 7:09 ` Yin Fengwei 2024-08-13 7:14 ` Mateusz Guzik 2024-08-14 3:02 ` Yin Fengwei 2024-08-14 4:10 ` Mateusz Guzik 2024-08-14 9:45 ` David Hildenbrand 2024-08-14 11:06 ` Mateusz Guzik 2024-08-14 12:02 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).