* [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
@ 2025-01-09 6:57 kernel test robot
2025-01-09 16:51 ` Kees Cook
0 siblings, 1 reply; 8+ messages in thread
From: kernel test robot @ 2025-01-09 6:57 UTC (permalink / raw)
To: Kees Cook
Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
Yury Norov, Greg Kroah-Hartman, linux-hardening, oliver.sang
Hello,
kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
testcase: vm-scalability
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:
runtime: 300s
size: 256G
test: msync
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250109/202501091405.a1fcb1ed-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/300s/256G/lkp-cpl-4sp2/msync/vm-scalability
commit:
f06e108a3d ("Compiler Attributes: disable __counted_by for clang < 19.1.3")
239d87327d ("fortify: Hide run-time copy size from value range tracking")
f06e108a3dc53c0f 239d87327dcd361b0098038995f
---------------- ---------------------------
%stddev %change %stddev
\ | \
654.00 ± 13% +62.7% 1063 ± 41% perf-c2c.HITM.local
74.03 ± 49% +113.3% 157.89 ± 40% sched_debug.cfs_rq:/.removed.runnable_avg.max
74.03 ± 49% +113.3% 157.89 ± 40% sched_debug.cfs_rq:/.removed.util_avg.max
9843704 ± 12% -31.4% 6748836 ± 24% numa-meminfo.node0.Active(file)
81609 ± 13% -18.3% 66698 ± 10% numa-meminfo.node1.Writeback
3197765 ± 12% +34.7% 4307440 ± 12% numa-meminfo.node3.MemFree
0.07 ± 2% +0.0 0.07 ± 2% mpstat.cpu.all.irq%
0.05 ± 2% +0.0 0.06 mpstat.cpu.all.soft%
2.17 ± 3% +0.4 2.58 ± 3% mpstat.cpu.all.sys%
0.42 ± 3% +0.1 0.49 ± 2% mpstat.cpu.all.usr%
2462818 +24.2% 3060034 vmstat.io.bo
8.76 ± 2% +14.7% 10.06 ± 3% vmstat.procs.r
12294 +13.6% 13967 ± 4% vmstat.system.cs
40339 ± 2% +6.8% 43096 ± 4% vmstat.system.in
6203763 ± 14% +67.5% 10389382 ± 23% numa-numastat.node0.local_node
6274485 ± 14% +66.8% 10464891 ± 23% numa-numastat.node0.numa_hit
6773452 ± 13% -59.5% 2743979 ± 68% numa-numastat.node0.numa_miss
6842787 ± 12% -58.8% 2819949 ± 66% numa-numastat.node0.other_node
7434683 ± 19% +36.7% 10159657 ± 26% numa-numastat.node1.local_node
7522237 ± 19% +36.4% 10257654 ± 26% numa-numastat.node1.numa_hit
16256 ± 2% +26.1% 20495 vm-scalability.median
5.43 ± 6% -2.5 2.92 ± 26% vm-scalability.median_stddev%
9.99 ± 10% -3.0 6.95 ± 9% vm-scalability.stddev%
5678018 ± 3% +17.3% 6661631 ± 2% vm-scalability.throughput
1.573e+09 +25.0% 1.966e+09 vm-scalability.time.file_system_outputs
16615 ± 3% +27.0% 21107 vm-scalability.time.involuntary_context_switches
2.099e+08 +25.0% 2.624e+08 vm-scalability.time.minor_page_faults
561.00 +21.8% 683.33 ± 3% vm-scalability.time.percent_of_cpu_this_job_got
1358 ± 3% +23.6% 1679 ± 3% vm-scalability.time.system_time
418.15 ± 2% +19.1% 497.88 vm-scalability.time.user_time
1135302 +11.7% 1268430 vm-scalability.time.voluntary_context_switches
8.846e+08 +25.0% 1.106e+09 vm-scalability.workload
2478521 ± 12% -33.1% 1658879 ± 24% numa-vmstat.node0.nr_active_file
45774950 ± 9% +28.8% 58943198 ± 5% numa-vmstat.node0.nr_dirtied
45774950 ± 9% +28.8% 58943198 ± 5% numa-vmstat.node0.nr_written
2476252 ± 12% -33.1% 1657048 ± 24% numa-vmstat.node0.nr_zone_active_file
6274222 ± 14% +66.8% 10464563 ± 23% numa-vmstat.node0.numa_hit
6203500 ± 14% +67.5% 10389054 ± 23% numa-vmstat.node0.numa_local
6773452 ± 13% -59.5% 2743979 ± 68% numa-vmstat.node0.numa_miss
6842787 ± 12% -58.8% 2819949 ± 66% numa-vmstat.node0.numa_other
49693812 ± 8% +20.0% 59611215 ± 8% numa-vmstat.node1.nr_dirtied
49693812 ± 8% +20.0% 59611215 ± 8% numa-vmstat.node1.nr_written
7521777 ± 19% +36.4% 10257607 ± 26% numa-vmstat.node1.numa_hit
7434223 ± 19% +36.7% 10159609 ± 26% numa-vmstat.node1.numa_local
2660800 ± 8% +22.1% 3250098 ± 5% numa-vmstat.node1.workingset_activate_file
3153899 ± 8% +19.5% 3769627 ± 5% numa-vmstat.node1.workingset_refault_file
2660800 ± 8% +22.1% 3250098 ± 5% numa-vmstat.node1.workingset_restore_file
53368316 ± 9% +20.2% 64130806 ± 8% numa-vmstat.node2.nr_dirtied
53368316 ± 9% +20.2% 64130806 ± 8% numa-vmstat.node2.nr_written
7683 ± 8% -20.2% 6129 ± 4% numa-vmstat.node2.workingset_nodes
47788357 ± 10% +32.1% 63105437 ± 10% numa-vmstat.node3.nr_dirtied
803731 ± 13% +34.0% 1076708 ± 12% numa-vmstat.node3.nr_free_pages
47788357 ± 10% +32.1% 63105437 ± 10% numa-vmstat.node3.nr_written
30030 ± 15% +75.3% 52638 ± 23% proc-vmstat.allocstall_movable
27837 ± 13% +58.8% 44214 ± 22% proc-vmstat.compact_fail
45835 ± 10% +88.6% 86440 ± 23% proc-vmstat.compact_stall
17998 ± 21% +134.6% 42225 ± 25% proc-vmstat.compact_success
22633426 +1.2% 22911084 proc-vmstat.nr_active_anon
11444651 -10.8% 10211517 ± 6% proc-vmstat.nr_active_file
1.966e+08 +25.0% 2.458e+08 proc-vmstat.nr_dirtied
3658433 -2.6% 3563342 proc-vmstat.nr_dirty
9170138 +12.1% 10276853 ± 6% proc-vmstat.nr_inactive_file
22567898 +1.2% 22846647 proc-vmstat.nr_shmem
1.966e+08 +25.0% 2.458e+08 proc-vmstat.nr_written
22633454 +1.2% 22911113 proc-vmstat.nr_zone_active_anon
11444767 -10.8% 10211682 ± 6% proc-vmstat.nr_zone_active_file
9170083 +12.1% 10276805 ± 6% proc-vmstat.nr_zone_inactive_file
3740131 -2.7% 3639414 proc-vmstat.nr_zone_write_pending
22011951 ± 15% +33.7% 29430963 ± 10% proc-vmstat.pgactivate
2824 +16.2% 3280 ± 22% proc-vmstat.pgalloc_dma
2.856e+08 +19.6% 3.416e+08 ± 3% proc-vmstat.pgalloc_normal
2.112e+08 +24.9% 2.637e+08 proc-vmstat.pgfault
2.886e+08 +19.3% 3.444e+08 ± 3% proc-vmstat.pgfree
6020 ± 9% +88.5% 11348 ± 44% proc-vmstat.pgmajfault
7.865e+08 +25.0% 9.832e+08 proc-vmstat.pgpgout
124025 +16.5% 144503 proc-vmstat.pgreuse
3641011 ± 15% +48.1% 5392566 ± 14% proc-vmstat.pgsteal_direct
2499 +26.9% 3171 proc-vmstat.unevictable_pgs_culled
29425 -4.0% 28243 proc-vmstat.workingset_nodes
9.93 +6.5% 10.58 perf-stat.i.MPKI
4.61e+09 +25.7% 5.793e+09 perf-stat.i.branch-instructions
0.32 ± 3% -0.0 0.29 perf-stat.i.branch-miss-rate%
12693622 +13.8% 14449439 perf-stat.i.branch-misses
83.47 +2.3 85.75 perf-stat.i.cache-miss-rate%
1.591e+08 +39.5% 2.221e+08 perf-stat.i.cache-misses
1.891e+08 +36.6% 2.584e+08 perf-stat.i.cache-references
12325 +13.6% 13999 ± 4% perf-stat.i.context-switches
1.28 -11.7% 1.13 ± 2% perf-stat.i.cpi
2.864e+10 +18.9% 3.405e+10 ± 2% perf-stat.i.cpu-cycles
343.31 +5.4% 361.81 perf-stat.i.cpu-migrations
141.92 -15.8% 119.51 perf-stat.i.cycles-between-cache-misses
1.792e+10 +29.5% 2.32e+10 perf-stat.i.instructions
1.01 +13.0% 1.14 perf-stat.i.ipc
5.54 +24.5% 6.90 perf-stat.i.metric.K/sec
624456 +24.6% 778107 perf-stat.i.minor-faults
624469 +24.6% 778135 perf-stat.i.page-faults
8.90 +7.8% 9.59 perf-stat.overall.MPKI
0.28 -0.0 0.25 perf-stat.overall.branch-miss-rate%
84.14 +1.8 85.91 perf-stat.overall.cache-miss-rate%
1.62 -8.3% 1.49 ± 2% perf-stat.overall.cpi
182.46 -14.9% 155.29 ± 2% perf-stat.overall.cycles-between-cache-misses
0.62 +9.0% 0.67 ± 2% perf-stat.overall.ipc
6475 +3.7% 6715 perf-stat.overall.path-length
4.639e+09 +25.0% 5.8e+09 perf-stat.ps.branch-instructions
12777070 +13.1% 14448212 perf-stat.ps.branch-misses
1.605e+08 +38.8% 2.229e+08 perf-stat.ps.cache-misses
1.908e+08 +36.0% 2.594e+08 perf-stat.ps.cache-references
12289 +13.6% 13955 ± 4% perf-stat.ps.context-switches
2.929e+10 +18.2% 3.461e+10 ± 2% perf-stat.ps.cpu-cycles
344.20 +5.3% 362.39 perf-stat.ps.cpu-migrations
1.805e+10 +28.8% 2.324e+10 perf-stat.ps.instructions
626335 +24.0% 776865 perf-stat.ps.minor-faults
626348 +24.0% 776893 perf-stat.ps.page-faults
5.728e+12 +29.6% 7.425e+12 perf-stat.total.instructions
34.75 ± 2% -17.3 17.48 ± 87% perf-profile.calltrace.cycles-pp.read_pages.page_cache_ra_order.filemap_fault.__do_fault.do_read_fault
34.74 ± 2% -16.4 18.29 ± 79% perf-profile.calltrace.cycles-pp.iomap_readahead.read_pages.page_cache_ra_order.filemap_fault.__do_fault
34.68 ± 2% -16.4 18.25 ± 79% perf-profile.calltrace.cycles-pp.iomap_readpage_iter.iomap_readahead.read_pages.page_cache_ra_order.filemap_fault
34.48 ± 2% -16.4 18.07 ± 80% perf-profile.calltrace.cycles-pp.zero_user_segments.iomap_readpage_iter.iomap_readahead.read_pages.page_cache_ra_order
34.28 ± 2% -16.3 17.97 ± 80% perf-profile.calltrace.cycles-pp.memset_orig.zero_user_segments.iomap_readpage_iter.iomap_readahead.read_pages
7.38 ± 7% +1.8 9.17 ± 13% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
0.00 +6.5 6.54 ± 66% perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
34.90 ± 2% -16.5 18.41 ± 79% perf-profile.children.cycles-pp.read_pages
34.89 ± 2% -16.5 18.41 ± 79% perf-profile.children.cycles-pp.iomap_readahead
34.83 ± 2% -16.5 18.36 ± 79% perf-profile.children.cycles-pp.iomap_readpage_iter
34.62 ± 2% -16.4 18.18 ± 80% perf-profile.children.cycles-pp.zero_user_segments
34.57 ± 2% -16.4 18.15 ± 80% perf-profile.children.cycles-pp.memset_orig
0.33 ± 7% -0.2 0.16 ± 87% perf-profile.children.cycles-pp.prep_compound_page
0.24 ± 18% -0.1 0.10 ± 83% perf-profile.children.cycles-pp.page_counter_try_charge
0.25 ± 8% -0.1 0.19 ± 16% perf-profile.children.cycles-pp.__mod_node_page_state
0.08 ± 13% -0.0 0.05 ± 47% perf-profile.children.cycles-pp.__mod_lruvec_state
0.08 ± 5% +0.0 0.10 ± 8% perf-profile.children.cycles-pp.___perf_sw_event
0.03 ±123% +0.1 0.10 ± 33% perf-profile.children.cycles-pp.on_each_cpu_cond_mask
0.03 ±123% +0.1 0.10 ± 33% perf-profile.children.cycles-pp.smp_call_function_many_cond
0.02 ±123% +0.1 0.10 ± 45% perf-profile.children.cycles-pp.up_write
0.07 ± 22% +0.1 0.19 ± 56% perf-profile.children.cycles-pp.free_tail_page_prepare
0.20 ± 19% +0.4 0.58 ± 61% perf-profile.children.cycles-pp.shmem_get_folio_gfp
0.20 ± 18% +0.4 0.61 ± 61% perf-profile.children.cycles-pp.shmem_write_begin
0.24 ± 20% +0.5 0.73 ± 60% perf-profile.children.cycles-pp.flush_tlb_mm_range
0.07 ± 12% +0.6 0.62 ± 63% perf-profile.children.cycles-pp.folio_unlock
0.29 ± 18% +0.6 0.85 ± 60% perf-profile.children.cycles-pp.ptep_clear_flush
0.04 ± 83% +0.6 0.64 ± 65% perf-profile.children.cycles-pp.shmem_write_end
0.33 ± 27% +0.8 1.12 ± 65% perf-profile.children.cycles-pp.page_vma_mkclean_one
0.33 ± 27% +0.8 1.12 ± 64% perf-profile.children.cycles-pp.page_mkclean_one
0.53 ± 2% +0.8 1.33 ± 57% perf-profile.children.cycles-pp.rmap_walk_file
0.35 ± 28% +0.8 1.18 ± 65% perf-profile.children.cycles-pp.folio_mkclean
0.00 +6.6 6.58 ± 66% perf-profile.children.cycles-pp.memcpy_orig
34.07 ± 2% -16.2 17.90 ± 80% perf-profile.self.cycles-pp.memset_orig
2.63 ± 19% -2.6 0.05 ±101% perf-profile.self.cycles-pp.copy_page_from_iter_atomic
0.25 ± 3% -0.1 0.12 ± 83% perf-profile.self.cycles-pp.folio_alloc_noprof
0.19 ± 14% -0.1 0.08 ± 80% perf-profile.self.cycles-pp.page_counter_try_charge
0.25 ± 9% -0.1 0.18 ± 17% perf-profile.self.cycles-pp.__mod_node_page_state
0.06 ± 7% +0.0 0.09 ± 17% perf-profile.self.cycles-pp.xfs_buffered_write_iomap_begin
0.00 +0.1 0.08 ± 29% perf-profile.self.cycles-pp.__cond_resched
1.94 ± 8% +0.5 2.47 ± 16% perf-profile.self.cycles-pp.do_access
0.07 ± 12% +0.5 0.62 ± 64% perf-profile.self.cycles-pp.folio_unlock
0.00 +6.5 6.50 ± 66% perf-profile.self.cycles-pp.memcpy_orig
0.00 ±200% +483.3% 0.01 ± 11% perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_pages_noprof.alloc_pages_mpol_noprof.folio_alloc_noprof.page_cache_ra_order
0.02 ± 51% +269.3% 0.06 ± 44% perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
0.01 ±121% +788.9% 0.09 ± 51% perf-sched.sch_delay.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
0.00 ±200% +566.7% 0.01 ± 14% perf-sched.sch_delay.avg.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
0.01 ± 17% +174.5% 0.02 ± 76% perf-sched.sch_delay.avg.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
0.01 +20.0% 0.01 perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.01 ± 17% -69.3% 0.00 ± 20% perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
0.01 ± 9% +1197.6% 0.09 ±128% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
0.04 ± 25% -40.3% 0.02 ± 30% perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
0.01 ± 6% -100.0% 0.00 perf-sched.sch_delay.avg.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
0.08 ± 68% +450.4% 0.45 ± 22% perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
0.15 ± 34% +78.2% 0.26 ± 22% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.00 ± 50% -100.0% 0.00 perf-sched.sch_delay.avg.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
0.01 ± 5% -100.0% 0.00 perf-sched.sch_delay.avg.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
0.00 ±200% +636.1% 0.01 ± 17% perf-sched.sch_delay.max.ms.__cond_resched.__alloc_pages_noprof.alloc_pages_mpol_noprof.folio_alloc_noprof.page_cache_ra_order
6.00 ± 95% +186.8% 17.22 ± 16% perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
0.03 ±162% +320.5% 0.13 ± 48% perf-sched.sch_delay.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
0.00 ±200% +876.2% 0.01 ± 21% perf-sched.sch_delay.max.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
0.01 ± 52% +221.1% 0.02 ± 59% perf-sched.sch_delay.max.ms.__cond_resched.xfs_write_fault.do_page_mkwrite.do_shared_fault.do_pte_missing
0.12 ±153% -92.3% 0.01 ± 21% perf-sched.sch_delay.max.ms.__cond_resched.zap_pmd_range.isra.0.unmap_page_range
0.35 ±155% +263.8% 1.28 ± 44% perf-sched.sch_delay.max.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
2.22 ± 44% -50.9% 1.09 ± 23% perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
0.32 ± 25% -65.3% 0.11 ± 12% perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
0.01 ± 10% +105.6% 0.01 ± 61% perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.02 ± 62% -100.0% 0.00 perf-sched.sch_delay.max.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
0.01 ± 35% +140.7% 0.03 ± 25% perf-sched.sch_delay.max.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
5.35 ± 13% +120.5% 11.80 ± 30% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.01 ± 27% +105.6% 0.01 ± 29% perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open
0.04 ±144% -100.0% 0.00 perf-sched.sch_delay.max.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
0.02 ± 49% -100.0% 0.00 perf-sched.sch_delay.max.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
34409 ± 4% +31.4% 45208 ± 19% perf-sched.total_wait_and_delay.count.ms
1.05 ± 66% -97.2% 0.03 ± 59% perf-sched.wait_and_delay.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
533.92 ±140% -97.3% 14.58 ±223% perf-sched.wait_and_delay.avg.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
13.99 ± 14% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
4.37 ± 61% -80.6% 0.85 ± 49% perf-sched.wait_and_delay.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
26.95 ± 25% -40.1% 16.14 ± 44% perf-sched.wait_and_delay.avg.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
128.59 ± 17% +229.1% 423.13 ± 16% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
40.40 ± 6% +8.7% 43.93 perf-sched.wait_and_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
70.57 ±130% +604.7% 497.32 ± 31% perf-sched.wait_and_delay.avg.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
328.60 ± 12% -100.0% 0.00 perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
10179 ± 15% -97.7% 237.17 ± 45% perf-sched.wait_and_delay.count.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
4488 +9.4% 4911 perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64
214.60 ± 24% -70.4% 63.50 ± 8% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
2937 ± 65% +699.5% 23480 perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
803.09 ± 73% -99.6% 3.53 ±183% perf-sched.wait_and_delay.max.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
533.92 ±140% -97.3% 14.58 ±223% perf-sched.wait_and_delay.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
462.45 ± 20% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
49.56 ± 39% -57.7% 20.98 ± 47% perf-sched.wait_and_delay.max.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
110.68 ± 24% -65.4% 38.31 ± 44% perf-sched.wait_and_delay.max.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
49.88 ± 7% +167.8% 133.58 ± 54% perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
261.23 ±122% +620.9% 1883 ± 41% perf-sched.wait_and_delay.max.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
16.91 ±122% +139.9% 40.56 ± 31% perf-sched.wait_time.avg.ms.__cond_resched.down_write.xfs_ilock_for_iomap.xfs_buffered_write_iomap_begin.iomap_iter
14.02 ± 65% -94.3% 0.80 ±200% perf-sched.wait_time.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.vfs_iter_write
1.04 ± 67% -98.2% 0.02 ± 34% perf-sched.wait_time.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
17.79 ±200% +1124.8% 217.93 ± 42% perf-sched.wait_time.avg.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
27.14 ± 32% -38.6% 16.67 ± 9% perf-sched.wait_time.avg.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
531.34 ±141% -94.0% 31.76 ±108% perf-sched.wait_time.avg.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
23.22 ± 49% -78.0% 5.10 ±107% perf-sched.wait_time.avg.ms.__cond_resched.zap_pmd_range.isra.0.unmap_page_range
13.99 ± 14% +72.1% 24.08 ± 4% perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
0.40 +15.0% 0.46 ± 4% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.36 ± 62% -72.2% 1.21 ± 39% perf-sched.wait_time.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
26.94 ± 25% -28.1% 19.37 ± 3% perf-sched.wait_time.avg.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
4.12 ± 2% -25.7% 3.06 ± 13% perf-sched.wait_time.avg.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
128.58 ± 17% +229.0% 423.04 ± 16% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
14.18 ± 30% -71.0% 4.12 ± 57% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
16.09 ± 62% -100.0% 0.00 perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
40.15 ± 7% +9.3% 43.90 perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
53.98 ± 17% +554.9% 353.50 ± 25% perf-sched.wait_time.avg.ms.sigsuspend.__x64_sys_rt_sigsuspend.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.09 ±127% -100.0% 0.00 perf-sched.wait_time.avg.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
36.56 ± 50% -100.0% 0.00 perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
79.98 ±107% +521.8% 497.31 ± 31% perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
35.31 ± 50% +40.6% 49.65 ± 10% perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_noprof.ifs_alloc.isra.0
16.91 ±122% +185.9% 48.33 ± 7% perf-sched.wait_time.max.ms.__cond_resched.down_write.xfs_ilock_for_iomap.xfs_buffered_write_iomap_begin.iomap_iter
560.10 ± 61% -94.7% 29.57 ±221% perf-sched.wait_time.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.vfs_iter_write
803.06 ± 73% -99.8% 1.82 ±176% perf-sched.wait_time.max.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
35.27 ± 38% -56.9% 15.19 ± 62% perf-sched.wait_time.max.ms.__cond_resched.rmap_walk_file.folio_mkclean.folio_clear_dirty_for_io.writeback_get_folio
17.79 ±200% +1874.3% 351.31 ± 41% perf-sched.wait_time.max.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
53.77 ± 20% -45.7% 29.19 ± 32% perf-sched.wait_time.max.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
531.34 ±141% -93.9% 32.25 ±107% perf-sched.wait_time.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
36.60 ± 50% +51.7% 55.51 ± 9% perf-sched.wait_time.max.ms.__cond_resched.xfs_write_fault.do_page_mkwrite.do_shared_fault.do_pte_missing
27.41 +19.5% 32.74 ± 5% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.56 ± 39% -49.0% 25.26 ± 12% perf-sched.wait_time.max.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
110.67 ± 24% -59.0% 45.33 ± 2% perf-sched.wait_time.max.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
39.78 ± 33% +428.9% 210.38 ±167% perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
2.53 ± 3% +9.8% 2.78 ± 4% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
41.46 ± 63% -100.0% 0.00 perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
49.87 ± 7% +167.8% 133.58 ± 54% perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
5.77 ±175% -100.0% 0.00 perf-sched.wait_time.max.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
92.08 ± 21% -100.0% 0.00 perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
290.66 ±102% +547.9% 1883 ± 41% perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
@ 2025-01-09 16:51 ` Kees Cook
2025-01-09 20:38 ` Kees Cook
0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 16:51 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
Yury Norov, Greg Kroah-Hartman, linux-hardening
On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
>
> commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
Well that is unexpected. There should be no binary output difference
with that patch. I will investigate...
--
Kees Cook
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 16:51 ` Kees Cook
@ 2025-01-09 20:38 ` Kees Cook
2025-01-09 20:52 ` Mateusz Guzik
0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 20:38 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
Yury Norov, Greg Kroah-Hartman, linux-hardening
On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> >
> > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> Well that is unexpected. There should be no binary output difference
> with that patch. I will investigate...
It looks like hiding the size value from GCC has the side-effect of
breaking memcpy inlining in many places. I would expect this to make
things _slower_, though. O_o
--
Kees Cook
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 20:38 ` Kees Cook
@ 2025-01-09 20:52 ` Mateusz Guzik
2025-01-09 21:12 ` Kees Cook
0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-09 20:52 UTC (permalink / raw)
To: Kees Cook
Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
Thomas Weißschuh, Nilay Shroff, Yury Norov,
Greg Kroah-Hartman, linux-hardening
On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > >
> > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > Well that is unexpected. There should be no binary output difference
> > with that patch. I will investigate...
>
> It looks like hiding the size value from GCC has the side-effect of
> breaking memcpy inlining in many places. I would expect this to make
> things _slower_, though. O_o
>
This depends on what was emitted in place and what CPU is executing it.
Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
not have FSRM and the size is low enough, then such code can indeed be
slower than suffering a call to memcpy (which does not issue rep mov).
I had seen gcc go to great pains to align a buffer for rep movsq even
when it was guaranteed to not be necessary for example.
Can you disasm an example affected spot?
Gcc has a bunch of magic switches to tell it what to emit in line, the
thing to do is to convince it to roll with a bunch of mov (not rep mov)
for sizes small enough(tm). What constitutes small enough depends on the
uarch.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 20:52 ` Mateusz Guzik
@ 2025-01-09 21:12 ` Kees Cook
2025-01-09 22:01 ` Mateusz Guzik
0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 21:12 UTC (permalink / raw)
To: Mateusz Guzik
Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
Thomas Weißschuh, Nilay Shroff, Yury Norov,
Greg Kroah-Hartman, linux-hardening
On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > >
> > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > Well that is unexpected. There should be no binary output difference
> > > with that patch. I will investigate...
> >
> > It looks like hiding the size value from GCC has the side-effect of
> > breaking memcpy inlining in many places. I would expect this to make
> > things _slower_, though. O_o
I think it's disabling value-range-based inlining, I'm trying to
construct some tests...
> This depends on what was emitted in place and what CPU is executing it.
>
> Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> not have FSRM and the size is low enough, then such code can indeed be
> slower than suffering a call to memcpy (which does not issue rep mov).
>
> I had seen gcc go to great pains to align a buffer for rep movsq even
> when it was guaranteed to not be necessary for example.
>
> Can you disasm an example affected spot?
I tried to find the most self-contained example I could, and I ended up
with:
static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
const struct in6_addr *daddr,
const void *post, unsigned char pfx)
{
memcpy(dst, daddr, pfx);
memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
}
Before 239d87327dcd ("fortify: Hide run-time copy size from value range
tracking"), the assembler is:
ffffffff8209f0e0 <ipv6_rpl_addr_decompress>:
ffffffff8209f0e0: e8 5b 62 fe fe call ffffffff81085340 <__fentry__>
ffffffff8209f0e5: 0f b6 c1 movzbl %cl,%eax
ffffffff8209f0e8: 49 89 d0 mov %rdx,%r8
ffffffff8209f0eb: 83 f8 08 cmp $0x8,%eax
ffffffff8209f0ee: 73 24 jae ffffffff8209f114 <ipv6_rpl_addr_decompress+0x34>
ffffffff8209f0f0: a8 04 test $0x4,%al
ffffffff8209f0f2: 75 64 jne ffffffff8209f158 <ipv6_rpl_addr_decompress+0x78>
ffffffff8209f0f4: 85 c0 test %eax,%eax
ffffffff8209f0f6: 74 09 je ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f0f8: 0f b6 16 movzbl (%rsi),%edx
ffffffff8209f0fb: 88 17 mov %dl,(%rdi)
ffffffff8209f0fd: a8 02 test $0x2,%al
ffffffff8209f0ff: 75 65 jne ffffffff8209f166 <ipv6_rpl_addr_decompress+0x86>
ffffffff8209f101: ba 10 00 00 00 mov $0x10,%edx
ffffffff8209f106: 48 01 c7 add %rax,%rdi
ffffffff8209f109: 4c 89 c6 mov %r8,%rsi
ffffffff8209f10c: 48 29 c2 sub %rax,%rdx
ffffffff8209f10f: e9 bc 33 21 00 jmp ffffffff822b24d0 <__memcpy>
ffffffff8209f114: 48 8b 16 mov (%rsi),%rdx
ffffffff8209f117: 4c 8d 4f 08 lea 0x8(%rdi),%r9
ffffffff8209f11b: 49 83 e1 f8 and $0xfffffffffffffff8,%r9
ffffffff8209f11f: 48 89 17 mov %rdx,(%rdi)
ffffffff8209f122: 48 8b 54 06 f8 mov -0x8(%rsi,%rax,1),%rdx
ffffffff8209f127: 48 89 54 07 f8 mov %rdx,-0x8(%rdi,%rax,1)
ffffffff8209f12c: 48 89 fa mov %rdi,%rdx
ffffffff8209f12f: 4c 29 ca sub %r9,%rdx
ffffffff8209f132: 48 29 d6 sub %rdx,%rsi
ffffffff8209f135: 01 c2 add %eax,%edx
ffffffff8209f137: 83 e2 f8 and $0xfffffff8,%edx
ffffffff8209f13a: 83 fa 08 cmp $0x8,%edx
ffffffff8209f13d: 72 c2 jb ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f13f: 83 e2 f8 and $0xfffffff8,%edx
ffffffff8209f142: 31 c9 xor %ecx,%ecx
ffffffff8209f144: 41 89 ca mov %ecx,%r10d
ffffffff8209f147: 83 c1 08 add $0x8,%ecx
ffffffff8209f14a: 4e 8b 1c 16 mov (%rsi,%r10,1),%r11
ffffffff8209f14e: 4f 89 1c 11 mov %r11,(%r9,%r10,1)
ffffffff8209f152: 39 d1 cmp %edx,%ecx
ffffffff8209f154: 72 ee jb ffffffff8209f144 <ipv6_rpl_addr_decompress+0x64>
ffffffff8209f156: eb a9 jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f158: 8b 16 mov (%rsi),%edx
ffffffff8209f15a: 89 17 mov %edx,(%rdi)
ffffffff8209f15c: 8b 54 06 fc mov -0x4(%rsi,%rax,1),%edx
ffffffff8209f160: 89 54 07 fc mov %edx,-0x4(%rdi,%rax,1)
ffffffff8209f164: eb 9b jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f166: 0f b7 54 06 fe movzwl -0x2(%rsi,%rax,1),%edx
ffffffff8209f16b: 66 89 54 07 fe mov %dx,-0x2(%rdi,%rax,1)
ffffffff8209f170: eb 8f jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
With the size hidden, it becomes:
ffffffff82096260 <ipv6_rpl_addr_decompress>:
ffffffff82096260: e8 db e0 fe fe call ffffffff81084340 <__fentry__>
ffffffff82096265: 55 push %rbp
ffffffff82096266: 0f b6 e9 movzbl %cl,%ebp
ffffffff82096269: 53 push %rbx
ffffffff8209626a: 48 89 d3 mov %rdx,%rbx
ffffffff8209626d: 48 89 ea mov %rbp,%rdx
ffffffff82096270: e8 9b 0a 21 00 call ffffffff822a6d10 <__memcpy>
ffffffff82096275: ba 10 00 00 00 mov $0x10,%edx
ffffffff8209627a: 48 89 de mov %rbx,%rsi
ffffffff8209627d: 5b pop %rbx
ffffffff8209627e: 48 89 c7 mov %rax,%rdi
ffffffff82096281: 48 29 ea sub %rbp,%rdx
ffffffff82096284: 48 01 ef add %rbp,%rdi
ffffffff82096287: 5d pop %rbp
ffffffff82096288: e9 83 0a 21 00 jmp ffffffff822a6d10 <__memcpy>
ffffffff8209628d: 0f 1f 00 nopl (%rax)
In the former, it looks like it is calculating how many 8, 4, and single
byte copy loops to perform for the first memcpy, since it knows the value
range must be [0..255]. In both cases the second memcpy is tail-called.
> Gcc has a bunch of magic switches to tell it what to emit in line, the
> thing to do is to convince it to roll with a bunch of mov (not rep mov)
> for sizes small enough(tm). What constitutes small enough depends on the
> uarch.
I found -mmemcpy-strategy:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mmemcpy-strategy_003dstrategy
But I don't see where to find the algs, and it seems like the above asm
is being produced when GCC thinks a value is in a certain range (rather
than compile-time known), which would make sense as far as what the
commit did: removed visibility into value ranges.
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 21:12 ` Kees Cook
@ 2025-01-09 22:01 ` Mateusz Guzik
2025-01-10 16:58 ` Kees Cook
0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-09 22:01 UTC (permalink / raw)
To: Kees Cook
Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
Thomas Weißschuh, Nilay Shroff, Yury Norov,
Greg Kroah-Hartman, linux-hardening
On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
>
> On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > >
> > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > Well that is unexpected. There should be no binary output difference
> > > > with that patch. I will investigate...
> > >
> > > It looks like hiding the size value from GCC has the side-effect of
> > > breaking memcpy inlining in many places. I would expect this to make
> > > things _slower_, though. O_o
>
> I think it's disabling value-range-based inlining, I'm trying to
> construct some tests...
>
> > This depends on what was emitted in place and what CPU is executing it.
> >
> > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > not have FSRM and the size is low enough, then such code can indeed be
> > slower than suffering a call to memcpy (which does not issue rep mov).
> >
> > I had seen gcc go to great pains to align a buffer for rep movsq even
> > when it was guaranteed to not be necessary for example.
> >
> > Can you disasm an example affected spot?
>
> I tried to find the most self-contained example I could, and I ended up
> with:
>
> static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> const struct in6_addr *daddr,
> const void *post, unsigned char pfx)
> {
> memcpy(dst, daddr, pfx);
> memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> }
>
Well I did what I should have from the get go and took the liberty of
looking at the profile.
%stddev %change %stddev
\ | \
[snip]
0.00 +6.5 6.54 ± 66%
perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
Disassembling copy_page_from_iter_atomic *prior* to the change indeed
reveals rep movsq as I suspected (second to last instruction):
<+919>: mov (%rax),%rdx
<+922>: lea 0x8(%rsi),%rdi
<+926>: and $0xfffffffffffffff8,%rdi
<+930>: mov %rdx,(%rsi)
<+933>: mov %r8d,%edx
<+936>: mov -0x8(%rax,%rdx,1),%rcx
<+941>: mov %rcx,-0x8(%rsi,%rdx,1)
<+946>: sub %rdi,%rsi
<+949>: mov %rsi,%rdx
<+952>: sub %rsi,%rax
<+955>: lea (%r8,%rdx,1),%ecx
<+959>: mov %rax,%rsi
<+962>: shr $0x3,%ecx
<+965>: rep movsq %ds:(%rsi),%es:(%rdi)
<+968>: jmp 0xffffffff819157c5 <copy_page_from_iter_atomic+869>
With the reported patch this is a call to memcpy.
This is the guy:
static __always_inline
size_t memcpy_from_iter(void *iter_from, size_t progress,
size_t len, void *to, void *priv2)
{
memcpy(to + progress, iter_from, len);
return 0;
}
I don't know what the specific bench is doing, I'm assuming passed
values were low enough that the overhead of spinning up rep movsq took
over.
gcc should retain the ability to optimize this, except it needs to be
convinced to not emit rep movsq for variable sizes (and instead call
memcpy).
For user memory access there is a bunch of hackery to inline rep mov
for CPUs where it does not suck for small sizes (see
rep_movs_alternative). Someone(tm) should port it over to memcpy
handling as well.
The expected state would be that for sizes known at compilation time
it rolls with movs as needed (no rep), otherwise emits the magic rep
movs/memcpy invocation, except for when it would be tail-called.
In your ipv6_rpl_addr_decompress example gcc went a little crazy,
which I mentioned does happen. However, most of the time it is doing a
good job instead and a now generated call to memcpy should make things
slower. I presume these spots are merely not being benchmarked here.
Note that going from inline movs (no rep) to a call to memcpy which
does movs (again no rep) comes with a "mere" function call overhead,
which is a different beast than spinning up rep movs on CPUs without
FSRM.
That is to say, contrary to the report above, I believe the change is
in fact a regression which just so happened to make things faster for
a specific case. The unintended speed up can be achieved without
regressing anything else by taming the craziness.
Reading the commit log I don't know what the way out is, perhaps you
could rope in some gcc folk to ask? Screwing with optimization to not
see a warning is definitely not the best option.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-09 22:01 ` Mateusz Guzik
@ 2025-01-10 16:58 ` Kees Cook
2025-01-10 19:14 ` Mateusz Guzik
0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-10 16:58 UTC (permalink / raw)
To: Mateusz Guzik
Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
Thomas Weißschuh, Nilay Shroff, Yury Norov,
Greg Kroah-Hartman, linux-hardening
On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > > >
> > > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > Well that is unexpected. There should be no binary output difference
> > > > > with that patch. I will investigate...
> > > >
> > > > It looks like hiding the size value from GCC has the side-effect of
> > > > breaking memcpy inlining in many places. I would expect this to make
> > > > things _slower_, though. O_o
> >
> > I think it's disabling value-range-based inlining, I'm trying to
> > construct some tests...
> >
> > > This depends on what was emitted in place and what CPU is executing it.
> > >
> > > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > > not have FSRM and the size is low enough, then such code can indeed be
> > > slower than suffering a call to memcpy (which does not issue rep mov).
> > >
> > > I had seen gcc go to great pains to align a buffer for rep movsq even
> > > when it was guaranteed to not be necessary for example.
> > >
> > > Can you disasm an example affected spot?
> >
> > I tried to find the most self-contained example I could, and I ended up
> > with:
> >
> > static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> > const struct in6_addr *daddr,
> > const void *post, unsigned char pfx)
> > {
> > memcpy(dst, daddr, pfx);
> > memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> > }
> >
>
> Well I did what I should have from the get go and took the liberty of
> looking at the profile.
>
> %stddev %change %stddev
> \ | \
> [snip]
> 0.00 +6.5 6.54 ± 66%
> perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
>
> Disassembling copy_page_from_iter_atomic *prior* to the change indeed
> reveals rep movsq as I suspected (second to last instruction):
>
> <+919>: mov (%rax),%rdx
> <+922>: lea 0x8(%rsi),%rdi
> <+926>: and $0xfffffffffffffff8,%rdi
> <+930>: mov %rdx,(%rsi)
> <+933>: mov %r8d,%edx
> <+936>: mov -0x8(%rax,%rdx,1),%rcx
> <+941>: mov %rcx,-0x8(%rsi,%rdx,1)
> <+946>: sub %rdi,%rsi
> <+949>: mov %rsi,%rdx
> <+952>: sub %rsi,%rax
> <+955>: lea (%r8,%rdx,1),%ecx
> <+959>: mov %rax,%rsi
> <+962>: shr $0x3,%ecx
> <+965>: rep movsq %ds:(%rsi),%es:(%rdi)
> <+968>: jmp 0xffffffff819157c5 <copy_page_from_iter_atomic+869>
>
> With the reported patch this is a call to memcpy.
>
> This is the guy:
> static __always_inline
> size_t memcpy_from_iter(void *iter_from, size_t progress,
> size_t len, void *to, void *priv2)
> {
> memcpy(to + progress, iter_from, len);
> return 0;
> }
Thanks for looking at this case!
>
> I don't know what the specific bench is doing, I'm assuming passed
> values were low enough that the overhead of spinning up rep movsq took
> over.
>
> gcc should retain the ability to optimize this, except it needs to be
> convinced to not emit rep movsq for variable sizes (and instead call
> memcpy).
>
> For user memory access there is a bunch of hackery to inline rep mov
> for CPUs where it does not suck for small sizes (see
> rep_movs_alternative). Someone(tm) should port it over to memcpy
> handling as well.
>
> The expected state would be that for sizes known at compilation time
> it rolls with movs as needed (no rep), otherwise emits the magic rep
> movs/memcpy invocation, except for when it would be tail-called.
>
> In your ipv6_rpl_addr_decompress example gcc went a little crazy,
> which I mentioned does happen. However, most of the time it is doing a
> good job instead and a now generated call to memcpy should make things
> slower. I presume these spots are merely not being benchmarked here.
> Note that going from inline movs (no rep) to a call to memcpy which
> does movs (again no rep) comes with a "mere" function call overhead,
> which is a different beast than spinning up rep movs on CPUs without
> FSRM.
>
> That is to say, contrary to the report above, I believe the change is
> in fact a regression which just so happened to make things faster for
> a specific case. The unintended speed up can be achieved without
> regressing anything else by taming the craziness.
How do we best make sense of the perf report? Even in the iter case
above, it looks like a perf improvement?
The fortify change lets GCC still inline compile-time-constant sizes, so
that's good. But it seems to force all the "in a given range" cases into
calls.
> Reading the commit log I don't know what the way out is, perhaps you
> could rope in some gcc folk to ask? Screwing with optimization to not
> see a warning is definitely not the best option.
Yeah, if we do need to revert this, I'm going to need another way to
silence the GCC value-range checker for memcpy...
--
Kees Cook
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
2025-01-10 16:58 ` Kees Cook
@ 2025-01-10 19:14 ` Mateusz Guzik
0 siblings, 0 replies; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-10 19:14 UTC (permalink / raw)
To: Kees Cook
Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
Thomas Weißschuh, Nilay Shroff, Yury Norov,
Greg Kroah-Hartman, linux-hardening
On Fri, Jan 10, 2025 at 5:58 PM Kees Cook <kees@kernel.org> wrote:
>
> On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> > That is to say, contrary to the report above, I believe the change is
> > in fact a regression which just so happened to make things faster for
> > a specific case. The unintended speed up can be achieved without
> > regressing anything else by taming the craziness.
>
> How do we best make sense of the perf report? Even in the iter case
> above, it looks like a perf improvement?
>
The kernel without your change compiled with gcc is leaving
performance on the table in select cases, namely when it elects to use
rep movsq for sizes below a magic threshold (depends on uarch).
Your change has the unintended side effect of changing
copy_page_from_iter_atomic to use plain memcpy, which justhappens to
be the right thing to do for this particular consumer.
However, it also has a side effect forcing of a memcpy call in places
which were optimized just fine -- for example if there is a spot where
there is a variable number of bytes to copy, but the range is small
and the upper limit is also small, gcc will elect to emit few movs and
be done with it, which is faster than calling memcpy. That is to say
for spots like that this is a regression.
In terms of optimizing all of this, the thing to do is to convince gcc
to not emit rep movsq for known problematic cases. But also not mess
with places which are optimized fine.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-01-10 19:15 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-09 6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
2025-01-09 16:51 ` Kees Cook
2025-01-09 20:38 ` Kees Cook
2025-01-09 20:52 ` Mateusz Guzik
2025-01-09 21:12 ` Kees Cook
2025-01-09 22:01 ` Mateusz Guzik
2025-01-10 16:58 ` Kees Cook
2025-01-10 19:14 ` Mateusz Guzik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox