[linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [linus:master] [fortify]  239d87327d:  vm-scalability.throughput 17.3% improvement
@ 2025-01-09  6:57 kernel test robot
  2025-01-09 16:51 ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: kernel test robot @ 2025-01-09  6:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
	Yury Norov, Greg Kroah-Hartman, linux-hardening, oliver.sang




Hello,

kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:


commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


testcase: vm-scalability
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

	runtime: 300s
	size: 256G
	test: msync
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250109/202501091405.a1fcb1ed-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/300s/256G/lkp-cpl-4sp2/msync/vm-scalability

commit: 
  f06e108a3d ("Compiler Attributes: disable __counted_by for clang < 19.1.3")
  239d87327d ("fortify: Hide run-time copy size from value range tracking")

f06e108a3dc53c0f 239d87327dcd361b0098038995f 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    654.00 ± 13%     +62.7%       1063 ± 41%  perf-c2c.HITM.local
     74.03 ± 49%    +113.3%     157.89 ± 40%  sched_debug.cfs_rq:/.removed.runnable_avg.max
     74.03 ± 49%    +113.3%     157.89 ± 40%  sched_debug.cfs_rq:/.removed.util_avg.max
   9843704 ± 12%     -31.4%    6748836 ± 24%  numa-meminfo.node0.Active(file)
     81609 ± 13%     -18.3%      66698 ± 10%  numa-meminfo.node1.Writeback
   3197765 ± 12%     +34.7%    4307440 ± 12%  numa-meminfo.node3.MemFree
      0.07 ±  2%      +0.0        0.07 ±  2%  mpstat.cpu.all.irq%
      0.05 ±  2%      +0.0        0.06        mpstat.cpu.all.soft%
      2.17 ±  3%      +0.4        2.58 ±  3%  mpstat.cpu.all.sys%
      0.42 ±  3%      +0.1        0.49 ±  2%  mpstat.cpu.all.usr%
   2462818           +24.2%    3060034        vmstat.io.bo
      8.76 ±  2%     +14.7%      10.06 ±  3%  vmstat.procs.r
     12294           +13.6%      13967 ±  4%  vmstat.system.cs
     40339 ±  2%      +6.8%      43096 ±  4%  vmstat.system.in
   6203763 ± 14%     +67.5%   10389382 ± 23%  numa-numastat.node0.local_node
   6274485 ± 14%     +66.8%   10464891 ± 23%  numa-numastat.node0.numa_hit
   6773452 ± 13%     -59.5%    2743979 ± 68%  numa-numastat.node0.numa_miss
   6842787 ± 12%     -58.8%    2819949 ± 66%  numa-numastat.node0.other_node
   7434683 ± 19%     +36.7%   10159657 ± 26%  numa-numastat.node1.local_node
   7522237 ± 19%     +36.4%   10257654 ± 26%  numa-numastat.node1.numa_hit
     16256 ±  2%     +26.1%      20495        vm-scalability.median
      5.43 ±  6%      -2.5        2.92 ± 26%  vm-scalability.median_stddev%
      9.99 ± 10%      -3.0        6.95 ±  9%  vm-scalability.stddev%
   5678018 ±  3%     +17.3%    6661631 ±  2%  vm-scalability.throughput
 1.573e+09           +25.0%  1.966e+09        vm-scalability.time.file_system_outputs
     16615 ±  3%     +27.0%      21107        vm-scalability.time.involuntary_context_switches
 2.099e+08           +25.0%  2.624e+08        vm-scalability.time.minor_page_faults
    561.00           +21.8%     683.33 ±  3%  vm-scalability.time.percent_of_cpu_this_job_got
      1358 ±  3%     +23.6%       1679 ±  3%  vm-scalability.time.system_time
    418.15 ±  2%     +19.1%     497.88        vm-scalability.time.user_time
   1135302           +11.7%    1268430        vm-scalability.time.voluntary_context_switches
 8.846e+08           +25.0%  1.106e+09        vm-scalability.workload
   2478521 ± 12%     -33.1%    1658879 ± 24%  numa-vmstat.node0.nr_active_file
  45774950 ±  9%     +28.8%   58943198 ±  5%  numa-vmstat.node0.nr_dirtied
  45774950 ±  9%     +28.8%   58943198 ±  5%  numa-vmstat.node0.nr_written
   2476252 ± 12%     -33.1%    1657048 ± 24%  numa-vmstat.node0.nr_zone_active_file
   6274222 ± 14%     +66.8%   10464563 ± 23%  numa-vmstat.node0.numa_hit
   6203500 ± 14%     +67.5%   10389054 ± 23%  numa-vmstat.node0.numa_local
   6773452 ± 13%     -59.5%    2743979 ± 68%  numa-vmstat.node0.numa_miss
   6842787 ± 12%     -58.8%    2819949 ± 66%  numa-vmstat.node0.numa_other
  49693812 ±  8%     +20.0%   59611215 ±  8%  numa-vmstat.node1.nr_dirtied
  49693812 ±  8%     +20.0%   59611215 ±  8%  numa-vmstat.node1.nr_written
   7521777 ± 19%     +36.4%   10257607 ± 26%  numa-vmstat.node1.numa_hit
   7434223 ± 19%     +36.7%   10159609 ± 26%  numa-vmstat.node1.numa_local
   2660800 ±  8%     +22.1%    3250098 ±  5%  numa-vmstat.node1.workingset_activate_file
   3153899 ±  8%     +19.5%    3769627 ±  5%  numa-vmstat.node1.workingset_refault_file
   2660800 ±  8%     +22.1%    3250098 ±  5%  numa-vmstat.node1.workingset_restore_file
  53368316 ±  9%     +20.2%   64130806 ±  8%  numa-vmstat.node2.nr_dirtied
  53368316 ±  9%     +20.2%   64130806 ±  8%  numa-vmstat.node2.nr_written
      7683 ±  8%     -20.2%       6129 ±  4%  numa-vmstat.node2.workingset_nodes
  47788357 ± 10%     +32.1%   63105437 ± 10%  numa-vmstat.node3.nr_dirtied
    803731 ± 13%     +34.0%    1076708 ± 12%  numa-vmstat.node3.nr_free_pages
  47788357 ± 10%     +32.1%   63105437 ± 10%  numa-vmstat.node3.nr_written
     30030 ± 15%     +75.3%      52638 ± 23%  proc-vmstat.allocstall_movable
     27837 ± 13%     +58.8%      44214 ± 22%  proc-vmstat.compact_fail
     45835 ± 10%     +88.6%      86440 ± 23%  proc-vmstat.compact_stall
     17998 ± 21%    +134.6%      42225 ± 25%  proc-vmstat.compact_success
  22633426            +1.2%   22911084        proc-vmstat.nr_active_anon
  11444651           -10.8%   10211517 ±  6%  proc-vmstat.nr_active_file
 1.966e+08           +25.0%  2.458e+08        proc-vmstat.nr_dirtied
   3658433            -2.6%    3563342        proc-vmstat.nr_dirty
   9170138           +12.1%   10276853 ±  6%  proc-vmstat.nr_inactive_file
  22567898            +1.2%   22846647        proc-vmstat.nr_shmem
 1.966e+08           +25.0%  2.458e+08        proc-vmstat.nr_written
  22633454            +1.2%   22911113        proc-vmstat.nr_zone_active_anon
  11444767           -10.8%   10211682 ±  6%  proc-vmstat.nr_zone_active_file
   9170083           +12.1%   10276805 ±  6%  proc-vmstat.nr_zone_inactive_file
   3740131            -2.7%    3639414        proc-vmstat.nr_zone_write_pending
  22011951 ± 15%     +33.7%   29430963 ± 10%  proc-vmstat.pgactivate
      2824           +16.2%       3280 ± 22%  proc-vmstat.pgalloc_dma
 2.856e+08           +19.6%  3.416e+08 ±  3%  proc-vmstat.pgalloc_normal
 2.112e+08           +24.9%  2.637e+08        proc-vmstat.pgfault
 2.886e+08           +19.3%  3.444e+08 ±  3%  proc-vmstat.pgfree
      6020 ±  9%     +88.5%      11348 ± 44%  proc-vmstat.pgmajfault
 7.865e+08           +25.0%  9.832e+08        proc-vmstat.pgpgout
    124025           +16.5%     144503        proc-vmstat.pgreuse
   3641011 ± 15%     +48.1%    5392566 ± 14%  proc-vmstat.pgsteal_direct
      2499           +26.9%       3171        proc-vmstat.unevictable_pgs_culled
     29425            -4.0%      28243        proc-vmstat.workingset_nodes
      9.93            +6.5%      10.58        perf-stat.i.MPKI
  4.61e+09           +25.7%  5.793e+09        perf-stat.i.branch-instructions
      0.32 ±  3%      -0.0        0.29        perf-stat.i.branch-miss-rate%
  12693622           +13.8%   14449439        perf-stat.i.branch-misses
     83.47            +2.3       85.75        perf-stat.i.cache-miss-rate%
 1.591e+08           +39.5%  2.221e+08        perf-stat.i.cache-misses
 1.891e+08           +36.6%  2.584e+08        perf-stat.i.cache-references
     12325           +13.6%      13999 ±  4%  perf-stat.i.context-switches
      1.28           -11.7%       1.13 ±  2%  perf-stat.i.cpi
 2.864e+10           +18.9%  3.405e+10 ±  2%  perf-stat.i.cpu-cycles
    343.31            +5.4%     361.81        perf-stat.i.cpu-migrations
    141.92           -15.8%     119.51        perf-stat.i.cycles-between-cache-misses
 1.792e+10           +29.5%   2.32e+10        perf-stat.i.instructions
      1.01           +13.0%       1.14        perf-stat.i.ipc
      5.54           +24.5%       6.90        perf-stat.i.metric.K/sec
    624456           +24.6%     778107        perf-stat.i.minor-faults
    624469           +24.6%     778135        perf-stat.i.page-faults
      8.90            +7.8%       9.59        perf-stat.overall.MPKI
      0.28            -0.0        0.25        perf-stat.overall.branch-miss-rate%
     84.14            +1.8       85.91        perf-stat.overall.cache-miss-rate%
      1.62            -8.3%       1.49 ±  2%  perf-stat.overall.cpi
    182.46           -14.9%     155.29 ±  2%  perf-stat.overall.cycles-between-cache-misses
      0.62            +9.0%       0.67 ±  2%  perf-stat.overall.ipc
      6475            +3.7%       6715        perf-stat.overall.path-length
 4.639e+09           +25.0%    5.8e+09        perf-stat.ps.branch-instructions
  12777070           +13.1%   14448212        perf-stat.ps.branch-misses
 1.605e+08           +38.8%  2.229e+08        perf-stat.ps.cache-misses
 1.908e+08           +36.0%  2.594e+08        perf-stat.ps.cache-references
     12289           +13.6%      13955 ±  4%  perf-stat.ps.context-switches
 2.929e+10           +18.2%  3.461e+10 ±  2%  perf-stat.ps.cpu-cycles
    344.20            +5.3%     362.39        perf-stat.ps.cpu-migrations
 1.805e+10           +28.8%  2.324e+10        perf-stat.ps.instructions
    626335           +24.0%     776865        perf-stat.ps.minor-faults
    626348           +24.0%     776893        perf-stat.ps.page-faults
 5.728e+12           +29.6%  7.425e+12        perf-stat.total.instructions
     34.75 ±  2%     -17.3       17.48 ± 87%  perf-profile.calltrace.cycles-pp.read_pages.page_cache_ra_order.filemap_fault.__do_fault.do_read_fault
     34.74 ±  2%     -16.4       18.29 ± 79%  perf-profile.calltrace.cycles-pp.iomap_readahead.read_pages.page_cache_ra_order.filemap_fault.__do_fault
     34.68 ±  2%     -16.4       18.25 ± 79%  perf-profile.calltrace.cycles-pp.iomap_readpage_iter.iomap_readahead.read_pages.page_cache_ra_order.filemap_fault
     34.48 ±  2%     -16.4       18.07 ± 80%  perf-profile.calltrace.cycles-pp.zero_user_segments.iomap_readpage_iter.iomap_readahead.read_pages.page_cache_ra_order
     34.28 ±  2%     -16.3       17.97 ± 80%  perf-profile.calltrace.cycles-pp.memset_orig.zero_user_segments.iomap_readpage_iter.iomap_readahead.read_pages
      7.38 ±  7%      +1.8        9.17 ± 13%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
      0.00            +6.5        6.54 ± 66%  perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
     34.90 ±  2%     -16.5       18.41 ± 79%  perf-profile.children.cycles-pp.read_pages
     34.89 ±  2%     -16.5       18.41 ± 79%  perf-profile.children.cycles-pp.iomap_readahead
     34.83 ±  2%     -16.5       18.36 ± 79%  perf-profile.children.cycles-pp.iomap_readpage_iter
     34.62 ±  2%     -16.4       18.18 ± 80%  perf-profile.children.cycles-pp.zero_user_segments
     34.57 ±  2%     -16.4       18.15 ± 80%  perf-profile.children.cycles-pp.memset_orig
      0.33 ±  7%      -0.2        0.16 ± 87%  perf-profile.children.cycles-pp.prep_compound_page
      0.24 ± 18%      -0.1        0.10 ± 83%  perf-profile.children.cycles-pp.page_counter_try_charge
      0.25 ±  8%      -0.1        0.19 ± 16%  perf-profile.children.cycles-pp.__mod_node_page_state
      0.08 ± 13%      -0.0        0.05 ± 47%  perf-profile.children.cycles-pp.__mod_lruvec_state
      0.08 ±  5%      +0.0        0.10 ±  8%  perf-profile.children.cycles-pp.___perf_sw_event
      0.03 ±123%      +0.1        0.10 ± 33%  perf-profile.children.cycles-pp.on_each_cpu_cond_mask
      0.03 ±123%      +0.1        0.10 ± 33%  perf-profile.children.cycles-pp.smp_call_function_many_cond
      0.02 ±123%      +0.1        0.10 ± 45%  perf-profile.children.cycles-pp.up_write
      0.07 ± 22%      +0.1        0.19 ± 56%  perf-profile.children.cycles-pp.free_tail_page_prepare
      0.20 ± 19%      +0.4        0.58 ± 61%  perf-profile.children.cycles-pp.shmem_get_folio_gfp
      0.20 ± 18%      +0.4        0.61 ± 61%  perf-profile.children.cycles-pp.shmem_write_begin
      0.24 ± 20%      +0.5        0.73 ± 60%  perf-profile.children.cycles-pp.flush_tlb_mm_range
      0.07 ± 12%      +0.6        0.62 ± 63%  perf-profile.children.cycles-pp.folio_unlock
      0.29 ± 18%      +0.6        0.85 ± 60%  perf-profile.children.cycles-pp.ptep_clear_flush
      0.04 ± 83%      +0.6        0.64 ± 65%  perf-profile.children.cycles-pp.shmem_write_end
      0.33 ± 27%      +0.8        1.12 ± 65%  perf-profile.children.cycles-pp.page_vma_mkclean_one
      0.33 ± 27%      +0.8        1.12 ± 64%  perf-profile.children.cycles-pp.page_mkclean_one
      0.53 ±  2%      +0.8        1.33 ± 57%  perf-profile.children.cycles-pp.rmap_walk_file
      0.35 ± 28%      +0.8        1.18 ± 65%  perf-profile.children.cycles-pp.folio_mkclean
      0.00            +6.6        6.58 ± 66%  perf-profile.children.cycles-pp.memcpy_orig
     34.07 ±  2%     -16.2       17.90 ± 80%  perf-profile.self.cycles-pp.memset_orig
      2.63 ± 19%      -2.6        0.05 ±101%  perf-profile.self.cycles-pp.copy_page_from_iter_atomic
      0.25 ±  3%      -0.1        0.12 ± 83%  perf-profile.self.cycles-pp.folio_alloc_noprof
      0.19 ± 14%      -0.1        0.08 ± 80%  perf-profile.self.cycles-pp.page_counter_try_charge
      0.25 ±  9%      -0.1        0.18 ± 17%  perf-profile.self.cycles-pp.__mod_node_page_state
      0.06 ±  7%      +0.0        0.09 ± 17%  perf-profile.self.cycles-pp.xfs_buffered_write_iomap_begin
      0.00            +0.1        0.08 ± 29%  perf-profile.self.cycles-pp.__cond_resched
      1.94 ±  8%      +0.5        2.47 ± 16%  perf-profile.self.cycles-pp.do_access
      0.07 ± 12%      +0.5        0.62 ± 64%  perf-profile.self.cycles-pp.folio_unlock
      0.00            +6.5        6.50 ± 66%  perf-profile.self.cycles-pp.memcpy_orig
      0.00 ±200%    +483.3%       0.01 ± 11%  perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_pages_noprof.alloc_pages_mpol_noprof.folio_alloc_noprof.page_cache_ra_order
      0.02 ± 51%    +269.3%       0.06 ± 44%  perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
      0.01 ±121%    +788.9%       0.09 ± 51%  perf-sched.sch_delay.avg.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
      0.00 ±200%    +566.7%       0.01 ± 14%  perf-sched.sch_delay.avg.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
      0.01 ± 17%    +174.5%       0.02 ± 76%  perf-sched.sch_delay.avg.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
      0.01           +20.0%       0.01        perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.01 ± 17%     -69.3%       0.00 ± 20%  perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.01 ±  9%   +1197.6%       0.09 ±128%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      0.04 ± 25%     -40.3%       0.02 ± 30%  perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
      0.01 ±  6%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
      0.08 ± 68%    +450.4%       0.45 ± 22%  perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.15 ± 34%     +78.2%       0.26 ± 22%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.00 ± 50%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
      0.01 ±  5%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
      0.00 ±200%    +636.1%       0.01 ± 17%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_pages_noprof.alloc_pages_mpol_noprof.folio_alloc_noprof.page_cache_ra_order
      6.00 ± 95%    +186.8%      17.22 ± 16%  perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
      0.03 ±162%    +320.5%       0.13 ± 48%  perf-sched.sch_delay.max.ms.__cond_resched.shmem_get_folio_gfp.shmem_write_begin.generic_perform_write.shmem_file_write_iter
      0.00 ±200%    +876.2%       0.01 ± 21%  perf-sched.sch_delay.max.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
      0.01 ± 52%    +221.1%       0.02 ± 59%  perf-sched.sch_delay.max.ms.__cond_resched.xfs_write_fault.do_page_mkwrite.do_shared_fault.do_pte_missing
      0.12 ±153%     -92.3%       0.01 ± 21%  perf-sched.sch_delay.max.ms.__cond_resched.zap_pmd_range.isra.0.unmap_page_range
      0.35 ±155%    +263.8%       1.28 ± 44%  perf-sched.sch_delay.max.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
      2.22 ± 44%     -50.9%       1.09 ± 23%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.32 ± 25%     -65.3%       0.11 ± 12%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
      0.01 ± 10%    +105.6%       0.01 ± 61%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.02 ± 62%    -100.0%       0.00        perf-sched.sch_delay.max.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
      0.01 ± 35%    +140.7%       0.03 ± 25%  perf-sched.sch_delay.max.ms.schedule_timeout.kswapd_try_to_sleep.kswapd.kthread
      5.35 ± 13%    +120.5%      11.80 ± 30%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.01 ± 27%    +105.6%       0.01 ± 29%  perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open
      0.04 ±144%    -100.0%       0.00        perf-sched.sch_delay.max.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
      0.02 ± 49%    -100.0%       0.00        perf-sched.sch_delay.max.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
     34409 ±  4%     +31.4%      45208 ± 19%  perf-sched.total_wait_and_delay.count.ms
      1.05 ± 66%     -97.2%       0.03 ± 59%  perf-sched.wait_and_delay.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
    533.92 ±140%     -97.3%      14.58 ±223%  perf-sched.wait_and_delay.avg.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
     13.99 ± 14%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
      4.37 ± 61%     -80.6%       0.85 ± 49%  perf-sched.wait_and_delay.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
     26.95 ± 25%     -40.1%      16.14 ± 44%  perf-sched.wait_and_delay.avg.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
    128.59 ± 17%    +229.1%     423.13 ± 16%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
     40.40 ±  6%      +8.7%      43.93        perf-sched.wait_and_delay.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
     70.57 ±130%    +604.7%     497.32 ± 31%  perf-sched.wait_and_delay.avg.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
    328.60 ± 12%    -100.0%       0.00        perf-sched.wait_and_delay.count.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
     10179 ± 15%     -97.7%     237.17 ± 45%  perf-sched.wait_and_delay.count.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
      4488            +9.4%       4911        perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64
    214.60 ± 24%     -70.4%      63.50 ±  8%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
      2937 ± 65%    +699.5%      23480        perf-sched.wait_and_delay.count.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
    803.09 ± 73%     -99.6%       3.53 ±183%  perf-sched.wait_and_delay.max.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
    533.92 ±140%     -97.3%      14.58 ±223%  perf-sched.wait_and_delay.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
    462.45 ± 20%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
     49.56 ± 39%     -57.7%      20.98 ± 47%  perf-sched.wait_and_delay.max.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
    110.68 ± 24%     -65.4%      38.31 ± 44%  perf-sched.wait_and_delay.max.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
     49.88 ±  7%    +167.8%     133.58 ± 54%  perf-sched.wait_and_delay.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
    261.23 ±122%    +620.9%       1883 ± 41%  perf-sched.wait_and_delay.max.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
     16.91 ±122%    +139.9%      40.56 ± 31%  perf-sched.wait_time.avg.ms.__cond_resched.down_write.xfs_ilock_for_iomap.xfs_buffered_write_iomap_begin.iomap_iter
     14.02 ± 65%     -94.3%       0.80 ±200%  perf-sched.wait_time.avg.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.vfs_iter_write
      1.04 ± 67%     -98.2%       0.02 ± 34%  perf-sched.wait_time.avg.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
     17.79 ±200%   +1124.8%     217.93 ± 42%  perf-sched.wait_time.avg.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
     27.14 ± 32%     -38.6%      16.67 ±  9%  perf-sched.wait_time.avg.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
    531.34 ±141%     -94.0%      31.76 ±108%  perf-sched.wait_time.avg.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
     23.22 ± 49%     -78.0%       5.10 ±107%  perf-sched.wait_time.avg.ms.__cond_resched.zap_pmd_range.isra.0.unmap_page_range
     13.99 ± 14%     +72.1%      24.08 ±  4%  perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
      0.40           +15.0%       0.46 ±  4%  perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      4.36 ± 62%     -72.2%       1.21 ± 39%  perf-sched.wait_time.avg.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
     26.94 ± 25%     -28.1%      19.37 ±  3%  perf-sched.wait_time.avg.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
      4.12 ±  2%     -25.7%       3.06 ± 13%  perf-sched.wait_time.avg.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
    128.58 ± 17%    +229.0%     423.04 ± 16%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
     14.18 ± 30%     -71.0%       4.12 ± 57%  perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.__flush_workqueue.xlog_cil_push_now.isra
     16.09 ± 62%    -100.0%       0.00        perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
     40.15 ±  7%      +9.3%      43.90        perf-sched.wait_time.avg.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
     53.98 ± 17%    +554.9%     353.50 ± 25%  perf-sched.wait_time.avg.ms.sigsuspend.__x64_sys_rt_sigsuspend.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.09 ±127%    -100.0%       0.00        perf-sched.wait_time.avg.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
     36.56 ± 50%    -100.0%       0.00        perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
     79.98 ±107%    +521.8%     497.31 ± 31%  perf-sched.wait_time.avg.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread
     35.31 ± 50%     +40.6%      49.65 ± 10%  perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_noprof.ifs_alloc.isra.0
     16.91 ±122%    +185.9%      48.33 ±  7%  perf-sched.wait_time.max.ms.__cond_resched.down_write.xfs_ilock_for_iomap.xfs_buffered_write_iomap_begin.iomap_iter
    560.10 ± 61%     -94.7%      29.57 ±221%  perf-sched.wait_time.max.ms.__cond_resched.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev.vfs_iter_write
    803.06 ± 73%     -99.8%       1.82 ±176%  perf-sched.wait_time.max.ms.__cond_resched.loop_process_work.process_one_work.worker_thread.kthread
     35.27 ± 38%     -56.9%      15.19 ± 62%  perf-sched.wait_time.max.ms.__cond_resched.rmap_walk_file.folio_mkclean.folio_clear_dirty_for_io.writeback_get_folio
     17.79 ±200%   +1874.3%     351.31 ± 41%  perf-sched.wait_time.max.ms.__cond_resched.shrink_folio_list.evict_folios.try_to_shrink_lruvec.shrink_one
     53.77 ± 20%     -45.7%      29.19 ± 32%  perf-sched.wait_time.max.ms.__cond_resched.writeback_get_folio.writeback_iter.iomap_writepages.xfs_vm_writepages
    531.34 ±141%     -93.9%      32.25 ±107%  perf-sched.wait_time.max.ms.__cond_resched.ww_mutex_lock.drm_gem_vunmap_unlocked.drm_gem_fb_vunmap.drm_atomic_helper_commit_planes
     36.60 ± 50%     +51.7%      55.51 ±  9%  perf-sched.wait_time.max.ms.__cond_resched.xfs_write_fault.do_page_mkwrite.do_shared_fault.do_pte_missing
     27.41           +19.5%      32.74 ±  5%  perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.56 ± 39%     -49.0%      25.26 ± 12%  perf-sched.wait_time.max.ms.io_schedule.folio_wait_bit_common.folio_wait_writeback.__filemap_fdatawait_range
    110.67 ± 24%     -59.0%      45.33 ±  2%  perf-sched.wait_time.max.ms.io_schedule.rq_qos_wait.wbt_wait.__rq_qos_throttle
     39.78 ± 33%    +428.9%     210.38 ±167%  perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
      2.53 ±  3%      +9.8%       2.78 ±  4%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     41.46 ± 63%    -100.0%       0.00        perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.__wait_for_common.submit_bio_wait
     49.87 ±  7%    +167.8%     133.58 ± 54%  perf-sched.wait_time.max.ms.schedule_timeout.io_schedule_timeout.balance_dirty_pages.balance_dirty_pages_ratelimited_flags
      5.77 ±175%    -100.0%       0.00        perf-sched.wait_time.max.ms.xlog_force_lsn.xfs_log_force_seq.xfs_file_fsync.__do_sys_msync
     92.08 ± 21%    -100.0%       0.00        perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xfs_file_fsync.__do_sys_msync.do_syscall_64
    290.66 ±102%    +547.9%       1883 ± 41%  perf-sched.wait_time.max.ms.xlog_wait_on_iclog.xlog_cil_push_work.process_one_work.worker_thread




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify]  239d87327d:  vm-scalability.throughput 17.3% improvement
  2025-01-09  6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
@ 2025-01-09 16:51 ` Kees Cook
  2025-01-09 20:38   ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 16:51 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
	Yury Norov, Greg Kroah-Hartman, linux-hardening

On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> 
> commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Well that is unexpected. There should be no binary output difference
with that patch. I will investigate...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify]  239d87327d:  vm-scalability.throughput 17.3% improvement
  2025-01-09 16:51 ` Kees Cook
@ 2025-01-09 20:38   ` Kees Cook
  2025-01-09 20:52     ` Mateusz Guzik
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 20:38 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Thomas Weißschuh, Nilay Shroff,
	Yury Norov, Greg Kroah-Hartman, linux-hardening

On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > 
> > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> Well that is unexpected. There should be no binary output difference
> with that patch. I will investigate...

It looks like hiding the size value from GCC has the side-effect of
breaking memcpy inlining in many places. I would expect this to make
things _slower_, though. O_o

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify]  239d87327d:  vm-scalability.throughput 17.3% improvement
  2025-01-09 20:38   ` Kees Cook
@ 2025-01-09 20:52     ` Mateusz Guzik
  2025-01-09 21:12       ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-09 20:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
	Thomas Weißschuh, Nilay Shroff, Yury Norov,
	Greg Kroah-Hartman, linux-hardening

On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > 
> > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > Well that is unexpected. There should be no binary output difference
> > with that patch. I will investigate...
> 
> It looks like hiding the size value from GCC has the side-effect of
> breaking memcpy inlining in many places. I would expect this to make
> things _slower_, though. O_o
> 

This depends on what was emitted in place and what CPU is executing it.

Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
not have FSRM and the size is low enough, then such code can indeed be
slower than suffering a call to memcpy (which does not issue rep mov).

I had seen gcc go to great pains to align a buffer for rep movsq even
when it was guaranteed to not be necessary for example.

Can you disasm an example affected spot?

Gcc has a bunch of magic switches to tell it what to emit in line, the
thing to do is to convince it to roll with a bunch of mov (not rep mov)
for sizes small enough(tm). What constitutes small enough depends on the
uarch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify]  239d87327d:  vm-scalability.throughput 17.3% improvement
  2025-01-09 20:52     ` Mateusz Guzik
@ 2025-01-09 21:12       ` Kees Cook
  2025-01-09 22:01         ` Mateusz Guzik
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-09 21:12 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
	Thomas Weißschuh, Nilay Shroff, Yury Norov,
	Greg Kroah-Hartman, linux-hardening

On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > 
> > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > Well that is unexpected. There should be no binary output difference
> > > with that patch. I will investigate...
> > 
> > It looks like hiding the size value from GCC has the side-effect of
> > breaking memcpy inlining in many places. I would expect this to make
> > things _slower_, though. O_o

I think it's disabling value-range-based inlining, I'm trying to
construct some tests...

> This depends on what was emitted in place and what CPU is executing it.
> 
> Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> not have FSRM and the size is low enough, then such code can indeed be
> slower than suffering a call to memcpy (which does not issue rep mov).
> 
> I had seen gcc go to great pains to align a buffer for rep movsq even
> when it was guaranteed to not be necessary for example.
> 
> Can you disasm an example affected spot?

I tried to find the most self-contained example I could, and I ended up
with:

static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
                                     const struct in6_addr *daddr,
                                     const void *post, unsigned char pfx)
{
        memcpy(dst, daddr, pfx);
        memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
}

Before 239d87327dcd ("fortify: Hide run-time copy size from value range
tracking"), the assembler is:

ffffffff8209f0e0 <ipv6_rpl_addr_decompress>:
ffffffff8209f0e0:       e8 5b 62 fe fe          call   ffffffff81085340 <__fentry__>
ffffffff8209f0e5:       0f b6 c1                movzbl %cl,%eax
ffffffff8209f0e8:       49 89 d0                mov    %rdx,%r8
ffffffff8209f0eb:       83 f8 08                cmp    $0x8,%eax
ffffffff8209f0ee:       73 24                   jae    ffffffff8209f114 <ipv6_rpl_addr_decompress+0x34>
ffffffff8209f0f0:       a8 04                   test   $0x4,%al
ffffffff8209f0f2:       75 64                   jne    ffffffff8209f158 <ipv6_rpl_addr_decompress+0x78>
ffffffff8209f0f4:       85 c0                   test   %eax,%eax
ffffffff8209f0f6:       74 09                   je     ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f0f8:       0f b6 16                movzbl (%rsi),%edx
ffffffff8209f0fb:       88 17                   mov    %dl,(%rdi)
ffffffff8209f0fd:       a8 02                   test   $0x2,%al
ffffffff8209f0ff:       75 65                   jne    ffffffff8209f166 <ipv6_rpl_addr_decompress+0x86>
ffffffff8209f101:       ba 10 00 00 00          mov    $0x10,%edx
ffffffff8209f106:       48 01 c7                add    %rax,%rdi
ffffffff8209f109:       4c 89 c6                mov    %r8,%rsi
ffffffff8209f10c:       48 29 c2                sub    %rax,%rdx
ffffffff8209f10f:       e9 bc 33 21 00          jmp    ffffffff822b24d0 <__memcpy>
ffffffff8209f114:       48 8b 16                mov    (%rsi),%rdx
ffffffff8209f117:       4c 8d 4f 08             lea    0x8(%rdi),%r9
ffffffff8209f11b:       49 83 e1 f8             and    $0xfffffffffffffff8,%r9
ffffffff8209f11f:       48 89 17                mov    %rdx,(%rdi)
ffffffff8209f122:       48 8b 54 06 f8          mov    -0x8(%rsi,%rax,1),%rdx
ffffffff8209f127:       48 89 54 07 f8          mov    %rdx,-0x8(%rdi,%rax,1)
ffffffff8209f12c:       48 89 fa                mov    %rdi,%rdx
ffffffff8209f12f:       4c 29 ca                sub    %r9,%rdx
ffffffff8209f132:       48 29 d6                sub    %rdx,%rsi
ffffffff8209f135:       01 c2                   add    %eax,%edx
ffffffff8209f137:       83 e2 f8                and    $0xfffffff8,%edx
ffffffff8209f13a:       83 fa 08                cmp    $0x8,%edx
ffffffff8209f13d:       72 c2                   jb     ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f13f:       83 e2 f8                and    $0xfffffff8,%edx
ffffffff8209f142:       31 c9                   xor    %ecx,%ecx
ffffffff8209f144:       41 89 ca                mov    %ecx,%r10d
ffffffff8209f147:       83 c1 08                add    $0x8,%ecx
ffffffff8209f14a:       4e 8b 1c 16             mov    (%rsi,%r10,1),%r11
ffffffff8209f14e:       4f 89 1c 11             mov    %r11,(%r9,%r10,1)
ffffffff8209f152:       39 d1                   cmp    %edx,%ecx
ffffffff8209f154:       72 ee                   jb     ffffffff8209f144 <ipv6_rpl_addr_decompress+0x64>
ffffffff8209f156:       eb a9                   jmp    ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f158:       8b 16                   mov    (%rsi),%edx
ffffffff8209f15a:       89 17                   mov    %edx,(%rdi)
ffffffff8209f15c:       8b 54 06 fc             mov    -0x4(%rsi,%rax,1),%edx
ffffffff8209f160:       89 54 07 fc             mov    %edx,-0x4(%rdi,%rax,1)
ffffffff8209f164:       eb 9b                   jmp    ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f166:       0f b7 54 06 fe          movzwl -0x2(%rsi,%rax,1),%edx
ffffffff8209f16b:       66 89 54 07 fe          mov    %dx,-0x2(%rdi,%rax,1)
ffffffff8209f170:       eb 8f                   jmp    ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>

With the size hidden, it becomes:

ffffffff82096260 <ipv6_rpl_addr_decompress>:
ffffffff82096260:       e8 db e0 fe fe          call   ffffffff81084340 <__fentry__>
ffffffff82096265:       55                      push   %rbp
ffffffff82096266:       0f b6 e9                movzbl %cl,%ebp
ffffffff82096269:       53                      push   %rbx
ffffffff8209626a:       48 89 d3                mov    %rdx,%rbx
ffffffff8209626d:       48 89 ea                mov    %rbp,%rdx
ffffffff82096270:       e8 9b 0a 21 00          call   ffffffff822a6d10 <__memcpy>
ffffffff82096275:       ba 10 00 00 00          mov    $0x10,%edx
ffffffff8209627a:       48 89 de                mov    %rbx,%rsi
ffffffff8209627d:       5b                      pop    %rbx
ffffffff8209627e:       48 89 c7                mov    %rax,%rdi
ffffffff82096281:       48 29 ea                sub    %rbp,%rdx
ffffffff82096284:       48 01 ef                add    %rbp,%rdi
ffffffff82096287:       5d                      pop    %rbp
ffffffff82096288:       e9 83 0a 21 00          jmp    ffffffff822a6d10 <__memcpy>
ffffffff8209628d:       0f 1f 00                nopl   (%rax)

In the former, it looks like it is calculating how many 8, 4, and single
byte copy loops to perform for the first memcpy, since it knows the value
range must be [0..255]. In both cases the second memcpy is tail-called.

> Gcc has a bunch of magic switches to tell it what to emit in line, the
> thing to do is to convince it to roll with a bunch of mov (not rep mov)
> for sizes small enough(tm). What constitutes small enough depends on the
> uarch.

I found -mmemcpy-strategy:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mmemcpy-strategy_003dstrategy

But I don't see where to find the algs, and it seems like the above asm
is being produced when GCC thinks a value is in a certain range (rather
than compile-time known), which would make sense as far as what the
commit did: removed visibility into value ranges.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
  2025-01-09 21:12       ` Kees Cook
@ 2025-01-09 22:01         ` Mateusz Guzik
  2025-01-10 16:58           ` Kees Cook
  0 siblings, 1 reply; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-09 22:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
	Thomas Weißschuh, Nilay Shroff, Yury Norov,
	Greg Kroah-Hartman, linux-hardening

On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
>
> On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > >
> > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > Well that is unexpected. There should be no binary output difference
> > > > with that patch. I will investigate...
> > >
> > > It looks like hiding the size value from GCC has the side-effect of
> > > breaking memcpy inlining in many places. I would expect this to make
> > > things _slower_, though. O_o
>
> I think it's disabling value-range-based inlining, I'm trying to
> construct some tests...
>
> > This depends on what was emitted in place and what CPU is executing it.
> >
> > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > not have FSRM and the size is low enough, then such code can indeed be
> > slower than suffering a call to memcpy (which does not issue rep mov).
> >
> > I had seen gcc go to great pains to align a buffer for rep movsq even
> > when it was guaranteed to not be necessary for example.
> >
> > Can you disasm an example affected spot?
>
> I tried to find the most self-contained example I could, and I ended up
> with:
>
> static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
>                                      const struct in6_addr *daddr,
>                                      const void *post, unsigned char pfx)
> {
>         memcpy(dst, daddr, pfx);
>         memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> }
>

Well I did what I should have from the get go and took the liberty of
looking at the profile.

         %stddev     %change         %stddev
             \          |                \
[snip]
      0.00            +6.5        6.54 ± 66%
perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev

Disassembling copy_page_from_iter_atomic *prior* to the change indeed
reveals rep movsq as I suspected (second to last instruction):

<+919>:   mov    (%rax),%rdx
<+922>:   lea    0x8(%rsi),%rdi
<+926>:   and    $0xfffffffffffffff8,%rdi
<+930>:   mov    %rdx,(%rsi)
<+933>:   mov    %r8d,%edx
<+936>:   mov    -0x8(%rax,%rdx,1),%rcx
<+941>:   mov    %rcx,-0x8(%rsi,%rdx,1)
<+946>:   sub    %rdi,%rsi
<+949>:   mov    %rsi,%rdx
<+952>:   sub    %rsi,%rax
<+955>:   lea    (%r8,%rdx,1),%ecx
<+959>:   mov    %rax,%rsi
<+962>:   shr    $0x3,%ecx
<+965>:   rep movsq %ds:(%rsi),%es:(%rdi)
<+968>:   jmp    0xffffffff819157c5 <copy_page_from_iter_atomic+869>

With the reported patch this is a call to memcpy.

This is the guy:
static __always_inline
size_t memcpy_from_iter(void *iter_from, size_t progress,
                        size_t len, void *to, void *priv2)
{
        memcpy(to + progress, iter_from, len);
        return 0;
}

I don't know what the specific bench is doing, I'm assuming passed
values were low enough that the overhead of spinning up rep movsq took
over.

gcc should retain the ability to optimize this, except it needs to be
convinced to not emit rep movsq for variable sizes (and instead call
memcpy).

For user memory access there is a bunch of hackery to inline rep mov
for CPUs where it does not suck for small sizes (see
rep_movs_alternative). Someone(tm) should port it over to memcpy
handling as well.

The expected state would be that for sizes known at compilation time
it rolls with movs as needed (no rep), otherwise emits the magic rep
movs/memcpy invocation, except for when it would be tail-called.

In your ipv6_rpl_addr_decompress example gcc went a little crazy,
which I mentioned does happen. However, most of the time it is doing a
good job instead and a now generated call to memcpy should make things
slower. I presume these spots are merely not being benchmarked here.
Note that going from inline movs (no rep) to a call to memcpy which
does movs (again no rep) comes with a "mere" function call overhead,
which is a different beast than spinning up rep movs on CPUs without
FSRM.

That is to say, contrary to the report above, I believe the change is
in fact a regression which just so happened to make things faster for
a specific case. The unintended speed up can be achieved without
regressing anything else by taming the craziness.

Reading the commit log I don't know what the way out is, perhaps you
could rope in some gcc folk to ask? Screwing with optimization to not
see a warning is definitely not the best option.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
  2025-01-09 22:01         ` Mateusz Guzik
@ 2025-01-10 16:58           ` Kees Cook
  2025-01-10 19:14             ` Mateusz Guzik
  0 siblings, 1 reply; 8+ messages in thread
From: Kees Cook @ 2025-01-10 16:58 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
	Thomas Weißschuh, Nilay Shroff, Yury Norov,
	Greg Kroah-Hartman, linux-hardening

On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > > >
> > > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > Well that is unexpected. There should be no binary output difference
> > > > > with that patch. I will investigate...
> > > >
> > > > It looks like hiding the size value from GCC has the side-effect of
> > > > breaking memcpy inlining in many places. I would expect this to make
> > > > things _slower_, though. O_o
> >
> > I think it's disabling value-range-based inlining, I'm trying to
> > construct some tests...
> >
> > > This depends on what was emitted in place and what CPU is executing it.
> > >
> > > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > > not have FSRM and the size is low enough, then such code can indeed be
> > > slower than suffering a call to memcpy (which does not issue rep mov).
> > >
> > > I had seen gcc go to great pains to align a buffer for rep movsq even
> > > when it was guaranteed to not be necessary for example.
> > >
> > > Can you disasm an example affected spot?
> >
> > I tried to find the most self-contained example I could, and I ended up
> > with:
> >
> > static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> >                                      const struct in6_addr *daddr,
> >                                      const void *post, unsigned char pfx)
> > {
> >         memcpy(dst, daddr, pfx);
> >         memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> > }
> >
> 
> Well I did what I should have from the get go and took the liberty of
> looking at the profile.
> 
>          %stddev     %change         %stddev
>              \          |                \
> [snip]
>       0.00            +6.5        6.54 ± 66%
> perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
> 
> Disassembling copy_page_from_iter_atomic *prior* to the change indeed
> reveals rep movsq as I suspected (second to last instruction):
> 
> <+919>:   mov    (%rax),%rdx
> <+922>:   lea    0x8(%rsi),%rdi
> <+926>:   and    $0xfffffffffffffff8,%rdi
> <+930>:   mov    %rdx,(%rsi)
> <+933>:   mov    %r8d,%edx
> <+936>:   mov    -0x8(%rax,%rdx,1),%rcx
> <+941>:   mov    %rcx,-0x8(%rsi,%rdx,1)
> <+946>:   sub    %rdi,%rsi
> <+949>:   mov    %rsi,%rdx
> <+952>:   sub    %rsi,%rax
> <+955>:   lea    (%r8,%rdx,1),%ecx
> <+959>:   mov    %rax,%rsi
> <+962>:   shr    $0x3,%ecx
> <+965>:   rep movsq %ds:(%rsi),%es:(%rdi)
> <+968>:   jmp    0xffffffff819157c5 <copy_page_from_iter_atomic+869>
> 
> With the reported patch this is a call to memcpy.
> 
> This is the guy:
> static __always_inline
> size_t memcpy_from_iter(void *iter_from, size_t progress,
>                         size_t len, void *to, void *priv2)
> {
>         memcpy(to + progress, iter_from, len);
>         return 0;
> }

Thanks for looking at this case!

> 
> I don't know what the specific bench is doing, I'm assuming passed
> values were low enough that the overhead of spinning up rep movsq took
> over.
> 
> gcc should retain the ability to optimize this, except it needs to be
> convinced to not emit rep movsq for variable sizes (and instead call
> memcpy).
> 
> For user memory access there is a bunch of hackery to inline rep mov
> for CPUs where it does not suck for small sizes (see
> rep_movs_alternative). Someone(tm) should port it over to memcpy
> handling as well.
> 
> The expected state would be that for sizes known at compilation time
> it rolls with movs as needed (no rep), otherwise emits the magic rep
> movs/memcpy invocation, except for when it would be tail-called.
> 
> In your ipv6_rpl_addr_decompress example gcc went a little crazy,
> which I mentioned does happen. However, most of the time it is doing a
> good job instead and a now generated call to memcpy should make things
> slower. I presume these spots are merely not being benchmarked here.
> Note that going from inline movs (no rep) to a call to memcpy which
> does movs (again no rep) comes with a "mere" function call overhead,
> which is a different beast than spinning up rep movs on CPUs without
> FSRM.
> 
> That is to say, contrary to the report above, I believe the change is
> in fact a regression which just so happened to make things faster for
> a specific case. The unintended speed up can be achieved without
> regressing anything else by taming the craziness.

How do we best make sense of the perf report? Even in the iter case
above, it looks like a perf improvement?

The fortify change lets GCC still inline compile-time-constant sizes, so
that's good. But it seems to force all the "in a given range" cases into
calls.

> Reading the commit log I don't know what the way out is, perhaps you
> could rope in some gcc folk to ask? Screwing with optimization to not
> see a warning is definitely not the best option.

Yeah, if we do need to revert this, I'm going to need another way to
silence the GCC value-range checker for memcpy...

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
  2025-01-10 16:58           ` Kees Cook
@ 2025-01-10 19:14             ` Mateusz Guzik
  0 siblings, 0 replies; 8+ messages in thread
From: Mateusz Guzik @ 2025-01-10 19:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel,
	Thomas Weißschuh, Nilay Shroff, Yury Norov,
	Greg Kroah-Hartman, linux-hardening

On Fri, Jan 10, 2025 at 5:58 PM Kees Cook <kees@kernel.org> wrote:
>
> On Thu, Jan 09, 2025 at 11:01:47PM +0100, Mateusz Guzik wrote:
> > That is to say, contrary to the report above, I believe the change is
> > in fact a regression which just so happened to make things faster for
> > a specific case. The unintended speed up can be achieved without
> > regressing anything else by taming the craziness.
>
> How do we best make sense of the perf report? Even in the iter case
> above, it looks like a perf improvement?
>

The kernel without your change compiled with gcc is leaving
performance on the table in select cases, namely when it elects to use
rep movsq for sizes below a magic threshold (depends on uarch).

Your change has the unintended side effect of changing
copy_page_from_iter_atomic to use plain memcpy, which justhappens to
be the right thing to do for this particular consumer.

However, it also has a side effect forcing of a memcpy call in places
which were optimized just fine -- for example if there is a spot where
there is a variable number of bytes to copy, but the range is small
and the upper limit is also small, gcc will elect to emit few movs and
be done with it, which is faster than calling memcpy. That is to say
for spots like that this is a regression.

In terms of optimizing all of this, the thing to do is to convince gcc
to not emit rep movsq for known problematic cases. But also not mess
with places which are optimized fine.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-10 19:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-09  6:57 [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement kernel test robot
2025-01-09 16:51 ` Kees Cook
2025-01-09 20:38   ` Kees Cook
2025-01-09 20:52     ` Mateusz Guzik
2025-01-09 21:12       ` Kees Cook
2025-01-09 22:01         ` Mateusz Guzik
2025-01-10 16:58           ` Kees Cook
2025-01-10 19:14             ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox