Re: [pipe] 759c01142a: -51.5% hackbench.throughput

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Huang, Ying <ying.huang@linux.intel.com>
To: lkp@lists.01.org
Subject: Re: [pipe] 759c01142a: -51.5% hackbench.throughput
Date: Fri, 05 Feb 2016 16:43:54 +0800	[thread overview]
Message-ID: <87d1sb8s39.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <87h9hxm5rf.fsf@yhuang-dev.intel.com>

[-- Attachment #1: Type: text/plain, Size: 12329 bytes --]

kernel test robot <ying.huang@linux.intel.com> writes:

> FYI, we noticed the below changes on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 759c01142a5d0f364a462346168a56de28a80f52 ("pipe: limit the per-user amount of pages allocated in pipes")
>
>
> =========================================================================================
> compiler/cpufreq_governor/ipc/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
>   gcc-4.9/performance/pipe/x86_64-rhel/threads/1600%/debian-x86_64-2015-02-07.cgz/lkp-wsx02/hackbench
>
> commit: 
>   558041d8d21b48287224dd0e32cf19563c77607c
>   759c01142a5d0f364a462346168a56de28a80f52
>
> 558041d8d21b4828 759c01142a5d0f364a46234616 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>     219967 .  4%     -51.5%     106583 .  1%  hackbench.throughput
>  5.319e+08 . 11%     -19.7%  4.273e+08 .  0%  hackbench.time.involuntary_context_switches
>    2422668 .  9%     -36.4%    1540731 .  2%  hackbench.time.minor_page_faults
>       7127 . 11%      +9.4%       7800 .  0%  hackbench.time.percent_of_cpu_this_job_got
>      42924 .  9%     +11.9%      48049 .  1%  hackbench.time.system_time
>       1665 . 11%     -49.3%     844.05 .  1%  hackbench.time.user_time
>     534.25 . 55%    +298.5%       2129 . 67%  numa-meminfo.node3.AnonHugePages
>       8581 . 57%     -48.4%       4427 .  1%  uptime.idle
>    5705345 . 13%     +58.8%    9061490 .  5%  softirqs.RCU
>     279868 .  8%     +20.4%     336916 .  1%  softirqs.SCHED
>      26.75 . 23%    +245.8%      92.50 . 39%  vmstat.procs.b
>       9809 . 13%     -64.7%       3463 .  7%  vmstat.procs.r
>     284320 . 13%     -45.9%     153686 .  1%  vmstat.system.in
>     114.00 . 10%     -50.9%      56.00 .  0%  time.file_system_outputs
>  5.319e+08 . 11%     -19.7%  4.273e+08 .  0%  time.involuntary_context_switches
>    2422668 .  9%     -36.4%    1540731 .  2%  time.minor_page_faults
>       1665 . 11%     -49.3%     844.05 .  1%  time.user_time
>      89.86 . 10%      +8.9%      97.84 .  0%  turbostat.%Busy
>       2270 . 10%      +8.9%       2471 .  0%  turbostat.Avg_MHz
>       1.45 .  8%     -52.6%       0.69 .  9%  turbostat.CPU%c1
>       0.17 . 13%     -52.2%       0.08 . 34%  turbostat.CPU%c3
>       8.53 .116%     -83.7%       1.39 .  3%  turbostat.CPU%c6
>   63583716 . 16%     -98.1%    1183913 . 53%  numa-numastat.node0.local_node
>   63587769 . 16%     -98.1%    1189061 . 53%  numa-numastat.node0.numa_hit
>   69989955 .  6%     -96.8%    2239323 .  8%  numa-numastat.node1.local_node
>   69995282 .  6%     -96.8%    2244474 .  8%  numa-numastat.node1.numa_hit
>   56620071 . 22%     -97.3%    1520012 . 32%  numa-numastat.node2.local_node
>   56625301 . 22%     -97.3%    1522594 . 31%  numa-numastat.node2.numa_hit
>   63753303 . 22%     -96.1%    2508634 . 14%  numa-numastat.node3.local_node
>   63754612 . 22%     -96.1%    2511213 . 14%  numa-numastat.node3.numa_hit
>   32233121 . 15%     -97.4%     830659 . 60%  numa-vmstat.node0.numa_hit
>   32193958 . 15%     -97.5%     790185 . 63%  numa-vmstat.node0.numa_local
>   35020615 . 10%     -96.4%    1253289 .  8%  numa-vmstat.node1.numa_hit
>   34968563 . 10%     -96.6%    1201078 .  9%  numa-vmstat.node1.numa_local
>   30394713 . 14%     -97.2%     843134 . 31%  numa-vmstat.node2.numa_hit
>   30373607 . 14%     -97.3%     824676 . 31%  numa-vmstat.node2.numa_local
>   32334081 . 21%     -95.5%    1469250 . 16%  numa-vmstat.node3.numa_hit
>   32286373 . 21%     -95.6%    1420300 . 16%  numa-vmstat.node3.numa_local
>  1.868e+08 . 11%     -51.0%   91569694 . 13%  cpuidle.C1-NHM.time
>    1743049 . 10%     -34.6%    1139878 . 11%  cpuidle.C1-NHM.usage
>  1.918e+08 . 12%     -49.7%   96433830 . 11%  cpuidle.C1E-NHM.time
>    1243978 . 16%     -40.7%     738168 . 12%  cpuidle.C1E-NHM.usage
>   70404853 .  8%     -47.4%   37040591 . 19%  cpuidle.C3-NHM.time
>  4.723e+09 .108%     -81.6%  8.672e+08 .  3%  cpuidle.C6-NHM.time
>     311666 . 18%     -47.9%     162252 . 15%  cpuidle.C6-NHM.usage
>  2.456e+08 . 18%     -53.2%   1.15e+08 .  9%  cpuidle.POLL.time
>     260576 . 21%     -58.6%     107765 . 17%  cpuidle.POLL.usage
>     906770 .  8%     -19.4%     730530 .  5%  proc-vmstat.numa_hint_faults
>   2.54e+08 . 11%     -97.1%    7464740 .  4%  proc-vmstat.numa_hit
>  2.539e+08 . 11%     -97.1%    7449281 .  4%  proc-vmstat.numa_local
>     662291 . 10%     -33.4%     441016 .  5%  proc-vmstat.numa_pages_migrated
>    1203460 .  6%     -16.7%    1002984 .  5%  proc-vmstat.numa_pte_updates
>    5716982 . 17%     -97.2%     158482 . 40%  proc-vmstat.pgalloc_dma32
>  2.521e+08 . 11%     -96.3%    9338972 .  3%  proc-vmstat.pgalloc_normal
>    3627983 .  4%     -26.1%    2682860 .  1%  proc-vmstat.pgfault
>  2.576e+08 . 11%     -96.4%    9352801 .  3%  proc-vmstat.pgfree
>     662291 . 10%     -33.4%     441016 .  5%  proc-vmstat.pgmigrate_success
>      27845 . 11%     -79.8%       5624 .  1%  slabinfo.kmalloc-1024.active_objs
>     879.00 . 11%     -79.2%     183.00 .  1%  slabinfo.kmalloc-1024.active_slabs
>      28142 . 11%     -79.1%       5879 .  0%  slabinfo.kmalloc-1024.num_objs
>     879.00 . 11%     -79.2%     183.00 .  1%  slabinfo.kmalloc-1024.num_slabs
>      20288 .  0%    +129.9%      46639 .  0%  slabinfo.kmalloc-64.active_objs
>     317.00 .  0%    +131.4%     733.50 .  0%  slabinfo.kmalloc-64.active_slabs
>      20288 .  0%    +131.6%      46980 .  0%  slabinfo.kmalloc-64.num_objs
>     317.00 .  0%    +131.4%     733.50 .  0%  slabinfo.kmalloc-64.num_slabs
>     335.50 .  9%     -44.8%     185.25 .  4%  slabinfo.taskstats.active_objs
>     335.50 .  9%     -44.8%     185.25 .  4%  slabinfo.taskstats.num_objs
>      38.97 . 29%     +51.7%      59.11 .  7%  perf-profile.cycles-pp.__account_scheduler_latency.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
>      39.95 . 28%     +53.0%      61.10 .  7%  perf-profile.cycles-pp.ttwu_do_activate.constprop.86.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common

>From above perf profile output,

Most time is spent for __account_scheduler_latency, which is part of
latency-stats (latencytop) utility to collect system latency statistics.
It uses a global spinlock (latency_lock) to protect its data structure,
the latency_lock becomes the bottleneck of the test.  The increased
context switches after the commit increases the contention of
latency_lock, so cause the regression.  After disable latency-stats, the
performance improved instead of regressed.  The compare result is as
follow:

=========================================================================================
compiler/cpufreq_governor/debug-setup/ipc/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
  gcc-4.9/performance/no-latency_stats/pipe/x86_64-rhel/threads/1600%/debian-x86_64-2015-02-07.cgz/lkp-wsx02/hackbench

commit: 
  558041d8d21b48287224dd0e32cf19563c77607c
  759c01142a5d0f364a462346168a56de28a80f52

558041d8d21b4828 759c01142a5d0f364a46234616 
---------------- -------------------------- 
       fail:runs  %reproduction    fail:runs
           |             |             |    
          1:6          -17%            :4     last_state.is_incomplete_run
         %stddev     %change         %stddev
             \          |                \  
    251067 ±  4%     +53.2%     384708 ±  0%  hackbench.throughput
 1.468e+09 ±  6%     -51.6%  7.099e+08 ± 40%  hackbench.time.involuntary_context_switches
      7299 ±  4%     -49.0%       3725 ± 57%  hackbench.time.percent_of_cpu_this_job_got
     42840 ±  3%     -33.3%      28582 ± 38%  hackbench.time.system_time

I think the performance improvement comes from the decreased the cache
miss rate, because of the decreased working set, which comes from the
decreased pipe size, as follow,

7.478e+09 ±  4%     -54.8%   3.38e+09 ± 36%  perf-stat.LLC-load-misses
 2.087e+11 ±  3%      -3.6%  2.012e+11 ± 37%  perf-stat.LLC-loads
 1.605e+10 ±  3%     -76.5%  3.771e+09 ± 37%  perf-stat.LLC-store-misses
 6.751e+10 ±  2%     -11.7%  5.963e+10 ± 36%  perf-stat.LLC-stores
 
The store miss rate decreased from about 23.77% to 6.32%.

Best Regards,
Huang, Ying

>       3.89 . 60%    +334.7%      16.93 . 13%  perf-profile.cycles-pp.__vfs_read.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>       0.70 . 30%   +1969.3%      14.49 . 11%  perf-profile.cycles-pp.__wake_up_common.__wake_up_sync_key.pipe_read.__vfs_read.vfs_read
>       0.71 . 30%   +1971.0%      14.65 . 11%  perf-profile.cycles-pp.__wake_up_sync_key.pipe_read.__vfs_read.vfs_read.sys_read
>       0.69 . 84%     -92.3%       0.05 . 63%  perf-profile.cycles-pp._raw_spin_lock.__schedule.schedule.exit_to_usermode_loop.syscall_return_slowpath
>      34.58 . 33%     +57.1%      54.33 .  7%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__account_scheduler_latency.enqueue_entity.enqueue_task_fair.activate_task
>      39.61 . 28%     +53.7%      60.89 .  7%  perf-profile.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.default_wake_function.autoremove_wake_function
>       0.70 . 30%   +1967.9%      14.47 . 11%  perf-profile.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_sync_key.pipe_read.__vfs_read
>       0.70 . 30%   +1965.0%      14.46 . 11%  perf-profile.cycles-pp.default_wake_function.autoremove_wake_function.__wake_up_common.__wake_up_sync_key.pipe_read
>      38.99 . 28%     +54.5%      60.25 .  7%  perf-profile.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
>      39.45 . 28%     +53.9%      60.72 .  7%  perf-profile.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.default_wake_function
>      57.97 . 20%     +40.4%      81.35 .  8%  perf-profile.cycles-pp.entry_SYSCALL_64_fastpath
>       1.31 . 60%     -92.0%       0.10 . 67%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.__schedule.schedule.exit_to_usermode_loop
>       2.54 . 68%     -81.8%       0.46 . 60%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.try_to_wake_up.default_wake_function.autoremove_wake_function
>      34.38 . 33%     +55.7%      53.54 .  6%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__account_scheduler_latency.enqueue_entity.enqueue_task_fair
>       3.67 . 65%    +354.5%      16.69 . 11%  perf-profile.cycles-pp.pipe_read.__vfs_read.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>       5.22 . 42%    +250.6%      18.30 . 21%  perf-profile.cycles-pp.sys_read.entry_SYSCALL_64_fastpath
>      39.95 . 28%     +53.0%      61.10 .  7%  perf-profile.cycles-pp.ttwu_do_activate.constprop.86.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common
>       4.70 . 48%    +282.5%      17.96 . 19%  perf-profile.cycles-pp.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>   16323355 . 13%    -100.0%       0.00 . -1%  latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>    1702613 . 26%    +288.1%    6607185 . 61%  latency_stats.avg.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>      31.67 . 74%  +63999.7%      20298 .149%  latency_stats.avg.stop_two_cpus.migrate_swap.task_numa_migrate.numa_migrate_preferred.task_numa_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>    3607008 .163%    -100.0%     384.25 . 14%  latency_stats.avg.wait_on_page_bit.__filemap_fdatawait_range.filemap_fdatawait_range.filemap_write_and_wait_range.nfs4_file_fsync.[nfsv4].vfs_fsync_range.vfs_fsync.nfs4_file_flush.[nfsv4].filp_close.do_dup2.SyS_dup2.entry_SYSCALL_64_fastpath
>    1042512 . 15%    +145.7%    2561649 . 35%  latency_stats.avg.wait_on_page_bit.__migration_entry_wait.migration_entry_wait.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>     384229 .  2%     -76.9%      88748 . 24%  latency_stats.hits.call_rwsem_down_read_failed.do_exit.SyS_exit.entry_SYSCALL_64_fastpath

[snip]

WARNING: multiple messages have this Message-ID (diff)

From: "Huang\, Ying" <ying.huang@linux.intel.com>
To: Willy Tarreau <w@1wt.eu>
Cc: lkp@01.org, LKML <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [lkp] [pipe] 759c01142a: -51.5% hackbench.throughput
Date: Fri, 05 Feb 2016 16:43:54 +0800	[thread overview]
Message-ID: <87d1sb8s39.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <87h9hxm5rf.fsf@yhuang-dev.intel.com> (kernel test robot's message of "Fri, 29 Jan 2016 11:19:48 +0800")

kernel test robot <ying.huang@linux.intel.com> writes:

> FYI, we noticed the below changes on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 759c01142a5d0f364a462346168a56de28a80f52 ("pipe: limit the per-user amount of pages allocated in pipes")
>
>
> =========================================================================================
> compiler/cpufreq_governor/ipc/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
>   gcc-4.9/performance/pipe/x86_64-rhel/threads/1600%/debian-x86_64-2015-02-07.cgz/lkp-wsx02/hackbench
>
> commit: 
>   558041d8d21b48287224dd0e32cf19563c77607c
>   759c01142a5d0f364a462346168a56de28a80f52
>
> 558041d8d21b4828 759c01142a5d0f364a46234616 
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>     219967 .  4%     -51.5%     106583 .  1%  hackbench.throughput
>  5.319e+08 . 11%     -19.7%  4.273e+08 .  0%  hackbench.time.involuntary_context_switches
>    2422668 .  9%     -36.4%    1540731 .  2%  hackbench.time.minor_page_faults
>       7127 . 11%      +9.4%       7800 .  0%  hackbench.time.percent_of_cpu_this_job_got
>      42924 .  9%     +11.9%      48049 .  1%  hackbench.time.system_time
>       1665 . 11%     -49.3%     844.05 .  1%  hackbench.time.user_time
>     534.25 . 55%    +298.5%       2129 . 67%  numa-meminfo.node3.AnonHugePages
>       8581 . 57%     -48.4%       4427 .  1%  uptime.idle
>    5705345 . 13%     +58.8%    9061490 .  5%  softirqs.RCU
>     279868 .  8%     +20.4%     336916 .  1%  softirqs.SCHED
>      26.75 . 23%    +245.8%      92.50 . 39%  vmstat.procs.b
>       9809 . 13%     -64.7%       3463 .  7%  vmstat.procs.r
>     284320 . 13%     -45.9%     153686 .  1%  vmstat.system.in
>     114.00 . 10%     -50.9%      56.00 .  0%  time.file_system_outputs
>  5.319e+08 . 11%     -19.7%  4.273e+08 .  0%  time.involuntary_context_switches
>    2422668 .  9%     -36.4%    1540731 .  2%  time.minor_page_faults
>       1665 . 11%     -49.3%     844.05 .  1%  time.user_time
>      89.86 . 10%      +8.9%      97.84 .  0%  turbostat.%Busy
>       2270 . 10%      +8.9%       2471 .  0%  turbostat.Avg_MHz
>       1.45 .  8%     -52.6%       0.69 .  9%  turbostat.CPU%c1
>       0.17 . 13%     -52.2%       0.08 . 34%  turbostat.CPU%c3
>       8.53 .116%     -83.7%       1.39 .  3%  turbostat.CPU%c6
>   63583716 . 16%     -98.1%    1183913 . 53%  numa-numastat.node0.local_node
>   63587769 . 16%     -98.1%    1189061 . 53%  numa-numastat.node0.numa_hit
>   69989955 .  6%     -96.8%    2239323 .  8%  numa-numastat.node1.local_node
>   69995282 .  6%     -96.8%    2244474 .  8%  numa-numastat.node1.numa_hit
>   56620071 . 22%     -97.3%    1520012 . 32%  numa-numastat.node2.local_node
>   56625301 . 22%     -97.3%    1522594 . 31%  numa-numastat.node2.numa_hit
>   63753303 . 22%     -96.1%    2508634 . 14%  numa-numastat.node3.local_node
>   63754612 . 22%     -96.1%    2511213 . 14%  numa-numastat.node3.numa_hit
>   32233121 . 15%     -97.4%     830659 . 60%  numa-vmstat.node0.numa_hit
>   32193958 . 15%     -97.5%     790185 . 63%  numa-vmstat.node0.numa_local
>   35020615 . 10%     -96.4%    1253289 .  8%  numa-vmstat.node1.numa_hit
>   34968563 . 10%     -96.6%    1201078 .  9%  numa-vmstat.node1.numa_local
>   30394713 . 14%     -97.2%     843134 . 31%  numa-vmstat.node2.numa_hit
>   30373607 . 14%     -97.3%     824676 . 31%  numa-vmstat.node2.numa_local
>   32334081 . 21%     -95.5%    1469250 . 16%  numa-vmstat.node3.numa_hit
>   32286373 . 21%     -95.6%    1420300 . 16%  numa-vmstat.node3.numa_local
>  1.868e+08 . 11%     -51.0%   91569694 . 13%  cpuidle.C1-NHM.time
>    1743049 . 10%     -34.6%    1139878 . 11%  cpuidle.C1-NHM.usage
>  1.918e+08 . 12%     -49.7%   96433830 . 11%  cpuidle.C1E-NHM.time
>    1243978 . 16%     -40.7%     738168 . 12%  cpuidle.C1E-NHM.usage
>   70404853 .  8%     -47.4%   37040591 . 19%  cpuidle.C3-NHM.time
>  4.723e+09 .108%     -81.6%  8.672e+08 .  3%  cpuidle.C6-NHM.time
>     311666 . 18%     -47.9%     162252 . 15%  cpuidle.C6-NHM.usage
>  2.456e+08 . 18%     -53.2%   1.15e+08 .  9%  cpuidle.POLL.time
>     260576 . 21%     -58.6%     107765 . 17%  cpuidle.POLL.usage
>     906770 .  8%     -19.4%     730530 .  5%  proc-vmstat.numa_hint_faults
>   2.54e+08 . 11%     -97.1%    7464740 .  4%  proc-vmstat.numa_hit
>  2.539e+08 . 11%     -97.1%    7449281 .  4%  proc-vmstat.numa_local
>     662291 . 10%     -33.4%     441016 .  5%  proc-vmstat.numa_pages_migrated
>    1203460 .  6%     -16.7%    1002984 .  5%  proc-vmstat.numa_pte_updates
>    5716982 . 17%     -97.2%     158482 . 40%  proc-vmstat.pgalloc_dma32
>  2.521e+08 . 11%     -96.3%    9338972 .  3%  proc-vmstat.pgalloc_normal
>    3627983 .  4%     -26.1%    2682860 .  1%  proc-vmstat.pgfault
>  2.576e+08 . 11%     -96.4%    9352801 .  3%  proc-vmstat.pgfree
>     662291 . 10%     -33.4%     441016 .  5%  proc-vmstat.pgmigrate_success
>      27845 . 11%     -79.8%       5624 .  1%  slabinfo.kmalloc-1024.active_objs
>     879.00 . 11%     -79.2%     183.00 .  1%  slabinfo.kmalloc-1024.active_slabs
>      28142 . 11%     -79.1%       5879 .  0%  slabinfo.kmalloc-1024.num_objs
>     879.00 . 11%     -79.2%     183.00 .  1%  slabinfo.kmalloc-1024.num_slabs
>      20288 .  0%    +129.9%      46639 .  0%  slabinfo.kmalloc-64.active_objs
>     317.00 .  0%    +131.4%     733.50 .  0%  slabinfo.kmalloc-64.active_slabs
>      20288 .  0%    +131.6%      46980 .  0%  slabinfo.kmalloc-64.num_objs
>     317.00 .  0%    +131.4%     733.50 .  0%  slabinfo.kmalloc-64.num_slabs
>     335.50 .  9%     -44.8%     185.25 .  4%  slabinfo.taskstats.active_objs
>     335.50 .  9%     -44.8%     185.25 .  4%  slabinfo.taskstats.num_objs
>      38.97 . 29%     +51.7%      59.11 .  7%  perf-profile.cycles-pp.__account_scheduler_latency.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
>      39.95 . 28%     +53.0%      61.10 .  7%  perf-profile.cycles-pp.ttwu_do_activate.constprop.86.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common

>From above perf profile output,

Most time is spent for __account_scheduler_latency, which is part of
latency-stats (latencytop) utility to collect system latency statistics.
It uses a global spinlock (latency_lock) to protect its data structure,
the latency_lock becomes the bottleneck of the test.  The increased
context switches after the commit increases the contention of
latency_lock, so cause the regression.  After disable latency-stats, the
performance improved instead of regressed.  The compare result is as
follow:

=========================================================================================
compiler/cpufreq_governor/debug-setup/ipc/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
  gcc-4.9/performance/no-latency_stats/pipe/x86_64-rhel/threads/1600%/debian-x86_64-2015-02-07.cgz/lkp-wsx02/hackbench

commit: 
  558041d8d21b48287224dd0e32cf19563c77607c
  759c01142a5d0f364a462346168a56de28a80f52

558041d8d21b4828 759c01142a5d0f364a46234616 
---------------- -------------------------- 
       fail:runs  %reproduction    fail:runs
           |             |             |    
          1:6          -17%            :4     last_state.is_incomplete_run
         %stddev     %change         %stddev
             \          |                \  
    251067 ±  4%     +53.2%     384708 ±  0%  hackbench.throughput
 1.468e+09 ±  6%     -51.6%  7.099e+08 ± 40%  hackbench.time.involuntary_context_switches
      7299 ±  4%     -49.0%       3725 ± 57%  hackbench.time.percent_of_cpu_this_job_got
     42840 ±  3%     -33.3%      28582 ± 38%  hackbench.time.system_time

I think the performance improvement comes from the decreased the cache
miss rate, because of the decreased working set, which comes from the
decreased pipe size, as follow,

7.478e+09 ±  4%     -54.8%   3.38e+09 ± 36%  perf-stat.LLC-load-misses
 2.087e+11 ±  3%      -3.6%  2.012e+11 ± 37%  perf-stat.LLC-loads
 1.605e+10 ±  3%     -76.5%  3.771e+09 ± 37%  perf-stat.LLC-store-misses
 6.751e+10 ±  2%     -11.7%  5.963e+10 ± 36%  perf-stat.LLC-stores
 
The store miss rate decreased from about 23.77% to 6.32%.

Best Regards,
Huang, Ying

>       3.89 . 60%    +334.7%      16.93 . 13%  perf-profile.cycles-pp.__vfs_read.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>       0.70 . 30%   +1969.3%      14.49 . 11%  perf-profile.cycles-pp.__wake_up_common.__wake_up_sync_key.pipe_read.__vfs_read.vfs_read
>       0.71 . 30%   +1971.0%      14.65 . 11%  perf-profile.cycles-pp.__wake_up_sync_key.pipe_read.__vfs_read.vfs_read.sys_read
>       0.69 . 84%     -92.3%       0.05 . 63%  perf-profile.cycles-pp._raw_spin_lock.__schedule.schedule.exit_to_usermode_loop.syscall_return_slowpath
>      34.58 . 33%     +57.1%      54.33 .  7%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__account_scheduler_latency.enqueue_entity.enqueue_task_fair.activate_task
>      39.61 . 28%     +53.7%      60.89 .  7%  perf-profile.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.default_wake_function.autoremove_wake_function
>       0.70 . 30%   +1967.9%      14.47 . 11%  perf-profile.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_sync_key.pipe_read.__vfs_read
>       0.70 . 30%   +1965.0%      14.46 . 11%  perf-profile.cycles-pp.default_wake_function.autoremove_wake_function.__wake_up_common.__wake_up_sync_key.pipe_read
>      38.99 . 28%     +54.5%      60.25 .  7%  perf-profile.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
>      39.45 . 28%     +53.9%      60.72 .  7%  perf-profile.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.default_wake_function
>      57.97 . 20%     +40.4%      81.35 .  8%  perf-profile.cycles-pp.entry_SYSCALL_64_fastpath
>       1.31 . 60%     -92.0%       0.10 . 67%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.__schedule.schedule.exit_to_usermode_loop
>       2.54 . 68%     -81.8%       0.46 . 60%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.try_to_wake_up.default_wake_function.autoremove_wake_function
>      34.38 . 33%     +55.7%      53.54 .  6%  perf-profile.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__account_scheduler_latency.enqueue_entity.enqueue_task_fair
>       3.67 . 65%    +354.5%      16.69 . 11%  perf-profile.cycles-pp.pipe_read.__vfs_read.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>       5.22 . 42%    +250.6%      18.30 . 21%  perf-profile.cycles-pp.sys_read.entry_SYSCALL_64_fastpath
>      39.95 . 28%     +53.0%      61.10 .  7%  perf-profile.cycles-pp.ttwu_do_activate.constprop.86.try_to_wake_up.default_wake_function.autoremove_wake_function.__wake_up_common
>       4.70 . 48%    +282.5%      17.96 . 19%  perf-profile.cycles-pp.vfs_read.sys_read.entry_SYSCALL_64_fastpath
>   16323355 . 13%    -100.0%       0.00 . -1%  latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
>    1702613 . 26%    +288.1%    6607185 . 61%  latency_stats.avg.pipe_read.__vfs_read.vfs_read.SyS_read.entry_SYSCALL_64_fastpath
>      31.67 . 74%  +63999.7%      20298 .149%  latency_stats.avg.stop_two_cpus.migrate_swap.task_numa_migrate.numa_migrate_preferred.task_numa_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>    3607008 .163%    -100.0%     384.25 . 14%  latency_stats.avg.wait_on_page_bit.__filemap_fdatawait_range.filemap_fdatawait_range.filemap_write_and_wait_range.nfs4_file_fsync.[nfsv4].vfs_fsync_range.vfs_fsync.nfs4_file_flush.[nfsv4].filp_close.do_dup2.SyS_dup2.entry_SYSCALL_64_fastpath
>    1042512 . 15%    +145.7%    2561649 . 35%  latency_stats.avg.wait_on_page_bit.__migration_entry_wait.migration_entry_wait.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>     384229 .  2%     -76.9%      88748 . 24%  latency_stats.hits.call_rwsem_down_read_failed.do_exit.SyS_exit.entry_SYSCALL_64_fastpath

[snip]

next prev parent reply	other threads:[~2016-02-05  8:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-29  3:19 [pipe] 759c01142a: -51.5% hackbench.throughput kernel test robot
2016-01-29  3:19 ` [lkp] " kernel test robot
2016-02-05  8:43 ` Huang, Ying [this message]
2016-02-05  8:43   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87d1sb8s39.fsf@yhuang-dev.intel.com \
    --to=ying.huang@linux.intel.com \
    --cc=lkp@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.