Re: hackbench regression due to commit 9dfc6e68bfe6e

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: hackbench regression due to commit 9dfc6e68bfe6e
       [not found]                     ` <alpine.DEB.2.00.1004061552500.19151@router.home>
@ 2010-04-06 22:10                       ` Eric Dumazet
  2010-04-07  2:34                         ` Zhang, Yanmin
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-06 22:10 UTC (permalink / raw)
  To: Christoph Lameter, netdev
  Cc: Zhang, Yanmin, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le mardi 06 avril 2010 à 15:55 -0500, Christoph Lameter a écrit :
> We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> performance increase in hackbench instead.
> 
> Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3102.142
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 308.731
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 311.591
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 310.200
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 38.048
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 44.711
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 39.407
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.411
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.765
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.822
> 
> Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
> ./hackbench 100 process 200000
> Running with 100*40 (== 4000) tasks.
> Time: 3003.578
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 300.289
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.462
> ./hackbench 100 process 20000
> Running with 100*40 (== 4000) tasks.
> Time: 301.173
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.191
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.964
> ./hackbench 10 process 20000
> Running with 10*40 (== 400) tasks.
> Time: 41.470
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.829
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 9.166
> ./hackbench 1 process 20000
> Running with 1*40 (== 40) tasks.
> Time: 8.681
> 
> 


Well, your config might be very different... and hackbench results can
vary by 10% on same machine, same kernel.

This is not a reliable bench, because af_unix is not prepared to get
such a lazy workload.

We really should warn people about this.



# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.922
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.696
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.060
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 14.108
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.165
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.310
# hackbench 25 process 3000 
Running with 25*40 (== 1000) tasks.
Time: 12.530


booting with slub_min_order=3 do change hackbench results for example ;)

All writers can compete on spinlock for a target UNIX socket, we spend _lot_ of time spinning.

If we _really_ want to speedup hackbench, we would have to change unix_state_lock()
to use a non spinning locking primitive (aka lock_sock()), and slowdown normal path.


# perf record -f hackbench 25 process 3000 
Running with 25*40 (== 1000) tasks.
Time: 13.330
[ perf record: Woken up 289 times to write data ]
[ perf record: Captured and wrote 54.312 MB perf.data (~2372928 samples) ]
# perf report
# Samples: 2370135
#
# Overhead    Command                 Shared Object  Symbol
# ........  .........  ............................  ......
#
     9.68%  hackbench  [kernel]                      [k] do_raw_spin_lock
     6.50%  hackbench  [kernel]                      [k] schedule
     4.38%  hackbench  [kernel]                      [k] __kmalloc_track_caller
     3.95%  hackbench  [kernel]                      [k] copy_to_user
     3.86%  hackbench  [kernel]                      [k] __alloc_skb
     3.77%  hackbench  [kernel]                      [k] unix_stream_recvmsg
     3.12%  hackbench  [kernel]                      [k] sock_alloc_send_pskb
     2.75%  hackbench  [vdso]                        [.] 0x000000ffffe425
     2.28%  hackbench  [kernel]                      [k] sysenter_past_esp
     2.03%  hackbench  [kernel]                      [k] __mutex_lock_common
     2.00%  hackbench  [kernel]                      [k] kfree
     2.00%  hackbench  [kernel]                      [k] delay_tsc
     1.75%  hackbench  [kernel]                      [k] update_curr
     1.70%  hackbench  [kernel]                      [k] kmem_cache_alloc
     1.69%  hackbench  [kernel]                      [k] do_raw_spin_unlock
     1.60%  hackbench  [kernel]                      [k] unix_stream_sendmsg
     1.54%  hackbench  [kernel]                      [k] sched_clock_local
     1.46%  hackbench  [kernel]                      [k] __slab_free
     1.37%  hackbench  [kernel]                      [k] do_raw_read_lock
     1.34%  hackbench  [kernel]                      [k] __switch_to
     1.24%  hackbench  [kernel]                      [k] select_task_rq_fair
     1.23%  hackbench  [kernel]                      [k] sock_wfree
     1.21%  hackbench  [kernel]                      [k] _raw_spin_unlock_irqrestore
     1.19%  hackbench  [kernel]                      [k] __mutex_unlock_slowpath
     1.05%  hackbench  [kernel]                      [k] trace_hardirqs_off
     0.99%  hackbench  [kernel]                      [k] __might_sleep
     0.93%  hackbench  [kernel]                      [k] do_raw_read_unlock
     0.93%  hackbench  [kernel]                      [k] _raw_spin_lock
     0.91%  hackbench  [kernel]                      [k] try_to_wake_up
     0.81%  hackbench  [kernel]                      [k] sched_clock
     0.80%  hackbench  [kernel]                      [k] trace_hardirqs_on



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-06 22:10                       ` hackbench regression due to commit 9dfc6e68bfe6e Eric Dumazet
@ 2010-04-07  2:34                         ` Zhang, Yanmin
  2010-04-07  6:39                           ` Eric Dumazet
                                             ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Zhang, Yanmin @ 2010-04-07  2:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 2010-04-07 at 00:10 +0200, Eric Dumazet wrote:
> Le mardi 06 avril 2010 à 15:55 -0500, Christoph Lameter a écrit :
> > We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> > performance increase in hackbench instead.
> > 
> > Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
> > ./hackbench 100 process 200000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3102.142
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 308.731
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 311.591
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 310.200
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 38.048
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 44.711
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 39.407
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 9.411
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.765
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.822
> > 
> > Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
> > ./hackbench 100 process 200000
> > Running with 100*40 (== 4000) tasks.
> > Time: 3003.578
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 300.289
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 301.462
> > ./hackbench 100 process 20000
> > Running with 100*40 (== 4000) tasks.
> > Time: 301.173
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.191
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.964
> > ./hackbench 10 process 20000
> > Running with 10*40 (== 400) tasks.
> > Time: 41.470
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.829
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 9.166
> > ./hackbench 1 process 20000
> > Running with 1*40 (== 40) tasks.
> > Time: 8.681
> > 
> > 
> 
> 
> Well, your config might be very different... and hackbench results can
> vary by 10% on same machine, same kernel.
> 
> This is not a reliable bench, because af_unix is not prepared to get
> such a lazy workload.
Thanks. I also found that. Normally, my script runs hackbench for 3 times and
gets an average value. To decrease the variation, I use 
'./hackbench 100 process 200000' to get a more stable result.


> 
> We really should warn people about this.
> 
> 
> 
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.922
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.696
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.060
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 14.108
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.165
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.310
> # hackbench 25 process 3000 
> Running with 25*40 (== 1000) tasks.
> Time: 12.530
> 
> 
> booting with slub_min_order=3 do change hackbench results for example ;)
By default, slub_min_order=3 on my Nehalem machines. I also tried different
larger slub_min_order and didn't find help.


> 
> All writers can compete on spinlock for a target UNIX socket, we spend _lot_ of time spinning.
> 
> If we _really_ want to speedup hackbench, we would have to change unix_state_lock()
> to use a non spinning locking primitive (aka lock_sock()), and slowdown normal path.
> 
> 
> # perf record -f hackbench 25 process 3000 
> Running with 25*40 (== 1000) tasks.
> Time: 13.330
> [ perf record: Woken up 289 times to write data ]
> [ perf record: Captured and wrote 54.312 MB perf.data (~2372928 samples) ]
> # perf report
> # Samples: 2370135
> #
> # Overhead    Command                 Shared Object  Symbol
> # ........  .........  ............................  ......
> #
>      9.68%  hackbench  [kernel]                      [k] do_raw_spin_lock
>      6.50%  hackbench  [kernel]                      [k] schedule
>      4.38%  hackbench  [kernel]                      [k] __kmalloc_track_caller
>      3.95%  hackbench  [kernel]                      [k] copy_to_user
>      3.86%  hackbench  [kernel]                      [k] __alloc_skb
>      3.77%  hackbench  [kernel]                      [k] unix_stream_recvmsg
>      3.12%  hackbench  [kernel]                      [k] sock_alloc_send_pskb
>      2.75%  hackbench  [vdso]                        [.] 0x000000ffffe425
>      2.28%  hackbench  [kernel]                      [k] sysenter_past_esp
>      2.03%  hackbench  [kernel]                      [k] __mutex_lock_common
>      2.00%  hackbench  [kernel]                      [k] kfree
>      2.00%  hackbench  [kernel]                      [k] delay_tsc
>      1.75%  hackbench  [kernel]                      [k] update_curr
>      1.70%  hackbench  [kernel]                      [k] kmem_cache_alloc
>      1.69%  hackbench  [kernel]                      [k] do_raw_spin_unlock
>      1.60%  hackbench  [kernel]                      [k] unix_stream_sendmsg
>      1.54%  hackbench  [kernel]                      [k] sched_clock_local
>      1.46%  hackbench  [kernel]                      [k] __slab_free
>      1.37%  hackbench  [kernel]                      [k] do_raw_read_lock
>      1.34%  hackbench  [kernel]                      [k] __switch_to
>      1.24%  hackbench  [kernel]                      [k] select_task_rq_fair
>      1.23%  hackbench  [kernel]                      [k] sock_wfree
>      1.21%  hackbench  [kernel]                      [k] _raw_spin_unlock_irqrestore
>      1.19%  hackbench  [kernel]                      [k] __mutex_unlock_slowpath
>      1.05%  hackbench  [kernel]                      [k] trace_hardirqs_off
>      0.99%  hackbench  [kernel]                      [k] __might_sleep
>      0.93%  hackbench  [kernel]                      [k] do_raw_read_unlock
>      0.93%  hackbench  [kernel]                      [k] _raw_spin_lock
>      0.91%  hackbench  [kernel]                      [k] try_to_wake_up
>      0.81%  hackbench  [kernel]                      [k] sched_clock
>      0.80%  hackbench  [kernel]                      [k] trace_hardirqs_on

I collected retired instruction, dtlb miss and LLC miss.
Below is data of LLC miss.

Kernel 2.6.33:
# Samples: 11639436896 LLC-load-misses
#
# Overhead          Command                                           Shared Object  Symbol
# ........  ...............  ......................................................  ......
#
    20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
    14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
    12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
     7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
     7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
     6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
     6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
     2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
     2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
     2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
     1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
     1.27%        hackbench  hackbench                                               [.] receiver
     0.99%        hackbench  libpthread-2.9.so                                       [.] __read
     0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg




Kernel 2.6.34-rc3:
# Samples: 13079611308 LLC-load-misses
#
# Overhead          Command                                                         Shared Object  Symbol
# ........  ...............  ....................................................................  ......
#
    18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
ing
    13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
    11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
     8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
     7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
caller
     6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
     5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
     3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
     2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
     1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
     1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
     1.59%        hackbench  hackbench                                                             [.] receiver
     1.37%        hackbench  libpthread-2.9.so                                                     [.] __read



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  2:34                         ` Zhang, Yanmin
@ 2010-04-07  6:39                           ` Eric Dumazet
  2010-04-07  9:07                             ` Zhang, Yanmin
  2010-04-07 10:47                           ` Pekka Enberg
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-07  6:39 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le mercredi 07 avril 2010 à 10:34 +0800, Zhang, Yanmin a écrit :

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
> 
> Kernel 2.6.33:
> # Samples: 11639436896 LLC-load-misses
> #
> # Overhead          Command                                           Shared Object  Symbol
> # ........  ...............  ......................................................  ......
> #
>     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
>     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
>     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
>      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
>      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
>      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
>      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
>      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
>      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
>      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
>      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
>      1.27%        hackbench  hackbench                                               [.] receiver
>      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
>      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
> 
> 
> 
> 
> Kernel 2.6.34-rc3:
> # Samples: 13079611308 LLC-load-misses
> #
> # Overhead          Command                                                         Shared Object  Symbol
> # ........  ...............  ....................................................................  ......
> #
>     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> ing
>     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
>     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
>      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
>      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> caller
>      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
>      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
>      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
>      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
>      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
>      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
>      1.59%        hackbench  hackbench                                                             [.] receiver
>      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read
> 
> 

Please check values of /proc/sys/net/core/rmem_default
and /proc/sys/net/core/wmem_default on your machines.

Their values can also change hackbench results, because increasing
wmem_default allows af_unix senders to consume much more skbs and stress
slab allocators (__slab_free), way beyond slub_min_order can tune them.

When 2000 senders are running (and 2000 receivers), we might consume
something like 2000 * 100.000 bytes of kernel memory for skbs. TLB
trashing is expected, because all these skbs can span many 2MB pages.
Maybe some node imbalance happens too.



You could try to boot your machine with less ram per node and check :

# cat /proc/buddyinfo 
Node 0, zone      DMA      2      1      2      2      1      1      1      0      1      1      3 
Node 0, zone    DMA32    219    298    143    584    145     57     44     41     31     26    517 
Node 1, zone    DMA32      4      1     17      1      0      3      2      2      2      2    123 
Node 1, zone   Normal    126    169     83      8      7      5     59     59     49     28    459 


One experiment on your Nehalem machine would be to change hackbench so
that each group (20 senders/ 20 receivers) run on a particular NUMA
node.

x86info -c ->

CPU #1
EFamily: 0 EModel: 1 Family: 6 Model: 26 Stepping: 5
CPU Model: Core i7 (Nehalem)
Processor name string: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
Type: 0 (Original OEM)	Brand: 0 (Unsupported)
Number of cores per physical package=8
Number of logical processors per socket=16
Number of logical processors per core=2
APIC ID: 0x10	Package: 0  Core: 1   SMT ID 0
Cache info
 L1 Instruction cache: 32KB, 4-way associative. 64 byte line size.
 L1 Data cache: 32KB, 8-way associative. 64 byte line size.
 L2 (MLC): 256KB, 8-way associative. 64 byte line size.
TLB info
 Data TLB: 4KB pages, 4-way associative, 64 entries
 64 byte prefetching.
Found unknown cache descriptors: 55 5a b2 ca e4 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  6:39                           ` Eric Dumazet
@ 2010-04-07  9:07                             ` Zhang, Yanmin
  2010-04-07  9:20                               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang, Yanmin @ 2010-04-07  9:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 2010-04-07 at 08:39 +0200, Eric Dumazet wrote:
> Le mercredi 07 avril 2010 à 10:34 +0800, Zhang, Yanmin a écrit :
> 
> > I collected retired instruction, dtlb miss and LLC miss.
> > Below is data of LLC miss.
> > 
> > Kernel 2.6.33:
> > # Samples: 11639436896 LLC-load-misses
> > #
> > # Overhead          Command                                           Shared Object  Symbol
> > # ........  ...............  ......................................................  ......
> > #
> >     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
> >     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
> >     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
> >      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
> >      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
> >      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
> >      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
> >      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
> >      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
> >      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
> >      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
> >      1.27%        hackbench  hackbench                                               [.] receiver
> >      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
> >      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
> > 
> > 
> > 
> > 
> > Kernel 2.6.34-rc3:
> > # Samples: 13079611308 LLC-load-misses
> > #
> > # Overhead          Command                                                         Shared Object  Symbol
> > # ........  ...............  ....................................................................  ......
> > #
> >     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> > ing
> >     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
> >     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
> >      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
> >      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> > caller
> >      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
> >      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
> >      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
> >      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
> >      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
> >      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
> >      1.59%        hackbench  hackbench                                                             [.] receiver
> >      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read
> > 
> > 
> 
> Please check values of /proc/sys/net/core/rmem_default
> and /proc/sys/net/core/wmem_default on your machines.
> 
> Their values can also change hackbench results, because increasing
> wmem_default allows af_unix senders to consume much more skbs and stress
> slab allocators (__slab_free), way beyond slub_min_order can tune them.
> 
> When 2000 senders are running (and 2000 receivers), we might consume
> something like 2000 * 100.000 bytes of kernel memory for skbs. TLB
> trashing is expected, because all these skbs can span many 2MB pages.
> Maybe some node imbalance happens too.
It's a good pointer. rmem_default and wmem_default are about 116k on my machine.
I changed them to 52K and it seems there is no improvement.

> 
> 
> 
> You could try to boot your machine with less ram per node and check :
> 
> # cat /proc/buddyinfo 
> Node 0, zone      DMA      2      1      2      2      1      1      1      0      1      1      3 
> Node 0, zone    DMA32    219    298    143    584    145     57     44     41     31     26    517 
> Node 1, zone    DMA32      4      1     17      1      0      3      2      2      2      2    123 
> Node 1, zone   Normal    126    169     83      8      7      5     59     59     49     28    459 
> 
> 
> One experiment on your Nehalem machine would be to change hackbench so
> that each group (20 senders/ 20 receivers) run on a particular NUMA
> node.
I expect process scheduler to work well in scheduling different groups
to different nodes.

I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
it does take care of NUMA.

> 
> x86info -c ->
> 
> CPU #1
> EFamily: 0 EModel: 1 Family: 6 Model: 26 Stepping: 5
> CPU Model: Core i7 (Nehalem)
> Processor name string: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
> Type: 0 (Original OEM)	Brand: 0 (Unsupported)
> Number of cores per physical package=8
> Number of logical processors per socket=16
> Number of logical processors per core=2
> APIC ID: 0x10	Package: 0  Core: 1   SMT ID 0
> Cache info
>  L1 Instruction cache: 32KB, 4-way associative. 64 byte line size.
>  L1 Data cache: 32KB, 8-way associative. 64 byte line size.
>  L2 (MLC): 256KB, 8-way associative. 64 byte line size.
> TLB info
>  Data TLB: 4KB pages, 4-way associative, 64 entries
>  64 byte prefetching.
> Found unknown cache descriptors: 55 5a b2 ca e4 
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  9:07                             ` Zhang, Yanmin
@ 2010-04-07  9:20                               ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-07  9:20 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le mercredi 07 avril 2010 à 17:07 +0800, Zhang, Yanmin a écrit :
> > 
> > One experiment on your Nehalem machine would be to change hackbench so
> > that each group (20 senders/ 20 receivers) run on a particular NUMA
> > node.
> I expect process scheduler to work well in scheduling different groups
> to different nodes.
> 
> I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
> it does take care of NUMA.
> 

hackbench allocates all unix sockets on one single node, then
forks/spans its children.

Thats huge node imbalance.

You can see this with lsof on a running hackbench :


# lsof -p 14802
COMMAND     PID USER   FD   TYPE             DEVICE    SIZE     NODE NAME
hackbench 14802 root  cwd    DIR              104,7    4096 12927240 /data/src/linux-2.6
hackbench 14802 root  rtd    DIR              104,2    4096        2 /
hackbench 14802 root  txt    REG              104,2   17524   697317 /usr/bin/hackbench
hackbench 14802 root  mem    REG              104,2  112212   558042 /lib/ld-2.3.4.so
hackbench 14802 root  mem    REG              104,2 1547588   558043 /lib/tls/libc-2.3.4.so
hackbench 14802 root  mem    REG              104,2  107928   557058 /lib/tls/libpthread-2.3.4.so
hackbench 14802 root  mem    REG                0,0                0 [heap] (stat: No such file or directory)
hackbench 14802 root    0u   CHR              136,0                3 /dev/pts/0
hackbench 14802 root    1u   CHR              136,0                3 /dev/pts/0
hackbench 14802 root    2u   CHR              136,0                3 /dev/pts/0
hackbench 14802 root    3u  unix 0xffff8800ac0da100            28939 socket
hackbench 14802 root    4u  unix 0xffff8800ac0da400            28940 socket
hackbench 14802 root    5u  unix 0xffff8800ac0da700            28941 socket
hackbench 14802 root    6u  unix 0xffff8800ac0daa00            28942 socket
hackbench 14802 root    8u  unix 0xffff8800aeac1800            28984 socket
hackbench 14802 root    9u  unix 0xffff8800aeac1e00            28986 socket
hackbench 14802 root   10u  unix 0xffff8800aeac2400            28988 socket
hackbench 14802 root   11u  unix 0xffff8800aeac2a00            28990 socket
hackbench 14802 root   12u  unix 0xffff8800aeac3000            28992 socket
hackbench 14802 root   13u  unix 0xffff8800aeac3600            28994 socket
hackbench 14802 root   14u  unix 0xffff8800aeac3c00            28996 socket
hackbench 14802 root   15u  unix 0xffff8800aeac4200            28998 socket
hackbench 14802 root   16u  unix 0xffff8800aeac4800            29000 socket
hackbench 14802 root   17u  unix 0xffff8800aeac4e00            29002 socket
hackbench 14802 root   18u  unix 0xffff8800aeac5400            29004 socket
hackbench 14802 root   19u  unix 0xffff8800aeac5a00            29006 socket
hackbench 14802 root   20u  unix 0xffff8800aeac6000            29008 socket
hackbench 14802 root   21u  unix 0xffff8800aeac6600            29010 socket
hackbench 14802 root   22u  unix 0xffff8800aeac6c00            29012 socket
hackbench 14802 root   23u  unix 0xffff8800aeac7200            29014 socket
hackbench 14802 root   24u  unix 0xffff8800aeac0f00            29016 socket
hackbench 14802 root   25u  unix 0xffff8800aeac0900            29018 socket
hackbench 14802 root   26u  unix 0xffff8800aeac7b00            29020 socket
hackbench 14802 root   27u  unix 0xffff8800aeac7500            29022 socket

All sockets structures (where all _hot_ locks reside) are on a single node.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  2:34                         ` Zhang, Yanmin
  2010-04-07  6:39                           ` Eric Dumazet
@ 2010-04-07 10:47                           ` Pekka Enberg
  2010-04-07 16:30                           ` Christoph Lameter
  2010-04-07 16:43                           ` Christoph Lameter
  3 siblings, 0 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-04-07 10:47 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Eric Dumazet, Christoph Lameter, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton, mingo

Zhang, Yanmin kirjoitti:
> Kernel 2.6.34-rc3:
> # Samples: 13079611308 LLC-load-misses
> #
> # Overhead          Command                                                         Shared Object  Symbol
> # ........  ...............  ....................................................................  ......
> #
>     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> ing
>     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
>     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
>      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
>      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> caller
>      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
>      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
>      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
>      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
>      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
>      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
>      1.59%        hackbench  hackbench                                                             [.] receiver
>      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read

Btw, you might want to try out "perf record -g" and "perf report 
--callchain fractal,5" to get a better view of where we're spending 
time. Perhaps you can spot the difference with that more easily.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  2:34                         ` Zhang, Yanmin
  2010-04-07  6:39                           ` Eric Dumazet
  2010-04-07 10:47                           ` Pekka Enberg
@ 2010-04-07 16:30                           ` Christoph Lameter
  2010-04-07 16:43                           ` Christoph Lameter
  3 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2010-04-07 16:30 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Eric Dumazet, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> > booting with slub_min_order=3 do change hackbench results for example ;)
> By default, slub_min_order=3 on my Nehalem machines. I also tried different
> larger slub_min_order and didn't find help.

Lets stop fiddling with kernel command line parameters for these test.
Leave as default. That is how I tested.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07  2:34                         ` Zhang, Yanmin
                                             ` (2 preceding siblings ...)
  2010-04-07 16:30                           ` Christoph Lameter
@ 2010-04-07 16:43                           ` Christoph Lameter
  2010-04-07 16:49                             ` Pekka Enberg
  2010-04-08  7:18                             ` Zhang, Yanmin
  3 siblings, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2010-04-07 16:43 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Eric Dumazet, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
>
> Kernel 2.6.33:
>     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
>     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
>     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
>      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
>      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
>      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
>      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
>      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
>      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
>      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
>      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
>      1.27%        hackbench  hackbench                                               [.] receiver
>      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
>      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
>
> Kernel 2.6.34-rc3:
>     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> ing
>     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
>     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
>      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
>      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> caller

Seems that the overhead of __kmalloc_node_track_caller was increased. The
function inlines slab_alloc().

>      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
>      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
>      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
>      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
>      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
>      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
>      1.59%        hackbench  hackbench                                                             [.] receiver
>      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read

I wonder if this is not related to the kmem_cache_cpu structure straggling
cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
structure was larger and therefore tight packing resulted in different
alignment.

Could you see how the following patch affects the results. It attempts to
increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
the potential that other per cpu fetches to neighboring objects affect the
situation. We could cacheline align the whole thing.

---
 include/linux/slub_def.h |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
@@ -38,6 +38,11 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
+#ifndef CONFIG_64BIT
+	int dummy1;
+#endif
+	unsigned long dummy2;
+
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 16:43                           ` Christoph Lameter
@ 2010-04-07 16:49                             ` Pekka Enberg
  2010-04-07 16:52                               ` Pekka Enberg
  2010-04-07 18:18                               ` Christoph Lameter
  2010-04-08  7:18                             ` Zhang, Yanmin
  1 sibling, 2 replies; 28+ messages in thread
From: Pekka Enberg @ 2010-04-07 16:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Christoph Lameter wrote:
> I wonder if this is not related to the kmem_cache_cpu structure straggling
> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> structure was larger and therefore tight packing resulted in different
> alignment.
> 
> Could you see how the following patch affects the results. It attempts to
> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> the potential that other per cpu fetches to neighboring objects affect the
> situation. We could cacheline align the whole thing.
> 
> ---
>  include/linux/slub_def.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +#ifndef CONFIG_64BIT
> +	int dummy1;
> +#endif
> +	unsigned long dummy2;
> +
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif

Would __cacheline_aligned_in_smp do the trick here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 16:49                             ` Pekka Enberg
@ 2010-04-07 16:52                               ` Pekka Enberg
  2010-04-07 18:20                                 ` Christoph Lameter
  2010-04-07 18:18                               ` Christoph Lameter
  1 sibling, 1 reply; 28+ messages in thread
From: Pekka Enberg @ 2010-04-07 16:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Pekka Enberg wrote:
> Christoph Lameter wrote:
>> I wonder if this is not related to the kmem_cache_cpu structure 
>> straggling
>> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
>> structure was larger and therefore tight packing resulted in different
>> alignment.
>>
>> Could you see how the following patch affects the results. It attempts to
>> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
>> the potential that other per cpu fetches to neighboring objects affect 
>> the
>> situation. We could cacheline align the whole thing.
>>
>> ---
>>  include/linux/slub_def.h |    5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> Index: linux-2.6/include/linux/slub_def.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/slub_def.h    2010-04-07 
>> 11:33:50.000000000 -0500
>> +++ linux-2.6/include/linux/slub_def.h    2010-04-07 
>> 11:35:18.000000000 -0500
>> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>>      void **freelist;    /* Pointer to first free per cpu object */
>>      struct page *page;    /* The slab from which we are allocating */
>>      int node;        /* The node of the page (or -1 for debug) */
>> +#ifndef CONFIG_64BIT
>> +    int dummy1;
>> +#endif
>> +    unsigned long dummy2;
>> +
>>  #ifdef CONFIG_SLUB_STATS
>>      unsigned stat[NR_SLUB_STAT_ITEMS];
>>  #endif
> 
> Would __cacheline_aligned_in_smp do the trick here?

Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with 
four underscores) for per-cpu data. Confusing...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 16:52                               ` Pekka Enberg
@ 2010-04-07 18:20                                 ` Christoph Lameter
  2010-04-07 18:25                                   ` Pekka Enberg
  2010-04-07 18:38                                   ` Eric Dumazet
  0 siblings, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2010-04-07 18:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> underscores) for per-cpu data. Confusing...

This does not particulary help to clarify the situation since we are
dealing with data that can either be allocated via the percpu allocator or
be statically present (kmalloc bootstrap situation).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 18:20                                 ` Christoph Lameter
@ 2010-04-07 18:25                                   ` Pekka Enberg
  2010-04-07 19:30                                     ` Christoph Lameter
  2010-04-07 18:38                                   ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Pekka Enberg @ 2010-04-07 18:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Christoph Lameter wrote:
> On Wed, 7 Apr 2010, Pekka Enberg wrote:
> 
>> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
>> underscores) for per-cpu data. Confusing...
> 
> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).

Yes, I am an idiot. :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 18:25                                   ` Pekka Enberg
@ 2010-04-07 19:30                                     ` Christoph Lameter
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2010-04-07 19:30 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Yes, I am an idiot. :-)

Plato said it in another way:

"As for me, all I know is that I know nothing."




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 18:20                                 ` Christoph Lameter
  2010-04-07 18:25                                   ` Pekka Enberg
@ 2010-04-07 18:38                                   ` Eric Dumazet
  2010-04-08  1:05                                     ` Zhang, Yanmin
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-07 18:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Zhang, Yanmin, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> On Wed, 7 Apr 2010, Pekka Enberg wrote:
> 
> > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > underscores) for per-cpu data. Confusing...
> 
> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).
> 
> --

Do we have a user program to check actual L1 cache size of a machine ?

I remember my HP blades have many BIOS options, I would like to make
sure they are properly set.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 18:38                                   ` Eric Dumazet
@ 2010-04-08  1:05                                     ` Zhang, Yanmin
  2010-04-08  4:59                                       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Zhang, Yanmin @ 2010-04-08  1:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 2010-04-07 at 20:38 +0200, Eric Dumazet wrote:
> Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> > On Wed, 7 Apr 2010, Pekka Enberg wrote:
> > 
> > > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > > underscores) for per-cpu data. Confusing...
> > 
> > This does not particulary help to clarify the situation since we are
> > dealing with data that can either be allocated via the percpu allocator or
> > be statically present (kmalloc bootstrap situation).
> > 
> > --
> 
> Do we have a user program to check actual L1 cache size of a machine ?
If there is no, it's easy to write it as kernel exports the cache stat by
/sys/devices/system/cpu/cpuXXX/cache/indexXXX/

> 
> I remember my HP blades have many BIOS options, I would like to make
> sure they are properly set.
> 
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  1:05                                     ` Zhang, Yanmin
@ 2010-04-08  4:59                                       ` Eric Dumazet
  2010-04-08  5:39                                         ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  4:59 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le jeudi 08 avril 2010 à 09:05 +0800, Zhang, Yanmin a écrit :

> > Do we have a user program to check actual L1 cache size of a machine ?
> If there is no, it's easy to write it as kernel exports the cache stat by
> /sys/devices/system/cpu/cpuXXX/cache/indexXXX/

Yes, this is what advertizes my L1 cache having 64bytes lines, but I
would like to check that in practice, this is not 128bytes...

./index0/type:Data
./index0/level:1
./index0/coherency_line_size:64
./index0/physical_line_partition:1
./index0/ways_of_associativity:8
./index0/number_of_sets:64
./index0/size:32K
./index0/shared_cpu_map:00000101
./index0/shared_cpu_list:0,8
./index1/type:Instruction
./index1/level:1
./index1/coherency_line_size:64
./index1/physical_line_partition:1
./index1/ways_of_associativity:4
./index1/number_of_sets:128
./index1/size:32K
./index1/shared_cpu_map:00000101
./index1/shared_cpu_list:0,8



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  4:59                                       ` Eric Dumazet
@ 2010-04-08  5:39                                         ` Eric Dumazet
  2010-04-08  7:00                                           ` Eric Dumazet
  2010-04-08 15:34                                           ` Christoph Lameter
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  5:39 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton


I suspect NUMA is completely out of order on current kernel, or my
Nehalem machine NUMA support is a joke

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 3071 MB
node 0 free: 2637 MB
node 1 size: 3062 MB
node 1 free: 2909 MB


# cat try.sh
hackbench 50 process 5000
numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
wait
echo node0 results
cat RES0
echo node1 results
cat RES1

numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
wait
echo node0 on mem1 results
cat RES0_1
echo node1 on mem0 results
cat RES1_0

# ./try.sh
Running with 50*40 (== 2000) tasks.
Time: 16.865
node0 results
Running with 25*40 (== 1000) tasks.
Time: 16.767
node1 results
Running with 25*40 (== 1000) tasks.
Time: 16.564
node0 on mem1 results
Running with 25*40 (== 1000) tasks.
Time: 16.814
node1 on mem0 results
Running with 25*40 (== 1000) tasks.
Time: 16.896



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  5:39                                         ` Eric Dumazet
@ 2010-04-08  7:00                                           ` Eric Dumazet
  2010-04-08  7:05                                             ` David Miller
  2010-04-08  7:54                                             ` Zhang, Yanmin
  2010-04-08 15:34                                           ` Christoph Lameter
  1 sibling, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  7:00 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit :
> I suspect NUMA is completely out of order on current kernel, or my
> Nehalem machine NUMA support is a joke
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 size: 3071 MB
> node 0 free: 2637 MB
> node 1 size: 3062 MB
> node 1 free: 2909 MB
> 
> 
> # cat try.sh
> hackbench 50 process 5000
> numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
> numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
> wait
> echo node0 results
> cat RES0
> echo node1 results
> cat RES1
> 
> numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> wait
> echo node0 on mem1 results
> cat RES0_1
> echo node1 on mem0 results
> cat RES1_0
> 
> # ./try.sh
> Running with 50*40 (== 2000) tasks.
> Time: 16.865
> node0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.767
> node1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.564
> node0 on mem1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.814
> node1 on mem0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.896

If run individually, the tests results are more what we would expect
(slow), but if machine runs the two set of process concurrently, each
group runs much faster...


# numactl --cpubind=0 --membind=1 hackbench 25 process 5000
Running with 25*40 (== 1000) tasks.
Time: 21.810

# numactl --cpubind=1 --membind=0 hackbench 25 process 5000
Running with 25*40 (== 1000) tasks.
Time: 20.679

# numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
[1] 9177
# numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
[2] 9196
# wait
[1]-  Done                    numactl --cpubind=0 --membind=1 hackbench
25 process 5000 >RES0_1
[2]+  Done                    numactl --cpubind=1 --membind=0 hackbench
25 process 5000 >RES1_0
# echo node0 on mem1 results
node0 on mem1 results
# cat RES0_1
Running with 25*40 (== 1000) tasks.
Time: 13.818
# echo node1 on mem0 results
node1 on mem0 results
# cat RES1_0
Running with 25*40 (== 1000) tasks.
Time: 11.633

Oh well...



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:00                                           ` Eric Dumazet
@ 2010-04-08  7:05                                             ` David Miller
  2010-04-08  7:20                                               ` David Miller
  2010-04-08  7:25                                               ` Eric Dumazet
  2010-04-08  7:54                                             ` Zhang, Yanmin
  1 sibling, 2 replies; 28+ messages in thread
From: David Miller @ 2010-04-08  7:05 UTC (permalink / raw)
  To: eric.dumazet
  Cc: yanmin_zhang, cl, penberg, netdev, tj, alex.shi, linux-kernel,
	ling.ma, tim.c.chen, akpm

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 08 Apr 2010 09:00:19 +0200

> If run individually, the tests results are more what we would expect
> (slow), but if machine runs the two set of process concurrently, each
> group runs much faster...

BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
that loopback TCP packets get fully checksum validated on receive.

I'm trying to figure out why skb->ip_summed ends up being
CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
CHECKSUM_PARTIAL in tcp_sendmsg().

I wonder how much this accounts for some of the hackbench
oddities... and other regressions in loopback tests we've seen.
:-)

Just FYI...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:05                                             ` David Miller
@ 2010-04-08  7:20                                               ` David Miller
  2010-04-08  7:25                                               ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2010-04-08  7:20 UTC (permalink / raw)
  To: eric.dumazet
  Cc: yanmin_zhang, cl, penberg, netdev, tj, alex.shi, linux-kernel,
	ling.ma, tim.c.chen, akpm

From: David Miller <davem@davemloft.net>
Date: Thu, 08 Apr 2010 00:05:57 -0700 (PDT)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
> 
>> If run individually, the tests results are more what we would expect
>> (slow), but if machine runs the two set of process concurrently, each
>> group runs much faster...
> 
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
> 
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().

Ok, it looks like it's only ACK packets that have this problem,
but still :-)

It's weird that we have a special ip_dev_loopback_xmit() for for
ip_mc_output() NF_HOOK()s, which forces skb->ip_summed to
CHECKSUM_UNNECESSARY, but the actual normal loopback xmit doesn't
do that...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:05                                             ` David Miller
  2010-04-08  7:20                                               ` David Miller
@ 2010-04-08  7:25                                               ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  7:25 UTC (permalink / raw)
  To: David Miller
  Cc: yanmin_zhang, cl, penberg, netdev, tj, alex.shi, linux-kernel,
	ling.ma, tim.c.chen, akpm

Le jeudi 08 avril 2010 à 00:05 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
> 
> > If run individually, the tests results are more what we would expect
> > (slow), but if machine runs the two set of process concurrently, each
> > group runs much faster...
> 
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
> 
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().
> 
> I wonder how much this accounts for some of the hackbench
> oddities... and other regressions in loopback tests we've seen.
> :-)
> 
> Just FYI...

Thanks !

But hackbench is a af_unix benchmark, so loopback stuff is not used that
much :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:00                                           ` Eric Dumazet
  2010-04-08  7:05                                             ` David Miller
@ 2010-04-08  7:54                                             ` Zhang, Yanmin
  2010-04-08  7:54                                               ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Zhang, Yanmin @ 2010-04-08  7:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Thu, 2010-04-08 at 09:00 +0200, Eric Dumazet wrote:
> Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit :
> > I suspect NUMA is completely out of order on current kernel, or my
> > Nehalem machine NUMA support is a joke
> > 
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 size: 3071 MB
> > node 0 free: 2637 MB
> > node 1 size: 3062 MB
> > node 1 free: 2909 MB
> > 
> > 
> > # cat try.sh
> > hackbench 50 process 5000
> > numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
> > numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
> > wait
> > echo node0 results
> > cat RES0
> > echo node1 results
> > cat RES1
> > 
> > numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> > numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> > wait
> > echo node0 on mem1 results
> > cat RES0_1
> > echo node1 on mem0 results
> > cat RES1_0
> > 
> > # ./try.sh
> > Running with 50*40 (== 2000) tasks.
> > Time: 16.865
> > node0 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.767
> > node1 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.564
> > node0 on mem1 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.814
> > node1 on mem0 results
> > Running with 25*40 (== 1000) tasks.
> > Time: 16.896
> 
> If run individually, the tests results are more what we would expect
> (slow), but if machine runs the two set of process concurrently, each
> group runs much faster...
If there are 2 nodes in the machine, processes on node 0 will contact MCH of
node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
a power-saving mode when all the cpus of node 1 are free. So the transactions
from MCH 1 to MCH 0 has a larger latency.

> 
> 
> # numactl --cpubind=0 --membind=1 hackbench 25 process 5000
> Running with 25*40 (== 1000) tasks.
> Time: 21.810
> 
> # numactl --cpubind=1 --membind=0 hackbench 25 process 5000
> Running with 25*40 (== 1000) tasks.
> Time: 20.679
> 
> # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
> [1] 9177
> # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
> [2] 9196
> # wait
> [1]-  Done                    numactl --cpubind=0 --membind=1 hackbench
> 25 process 5000 >RES0_1
> [2]+  Done                    numactl --cpubind=1 --membind=0 hackbench
> 25 process 5000 >RES1_0
> # echo node0 on mem1 results
> node0 on mem1 results
> # cat RES0_1
> Running with 25*40 (== 1000) tasks.
> Time: 13.818
> # echo node1 on mem0 results
> node1 on mem0 results
> # cat RES1_0
> Running with 25*40 (== 1000) tasks.
> Time: 11.633
> 
> Oh well...
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:54                                             ` Zhang, Yanmin
@ 2010-04-08  7:54                                               ` Eric Dumazet
  2010-04-08  8:09                                                 ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  7:54 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :

> If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> a power-saving mode when all the cpus of node 1 are free. So the transactions
> from MCH 1 to MCH 0 has a larger latency.
> 

Hmm, thanks for the hint, I will investigate this.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  7:54                                               ` Eric Dumazet
@ 2010-04-08  8:09                                                 ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08  8:09 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton, Ingo Molnar

Le jeudi 08 avril 2010 à 09:54 +0200, Eric Dumazet a écrit :
> Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :
> 
> > If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> > a power-saving mode when all the cpus of node 1 are free. So the transactions
> > from MCH 1 to MCH 0 has a larger latency.
> > 
> 
> Hmm, thanks for the hint, I will investigate this.

Oh well, 

perf timechart record &

Instant crash

Call Trace:
 perf_trace_sched_switch+0xd5/0x120
 schedule+0x6b5/0x860
 retint_careful+0xd/0x21
 
RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40
CR2: 00000000d21f1422

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08  5:39                                         ` Eric Dumazet
  2010-04-08  7:00                                           ` Eric Dumazet
@ 2010-04-08 15:34                                           ` Christoph Lameter
  2010-04-08 15:52                                             ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Lameter @ 2010-04-08 15:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Zhang, Yanmin, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Thu, 8 Apr 2010, Eric Dumazet wrote:

> I suspect NUMA is completely out of order on current kernel, or my
> Nehalem machine NUMA support is a joke
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 size: 3071 MB
> node 0 free: 2637 MB
> node 1 size: 3062 MB
> node 1 free: 2909 MB

How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

> # ./try.sh
> Running with 50*40 (== 2000) tasks.
> Time: 16.865
> node0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.767
> node1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.564
> node0 on mem1 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.814
> node1 on mem0 results
> Running with 25*40 (== 1000) tasks.
> Time: 16.896
>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-08 15:34                                           ` Christoph Lameter
@ 2010-04-08 15:52                                             ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2010-04-08 15:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Zhang, Yanmin, Pekka Enberg, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

Le jeudi 08 avril 2010 à 10:34 -0500, Christoph Lameter a écrit :
> On Thu, 8 Apr 2010, Eric Dumazet wrote:
> 
> > I suspect NUMA is completely out of order on current kernel, or my
> > Nehalem machine NUMA support is a joke
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 size: 3071 MB
> > node 0 free: 2637 MB
> > node 1 size: 3062 MB
> > node 1 free: 2909 MB
> 
> How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

one socket maps to 0 2 4 6 8 10 12 14 (Node 0)
one socket maps to 1 3 5 7 9 11 13 15 (Node 1)

# numactl --cpubind=0 --membind=0 numactl --show
policy: bind
preferred node: 0
interleavemask: 
interleavenode: 0
nodebind: 0 
membind: 0 
cpubind: 1 3 5 7 9 11 13 15 1024 

(strange 1024 report...)

# numactl --cpubind=1 --membind=1 numactl --show
policy: bind
preferred node: 1
interleavemask: 
interleavenode: 0
nodebind: 
membind: 1 
cpubind: 0 2 4 6 8 10 12 14 



[    0.161170] Booting Node   0, Processors  #1
[    0.248995] CPU 1 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 SHD:8
[    0.269177]  Ok.
[    0.269453] Booting Node   1, Processors  #2
[    0.356965] CPU 2 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.377207]  Ok.
[    0.377485] Booting Node   0, Processors  #3
[    0.464935] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.485065]  Ok.
[    0.485217] Booting Node   1, Processors  #4
[    0.572906] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[    0.593044]  Ok.
...
grep "physical id" /proc/cpuinfo 
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0
physical id	: 1
physical id	: 0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 16:49                             ` Pekka Enberg
  2010-04-07 16:52                               ` Pekka Enberg
@ 2010-04-07 18:18                               ` Christoph Lameter
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Lameter @ 2010-04-07 18:18 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Zhang, Yanmin, Eric Dumazet, netdev, Tejun Heo, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Christoph Lameter wrote:
> > I wonder if this is not related to the kmem_cache_cpu structure straggling
> > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> > structure was larger and therefore tight packing resulted in different
> > alignment.
> >
> > Could you see how the following patch affects the results. It attempts to
> > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> > the potential that other per cpu fetches to neighboring objects affect the
> > situation. We could cacheline align the whole thing.
> >
> > ---
> >  include/linux/slub_def.h |    5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > Index: linux-2.6/include/linux/slub_def.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000
> > -0500
> > +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000
> > -0500
> > @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
> >  	void **freelist;	/* Pointer to first free per cpu object */
> >  	struct page *page;	/* The slab from which we are allocating */
> >  	int node;		/* The node of the page (or -1 for debug) */
> > +#ifndef CONFIG_64BIT
> > +	int dummy1;
> > +#endif
> > +	unsigned long dummy2;
> > +
> >  #ifdef CONFIG_SLUB_STATS
> >  	unsigned stat[NR_SLUB_STAT_ITEMS];
> >  #endif
>
> Would __cacheline_aligned_in_smp do the trick here?

This is allocated via the percpu allocator. We could specify cacheline
alignment there but that would reduce the density. You basically need 4
words for a kmem_cache_cpu structure. A number of those fit into one 64
byte cacheline.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: hackbench regression due to commit 9dfc6e68bfe6e
  2010-04-07 16:43                           ` Christoph Lameter
  2010-04-07 16:49                             ` Pekka Enberg
@ 2010-04-08  7:18                             ` Zhang, Yanmin
  1 sibling, 0 replies; 28+ messages in thread
From: Zhang, Yanmin @ 2010-04-08  7:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, netdev, Tejun Heo, Pekka Enberg, alex.shi,
	linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
	Andrew Morton

On Wed, 2010-04-07 at 11:43 -0500, Christoph Lameter wrote:
> On Wed, 7 Apr 2010, Zhang, Yanmin wrote:
> 
> > I collected retired instruction, dtlb miss and LLC miss.
> > Below is data of LLC miss.
> >
> > Kernel 2.6.33:
> >     20.94%        hackbench  [kernel.kallsyms]                                       [k] copy_user_generic_string
> >     14.56%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_recvmsg
> >     12.88%        hackbench  [kernel.kallsyms]                                       [k] kfree
> >      7.37%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_free
> >      7.18%        hackbench  [kernel.kallsyms]                                       [k] kmem_cache_alloc_node
> >      6.78%        hackbench  [kernel.kallsyms]                                       [k] kfree_skb
> >      6.27%        hackbench  [kernel.kallsyms]                                       [k] __kmalloc_node_track_caller
> >      2.73%        hackbench  [kernel.kallsyms]                                       [k] __slab_free
> >      2.21%        hackbench  [kernel.kallsyms]                                       [k] get_partial_node
> >      2.01%        hackbench  [kernel.kallsyms]                                       [k] _raw_spin_lock
> >      1.59%        hackbench  [kernel.kallsyms]                                       [k] schedule
> >      1.27%        hackbench  hackbench                                               [.] receiver
> >      0.99%        hackbench  libpthread-2.9.so                                       [.] __read
> >      0.87%        hackbench  [kernel.kallsyms]                                       [k] unix_stream_sendmsg
> >
> > Kernel 2.6.34-rc3:
> >     18.55%        hackbench  [kernel.kallsyms]                                                     [k] copy_user_generic_str
> > ing
> >     13.19%        hackbench  [kernel.kallsyms]                                                     [k] unix_stream_recvmsg
> >     11.62%        hackbench  [kernel.kallsyms]                                                     [k] kfree
> >      8.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_free
> >      7.88%        hackbench  [kernel.kallsyms]                                                     [k] __kmalloc_node_track_
> > caller
> 
> Seems that the overhead of __kmalloc_node_track_caller was increased. The
> function inlines slab_alloc().
> 
> >      6.54%        hackbench  [kernel.kallsyms]                                                     [k] kmem_cache_alloc_node
> >      5.94%        hackbench  [kernel.kallsyms]                                                     [k] kfree_skb
> >      3.48%        hackbench  [kernel.kallsyms]                                                     [k] __slab_free
> >      2.15%        hackbench  [kernel.kallsyms]                                                     [k] _raw_spin_lock
> >      1.83%        hackbench  [kernel.kallsyms]                                                     [k] schedule
> >      1.82%        hackbench  [kernel.kallsyms]                                                     [k] get_partial_node
> >      1.59%        hackbench  hackbench                                                             [.] receiver
> >      1.37%        hackbench  libpthread-2.9.so                                                     [.] __read
> 
> I wonder if this is not related to the kmem_cache_cpu structure straggling
> cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
> structure was larger and therefore tight packing resulted in different
> alignment.
> 
> Could you see how the following patch affects the results. It attempts to
> increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
> the potential that other per cpu fetches to neighboring objects affect the
> situation. We could cacheline align the whole thing.
I tested the patch against 2.6.33+9dfc6e68bfe6e and it seems it doesn't help.

I dumped percpu allocation info when booting kernel and didn't find clear sign.

> 
> ---
>  include/linux/slub_def.h |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> Index: linux-2.6/include/linux/slub_def.h
> ===================================================================
> --- linux-2.6.orig/include/linux/slub_def.h	2010-04-07 11:33:50.000000000 -0500
> +++ linux-2.6/include/linux/slub_def.h	2010-04-07 11:35:18.000000000 -0500
> @@ -38,6 +38,11 @@ struct kmem_cache_cpu {
>  	void **freelist;	/* Pointer to first free per cpu object */
>  	struct page *page;	/* The slab from which we are allocating */
>  	int node;		/* The node of the page (or -1 for debug) */
> +#ifndef CONFIG_64BIT
> +	int dummy1;
> +#endif
> +	unsigned long dummy2;
> +
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-04-08 15:52 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1269506457.4513.141.camel@alexs-hp.sh.intel.com>
     [not found] ` <alpine.DEB.2.00.1003250942080.2670@router.home>
     [not found]   ` <1269570902.9614.92.camel@alexs-hp.sh.intel.com>
     [not found]     ` <1270114166.2078.107.camel@ymzhang.sh.intel.com>
     [not found]       ` <alpine.DEB.2.00.1004011050340.16531@router.home>
     [not found]         ` <1270195589.2078.116.camel@ymzhang.sh.intel.com>
     [not found]           ` <alpine.DEB.2.00.1004050853300.23149@router.home>
     [not found]             ` <i2z84144f021004051030k7ff5190cyc083aa12c552dfac@mail.gmail.com>
     [not found]               ` <4BBA8DF9.8010409@kernel.org>
     [not found]                 ` <1270542497.2078.123.camel@ymzhang.sh.intel.com>
     [not found]                   ` <alpine.DEB.2.00.1004061033330.18750@router.home>
     [not found]                     ` <alpine.DEB.2.00.1004061552500.19151@router.home>
2010-04-06 22:10                       ` hackbench regression due to commit 9dfc6e68bfe6e Eric Dumazet
2010-04-07  2:34                         ` Zhang, Yanmin
2010-04-07  6:39                           ` Eric Dumazet
2010-04-07  9:07                             ` Zhang, Yanmin
2010-04-07  9:20                               ` Eric Dumazet
2010-04-07 10:47                           ` Pekka Enberg
2010-04-07 16:30                           ` Christoph Lameter
2010-04-07 16:43                           ` Christoph Lameter
2010-04-07 16:49                             ` Pekka Enberg
2010-04-07 16:52                               ` Pekka Enberg
2010-04-07 18:20                                 ` Christoph Lameter
2010-04-07 18:25                                   ` Pekka Enberg
2010-04-07 19:30                                     ` Christoph Lameter
2010-04-07 18:38                                   ` Eric Dumazet
2010-04-08  1:05                                     ` Zhang, Yanmin
2010-04-08  4:59                                       ` Eric Dumazet
2010-04-08  5:39                                         ` Eric Dumazet
2010-04-08  7:00                                           ` Eric Dumazet
2010-04-08  7:05                                             ` David Miller
2010-04-08  7:20                                               ` David Miller
2010-04-08  7:25                                               ` Eric Dumazet
2010-04-08  7:54                                             ` Zhang, Yanmin
2010-04-08  7:54                                               ` Eric Dumazet
2010-04-08  8:09                                                 ` Eric Dumazet
2010-04-08 15:34                                           ` Christoph Lameter
2010-04-08 15:52                                             ` Eric Dumazet
2010-04-07 18:18                               ` Christoph Lameter
2010-04-08  7:18                             ` Zhang, Yanmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).