netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Loopback performance from kernel 2.6.12 to 2.6.37
       [not found]     ` <1289214289.2820.188.camel@edumazet-laptop>
@ 2010-11-08 15:06       ` Eric Dumazet
  2010-11-09  0:05         ` Andrew Hendry
  2010-11-09 13:59         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-11-08 15:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

Le lundi 08 novembre 2010 à 12:04 +0100, Eric Dumazet a écrit :
> Le lundi 08 novembre 2010 à 11:58 +0100, Jesper Dangaard Brouer a
> écrit :
> > On Fri, 2010-11-05 at 21:29 +0100, Eric Dumazet wrote:
> > > Le vendredi 05 novembre 2010 à 11:49 +0100, Jesper Dangaard Brouer a
> > > écrit :
> > > > Hi Eric,
> > > > 
> > > > A colleague send me a link to someone who has done some quite extensive
> > > > performance measurements across different kernel versions.
> > > > 
> > > > I noticed that the loopback performance has gotten quite bad:
> > > > 
> > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> > > > 
> > > > I though you might be interested in the link.
> > > > 
> > > > See you around :-)
> > > 
> > > Hi !
> > > 
> > > Problem is : I have no idea what test they exactly use,
> > > do you have info about it ?
> > 
> > Its called the Phoronix test-suite, their website is:
> > http://www.phoronix-test-suite.com/?k=home
> > 
> > On my Ubuntu workstation their software comes as a software package:
> >  sudo aptitude install phoronix-test-suite
> > 
> > They seem to be related to the test/review site:
> > http://www.phoronix.com/
> > 
> > 
> > 
> > > This probably can be explained very fast.
> > 
> > The loopback test seems to be the only real networking test they do.
> > It looks like they just copy a very big fil via loopback, and record the
> > time it took... quite simple.
> > 
> > Their tests seems to be focused on CPU util/speed, graphics/games.
> > 
> > 
> > The thing that caught my attention, was that they seemed interested in
> > doing performance regression testing on all kernel versions...
> > 
> > So, I though, it would be great if someone else would do automated
> > performance regression testing for us :-),  Too bad they only have a
> > very simple network test.
> > 
> > 
> 

> 

CC netdev, if you dont mind.


Their network test is basically :

netcat -l 9999 >/dev/null &
time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999

They say it takes 38 seconds on their "super fast" processor

On my dev machine, not super fast (E5540 @2.53GHz), I get 8 or 9
seconds, even if only one CPU is online, all others offline.

Go figure... maybe an artifact of the virtualization they use.

I suspect some problem with the ticket spinlocks and a call to
hypervisor to say 'I am spinning on a spinlock, see if you need to do
something useful', or maybe ACPI problem (going to/from idle)





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-08 15:06       ` Loopback performance from kernel 2.6.12 to 2.6.37 Eric Dumazet
@ 2010-11-09  0:05         ` Andrew Hendry
  2010-11-09  5:22           ` Eric Dumazet
  2010-11-09 13:59         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Hendry @ 2010-11-09  0:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, netdev

results on an i7 860 @ 2.80Ghz machine, no virtualization involved. 2.6.37-rc1+

# time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 50.2022 s, 209 MB/s

real	0m50.210s
user	0m1.094s
sys	0m57.589s



On Tue, Nov 9, 2010 at 2:06 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 08 novembre 2010 à 12:04 +0100, Eric Dumazet a écrit :
>> Le lundi 08 novembre 2010 à 11:58 +0100, Jesper Dangaard Brouer a
>> écrit :
>> > On Fri, 2010-11-05 at 21:29 +0100, Eric Dumazet wrote:
>> > > Le vendredi 05 novembre 2010 à 11:49 +0100, Jesper Dangaard Brouer a
>> > > écrit :
>> > > > Hi Eric,
>> > > >
>> > > > A colleague send me a link to someone who has done some quite extensive
>> > > > performance measurements across different kernel versions.
>> > > >
>> > > > I noticed that the loopback performance has gotten quite bad:
>> > > >
>> > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
>> > > >
>> > > > I though you might be interested in the link.
>> > > >
>> > > > See you around :-)
>> > >
>> > > Hi !
>> > >
>> > > Problem is : I have no idea what test they exactly use,
>> > > do you have info about it ?
>> >
>> > Its called the Phoronix test-suite, their website is:
>> > http://www.phoronix-test-suite.com/?k=home
>> >
>> > On my Ubuntu workstation their software comes as a software package:
>> >  sudo aptitude install phoronix-test-suite
>> >
>> > They seem to be related to the test/review site:
>> > http://www.phoronix.com/
>> >
>> >
>> >
>> > > This probably can be explained very fast.
>> >
>> > The loopback test seems to be the only real networking test they do.
>> > It looks like they just copy a very big fil via loopback, and record the
>> > time it took... quite simple.
>> >
>> > Their tests seems to be focused on CPU util/speed, graphics/games.
>> >
>> >
>> > The thing that caught my attention, was that they seemed interested in
>> > doing performance regression testing on all kernel versions...
>> >
>> > So, I though, it would be great if someone else would do automated
>> > performance regression testing for us :-),  Too bad they only have a
>> > very simple network test.
>> >
>> >
>>
>
>>
>
> CC netdev, if you dont mind.
>
>
> Their network test is basically :
>
> netcat -l 9999 >/dev/null &
> time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
>
> They say it takes 38 seconds on their "super fast" processor
>
> On my dev machine, not super fast (E5540 @2.53GHz), I get 8 or 9
> seconds, even if only one CPU is online, all others offline.
>
> Go figure... maybe an artifact of the virtualization they use.
>
> I suspect some problem with the ticket spinlocks and a call to
> hypervisor to say 'I am spinning on a spinlock, see if you need to do
> something useful', or maybe ACPI problem (going to/from idle)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09  0:05         ` Andrew Hendry
@ 2010-11-09  5:22           ` Eric Dumazet
  2010-11-09  6:23             ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09  5:22 UTC (permalink / raw)
  To: Andrew Hendry; +Cc: Jesper Dangaard Brouer, netdev

Le mardi 09 novembre 2010 à 11:05 +1100, Andrew Hendry a écrit :
> results on an i7 860 @ 2.80Ghz machine, no virtualization involved. 2.6.37-rc1+
> 
> # time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 50.2022 s, 209 MB/s
> 
> real	0m50.210s
> user	0m1.094s
> sys	0m57.589s

Thanks !

Could you take a pef snapshot during the test ?

# perf record -a -g sleep 10
# perf report




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09  5:22           ` Eric Dumazet
@ 2010-11-09  6:23             ` Eric Dumazet
  2010-11-09  6:30               ` Andrew Hendry
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09  6:23 UTC (permalink / raw)
  To: Andrew Hendry; +Cc: Jesper Dangaard Brouer, netdev

Le mardi 09 novembre 2010 à 06:22 +0100, Eric Dumazet a écrit :
> Le mardi 09 novembre 2010 à 11:05 +1100, Andrew Hendry a écrit :
> > results on an i7 860 @ 2.80Ghz machine, no virtualization involved. 2.6.37-rc1+
> > 
> > # time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> > 10000+0 records in
> > 10000+0 records out
> > 10485760000 bytes (10 GB) copied, 50.2022 s, 209 MB/s
> > 
> > real	0m50.210s
> > user	0m1.094s
> > sys	0m57.589s
> 
> Thanks !
> 
> Could you take a pef snapshot during the test ?
> 
> # perf record -a -g sleep 10
> # perf report
> 
> 

On my laptop 
Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
(2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010 x86_64
GNU/Linux) :

time dd if=/dev/zero bs=1M count=10000|netcat 127.0.0.1 9999
10000+0 enregistrements lus
10000+0 enregistrements écrits
10485760000 octets (10 GB) copiés, 38,2691 s, 274 MB/s

real	0m38.274s
user	0m1.870s
sys	0m38.370s


perf top result :

-------------------------------------------------------------------------------------------------
   PerfTop:    1948 irqs/sec  kernel:90.7%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
-------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________ ___________________

             1867.00 12.4% copy_user_generic_string    [kernel.kallsyms]  
             1166.00  7.7% __ticket_spin_lock          [kernel.kallsyms]  
              744.00  4.9% __clear_user                [kernel.kallsyms]  
              667.00  4.4% system_call                 [kernel.kallsyms]  
              329.00  2.2% tcp_sendmsg                 [kernel.kallsyms]  
              304.00  2.0% schedule                    [kernel.kallsyms]  
              257.00  1.7% _raw_spin_unlock_irqrestore [kernel.kallsyms]  
              231.00  1.5% fget_light                  [kernel.kallsyms]  
              216.00  1.4% do_poll                     [kernel.kallsyms]  
              203.00  1.3% __read_chk                  /lib/libc-2.12.1.so
              202.00  1.3% __pollwait                  [kernel.kallsyms]  
              201.00  1.3% __poll                      /lib/libc-2.12.1.so
              187.00  1.2% system_call_after_swapgs    [kernel.kallsyms]  
              176.00  1.2% __write                     /lib/libc-2.12.1.so
              173.00  1.1% _raw_spin_lock_irqsave      [kernel.kallsyms]  
              163.00  1.1% tcp_recvmsg                 [kernel.kallsyms]  
              158.00  1.0% do_sys_poll                 [kernel.kallsyms]  
              153.00  1.0% vfs_write                   [kernel.kallsyms]  
              143.00  0.9% pipe_read                   [kernel.kallsyms]  
              141.00  0.9% fput                        [kernel.kallsyms]  
              121.00  0.8% common_file_perm            [kernel.kallsyms]  
              120.00  0.8% _cond_resched               [kernel.kallsyms]  


# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0   1456 120056  51572 2606876    0    0   158    41  254  190  9  2 88  0
 2  0   1456 120140  51580 2606868    0    0    12     0  758 158309 11 76 13  0
 2  0   1456 119520  51588 2606896    0    0     0   176  778 160749  8 80 12  0
 2  0   1456 120388  51588 2606896    0    0     0     0  730 158201  9 76 16  0
 3  0   1456 120388  51588 2606896    0    0     0     0  745 158490  8 76 16  0
 2  0   1456 120520  51588 2606896    0    0     0     0  991 159120  9 78 13  0
 2  0   1456 120024  51588 2606896    0    0     0     0  653 160023 10 79 11  0
 3  0   1456 120520  51588 2606896    0    0     0     0  659 160614  8 78 14  0
 2  0   1456 120272  51596 2606896    0    0     0    80  695 159922 10 75 14  0
 4  0   1456 120272  51596 2606896    0    0     0     0  675 158010  7 79 14  0


# powertop
     PowerTOP version 1.13      (C) 2007 Intel Corporation

< Detailed C-state information is not P-states (frequencies)
                                      Turbo Mode    43.1%
                                        2.40 Ghz    48.0%
                                        2.00 Ghz     8.2%
                                        1.60 Ghz     0.7%
                                        1200 Mhz     0.1%

Wakeups-from-idle per second : 542.9    interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
  21.9% (196.5)   [kernel scheduler] Load balancing tick
  21.2% (190.7)   [Rescheduling interrupts] <kernel IPI>
  12.7% (114.0)   PS/2 keyboard/mouse/touchpad interrupt
  12.0% (107.9)   plugin-containe
  11.1% ( 99.3)   alsa-sink
   6.0% ( 53.8)   firefox-bin
   4.4% ( 39.7)   fping
   3.9% ( 35.2)   Xorg
   1.3% ( 11.3)   [b43] <interrupt>
   1.1% ( 10.0)   ksoftirqd/0
   0.4% (  4.0)D  nagios3
   0.2% (  1.9)D  gnome-terminal
   0.7% (  6.4)   [Thermal event interrupts] <kernel IPI>




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09  6:23             ` Eric Dumazet
@ 2010-11-09  6:30               ` Andrew Hendry
  2010-11-09  6:38                 ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Hendry @ 2010-11-09  6:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, netdev

most my slowdown was kmemleak left on.

After fixing its is still a lot slower than your dev system
.
# time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 25.8182 s, 406 MB/s

real	0m25.821s
user	0m1.502s
sys	0m33.463s

------------------------------------------------------------------------------------------------------------------
   PerfTop:     241 irqs/sec  kernel:56.8%  exact:  0.0% [1000Hz
cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________
______________________________________

             1255.00  8.7% hpet_msi_next_event
/lib/modules/2.6.37-rc1+/build/vmlinux
             1081.00  7.5% copy_user_generic_string
/lib/modules/2.6.37-rc1+/build/vmlinux
              863.00  6.0% __ticket_spin_lock
/lib/modules/2.6.37-rc1+/build/vmlinux
              498.00  3.5% do_sys_poll
/lib/modules/2.6.37-rc1+/build/vmlinux
              455.00  3.2% system_call
/lib/modules/2.6.37-rc1+/build/vmlinux
              409.00  2.8% fget_light
/lib/modules/2.6.37-rc1+/build/vmlinux
              348.00  2.4% tcp_sendmsg
/lib/modules/2.6.37-rc1+/build/vmlinux
              269.00  1.9% fsnotify
/lib/modules/2.6.37-rc1+/build/vmlinux
              258.00  1.8% _raw_spin_unlock_irqrestore
/lib/modules/2.6.37-rc1+/build/vmlinux
              223.00  1.6% _raw_spin_lock_irqsave
/lib/modules/2.6.37-rc1+/build/vmlinux
              203.00  1.4% __clear_user
/lib/modules/2.6.37-rc1+/build/vmlinux
              184.00  1.3% tcp_poll
/lib/modules/2.6.37-rc1+/build/vmlinux
              178.00  1.2% vfs_write
/lib/modules/2.6.37-rc1+/build/vmlinux
              165.00  1.1% tcp_recvmsg
/lib/modules/2.6.37-rc1+/build/vmlinux
              152.00  1.1% pipe_read
/lib/modules/2.6.37-rc1+/build/vmlinux
              149.00  1.0% schedule
/lib/modules/2.6.37-rc1+/build/vmlinux
              135.00  0.9% rw_verify_area
/lib/modules/2.6.37-rc1+/build/vmlinux
              135.00  0.9% __pollwait
/lib/modules/2.6.37-rc1+/build/vmlinux
              130.00  0.9% __write
/lib/libc-2.12.1.so
              127.00  0.9% __ticket_spin_unlock
/lib/modules/2.6.37-rc1+/build/vmlinux
              126.00  0.9% __poll
/lib/libc-2.12.1.so


On Tue, Nov 9, 2010 at 5:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 09 novembre 2010 à 06:22 +0100, Eric Dumazet a écrit :
>> Le mardi 09 novembre 2010 à 11:05 +1100, Andrew Hendry a écrit :
>> > results on an i7 860 @ 2.80Ghz machine, no virtualization involved. 2.6.37-rc1+
>> >
>> > # time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
>> > 10000+0 records in
>> > 10000+0 records out
>> > 10485760000 bytes (10 GB) copied, 50.2022 s, 209 MB/s
>> >
>> > real        0m50.210s
>> > user        0m1.094s
>> > sys 0m57.589s
>>
>> Thanks !
>>
>> Could you take a pef snapshot during the test ?
>>
>> # perf record -a -g sleep 10
>> # perf report
>>
>>
>
> On my laptop
> Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
> (2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010 x86_64
> GNU/Linux) :
>
> time dd if=/dev/zero bs=1M count=10000|netcat 127.0.0.1 9999
> 10000+0 enregistrements lus
> 10000+0 enregistrements écrits
> 10485760000 octets (10 GB) copiés, 38,2691 s, 274 MB/s
>
> real    0m38.274s
> user    0m1.870s
> sys     0m38.370s
>
>
> perf top result :
>
> -------------------------------------------------------------------------------------------------
>   PerfTop:    1948 irqs/sec  kernel:90.7%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
> -------------------------------------------------------------------------------------------------
>
>             samples  pcnt function                    DSO
>             _______ _____ ___________________________ ___________________
>
>             1867.00 12.4% copy_user_generic_string    [kernel.kallsyms]
>             1166.00  7.7% __ticket_spin_lock          [kernel.kallsyms]
>              744.00  4.9% __clear_user                [kernel.kallsyms]
>              667.00  4.4% system_call                 [kernel.kallsyms]
>              329.00  2.2% tcp_sendmsg                 [kernel.kallsyms]
>              304.00  2.0% schedule                    [kernel.kallsyms]
>              257.00  1.7% _raw_spin_unlock_irqrestore [kernel.kallsyms]
>              231.00  1.5% fget_light                  [kernel.kallsyms]
>              216.00  1.4% do_poll                     [kernel.kallsyms]
>              203.00  1.3% __read_chk                  /lib/libc-2.12.1.so
>              202.00  1.3% __pollwait                  [kernel.kallsyms]
>              201.00  1.3% __poll                      /lib/libc-2.12.1.so
>              187.00  1.2% system_call_after_swapgs    [kernel.kallsyms]
>              176.00  1.2% __write                     /lib/libc-2.12.1.so
>              173.00  1.1% _raw_spin_lock_irqsave      [kernel.kallsyms]
>              163.00  1.1% tcp_recvmsg                 [kernel.kallsyms]
>              158.00  1.0% do_sys_poll                 [kernel.kallsyms]
>              153.00  1.0% vfs_write                   [kernel.kallsyms]
>              143.00  0.9% pipe_read                   [kernel.kallsyms]
>              141.00  0.9% fput                        [kernel.kallsyms]
>              121.00  0.8% common_file_perm            [kernel.kallsyms]
>              120.00  0.8% _cond_resched               [kernel.kallsyms]
>
>
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  2  0   1456 120056  51572 2606876    0    0   158    41  254  190  9  2 88  0
>  2  0   1456 120140  51580 2606868    0    0    12     0  758 158309 11 76 13  0
>  2  0   1456 119520  51588 2606896    0    0     0   176  778 160749  8 80 12  0
>  2  0   1456 120388  51588 2606896    0    0     0     0  730 158201  9 76 16  0
>  3  0   1456 120388  51588 2606896    0    0     0     0  745 158490  8 76 16  0
>  2  0   1456 120520  51588 2606896    0    0     0     0  991 159120  9 78 13  0
>  2  0   1456 120024  51588 2606896    0    0     0     0  653 160023 10 79 11  0
>  3  0   1456 120520  51588 2606896    0    0     0     0  659 160614  8 78 14  0
>  2  0   1456 120272  51596 2606896    0    0     0    80  695 159922 10 75 14  0
>  4  0   1456 120272  51596 2606896    0    0     0     0  675 158010  7 79 14  0
>
>
> # powertop
>     PowerTOP version 1.13      (C) 2007 Intel Corporation
>
> < Detailed C-state information is not P-states (frequencies)
>                                      Turbo Mode    43.1%
>                                        2.40 Ghz    48.0%
>                                        2.00 Ghz     8.2%
>                                        1.60 Ghz     0.7%
>                                        1200 Mhz     0.1%
>
> Wakeups-from-idle per second : 542.9    interval: 10.0s
> no ACPI power usage estimate available
>
> Top causes for wakeups:
>  21.9% (196.5)   [kernel scheduler] Load balancing tick
>  21.2% (190.7)   [Rescheduling interrupts] <kernel IPI>
>  12.7% (114.0)   PS/2 keyboard/mouse/touchpad interrupt
>  12.0% (107.9)   plugin-containe
>  11.1% ( 99.3)   alsa-sink
>   6.0% ( 53.8)   firefox-bin
>   4.4% ( 39.7)   fping
>   3.9% ( 35.2)   Xorg
>   1.3% ( 11.3)   [b43] <interrupt>
>   1.1% ( 10.0)   ksoftirqd/0
>   0.4% (  4.0)D  nagios3
>   0.2% (  1.9)D  gnome-terminal
>   0.7% (  6.4)   [Thermal event interrupts] <kernel IPI>
>
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09  6:30               ` Andrew Hendry
@ 2010-11-09  6:38                 ` Eric Dumazet
  2010-11-09  6:42                   ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09  6:38 UTC (permalink / raw)
  To: Andrew Hendry; +Cc: Jesper Dangaard Brouer, netdev

Le mardi 09 novembre 2010 à 17:30 +1100, Andrew Hendry a écrit :
> most my slowdown was kmemleak left on.
> 
> After fixing its is still a lot slower than your dev system
> .
> # time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 25.8182 s, 406 MB/s
> 
> real	0m25.821s
> user	0m1.502s
> sys	0m33.463s
> 
> ------------------------------------------------------------------------------------------------------------------
>    PerfTop:     241 irqs/sec  kernel:56.8%  exact:  0.0% [1000Hz
> cycles],  (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                    DSO
>              _______ _____ ___________________________
> ______________________________________
> 
>              1255.00  8.7% hpet_msi_next_event
> /lib/modules/2.6.37-rc1+/build/vmlinux
>              1081.00  7.5% copy_user_generic_string
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               863.00  6.0% __ticket_spin_lock
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               498.00  3.5% do_sys_poll
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               455.00  3.2% system_call
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               409.00  2.8% fget_light
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               348.00  2.4% tcp_sendmsg
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               269.00  1.9% fsnotify
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               258.00  1.8% _raw_spin_unlock_irqrestore
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               223.00  1.6% _raw_spin_lock_irqsave
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               203.00  1.4% __clear_user
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               184.00  1.3% tcp_poll
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               178.00  1.2% vfs_write
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               165.00  1.1% tcp_recvmsg
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               152.00  1.1% pipe_read
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               149.00  1.0% schedule
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               135.00  0.9% rw_verify_area
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               135.00  0.9% __pollwait
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               130.00  0.9% __write
> /lib/libc-2.12.1.so
>               127.00  0.9% __ticket_spin_unlock
> /lib/modules/2.6.37-rc1+/build/vmlinux
>               126.00  0.9% __poll
> /lib/libc-2.12.1.so
> 
> 


Hmm, your clock source is HPET, that might explain the problem on a
scheduler intensive workload.

My HP dev machine
# grep . /sys/devices/system/clocksource/clocksource0/*
/sys/devices/system/clocksource/clocksource0/available_clocksource:tsc hpet acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:tsc

My laptop:
$ grep . /sys/devices/system/clocksource/clocksource0/*
/sys/devices/system/clocksource/clocksource0/available_clocksource:tsc hpet acpi_pm 
/sys/devices/system/clocksource/clocksource0/current_clocksource:tsc



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09  6:38                 ` Eric Dumazet
@ 2010-11-09  6:42                   ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09  6:42 UTC (permalink / raw)
  To: Andrew Hendry; +Cc: Jesper Dangaard Brouer, netdev

Le mardi 09 novembre 2010 à 07:38 +0100, Eric Dumazet a écrit :

> Hmm, your clock source is HPET, that might explain the problem on a
> scheduler intensive workload.
> 

And if a packet sniffer (dhclient for example) makes all packets being
timestamped, it also can explain a slowdown, even if there is no
scheduler artifacts.

cat /proc/net/packet

> My HP dev machine
> # grep . /sys/devices/system/clocksource/clocksource0/*
> /sys/devices/system/clocksource/clocksource0/available_clocksource:tsc hpet acpi_pm 
> /sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
> 
> My laptop:
> $ grep . /sys/devices/system/clocksource/clocksource0/*
> /sys/devices/system/clocksource/clocksource0/available_clocksource:tsc hpet acpi_pm 
> /sys/devices/system/clocksource/clocksource0/current_clocksource:tsc
> 




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-08 15:06       ` Loopback performance from kernel 2.6.12 to 2.6.37 Eric Dumazet
  2010-11-09  0:05         ` Andrew Hendry
@ 2010-11-09 13:59         ` Jesper Dangaard Brouer
  2010-11-09 14:06           ` Eric Dumazet
  2010-11-09 14:16           ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2010-11-09 13:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
...
> > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > 
> > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> >

> CC netdev, if you dont mind.

No problem :-)

> Their network test is basically :
> 
> netcat -l 9999 >/dev/null &
> time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999

Should it not be:
 netcat -l -p 9999 >/dev/null &

When I run the commands "dd | netcat", netcat never finish/exits, I have
to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?

"dd" reports 17.54 sec, and but the "time" measurement cannot be uses as
the netcat just hangs/waits for more data...

time dd if=/dev/zero bs=1M count=10000 | netcat 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 17,5419 s, 598 MB/s

When I run the commands, I see 261682 context switches per sec...

This quick test were on a Core i7 920 using kernel
2.6.32-rc3-net-next-dev-mp2t.

-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 13:59         ` Jesper Dangaard Brouer
@ 2010-11-09 14:06           ` Eric Dumazet
  2010-11-09 14:16           ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09 14:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

Le mardi 09 novembre 2010 à 14:59 +0100, Jesper Dangaard Brouer a
écrit :
> On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> ...
> > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > 
> > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> > >
> 
> > CC netdev, if you dont mind.
> 
> No problem :-)
> 
> > Their network test is basically :
> > 
> > netcat -l 9999 >/dev/null &
> > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 
> Should it not be:
>  netcat -l -p 9999 >/dev/null &
> 

It depends on netcat version. On yours, yes, you need the "-p"

> When I run the commands "dd | netcat", netcat never finish/exits, I have
> to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> "dd" reports 17.54 sec, and but the "time" measurement cannot be uses as
> the netcat just hangs/waits for more data...
> 

Your netcat version needs a "-c" switch

time dd if=/dev/zero bs=1M count=10000 | netcat -c 127.0.0.1 9999


> time dd if=/dev/zero bs=1M count=10000 | netcat 127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 17,5419 s, 598 MB/s
> 
> When I run the commands, I see 261682 context switches per sec...
> 
> This quick test were on a Core i7 920 using kernel
> 2.6.32-rc3-net-next-dev-mp2t.
> 

32 or 64bit kernel ?

Thanks !




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 13:59         ` Jesper Dangaard Brouer
  2010-11-09 14:06           ` Eric Dumazet
@ 2010-11-09 14:16           ` Jesper Dangaard Brouer
  2010-11-09 14:25             ` Eric Dumazet
  2010-11-09 14:38             ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2010-11-09 14:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> ...
> > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > 
> > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6

> > Their network test is basically :
> > 
> > netcat -l 9999 >/dev/null &
> > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 
> Should it not be:
>  netcat -l -p 9999 >/dev/null &
> 
> When I run the commands "dd | netcat", netcat never finish/exits, I have
> to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?

To fix this I added "-q 0" to netcat.  Thus my working commands are:

 netcat -l -p 9999 >/dev/null &
 time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999

Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
kernel version 2.6.32-5-amd64 (which I usually don't use)
The results are 7.487 sec:

time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 7,48562 s, 1,4 GB/s

real    0m7.487s
user    0m0.224s
sys     0m9.785s

Using vmstat I see approx 400000 context switches per sec.

Perf top says:
            samples  pcnt function                  DSO
             _______ _____ _________________________ ___________

             6442.00 16.3% copy_user_generic_string  [kernel]   
             2226.00  5.6% __clear_user              [kernel]   
              912.00  2.3% _spin_lock_irqsave        [kernel]   
              773.00  2.0% _spin_lock_bh             [kernel]   
              736.00  1.9% schedule                  [kernel]   
              582.00  1.5% ipt_do_table              [ip_tables]
              569.00  1.4% _spin_lock                [kernel]   
              505.00  1.3% get_page_from_freelist    [kernel]   
              451.00  1.1% _spin_unlock_irqrestore   [kernel]   
              434.00  1.1% do_select                 [kernel]   
              354.00  0.9% tcp_sendmsg               [kernel]   
              348.00  0.9% tick_nohz_stop_sched_tick [kernel]   
              347.00  0.9% tcp_transmit_skb          [kernel]   
              345.00  0.9% zero_fd_set               [kernel]   

Perf also complains about it cannot resolve netcat debug symbols.


-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 14:16           ` Jesper Dangaard Brouer
@ 2010-11-09 14:25             ` Eric Dumazet
  2010-11-18 13:52               ` Eric Dumazet
  2010-11-09 14:38             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-09 14:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

Le mardi 09 novembre 2010 à 15:16 +0100, Jesper Dangaard Brouer a
écrit :
> On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> > ...
> > > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > > 
> > > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> 
> > > Their network test is basically :
> > > 
> > > netcat -l 9999 >/dev/null &
> > > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> > 
> > Should it not be:
> >  netcat -l -p 9999 >/dev/null &
> > 
> > When I run the commands "dd | netcat", netcat never finish/exits, I have
> > to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> To fix this I added "-q 0" to netcat.  Thus my working commands are:
> 
>  netcat -l -p 9999 >/dev/null &
>  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 
> Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> kernel version 2.6.32-5-amd64 (which I usually don't use)
> The results are 7.487 sec:
> 
> time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 7,48562 s, 1,4 GB/s
> 
> real    0m7.487s
> user    0m0.224s
> sys     0m9.785s

So far, so good. These are the expected numbers. Now we have to
understand why corei7 gets 38 seconds instead of 8 :)




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 14:16           ` Jesper Dangaard Brouer
  2010-11-09 14:25             ` Eric Dumazet
@ 2010-11-09 14:38             ` Jesper Dangaard Brouer
  2010-11-10 11:24               ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2010-11-09 14:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, 2010-11-09 at 15:16 +0100, Jesper Dangaard Brouer wrote:
> On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> > ...
> > > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > > 
> > > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> 
> > > Their network test is basically :
> > > 
> > > netcat -l 9999 >/dev/null &
> > > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> > 
> > Should it not be:
> >  netcat -l -p 9999 >/dev/null &
> > 
> > When I run the commands "dd | netcat", netcat never finish/exits, I have
> > to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> To fix this I added "-q 0" to netcat.  Thus my working commands are:
> 
>  netcat -l -p 9999 >/dev/null &
>  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 
> Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> kernel version 2.6.32-5-amd64 (which I usually don't use)
> The results are 7.487 sec

Using kernel 2.6.35.8-comx01+ (which is 35-stable with some minor
patches of my own) on the same type of hardware (our preprod server).
The result is 12 sec.

time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 12,0805 s, 868 MB/s

real    0m12.082s
user    0m0.311s
sys     0m15.896s

BUT perf top reveals that its probably related to the function
'find_busiest_group' ... any kernel config hints how I get rid of that?


              samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________

             4152.00 12.8% copy_user_generic_string    [kernel]      
             1802.00  5.6% find_busiest_group          [kernel]      
              852.00  2.6% __clear_user                [kernel]      
              836.00  2.6% _raw_spin_lock_bh           [kernel]      
              819.00  2.5% ipt_do_table                [ip_tables]   
              628.00  1.9% rebalance_domains           [kernel]      
              564.00  1.7% _raw_spin_lock              [kernel]      
              562.00  1.7% _raw_spin_lock_irqsave      [kernel]      
              522.00  1.6% schedule                    [kernel]      
              441.00  1.4% find_next_bit               [kernel]      
              413.00  1.3% _raw_spin_unlock_irqrestore [kernel]      
              394.00  1.2% tcp_sendmsg                 [kernel]      
              391.00  1.2% tcp_packet                  [nf_conntrack]
              368.00  1.1% do_select                   [kernel]      



> Using vmstat I see approx 400000 context switches per sec.
>
Previous:
> Perf top says:
>             samples  pcnt function                  DSO
>              _______ _____ _________________________ ___________
> 
>              6442.00 16.3% copy_user_generic_string  [kernel]   
>              2226.00  5.6% __clear_user              [kernel]   
>               912.00  2.3% _spin_lock_irqsave        [kernel]   
>               773.00  2.0% _spin_lock_bh             [kernel]   
>               736.00  1.9% schedule                  [kernel]   
>               582.00  1.5% ipt_do_table              [ip_tables]
>               569.00  1.4% _spin_lock                [kernel]   
>               505.00  1.3% get_page_from_freelist    [kernel]   
>               451.00  1.1% _spin_unlock_irqrestore   [kernel]   
>               434.00  1.1% do_select                 [kernel]   
>               354.00  0.9% tcp_sendmsg               [kernel]   
>               348.00  0.9% tick_nohz_stop_sched_tick [kernel]   
>               347.00  0.9% tcp_transmit_skb          [kernel]   
>               345.00  0.9% zero_fd_set               [kernel]   


-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
@ 2010-11-09 21:35 Xose Vazquez Perez
  2010-11-10  8:49 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Xose Vazquez Perez @ 2010-11-09 21:35 UTC (permalink / raw)
  To: netdev, jdb

Jesper Dangaard Brouer wrote:

> To fix this I added "-q 0" to netcat.  Thus my working commands are:
> 
>  netcat -l -p 9999 >/dev/null &
>  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 
> Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> kernel version 2.6.32-5-amd64 (which I usually don't use)
> The results are 7.487 sec:

netcat flavor ?

http://nc110.sourceforge.net/
http://nmap.org/ncat/
http://www.dest-unreach.org/socat/
http://cryptcat.sourceforge.net/
http://netcat.sourceforge.net/
http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/nc/

-- 
«Allá muevan feroz guerra, ciegos reyes por un palmo más de tierra;
que yo aquí tengo por mío cuanto abarca el mar bravío, a quien nadie
impuso leyes. Y no hay playa, sea cualquiera, ni bandera de esplendor,
que no sienta mi derecho y dé pecho a mi valor.»

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 21:35 Xose Vazquez Perez
@ 2010-11-10  8:49 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2010-11-10  8:49 UTC (permalink / raw)
  To: Xose Vazquez Perez; +Cc: netdev

On Tue, 2010-11-09 at 22:35 +0100, Xose Vazquez Perez wrote:
> Jesper Dangaard Brouer wrote:
> 
> > To fix this I added "-q 0" to netcat.  Thus my working commands are:
> > 
> >  netcat -l -p 9999 >/dev/null &
> >  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> > 
> > Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> > kernel version 2.6.32-5-amd64 (which I usually don't use)
> > The results are 7.487 sec:
> 
> netcat flavor ?

Debian package netcat-traditional
 netcat version [v1.10-38]

>From "aptitude show netcat-traditional":
 This is the "classic" netcat, written by *Hobbit*. It lacks many
features found in netcat-openbsd.

Didn't know there were that many flavors...

> http://nc110.sourceforge.net/
> http://nmap.org/ncat/
> http://www.dest-unreach.org/socat/
> http://cryptcat.sourceforge.net/
> http://netcat.sourceforge.net/
> http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/nc/


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 14:38             ` Jesper Dangaard Brouer
@ 2010-11-10 11:24               ` Jesper Dangaard Brouer
  2010-12-12 15:48                 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2010-11-10 11:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, acme

On Tue, 2010-11-09 at 15:38 +0100, Jesper Dangaard Brouer wrote:
> On Tue, 2010-11-09 at 15:16 +0100, Jesper Dangaard Brouer wrote:
> > On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> > > On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> > > ...
> > 
> > To fix this I added "-q 0" to netcat.  Thus my working commands are:
> > 
> >  netcat -l -p 9999 >/dev/null &
> >  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> > 
> > Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> > kernel version 2.6.32-5-amd64 (which I usually don't use)
> > The results are 7.487 sec
> 
> Using kernel 2.6.35.8-comx01+ (which is 35-stable with some minor
> patches of my own) on the same type of hardware (our preprod server).
> The result is 12 sec.
> 
> time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 12,0805 s, 868 MB/s
> 
> real    0m12.082s
> user    0m0.311s
> sys     0m15.896s

On the same system I can better performance IF I pin the processes on
different CPUs. BUT the trick here is I choose CPUs with different "core
id", thus I avoid the HT CPUs in the system (hint look in /proc/cpuinfo
for choosing the CPUs).

Commands:
 taskset 16 netcat -lv -p 9999 >/dev/null &
 time taskset 1 dd if=/dev/zero bs=1M count=10000 | taskset 4 netcat -q0 127.0.0.1 9999

Result:
 10485760000 bytes (10 GB) copied, 8,74021 s, 1,2 GB/s
 real    0m8.742s
 user    0m0.208s
 sys     0m11.426s

So, perhaps the Core i7 has a problem with the HT CPUs with this
workload?

Forcing dd and netcat on the same HT CPU gives a result of approx 18
sec!

Commands:
 taskset 16 netcat -lv -p 9999 >/dev/null
 time taskset 1 dd if=/dev/zero bs=1M count=10000 | taskset 2 netcat -q0 127.0.0.1 9999

Result:
 10485760000 bytes (10 GB) copied, 18,6575 s, 562 MB/s
 real    0m18.659s
 user    0m0.341s
 sys     0m18.969s


> BUT perf top reveals that its probably related to the function
> 'find_busiest_group' ... any kernel config hints how I get rid of that?

The 'find_busiest_group' seems to be an artifact of "perf top", if I use
"perf record" then the 'find_busiest_group' function disappears.  Which
is kind of strange, as 'find_busiest_group' seem the be related to
sched_fair.c.

perf --version
perf version 2.6.35.7.1.g60d9c

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-09 14:25             ` Eric Dumazet
@ 2010-11-18 13:52               ` Eric Dumazet
  2010-11-18 17:41                 ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-18 13:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, David Miller

Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :

> So far, so good. These are the expected numbers. Now we have to
> understand why corei7 gets 38 seconds instead of 8 :)
> 
> 

My tests show a problem with backlog processing, and too big TCP
windows. (at least on loopback and wild senders)

Basically, with huge tcp windows we have now (default 4 Mbytes),
the reader process can have to process up to 4Mbytes of backlogged data
in __release_sock() before returning from its 'small' read(fd, buffer,
1024) done by netcat.

While it processes this backlog, it sends tcp ACKS, allowing sender to
send new frames that might be dropped because of sk_rcvqueues_full(), or
continue to fill receive queue up to the receiver window, feeding the
task in __release_sock() [loop]


This blows cpu caches completely [data is queued, and the dequeue is
done long after], and latency of a single read() can be very high. This
blocks the pipeline of user processing eventually.


<disgress>
I also understand why UDP latencies are so impacted. If we receive a
burst of frames on same socket, the user process reading first frame
might be forced to process the backlog before returning to userland.

Really we must zap lock_sock() from UDP input path.

commit 95766fff6b9a78d1 ([UDP]: Add memory accounting) was a big error.
</disgress>



On my server machine with 6Mbytes of L2 cache, you dont see the problem,
while on my laptop with 3Mbytes of L2 cache, you can see the problem.

I caught this because of new SNMP counter added in 2.6.34
(TCPBacklogDrop), that could easily take 1000 increments during the
test.


I built a test program, maybe easier to use than various netcat flavors
It also use two tasks only, thats better if you have a core 2 Duo like
me on my laptop ;)

To reproduce the problem, run it with option -l 4M

$ netstat -s|grep TCPBacklogDrop
    TCPBacklogDrop: 788
$ time ./loopback_transfert -l 1k;netstat -s|grep TCPBacklogDrop

real	0m14.013s
user	0m0.630s
sys	0m13.250s
    TCPBacklogDrop: 788
$ time ./loopback_transfert -l 128k;netstat -s|grep TCPBacklogDrop

real	0m7.447s
user	0m0.030s
sys	0m5.490s
    TCPBacklogDrop: 789
$ time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop

real	0m11.206s
user	0m0.020s
sys	0m7.150s
    TCPBacklogDrop: 793
$ time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop

real	0m10.347s
user	0m0.000s
sys	0m6.120s
    TCPBacklogDrop: 1510
$ time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop

real	0m6.810s
user	0m0.040s
sys	0m6.670s
    TCPBacklogDrop: 1511


/*
 * Very simple program to test TCP loopback speed
 * This came from the phoronix benchmark using following :
 *
 * netcat -d -l 9999 >/dev/null &
 * time dd if=/dev/zero bs=1M count=10000 | netcat 127.0.0.1 9999
 *
 * Problem is the benchmark also use pipe subsystem, and three tasks,
 * while the following program uses only TCP subsystem and two tasks.
 * I still use small blocksize (netcat apparently use 1Kbyte blocks)
 *
 * Options :
 *            -l  blocksize   (in bytes, default : 1024)
 *            -s  socket SNDBUF/RCVBUF (default : system defaults (too big))
 */

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

long long amount_to_transfert = 10*1024*1024*1024LL; /* 10 Go */
unsigned int blocksize = 1024; /* to mimic netcat very pessimistic behavior */
unsigned int socket_bufsize = 0;

static void Server(int fdlisten)
{
	int newfd;
	struct sockaddr_in sockaddr;
	socklen_t len = sizeof(sockaddr);
	char *buffer;
	long total = 0;
	int ret;

	buffer = malloc(blocksize);
	newfd = accept(fdlisten, (struct sockaddr *)&sockaddr, &len);

	if (newfd == -1) {
		perror("accept");
		exit(1);
	}
	close(fdlisten);
	if (socket_bufsize)
		setsockopt(newfd, SOL_SOCKET, SO_RCVBUF, &socket_bufsize, 4);
	while (1) {
		ret = read(newfd, buffer, blocksize);
		if (ret <= 0) break;
		total += ret;
		}
	close(newfd);
	_exit(0);
}

static void usage(int code)
{
	exit(code);
}

static long scansize(char *str)
{
	char *end;
	long res = strtol(str, &end, 0);
	if (end) {
		if (*end == 'k') res <<= 10;
		if (*end == 'M') res <<= 20;
	}
	return res;
}

int main(int argc, char *argv[])
{
	int i;
	struct sockaddr_in sockaddr;
	socklen_t slen = sizeof(sockaddr);
	int fdlisten, fd;
	int port;
	char *buffer;
	long long total = 0;
	int ret = 0;

	while ((i = getopt(argc, argv, "l:s:")) != EOF) {
		if (i == 'l')
			blocksize = scansize(optarg);
		else if (i == 's')
			socket_bufsize = scansize(optarg);
		else usage(1);
	}
	buffer = calloc(blocksize, 1);
	fdlisten = socket(AF_INET, SOCK_STREAM, 0);
	if (fdlisten == -1) {
		perror("socket");
		return 1;
	}
	memset(&sockaddr, 0, sizeof(sockaddr));
	sockaddr.sin_family = AF_INET;
	sockaddr.sin_port = 0;
	sockaddr.sin_addr.s_addr = htonl(0x7f000001);
	if (bind(fdlisten, (struct sockaddr *)&sockaddr, sizeof(sockaddr))== -1) {
		perror("bind()");
		return 1;
	}
	if (listen(fdlisten, 10)== -1) {
		perror("listen");
		return 1;
	}
	getsockname(fdlisten, (struct sockaddr *)&sockaddr, &slen);
	port = ntohs(sockaddr.sin_port);

	if (fork() == 0)
		Server(fdlisten);

	close(fdlisten);
	fd = socket(AF_INET, SOCK_STREAM, 0);
	if (fd == -1) {
		perror("socket");
		return -1;
	}
	memset(&sockaddr, 0, sizeof(sockaddr));
	sockaddr.sin_family = AF_INET;
	sockaddr.sin_port = htons(port);
	sockaddr.sin_addr.s_addr = htonl(0x7f000001);
	if (socket_bufsize)
		setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &socket_bufsize, 4);
	connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
	while (total < amount_to_transfert) {
		ret = write(fd, buffer, blocksize);
		if (ret <= 0) break;
		total += ret;
	}
	close(fd);
	return 0;
}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-18 13:52               ` Eric Dumazet
@ 2010-11-18 17:41                 ` Eric Dumazet
  2010-11-18 17:48                   ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2010-11-18 17:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, David Miller

Le jeudi 18 novembre 2010 à 14:52 +0100, Eric Dumazet a écrit :
> Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :
> 
> > So far, so good. These are the expected numbers. Now we have to
> > understand why corei7 gets 38 seconds instead of 8 :)
> > 
> > 
> 
> My tests show a problem with backlog processing, and too big TCP
> windows. (at least on loopback and wild senders)
> 
> Basically, with huge tcp windows we have now (default 4 Mbytes),
> the reader process can have to process up to 4Mbytes of backlogged data
> in __release_sock() before returning from its 'small' read(fd, buffer,
> 1024) done by netcat.
> 
> While it processes this backlog, it sends tcp ACKS, allowing sender to
> send new frames that might be dropped because of sk_rcvqueues_full(), or
> continue to fill receive queue up to the receiver window, feeding the
> task in __release_sock() [loop]
> 
> 
> This blows cpu caches completely [data is queued, and the dequeue is
> done long after], and latency of a single read() can be very high. This
> blocks the pipeline of user processing eventually.
> 
> 
> <disgress>
> I also understand why UDP latencies are so impacted. If we receive a
> burst of frames on same socket, the user process reading first frame
> might be forced to process the backlog before returning to userland.
> 
> Really we must zap lock_sock() from UDP input path.
> 
> commit 95766fff6b9a78d1 ([UDP]: Add memory accounting) was a big error.
> </disgress>
> 
> 
> 
> On my server machine with 6Mbytes of L2 cache, you dont see the problem,
> while on my laptop with 3Mbytes of L2 cache, you can see the problem.
> 
> I caught this because of new SNMP counter added in 2.6.34
> (TCPBacklogDrop), that could easily take 1000 increments during the
> test.
> 
> 
> I built a test program, maybe easier to use than various netcat flavors
> It also use two tasks only, thats better if you have a core 2 Duo like
> me on my laptop ;)
> 
> To reproduce the problem, run it with option -l 4M
> 
> $ netstat -s|grep TCPBacklogDrop
>     TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 1k;netstat -s|grep TCPBacklogDrop
> 
> real	0m14.013s
> user	0m0.630s
> sys	0m13.250s
>     TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 128k;netstat -s|grep TCPBacklogDrop
> 
> real	0m7.447s
> user	0m0.030s
> sys	0m5.490s
>     TCPBacklogDrop: 789
> $ time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
> 
> real	0m11.206s
> user	0m0.020s
> sys	0m7.150s
>     TCPBacklogDrop: 793
> $ time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
> 
> real	0m10.347s
> user	0m0.000s
> sys	0m6.120s
>     TCPBacklogDrop: 1510
> $ time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
> 
> real	0m6.810s
> user	0m0.040s
> sys	0m6.670s
>     TCPBacklogDrop: 1511
> 

I forgot to include test results for my dev machine (server class
machine, 8 Mbytes of L2 cache) NUMA 
2 sockets : Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

# netstat -s|grep TCPBacklogDrop
    TCPBacklogDrop: 8891
# time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop

real	0m7.033s
user	0m0.010s
sys	0m4.580s
    TCPBacklogDrop: 9239
# time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop

real	0m5.408s
user	0m0.000s
sys	0m2.880s
    TCPBacklogDrop: 9243
# time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop

real	0m2.965s
user	0m0.000s
sys	0m2.070s
    TCPBacklogDrop: 10485
# time ./loopback_transfert -l 6M;netstat -s|grep TCPBacklogDrop

real	0m7.711s
user	0m0.000s
sys	0m3.180s
    TCPBacklogDrop: 13537
# time ./loopback_transfert -l 8M;netstat -s|grep TCPBacklogDrop

real	0m11.497s
user	0m0.020s
sys	0m3.830s
    TCPBacklogDrop: 17108


As soon as our working set is larger than L2 cache, this is very slow.

for the -l 8M bench :

# Events: 7K cycles
#
# Overhead  Command      Shared Object                               Symbol
# ........  .......  .................  ...................................
#
    40.97%  loopback_transf  [kernel.kallsyms]  [k] copy_user_generic_string
    18.57%    :8968  [kernel.kallsyms]  [k] copy_user_generic_string
     3.54%    :8968  [kernel.kallsyms]  [k] get_page_from_freelist
     3.36%    :8968  [kernel.kallsyms]  [k] tcp_sendmsg
     1.17%    :8968  [kernel.kallsyms]  [k] put_page
     0.99%    :8968  [kernel.kallsyms]  [k] free_hot_cold_page
     0.99%    :8968  [kernel.kallsyms]  [k] __might_sleep
     0.88%    :8968  [kernel.kallsyms]  [k] __ticket_spin_lock
     0.81%  loopback_transf  [kernel.kallsyms]  [k] free_pcppages_bulk
     0.79%    :8968  [kernel.kallsyms]  [k] __alloc_pages_nodemask
     0.63%  loopback_transf  [kernel.kallsyms]  [k] put_page
     0.63%  loopback_transf  [kernel.kallsyms]  [k] __might_sleep
     0.63%  loopback_transf  [kernel.kallsyms]  [k] tcp_transmit_skb
     0.57%    :8968  [kernel.kallsyms]  [k] skb_release_data
     0.55%  loopback_transf  [kernel.kallsyms]  [k] free_hot_cold_page
     0.53%    :8968  [kernel.kallsyms]  [k] tcp_ack
     0.50%  loopback_transf  [kernel.kallsyms]  [k] __inet_lookup_established
     0.49%  loopback_transf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec
     0.47%    :8968  [kernel.kallsyms]  [k] __rmqueue
     0.45%    :8968  [kernel.kallsyms]  [k] get_pageblock_flags_group
     0.41%    :8968  [kernel.kallsyms]  [k] zone_watermark_ok
     0.41%    :8968  [kernel.kallsyms]  [k] __inc_zone_state
     0.40%  loopback_transf  [kernel.kallsyms]  [k] skb_release_data
     0.39%    :8968  [kernel.kallsyms]  [k] tcp_transmit_skb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-18 17:41                 ` Eric Dumazet
@ 2010-11-18 17:48                   ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2010-11-18 17:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jdb, netdev, lawrence

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 18 Nov 2010 18:41:53 +0100

> Le jeudi 18 novembre 2010 à 14:52 +0100, Eric Dumazet a écrit :
>> Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :
>> 
>> My tests show a problem with backlog processing, and too big TCP
>> windows. (at least on loopback and wild senders)
>> 
>> Basically, with huge tcp windows we have now (default 4 Mbytes),
>> the reader process can have to process up to 4Mbytes of backlogged data
>> in __release_sock() before returning from its 'small' read(fd, buffer,
>> 1024) done by netcat.
>> 
>> While it processes this backlog, it sends tcp ACKS, allowing sender to
>> send new frames that might be dropped because of sk_rcvqueues_full(), or
>> continue to fill receive queue up to the receiver window, feeding the
>> task in __release_sock() [loop]
>> 
>> 
>> This blows cpu caches completely [data is queued, and the dequeue is
>> done long after], and latency of a single read() can be very high. This
>> blocks the pipeline of user processing eventually.

Thanks for looking into this Eric.

We definitely need some kind of choke point so that TCP never
significantly exceeds the congestion window point at which throughput
stops increasing (and only latency does).

One idea is that when we integrate Lawrence Brakmo's TCP-NV congestion
control algorithm, we can try enabling it by default over loopback.

Loopback is kind of an interesting case of the problem scenerio
Lawrence is trying to solve.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
  2010-11-10 11:24               ` Jesper Dangaard Brouer
@ 2010-12-12 15:48                 ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 19+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-12-12 15:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev

Em Wed, Nov 10, 2010 at 12:24:16PM +0100, Jesper Dangaard Brouer escreveu:
> > BUT perf top reveals that its probably related to the function
> > 'find_busiest_group' ... any kernel config hints how I get rid of that?
> 
> The 'find_busiest_group' seems to be an artifact of "perf top", if I use
> "perf record" then the 'find_busiest_group' function disappears.  Which
> is kind of strange, as 'find_busiest_group' seem the be related to
> sched_fair.c.
> 
> perf --version
> perf version 2.6.35.7.1.g60d9c

perf top does sytemwide sampling, while when you use 'perf record
./workload' its not systemwide.

Take a look at 'perf top --help' to see how to limit this to an existing
pid, tid, cpu list, etc.

- Arnaldo

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-12-12 15:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1288954189.28003.178.camel@firesoul.comx.local>
     [not found] ` <1288988955.2665.297.camel@edumazet-laptop>
     [not found]   ` <1289213926.15004.19.camel@firesoul.comx.local>
     [not found]     ` <1289214289.2820.188.camel@edumazet-laptop>
2010-11-08 15:06       ` Loopback performance from kernel 2.6.12 to 2.6.37 Eric Dumazet
2010-11-09  0:05         ` Andrew Hendry
2010-11-09  5:22           ` Eric Dumazet
2010-11-09  6:23             ` Eric Dumazet
2010-11-09  6:30               ` Andrew Hendry
2010-11-09  6:38                 ` Eric Dumazet
2010-11-09  6:42                   ` Eric Dumazet
2010-11-09 13:59         ` Jesper Dangaard Brouer
2010-11-09 14:06           ` Eric Dumazet
2010-11-09 14:16           ` Jesper Dangaard Brouer
2010-11-09 14:25             ` Eric Dumazet
2010-11-18 13:52               ` Eric Dumazet
2010-11-18 17:41                 ` Eric Dumazet
2010-11-18 17:48                   ` David Miller
2010-11-09 14:38             ` Jesper Dangaard Brouer
2010-11-10 11:24               ` Jesper Dangaard Brouer
2010-12-12 15:48                 ` Arnaldo Carvalho de Melo
2010-11-09 21:35 Xose Vazquez Perez
2010-11-10  8:49 ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).