Kernel forwarding performance test regressions

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Kernel forwarding performance test regressions
@ 2009-08-19 18:00 Stephen Hemminger
  2009-08-25  9:47 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2009-08-19 18:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Vyatta regularly runs RFC2544 performance tests as part of
the QA release regression tests. These tests are run using
a Spirent analyzer that sends packets at maximum rate and
measures the number of packets received.

The interesting (worst case) number is the forwarding percentage for
minimum size Ethernet packets.  For packets 1K and above all the packets
get through but for smaller sizes the system can't keep up.

The hardware is Dell based
CPU is Intel Dual Core E2220 @ 2.40GHz (or 2.2GHz)
NIC's are internal Broadcom (tg3).

Size	2.6.23	2.6.24	2.6.26	2.6.29	2.6.30
64	 14.%	 20%	 21%	 17%	 19%
128	 22	 33	 34	 28	 32
256	 37	 52	 58	 49	 54
512	 67	 85	 83	 85	 85
1024	100	100	100	100	100
1280	100	100	100	100	100
1518	100	100	100	100	100

Some other details: 
  * Hardware change between 2.6.24 -> 2.6.26 numbers
    went from 2.2 to 2.4Ghz

  * no SMP affinity (or irqbalance) is done,
    numbers are significantly better if IRQ's are pinned.
    2.6.26 goes from 20% to 32%

  * unidirectional numbers are 2X the bidirectional numbers:
    2.6.26 goes from 20% to 40%

  * this is single stream (doesn't help/use multiqueue)

  * system loads iptables but does not use it, so each packet
    sees the overhead of null rules.

So kernel 2.6.29 had an observable dip in performance
which seems to be mostly recovered in 2.6.30.

These are from our QA, not me so please don't ask me for
"please rerun with XX enabled", go run the same test
yourself with pktgen.

-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel forwarding performance test regressions
  2009-08-19 18:00 Kernel forwarding performance test regressions Stephen Hemminger
@ 2009-08-25  9:47 ` Eric Dumazet
  2009-08-25 16:04   ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2009-08-25  9:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, Robert Olsson

Stephen Hemminger a écrit :
> Vyatta regularly runs RFC2544 performance tests as part of
> the QA release regression tests. These tests are run using
> a Spirent analyzer that sends packets at maximum rate and
> measures the number of packets received.
> 
> The interesting (worst case) number is the forwarding percentage for
> minimum size Ethernet packets.  For packets 1K and above all the packets
> get through but for smaller sizes the system can't keep up.
> 
> The hardware is Dell based
> CPU is Intel Dual Core E2220 @ 2.40GHz (or 2.2GHz)
> NIC's are internal Broadcom (tg3).
> 
> Size	2.6.23	2.6.24	2.6.26	2.6.29	2.6.30
> 64	 14.%	 20%	 21%	 17%	 19%
> 128	 22	 33	 34	 28	 32
> 256	 37	 52	 58	 49	 54
> 512	 67	 85	 83	 85	 85
> 1024	100	100	100	100	100
> 1280	100	100	100	100	100
> 1518	100	100	100	100	100
> 
> 
> Some other details: 
>   * Hardware change between 2.6.24 -> 2.6.26 numbers
>     went from 2.2 to 2.4Ghz
> 
>   * no SMP affinity (or irqbalance) is done,
>     numbers are significantly better if IRQ's are pinned.
>     2.6.26 goes from 20% to 32%

Thats strange, because at Giga flood level, we should be on NAPI mode,
ksoftirqd using 100% of one cpu. SMP affinities should not matter at all...

> 
>   * unidirectional numbers are 2X the bidirectional numbers:
>     2.6.26 goes from 20% to 40%
> 
>   * this is single stream (doesn't help/use multiqueue)
> 
>   * system loads iptables but does not use it, so each packet
>     sees the overhead of null rules.
> 
> So kernel 2.6.29 had an observable dip in performance
> which seems to be mostly recovered in 2.6.30.
> 
> These are from our QA, not me so please don't ask me for
> "please rerun with XX enabled", go run the same test
> yourself with pktgen.
> 

Unfortunatly I cannot reach line-rate with pktgen and small packets.
(Limit ~1012333pps 485Mb/sec on my test machine, 3GHz E5450 cpu)

It seems timestamping is too expensive on pktgen, even for "delay 0" 
and only one device setup (next_to_run() doesnt have to select the 'best' device)
We probably can improve pktgen a litle bit, or use a faster timestamping...

oprofile results on pktgen machine (linux 2.6.30.5) :
CPU: Core 2, speed 3000.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
58137    58137         27.9549  27.9549    read_tsc
51487    109624        24.7573  52.7122    pktgen_thread_worker
33079    142703        15.9059  68.6181    getnstimeofday
15694    158397         7.5464  76.1645    getCurUs
11806    170203         5.6769  81.8413    do_gettimeofday
5852     176055         2.8139  84.6553    kthread_should_stop
5244     181299         2.5216  87.1768    kthread
4181     185480         2.0104  89.1872    mwait_idle
3837     189317         1.8450  91.0322    consume_skb
2217     191534         1.0660  92.0983    skb_dma_unmap
1599     193133         0.7689  92.8671    skb_dma_map
1389     194522         0.6679  93.5350    local_bh_enable_ip
1350     195872         0.6491  94.1842    nommu_map_page
1086     196958         0.5222  94.7064    mix_pool_bytes_extract
835      197793         0.4015  95.1079    apic_timer_interrupt
774      198567         0.3722  95.4801    irq_entries_start
450      199017         0.2164  95.6964    timer_stats_update_stats
404      199421         0.1943  95.8907    scheduler_tick
403      199824         0.1938  96.0845    find_busiest_group
336      200160         0.1616  96.2460    local_bh_disable
332      200492         0.1596  96.4057    rb_get_reader_page
329      200821         0.1582  96.5639    ring_buffer_consume
267      201088         0.1284  96.6923    add_timer_randomness




I experiment 0.1% drops around 635085pps 284Mb/sec, on my dev machine
(using vlan and bonding, bi-directional , output device = input device)

Some notes :

- Small packets hit the copybreak (mis)feature (that tg3 and other drivers use),
 and we know this slow down forwarding. No real differences on small
packets anyway since we need to read packet to process it (one cache line)


- neigh_resolve_output() has a cost  because
of atomic ops of read_lock_bh(&neigh->lock)/read_unlock_bh(&neigh->lock)
This might be a candidate for RCU conversion ?

- ip_rt_send_redirect() is quite expensive, even if send_redirect is set to 0, because
of in_dev_get()/in_dev_put() (two atomic ops that could be avoided : I submitted a patch)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel forwarding performance test regressions
  2009-08-25  9:47 ` Eric Dumazet
@ 2009-08-25 16:04   ` Stephen Hemminger
  2009-08-25 16:25     ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2009-08-25 16:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Robert Olsson

On Tue, 25 Aug 2009 11:47:58 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > Vyatta regularly runs RFC2544 performance tests as part of
> > the QA release regression tests. These tests are run using
> > a Spirent analyzer that sends packets at maximum rate and
> > measures the number of packets received.
> > 
> > The interesting (worst case) number is the forwarding percentage for
> > minimum size Ethernet packets.  For packets 1K and above all the packets
> > get through but for smaller sizes the system can't keep up.
> > 
> > The hardware is Dell based
> > CPU is Intel Dual Core E2220 @ 2.40GHz (or 2.2GHz)
> > NIC's are internal Broadcom (tg3).
> > 
> > Size	2.6.23	2.6.24	2.6.26	2.6.29	2.6.30
> > 64	 14.%	 20%	 21%	 17%	 19%
> > 128	 22	 33	 34	 28	 32
> > 256	 37	 52	 58	 49	 54
> > 512	 67	 85	 83	 85	 85
> > 1024	100	100	100	100	100
> > 1280	100	100	100	100	100
> > 1518	100	100	100	100	100
> > 
> > 
> > Some other details: 
> >   * Hardware change between 2.6.24 -> 2.6.26 numbers
> >     went from 2.2 to 2.4Ghz
> > 
> >   * no SMP affinity (or irqbalance) is done,
> >     numbers are significantly better if IRQ's are pinned.
> >     2.6.26 goes from 20% to 32%
> 
> Thats strange, because at Giga flood level, we should be on NAPI mode,
> ksoftirqd using 100% of one cpu. SMP affinities should not matter at all...

The transmit completions are still kicking off some interrupts.

> > 
> >   * unidirectional numbers are 2X the bidirectional numbers:
> >     2.6.26 goes from 20% to 40%
> > 
> >   * this is single stream (doesn't help/use multiqueue)
> > 
> >   * system loads iptables but does not use it, so each packet
> >     sees the overhead of null rules.
> > 
> > So kernel 2.6.29 had an observable dip in performance
> > which seems to be mostly recovered in 2.6.30.
> > 
> > These are from our QA, not me so please don't ask me for
> > "please rerun with XX enabled", go run the same test
> > yourself with pktgen.
> > 
> 
> Unfortunatly I cannot reach line-rate with pktgen and small packets.
> (Limit ~1012333pps 485Mb/sec on my test machine, 3GHz E5450 cpu)

Things that help:
  * make sure flow control is off
  * increase transmit ring size
  * sometimes tx IRQ coalescing
Using an old SMP Opteron box for pktgen right now.

> It seems timestamping is too expensive on pktgen, even for "delay 0" 
> and only one device setup (next_to_run() doesnt have to select the 'best' device)
> We probably can improve pktgen a litle bit, or use a faster timestamping...

I have a patch that might help, I haven't tested it or used it.
It converts the pktgen calls from gettimeofday to using sched_clock()
this saves the math overhead since pktgen only cares about comparison
and delta's. It also prevents problems with kernel deciding clock
source is not stable.  Still need to test and review this to make
sure pktgen only uses value on same cpu.

> oprofile results on pktgen machine (linux 2.6.30.5) :
> CPU: Core 2, speed 3000.08 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples  cum. samples  %        cum. %     symbol name
> 58137    58137         27.9549  27.9549    read_tsc
> 51487    109624        24.7573  52.7122    pktgen_thread_worker
> 33079    142703        15.9059  68.6181    getnstimeofday
> 15694    158397         7.5464  76.1645    getCurUs
> 11806    170203         5.6769  81.8413    do_gettimeofday
> 5852     176055         2.8139  84.6553    kthread_should_stop
> 5244     181299         2.5216  87.1768    kthread
> 4181     185480         2.0104  89.1872    mwait_idle
> 3837     189317         1.8450  91.0322    consume_skb
> 2217     191534         1.0660  92.0983    skb_dma_unmap
> 1599     193133         0.7689  92.8671    skb_dma_map
> 1389     194522         0.6679  93.5350    local_bh_enable_ip
> 1350     195872         0.6491  94.1842    nommu_map_page
> 1086     196958         0.5222  94.7064    mix_pool_bytes_extract
> 835      197793         0.4015  95.1079    apic_timer_interrupt
> 774      198567         0.3722  95.4801    irq_entries_start
> 450      199017         0.2164  95.6964    timer_stats_update_stats
> 404      199421         0.1943  95.8907    scheduler_tick
> 403      199824         0.1938  96.0845    find_busiest_group
> 336      200160         0.1616  96.2460    local_bh_disable
> 332      200492         0.1596  96.4057    rb_get_reader_page
> 329      200821         0.1582  96.5639    ring_buffer_consume
> 267      201088         0.1284  96.6923    add_timer_randomness

The profile of pktgen will favor the tsc because it spins and looks
at TSC during the spin. Not sure why tg3 driver overhead isn't showing up.


> I experiment 0.1% drops around 635085pps 284Mb/sec, on my dev machine
> (using vlan and bonding, bi-directional , output device = input device)
> 
> Some notes :
> 
> - Small packets hit the copybreak (mis)feature (that tg3 and other drivers use),
>  and we know this slow down forwarding. No real differences on small
> packets anyway since we need to read packet to process it (one cache line)

Good point: we disable copybreak on some devices (with modprobe options) in
the Vyatta distro.
 
> - neigh_resolve_output() has a cost  because
> of atomic ops of read_lock_bh(&neigh->lock)/read_unlock_bh(&neigh->lock)
> This might be a candidate for RCU conversion ?

yes

> - ip_rt_send_redirect() is quite expensive, even if send_redirect is set to 0, because
> of in_dev_get()/in_dev_put() (two atomic ops that could be avoided : I submitted a patch)
> 


-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel forwarding performance test regressions
  2009-08-25 16:04   ` Stephen Hemminger
@ 2009-08-25 16:25     ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2009-08-25 16:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, Robert Olsson

Stephen Hemminger a écrit :
> On Tue, 25 Aug 2009 11:47:58 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Thats strange, because at Giga flood level, we should be on NAPI mode,
>> ksoftirqd using 100% of one cpu. SMP affinities should not matter at all...
> 
> The transmit completions are still kicking off some interrupts.

Ah, yes, in my case, as I use same device for transmit, I had no addtional interrupts

> 
>>>   * unidirectional numbers are 2X the bidirectional numbers:
>>>     2.6.26 goes from 20% to 40%
>>>
>>>   * this is single stream (doesn't help/use multiqueue)
>>>
>>>   * system loads iptables but does not use it, so each packet
>>>     sees the overhead of null rules.
>>>
>>> So kernel 2.6.29 had an observable dip in performance
>>> which seems to be mostly recovered in 2.6.30.
>>>
>>> These are from our QA, not me so please don't ask me for
>>> "please rerun with XX enabled", go run the same test
>>> yourself with pktgen.
>>>
>> Unfortunatly I cannot reach line-rate with pktgen and small packets.
>> (Limit ~1012333pps 485Mb/sec on my test machine, 3GHz E5450 cpu)
> 
> Things that help:
>   * make sure flow control is off
it is
>   * increase transmit ring size
already at max 511 value
>   * sometimes tx IRQ coalescing
yep
> Using an old SMP Opteron box for pktgen right now.
> 
>> It seems timestamping is too expensive on pktgen, even for "delay 0" 
>> and only one device setup (next_to_run() doesnt have to select the 'best' device)
>> We probably can improve pktgen a litle bit, or use a faster timestamping...
> 
> I have a patch that might help, I haven't tested it or used it.
> It converts the pktgen calls from gettimeofday to using sched_clock()
> this saves the math overhead since pktgen only cares about comparison
> and delta's. It also prevents problems with kernel deciding clock
> source is not stable.  Still need to test and review this to make
> sure pktgen only uses value on same cpu.

Well, I tried using two adapters and got more bandwidth from same CPU0, so it seems
tg3 on my machine is not able to go past 1012333pps (and BTW, bnx2 is much
 slower, I dont know why...)

Configuring /proc/net/pktgen/eth3  (tg3)
Configuring /proc/net/pktgen/eth1  (bnx2)
Running... ctrl^C to stop
Done
Params: count 100000  min_pkt_size: 56  max_pkt_size: 56
     frags: 0  delay: 0  clone_skb: 1000  ifname: eth3
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 192.168.20.120  dst_max: 192.168.20.121
     src_min:   src_max:
     src_mac: 00:1e:0b:92:78:51 dst_mac: 00:1f:29:6b:86:15
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 100000  errors: 0
     started: 1251217024743446us  stopped: 1251217024842450us idle: 253us
     seq_num: 100001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x200a8c0  cur_daddr: 0x7814a8c0
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 99004(c98751+d253) usec, 100000 (56byte,0frags)
  1010060pps 452Mb/sec (452506880bps) errors: 0
Params: count 100000  min_pkt_size: 56  max_pkt_size: 56
     frags: 0  delay: 0  clone_skb: 1000  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 192.168.20.120  dst_max: 192.168.20.121
     src_min:   src_max:
     src_mac: 00:1e:0b:ec:d3:d2 dst_mac: 00:1f:29:6b:86:15
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 100000  errors: 0
     started: 1251217024743445us  stopped: 1251217024888749us idle: 329us
     seq_num: 100001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x7814a8c0
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 145304(c144975+d329) usec, 100000 (56byte,0frags)
  688212pps 308Mb/sec (308318976bps) errors: 0

07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 12)
        Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction Gigabit Server Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 34
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        [virtual] Expansion ROM at d0000000 [disabled] [size=16K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Kernel driver in use: bnx2 (eth1)

14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3)
        Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 35
        Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
        Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at d0200000 [disabled] [size=128K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
        Kernel driver in use: tg3
        Kernel modules: tg3 (eth2, not used in my pktgen setup)



14:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3)
        Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 37
        Memory at fdfd0000 (64-bit, non-prefetchable) [size=64K]
        Memory at fdfc0000 (64-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at d0220000 [disabled] [size=128K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
        Kernel driver in use: tg3
        Kernel modules: tg3 (eth3)

> 
>> oprofile results on pktgen machine (linux 2.6.30.5) :
>> CPU: Core 2, speed 3000.08 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
>> samples  cum. samples  %        cum. %     symbol name
>> 58137    58137         27.9549  27.9549    read_tsc
>> 51487    109624        24.7573  52.7122    pktgen_thread_worker
>> 33079    142703        15.9059  68.6181    getnstimeofday
>> 15694    158397         7.5464  76.1645    getCurUs
>> 11806    170203         5.6769  81.8413    do_gettimeofday
>> 5852     176055         2.8139  84.6553    kthread_should_stop
>> 5244     181299         2.5216  87.1768    kthread
>> 4181     185480         2.0104  89.1872    mwait_idle
>> 3837     189317         1.8450  91.0322    consume_skb
>> 2217     191534         1.0660  92.0983    skb_dma_unmap
>> 1599     193133         0.7689  92.8671    skb_dma_map
>> 1389     194522         0.6679  93.5350    local_bh_enable_ip
>> 1350     195872         0.6491  94.1842    nommu_map_page
>> 1086     196958         0.5222  94.7064    mix_pool_bytes_extract
>> 835      197793         0.4015  95.1079    apic_timer_interrupt
>> 774      198567         0.3722  95.4801    irq_entries_start
>> 450      199017         0.2164  95.6964    timer_stats_update_stats
>> 404      199421         0.1943  95.8907    scheduler_tick
>> 403      199824         0.1938  96.0845    find_busiest_group
>> 336      200160         0.1616  96.2460    local_bh_disable
>> 332      200492         0.1596  96.4057    rb_get_reader_page
>> 329      200821         0.1582  96.5639    ring_buffer_consume
>> 267      201088         0.1284  96.6923    add_timer_randomness
> 
> The profile of pktgen will favor the tsc because it spins and looks
> at TSC during the spin. Not sure why tg3 driver overhead isn't showing up.

Sorry, for a strange reason, I have to load tg3 as a module (all other things are in static in vmlinux)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-08-25 16:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-19 18:00 Kernel forwarding performance test regressions Stephen Hemminger
2009-08-25  9:47 ` Eric Dumazet
2009-08-25 16:04   ` Stephen Hemminger
2009-08-25 16:25     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).