From mboxrd@z Thu Jan  1 00:00:00 1970
From: Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: Kernel forwarding performance test regressions
Date: Tue, 25 Aug 2009 11:47:58 +0200
Message-ID: <4A93B34E.1040100@gmail.com>
References: <20090819110010.53b630cd@nehalam>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: David Miller <davem@davemloft.net>, netdev@vger.kernel.org,
	Robert Olsson <robert.olsson@its.uu.se>
To: Stephen Hemminger <shemminger@vyatta.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from gw1.cosmosbay.com ([212.99.114.194]:44696 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751118AbZHYJsA (ORCPT
	<rfc822;netdev@vger.kernel.org>); Tue, 25 Aug 2009 05:48:00 -0400
In-Reply-To: <20090819110010.53b630cd@nehalam>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

Stephen Hemminger a =E9crit :
> Vyatta regularly runs RFC2544 performance tests as part of
> the QA release regression tests. These tests are run using
> a Spirent analyzer that sends packets at maximum rate and
> measures the number of packets received.
>=20
> The interesting (worst case) number is the forwarding percentage for
> minimum size Ethernet packets.  For packets 1K and above all the pack=
ets
> get through but for smaller sizes the system can't keep up.
>=20
> The hardware is Dell based
> CPU is Intel Dual Core E2220 @ 2.40GHz (or 2.2GHz)
> NIC's are internal Broadcom (tg3).
>=20
> Size	2.6.23	2.6.24	2.6.26	2.6.29	2.6.30
> 64	 14.%	 20%	 21%	 17%	 19%
> 128	 22	 33	 34	 28	 32
> 256	 37	 52	 58	 49	 54
> 512	 67	 85	 83	 85	 85
> 1024	100	100	100	100	100
> 1280	100	100	100	100	100
> 1518	100	100	100	100	100
>=20
>=20
> Some other details:=20
>   * Hardware change between 2.6.24 -> 2.6.26 numbers
>     went from 2.2 to 2.4Ghz
>=20
>   * no SMP affinity (or irqbalance) is done,
>     numbers are significantly better if IRQ's are pinned.
>     2.6.26 goes from 20% to 32%

Thats strange, because at Giga flood level, we should be on NAPI mode,
ksoftirqd using 100% of one cpu. SMP affinities should not matter at al=
l...

>=20
>   * unidirectional numbers are 2X the bidirectional numbers:
>     2.6.26 goes from 20% to 40%
>=20
>   * this is single stream (doesn't help/use multiqueue)
>=20
>   * system loads iptables but does not use it, so each packet
>     sees the overhead of null rules.
>=20
> So kernel 2.6.29 had an observable dip in performance
> which seems to be mostly recovered in 2.6.30.
>=20
> These are from our QA, not me so please don't ask me for
> "please rerun with XX enabled", go run the same test
> yourself with pktgen.
>=20

Unfortunatly I cannot reach line-rate with pktgen and small packets.
(Limit ~1012333pps 485Mb/sec on my test machine, 3GHz E5450 cpu)

It seems timestamping is too expensive on pktgen, even for "delay 0"=20
and only one device setup (next_to_run() doesnt have to select the 'bes=
t' device)
We probably can improve pktgen a litle bit, or use a faster timestampin=
g...

oprofile results on pktgen machine (linux 2.6.30.5) :
CPU: Core 2, speed 3000.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a u=
nit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
58137    58137         27.9549  27.9549    read_tsc
51487    109624        24.7573  52.7122    pktgen_thread_worker
33079    142703        15.9059  68.6181    getnstimeofday
15694    158397         7.5464  76.1645    getCurUs
11806    170203         5.6769  81.8413    do_gettimeofday
5852     176055         2.8139  84.6553    kthread_should_stop
5244     181299         2.5216  87.1768    kthread
4181     185480         2.0104  89.1872    mwait_idle
3837     189317         1.8450  91.0322    consume_skb
2217     191534         1.0660  92.0983    skb_dma_unmap
1599     193133         0.7689  92.8671    skb_dma_map
1389     194522         0.6679  93.5350    local_bh_enable_ip
1350     195872         0.6491  94.1842    nommu_map_page
1086     196958         0.5222  94.7064    mix_pool_bytes_extract
835      197793         0.4015  95.1079    apic_timer_interrupt
774      198567         0.3722  95.4801    irq_entries_start
450      199017         0.2164  95.6964    timer_stats_update_stats
404      199421         0.1943  95.8907    scheduler_tick
403      199824         0.1938  96.0845    find_busiest_group
336      200160         0.1616  96.2460    local_bh_disable
332      200492         0.1596  96.4057    rb_get_reader_page
329      200821         0.1582  96.5639    ring_buffer_consume
267      201088         0.1284  96.6923    add_timer_randomness


I experiment 0.1% drops around 635085pps 284Mb/sec, on my dev machine
(using vlan and bonding, bi-directional , output device =3D input devic=
e)

Some notes :

- Small packets hit the copybreak (mis)feature (that tg3 and other driv=
ers use),
 and we know this slow down forwarding. No real differences on small
packets anyway since we need to read packet to process it (one cache li=
ne)


- neigh_resolve_output() has a cost  because
of atomic ops of read_lock_bh(&neigh->lock)/read_unlock_bh(&neigh->lock=
)
This might be a candidate for RCU conversion ?

- ip_rt_send_redirect() is quite expensive, even if send_redirect is se=
t to 0, because
of in_dev_get()/in_dev_put() (two atomic ops that could be avoided : I =
submitted a patch)