From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [PATCH] pktgen: Avoid dirtying skb->users when txq is full Date: Wed, 30 Sep 2009 17:25:32 -0700 Message-ID: <20090930172532.2c2d1d42@s6510> References: <20090922224902.17ed6cc4@nehalam> <20090923174141.1d350103@s6510> <4AC3E3C5.1090108@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jesper Dangaard Brouer , Robert Olsson , netdev@vger.kernel.org, "David S. Miller" To: Eric Dumazet Return-path: Received: from mail.vyatta.com ([76.74.103.46]:40167 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755190AbZJAAZb convert rfc822-to-8bit (ORCPT ); Wed, 30 Sep 2009 20:25:31 -0400 In-Reply-To: <4AC3E3C5.1090108@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 01 Oct 2009 01:03:33 +0200 Eric Dumazet wrote: > Stephen Hemminger a =C3=A9crit : > > On Tue, 22 Sep 2009 22:49:02 -0700 > > Stephen Hemminger wrote: > >=20 > >> I thought others want to know how to get maximum speed of pktgen. > >> > >> 1. Run nothing else (even X11), just a command line > >> 2. Make sure ethernet flow control is disabled > >> ethtool -A eth0 autoneg off rx off tx off > >> 3. Make sure clocksource is TSC. On my old SMP Opteron's > >> needed to get patch since in 2.6.30 or later, the clock guru's > >> decided to remove it on all non Intel machines. Look for patch > >> than enables "tsc=3Dreliable" > >> 4. Compile Ethernet drivers in, the overhead of the indirect > >> function call required for modules (or cache footprint), > >> slows things down. > >> 5. Increase transmit ring size to 1000 > >> ethtool -G eth0 tx 1000 > >> >=20 > Thanks a lot Stephen. >=20 > I did some pktgen session tonight and found one contention on skb->us= ers field > that following patch avoids. >=20 >=20 > Before patch : > ---------------------------------------------------------------------= --------- > PerfTop: 5187 irqs/sec kernel:100.0% [100000 cycles], (all, c= pu: 0) > ---------------------------------------------------------------------= --------- >=20 > samples pcnt kernel function > _______ _____ _______________ >=20 > 16688.00 - 50.9% : consume_skb > 6541.00 - 20.0% : skb_dma_unmap > 3277.00 - 10.0% : tg3_poll > 1968.00 - 6.0% : mwait_idle > 651.00 - 2.0% : irq_entries_start > 466.00 - 1.4% : _spin_lock > 442.00 - 1.3% : mix_pool_bytes_extract > 373.00 - 1.1% : tg3_msi > 353.00 - 1.1% : read_tsc > 177.00 - 0.5% : sched_clock_local > 176.00 - 0.5% : sched_clock > 137.00 - 0.4% : tick_nohz_stop_sched_tick >=20 > After patch: > ---------------------------------------------------------------------= --------- > PerfTop: 3530 irqs/sec kernel:99.9% [100000 cycles], (all, cp= u: 0) > ---------------------------------------------------------------------= --------- >=20 > samples pcnt kernel function > _______ _____ _______________ >=20 > 17127.00 - 34.0% : tg3_poll > 12691.00 - 25.2% : consume_skb > 5299.00 - 10.5% : skb_dma_unmap > 4179.00 - 8.3% : mwait_idle > 1583.00 - 3.1% : irq_entries_start > 1288.00 - 2.6% : mix_pool_bytes_extract > 1239.00 - 2.5% : tg3_msi > 1062.00 - 2.1% : read_tsc > 583.00 - 1.2% : _spin_lock > 432.00 - 0.9% : sched_clock > 416.00 - 0.8% : sched_clock_local > 360.00 - 0.7% : tick_nohz_stop_sched_tick > 329.00 - 0.7% : ktime_get > 263.00 - 0.5% : _spin_lock_irqsave >=20 > I believe we could go further, batching the atomic_inc(&skb->users) w= e do all the > time, competing with the atomic_dec() done by tx completion handler (= possibly run > on other cpu): Reserve XXXXXXX units to the skb->users, and decrement= a pktgen > local variable and refill the reserve if necessary, once in a while..= =2E When buffer is allocated we know the number of times it will be cloned.= So why not set it there, would need to cleanup on interrupt but that should be= possible. Alternatively, just change skb->destructor on last packet and use a pro= per completion mechanism. > [PATCH] pktgen: Avoid dirtying skb->users when txq is full >=20 > We can avoid two atomic ops on skb->users if packet is not going to b= e > sent to the device (because hardware txqueue is full) >=20 > Signed-off-by: Eric Dumazet > --- >=20 > diff --git a/net/core/pktgen.c b/net/core/pktgen.c > index 4d11c28..6a9ab28 100644 > --- a/net/core/pktgen.c > +++ b/net/core/pktgen.c > @@ -3439,12 +3439,14 @@ static void pktgen_xmit(struct pktgen_dev *pk= t_dev) > txq =3D netdev_get_tx_queue(odev, queue_map); > =20 > __netif_tx_lock_bh(txq); > - atomic_inc(&(pkt_dev->skb->users)); > =20 > - if (unlikely(netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(t= xq))) > + if (unlikely(netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(t= xq))) { > ret =3D NETDEV_TX_BUSY; > - else > - ret =3D (*xmit)(pkt_dev->skb, odev); > + pkt_dev->last_ok =3D 0; > + goto unlock; > + } > + atomic_inc(&(pkt_dev->skb->users)); > + ret =3D (*xmit)(pkt_dev->skb, odev); > =20 > switch (ret) { > case NETDEV_TX_OK: > @@ -3466,6 +3468,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_= dev) > atomic_dec(&(pkt_dev->skb->users)); > pkt_dev->last_ok =3D 0; > } > +unlock: > __netif_tx_unlock_bh(txq); > =20 > /* If pkt_dev->count is zero, then run forever */ Acked-by: Stephen Hemminger