Re: [RFC] New driver API to speed up small packets xmits

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC] New driver API to speed up small packets xmits
       [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
@ 2007-05-15 21:17 ` Roland Dreier
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Roland Dreier @ 2007-05-15 21:17 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, ak, krkumar2, netdev, netdev-owner

 >       I thought to enable GSO, device driver actually does nothing rather
 > than enabling the flag. GSO moved TCP offloading to interface layer before
 > device xmit. It's a different idea with multiple packets per xmit. GSO
 > still queue the packet one bye one in QDISC and xmit one bye one. The
 > multiple packets per xmit will xmit N packets when N packets in QDISC
 > queue. Please corrent me if wrong.

Current use of GSO does segmentation just above the netdevice driver.
However there's nothing that prevents it from being used to push
segmentation into the driver -- as I described, just set NETIF_F_GSO_SOFTWARE
and do skb_gso_segment() within the driver.

As Michael Chan pointed out, tg3.c already does this to work around a
rare bug in the HW TSO implementation.  For IPoIB we could do it all
the time to get multiple send work requests from one call to the xmit
method.

 - R.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
@ 2007-05-15 21:25     ` Roland Dreier
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
  2007-05-15 21:32     ` David Miller
  1 sibling, 1 reply; 19+ messages in thread
From: Roland Dreier @ 2007-05-15 21:25 UTC (permalink / raw)
  To: Shirley Ma; +Cc: ak, David Miller, krkumar2, netdev, netdev-owner

    Shirley>       I just wonder without TSO support in HW, how much
    Shirley> benefit we can get by pushing GSO from interface layer to
    Shirley> device layer besides we can do multiple packets in IPoIB.

The entire benefit comes from having multiple packets to queue in one
call to the xmit method.

 - R.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  2007-05-15 21:25     ` Roland Dreier
@ 2007-05-15 21:32     ` David Miller
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
                         ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: David Miller @ 2007-05-15 21:32 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:22:57 -0700

>       I just wonder without TSO support in HW, how much benefit we
> can get by pushing GSO from interface layer to device layer besides
> we can do multiple packets in IPoIB.

I bet the gain is non-trivial.

I'd say about half of the gain from TSO comes from only calling down
into the driver from TCP one time as opposed to N times.  That's
the majority of the "CPU work" involved in TCP sending.

The rest of the gain comes from only transmitting the packet headers
once rather than N times, which conserves I/O bus bandwidth.

GSO will not help the case of lots of UDP applications sending
small packets, or something like that.  An efficient qdisc-->driver
transfer during netif_wake_queue() could help solve some of that,
as is being discussed here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
@ 2007-05-15 21:38         ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2007-05-15 21:38 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:35:13 -0700

>       I would think it might be better to implement dequeueN in
> interface xmit which will benefit multiple streams as well not just
> for large packet.

I think it is beneficial to have both, each of them helps with
different aspects of packet sending overhead.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:32     ` David Miller
@ 2007-05-15 22:17       ` jamal
  2007-05-15 22:48         ` jamal
  2007-05-16 22:12         ` Sridhar Samudrala
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
  2007-05-21  7:56       ` Herbert Xu
  2 siblings, 2 replies; 19+ messages in thread
From: jamal @ 2007-05-15 22:17 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner, ashwin.chaugule,
	Evgeniy Polyakov, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On Tue, 2007-15-05 at 14:32 -0700, David Miller wrote:
>  An efficient qdisc-->driver
> transfer during netif_wake_queue() could help solve some of that,
> as is being discussed here.

Ok, heres the approach i discussed at netconf.
It needs net-2.6 and the patch i posted earlier to clean up
qdisc_restart() [1].
I havent ported over all the bits from 2.6.18,  but this works.
Krishna and i have colluded privately on working together. I just need
to reproduce the patches, so here is the core.
A lot of the code in the core could be aggragated later - right now i am
worried about correctness.
I will post a patch for tun device in a few minutes
that i use to test on my laptop (i need to remove some debugs) to show
an example.
I also plan to post a patch for e1000 - but that will take more
than a few minutes.
the e1000 driver has changed quiet a bit since 2.6.18, so it is
consuming.

What does a driver need to do to get batched-to?

1) On initialization (probe probably)
 a) set NETIF_F_BTX in its dev->features at startup
 i.e dev->features |= NETIF_F_BTX
 b) initialize the batch queue i.e something like
skb_queue_head_init(&dev->blist);
c) set dev->xmit_win to something reasonable like
maybe half the DMA ring size or tx_queuelen

2) create a new method for batch txmit.
This loops on dev->blist and stashes onto hardware.
All return codes like NETDEV_TX_OK etc still apply.

3) set the dev->xmit_win which provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

Of course, to work, all this requires that the driver to have a
threshold for waking up tx path; like  drivers such as e1000 or tg3 do
in order to invoke netif_wake_queue (example look at TX_WAKE_THRESHOLD
usage in e1000).


feedback welcome (preferably in the form of patches).

Anyone with a really nice tool to measure CPU improvement will help
a great deal in quantifying things. As i have said earlier, I never saw
any throughput improvement. But like T/GSO it may be just CPU savings
(as was suggested at netconf).

cheers,
jamal
[1] http://marc.info/?l=linux-netdev&m=117914954911959&w=2


[-- Attachment #2: batch0 --]
[-- Type: text/x-patch, Size: 4512 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f671cd2..7205748 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -325,6 +325,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BTX		8192	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -450,6 +451,11 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_batch_xmit) (struct sk_buff_head *list,
+						    struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						    struct net_device *dev);
+	int			xmit_win;
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -466,6 +472,10 @@ struct net_device
 	struct list_head	todo_list;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
+	/*XXX: Fix eventually to not allocate if device not
+	 *batch capable
+	*/
+	struct sk_buff_head	blist;
 
 	struct net_device	*link_watch_next;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ed80054..61fa301 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -85,10 +85,12 @@ static inline int
 do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 {
 
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (skb) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 	/* XXX: Could netif_schedule fail? Or is that fact we are
 	 * requeueing imply the hardware path is closed
 	 * and even if we fail, some interupt will wake us
@@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 	int ret = handle_dev_cpu_collision(dev);
 
 	if (ret == SCHED_TX_DROP) {
-		kfree_skb(skb);
+		if (skb) /* we are not batching */
+			kfree_skb(skb);
+		else if (!skb_queue_empty(&dev->blist))
+			skb_queue_purge(&dev->blist);
 		return qdisc_qlen(q);
 	}
 
@@ -195,10 +200,99 @@ static inline int qdisc_restart(struct net_device *dev)
 	return do_dev_requeue(skb, dev, q);
 }
 
+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *skbs = &dev->blist;
+	int tdq = count;
+
+	/* 
+	 * very unlikely, but who knows ..
+	 * If this happens we dont try to grab more pkts
+	 */
+	if (!skb_queue_empty(&dev->blist))
+		return skb_queue_len(&dev->blist);
+
+	if (dev->gso_skb) {
+		count--;
+		__skb_queue_head(skbs, dev->gso_skb);
+		dev->gso_skb = NULL;
+	}
+
+	while (count) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+		count--;
+		__skb_queue_head(skbs, skb);
+	}
+
+	return tdq - count;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+	return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+	struct Qdisc *q = dev->qdisc;
+	unsigned lockless = (dev->features & NETIF_F_LLTX);
+	int count = dev->xmit_win;
+	int ret = 0;
+
+	ret = try_get_tx_pkts(dev, q, count);
+
+	if (ret == 0)
+		return qdisc_qlen(q);
+
+	/* we have packets to send! */
+	if (!lockless) {
+		if (!netif_tx_trylock(dev))
+			return tx_islocked(NULL, dev, q);
+	}
+
+	/* all clear .. */
+	spin_unlock(&dev->queue_lock);
+
+	ret = NETDEV_TX_BUSY;
+	if (!netif_queue_stopped(dev))
+		ret = try_tx_pkts(dev);
+
+	if (!lockless)
+		netif_tx_unlock(dev);
+
+	spin_lock(&dev->queue_lock);
+
+	q = dev->qdisc;
+
+	/* most likely result, packet went ok */
+	if (ret == NETDEV_TX_OK)
+		return qdisc_qlen(q);
+	/* only for lockless drivers .. */
+	if (ret == NETDEV_TX_LOCKED && lockless)
+		return tx_islocked(NULL, dev, q);
+
+	if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+		printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+		       dev->name, ret, q->q.qlen);
+
+	return do_dev_requeue(NULL, dev, q);
+}
+
 void __qdisc_run(struct net_device *dev)
 {
+	unsigned batching = (dev->features & NETIF_F_BTX);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!batching && !qdisc_restart(dev))
+			break;
+		else if (!batch_qdisc_restart(dev))
 			break;
 	} while (!netif_queue_stopped(dev));
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
@ 2007-05-15 22:48         ` jamal
  2007-05-16  0:50           ` jamal
  2007-05-16 22:12         ` Sridhar Samudrala
  1 sibling, 1 reply; 19+ messages in thread
From: jamal @ 2007-05-15 22:48 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Tue, 2007-15-05 at 18:17 -0400, jamal wrote:

> I will post a patch for tun device in a few minutes
> that i use to test on my laptop (i need to remove some debugs) to show
> an example.

Ok, here it is. 
The way i test is to point packets at a tun device. [One way i do it
is attach an ingress qdisc on lo; attach a u32 filter to match all;
on match redirect to the tun device].
The user space program reading sleeps for about a second every 20
packets or so. This forces things to accumulate in the drivers queue.
Backpressure builds up and the throttling effect is really nice to see
working.

I will try to post the e1000 patch tonight or tommorow morning.

cheers,
jamal



[-- Attachment #2: tun-b --]
[-- Type: text/x-patch, Size: 3734 bytes --]

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a2c6caa..076f794 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -70,6 +70,7 @@
 static int debug;
 #endif
 
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
 /* Network device part of the driver */
 
 static LIST_HEAD(tun_dev_list);
@@ -86,9 +87,53 @@ static int tun_net_open(struct net_device *dev)
 static int tun_net_close(struct net_device *dev)
 {
 	netif_stop_queue(dev);
+	//skb_queue_purge(&dev->blist);
 	return 0;
 }
 
+/* Batch Net device start xmit
+ * combine with non-batching version
+ * */
+static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev)
+{
+	struct sk_buff *skb;
+	struct tun_struct *tun = netdev_priv(dev);
+	u32 qlen = skb_queue_len(&tun->readq);
+
+	/* Drop packet if interface is not attached */
+	if (!tun->attached) {
+		tun->stats.tx_dropped+=skb_queue_len(&dev->blist);
+		skb_queue_purge(&dev->blist);
+		return NETDEV_TX_OK;
+	}
+
+	while (skb_queue_len(&dev->blist)) {
+		skb = __skb_dequeue(skbs);
+		if (!skb)
+			break;
+		skb_queue_tail(&tun->readq, skb);
+	}
+
+	qlen = skb_queue_len(&tun->readq);
+	if (qlen >= dev->tx_queue_len) {
+		netif_stop_queue(dev);
+		tun->stats.tx_fifo_errors++;
+		dev->xmit_win = 1;
+	} else {
+		dev->xmit_win = dev->tx_queue_len - qlen;
+	}
+
+	/* Queue packet */
+	dev->trans_start = jiffies;
+
+	/* Notify and wake up reader process */
+	if (tun->flags & TUN_FASYNC)
+		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
+	wake_up_interruptible(&tun->read_wait);
+
+	return NETDEV_TX_OK;
+}
+
 /* Net device start xmit */
 static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
@@ -207,6 +252,7 @@ static void tun_net_init(struct net_device *dev)
 		dev->tx_queue_len = TUN_READQ_SIZE;  /* We prefer our own queue length */
 		break;
 	}
+	dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */
 }
 
 /* Character device part */
@@ -382,7 +428,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		netif_wake_queue(tun->dev);
+		{
+			u32 t = skb_queue_len(&tun->readq);
+			if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
+				tun->dev->xmit_win = tun->dev->tx_queue_len;
+				netif_wake_queue(tun->dev);
+			}
+		}
 
 		/** Decide whether to accept this packet. This code is designed to
 		 * behave identically to an Ethernet interface. Accept the packet if
@@ -429,6 +481,7 @@ static void tun_setup(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	skb_queue_head_init(&tun->readq);
+	skb_queue_head_init(&dev->blist);
 	init_waitqueue_head(&tun->read_wait);
 
 	tun->owner = -1;
@@ -436,6 +489,8 @@ static void tun_setup(struct net_device *dev)
 	SET_MODULE_OWNER(dev);
 	dev->open = tun_net_open;
 	dev->hard_start_xmit = tun_net_xmit;
+	dev->hard_prep_xmit = NULL;
+	dev->hard_batch_xmit = tun_net_bxmit;
 	dev->stop = tun_net_close;
 	dev->get_stats = tun_net_stats;
 	dev->ethtool_ops = &tun_ethtool_ops;
@@ -458,7 +513,7 @@ static struct tun_struct *tun_get_by_name(const char *name)
 static int tun_set_iff(struct file *file, struct ifreq *ifr)
 {
 	struct tun_struct *tun;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	int err;
 
 	tun = tun_get_by_name(ifr->ifr_name);
@@ -528,12 +583,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr)
 	}
 
 	DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name);
+	dev->features |= NETIF_F_BTX;
 
 	if (ifr->ifr_flags & IFF_NO_PI)
 		tun->flags |= TUN_NO_PI;
 
-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
+	if (ifr->ifr_flags & IFF_ONE_QUEUE) {
 		tun->flags |= TUN_ONE_QUEUE;
+		dev->features &= ~NETIF_F_BTX;
+	}
 
 	file->private_data = tun;
 	tun->attached = 1;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-15 23:36         ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2007-05-15 23:36 UTC (permalink / raw)
  To: xma; +Cc: ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 16:33:22 -0700

>       That's interesting. So a generic LRO in interface layer will benefit
> the preformance more, right? Receiving path TCP N times is more expensive
> than sending, I think.

If you look at some of the drivers doing LRO, the bulk of the
implementation is in software, so yes :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:48         ` jamal
@ 2007-05-16  0:50           ` jamal
  0 siblings, 0 replies; 19+ messages in thread
From: jamal @ 2007-05-16  0:50 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja

On Tue, 2007-15-05 at 18:48 -0400, jamal wrote:

> I will try to post the e1000 patch tonight or tommorow morning.

I have the e1000 path done; a few features from the 2.6.18 missing
(mainly the one mucking with tx ring pruning on the tx path).
While it compiles and looks right - i havent tested it and wont have
time for another day or so. However, if anyone wants it in its current
form -let me know and i will email you privately.

cheers,
jamal 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-17  4:03           ` Krishna Kumar2
@ 2007-05-16 21:44             ` Sridhar Samudrala
  2007-05-17  5:01               ` Krishna Kumar2
  0 siblings, 1 reply; 19+ messages in thread
From: Sridhar Samudrala @ 2007-05-16 21:44 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Krishna Kumar2 wrote:
> Hi Sridhar,
> 
> Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:
> 
>> AFAIK, gso_skb can be a list of skb's. Can we add a list
>> to another list using __skb_queue_head()?
>> Also, if gso_skb is a list of multiple skb's, i think the
>> count needs to be decremented by the number of segments in
>> gso_skb.
> 
> gso_skb is the last GSO skb that failed to be sent. This already
> segmented skb is kept "cached" and whenever the next xmit happens,
> this skb is first sent before any packets from the queue are taken
> out (otherwise out-of-order packets). So there can atmost be one
> gso_skb per device.

Yes. There can be only one gso_skb per device. But it can have more
than one segments linked together via skb->next.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
  2007-05-15 22:48         ` jamal
@ 2007-05-16 22:12         ` Sridhar Samudrala
  2007-05-16 22:52           ` jamal
  2007-05-17  4:03           ` Krishna Kumar2
  1 sibling, 2 replies; 19+ messages in thread
From: Sridhar Samudrala @ 2007-05-16 22:12 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner,
	ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja

Jamal,

Here are some comments i have on your patch.
See them inline.

Thanks
Sridhar

+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+       struct sk_buff *skb;
+       struct sk_buff_head *skbs = &dev->blist;
+       int tdq = count;
+
+       /* 
+        * very unlikely, but who knows ..
+        * If this happens we dont try to grab more pkts
+        */
+       if (!skb_queue_empty(&dev->blist))
+               return skb_queue_len(&dev->blist);
+
+       if (dev->gso_skb) {
+               count--;
+               __skb_queue_head(skbs, dev->gso_skb);
+               dev->gso_skb = NULL;
+       }

AFAIK, gso_skb can be a list of skb's. Can we add a list
to another list using __skb_queue_head()?
Also, if gso_skb is a list of multiple skb's, i think the
count needs to be decremented by the number of segments in
gso_skb.

+
+       while (count) {
+               skb = q->dequeue(q);
+               if (!skb)
+                       break;
+               count--;
+               __skb_queue_head(skbs, skb);
+       }
+
+       return tdq - count;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+       return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+       struct Qdisc *q = dev->qdisc;
+       unsigned lockless = (dev->features & NETIF_F_LLTX);
+       int count = dev->xmit_win;
+       int ret = 0;
+
+       ret = try_get_tx_pkts(dev, q, count);
+
+       if (ret == 0)
+               return qdisc_qlen(q);
+
+       /* we have packets to send! */
+       if (!lockless) {
+               if (!netif_tx_trylock(dev))
+                       return tx_islocked(NULL, dev, q);
+       }
+
+       /* all clear .. */
+       spin_unlock(&dev->queue_lock);
+
+       ret = NETDEV_TX_BUSY;
+       if (!netif_queue_stopped(dev))
+               ret = try_tx_pkts(dev);

try_tx_pkts() is directly calling the device's batch xmit routine.
Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit
and GSO segmentation?

+
+       if (!lockless)
+               netif_tx_unlock(dev);
+
+       spin_lock(&dev->queue_lock);
+
+       q = dev->qdisc;
+
+       /* most likely result, packet went ok */
+       if (ret == NETDEV_TX_OK)
+               return qdisc_qlen(q);
+       /* only for lockless drivers .. */
+       if (ret == NETDEV_TX_LOCKED && lockless)
+               return tx_islocked(NULL, dev, q);
+
+       if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+               printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+                      dev->name, ret, q->q.qlen);
+
+       return do_dev_requeue(NULL, dev, q);
+}




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:12         ` Sridhar Samudrala
@ 2007-05-16 22:52           ` jamal
  2007-05-17  3:25             ` jamal
  2007-05-17  4:03           ` Krishna Kumar2
  1 sibling, 1 reply; 19+ messages in thread
From: jamal @ 2007-05-16 22:52 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner,
	ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja

On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote:
> Jamal,
> 
> Here are some comments i have on your patch.
> See them inline.
> 

Thanks for taking the time Sridhar.

> try_tx_pkts() is directly calling the device's batch xmit routine.
> Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit
> and GSO segmentation?
> 

I think this is the core of all your other comments.
Good catch - that piece of code has changed since 2.6.18; so the patch
i posted wont work with GSO.

You actually caught a bug in my other patch as well that nobody else
did;->

I will have to think a bit about this; i may end up coalescing when
grabbing the packets but call the nit from the driver using a helper.

cheers,
jamal


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:52           ` jamal
@ 2007-05-17  3:25             ` jamal
  2007-05-18 12:07               ` jamal
  0 siblings, 1 reply; 19+ messages in thread
From: jamal @ 2007-05-17  3:25 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Auke, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 808 bytes --]

On Wed, 2007-16-05 at 18:52 -0400, jamal wrote:
> On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote:

> 
> I will have to think a bit about this; i may end up coalescing when
> grabbing the packets but call the nit from the driver using a helper.
> 

Thats what i did. This would hopefully work with GSO now (infact nit
works now with GSO when it didnt before).

This patch now includes two changed drivers (tun and e1000). I have
tested tun with this patch. I tested e1000 earlier and i couldnt find
any issues - although as the tittle says its a WIP.

As before you need net-2.6. You also need the qdisc restart cleanup
patch.

Please comment.

If all is good, I think my next efforts will be to convert pktgen to be
aware of the api so we can have some serious traffic generation.

cheers,
jamal

[-- Attachment #2: combined-batch-p --]
[-- Type: text/x-patch, Size: 25942 bytes --]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 637ae8f..b4c900e 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -154,6 +154,8 @@ static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static int e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *dev);
+static int e1000_xmit_frames(struct sk_buff_head *list, struct net_device *dev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
 static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
 static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -932,6 +934,8 @@ e1000_probe(struct pci_dev *pdev,
 	netdev->open = &e1000_open;
 	netdev->stop = &e1000_close;
 	netdev->hard_start_xmit = &e1000_xmit_frame;
+	netdev->hard_prep_xmit = &e1000_prep_queue_frame;
+	netdev->hard_batch_xmit = &e1000_xmit_frames;
 	netdev->get_stats = &e1000_get_stats;
 	netdev->set_multicast_list = &e1000_set_multi;
 	netdev->set_mac_address = &e1000_set_mac;
@@ -940,6 +944,7 @@ e1000_probe(struct pci_dev *pdev,
 	e1000_set_ethtool_ops(netdev);
 	netdev->tx_timeout = &e1000_tx_timeout;
 	netdev->watchdog_timeo = 5 * HZ;
+	skb_queue_head_init(&netdev->blist);
 #ifdef CONFIG_E1000_NAPI
 	netdev->poll = &e1000_clean;
 	netdev->weight = 64;
@@ -998,6 +1003,7 @@ e1000_probe(struct pci_dev *pdev,
 		netdev->features |= NETIF_F_HIGHDMA;
 
 	netdev->features |= NETIF_F_LLTX;
+	netdev->features |= NETIF_F_BTX;
 
 	adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
 
@@ -1155,6 +1161,7 @@ e1000_probe(struct pci_dev *pdev,
 	if ((err = register_netdev(netdev)))
 		goto err_register;
 
+	netdev->xmit_win = adapter->tx_ring->count>>1;
 	/* tell the stack to leave us alone until e1000_open() is called */
 	netif_carrier_off(netdev);
 	netif_stop_queue(netdev);
@@ -1449,6 +1456,7 @@ e1000_open(struct net_device *netdev)
 	/* fire a link status change interrupt to start the watchdog */
 	E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_LSC);
 
+	printk("%s Batch window is %d\n",netdev->name, netdev->xmit_win);
 	return E1000_SUCCESS;
 
 err_req_irq:
@@ -1503,6 +1511,7 @@ e1000_close(struct net_device *netdev)
 	    e1000_check_mng_mode(&adapter->hw))
 		e1000_release_hw_control(adapter);
 
+	skb_queue_purge(&netdev->blist);
 	return 0;
 }
 
@@ -3098,6 +3107,18 @@ e1000_tx_map(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
 }
 
 static void
+e1000_kick_DMA(struct e1000_adapter *adapter,
+               struct e1000_tx_ring *tx_ring, int i)
+{
+	wmb();
+	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+	/* we need this if more than one processor can write to our tail
+	** at a time, it syncronizes IO on IA64/Altix systems */
+	mmiowb();
+}
+
+
+static void
 e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
                int tx_flags, int count)
 {
@@ -3139,17 +3160,7 @@ e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
 
 	tx_desc->lower.data |= cpu_to_le32(adapter->txd_cmd);
 
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.  (Only
-	 * applicable for weak-ordered memory model archs,
-	 * such as IA-64). */
-	wmb();
-
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
-	/* we need this if more than one processor can write to our tail
-	 * at a time, it syncronizes IO on IA64/Altix systems */
-	mmiowb();
 }
 
 /**
@@ -3256,54 +3267,60 @@ static int e1000_maybe_stop_tx(struct net_device *netdev,
 }
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+struct e1000_tx_cbdata {
+	int count;
+	unsigned int max_per_txd;
+	unsigned int nr_frags;
+	unsigned int mss;
+};
+
+#define E1000_SKB_CB(__skb)       ((struct e1000_tx_cbdata *)&((__skb)->cb[0]))
+#define NETDEV_TX_DROPPED	-5
+
 static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
 {
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_tx_ring *tx_ring;
-	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
+	unsigned int f;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
 	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
-	unsigned int tx_flags = 0;
 	unsigned int len = skb->len;
-	unsigned long flags;
-	unsigned int nr_frags = 0;
-	unsigned int mss = 0;
-	int count = 0;
-	int tso;
-	unsigned int f;
+
+	struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
+	cb->mss = 0;
+	cb->nr_frags = 0;
+	cb->max_per_txd = E1000_MAX_DATA_PER_TXD;
+	cb->count = 0;
+
 	len -= skb->data_len;
 
-	/* This goes back to the question of how to logically map a tx queue
-	 * to a flow.  Right now, performance is impacted slightly negatively
-	 * if using multiple tx queues.  If the stack breaks away from a
-	 * single qdisc implementation, we can look at this again. */
 	tx_ring = adapter->tx_ring;
 
 	if (unlikely(skb->len <= 0)) {
 		dev_kfree_skb_any(skb);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
-	/* 82571 and newer doesn't need the workaround that limited descriptor
-	 * length to 4kB */
+	/* 82571 and newer doesn't need the workaround that limited 
+	   descriptor length to 4kB */
 	if (adapter->hw.mac_type >= e1000_82571)
-		max_per_txd = 8192;
+		cb->max_per_txd = 8192;
 
-	mss = skb_shinfo(skb)->gso_size;
+	cb->mss = skb_shinfo(skb)->gso_size;
 	/* The controller does a simple calculation to
 	 * make sure there is enough room in the FIFO before
 	 * initiating the DMA for each buffer.  The calc is:
 	 * 4 = ceil(buffer len/mss).  To make sure we don't
 	 * overrun the FIFO, adjust the max buffer len if mss
 	 * drops. */
-	if (mss) {
+	if (cb->mss) {
 		uint8_t hdr_len;
-		max_per_txd = min(mss << 2, max_per_txd);
-		max_txd_pwr = fls(max_per_txd) - 1;
+		cb->max_per_txd = min(cb->mss << 2, cb->max_per_txd);
+		max_txd_pwr = fls(cb->max_per_txd) - 1;
 
 		/* TSO Workaround for 82571/2/3 Controllers -- if skb->data
-		* points to just header, pull a few bytes of payload from
-		* frags into skb->data */
+		 * points to just header, pull a few bytes of payload from
+		 * frags into skb->data */
 		hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 		if (skb->data_len && (hdr_len == (skb->len - skb->data_len))) {
 			switch (adapter->hw.mac_type) {
@@ -3315,7 +3332,8 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 				 * NOTE: this is a TSO only workaround
 				 * if end byte alignment not correct move us
 				 * into the next dword */
-				if ((unsigned long)(skb_tail_pointer(skb) - 1) & 4)
+				if ((unsigned long)(skb_tail_pointer(skb) -
+						    1) & 4)
 					break;
 				/* fall through */
 			case e1000_82571:
@@ -3327,7 +3345,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 					DPRINTK(DRV, ERR,
 						"__pskb_pull_tail failed.\n");
 					dev_kfree_skb_any(skb);
-					return NETDEV_TX_OK;
+					return NETDEV_TX_DROPPED;
 				}
 				len = skb->len - skb->data_len;
 				break;
@@ -3339,46 +3357,56 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	}
 
 	/* reserve a descriptor for the offload context */
-	if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
-		count++;
-	count++;
+	if ((cb->mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
+		cb->count++;
+	cb->count++;
 
 	/* Controller Erratum workaround */
 	if (!skb->data_len && tx_ring->last_tx_tso && !skb_is_gso(skb))
-		count++;
+		cb->count++;
 
-	count += TXD_USE_COUNT(len, max_txd_pwr);
+	cb->count += TXD_USE_COUNT(len, max_txd_pwr);
 
 	if (adapter->pcix_82544)
-		count++;
+		cb->count++;
 
 	/* work-around for errata 10 and it applies to all controllers
 	 * in PCI-X mode, so add one more descriptor to the count
 	 */
 	if (unlikely((adapter->hw.bus_type == e1000_bus_type_pcix) &&
-			(len > 2015)))
-		count++;
+		     (len > 2015)))
+		cb->count++;
 
-	nr_frags = skb_shinfo(skb)->nr_frags;
-	for (f = 0; f < nr_frags; f++)
-		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size,
-				       max_txd_pwr);
+	cb->nr_frags = skb_shinfo(skb)->nr_frags;
+	for (f = 0; f < cb->nr_frags; f++)
+		cb->count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size,
+					   max_txd_pwr);
 	if (adapter->pcix_82544)
-		count += nr_frags;
-
+		cb->count += cb->nr_frags;
 
 	if (adapter->hw.tx_pkt_filtering &&
 	    (adapter->hw.mac_type == e1000_82573))
 		e1000_transfer_dhcp_info(adapter, skb);
 
-	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
-		/* Collision - tell upper layer to requeue */
-		return NETDEV_TX_LOCKED;
+	return NETDEV_TX_OK;
+}
+
+/* invoked under tx_ring->lock */
+static int e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct e1000_tx_ring *tx_ring;
+	int tso;
+	unsigned int first;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	unsigned int tx_flags = 0;
+
+	struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
+	tx_ring = adapter->tx_ring;
 
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) {
+		netif_stop_queue(netdev);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -3386,7 +3414,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
 			netif_stop_queue(netdev);
 			mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
-			spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 			return NETDEV_TX_BUSY;
 		}
 	}
@@ -3401,8 +3428,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	tso = e1000_tso(adapter, tx_ring, skb);
 	if (tso < 0) {
 		dev_kfree_skb_any(skb);
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
 	if (likely(tso)) {
@@ -3418,16 +3444,157 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		tx_flags |= E1000_TX_FLAGS_IPV4;
 
 	e1000_tx_queue(adapter, tx_ring, tx_flags,
-	               e1000_tx_map(adapter, tx_ring, skb, first,
-	                            max_per_txd, nr_frags, mss));
+		       e1000_tx_map(adapter, tx_ring, skb, first,
+				    cb->max_per_txd, cb->nr_frags, cb->mss));
 
-	netdev->trans_start = jiffies;
+	return NETDEV_TX_OK;
+}
+
+/* called with tx_ring->lock held */
+static int real_e1000_xmit_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	struct e1000_adapter *adapter = netdev_priv(dev);
+	int ret = NETDEV_TX_OK;
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	ret = e1000_queue_frame(skb, dev);
+
+	if (ret == NETDEV_TX_OK) {
+		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+		dev->trans_start = jiffies;
+	}
+	if (ret == NETDEV_TX_DROPPED)
+		ret = NETDEV_TX_OK;
 
-	/* Make sure there is space in the ring for the next send. */
-	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
 
+	if (unlikely(e1000_maybe_stop_tx(dev, tx_ring, MAX_SKB_FRAGS + 2))) {
+		dev->xmit_win = 1;
+		netif_stop_queue(dev);
+		ret = NETDEV_TX_BUSY;
+	} else {
+		int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2);
+		dev->xmit_win = rspace;
+	}
+
+	if (ret == NETDEV_TX_BUSY)
+		printk("Single: %s stopped with win of %d\n",
+			dev->name,dev->xmit_win);
+	return ret;
+}
+
+/* single frame transmitter */
+static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	int ret = NETDEV_TX_OK;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	unsigned long flags;
+	struct e1000_tx_cbdata *cb; 
+
+	/* hopefully we will never cb data > 48 bytes .. */
+	memset(skb->cb, 0, sizeof(skb->cb));
+	ret = netdev->hard_prep_xmit(skb, netdev);
+	if (ret != NETDEV_TX_OK)
+		return NETDEV_TX_OK;
+
+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
+
+	cb = E1000_SKB_CB(skb);
+	/* need: count + 2 desc gap to keep tail from touching
+	 * head, otherwise try next time */
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) {
+		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+		return NETDEV_TX_BUSY;
+	}
+
+	ret = real_e1000_xmit_frame(skb, netdev);
 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-	return NETDEV_TX_OK;
+	return ret;
+}
+
+/*
+ * Batch transmit
+*/
+static int
+e1000_xmit_frames(struct sk_buff_head *list, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev->priv;
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	int ret = NETDEV_TX_OK;
+	int didq = 0;
+	struct sk_buff *skb = NULL;
+	unsigned long flags;
+
+	/* 
+	 * we should probably wait for this lock!
+	 */
+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		memset(skb->cb, 0, sizeof(skb->cb)); /* remove? */
+		ret = netdev->hard_prep_xmit(skb, netdev);
+		if (ret != NETDEV_TX_OK)
+			continue;
+
+		/*XXX: This may be an opportunity to not give nit
+		 * the packet if the dev ix TX BUSY ;-> */
+		dev_do_xmit_nit(skb, netdev);
+		ret = e1000_queue_frame(skb, netdev);
+		if (ret == NETDEV_TX_OK) {
+			didq++;
+		} else {
+			/* should never happen, but murphy is around */
+			if (ret == NETDEV_TX_BUSY) {
+				__skb_queue_head(list, skb);
+				    break;
+			}
+		}
+	}
+
+	/* we tried to send as many as we could .. */
+	if (didq) {
+		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+		netdev->trans_start = jiffies;
+	}
+
+	if (ret == NETDEV_TX_DROPPED)
+		ret = NETDEV_TX_OK;
+
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
+	/* need: MAX_SKB_FRAGS + 2 desc gap to keep tail from touching
+	 * head, otherwise try next time */
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2))) {
+		netdev->xmit_win = 1;
+		netif_stop_queue(netdev);
+		ret = NETDEV_TX_BUSY;
+	} else {
+		int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2);
+		netdev->xmit_win = rspace;
+		printk("batch %s still awake with win of %d\n",
+			netdev->name, netdev->xmit_win);
+	}
+	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+	if (ret == NETDEV_TX_BUSY)
+		printk("Batch: %s stopped with win of %d\n",
+			netdev->name, netdev->xmit_win);
+	return ret;
 }
 
 /**
@@ -4032,7 +4199,10 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
 		 */
 		smp_mb();
 		if (netif_queue_stopped(netdev)) {
+			netdev->xmit_win = E1000_DESC_UNUSED(tx_ring);
 			netif_wake_queue(netdev);
+			printk(" %s woken with win of %d\n",
+				netdev->name,netdev->xmit_win);
 			++adapter->restart_queue;
 		}
 	}
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a2c6caa..e128ae3 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -70,6 +70,7 @@
 static int debug;
 #endif
 
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
 /* Network device part of the driver */
 
 static LIST_HEAD(tun_dev_list);
@@ -86,9 +87,56 @@ static int tun_net_open(struct net_device *dev)
 static int tun_net_close(struct net_device *dev)
 {
 	netif_stop_queue(dev);
+	skb_queue_purge(&dev->blist);
 	return 0;
 }
 
+/* Batch Net device start xmit
+ * combine with non-batching version
+ * */
+static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev)
+{
+	struct sk_buff *skb;
+	int didq = 0;
+	struct tun_struct *tun = netdev_priv(dev);
+	u32 qlen = skb_queue_len(&tun->readq);
+
+	/* Drop packet if interface is not attached */
+	if (!tun->attached) {
+		tun->stats.tx_dropped+=skb_queue_len(&dev->blist);
+		skb_queue_purge(&dev->blist);
+		return NETDEV_TX_OK;
+	}
+
+	while (skb_queue_len(&dev->blist)) {
+		skb = __skb_dequeue(skbs);
+		if (!skb)
+			break;
+		dev_do_xmit_nit(skb, dev);
+		skb_queue_tail(&tun->readq, skb);
+		didq++;
+	}
+
+	qlen = skb_queue_len(&tun->readq);
+	if (qlen >= dev->tx_queue_len) {
+		netif_stop_queue(dev);
+		tun->stats.tx_fifo_errors++;
+		dev->xmit_win = 1;
+	} else {
+		dev->xmit_win = dev->tx_queue_len - qlen;
+	}
+
+	if (didq)
+		dev->trans_start = jiffies;
+
+	/* Notify and wake up reader process */
+	if (tun->flags & TUN_FASYNC)
+		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
+	wake_up_interruptible(&tun->read_wait);
+
+	return NETDEV_TX_OK;
+}
+
 /* Net device start xmit */
 static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
@@ -207,6 +255,7 @@ static void tun_net_init(struct net_device *dev)
 		dev->tx_queue_len = TUN_READQ_SIZE;  /* We prefer our own queue length */
 		break;
 	}
+	dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */
 }
 
 /* Character device part */
@@ -382,7 +431,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		netif_wake_queue(tun->dev);
+		{
+			u32 t = skb_queue_len(&tun->readq);
+			if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
+				tun->dev->xmit_win = tun->dev->tx_queue_len;
+				netif_wake_queue(tun->dev);
+			}
+		}
 
 		/** Decide whether to accept this packet. This code is designed to
 		 * behave identically to an Ethernet interface. Accept the packet if
@@ -429,6 +484,7 @@ static void tun_setup(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	skb_queue_head_init(&tun->readq);
+	skb_queue_head_init(&dev->blist);
 	init_waitqueue_head(&tun->read_wait);
 
 	tun->owner = -1;
@@ -436,6 +492,8 @@ static void tun_setup(struct net_device *dev)
 	SET_MODULE_OWNER(dev);
 	dev->open = tun_net_open;
 	dev->hard_start_xmit = tun_net_xmit;
+	dev->hard_prep_xmit = NULL;
+	dev->hard_batch_xmit = tun_net_bxmit;
 	dev->stop = tun_net_close;
 	dev->get_stats = tun_net_stats;
 	dev->ethtool_ops = &tun_ethtool_ops;
@@ -458,7 +516,7 @@ static struct tun_struct *tun_get_by_name(const char *name)
 static int tun_set_iff(struct file *file, struct ifreq *ifr)
 {
 	struct tun_struct *tun;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	int err;
 
 	tun = tun_get_by_name(ifr->ifr_name);
@@ -528,12 +586,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr)
 	}
 
 	DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name);
+	dev->features |= NETIF_F_BTX;
 
 	if (ifr->ifr_flags & IFF_NO_PI)
 		tun->flags |= TUN_NO_PI;
 
-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
+	if (ifr->ifr_flags & IFF_ONE_QUEUE) {
 		tun->flags |= TUN_ONE_QUEUE;
+		dev->features &= ~NETIF_F_BTX;
+	}
 
 	file->private_data = tun;
 	tun->attached = 1;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f671cd2..85a1baf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -325,6 +325,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BTX		8192	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -450,6 +451,11 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_batch_xmit) (struct sk_buff_head *list,
+						    struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						    struct net_device *dev);
+	int			xmit_win;
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -466,6 +472,10 @@ struct net_device
 	struct list_head	todo_list;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
+	/*XXX: Fix eventually to not allocate if device not
+	 *batch capable
+	*/
+	struct sk_buff_head	blist;
 
 	struct net_device	*link_watch_next;
 
@@ -742,7 +752,12 @@ extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
-
+extern int		do_gso_skb(struct sk_buff *skb,
+			           struct sk_buff_head *skbs);
+extern int		do_possible_gso_skb(struct sk_buff *skb,
+					    struct net_device *dev);
+extern void 		dev_do_xmit_nit(struct sk_buff *skb, 
+					struct net_device *dev);
 extern void		dev_init(void);
 
 extern int		netdev_budget;
diff --git a/net/core/dev.c b/net/core/dev.c
index 8301e2a..0d728cd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1372,6 +1372,47 @@ out_kfree_skb:
 	return 0;
 }
 
+int do_gso_skb(struct sk_buff *skb, struct sk_buff_head *skbs)
+{
+	int tdq = 0;
+	do {
+		struct sk_buff *nskb = skb->next;
+
+		skb->next = nskb->next;
+		nskb->next = NULL;
+		tdq++;
+		__skb_queue_head(skbs, skb);
+	} while (skb->next);
+	skb->destructor = DEV_GSO_CB(skb)->destructor;
+
+	return tdq;
+}
+
+int do_possible_gso_skb(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+		if (skb->next)
+			return do_gso_skb(skb, skbs);
+	}
+
+	__skb_queue_head(skbs, skb);
+	return 1;
+}
+
+/* invoked by driver when batching once figured skb is sane*/
+void  dev_do_xmit_nit(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+}
+
+
 #define HARD_TX_LOCK(dev, cpu) {			\
 	if ((dev->features & NETIF_F_LLTX) == 0) {	\
 		netif_tx_lock(dev);			\
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..530de14 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3217,9 +3217,7 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 			pkt_dev->next_tx_us++;
 			pkt_dev->next_tx_ns -= 1000;
 		}
-	}
-
-	else {			/* Retry it next time */
+	} else { /* netif_queue_stopped -- Retry it next time */
 		pkt_dev->last_ok = 0;
 		pkt_dev->next_tx_us = getCurUs();	/* TODO */
 		pkt_dev->next_tx_ns = 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ed80054..4fe5a9b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -85,10 +85,12 @@ static inline int
 do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 {
 
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (skb) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 	/* XXX: Could netif_schedule fail? Or is that fact we are
 	 * requeueing imply the hardware path is closed
 	 * and even if we fail, some interupt will wake us
@@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 	int ret = handle_dev_cpu_collision(dev);
 
 	if (ret == SCHED_TX_DROP) {
-		kfree_skb(skb);
+		if (skb) /* we are not batching */
+			kfree_skb(skb);
+		else if (!skb_queue_empty(&dev->blist))
+			skb_queue_purge(&dev->blist);
 		return qdisc_qlen(q);
 	}
 
@@ -195,10 +200,104 @@ static inline int qdisc_restart(struct net_device *dev)
 	return do_dev_requeue(skb, dev, q);
 }
 
+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *skbs = &dev->blist;
+	int tdq = 0;
+
+	/* 
+	 * very unlikely, but who knows ..
+	 * If this happens we dont try to grab more pkts
+	 */
+	if (!skb_queue_empty(&dev->blist))
+		return skb_queue_len(&dev->blist);
+
+	if (unlikely(dev->gso_skb)) {
+		skb = dev->gso_skb;
+		dev->gso_skb = NULL;
+		tdq = do_gso_skb(skb, skbs);
+	}
+
+	if (tdq > count)
+		return tdq; /*we will stop here */
+
+	count -= tdq;
+	while (count > 0) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+		
+		tdq += do_possible_gso_skb(skb, dev);
+		count -= tdq;
+	}
+
+	return tdq;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+	return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+	struct Qdisc *q = dev->qdisc;
+	unsigned lockless = (dev->features & NETIF_F_LLTX);
+	int count = dev->xmit_win;
+	int ret = 0;
+
+	ret = try_get_tx_pkts(dev, q, count);
+
+	if (ret == 0)
+		return qdisc_qlen(q);
+
+	/* we have packets to send! */
+	if (!lockless) {
+		if (!netif_tx_trylock(dev))
+			return tx_islocked(NULL, dev, q);
+	}
+
+	/* all clear .. */
+	spin_unlock(&dev->queue_lock);
+
+	ret = NETDEV_TX_BUSY;
+	if (!netif_queue_stopped(dev))
+		ret = try_tx_pkts(dev);
+
+	if (!lockless)
+		netif_tx_unlock(dev);
+
+	spin_lock(&dev->queue_lock);
+
+	q = dev->qdisc;
+
+	/* most likely result, packet went ok */
+	if (ret == NETDEV_TX_OK)
+		return qdisc_qlen(q);
+	/* only for lockless drivers .. */
+	if (ret == NETDEV_TX_LOCKED && lockless)
+		return tx_islocked(NULL, dev, q);
+
+	if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+		printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+		       dev->name, ret, q->q.qlen);
+
+	return do_dev_requeue(NULL, dev, q);
+}
+
 void __qdisc_run(struct net_device *dev)
 {
+	unsigned batching = (dev->features & NETIF_F_BTX);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!batching && !qdisc_restart(dev))
+			break;
+		else if (!batch_qdisc_restart(dev))
 			break;
 	} while (!netif_queue_stopped(dev));
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:12         ` Sridhar Samudrala
  2007-05-16 22:52           ` jamal
@ 2007-05-17  4:03           ` Krishna Kumar2
  2007-05-16 21:44             ` Sridhar Samudrala
  1 sibling, 1 reply; 19+ messages in thread
From: Krishna Kumar2 @ 2007-05-17  4:03 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Hi Sridhar,

Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:

> AFAIK, gso_skb can be a list of skb's. Can we add a list
> to another list using __skb_queue_head()?
> Also, if gso_skb is a list of multiple skb's, i think the
> count needs to be decremented by the number of segments in
> gso_skb.

gso_skb is the last GSO skb that failed to be sent. This already
segmented skb is kept "cached" and whenever the next xmit happens,
this skb is first sent before any packets from the queue are taken
out (otherwise out-of-order packets). So there can atmost be one
gso_skb per device.

- KK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 21:44             ` Sridhar Samudrala
@ 2007-05-17  5:01               ` Krishna Kumar2
  0 siblings, 0 replies; 19+ messages in thread
From: Krishna Kumar2 @ 2007-05-17  5:01 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:14:41 AM:

> Krishna Kumar2 wrote:
> > Hi Sridhar,
> >
> > Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:
> >
> >> AFAIK, gso_skb can be a list of skb's. Can we add a list
> >> to another list using __skb_queue_head()?
> >> Also, if gso_skb is a list of multiple skb's, i think the
> >> count needs to be decremented by the number of segments in
> >> gso_skb.
> >
> > gso_skb is the last GSO skb that failed to be sent. This already
> > segmented skb is kept "cached" and whenever the next xmit happens,
> > this skb is first sent before any packets from the queue are taken
> > out (otherwise out-of-order packets). So there can atmost be one
> > gso_skb per device.
>
> Yes. There can be only one gso_skb per device. But it can have more
> than one segments linked together via skb->next.

It will definitely have more than 1 segments (criterea to be on the
gso_skb). But I think you are right, the segmented skb will "lose"
it's 1-n segments (newsk->next = next) when it is added to the list.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-17  3:25             ` jamal
@ 2007-05-18 12:07               ` jamal
  0 siblings, 0 replies; 19+ messages in thread
From: jamal @ 2007-05-18 12:07 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Auke, Gagan Arneja

On Wed, 2007-16-05 at 23:25 -0400, jamal wrote:

> 
> This patch now includes two changed drivers (tun and e1000). I have
> tested tun with this patch. I tested e1000 earlier and i couldnt find
> any issues - although as the tittle says its a WIP.
> 
> As before you need net-2.6. You also need the qdisc restart cleanup
> patch.
> 

I have found a bug on e1000 which manifests under a lot of pounding with
traffic.
I am not gonna be posting the patch anymore - but will continue to fix
and add things. If you want the patches email me privately.

cheers,
jamal


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:32     ` David Miller
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-21  7:56       ` Herbert Xu
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
  2 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2007-05-21  7:56 UTC (permalink / raw)
  To: David Miller; +Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner

David Miller <davem@davemloft.net> wrote:
> From: Shirley Ma <xma@us.ibm.com>
> Date: Tue, 15 May 2007 14:22:57 -0700
> 
>>       I just wonder without TSO support in HW, how much benefit we
>> can get by pushing GSO from interface layer to device layer besides
>> we can do multiple packets in IPoIB.
> 
> I bet the gain is non-trivial.

Yep, for any NIC that supports SG but not TSO then software GSO will
be a big win.  When the NIC doesn't support SG then the win is mostly
offset by the need to copy the packet again.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
@ 2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
  2007-05-22 23:12             ` Herbert Xu
  0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2007-05-22 22:36 UTC (permalink / raw)
  To: xma; +Cc: herbert, ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:22:35 -0700

> > Yep, for any NIC that supports SG but not TSO then software GSO will
> > be a big win.  When the NIC doesn't support SG then the win is mostly
> > offset by the need to copy the packet again.
> >
> > Cheers,
> > --
> 
>       We could avoid packet copy by allocating number of 64K/MTU skb
> buffers instead of one big 64K buffer size. Is it a possible approach?

I think you misunderstand the problem.

If the device does not support scatter gather, it is impossible to
avoid the copy.

SKB's from TSO are composed of discontiguous page chunks represented
by the skb_shared_info() array.  These can come either from sendmsg()
user data (which the kernel fills into a per-socket data page which is
reallocated once filled), or from sendfile() page cache pages.

Therefore, there has to be a copy somewhere to be able to give
that packet to a non-scatter-gather device.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:04               ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2007-05-22 23:04 UTC (permalink / raw)
  To: xma; +Cc: ak, herbert, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:58:05 -0700

> Sorry for the confusion. I am thinking to avoid copy in skb_segment() for
> GSO. The way could be in tcp_sendmsg() to allocate small discontiguous
> buffers (equal = MTU) instead of allocating pages.

The SKB splitting algorithm in TCP's transmit engine depends upon the
skb_shared_info() array being splittable at arbitrary points with only
page counts to manage.  This is the only way I found to make SKB
splitting at transmit time extremely inexpensive.

SACK block processing needs to perform these kinds of splits
at well, so it really really has to be cheap.

The invariant is that every TCP TSO packet must have it's header
at skb->data and all of it's data in the paged skb_shared_info().

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:12             ` Herbert Xu
  1 sibling, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2007-05-22 23:12 UTC (permalink / raw)
  To: David Miller; +Cc: xma, ak, krkumar2, netdev, netdev-owner, rdreier

On Tue, May 22, 2007 at 03:36:36PM -0700, David Miller wrote:
> 
> > > Yep, for any NIC that supports SG but not TSO then software GSO will
> > > be a big win.  When the NIC doesn't support SG then the win is mostly
> > > offset by the need to copy the packet again.

...
 
> SKB's from TSO are composed of discontiguous page chunks represented
> by the skb_shared_info() array.  These can come either from sendmsg()
> user data (which the kernel fills into a per-socket data page which is
> reallocated once filled), or from sendfile() page cache pages.

Yes sendmsg() is the case where it's almost even whether GSO is
turned on or off because GSO does an extra copy which offsets the
win in reduced per-packet cost.

For sendfile() it's still a win though since we have to copy it
anyway and doing it in GSO avoids the per-packet cost.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-05-22 23:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
2007-05-15 21:17 ` [RFC] New driver API to speed up small packets xmits Roland Dreier
     [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
2007-05-15 21:25     ` Roland Dreier
     [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
2007-05-15 21:38         ` David Miller
2007-05-15 21:32     ` David Miller
2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
2007-05-15 22:48         ` jamal
2007-05-16  0:50           ` jamal
2007-05-16 22:12         ` Sridhar Samudrala
2007-05-16 22:52           ` jamal
2007-05-17  3:25             ` jamal
2007-05-18 12:07               ` jamal
2007-05-17  4:03           ` Krishna Kumar2
2007-05-16 21:44             ` Sridhar Samudrala
2007-05-17  5:01               ` Krishna Kumar2
     [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
2007-05-15 23:36         ` David Miller
2007-05-21  7:56       ` Herbert Xu
     [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
2007-05-22 22:36           ` David Miller
     [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
2007-05-22 23:04               ` David Miller
2007-05-22 23:12             ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).