[ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-08  9:31 Krishna Kumar
  2007-08-08  9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
                   ` (9 more replies)
  0 siblings, 10 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:31 UTC (permalink / raw)
  To: johnpol, sri, shemminger, davem, kaber
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, hadi,
	netdev

This set of patches implements the batching API, and adds support for this
API in IPoIB.

List of changes from original submission:
-----------------------------------------
1.  [Patrick] Suggestion to remove tx_queue_len check for enabling batching.
2.  [Patrick] Move queue purging to dev_deactivate to free references on
	device going down.
3.  [Patrick] Remove changelog & unrelated changes from sch_generic.c
4.  [Patrick] Free skb_blist in unregister_netdev (also suggested to put in
	free_netdev, but it is not required as unregister_netdev will not fail
	at this location).
5.  [Stephen/Patrick] Remove /sysfs support.
6.  [Stephen] Add ethtool support.
7.  [Evgeniy] Stop interrupts while changing tx_batch_skb value.
8.  [Michael Tsirkin] Remove misleading comment in ipoib_send().
9.  [KK] Remove NETIF_F_BATCH_SKBS (device supports batching if API present).
10. [KK] Remove xmit_slots from netdev.
11. [KK] [IPoIB]: Use unsigned instead of int for index's, handle race
	between multiple WC's executing on different CPU's by having a new
	lock (or might need to hold lock for entire duration of WC - some
	optimization is possible here), changed multiple skb algo to not use
	xmit_slots, simplify code, minor performance changes wrt slot
	counters, etc.

List of changes implemented, tested and dropped:
------------------------------------------------
1. [Patrick] Suggestion to use skb_blist statically in netdevice. This
	reduces performance (~ 1%) (possibly due to having an extra check for
	dev->hard_start_xmit_batch API).
2. [Patrick] Suggestion to check if hard_start_xmit_batch can be removed:
	This reduces performance as a call to a non inline function is made,
	and an extra check in driver to see if skb is NULL.
3. [Sridhar] Suggestion to always use batching for regular xmit case too:
	While testing, for some reason the tests virtually hangs and
	transfers almost no data for higher number of proceses (like 64 and
	above).

Patches are described as:
		 Mail 0/9: This mail
		 Mail 1/9: HOWTO documentation
		 Mail 2/9: Introduce skb_blist and hard_start_xmit_batch API
		 Mail 3/9: Modify qdisc_run() to support batching
		 Mail 4/9: Add ethtool support to enable/disable batching
		 Mail 5/9: IPoIB header file changes to use batching
		 Mail 6/9: IPoIB CM & Multicast changes
		 Mail 7/9: IPoIB verb changes to use batching
		 Mail 8/9: IPoIB internal post and work completion handler
		 Mail 9/9: Implement the new batching API

RESULTS: The performance improvement for TCP No Delay is in the range of -8%
	to 320% (with -8% being the sole negative), with many individual tests
	giving 50% or more improvement (I think it is to do with the hw slots
	getting full quicker resulting in more batching when the queue gets
	woken). The results for TCP is in the range of -11% to 93%, with most
	of the tests (8/12) giving improvements.

ISSUES: I am getting a huge amount of retransmissions for both TCP and TCP No
	Delay cases for IPoIB (which explains the slight degradation for some
	test cases mentioned above). After a full test run, the regular code
	resulted in 74 retransmissions, while there were 1365716 retrans with
	batching code - or 18500 retransmissions for every 1 in regular code.
	But with this huge amount of retransmissions there is 20.7% overall
	improvement in BW (which implies batching will improve results even
	more if this problem is fixed). I suspect this is some issue in the
	driver/firmware since:
		a. I see similar low retransmissions numbers for E1000 (so
		   no bug in core changes).
		b. Even with batching set to maximum 2 skbs, I get almost the
		   same number of retransmissions (implies receiver is
		   probably not dropping skbs). ifconfig/netstat on receiver
		   gives no clue (drop/errors, etc).
	This issue delayed submitting patches for the last 2 weeks, as I was
	trying to debug this; any help from openIB community is appreciated.

Please review and provide feedback; and consider for inclusion.

Thanks,

- KK

---------------------------------------------------------------
Test Case                 ORG         NEW          % Change
---------------------------------------------------------------
                                 TCP
                                 ---
Size:32 Procs:1           2709        4217           55.66
Size:128 Procs:1          10950       15853          44.77
Size:512 Procs:1          35313       68224          93.19
Size:4096 Procs:1         118144      119935         1.51

Size:32 Procs:8           18976       22432          18.21
Size:128 Procs:8          66351       86072          29.72
Size:512 Procs:8          246546      234373         -4.93
Size:4096 Procs:8         268861      251540         -6.44

Size:32 Procs:16          35009       45861          30.99
Size:128 Procs:16         150979      164961         9.26
Size:512 Procs:16         259443      230730         -11.06
Size:4096 Procs:16        265313      246794         -6.98

                               TCP No Delay
                               ------------
Size:32 Procs:1           1930        1944           .72
Size:128 Procs:1          8573        7831           -8.65
Size:512 Procs:1          28536       29347          2.84
Size:4096 Procs:1         98916       104236         5.37

Size:32 Procs:8           4173        17560          320.80
Size:128 Procs:8          17350       66205          281.58
Size:512 Procs:8          69777       211467         203.06
Size:4096 Procs:8         201096      242578         20.62

Size:32 Procs:16          20570       37778          83.65
Size:128 Procs:16         95005       154464         62.58
Size:512 Procs:16         111677      221570         98.40
Size:4096 Procs:16        204765      240368         17.38
---------------------------------------------------------------
Overall:                  2340962     2826340        20.73%
                        [Summary: 19 Better cases, 5 worse]

Testing environment (on client, server uses 4096 sendq size):
	echo "Using 512 size sendq"
	modprobe ib_ipoib send_queue_size=512 recv_queue_size=512
	echo "4096 524288 4194304" > /proc/sys/net/ipv4/tcp_wmem
	echo "4096 1048576 4194304" > /proc/sys/net/ipv4/tcp_rmem
	echo 4194304 > /proc/sys/net/core/rmem_max
	echo 4194304 > /proc/sys/net/core/wmem_max
	echo 120000 > /proc/sys/net/core/netdev_max_backlog

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
@ 2007-08-08  9:31 ` Krishna Kumar
  2007-08-08  9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:31 UTC (permalink / raw)
  To: johnpol, kaber, shemminger, davem, sri
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

Add Documentation describing batching API.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 Batching_skb_API.txt |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 82 insertions(+)

diff -ruNp ORG/Documentation/networking/Batching_skb_API.txt NEW/Documentation/networking/Batching_skb_API.txt
--- ORG/Documentation/networking/Batching_skb_API.txt	1970-01-01 05:30:00.000000000 +0530
+++ NEW/Documentation/networking/Batching_skb_API.txt	2007-08-07 22:41:55.000000000 +0530
@@ -0,0 +1,82 @@
+		 HOWTO for batching skb API support
+		 -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+----------------------------------------------
+
+A new xmit API - hard_start_xmit_batch() is provided in the netdevice layer
+similar to the existing hard_start_xmit() API. Drivers which export this
+API can implement it similar to the hard_start_xmit handler. The new API
+should process multiple skbs (or even one) in a single call while the
+existing hard_start_xmit processes one skb. It is possible for the driver
+writer to re-use most of the code from the existing API in the new API
+without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+	This API is optionally exported by a driver. The pre-requisite for a
+	driver to use this API is that it should have a reasonably sized
+	hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+	The networking stack gets called from upper layer protocols with a
+	single skb to transmit. This skb is first enqueue'd and an attempt is
+	made to transmit it immediately (via qdisc_run). However, events like
+	tx lock contention, tx queue stopped, etc, can result in the skb not
+	getting sent out and it remains in the queue. When the next xmit is
+	called or when the queue is re-enabled, qdisc_run could potentially
+	find multiple packets in the queue, and iteratively send them all out
+	one-by-one.
+
+	The batching skb API was added to exploit this situation where all
+	skbs can be passed in one shot to the device. This reduces driver
+	processing, locking at the driver (or in stack for ~LLTX drivers)
+	gets amortized over multiple skbs, and in case of specific drivers
+	where every xmit results in a completion processing (like IPoIB),
+	optimizations can be made in the driver to request a completion for
+	only the last skb that was sent which results in saving interrupts
+	for every (but the last) skb that was sent in the same batch.
+
+	Batching can result in significant performance gains for systems that
+	have multiple data stream paths over the same network interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+	The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+	simplistically, can be written almost identically to the regular
+	xmit API except that multiple skbs should be processed by the driver
+	instead of one skb. The new API doesn't get a skb as an argument,
+	instead it picks up all the skbs from dev->skb_blist, where it was
+	added by the core stack, and tries to send them out.
+
+	Batching requires the driver to set dev->hard_start_xmit_batch to the
+	new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+	Batching could be disabled for a particular device, e.g. on desktop
+	systems if only one stream of network activity for that device is
+	taking place, since performance could be slightly affected due to
+	extra processing that batching adds (unless packets are getting
+	sent fast resulting in stopped queue's). Batching can be enabled if
+	more than one stream of network activity per device is being done,
+	e.g. on servers; or even desktop usage with multiple browser, chat,
+	file transfer sessions, etc.
+
+	Per device batching can be enabled/disabled using ethtool, where
+	passing 1 enables batching and passing 0 disables batching.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
  2007-08-08  9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-08-08  9:31 ` Krishna Kumar
  2007-08-08 10:59   ` [ofa-general] " Stephen Hemminger
  2007-08-08 12:01   ` Evgeniy Polyakov
  2007-08-08  9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:31 UTC (permalink / raw)
  To: johnpol, sri, shemminger, kaber, davem
  Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

Introduce skb_blist and hard_start_xmit_batch API, handle driver's usage
of the new API, and add support routines.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    8 +++
 net/core/dev.c            |   98 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

diff -ruNp ORG/include/linux/netdevice.h NEW/include/linux/netdevice.h
--- ORG/include/linux/netdevice.h	2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/netdevice.h	2007-08-07 13:11:19.000000000 +0530
@@ -456,6 +456,9 @@ struct net_device
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
 
+	/* List of batch skbs (optional, used if driver supports batching API */
+	struct sk_buff_head	*skb_blist;
+
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
@@ -472,6 +475,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_start_xmit_batch) (struct net_device
+							  *dev);
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -832,6 +838,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct sk_buff *skb,
+					     struct net_device *dev);
 
 extern void		dev_init(void);
 
diff -ruNp ORG/net/core/dev.c NEW/net/core/dev.c
--- ORG/net/core/dev.c	2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/dev.c	2007-08-07 13:11:19.000000000 +0530
@@ -897,6 +897,55 @@ void netdev_state_change(struct net_devi
 	}
 }
 
+static void free_batching(struct net_device *dev)
+{
+	if (dev->skb_blist) {
+		if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+		kfree(dev->skb_blist);
+		dev->skb_blist = NULL;
+	}
+}
+
+int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
+{
+	int ret = 0;
+	struct sk_buff_head *blist;
+
+	if (!dev->hard_start_xmit_batch) {
+		/* Driver doesn't support batching skb API */
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	/* Handle invalid argument */
+	if (new_batch_skb < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Check if new value is same as the current */
+	if (!!dev->skb_blist == !!new_batch_skb)
+		goto out;
+
+	if (new_batch_skb &&
+	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock(&dev->queue_lock);
+	if (new_batch_skb) {
+		skb_queue_head_init(blist);
+		dev->skb_blist = blist;
+	} else
+		free_batching(dev);
+	spin_unlock(&dev->queue_lock);
+
+out:
+	return ret;
+}
+
 /**
  *	dev_load 	- load a network module
  *	@name: name of interface
@@ -1459,6 +1508,45 @@ static int dev_gso_segment(struct sk_buf
 	return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree(skb);
+			return 0;
+		}
+
+		if (skb->next) {
+			int count = 0;
+
+			do {
+				struct sk_buff *nskb = skb->next;
+
+				skb->next = nskb->next;
+				__skb_queue_tail(dev->skb_blist, nskb);
+				count++;
+			} while (skb->next);
+
+			skb->destructor = DEV_GSO_CB(skb)->destructor;
+			kfree_skb(skb);
+			return count;
+		}
+	}
+	__skb_queue_tail(dev->skb_blist, skb);
+	return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (likely(!skb->next)) {
@@ -3446,6 +3535,13 @@ int register_netdevice(struct net_device
 		}
 	}
 
+	if (dev->hard_start_xmit_batch) {
+		/* Driver supports batching skb API */
+		dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL);
+		if (dev->skb_blist)
+			skb_queue_head_init(dev->skb_blist);
+	}
+
 	/*
 	 *	nil rebuild_header routine,
 	 *	that should be never called and used as just bug trap.
@@ -3787,6 +3882,9 @@ void unregister_netdevice(struct net_dev
 
 	synchronize_net();
 
+	/* Deallocate batching structure */
+	free_batching(dev);
+
 	/* Shutdown queueing discipline. */
 	dev_shutdown(dev);
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
  2007-08-08  9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
  2007-08-08  9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
@ 2007-08-08  9:31 ` Krishna Kumar
  2007-08-08 12:14   ` [ofa-general] " Evgeniy Polyakov
  2007-08-08 14:05   ` Patrick McHardy
  2007-08-08  9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:31 UTC (permalink / raw)
  To: johnpol, davem, shemminger, kaber, sri
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/pkt_sched.h |    6 +--
 net/core/dev.c          |    5 +--
 net/sched/sch_generic.c |   77 +++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 69 insertions(+), 19 deletions(-)

diff -ruNp ORG/include/net/pkt_sched.h NEW/include/net/pkt_sched.h
--- ORG/include/net/pkt_sched.h	2007-07-17 08:48:37.000000000 +0530
+++ NEW/include/net/pkt_sched.h	2007-08-07 13:11:19.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp ORG/net/sched/sch_generic.c NEW/net/sched/sch_generic.c
--- ORG/net/sched/sch_generic.c	2007-07-12 08:55:20.000000000 +0530
+++ NEW/net/sched/sch_generic.c	2007-08-07 13:11:19.000000000 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
 				  struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (likely(skb)) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 
 	netif_schedule(dev);
 	return 0;
@@ -91,17 +93,22 @@ static inline int handle_dev_cpu_collisi
 		/*
 		 * Same CPU holding the lock. It may be a transient
 		 * configuration error, when hard_start_xmit() recurses. We
-		 * detect it by checking xmit owner and drop the packet when
-		 * deadloop is detected. Return OK to try the next skb.
+		 * detect it by checking xmit owner and drop the packet (or
+		 * all packets in batching case) when deadloop is detected.
+		 * Return OK to try the next skb.
 		 */
-		kfree_skb(skb);
+		if (likely(skb))
+			kfree_skb(skb);
+		else if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+
 		if (net_ratelimit())
 			printk(KERN_WARNING "Dead loop on netdevice %s, "
 			       "fix it urgently!\n", dev->name);
 		ret = qdisc_qlen(q);
 	} else {
 		/*
-		 * Another cpu is holding lock, requeue & delay xmits for
+		 * Another cpu is holding lock, requeue skb & delay xmits for
 		 * some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
 }
 
 /*
+ * Algorithm to get skb(s) is:
+ *	- Non batching drivers, or if the batch list is empty and there is
+ *	  1 skb in the queue - dequeue skb and put it in *skbp to tell the
+ *	  caller to use the single xmit API.
+ *	- Batching drivers where the batch list already contains atleast one
+ *	  skb, or if there are multiple skbs in the queue: keep dequeue'ing
+ *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *	  the multiple xmit API.
+ *
+ * Returns:
+ *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ *	    (in case >1 skbs present in blist for batching)
+ *	0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+			  struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+	if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
+		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+	} else {
+		int max = dev->tx_queue_len - skb_queue_len(blist);
+		struct sk_buff *skb;
+
+		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+			max -= dev_add_skb_to_blist(skb, dev);
+
+		*skbp = NULL;
+		return 1;	/* we have atleast one skb in blist */
+	}
+}
+
+/*
  * NOTE: Called under dev->queue_lock with locally disabled BH.
  *
  * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+				struct sk_buff_head *blist)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
@@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n
 	int ret;
 
 	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+	if (unlikely(!get_skb(dev, q, blist, &skb)))
 		return 0;
 
 	/*
@@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
-	ret = dev_hard_start_xmit(skb, dev);
+	if (likely(skb))
+		ret = dev_hard_start_xmit(skb, dev);
+	else
+		ret = dev->hard_start_xmit_batch(dev);
 
 	if (!lockless)
 		netif_tx_unlock(dev);
@@ -168,7 +211,7 @@ static inline int qdisc_restart(struct n
 
 	switch (ret) {
 	case NETDEV_TX_OK:
-		/* Driver sent out skb successfully */
+		/* Driver sent out skb (or entire skb_blist) successfully */
 		ret = qdisc_qlen(q);
 		break;
 
@@ -190,10 +233,10 @@ static inline int qdisc_restart(struct n
 	return ret;
 }
 
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	do {
-		if (!qdisc_restart(dev))
+		if (!qdisc_restart(dev, blist))
 			break;
 	} while (!netif_queue_stopped(dev));
 
@@ -563,6 +606,12 @@ void dev_deactivate(struct net_device *d
 	qdisc = dev->qdisc;
 	dev->qdisc = &noop_qdisc;
 
+	if (dev->skb_blist) {
+		/* Release skbs on batch list */
+		if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+	}
+
 	qdisc_reset(qdisc);
 
 	skb = dev->gso_skb;
diff -ruNp ORG/net/core/dev.c NEW/net/core/dev.c
--- ORG/net/core/dev.c	2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/dev.c	2007-08-07 13:11:19.000000000 +0530
@@ -1699,7 +1699,7 @@ gso:
 			/* reset queue_mapping to zero */
 			skb->queue_mapping = 0;
 			rc = q->enqueue(skb, q);
-			qdisc_run(dev);
+			qdisc_run(dev, NULL);
 			spin_unlock(&dev->queue_lock);
 
 			rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1896,7 +1896,8 @@ static void net_tx_action(struct softirq
 			clear_bit(__LINK_STATE_SCHED, &dev->state);
 
 			if (spin_trylock(&dev->queue_lock)) {
-				qdisc_run(dev);
+				/* Send all skbs if driver supports batching */
+				qdisc_run(dev, dev->skb_blist);
 				spin_unlock(&dev->queue_lock);
 			} else {
 				netif_schedule(dev);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 4/9 Rev3] [ethtool] Add ethtool support
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (2 preceding siblings ...)
  2007-08-08  9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-08-08  9:31 ` Krishna Kumar
  2007-08-08  9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:31 UTC (permalink / raw)
  To: johnpol, sri, shemminger, davem, kaber
  Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

Add ethtool support to enable/disable batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/ethtool.h   |    2 ++
 include/linux/netdevice.h |    2 ++
 net/core/ethtool.c        |   26 ++++++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff -ruNp ORG/include/linux/ethtool.h NEW/include/linux/ethtool.h
--- ORG/include/linux/ethtool.h	2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/ethtool.h	2007-08-07 13:11:19.000000000 +0530
@@ -410,6 +410,8 @@ struct ethtool_ops {
 #define ETHTOOL_SUFO		0x00000022 /* Set UFO enable (ethtool_value) */
 #define ETHTOOL_GGSO		0x00000023 /* Get GSO enable (ethtool_value) */
 #define ETHTOOL_SGSO		0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GBATCH		0x00000024 /* Get Batch (ethtool_value) */
+#define ETHTOOL_SBATCH		0x00000025 /* Set Batch (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff -ruNp ORG/include/linux/netdevice.h NEW/include/linux/netdevice.h
--- ORG/include/linux/netdevice.h	2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/netdevice.h	2007-08-07 13:11:19.000000000 +0530
@@ -1112,6 +1112,8 @@ extern void		dev_set_promiscuity(struct 
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
+extern int		dev_change_tx_batch_skb(struct net_device *dev,
+						unsigned long new_batch_skb);
 /* Load a device via the kmod */
 extern void		dev_load(const char *name);
 extern void		dev_mcast_init(void);
diff -ruNp ORG/net/core/ethtool.c NEW/net/core/ethtool.c
--- ORG/net/core/ethtool.c	2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/ethtool.c	2007-08-07 13:11:19.000000000 +0530
@@ -638,6 +638,25 @@ static int ethtool_set_gso(struct net_de
 	return 0;
 }
 
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata = { ETHTOOL_GBATCH };
+
+	edata.data = dev->skb_blist != NULL;
+	if (copy_to_user(useraddr, &edata, sizeof(edata)))
+		 return -EFAULT;
+	return 0;
+}
+
+static int ethtool_set_gso(struct net_device *dev, char __user *useraddr)
+{
+	struct ethtool_value edata;
+
+	if (copy_from_user(&edata, useraddr, sizeof(edata)))
+		return -EFAULT;
+	return dev_change_tx_batch_skb(dev, edata.data);
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -817,6 +836,7 @@ int dev_ethtool(struct ifreq *ifr)
 	case ETHTOOL_GPERMADDR:
 	case ETHTOOL_GUFO:
 	case ETHTOOL_GGSO:
+	case ETHTOOL_GBATCH:
 		break;
 	default:
 		if (!capable(CAP_NET_ADMIN))
@@ -935,6 +955,12 @@ int dev_ethtool(struct ifreq *ifr)
 	case ETHTOOL_SGSO:
 		rc = ethtool_set_gso(dev, useraddr);
 		break;
+	case ETHTOOL_GBATCH:
+		rc = ethtool_get_batch(dev, useraddr);
+		break;
+	case ETHTOOL_SBATCH:
+		rc = ethtool_set_batch(dev, useraddr);
+		break;
 	default:
 		rc = -EOPNOTSUPP;
 	}

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (3 preceding siblings ...)
  2007-08-08  9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
@ 2007-08-08  9:32 ` Krishna Kumar
  2007-08-08  9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:32 UTC (permalink / raw)
  To: johnpol, kaber, shemminger, davem, sri
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan, tgraf,
	hadi, netdev

IPoIB header file changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib.h |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib.h NEW/drivers/infiniband/ulp/ipoib/ipoib.h
--- ORG/drivers/infiniband/ulp/ipoib/ipoib.h	2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib.h	2007-08-07 13:11:19.000000000 +0530
@@ -266,11 +266,13 @@ struct ipoib_dev_priv {
 	struct ipoib_rx_buf *rx_ring;
 
 	spinlock_t           tx_lock;
+	spinlock_t           comp_lock;		/* to handle parallel WC's */
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	unsigned             tx_prev_tail;	/* to handle parallel WC's */
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -365,8 +367,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+		      u32 qpn, int wr_num);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (4 preceding siblings ...)
  2007-08-08  9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
@ 2007-08-08  9:32 ` Krishna Kumar
  2007-08-08  9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:32 UTC (permalink / raw)
  To: johnpol, sri, shemminger, kaber, davem
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, hadi,
	netdev

IPoIB CM & Multicast changes based on header file changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_cm.c        |   13 +++++++++----
 ipoib_multicast.c |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_cm.c NEW/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-07-17 08:48:35.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-08-07 13:11:19.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_multicast.c NEW/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-08-07 13:11:19.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
 	}
 
 unlock:

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (5 preceding siblings ...)
  2007-08-08  9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
@ 2007-08-08  9:32 ` Krishna Kumar
  2007-08-08  9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:32 UTC (permalink / raw)
  To: johnpol, davem, shemminger, kaber, sri
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan, tgraf,
	hadi, netdev

IPoIB verb changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_verbs.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_verbs.c NEW/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-08-07 13:11:19.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (6 preceding siblings ...)
  2007-08-08  9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
@ 2007-08-08  9:32 ` Krishna Kumar
  2007-08-08  9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
  2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:32 UTC (permalink / raw)
  To: johnpol, sri, shemminger, davem, kaber
  Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
	rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

IPoIB internal post and work completion handler changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |  217 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 173 insertions(+), 44 deletions(-)

diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_ib.c NEW/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-07-17 08:48:35.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-08-07 13:11:19.000000000 +0530
@@ -242,6 +242,8 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, num_completions;
+	unsigned int tx_ring_index;
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
@@ -255,18 +257,56 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	/*
+	 * Handle skbs completion from tx_tail to wr_id. Two issues :
+	 *	- Need to stop other WC's from mangling same skb's if they
+	 *	  run at the same time. Use tx_prev_tail to demarcate WC's.
+	 *	- Handle WC's from earlier (possibly multiple) post_sends in
+	 *	  this iteration as we move from tx_prev_tail to wr_id, since
+	 *	  if the last WR (which is the one which requested completion
+	 *	  notification) failed to be sent for any of those earlier
+	 *	  request(s), no completion notification is generated for
+	 *	  successful WR's of those earlier request(s).
+	 */
+	spin_lock_irqsave(&priv->comp_lock, flags);
+
+	/* Get start index */
+	tx_ring_index = priv->tx_prev_tail & (ipoib_sendq_size - 1);
+
+	/*Find number of WC's */
+	num_completions = wr_id - tx_ring_index + 1;
+	if (unlikely(num_completions <= 0))
+		num_completions += ipoib_sendq_size;
+
+	/* Save new start index for any parallel WC's */
+	priv->tx_prev_tail += num_completions;
+
+	spin_unlock_irqrestore(&priv->comp_lock, flags);
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+	tx_req = &priv->tx_ring[tx_ring_index];
+	for (i = 0; i < num_completions; i++) {
+		if (likely(tx_req->skb)) {
+			ib_dma_unmap_single(priv->ca, tx_req->mapping,
+					    tx_req->skb->len, DMA_TO_DEVICE);
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+			++priv->stats.tx_packets;
+			priv->stats.tx_bytes += tx_req->skb->len;
 
-	dev_kfree_skb_any(tx_req->skb);
+			dev_kfree_skb_any(tx_req->skb);
+		}
+		/*
+		 * else this skb failed synchronously when posted and was
+		 * freed immediately.
+		 */
+
+		if (likely(++tx_ring_index != ipoib_sendq_size))
+			tx_req++;
+		else
+			tx_req = &priv->tx_ring[0];
+	}
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
-	++priv->tx_tail;
+	priv->tx_tail += num_completions;
 	if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
 	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
@@ -340,29 +380,57 @@ void ipoib_ib_completion(struct ib_cq *c
 	netif_rx_schedule(dev_ptr);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-			    unsigned int wr_id,
-			    struct ib_ah *address, u32 qpn,
-			    u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, first_wr is the first slot in tx_wr[] (or
+ * tx_sge[]). first_wr is normally zero unless a previous post_send returned
+ * error and we are trying to post the untried WR's, in which case first_wr
+ * is the index to the first untried WR.
+ *
+ * Break the WR link before posting so that provider knows how many WR's to
+ * process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+			    int first_wr, int num_skbs,
+			    struct ib_send_wr **bad_wr)
 {
-	struct ib_send_wr *bad_wr;
+	int ret;
+	struct ib_send_wr *last_wr, *next_wr;
+
+	last_wr = &priv->tx_wr[first_wr + num_skbs - 1];
+
+	/* Set Completion Notification for last WR */
+	last_wr->send_flags = IB_SEND_SIGNALED;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	/* Terminate the last WR */
+	next_wr = last_wr->next;
+	last_wr->next = NULL;
 
-	priv->tx_wr.wr_id 	      = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn  = qpn;
-	priv->tx_wr.wr.ud.ah 	      = address;
+	/* Send all the WR's in one doorbell */
+	ret = ib_post_send(priv->qp, &priv->tx_wr[first_wr], bad_wr);
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	/* Restore send_flags & WR chain */
+	last_wr->send_flags = 0;
+	last_wr->next = next_wr;
+
+	return ret;
 }
 
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_ring; and details of the WR in tx_wr
+ * to pass to the provider.
+ *
+ * Returns:
+ *	1: Error and the skb is freed.
+ *	0 skb processed successfully.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+		      u32 qpn, int wr_num)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_tx_buf *tx_req;
 	u64 addr;
+	unsigned int tx_ring_index;
 
 	if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
 		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
@@ -370,7 +438,7 @@ void ipoib_send(struct net_device *dev, 
 		++priv->stats.tx_dropped;
 		++priv->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
-		return;
+		return 1;
 	}
 
 	ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
@@ -383,35 +451,96 @@ void ipoib_send(struct net_device *dev, 
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
-	tx_req->skb = skb;
-	addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
-				 DMA_TO_DEVICE);
+	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
 		++priv->stats.tx_errors;
 		dev_kfree_skb_any(skb);
-		return;
+		return 1;
 	}
-	tx_req->mapping = addr;
 
-	if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
-			       address->ah, qpn, addr, skb->len))) {
-		ipoib_warn(priv, "post_send failed\n");
-		++priv->stats.tx_errors;
-		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
-		dev_kfree_skb_any(skb);
-	} else {
-		dev->trans_start = jiffies;
+	tx_ring_index = priv->tx_head & (ipoib_sendq_size - 1);
+
+	/* Save till completion handler executes */
+	priv->tx_ring[tx_ring_index].skb = skb;
+	priv->tx_ring[tx_ring_index].mapping = addr;
+
+	/* Set WR values for the provider to use */
+	priv->tx_sge[wr_num].addr = addr;
+	priv->tx_sge[wr_num].length = skb->len;
+
+	priv->tx_wr[wr_num].wr_id = tx_ring_index;
+	priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+	priv->tx_wr[wr_num].wr.ud.ah = address->ah;
+
+	priv->tx_head++;
+
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+		netif_stop_queue(dev);
+		set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+	}
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+	return 0;
+}
 
-		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
-			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-			netif_stop_queue(dev);
-			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+/*
+ * Send num_skbs to the device. If an skb is passed to this function, it is
+ * single, unprocessed skb send case; otherwise it means that all skbs are
+ * already processed and put on priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+		struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int first_wr = 0;
+
+	if (skb && ipoib_process_skb(dev, skb, priv, address, qpn, 0))
+		return;
+
+	/* Send all skb's in one post */
+	do {
+		struct ib_send_wr *bad_wr;
+
+		if (unlikely((post_send(priv, qpn, first_wr, num_skbs,
+					&bad_wr)))) {
+			int done;
+
+			ipoib_warn(priv, "post_send failed\n");
+
+			/* Get number of WR's that finished successfully */
+			done = bad_wr - &priv->tx_wr[first_wr];
+
+			/* Handle 1 error */
+			priv->stats.tx_errors++;
+			ib_dma_unmap_single(priv->ca,
+				priv->tx_sge[first_wr + done].addr,
+				priv->tx_sge[first_wr + done].length,
+				DMA_TO_DEVICE);
+
+			/* Free failed WR & reset for WC handler to recognize */
+			dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+			priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+			/* Handle 'n' successes */
+			if (done) {
+				dev->trans_start = jiffies;
+				address->last_send = priv->tx_head - (num_skbs -
+								      done) - 1;
+			}
+
+			/* Get count of skbs that were not tried */
+			num_skbs -= (done + 1);
+				/* + 1 for WR that was tried & failed */
+
+			/* Get start index for next iteration */
+			first_wr += (done + 1);
+		} else {
+			dev->trans_start = jiffies;
+
+			address->last_send = priv->tx_head - 1;
+			num_skbs = 0;
 		}
-	}
+	} while (num_skbs);
 }
 
 static void __ipoib_reap_ah(struct net_device *dev)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (7 preceding siblings ...)
  2007-08-08  9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
@ 2007-08-08  9:32 ` Krishna Kumar
  2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  9 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar @ 2007-08-08  9:32 UTC (permalink / raw)
  To: johnpol, kaber, shemminger, davem, sri
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
	tgraf, hadi, netdev, Krishna Kumar, xma

IPoIB: implement the new batching API.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_main.c |  189 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 184 insertions(+), 5 deletions(-)

diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_main.c NEW/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-07 13:11:19.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 
-			ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, neigh->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 			goto out;
 		}
 
@@ -753,6 +755,153 @@ out:
 	return NETDEV_TX_OK;
 }
 
+#define	XMIT_QUEUED_SKBS()						\
+	do {								\
+		if (wr_num) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   wr_num);				\
+			wr_num = 0;					\
+		}							\
+	} while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc. Status: Done, needs testing.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	struct sk_buff_head *blist = dev->skb_blist;
+	int max_skbs, wr_num = 0;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
+	unsigned long flags;
+
+	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+		return NETDEV_TX_LOCKED;
+
+	/*
+	 * Figure out how many skbs can be sent. This prevents the device
+	 * getting full and avoids checking for queue stopped after each
+	 * iteration.
+	 */
+	max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+	while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+		if (likely(skb->dst && skb->dst->neighbour)) {
+			if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+				XMIT_QUEUED_SKBS();
+				ipoib_path_lookup(skb, dev);
+				continue;
+			}
+
+			neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+			if (ipoib_cm_get(neigh)) {
+				if (ipoib_cm_up(neigh)) {
+					XMIT_QUEUED_SKBS();
+					ipoib_cm_send(dev, skb,
+						      ipoib_cm_get(neigh));
+					continue;
+				}
+			} else if (neigh->ah) {
+				if (unlikely(memcmp(&neigh->dgid.raw,
+						    skb->dst->neighbour->ha + 4,
+						    sizeof(union ib_gid)))) {
+					spin_lock(&priv->lock);
+					/*
+					 * It's safe to call ipoib_put_ah()
+					 * inside priv->lock here, because we
+					 * know that path->ah will always hold
+					 * one more reference, so ipoib_put_ah()
+					 * will never do more than decrement
+					 * the ref count.
+					 */
+					ipoib_put_ah(neigh->ah);
+					list_del(&neigh->list);
+					ipoib_neigh_free(dev, neigh);
+					spin_unlock(&priv->lock);
+					XMIT_QUEUED_SKBS();
+					ipoib_path_lookup(skb, dev);
+					continue;
+				}
+
+				qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+				if (neigh != old_neigh || qpn != old_qpn) {
+					/*
+					 * Sending to a different destination
+					 * from earlier skb's - send all
+					 * existing skbs (if any), and restart.
+					 */
+					XMIT_QUEUED_SKBS();
+					old_neigh = neigh;
+					old_qpn = qpn;
+				}
+
+				if (likely(!ipoib_process_skb(dev, skb, priv,
+							      neigh->ah, qpn,
+							      wr_num)))
+					wr_num++;
+
+				continue;
+			}
+
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				spin_lock(&priv->lock);
+				__skb_queue_tail(&neigh->queue, skb);
+				spin_unlock(&priv->lock);
+			} else {
+				dev_kfree_skb_any(skb);
+				++priv->stats.tx_dropped;
+				++max_skbs;
+			}
+		} else {
+			struct ipoib_pseudoheader *phdr =
+				(struct ipoib_pseudoheader *) skb->data;
+			skb_pull(skb, sizeof *phdr);
+
+			if (phdr->hwaddr[4] == 0xff) {
+				/* Add in the P_Key for multicast*/
+				phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+				phdr->hwaddr[9] = priv->pkey & 0xff;
+
+				XMIT_QUEUED_SKBS();
+				ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+			} else {
+				/* unicast GID -- should be ARP or RARP reply */
+
+				if ((be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_ARP) &&
+				    (be16_to_cpup((__be16 *) skb->data) !=
+				    ETH_P_RARP)) {
+					ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+						IPOIB_GID_FMT "\n",
+						skb->dst ? "neigh" : "dst",
+						be16_to_cpup((__be16 *)
+						skb->data),
+						IPOIB_QPN(phdr->hwaddr),
+						IPOIB_GID_RAW_ARG(phdr->hwaddr
+								  + 4));
+					dev_kfree_skb_any(skb);
+					++priv->stats.tx_dropped;
+					++max_skbs;
+					continue;
+				}
+				XMIT_QUEUED_SKBS();
+				unicast_arp_send(skb, dev, phdr);
+			}
+		}
+	}
+
+	/* Send out last packets (if any) */
+	XMIT_QUEUED_SKBS();
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+	return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
 static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -896,13 +1045,37 @@ int ipoib_dev_init(struct net_device *de
 		goto out_rx_ring_cleanup;
 	}
 
-	/* priv->tx_head & tx_tail are already 0 */
+	/* priv->tx_head & tx_tail & tx_priv_tail are already 0 */
 
-	if (ipoib_ib_dev_init(dev, ca, port))
+	/* Allocate tx_sge */
+	priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+			       GFP_KERNEL);
+	if (!priv->tx_sge) {
+		printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
 		goto out_tx_ring_cleanup;
+	}
+
+	/* Allocate tx_wr */
+	priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+			      GFP_KERNEL);
+	if (!priv->tx_wr) {
+		printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+		       ca->name, ipoib_sendq_size);
+		goto out_tx_sge_cleanup;
+	}
+
+	if (ipoib_ib_dev_init(dev, ca, port))
+		goto out_tx_wr_cleanup;
 
 	return 0;
 
+out_tx_wr_cleanup:
+	kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+	kfree(priv->tx_sge);
+
 out_tx_ring_cleanup:
 	kfree(priv->tx_ring);
 
@@ -930,9 +1103,13 @@ void ipoib_dev_cleanup(struct net_device
 
 	kfree(priv->rx_ring);
 	kfree(priv->tx_ring);
+	kfree(priv->tx_sge);
+	kfree(priv->tx_wr);
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
+	priv->tx_sge = NULL;
+	priv->tx_wr = NULL;
 }
 
 static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1120,7 @@ static void ipoib_setup(struct net_devic
 	dev->stop 		 = ipoib_stop;
 	dev->change_mtu 	 = ipoib_change_mtu;
 	dev->hard_start_xmit 	 = ipoib_start_xmit;
+	dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
 	dev->get_stats 		 = ipoib_get_stats;
 	dev->tx_timeout 	 = ipoib_timeout;
 	dev->hard_header 	 = ipoib_hard_header;
@@ -979,6 +1157,7 @@ static void ipoib_setup(struct net_devic
 
 	spin_lock_init(&priv->lock);
 	spin_lock_init(&priv->tx_lock);
+	spin_lock_init(&priv->comp_lock);
 
 	mutex_init(&priv->mcast_mutex);
 	mutex_init(&priv->vlan_mutex);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
                   ` (8 preceding siblings ...)
  2007-08-08  9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
@ 2007-08-08 10:49 ` David Miller
  2007-08-08 11:09   ` Krishna Kumar2
                     ` (2 more replies)
  9 siblings, 3 replies; 52+ messages in thread
From: David Miller @ 2007-08-08 10:49 UTC (permalink / raw)
  To: krkumar2
  Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
	Robert.Olsson, kumarkr, rdreier, mcarlson, jeff, hadi, general,
	mchan, tgraf, netdev, shemminger, kaber, sri

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed, 08 Aug 2007 15:01:14 +0530

> RESULTS: The performance improvement for TCP No Delay is in the range of -8%
> 	to 320% (with -8% being the sole negative), with many individual tests
> 	giving 50% or more improvement (I think it is to do with the hw slots
> 	getting full quicker resulting in more batching when the queue gets
> 	woken). The results for TCP is in the range of -11% to 93%, with most
> 	of the tests (8/12) giving improvements.

Not because I think it obviates your work, but rather because I'm
curious, could you test a TSO-in-hardware driver converted to
batching and see how TSO alone compares to batching for a pure
TCP workload?

I personally don't think it will help for that case at all as
TSO likely does better job of coalescing the work _and_ reducing
bus traffic as well as work in the TCP stack.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
  2007-08-08  9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
@ 2007-08-08 10:59   ` Stephen Hemminger
  2007-08-08 11:24     ` Krishna Kumar2
  2007-08-08 12:01   ` Evgeniy Polyakov
  1 sibling, 1 reply; 52+ messages in thread
From: Stephen Hemminger @ 2007-08-08 10:59 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
	Robert.Olsson, kumarkr, rdreier, mcarlson, kaber, jeff, general,
	mchan, tgraf, netdev, hadi, davem, sri

How do qdisc's that do rate control work now? Did you add logic to
breakup batches for token bucket, etc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
@ 2007-08-08 11:09   ` Krishna Kumar2
  2007-08-08 22:01     ` [ofa-general] " David Miller
  2007-08-08 13:42   ` [ofa-general] " Herbert Xu
  2007-08-14  9:02   ` Krishna Kumar2
  2 siblings, 1 reply; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-08 11:09 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:

> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> >    to 320% (with -8% being the sole negative), with many individual
tests
> >    giving 50% or more improvement (I think it is to do with the hw
slots
> >    getting full quicker resulting in more batching when the queue gets
> >    woken). The results for TCP is in the range of -11% to 93%, with
most
> >    of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.

Definitely, I will try to do this.

What do you generally think of the patch/implementation ? :)

thanks,

- KK


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
  2007-08-08 10:59   ` [ofa-general] " Stephen Hemminger
@ 2007-08-08 11:24     ` Krishna Kumar2
  0 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-08 11:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	mchan, tgraf, netdev, sri, davem, rdreier

Stephen Hemminger <shemminger@linux-foundation.org> wrote on 08/08/2007
04:29:46 PM:

> How do qdisc's that do rate control work now? Did you add logic to
> breakup batches for token bucket, etc.

My thought was that the code works identically to the current code except
in requeue cases. For requeue the existing code sched decides which skb
to pass next (first) to device; while in batching case it is assumed that
skbs already in batch list are part of the hardware queue and sent first.

If that is a problem - at the time of enabling batching it is possible
to check what qdisc_ops is being used (or whenever the qdisc's change),
or maybe add a flag to each qdisc_ops to indicate whether qdisc supports
batching, etc ? Like definitely enable batching if using pfifo_fast_ops..

What is your suggestion ?

thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
  2007-08-08  9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
  2007-08-08 10:59   ` [ofa-general] " Stephen Hemminger
@ 2007-08-08 12:01   ` Evgeniy Polyakov
  2007-08-09  3:09     ` Krishna Kumar2
  1 sibling, 1 reply; 52+ messages in thread
From: Evgeniy Polyakov @ 2007-08-08 12:01 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
	kumarkr, rdreier, mcarlson, kaber, jeff, hadi, general, mchan,
	tgraf, netdev, shemminger, davem, sri

Hi Krishna.

On Wed, Aug 08, 2007 at 03:01:35PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
> +{
> +	int ret = 0;
> +	struct sk_buff_head *blist;
> +
> +	if (!dev->hard_start_xmit_batch) {
> +		/* Driver doesn't support batching skb API */
> +		ret = -ENOTSUPP;
> +		goto out;
> +	}
> +
> +	/* Handle invalid argument */
> +	if (new_batch_skb < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}

It is unsigned, how can it be less than zero?

And actually you use it just like a binary flag (casted to/from u32 in
the code, btw), so why not using ethtool_value directly here?

> +	/* Check if new value is same as the current */
> +	if (!!dev->skb_blist == !!new_batch_skb)
> +		goto out;
> +
> +	if (new_batch_skb &&
> +	    (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	spin_lock(&dev->queue_lock);
> +	if (new_batch_skb) {
> +		skb_queue_head_init(blist);
> +		dev->skb_blist = blist;
> +	} else
> +		free_batching(dev);
> +	spin_unlock(&dev->queue_lock);

This needs bh lock too, since blist is accessed by qdisc_restart.

> +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> +{
> +	if (!list_empty(&ptype_all))
> +		dev_queue_xmit_nit(skb, dev);
> +
> +	if (netif_needs_gso(dev, skb)) {
> +		if (unlikely(dev_gso_segment(skb))) {
> +			kfree(skb);
> +			return 0;
> +		}
> +
> +		if (skb->next) {
> +			int count = 0;
> +
> +			do {
> +				struct sk_buff *nskb = skb->next;
> +
> +				skb->next = nskb->next;
> +				__skb_queue_tail(dev->skb_blist, nskb);
> +				count++;
> +			} while (skb->next);

Is it possible to move list without iterating over each entry?


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08  9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-08-08 12:14   ` Evgeniy Polyakov
  2007-08-09  3:13     ` Krishna Kumar2
  2007-08-08 14:05   ` Patrick McHardy
  1 sibling, 1 reply; 52+ messages in thread
From: Evgeniy Polyakov @ 2007-08-08 12:14 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, hadi, general,
	mchan, tgraf, netdev, shemminger, davem, sri

On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> +			  struct sk_buff_head *blist, struct sk_buff **skbp)
> +{
> +	if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
> +		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> +	} else {
> +		int max = dev->tx_queue_len - skb_queue_len(blist);
> +		struct sk_buff *skb;
> +
> +		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> +			max -= dev_add_skb_to_blist(skb, dev);
> +
> +		*skbp = NULL;
> +		return 1;	/* we have atleast one skb in blist */
> +	}
> +}

Same here - is it possible to get a list in one go instead of pulling
one-by-one, since it forces quite a few additional unneded lock
get/releases. What about dev_dequeue_number_skb(dev, q, num), which will 
grab the lock and move a list of skbs from one queue to provided head.

> @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
>  	/* And release queue */
>  	spin_unlock(&dev->queue_lock);
>  
> -	ret = dev_hard_start_xmit(skb, dev);
> +	if (likely(skb))
> +		ret = dev_hard_start_xmit(skb, dev);
> +	else
> +		ret = dev->hard_start_xmit_batch(dev);

Perfectionism says that having array of two functions and calling one of
them via array_func_pointer[!!skb] will be much faster. Just a though.
It is actually much faster than if/else on x86 at least.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  2007-08-08 11:09   ` Krishna Kumar2
@ 2007-08-08 13:42   ` Herbert Xu
  2007-08-08 15:14     ` jamal
                       ` (2 more replies)
  2007-08-14  9:02   ` Krishna Kumar2
  2 siblings, 3 replies; 52+ messages in thread
From: Herbert Xu @ 2007-08-08 13:42 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, jagana, hadi, general, mchan, tgraf,
	netdev, shemminger, kaber, sri

On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> 
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?

You could even lower the bar by disabling TSO and enabling
software GSO.

> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.

I agree.  I suspect the bulk of the effort is in getting
these skb's created and processed by the stack so that by
the time that they're exiting the qdisc there's not much
to be saved anymore.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08  9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
  2007-08-08 12:14   ` [ofa-general] " Evgeniy Polyakov
@ 2007-08-08 14:05   ` Patrick McHardy
  2007-08-08 15:26     ` [ofa-general] " jamal
  2007-08-09  4:06     ` Krishna Kumar2
  1 sibling, 2 replies; 52+ messages in thread
From: Patrick McHardy @ 2007-08-08 14:05 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, davem, shemminger, sri, jagana, Robert.Olsson,
	peter.p.waskiewicz.jr, herbert, gaagaan, kumarkr, rdreier,
	rick.jones2, mcarlson, jeff, general, mchan, tgraf, hadi, netdev,
	xma

Krishna Kumar wrote:
> + * Algorithm to get skb(s) is:
> + *	- Non batching drivers, or if the batch list is empty and there is
> + *	  1 skb in the queue - dequeue skb and put it in *skbp to tell the
> + *	  caller to use the single xmit API.
> + *	- Batching drivers where the batch list already contains atleast one
> + *	  skb, or if there are multiple skbs in the queue: keep dequeue'ing
> + *	  skb's upto a limit and set *skbp to NULL to tell the caller to use
> + *	  the multiple xmit API.
> + *
> + * Returns:
> + *	1 - atleast one skb is to be sent out, *skbp contains skb or NULL
> + *	    (in case >1 skbs present in blist for batching)
> + *	0 - no skbs to be sent.
> + */
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> +			  struct sk_buff_head *blist, struct sk_buff **skbp)
> +{
> +	if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
> +		return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> +	} else {
> +		int max = dev->tx_queue_len - skb_queue_len(blist);
> +		struct sk_buff *skb;
> +
> +		while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> +			max -= dev_add_skb_to_blist(skb, dev);
> +
> +		*skbp = NULL;
> +		return 1;	/* we have atleast one skb in blist */
> +	}
> +}

I think you missed my previous reply to this in the flood of
responses (or I missed your reply), so I'm copying it below:

The entire idea of a single queue after qdiscs that is refilled
independantly of transmissions times etc. make be worry a bit.
By changing the timing you're effectively changing the qdiscs
behaviour, at least in some cases. SFQ is a good example, but I
believe it affects most work-conserving qdiscs. Think of this
situation:

100 packets of flow 1 arrive
50 packets of flow 1 are sent
100 packets for flow 2 arrive
remaining packets are sent

On the wire you'll first see 50 packets of flow 1, than 100 packets
alternate of flow 1 and 2, then 50 packets flow 2.

With your additional queue all packets of flow 1 are pulled out of
the qdisc immediately and put in the fifo. When the 100 packets of
the second flow arrive they will also get pulled out immediately
and are put in the fifo behind the remaining 50 packets of flow 1.
So what you get on the wire is:

100 packets of flow 1
100 packets of flow 1

So SFQ is without any effect. This is not completely avoidable of
course, but you can and should limit the damage by only pulling
out as much packets as the driver can take and have the driver
stop the queue afterwards.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 13:42   ` [ofa-general] " Herbert Xu
@ 2007-08-08 15:14     ` jamal
  2007-08-08 20:55       ` Stephen Hemminger
  2007-08-08 22:22       ` David Miller
  2007-08-09  0:06     ` Shirley Ma
  2007-08-09  3:19     ` [ofa-general] " Krishna Kumar2
  2 siblings, 2 replies; 52+ messages in thread
From: jamal @ 2007-08-08 15:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, kaber, jagana, general, mchan, tgraf,
	netdev, shemminger, David Miller, sri

On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> > 
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
> 
> You could even lower the bar by disabling TSO and enabling
> software GSO.

>From my observation for TCP packets slightly above MDU (upto 2K), GSO
gives worse performance than non-GSO throughput-wise. Actually this has
nothing to do with batching, rather the behavior is consistent with or
without batching changes. 

> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
> 
> I agree.
> I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.

pktgen shows a clear win if you test the driver path - which is what you
should test because thats where the batching changes are. 
Using TCP or UDP adds other variables[1] that need to be isolated first
in order to quantify the effect of batching. 
For throughput and CPU utilization, the benefit will be clear when there
are a lot more flows. 

cheers,
jamal

[1] I think there are too many other variables in play unfortunately
when you are dealing with a path that starts above the driver and one
that covers end to end effect:  traffic/app source, system clock sources
as per my recent discovery, congestion control algorithms used, tuning
of recevier etc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08 14:05   ` Patrick McHardy
@ 2007-08-08 15:26     ` jamal
  2007-08-09  4:06     ` Krishna Kumar2
  1 sibling, 0 replies; 52+ messages in thread
From: jamal @ 2007-08-08 15:26 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: johnpol, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, jagana, general, mchan,
	tgraf, jeff, shemminger, davem, sri

On Wed, 2007-08-08 at 16:05 +0200, Patrick McHardy wrote:

> This is not completely avoidable of
> course, but you can and should limit the damage by only pulling
> out as much packets as the driver can take and have the driver
> stop the queue afterwards.

This why the dev->xmit_win exists in my patches. The driver says how
much space it has using this variable - and only those packets get
pulled off; so there is no conflict with any scheduling algorithm be it
work/non-work conserving. 
The ideal case is never to carry over anything in dev->blist. However,
there is a challenge with GSO/TSO: if say the descriptors in the driver
are less than the list of skbs, you are faced with the dilema of sending
a fraction of the gso list first.
My current approach is to transmit as much as the driver would allow.
On the next opportunity to transmit, you do immediately send anything
remaining first before you ask the qdisc for more packets. An
alternative approach i had entertained was not to send anything from the
skb list when hitting such a case, but it hasnt seem needed so far
(havent seen any leftovers from my experiments).

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 15:14     ` jamal
@ 2007-08-08 20:55       ` Stephen Hemminger
  2007-08-08 22:40         ` jamal
  2007-08-08 22:22       ` David Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Stephen Hemminger @ 2007-08-08 20:55 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
	Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
	general, mchan, tgraf, jeff, netdev, kaber, sri

On Wed, 08 Aug 2007 11:14:35 -0400
jamal <hadi@cyberus.ca> wrote:

> On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> > On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> > > 
> > > Not because I think it obviates your work, but rather because I'm
> > > curious, could you test a TSO-in-hardware driver converted to
> > > batching and see how TSO alone compares to batching for a pure
> > > TCP workload?
> > 
> > You could even lower the bar by disabling TSO and enabling
> > software GSO.
> 
> >From my observation for TCP packets slightly above MDU (upto 2K), GSO
> gives worse performance than non-GSO throughput-wise. Actually this has
> nothing to do with batching, rather the behavior is consistent with or
> without batching changes. 
> 
> > > I personally don't think it will help for that case at all as
> > > TSO likely does better job of coalescing the work _and_ reducing
> > > bus traffic as well as work in the TCP stack.
> > 
> > I agree.
> > I suspect the bulk of the effort is in getting
> > these skb's created and processed by the stack so that by
> > the time that they're exiting the qdisc there's not much
> > to be saved anymore.
> 
> pktgen shows a clear win if you test the driver path - which is what you
> should test because thats where the batching changes are. 
> Using TCP or UDP adds other variables[1] that need to be isolated first
> in order to quantify the effect of batching. 
> For throughput and CPU utilization, the benefit will be clear when there
> are a lot more flows. 

Optimizing for pktgen is a mistake for most users. Please show something
useful like router forwarding, TCP (single and multi flow) and/or better
yet application benchmark improvement.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 11:09   ` Krishna Kumar2
@ 2007-08-08 22:01     ` David Miller
  2007-08-09  4:19       ` Krishna Kumar2
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2007-08-08 22:01 UTC (permalink / raw)
  To: krkumar2
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 8 Aug 2007 16:39:47 +0530

> What do you generally think of the patch/implementation ? :)

We have two driver implementation paths on recieve and now
we'll have two on send, and that's not a good trend.

In an ideal world all the drivers would be NAPI and netif_rx()
would only be used by tunneling drivers and similar in the
protocol layers.  And likewise all sends would go through
->hard_start_xmit().

If you can come up with a long term strategy that gets rid of
the special transmit method, that'd be great.

We should make Linux network drivers easy to write, not more difficult
by constantly adding most interfaces than we consolidate.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 15:14     ` jamal
  2007-08-08 20:55       ` Stephen Hemminger
@ 2007-08-08 22:22       ` David Miller
  2007-08-08 22:53         ` jamal
  1 sibling, 1 reply; 52+ messages in thread
From: David Miller @ 2007-08-08 22:22 UTC (permalink / raw)
  To: hadi
  Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
	jeff, shemminger, kaber, sri

From: jamal <hadi@cyberus.ca>
Date: Wed, 08 Aug 2007 11:14:35 -0400

> pktgen shows a clear win if you test the driver path - which is what
> you should test because thats where the batching changes are.

The driver path, however, does not exist on an island and what
we care about is the final result with the changes running
inside the full system.

So, to be honest, besides for initial internal development
feedback, the isolated tests only have minimal merit and
it's the full protocol tests that are really interesting.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 20:55       ` Stephen Hemminger
@ 2007-08-08 22:40         ` jamal
  0 siblings, 0 replies; 52+ messages in thread
From: jamal @ 2007-08-08 22:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
	Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
	general, mchan, tgraf, jeff, netdev, kaber, sri

On Wed, 2007-08-08 at 21:55 +0100, Stephen Hemminger wrote:

> > pktgen shows a clear win if you test the driver path - which is what you
> > should test because thats where the batching changes are. 
> > Using TCP or UDP adds other variables[1] that need to be isolated first
> > in order to quantify the effect of batching. 
> > For throughput and CPU utilization, the benefit will be clear when there
> > are a lot more flows. 
> 
> Optimizing for pktgen is a mistake for most users. 

There is no "optimization for pktgen". 

If you are improving the tx path, the first step is to test that you can
show the tx path improved. pktgen happens to be the best test suite for
that because it talks to the driver and exercises the changes.
i.e if one cant show that exercising the direct path demonstrates
improvements you probably wont be able to show batching improves TCP -
but i dont even wanna swear by that. 
Does that make sense?

> Please show something
> useful like router forwarding, TCP (single and multi flow) and/or better
> yet application benchmark improvement.

Absolutely, but first things first. Analysis of why something improves
is extremely important, just saying "TCP throughput improved" is not
interesting and lazy. 
To be scientific, it is important to isolate variables first in order to
come up with meaningful results that can be analysed. 
To make a point, I have noticed extremely different results between TCP
BIC vs reno with batching. So congestion control as a variable is
important. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 22:22       ` David Miller
@ 2007-08-08 22:53         ` jamal
  0 siblings, 0 replies; 52+ messages in thread
From: jamal @ 2007-08-08 22:53 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
	jeff, shemminger, kaber, sri

On Wed, 2007-08-08 at 15:22 -0700, David Miller wrote:

> The driver path, however, does not exist on an island and what
> we care about is the final result with the changes running
> inside the full system.
> 
> So, to be honest, besides for initial internal development
> feedback, the isolated tests only have minimal merit and
> it's the full protocol tests that are really interesting.

But you cant go there if cant show the path which is supposedly improved
has indeed improved;-> I would certainly agree with you that if it
doesnt prove consistently useful with protocols it has no value
(remember thats why i never submitted these patches all this time).
We just need better analysis of the results - i cant ignore that the
selection of the clock sources for example gives me different results
and that when i boot i cant be guaranteed the same clock source. I cant
ignore the fact that i get different results when i use a different
congestion control algorithm. And none of this has to do with the
batching patches.

I am using UDP at the moment because it is simpler to analyze. And yes,
it would be "an interesting idea that gets shelved" if we cant achieve
any of the expected goals. We've shelved ideas before. BTW, read the
little doc i wrote on the dev->prep_xmit() you may find it interesting.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 13:42   ` [ofa-general] " Herbert Xu
  2007-08-08 15:14     ` jamal
@ 2007-08-09  0:06     ` Shirley Ma
  2007-08-09  3:19     ` [ofa-general] " Krishna Kumar2
  2 siblings, 0 replies; 52+ messages in thread
From: Shirley Ma @ 2007-08-09  0:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, gaagaan, general, hadi, jeff, johnpol, kaber,
	krkumar2, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
	rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf,
	Venkata Jagana

Hello Herbert,

> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
> 
> You could even lower the bar by disabling TSO and enabling
> software GSO.

We had a discuss before. GSO doesn't benefit device which has no HW 
checksum (like IPoIB) by inducing an extra copy. And GSO benefits one 
stream, batching benefits multiple streams.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
  2007-08-08 12:01   ` Evgeniy Polyakov
@ 2007-08-09  3:09     ` Krishna Kumar2
  0 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  3:09 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: jagana, mcarlson, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan,
	tgraf, netdev, shemminger, davem, sri

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 08/08/2007 05:31:43 PM:

> > +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long
new_batch_skb)
> > +{
> > +   int ret = 0;
> > +   struct sk_buff_head *blist;
> > +
> > +   if (!dev->hard_start_xmit_batch) {
> > +      /* Driver doesn't support batching skb API */
> > +      ret = -ENOTSUPP;
> > +      goto out;
> > +   }
> > +
> > +   /* Handle invalid argument */
> > +   if (new_batch_skb < 0) {
> > +      ret = -EINVAL;
> > +      goto out;
> > +   }
> It is unsigned, how can it be less than zero?

Yuck, originally I had it as int and changed to ulong and forgot to
remove this check.

> And actually you use it just like a binary flag (casted to/from u32 in
> the code, btw), so why not using ethtool_value directly here?

I still need to check if the value is changing, so the one check is needed.
Later I am using it as a value directly.

> > +   /* Check if new value is same as the current */
> > +   if (!!dev->skb_blist == !!new_batch_skb)
> > +      goto out;
> > +
> > +   if (new_batch_skb &&
> > +       (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
> > +      ret = -ENOMEM;
> > +      goto out;
> > +   }
> > +
> > +   spin_lock(&dev->queue_lock);
> > +   if (new_batch_skb) {
> > +      skb_queue_head_init(blist);
> > +      dev->skb_blist = blist;
> > +   } else
> > +      free_batching(dev);
> > +   spin_unlock(&dev->queue_lock);
>
> This needs bh lock too, since blist is accessed by qdisc_restart.

Yes, had it in the code, put it in the list of changes, but missed it for
some reason :(

> > +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> > +{
> > +   if (!list_empty(&ptype_all))
> > +      dev_queue_xmit_nit(skb, dev);
> > +
> > +   if (netif_needs_gso(dev, skb)) {
> > +      if (unlikely(dev_gso_segment(skb))) {
> > +         kfree(skb);
> > +         return 0;
> > +      }
> > +
> > +      if (skb->next) {
> > +         int count = 0;
> > +
> > +         do {
> > +            struct sk_buff *nskb = skb->next;
> > +
> > +            skb->next = nskb->next;
> > +            __skb_queue_tail(dev->skb_blist, nskb);
> > +            count++;
> > +         } while (skb->next);
>
> Is it possible to move list without iterating over each entry?

Though I cannot see something obvious to do that, let me see if
something is possible as it will make a good difference.

thanks for your suggestions,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08 12:14   ` [ofa-general] " Evgeniy Polyakov
@ 2007-08-09  3:13     ` Krishna Kumar2
  0 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  3:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 08/08/2007 05:44:02 PM:

> On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar
(krkumar2@in.ibm.com) wrote:
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > +           struct sk_buff_head *blist, struct sk_buff **skbp)
> > +{
> > +   if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1))) {
> > +      return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > +   } else {
> > +      int max = dev->tx_queue_len - skb_queue_len(blist);
> > +      struct sk_buff *skb;
> > +
> > +      while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> > +         max -= dev_add_skb_to_blist(skb, dev);
> > +
> > +      *skbp = NULL;
> > +      return 1;   /* we have atleast one skb in blist */
> > +   }
> > +}
>
> Same here - is it possible to get a list in one go instead of pulling
> one-by-one, since it forces quite a few additional unneded lock
> get/releases. What about dev_dequeue_number_skb(dev, q, num), which will
> grab the lock and move a list of skbs from one queue to provided head.

OK, I will try this out.

> > @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
> >     /* And release queue */
> >     spin_unlock(&dev->queue_lock);
> >
> > -   ret = dev_hard_start_xmit(skb, dev);
> > +   if (likely(skb))
> > +      ret = dev_hard_start_xmit(skb, dev);
> > +   else
> > +      ret = dev->hard_start_xmit_batch(dev);
>
> Perfectionism says that having array of two functions and calling one of
> them via array_func_pointer[!!skb] will be much faster. Just a though.
> It is actually much faster than if/else on x86 at least.

Thinking about this - I will have to store the 2 pointer array in dev
itself wasting some space, and also fn pointer will have wrong signature
as one takes an extra argument. Will ponder some more :)

thanks,

- KK


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 13:42   ` [ofa-general] " Herbert Xu
  2007-08-08 15:14     ` jamal
  2007-08-09  0:06     ` Shirley Ma
@ 2007-08-09  3:19     ` Krishna Kumar2
  2 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  3:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, mcarlson,
	peter.p.waskiewicz.jr, hadi, kaber, netdev, general, mchan, tgraf,
	sri, shemminger, David Miller, rdreier

Herbert Xu <herbert@gondor.apana.org.au> wrote on 08/08/2007 07:12:47 PM:

> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.

I will try with E1000 (though I didn't see improvement when I tested a long
time back). The difference I expect is that TSO would help with large
packets and not necessarily small/medium packets and not definitely in
the case of multiple different skbs (as opposed to single large skb)
getting
queue'd. I think these are two different workloads.

> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree.  I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.

However, I am getting a large improvement for IPoIB specifically for this
same case. The reason - batching will help only when queue gets full and
stopped (and to a lesser extent if tx lock was not got, which results
in fewer amount of batching that can be done).

thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
  2007-08-08 14:05   ` Patrick McHardy
  2007-08-08 15:26     ` [ofa-general] " jamal
@ 2007-08-09  4:06     ` Krishna Kumar2
  1 sibling, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  4:06 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
	mchan, tgraf, netdev, shemminger, davem, sri

Hi Patrick,

Patrick McHardy <kaber@trash.net> wrote on 08/08/2007 07:35:17 PM:

> Krishna Kumar wrote:
<snip>
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > +           struct sk_buff_head *blist, struct sk_buff **skbp)
> > +{
> > +   if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1))) {
> > +      return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > +   } else {
> > +      int max = dev->tx_queue_len - skb_queue_len(blist);
> > +      struct sk_buff *skb;
> > +
> > +      while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> > +         max -= dev_add_skb_to_blist(skb, dev);
> > +
> > +      *skbp = NULL;
> > +      return 1;   /* we have atleast one skb in blist */
> > +   }
> > +}
>
> I think you missed my previous reply to this in the flood of
> responses (or I missed your reply), so I'm copying it below:

Sorry, but I didn't get your post on this point earlier (thanks for
posting again).

> The entire idea of a single queue after qdiscs that is refilled
> independantly of transmissions times etc. make be worry a bit.
> By changing the timing you're effectively changing the qdiscs
> behaviour, at least in some cases. SFQ is a good example, but I
> believe it affects most work-conserving qdiscs. Think of this
> situation:
>
> 100 packets of flow 1 arrive
> 50 packets of flow 1 are sent
> 100 packets for flow 2 arrive
> remaining packets are sent
>
> On the wire you'll first see 50 packets of flow 1, than 100 packets
> alternate of flow 1 and 2, then 50 packets flow 2.
>
> With your additional queue all packets of flow 1 are pulled out of
> the qdisc immediately and put in the fifo. When the 100 packets of
> the second flow arrive they will also get pulled out immediately
> and are put in the fifo behind the remaining 50 packets of flow 1.
> So what you get on the wire is:
>
> 100 packets of flow 1
> 100 packets of flow 1

In normal case (qdisc run from xmit and not from tx_action), the code
executing is the same as regular code without any difference in wire
behavior due to the check:
      if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)))
{
            return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
(I always pass blist as NULL).

With SFQ for the above scenario, my code will first send out 50 F1 skbs
iteratively (blist==NULL), get blocked so that 50 skbs accumulate on F1
and 100 on F2, then when it is woken up, it will batch but it will pick
up 50 F1 and 50 F2 skbs in alternate order and put in the queue, and
finally pick up the remaining 50 F2 skbs and put those too in the queue.
Since I am calling dev_dequeue_skb, I am assured of RR when using SFQ.
The only time I would have 100 skbs from F1 in my queue (and hence sent
out first) is if there are NO F2 skbs.

> So SFQ is without any effect. This is not completely avoidable of
> course, but you can and should limit the damage by only pulling
> out as much packets as the driver can take and have the driver
> stop the queue afterwards.

I feel this cannot happen. Please correct me if I am wrong.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 22:01     ` [ofa-general] " David Miller
@ 2007-08-09  4:19       ` Krishna Kumar2
  2007-08-09  4:27         ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  4:19 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

Hi Dave,

David Miller <davem@davemloft.net> wrote on 08/09/2007 03:31:37 AM:

> > What do you generally think of the patch/implementation ? :)
>
> We have two driver implementation paths on recieve and now
> we'll have two on send, and that's not a good trend.

Correct.

> In an ideal world all the drivers would be NAPI and netif_rx()
> would only be used by tunneling drivers and similar in the
> protocol layers.  And likewise all sends would go through
> ->hard_start_xmit().
>
> If you can come up with a long term strategy that gets rid of
> the special transmit method, that'd be great.
>
> We should make Linux network drivers easy to write, not more difficult
> by constantly adding most interfaces than we consolidate.

I think that is a good top level view, and I agree with that.

Patrick had suggested calling dev_hard_start_xmit() instead of
conditionally calling the new API and to remove the new API
entirely. The driver determines whether batching is required or
not depending on (skb==NULL) or not. Would that approach be fine
with this "single interface" goal ?

Thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-09  4:19       ` Krishna Kumar2
@ 2007-08-09  4:27         ` David Miller
  2007-08-09  6:26           ` Krishna Kumar2
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2007-08-09  4:27 UTC (permalink / raw)
  To: krkumar2
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 9 Aug 2007 09:49:57 +0530

> Patrick had suggested calling dev_hard_start_xmit() instead of
> conditionally calling the new API and to remove the new API
> entirely. The driver determines whether batching is required or
> not depending on (skb==NULL) or not. Would that approach be fine
> with this "single interface" goal ?

It is a valid posibility.

Note that this is similar to how we handle TSO, the driver
sets the feature bit and in its ->hard_start_xmit() it checks
the SKB for the given offload property.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-09  4:27         ` David Miller
@ 2007-08-09  6:26           ` Krishna Kumar2
  0 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-09  6:26 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma

David Miller <davem@davemloft.net> wrote on 08/09/2007 09:57:27 AM:
>
> > Patrick had suggested calling dev_hard_start_xmit() instead of
> > conditionally calling the new API and to remove the new API
> > entirely. The driver determines whether batching is required or
> > not depending on (skb==NULL) or not. Would that approach be fine
> > with this "single interface" goal ?
>
> It is a valid posibility.
>
> Note that this is similar to how we handle TSO, the driver
> sets the feature bit and in its ->hard_start_xmit() it checks
> the SKB for the given offload property.

Great, I will try to get rid of two paths entirely, and see how to
re-arrange the code cleanly.

thanks,

- KK


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
  2007-08-08 11:09   ` Krishna Kumar2
  2007-08-08 13:42   ` [ofa-general] " Herbert Xu
@ 2007-08-14  9:02   ` Krishna Kumar2
  2 siblings, 0 replies; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-14  9:02 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

Hi Dave,

David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:

> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> >    to 320% (with -8% being the sole negative), with many individual
tests
> >    giving 50% or more improvement (I think it is to do with the hw
slots
> >    getting full quicker resulting in more batching when the queue gets
> >    woken). The results for TCP is in the range of -11% to 93%, with
most
> >    of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.

I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My
hw is 82547GI).

You are right, it doesn't help TSO case at all (infact degrades). Two
things to note though:
      - E1000 may not be suitable for adding batching (which is no
        longer a new API, as I have changed it already).
      - Small skbs where TSO doesn't come into picture still seems to
        improve. A couple of cases for large skbs did result in some
        improved (like 4K, TCP No Delay, 32 procs).

[Total segments retransmission for original code test run: 2220 & for
new code test run: 1620. So the retransmission problem that I was
getting seems to be an IPoIB bug, though I did have to fix one bug
in my networking component where I was calling qdisc_run(NULL) for
regular xmit path and change to always use batching. The problem is
that skb1 - skb10 may be present in the queue after each of them
failed to be sent out, then net_tx_action fires which batches all of
these into the blist and tries to send them out again, which also
fails (eg tx lock fail or queue full), then the next single skb xmit
will send the latest skb ignoring the 10 skbs that are already waiting
in the batching list. These 10 skbs are sent out only the next time
net_tx_action is called, so out of order skbs result. This fix reduced
retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
driver to use iterative sends of each skb instead of creating multiple
Work Request's, that number went down to 15].

I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
run a longer one tonight). The results are (results in KB/s, and %):

Test Case                      Org BW       New BW      % Change

                              TCP
                            --------
Size:32 Procs:1                 1848        3918        112.01
Size:32 Procs:8                 21888       21555       -1.52
Size:32 Procs:32                19317       22433       16.13

Size:256 Procs:1                15584       25991       66.78
Size:256 Procs:8                110937      74565       -32.78
Size:256 Procs:32               105767      98967       -6.42

Size:4096 Procs:1               81910       96073       17.29
Size:4096 Procs:8               113302      94040       -17.00
Size:4096 Procs:32              109664      105522      -3.77

                            TCP No Delay:
                            --------------
Size:32 Procs:1                 2688        3177        18.19
Size:32 Procs:8                 6568        10588       61.20
Size:32 Procs:32                6573        7838        19.24

Size:256 Procs:1                7869        12724       61.69
Size:256 Procs:8                65652       45652       -30.46
Size:256 Procs:32               95114       112279      18.04

Size:4096 Procs:1               95302       84664       -11.16
Size:4096 Procs:8               111119      89111       -19.80
Size:4096 Procs:32              109249      113919      4.27

I will submit Rev4 with suggested changes (including single merged
API) on Thursday after some more testing.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21  7:18 ` David Miller
@ 2007-08-21 12:30   ` jamal
  2007-08-21 18:51     ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: jamal @ 2007-08-21 12:30 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:

> Using 16K buffer size really isn't going to keep the pipe full enough
> for TSO. 

Why the comparison with TSO (or GSO for that matter)?
Seems to me that is only valid/fair if you have a single flow. 
Batching is multi-flow focused (or i should say flow-unaware).

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 12:30   ` [ofa-general] " jamal
@ 2007-08-21 18:51     ` David Miller
  2007-08-21 21:09       ` jamal
  2007-08-22  4:11       ` Krishna Kumar2
  0 siblings, 2 replies; 52+ messages in thread
From: David Miller @ 2007-08-21 18:51 UTC (permalink / raw)
  To: hadi
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

From: jamal <hadi@cyberus.ca>
Date: Tue, 21 Aug 2007 08:30:22 -0400

> On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> 
> > Using 16K buffer size really isn't going to keep the pipe full enough
> > for TSO. 
> 
> Why the comparison with TSO (or GSO for that matter)?

Because TSO does batching already, so it's a very good
"tit for tat" comparison of the new batching scheme
vs. an existing one.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 18:51     ` David Miller
@ 2007-08-21 21:09       ` jamal
  2007-08-22  4:11       ` Krishna Kumar2
  1 sibling, 0 replies; 52+ messages in thread
From: jamal @ 2007-08-21 21:09 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, mcarlson, netdev, sri, general, mchan,
	tgraf, johnpol, shemminger, kaber, herbert

On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote:

> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.

Fair enough - I may have read too much into your email then;->

For bulk type of apps (where TSO will make a difference) this a fair
test. Hence i agree the 16KB buffer size is not sensible if the goal is
to simulate such an app. 
However (and this is where i read too much into what you were saying)
that the test by itself is insufficient comparison. You gotta look at
the other side of the coin i.e. at apps where TSO wont buy much.
Examples, a busy ssh or irc server and you could go as far as looking at
the most pre-dominant app on the wild west, http (average page size from
a few years back was in the range of 10-20K and can be simulated with
good ole netperf/iperf). 

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-21 18:51     ` David Miller
  2007-08-21 21:09       ` jamal
@ 2007-08-22  4:11       ` Krishna Kumar2
  2007-08-22  4:22         ` David Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Krishna Kumar2 @ 2007-08-22  4:11 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:

> From: jamal <hadi@cyberus.ca>
> Date: Tue, 21 Aug 2007 08:30:22 -0400
>
> > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> >
> > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > for TSO.
> >
> > Why the comparison with TSO (or GSO for that matter)?
>
> Because TSO does batching already, so it's a very good
> "tit for tat" comparison of the new batching scheme
> vs. an existing one.

I am planning to do more testing on your suggestion over the
weekend, but I had a comment. Are you saying that TSO and
batching should be mutually exclusive so hardware that doesn't
support TSO (like IB) only would benefit?

But even if they can co-exist, aren't cases like sending
multiple small skbs better handled with batching?

Thanks,

- KK

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  4:11       ` Krishna Kumar2
@ 2007-08-22  4:22         ` David Miller
  2007-08-22 17:09           ` Rick Jones
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2007-08-22  4:22 UTC (permalink / raw)
  To: krkumar2
  Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
	tgraf, sri, shemminger, kaber, herbert

From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 22 Aug 2007 09:41:52 +0530

> David Miller <davem@davemloft.net> wrote on 08/22/2007 12:21:43 AM:
> 
> > From: jamal <hadi@cyberus.ca>
> > Date: Tue, 21 Aug 2007 08:30:22 -0400
> >
> > > On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:
> > >
> > > > Using 16K buffer size really isn't going to keep the pipe full enough
> > > > for TSO.
> > >
> > > Why the comparison with TSO (or GSO for that matter)?
> >
> > Because TSO does batching already, so it's a very good
> > "tit for tat" comparison of the new batching scheme
> > vs. an existing one.
> 
> I am planning to do more testing on your suggestion over the
> weekend, but I had a comment. Are you saying that TSO and
> batching should be mutually exclusive so hardware that doesn't
> support TSO (like IB) only would benefit?
> 
> But even if they can co-exist, aren't cases like sending
> multiple small skbs better handled with batching?

I'm not making any suggestions, so don't read that into anything I've
said :-)

I think the jury is still out, but seeing TSO perform even slightly
worse with the batching changes in place would be very worrysome.
This applies to both throughput and cpu utilization.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22  4:22         ` David Miller
@ 2007-08-22 17:09           ` Rick Jones
  2007-08-22 20:21             ` David Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Rick Jones @ 2007-08-22 17:09 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
	peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general, mchan,
	tgraf, netdev, johnpol, shemminger, kaber, sri

David Miller wrote:
> I think the jury is still out, but seeing TSO perform even slightly
> worse with the batching changes in place would be very worrysome.
> This applies to both throughput and cpu utilization.

Should it be any more or less worrysome than small packet performance (eg the 
TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it 
disabled?

rick jones

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-22 20:21             ` David Miller
@ 2007-08-23 22:04               ` jamal
  2007-08-23 22:25                 ` jamal
  2007-08-23 22:30                 ` David Miller
  0 siblings, 2 replies; 52+ messages in thread
From: jamal @ 2007-08-23 22:04 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, netdev,
	johnpol, shemminger, kaber, sri

On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Wed, 22 Aug 2007 10:09:37 -0700
> 
> > Should it be any more or less worrysome than small packet
> > performance (eg the TCP_RR stuff I posted recently) being rather
> > worse with TSO enabled than with it disabled?
> 
> That, like any such thing shown by the batching changes, is a bug
> to fix.

Possibly a bug - but you really should turn off TSO if you are doing
huge interactive transactions (which is fair because there is a clear
demarcation).
The litmus test is the same as any change that is supposed to improve
net performance - it has to demonstrate it is not intrusive and that it
improves (consistently) performance. The standard metrics are
{throughput, cpu-utilization, latency} i.e as long as one improves and
others remain zero, it would make sense. Yes, i am religious for
batching after all the invested sweat (and i continue to work on it
hoping to demystify) - the theory makes a lot of sense.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:25                 ` jamal
@ 2007-08-23 22:35                   ` Rick Jones
  2007-08-23 22:41                     ` jamal
  2007-08-24  3:18                     ` Bill Fink
  0 siblings, 2 replies; 52+ messages in thread
From: Rick Jones @ 2007-08-23 22:35 UTC (permalink / raw)
  To: hadi
  Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, johnpol, shemminger, David Miller, sri

jamal wrote:
> [TSO already passed - iirc, it has been
> demostranted to really not add much to throughput (cant improve much
> over closeness to wire speed) but improve CPU utilization].

In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
difference for throughput.

rick jones

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:30                 ` David Miller
@ 2007-08-23 22:38                   ` jamal
  2007-08-24  3:34                     ` Stephen Hemminger
  0 siblings, 1 reply; 52+ messages in thread
From: jamal @ 2007-08-23 22:38 UTC (permalink / raw)
  To: David Miller
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, jeff, general, mchan, tgraf, johnpol,
	shemminger, kaber, sri

On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Thu, 23 Aug 2007 18:04:10 -0400
> 
> > Possibly a bug - but you really should turn off TSO if you are doing
> > huge interactive transactions (which is fair because there is a clear
> > demarcation).
> 
> I don't see how this can matter.
> 
> TSO only ever does anything if you accumulate more than one MSS
> worth of data.

I stand corrected then.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:35                   ` [ofa-general] " Rick Jones
@ 2007-08-23 22:41                     ` jamal
  2007-08-24  3:18                     ` Bill Fink
  1 sibling, 0 replies; 52+ messages in thread
From: jamal @ 2007-08-23 22:41 UTC (permalink / raw)
  To: Rick Jones
  Cc: jagana, herbert, gaagaan, Robert.Olsson, netdev, rdreier,
	peter.p.waskiewicz.jr, mcarlson, kaber, jeff, general, mchan,
	tgraf, johnpol, shemminger, David Miller, sri

On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote:
> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
> 
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
> difference for throughput.

I am still so 1Gige;-> I stand corrected again ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:35                   ` [ofa-general] " Rick Jones
  2007-08-23 22:41                     ` jamal
@ 2007-08-24  3:18                     ` Bill Fink
  2007-08-24 12:14                       ` jamal
                                         ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Bill Fink @ 2007-08-24  3:18 UTC (permalink / raw)
  To: Rick Jones
  Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
	peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

On Thu, 23 Aug 2007, Rick Jones wrote:

> jamal wrote:
> > [TSO already passed - iirc, it has been
> > demostranted to really not add much to throughput (cant improve much
> > over closeness to wire speed) but improve CPU utilization].
> 
> In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
> difference for throughput.

Not too much.

TSO enabled:

[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

TSO disabled:

[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

Pretty negligible difference it seems.

This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte
jumbo frames, in a LAN environment.

For grins, I also did a couple of tests with an MSS of 1460 to
emulate a standard 1500 byte Ethernet MTU.

TSO enabled:

[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5102.8503 MB /  10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX

TSO disabled:

[root@lang2 ~]# ethtool -K eth2 tso off
[root@lang2 ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5399.5625 MB /  10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX

Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization.  Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).

						-Bill

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-23 22:38                   ` [ofa-general] " jamal
@ 2007-08-24  3:34                     ` Stephen Hemminger
  0 siblings, 0 replies; 52+ messages in thread
From: Stephen Hemminger @ 2007-08-24  3:34 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, David Miller, sri

On Thu, 23 Aug 2007 18:38:22 -0400
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
> > From: jamal <hadi@cyberus.ca>
> > Date: Thu, 23 Aug 2007 18:04:10 -0400
> > 
> > > Possibly a bug - but you really should turn off TSO if you are doing
> > > huge interactive transactions (which is fair because there is a clear
> > > demarcation).
> > 
> > I don't see how this can matter.
> > 
> > TSO only ever does anything if you accumulate more than one MSS
> > worth of data.
> 
> I stand corrected then.
> 
> cheers,
> jamal
> 

For most normal Internet TCP connections, you will see only 2 or 3 packets per TSO
because of ACK clocking. If you turn off delayed ACK on the receiver it
will be even less.

A current hot topic of research is reducing the number of ACK's to make TCP
work better over asymmetric links like 3G.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:18                     ` Bill Fink
@ 2007-08-24 12:14                       ` jamal
  2007-08-24 18:08                         ` Bill Fink
  2007-08-24 21:25                         ` David Miller
  2007-08-24 18:46                       ` Rick Jones
  2007-08-25  0:42                       ` John Heffner
  2 siblings, 2 replies; 52+ messages in thread
From: jamal @ 2007-08-24 12:14 UTC (permalink / raw)
  To: Bill Fink
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, sri

On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:

[..]
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization.  

Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.

> Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled, 

I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.

> so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).

Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 12:14                       ` jamal
@ 2007-08-24 18:08                         ` Bill Fink
  2007-08-24 21:25                         ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: Bill Fink @ 2007-08-24 18:08 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, sri

On Fri, 24 Aug 2007, jamal wrote:

> On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
> 
> [..]
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.  
> 
> Good stuff. What kind of machine? SMP?

Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce
Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs,
4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots
(2 buses).

It is SMP but both the NIC interrupts and nuttcp are bound to
CPU 0.  And all other non-kernel system processes are bound to
CPU 1.

> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.

Odd.  I just reran the TCP CUBIC "-M1460" tests, and with TSO enabled
on the transmitter, there were about 153709 eth2 interrupts on the
receiver, while with TSO disabled there was actually a somewhat higher
number (164988) of receiver side eth2 interrupts, although the receive
side CPU utilization was actually lower in that case.

On the transmit side (different test run), the TSO enabled case had
about 161773 eth2 interrupts whereas the TSO disabled case had about
165179 eth2 interrupts.

> Does the choice of the tcp congestion control algorithm affect results?
> it would be interesting to see both MTUs with either TCP BIC vs good old
> reno on sender (probably without changing what the receiver does). BIC
> seems to be the default lately.

These tests were with the default TCP CUBIC (with initial_ssthresh
set to 0).

With TCP BIC (and initial_ssthresh set to 0):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11751.3750 MB /  10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4999.3321 MB /  10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5502.6250 MB /  10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX

And with TCP Reno:

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11782.6250 MB /  10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5024.6649 MB /  10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5284.0000 MB /  10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX

Very similar results to the original TCP CUBIC tests.

> > Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, 
> 
> I would suspect the fact that a lot more packets making it into the
> receiver for TSO contributes.
> 
> > so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
> 
> Unfortunately the receiver plays a big role in such tests - if it is
> bottlenecked then you are not really testing the limits of the
> transmitter. 

It might be interesting to see what affect the LRO changes would have
on this.  Once they are in a stable released kernel, I might try that
out, or maybe even before if I get some spare time (but that's in very
short supply right now).

						-Thanks

						-Bill

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24  3:18                     ` Bill Fink
  2007-08-24 12:14                       ` jamal
@ 2007-08-24 18:46                       ` Rick Jones
  2007-08-25  0:42                       ` John Heffner
  2 siblings, 0 replies; 52+ messages in thread
From: Rick Jones @ 2007-08-24 18:46 UTC (permalink / raw)
  To: Bill Fink
  Cc: jagana, herbert, gaagaan, Robert.Olsson, mcarlson, rdreier,
	peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

Bill Fink wrote:
> On Thu, 23 Aug 2007, Rick Jones wrote:
> 
> 
>>jamal wrote:
>>
>>>[TSO already passed - iirc, it has been
>>>demostranted to really not add much to throughput (cant improve much
>>>over closeness to wire speed) but improve CPU utilization].
>>
>>In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
>>difference for throughput.
> 
> 
> Not too much.
> 
> TSO enabled:
> 
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> 
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX
> 
> TSO disabled:
> 
> [root@lang2 ~]# ethtool -K eth2 tso off
> [root@lang2 ~]# ethtool -k eth2
> Offload parameters for eth2:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
> 
> [root@lang2 ~]# nuttcp -w10m 192.168.88.16
> 11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX
> 
> Pretty negligible difference it seems.

Leaves one wondering how often more than one segment was sent to the card in the 
9000 byte case :)

rick jones

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-24 12:14                       ` jamal
  2007-08-24 18:08                         ` Bill Fink
@ 2007-08-24 21:25                         ` David Miller
  1 sibling, 0 replies; 52+ messages in thread
From: David Miller @ 2007-08-24 21:25 UTC (permalink / raw)
  To: hadi
  Cc: jagana, billfink, peter.p.waskiewicz.jr, herbert, gaagaan,
	Robert.Olsson, netdev, rdreier, mcarlson, jeff, general, mchan,
	tgraf, johnpol, shemminger, kaber, sri

From: jamal <hadi@cyberus.ca>
Date: Fri, 24 Aug 2007 08:14:16 -0400

> Seems the receive side of the sender is also consuming a lot more cpu
> i suspect because receiver is generating a lot more ACKs with TSO.

I've seen this behavior before on a low cpu powered receiver and the
issue is that batching too much actually hurts a receiver.

If the data packets were better spaced out, the receive would handle
the load better.

This is the thing the TOE guys keep talking about overcoming with
their packet pacing algorithms in their on-card TOE stack.

My hunch is that even if in the non-TSO case the TX packets were all
back to back in the cards TX ring, TSO still spits them out faster on
the wire.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-25  0:42                       ` John Heffner
@ 2007-08-26  8:41                         ` Bill Fink
  2007-08-27  1:32                           ` John Heffner
  0 siblings, 1 reply; 52+ messages in thread
From: Bill Fink @ 2007-08-26  8:41 UTC (permalink / raw)
  To: John Heffner
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	mcarlson, rdreier, hadi, kaber, jeff, general, mchan, tgraf,
	netdev, johnpol, shemminger, David Miller, sri

On Fri, 24 Aug 2007, John Heffner wrote:

> Bill Fink wrote:
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.  Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
> 
> One possibility is that I think the receive-side processing tends to do 
> better when receiving into an empty queue.  When the (non-TSO) sender is 
> the flow's bottleneck, this is going to be the case.  But when you 
> switch to TSO, the receiver becomes the bottleneck and you're always 
> going to have to put the packets at the back of the receive queue.  This 
> might help account for the reason why you have both lower throughput and 
> higher CPU utilization -- there's a point of instability right where the 
> receiver becomes the bottleneck and you end up pushing it over to the 
> bad side. :)
> 
> Just a theory.  I'm honestly surprised this effect would be so 
> significant.  What do the numbers from netstat -s look like in the two 
> cases?

Well, I was going to check this out, but I happened to reboot the
system and now I get somewhat different results.

Here are the new results, which should hopefully be more accurate
since they are on a freshly booted system.

TSO enabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11610.6875 MB /  10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5029.6875 MB /  10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX

TSO disabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.9375 MB /  10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5823.3125 MB /  10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX

The TSO disabled case got a little better performance even for
9000 byte jumbo frames.  For the "-M1460" case eumalating a
standard 1500 byte Ethernet MTU, the performance was significantly
better and used less CPU on the receiver (82 % versus 100 %)
although it did use significantly more CPU on the transmitter
(100 % versus 36 %).

TSO disabled and GSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11609.5625 MB /  10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.4375 MB /  10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX

The GSO enabled case is very similar to the TSO enabled case,
except that for the "-M1460" test the transmitter used more
CPU (52 % versus 36 %), which is to be expected since TSO has
hardware assist.

Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:

Ip:
    3659898 total packets received
    3659898 incoming packets delivered
    80050 requests sent out
Tcp:
    2 passive connection openings
    3659897 segments received
    80050 segments send out
TcpExt:
    33 packets directly queued to recvmsg prequeue.
    104956 packets directly received from backlog
    705528 packets directly received from prequeue
    3654842 packets header predicted
    193 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
    4107083 total packets received
    4107083 incoming packets delivered
    1401376 requests sent out
Tcp:
    2 passive connection openings
    4107083 segments received
    1401376 segments send out
TcpExt:
    2 TCP sockets finished time wait in fast timer
    48486 packets directly queued to recvmsg prequeue.
    1056111048 packets directly received from backlog
    2273357712 packets directly received from prequeue
    1819317 packets header predicted
    2287497 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.

I also ran another set of tests that may be of interest.  I changed
the rx-usecs/tx-usecs interrupt coalescing parameter from the
recommended optimum value of 75 usecs to 0 (no coalescing), but
only on the transmitter.  The comparison discussions below are
relative to the previous tests where rx-usecs/tx-usecs were set
to 75 usecs.

TSO enabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11812.8125 MB /  10.00 sec = 9905.6640 Mbps 100 %TX 75 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 7701.8750 MB /  10.00 sec = 6458.5541 Mbps 100 %TX 56 %RX

For 9000 byte jumbo frames it now gets a little better performance
and almost matches the 10-GigE line rate performance of the TSO
disabled case.  For the "-M1460" test, it gets substantially better
performance (6458.5541 Mbps versus 4194.6931 Mbps) at the expense
of much higher transmitter CPU utilization (100 % versus 36 %),
although the receiver CPU utilization is much less (56 % versus 100 %).

TSO disabled and GSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11817.3125 MB /  10.00 sec = 9909.4058 Mbps 100 %TX 76 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4081.2500 MB /  10.00 sec = 3422.3994 Mbps 99 %TX 41 %RX

For 9000 byte jumbo frames the results are essentially the same.
For the "-M1460" test, the performance is significantly worse
(3422.3994 Mbps versus 4883.2429 Mbps) even though the transmitter
CPU utilization is saturated in both cases, but the receiver CPU
utilization is about half (41 % versus 82 %).

TSO disabled and GSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11813.3750 MB /  10.00 sec = 9906.1090 Mbps 99 %TX 77 %RX

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 3939.1875 MB /  10.00 sec = 3303.2814 Mbps 100 %TX 41 %RX

For 9000 byte jumbo frames the performance is a little better,
again approaching 10-GigE line rate.  But for the "-M1460" test,
the performance is significantly worse (3303.2814 Mbps versus
4170.6739 Mbps) even though the transmitter consumes much more
CPU (100 % versus 52 %).  In this case though the receiver has
a much lower CPU utilization (41 % versus 100 %).

						-Bill

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
  2007-08-27 23:23                               ` jamal
@ 2007-09-14  7:20                                 ` Bill Fink
  0 siblings, 0 replies; 52+ messages in thread
From: Bill Fink @ 2007-09-14  7:20 UTC (permalink / raw)
  To: hadi
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, jheffner, sri

On Mon, 27 Aug 2007, jamal wrote:

> On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:
> 
> > The transfer is much better behaved if we ACK every two full sized
> > frames we copy into the receiver, and therefore don't stretch ACK, but
> > at the cost of cpu utilization.
> 
> The rx coalescing in theory should help by accumulating more ACKs on the
> rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
> you are better off to turn off the coalescing if you want higher
> numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
> this by reducing bursts on outgoing packets.
>  
> Bill:
> who suggested (as per your email) the 75usec value and what was it based
> on measurement-wise? 

Belatedly getting back to this thread.  There was a recent myri10ge
patch that changed the default value for tx/rx interrupt coalescing
to 75 usec claiming it was an optimum value for maximum throughput
(and is also mentioned in their external README documentation).

I also did some empirical testing to determine the effect of different
values of TX/RX interrupt coalescing on 10-GigE network performance,
both with TSO enabled and with TSO disabled.  The actual test runs
are attached at the end of this message, but the results are summarized
in the following table (network performance in Mbps).

		        TX/RX interrupt coalescing in usec (both sides)
		   0	  15	  30	  45	  60	  75	  90	 105

TSO enabled	8909	9682	9716	9725	9739	9745	9688	9648
TSO disabled	9113	9910	9910	9910	9910	9910	9910	9910

TSO disabled performance is always better than equivalent TSO enabled
performance.  With TSO enabled, the optimum performance is indeed at
a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
performance is the full 10-GigE line rate of 9910 Mbps for any value
of TX/RX interrupt coalescing from 15 usec to 105 usec.

> BTW, thanks for the finding the energy to run those tests and a very
> refreshing perspective. I dont mean to add more work, but i had some
> queries;
> On your earlier tests, i think that Reno showed some significant
> differences on the lower MTU case over BIC. I wonder if this is
> consistent? 

Here's a retest (5 tests each):

TSO enabled:

TCP Cubic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5007.6295 MB /  10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4950.9279 MB /  10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4917.1742 MB /  10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4948.7920 MB /  10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4937.5765 MB /  10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX

TCP Bic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5005.5335 MB /  10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.0625 MB /  10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.7500 MB /  10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.3777 MB /  10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5059.1815 MB /  10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX

TCP Reno:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4973.3532 MB /  10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4984.4375 MB /  10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4995.6841 MB /  10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4982.2500 MB /  10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4989.9796 MB /  10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX

TSO disabled:

TCP Cubic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5075.8125 MB /  10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5056.0000 MB /  10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5047.4375 MB /  10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5066.1875 MB /  10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4986.3750 MB /  10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX

TCP Bic (initial_ssthresh set to 0):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5040.5625 MB /  10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5049.7500 MB /  10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5076.5000 MB /  10.03 sec = 4247.6632 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5017.2500 MB /  10.03 sec = 4197.4990 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5013.3125 MB /  10.03 sec = 4194.8851 Mbps 100 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5036.0625 MB /  10.03 sec = 4213.9195 Mbps 100 %TX 100 %RX

TCP Reno:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5006.8750 MB /  10.02 sec = 4189.6051 Mbps 99 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5028.1250 MB /  10.02 sec = 4207.4553 Mbps 100 %TX 99 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5021.9375 MB /  10.02 sec = 4202.2668 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5000.5625 MB /  10.03 sec = 4184.3109 Mbps 99 %TX 100 %RX
[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5025.1250 MB /  10.03 sec = 4204.7378 Mbps 99 %TX 100 %RX

Not too much variation here, and not quite as high results
as previously.  Some further testing reveals that while this
time I mainly get results like (here for TCP Bic with TSO
disabled):

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4958.0625 MB /  10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX

I also sometimes get results like:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5882.1875 MB /  10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX

The higher performing results seem to correspond to when there's a
somewhat lower receiver CPU utilization.  I'm not sure but there
could also have been an effect from running the "-M1460" test after
the 9000 byte jumbo frame test (no jumbo tests were done at all prior
to running the above sets of 5 tests, although I did always discard
an initial "warmup" test, and now that I think about it some of
those initial discarded "warmup" tests did have somewhat anomalously
high results).

> A side note: Although the experimentation reduces the variables (eg
> tying all to CPU0), it would be more exciting to see multi-cpu and
> multi-flow sender effect (which IMO is more real world). 

These systems are intended as test systems for 10-GigE networks,
and as such it's important to get as consistently close to full
10-GigE line rate as possible, and that's why the interrupts and
nuttcp application are tied to CPU0, with almost all other system
applications tied to CPU1.

Now on another system that's intended as a 10-GigE firewall system,
it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
CPU0 and the interrupts for CPU1 tied to CPU1.  In IP forwarding
tests of this system, I have basically achieved full bidirectional
10-GigE line rate IP forwarding with 9000 byte jumbo frames.

chance4 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance5 -> chance6 -> chance9 4.85 Gbps rate limited TCP stream
chance7 <- chance6 <- chance8 10.0 Gbps non-rate limited TCP stream

[root@chance7 ~]# nuttcp -Ic4tc9 -Ri4.85g -w10m 192.168.88.8 192.168.89.16 & \
    nuttcp -Ic5tc9 -Ri4.85g -w10m -P5100 -p5101 192.168.88.9 192.168.89.16 & \
    nuttcp -Ic7rc8 -r -w10m 192.168.89.15
c4tc9:  5778.6875 MB /  10.01 sec = 4842.7158 Mbps 100 %TX 42 %RX
c5tc9:  5778.9375 MB /  10.01 sec = 4843.1595 Mbps 100 %TX 40 %RX
c7rc8: 11509.1875 MB /  10.00 sec = 9650.8009 Mbps 99 %TX 74 %RX

If there's some other specific test you'd like to see, and it's not
too difficult to set up and I have some spare time, I'll see what I
can do.

						-Bill



Testing of effect of RX/TX interrupt coalescing on 10-GigE network performance
(both with TSO enabled and with TSO disabled):
--------------------------------------------------------------------------------

No RX/TX interrupt coalescing (either side):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10649.8750 MB /  10.03 sec = 8908.9806 Mbps 97 %TX 100 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
10879.5000 MB /  10.02 sec = 9112.5141 Mbps 99 %TX 99 %RX

RX/TX interrupt coalescing set to 15 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11546.7500 MB /  10.00 sec = 9682.0785 Mbps 99 %TX 90 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.9375 MB /  10.00 sec = 9910.3702 Mbps 100 %TX 92 %RX

RX/TX interrupt coalescing set to 30 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11587.1250 MB /  10.00 sec = 9715.9489 Mbps 99 %TX 81 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.8125 MB /  10.00 sec = 9910.3040 Mbps 100 %TX 81 %RX

RX/TX interrupt coalescing set to 45 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11597.8750 MB /  10.00 sec = 9724.9902 Mbps 99 %TX 76 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.6250 MB /  10.00 sec = 9910.0933 Mbps 100 %TX 77 %RX

RX/TX interrupt coalescing set to 60 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11614.7500 MB /  10.00 sec = 9739.1323 Mbps 100 %TX 74 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9909.9995 Mbps 100 %TX 76 %RX

RX/TX interrupt coalescing set to 75 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11621.7500 MB /  10.00 sec = 9745.0993 Mbps 100 %TX 72 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.0625 MB /  10.00 sec = 9909.7881 Mbps 100 %TX 75 %RX

RX/TX interrupt coalescing set to 90 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11553.1250 MB /  10.00 sec = 9687.6458 Mbps 100 %TX 71 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9910.0837 Mbps 100 %TX 73 %RX

RX/TX interrupt coalescing set to 105 usec (both sides):

TSO enabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11505.7500 MB /  10.00 sec = 9647.8558 Mbps 99 %TX 69 %RX

TSO disabled:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.4375 MB /  10.00 sec = 9910.0530 Mbps 100 %TX 74 %RX

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2007-09-14  7:20 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-08  9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
2007-08-08  9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
2007-08-08  9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
2007-08-08 10:59   ` [ofa-general] " Stephen Hemminger
2007-08-08 11:24     ` Krishna Kumar2
2007-08-08 12:01   ` Evgeniy Polyakov
2007-08-09  3:09     ` Krishna Kumar2
2007-08-08  9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
2007-08-08 12:14   ` [ofa-general] " Evgeniy Polyakov
2007-08-09  3:13     ` Krishna Kumar2
2007-08-08 14:05   ` Patrick McHardy
2007-08-08 15:26     ` [ofa-general] " jamal
2007-08-09  4:06     ` Krishna Kumar2
2007-08-08  9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
2007-08-08  9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
2007-08-08  9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
2007-08-08  9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
2007-08-08  9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
2007-08-08  9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-08 11:09   ` Krishna Kumar2
2007-08-08 22:01     ` [ofa-general] " David Miller
2007-08-09  4:19       ` Krishna Kumar2
2007-08-09  4:27         ` David Miller
2007-08-09  6:26           ` Krishna Kumar2
2007-08-08 13:42   ` [ofa-general] " Herbert Xu
2007-08-08 15:14     ` jamal
2007-08-08 20:55       ` Stephen Hemminger
2007-08-08 22:40         ` jamal
2007-08-08 22:22       ` David Miller
2007-08-08 22:53         ` jamal
2007-08-09  0:06     ` Shirley Ma
2007-08-09  3:19     ` [ofa-general] " Krishna Kumar2
2007-08-14  9:02   ` Krishna Kumar2
  -- strict thread matches above, loose matches on Subject: below --
2007-08-17  6:06 Krishna Kumar2
2007-08-21  7:18 ` David Miller
2007-08-21 12:30   ` [ofa-general] " jamal
2007-08-21 18:51     ` David Miller
2007-08-21 21:09       ` jamal
2007-08-22  4:11       ` Krishna Kumar2
2007-08-22  4:22         ` David Miller
2007-08-22 17:09           ` Rick Jones
2007-08-22 20:21             ` David Miller
2007-08-23 22:04               ` [ofa-general] " jamal
2007-08-23 22:25                 ` jamal
2007-08-23 22:35                   ` [ofa-general] " Rick Jones
2007-08-23 22:41                     ` jamal
2007-08-24  3:18                     ` Bill Fink
2007-08-24 12:14                       ` jamal
2007-08-24 18:08                         ` Bill Fink
2007-08-24 21:25                         ` David Miller
2007-08-24 18:46                       ` Rick Jones
2007-08-25  0:42                       ` John Heffner
2007-08-26  8:41                         ` [ofa-general] " Bill Fink
2007-08-27  1:32                           ` John Heffner
2007-08-27  2:04                             ` David Miller
2007-08-27 23:23                               ` jamal
2007-09-14  7:20                                 ` [ofa-general] " Bill Fink
2007-08-23 22:30                 ` David Miller
2007-08-23 22:38                   ` [ofa-general] " jamal
2007-08-24  3:34                     ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).