* [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
@ 2007-08-08 9:31 Krishna Kumar
2007-08-08 9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
` (9 more replies)
0 siblings, 10 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, sri, shemminger, davem, kaber
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, hadi,
netdev
This set of patches implements the batching API, and adds support for this
API in IPoIB.
List of changes from original submission:
-----------------------------------------
1. [Patrick] Suggestion to remove tx_queue_len check for enabling batching.
2. [Patrick] Move queue purging to dev_deactivate to free references on
device going down.
3. [Patrick] Remove changelog & unrelated changes from sch_generic.c
4. [Patrick] Free skb_blist in unregister_netdev (also suggested to put in
free_netdev, but it is not required as unregister_netdev will not fail
at this location).
5. [Stephen/Patrick] Remove /sysfs support.
6. [Stephen] Add ethtool support.
7. [Evgeniy] Stop interrupts while changing tx_batch_skb value.
8. [Michael Tsirkin] Remove misleading comment in ipoib_send().
9. [KK] Remove NETIF_F_BATCH_SKBS (device supports batching if API present).
10. [KK] Remove xmit_slots from netdev.
11. [KK] [IPoIB]: Use unsigned instead of int for index's, handle race
between multiple WC's executing on different CPU's by having a new
lock (or might need to hold lock for entire duration of WC - some
optimization is possible here), changed multiple skb algo to not use
xmit_slots, simplify code, minor performance changes wrt slot
counters, etc.
List of changes implemented, tested and dropped:
------------------------------------------------
1. [Patrick] Suggestion to use skb_blist statically in netdevice. This
reduces performance (~ 1%) (possibly due to having an extra check for
dev->hard_start_xmit_batch API).
2. [Patrick] Suggestion to check if hard_start_xmit_batch can be removed:
This reduces performance as a call to a non inline function is made,
and an extra check in driver to see if skb is NULL.
3. [Sridhar] Suggestion to always use batching for regular xmit case too:
While testing, for some reason the tests virtually hangs and
transfers almost no data for higher number of proceses (like 64 and
above).
Patches are described as:
Mail 0/9: This mail
Mail 1/9: HOWTO documentation
Mail 2/9: Introduce skb_blist and hard_start_xmit_batch API
Mail 3/9: Modify qdisc_run() to support batching
Mail 4/9: Add ethtool support to enable/disable batching
Mail 5/9: IPoIB header file changes to use batching
Mail 6/9: IPoIB CM & Multicast changes
Mail 7/9: IPoIB verb changes to use batching
Mail 8/9: IPoIB internal post and work completion handler
Mail 9/9: Implement the new batching API
RESULTS: The performance improvement for TCP No Delay is in the range of -8%
to 320% (with -8% being the sole negative), with many individual tests
giving 50% or more improvement (I think it is to do with the hw slots
getting full quicker resulting in more batching when the queue gets
woken). The results for TCP is in the range of -11% to 93%, with most
of the tests (8/12) giving improvements.
ISSUES: I am getting a huge amount of retransmissions for both TCP and TCP No
Delay cases for IPoIB (which explains the slight degradation for some
test cases mentioned above). After a full test run, the regular code
resulted in 74 retransmissions, while there were 1365716 retrans with
batching code - or 18500 retransmissions for every 1 in regular code.
But with this huge amount of retransmissions there is 20.7% overall
improvement in BW (which implies batching will improve results even
more if this problem is fixed). I suspect this is some issue in the
driver/firmware since:
a. I see similar low retransmissions numbers for E1000 (so
no bug in core changes).
b. Even with batching set to maximum 2 skbs, I get almost the
same number of retransmissions (implies receiver is
probably not dropping skbs). ifconfig/netstat on receiver
gives no clue (drop/errors, etc).
This issue delayed submitting patches for the last 2 weeks, as I was
trying to debug this; any help from openIB community is appreciated.
Please review and provide feedback; and consider for inclusion.
Thanks,
- KK
---------------------------------------------------------------
Test Case ORG NEW % Change
---------------------------------------------------------------
TCP
---
Size:32 Procs:1 2709 4217 55.66
Size:128 Procs:1 10950 15853 44.77
Size:512 Procs:1 35313 68224 93.19
Size:4096 Procs:1 118144 119935 1.51
Size:32 Procs:8 18976 22432 18.21
Size:128 Procs:8 66351 86072 29.72
Size:512 Procs:8 246546 234373 -4.93
Size:4096 Procs:8 268861 251540 -6.44
Size:32 Procs:16 35009 45861 30.99
Size:128 Procs:16 150979 164961 9.26
Size:512 Procs:16 259443 230730 -11.06
Size:4096 Procs:16 265313 246794 -6.98
TCP No Delay
------------
Size:32 Procs:1 1930 1944 .72
Size:128 Procs:1 8573 7831 -8.65
Size:512 Procs:1 28536 29347 2.84
Size:4096 Procs:1 98916 104236 5.37
Size:32 Procs:8 4173 17560 320.80
Size:128 Procs:8 17350 66205 281.58
Size:512 Procs:8 69777 211467 203.06
Size:4096 Procs:8 201096 242578 20.62
Size:32 Procs:16 20570 37778 83.65
Size:128 Procs:16 95005 154464 62.58
Size:512 Procs:16 111677 221570 98.40
Size:4096 Procs:16 204765 240368 17.38
---------------------------------------------------------------
Overall: 2340962 2826340 20.73%
[Summary: 19 Better cases, 5 worse]
Testing environment (on client, server uses 4096 sendq size):
echo "Using 512 size sendq"
modprobe ib_ipoib send_queue_size=512 recv_queue_size=512
echo "4096 524288 4194304" > /proc/sys/net/ipv4/tcp_wmem
echo "4096 1048576 4194304" > /proc/sys/net/ipv4/tcp_rmem
echo 4194304 > /proc/sys/net/core/rmem_max
echo 4194304 > /proc/sys/net/core/wmem_max
echo 120000 > /proc/sys/net/core/netdev_max_backlog
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
@ 2007-08-08 9:31 ` Krishna Kumar
2007-08-08 9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
` (8 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, kaber, shemminger, davem, sri
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
Add Documentation describing batching API.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
Batching_skb_API.txt | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 82 insertions(+)
diff -ruNp ORG/Documentation/networking/Batching_skb_API.txt NEW/Documentation/networking/Batching_skb_API.txt
--- ORG/Documentation/networking/Batching_skb_API.txt 1970-01-01 05:30:00.000000000 +0530
+++ NEW/Documentation/networking/Batching_skb_API.txt 2007-08-07 22:41:55.000000000 +0530
@@ -0,0 +1,82 @@
+ HOWTO for batching skb API support
+ -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+----------------------------------------------
+
+A new xmit API - hard_start_xmit_batch() is provided in the netdevice layer
+similar to the existing hard_start_xmit() API. Drivers which export this
+API can implement it similar to the hard_start_xmit handler. The new API
+should process multiple skbs (or even one) in a single call while the
+existing hard_start_xmit processes one skb. It is possible for the driver
+writer to re-use most of the code from the existing API in the new API
+without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+ This API is optionally exported by a driver. The pre-requisite for a
+ driver to use this API is that it should have a reasonably sized
+ hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+ The networking stack gets called from upper layer protocols with a
+ single skb to transmit. This skb is first enqueue'd and an attempt is
+ made to transmit it immediately (via qdisc_run). However, events like
+ tx lock contention, tx queue stopped, etc, can result in the skb not
+ getting sent out and it remains in the queue. When the next xmit is
+ called or when the queue is re-enabled, qdisc_run could potentially
+ find multiple packets in the queue, and iteratively send them all out
+ one-by-one.
+
+ The batching skb API was added to exploit this situation where all
+ skbs can be passed in one shot to the device. This reduces driver
+ processing, locking at the driver (or in stack for ~LLTX drivers)
+ gets amortized over multiple skbs, and in case of specific drivers
+ where every xmit results in a completion processing (like IPoIB),
+ optimizations can be made in the driver to request a completion for
+ only the last skb that was sent which results in saving interrupts
+ for every (but the last) skb that was sent in the same batch.
+
+ Batching can result in significant performance gains for systems that
+ have multiple data stream paths over the same network interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+ The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+ simplistically, can be written almost identically to the regular
+ xmit API except that multiple skbs should be processed by the driver
+ instead of one skb. The new API doesn't get a skb as an argument,
+ instead it picks up all the skbs from dev->skb_blist, where it was
+ added by the core stack, and tries to send them out.
+
+ Batching requires the driver to set dev->hard_start_xmit_batch to the
+ new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+ Batching could be disabled for a particular device, e.g. on desktop
+ systems if only one stream of network activity for that device is
+ taking place, since performance could be slightly affected due to
+ extra processing that batching adds (unless packets are getting
+ sent fast resulting in stopped queue's). Batching can be enabled if
+ more than one stream of network activity per device is being done,
+ e.g. on servers; or even desktop usage with multiple browser, chat,
+ file transfer sessions, etc.
+
+ Per device batching can be enabled/disabled using ethtool, where
+ passing 1 enables batching and passing 0 disables batching.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
2007-08-08 9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-08-08 9:31 ` Krishna Kumar
2007-08-08 10:59 ` [ofa-general] " Stephen Hemminger
2007-08-08 12:01 ` Evgeniy Polyakov
2007-08-08 9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
` (7 subsequent siblings)
9 siblings, 2 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, sri, shemminger, kaber, davem
Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
Introduce skb_blist and hard_start_xmit_batch API, handle driver's usage
of the new API, and add support routines.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/netdevice.h | 8 +++
net/core/dev.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 106 insertions(+)
diff -ruNp ORG/include/linux/netdevice.h NEW/include/linux/netdevice.h
--- ORG/include/linux/netdevice.h 2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/netdevice.h 2007-08-07 13:11:19.000000000 +0530
@@ -456,6 +456,9 @@ struct net_device
/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
+ /* List of batch skbs (optional, used if driver supports batching API */
+ struct sk_buff_head *skb_blist;
+
/* ingress path synchronizer */
spinlock_t ingress_lock;
struct Qdisc *qdisc_ingress;
@@ -472,6 +475,9 @@ struct net_device
void *priv; /* pointer to private data */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+ int (*hard_start_xmit_batch) (struct net_device
+ *dev);
+
/* These may be needed for future network-power-down code. */
unsigned long trans_start; /* Time (in jiffies) of last Tx */
@@ -832,6 +838,8 @@ extern int dev_set_mac_address(struct n
struct sockaddr *);
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev);
+extern int dev_add_skb_to_blist(struct sk_buff *skb,
+ struct net_device *dev);
extern void dev_init(void);
diff -ruNp ORG/net/core/dev.c NEW/net/core/dev.c
--- ORG/net/core/dev.c 2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/dev.c 2007-08-07 13:11:19.000000000 +0530
@@ -897,6 +897,55 @@ void netdev_state_change(struct net_devi
}
}
+static void free_batching(struct net_device *dev)
+{
+ if (dev->skb_blist) {
+ if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+ kfree(dev->skb_blist);
+ dev->skb_blist = NULL;
+ }
+}
+
+int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
+{
+ int ret = 0;
+ struct sk_buff_head *blist;
+
+ if (!dev->hard_start_xmit_batch) {
+ /* Driver doesn't support batching skb API */
+ ret = -ENOTSUPP;
+ goto out;
+ }
+
+ /* Handle invalid argument */
+ if (new_batch_skb < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Check if new value is same as the current */
+ if (!!dev->skb_blist == !!new_batch_skb)
+ goto out;
+
+ if (new_batch_skb &&
+ (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock(&dev->queue_lock);
+ if (new_batch_skb) {
+ skb_queue_head_init(blist);
+ dev->skb_blist = blist;
+ } else
+ free_batching(dev);
+ spin_unlock(&dev->queue_lock);
+
+out:
+ return ret;
+}
+
/**
* dev_load - load a network module
* @name: name of interface
@@ -1459,6 +1508,45 @@ static int dev_gso_segment(struct sk_buf
return 0;
}
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+ if (!list_empty(&ptype_all))
+ dev_queue_xmit_nit(skb, dev);
+
+ if (netif_needs_gso(dev, skb)) {
+ if (unlikely(dev_gso_segment(skb))) {
+ kfree(skb);
+ return 0;
+ }
+
+ if (skb->next) {
+ int count = 0;
+
+ do {
+ struct sk_buff *nskb = skb->next;
+
+ skb->next = nskb->next;
+ __skb_queue_tail(dev->skb_blist, nskb);
+ count++;
+ } while (skb->next);
+
+ skb->destructor = DEV_GSO_CB(skb)->destructor;
+ kfree_skb(skb);
+ return count;
+ }
+ }
+ __skb_queue_tail(dev->skb_blist, skb);
+ return 1;
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
if (likely(!skb->next)) {
@@ -3446,6 +3535,13 @@ int register_netdevice(struct net_device
}
}
+ if (dev->hard_start_xmit_batch) {
+ /* Driver supports batching skb API */
+ dev->skb_blist = kmalloc(sizeof *dev->skb_blist, GFP_KERNEL);
+ if (dev->skb_blist)
+ skb_queue_head_init(dev->skb_blist);
+ }
+
/*
* nil rebuild_header routine,
* that should be never called and used as just bug trap.
@@ -3787,6 +3882,9 @@ void unregister_netdevice(struct net_dev
synchronize_net();
+ /* Deallocate batching structure */
+ free_batching(dev);
+
/* Shutdown queueing discipline. */
dev_shutdown(dev);
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
2007-08-08 9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
2007-08-08 9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
@ 2007-08-08 9:31 ` Krishna Kumar
2007-08-08 12:14 ` [ofa-general] " Evgeniy Polyakov
2007-08-08 14:05 ` Patrick McHardy
2007-08-08 9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
` (6 subsequent siblings)
9 siblings, 2 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, davem, shemminger, kaber, sri
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/net/pkt_sched.h | 6 +--
net/core/dev.c | 5 +--
net/sched/sch_generic.c | 77 +++++++++++++++++++++++++++++++++++++++---------
3 files changed, 69 insertions(+), 19 deletions(-)
diff -ruNp ORG/include/net/pkt_sched.h NEW/include/net/pkt_sched.h
--- ORG/include/net/pkt_sched.h 2007-07-17 08:48:37.000000000 +0530
+++ NEW/include/net/pkt_sched.h 2007-08-07 13:11:19.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
struct rtattr *tab);
extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
if (!netif_queue_stopped(dev) &&
!test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
- __qdisc_run(dev);
+ __qdisc_run(dev, blist);
}
extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp ORG/net/sched/sch_generic.c NEW/net/sched/sch_generic.c
--- ORG/net/sched/sch_generic.c 2007-07-12 08:55:20.000000000 +0530
+++ NEW/net/sched/sch_generic.c 2007-08-07 13:11:19.000000000 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
struct Qdisc *q)
{
- if (unlikely(skb->next))
- dev->gso_skb = skb;
- else
- q->ops->requeue(skb, q);
+ if (likely(skb)) {
+ if (unlikely(skb->next))
+ dev->gso_skb = skb;
+ else
+ q->ops->requeue(skb, q);
+ }
netif_schedule(dev);
return 0;
@@ -91,17 +93,22 @@ static inline int handle_dev_cpu_collisi
/*
* Same CPU holding the lock. It may be a transient
* configuration error, when hard_start_xmit() recurses. We
- * detect it by checking xmit owner and drop the packet when
- * deadloop is detected. Return OK to try the next skb.
+ * detect it by checking xmit owner and drop the packet (or
+ * all packets in batching case) when deadloop is detected.
+ * Return OK to try the next skb.
*/
- kfree_skb(skb);
+ if (likely(skb))
+ kfree_skb(skb);
+ else if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+
if (net_ratelimit())
printk(KERN_WARNING "Dead loop on netdevice %s, "
"fix it urgently!\n", dev->name);
ret = qdisc_qlen(q);
} else {
/*
- * Another cpu is holding lock, requeue & delay xmits for
+ * Another cpu is holding lock, requeue skb & delay xmits for
* some time.
*/
__get_cpu_var(netdev_rx_stat).cpu_collision++;
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
}
/*
+ * Algorithm to get skb(s) is:
+ * - Non batching drivers, or if the batch list is empty and there is
+ * 1 skb in the queue - dequeue skb and put it in *skbp to tell the
+ * caller to use the single xmit API.
+ * - Batching drivers where the batch list already contains atleast one
+ * skb, or if there are multiple skbs in the queue: keep dequeue'ing
+ * skb's upto a limit and set *skbp to NULL to tell the caller to use
+ * the multiple xmit API.
+ *
+ * Returns:
+ * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ * (in case >1 skbs present in blist for batching)
+ * 0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+ struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+ if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
+ return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+ } else {
+ int max = dev->tx_queue_len - skb_queue_len(blist);
+ struct sk_buff *skb;
+
+ while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+ max -= dev_add_skb_to_blist(skb, dev);
+
+ *skbp = NULL;
+ return 1; /* we have atleast one skb in blist */
+ }
+}
+
+/*
* NOTE: Called under dev->queue_lock with locally disabled BH.
*
* __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
* >0 - queue is not empty.
*
*/
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+ struct sk_buff_head *blist)
{
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
@@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n
int ret;
/* Dequeue packet */
- if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+ if (unlikely(!get_skb(dev, q, blist, &skb)))
return 0;
/*
@@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
/* And release queue */
spin_unlock(&dev->queue_lock);
- ret = dev_hard_start_xmit(skb, dev);
+ if (likely(skb))
+ ret = dev_hard_start_xmit(skb, dev);
+ else
+ ret = dev->hard_start_xmit_batch(dev);
if (!lockless)
netif_tx_unlock(dev);
@@ -168,7 +211,7 @@ static inline int qdisc_restart(struct n
switch (ret) {
case NETDEV_TX_OK:
- /* Driver sent out skb successfully */
+ /* Driver sent out skb (or entire skb_blist) successfully */
ret = qdisc_qlen(q);
break;
@@ -190,10 +233,10 @@ static inline int qdisc_restart(struct n
return ret;
}
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
do {
- if (!qdisc_restart(dev))
+ if (!qdisc_restart(dev, blist))
break;
} while (!netif_queue_stopped(dev));
@@ -563,6 +606,12 @@ void dev_deactivate(struct net_device *d
qdisc = dev->qdisc;
dev->qdisc = &noop_qdisc;
+ if (dev->skb_blist) {
+ /* Release skbs on batch list */
+ if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+ }
+
qdisc_reset(qdisc);
skb = dev->gso_skb;
diff -ruNp ORG/net/core/dev.c NEW/net/core/dev.c
--- ORG/net/core/dev.c 2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/dev.c 2007-08-07 13:11:19.000000000 +0530
@@ -1699,7 +1699,7 @@ gso:
/* reset queue_mapping to zero */
skb->queue_mapping = 0;
rc = q->enqueue(skb, q);
- qdisc_run(dev);
+ qdisc_run(dev, NULL);
spin_unlock(&dev->queue_lock);
rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1896,7 +1896,8 @@ static void net_tx_action(struct softirq
clear_bit(__LINK_STATE_SCHED, &dev->state);
if (spin_trylock(&dev->queue_lock)) {
- qdisc_run(dev);
+ /* Send all skbs if driver supports batching */
+ qdisc_run(dev, dev->skb_blist);
spin_unlock(&dev->queue_lock);
} else {
netif_schedule(dev);
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 4/9 Rev3] [ethtool] Add ethtool support
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (2 preceding siblings ...)
2007-08-08 9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-08-08 9:31 ` Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
` (5 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:31 UTC (permalink / raw)
To: johnpol, sri, shemminger, davem, kaber
Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
Add ethtool support to enable/disable batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/ethtool.h | 2 ++
include/linux/netdevice.h | 2 ++
net/core/ethtool.c | 26 ++++++++++++++++++++++++++
3 files changed, 30 insertions(+)
diff -ruNp ORG/include/linux/ethtool.h NEW/include/linux/ethtool.h
--- ORG/include/linux/ethtool.h 2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/ethtool.h 2007-08-07 13:11:19.000000000 +0530
@@ -410,6 +410,8 @@ struct ethtool_ops {
#define ETHTOOL_SUFO 0x00000022 /* Set UFO enable (ethtool_value) */
#define ETHTOOL_GGSO 0x00000023 /* Get GSO enable (ethtool_value) */
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GBATCH 0x00000024 /* Get Batch (ethtool_value) */
+#define ETHTOOL_SBATCH 0x00000025 /* Set Batch (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff -ruNp ORG/include/linux/netdevice.h NEW/include/linux/netdevice.h
--- ORG/include/linux/netdevice.h 2007-08-06 08:25:37.000000000 +0530
+++ NEW/include/linux/netdevice.h 2007-08-07 13:11:19.000000000 +0530
@@ -1112,6 +1112,8 @@ extern void dev_set_promiscuity(struct
extern void dev_set_allmulti(struct net_device *dev, int inc);
extern void netdev_state_change(struct net_device *dev);
extern void netdev_features_change(struct net_device *dev);
+extern int dev_change_tx_batch_skb(struct net_device *dev,
+ unsigned long new_batch_skb);
/* Load a device via the kmod */
extern void dev_load(const char *name);
extern void dev_mcast_init(void);
diff -ruNp ORG/net/core/ethtool.c NEW/net/core/ethtool.c
--- ORG/net/core/ethtool.c 2007-08-06 08:25:40.000000000 +0530
+++ NEW/net/core/ethtool.c 2007-08-07 13:11:19.000000000 +0530
@@ -638,6 +638,25 @@ static int ethtool_set_gso(struct net_de
return 0;
}
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata = { ETHTOOL_GBATCH };
+
+ edata.data = dev->skb_blist != NULL;
+ if (copy_to_user(useraddr, &edata, sizeof(edata)))
+ return -EFAULT;
+ return 0;
+}
+
+static int ethtool_set_gso(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata;
+
+ if (copy_from_user(&edata, useraddr, sizeof(edata)))
+ return -EFAULT;
+ return dev_change_tx_batch_skb(dev, edata.data);
+}
+
static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
{
struct ethtool_test test;
@@ -817,6 +836,7 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_GPERMADDR:
case ETHTOOL_GUFO:
case ETHTOOL_GGSO:
+ case ETHTOOL_GBATCH:
break;
default:
if (!capable(CAP_NET_ADMIN))
@@ -935,6 +955,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SGSO:
rc = ethtool_set_gso(dev, useraddr);
break;
+ case ETHTOOL_GBATCH:
+ rc = ethtool_get_batch(dev, useraddr);
+ break;
+ case ETHTOOL_SBATCH:
+ rc = ethtool_set_batch(dev, useraddr);
+ break;
default:
rc = -EOPNOTSUPP;
}
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (3 preceding siblings ...)
2007-08-08 9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
@ 2007-08-08 9:32 ` Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
` (4 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:32 UTC (permalink / raw)
To: johnpol, kaber, shemminger, davem, sri
Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan, tgraf,
hadi, netdev
IPoIB header file changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib.h | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib.h NEW/drivers/infiniband/ulp/ipoib/ipoib.h
--- ORG/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib.h 2007-08-07 13:11:19.000000000 +0530
@@ -266,11 +266,13 @@ struct ipoib_dev_priv {
struct ipoib_rx_buf *rx_ring;
spinlock_t tx_lock;
+ spinlock_t comp_lock; /* to handle parallel WC's */
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
unsigned tx_tail;
- struct ib_sge tx_sge;
- struct ib_send_wr tx_wr;
+ unsigned tx_prev_tail; /* to handle parallel WC's */
+ struct ib_sge *tx_sge;
+ struct ib_send_wr *tx_wr;
struct ib_wc ibwc[IPOIB_NUM_WC];
@@ -365,8 +367,11 @@ static inline void ipoib_put_ah(struct i
int ipoib_open(struct net_device *dev);
int ipoib_add_pkey_attr(struct net_device *dev);
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+ u32 qpn, int wr_num);
void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn);
+ struct ipoib_ah *address, u32 qpn, int num_skbs);
void ipoib_reap_ah(struct work_struct *work);
void ipoib_flush_paths(struct net_device *dev);
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (4 preceding siblings ...)
2007-08-08 9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
@ 2007-08-08 9:32 ` Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
` (3 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:32 UTC (permalink / raw)
To: johnpol, sri, shemminger, kaber, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, jeff, general, mchan, tgraf, hadi,
netdev
IPoIB CM & Multicast changes based on header file changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_cm.c | 13 +++++++++----
ipoib_multicast.c | 4 ++--
2 files changed, 11 insertions(+), 6 deletions(-)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_cm.c NEW/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-17 08:48:35.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-08-07 13:11:19.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
unsigned int wr_id,
u64 addr, int len)
{
+ int ret;
struct ib_send_wr *bad_wr;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ priv->tx_sge[0].addr = addr;
+ priv->tx_sge[0].length = len;
+
+ priv->tx_wr[0].wr_id = wr_id;
- priv->tx_wr.wr_id = wr_id;
+ priv->tx_wr[0].next = NULL;
+ ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+ priv->tx_wr[0].next = &priv->tx_wr[1];
- return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+ return ret;
}
void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_multicast.c NEW/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-08-07 13:11:19.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
sizeof (union ib_gid))) {
priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
- priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+ priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
}
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
}
}
- ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+ ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
}
unlock:
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (5 preceding siblings ...)
2007-08-08 9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
@ 2007-08-08 9:32 ` Krishna Kumar
2007-08-08 9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
` (2 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:32 UTC (permalink / raw)
To: johnpol, davem, shemminger, kaber, sri
Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan, tgraf,
hadi, netdev
IPoIB verb changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_verbs.c | 23 ++++++++++++++---------
1 files changed, 14 insertions(+), 9 deletions(-)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_verbs.c NEW/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-08-07 13:11:19.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
.max_send_sge = 1,
.max_recv_sge = 1
},
- .sq_sig_type = IB_SIGNAL_ALL_WR,
+ .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */
.qp_type = IB_QPT_UD
};
-
- int ret, size;
+ struct ib_send_wr *next_wr = NULL;
+ int i, ret, size;
priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff;
priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff;
- priv->tx_sge.lkey = priv->mr->lkey;
-
- priv->tx_wr.opcode = IB_WR_SEND;
- priv->tx_wr.sg_list = &priv->tx_sge;
- priv->tx_wr.num_sge = 1;
- priv->tx_wr.send_flags = IB_SEND_SIGNALED;
+ for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+ priv->tx_sge[i].lkey = priv->mr->lkey;
+ priv->tx_wr[i].opcode = IB_WR_SEND;
+ priv->tx_wr[i].sg_list = &priv->tx_sge[i];
+ priv->tx_wr[i].num_sge = 1;
+ priv->tx_wr[i].send_flags = 0;
+
+ /* Link the list properly for provider to use */
+ priv->tx_wr[i].next = next_wr;
+ next_wr = &priv->tx_wr[i];
+ }
return 0;
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (6 preceding siblings ...)
2007-08-08 9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
@ 2007-08-08 9:32 ` Krishna Kumar
2007-08-08 9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:32 UTC (permalink / raw)
To: johnpol, sri, shemminger, davem, kaber
Cc: jagana, Robert.Olsson, rick.jones2, herbert, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
IPoIB internal post and work completion handler changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_ib.c | 217 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 173 insertions(+), 44 deletions(-)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_ib.c NEW/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-17 08:48:35.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-08-07 13:11:19.000000000 +0530
@@ -242,6 +242,8 @@ repost:
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int i, num_completions;
+ unsigned int tx_ring_index;
unsigned int wr_id = wc->wr_id;
struct ipoib_tx_buf *tx_req;
unsigned long flags;
@@ -255,18 +257,56 @@ static void ipoib_ib_handle_tx_wc(struct
return;
}
- tx_req = &priv->tx_ring[wr_id];
+ /*
+ * Handle skbs completion from tx_tail to wr_id. Two issues :
+ * - Need to stop other WC's from mangling same skb's if they
+ * run at the same time. Use tx_prev_tail to demarcate WC's.
+ * - Handle WC's from earlier (possibly multiple) post_sends in
+ * this iteration as we move from tx_prev_tail to wr_id, since
+ * if the last WR (which is the one which requested completion
+ * notification) failed to be sent for any of those earlier
+ * request(s), no completion notification is generated for
+ * successful WR's of those earlier request(s).
+ */
+ spin_lock_irqsave(&priv->comp_lock, flags);
+
+ /* Get start index */
+ tx_ring_index = priv->tx_prev_tail & (ipoib_sendq_size - 1);
+
+ /*Find number of WC's */
+ num_completions = wr_id - tx_ring_index + 1;
+ if (unlikely(num_completions <= 0))
+ num_completions += ipoib_sendq_size;
+
+ /* Save new start index for any parallel WC's */
+ priv->tx_prev_tail += num_completions;
+
+ spin_unlock_irqrestore(&priv->comp_lock, flags);
- ib_dma_unmap_single(priv->ca, tx_req->mapping,
- tx_req->skb->len, DMA_TO_DEVICE);
+ tx_req = &priv->tx_ring[tx_ring_index];
+ for (i = 0; i < num_completions; i++) {
+ if (likely(tx_req->skb)) {
+ ib_dma_unmap_single(priv->ca, tx_req->mapping,
+ tx_req->skb->len, DMA_TO_DEVICE);
- ++priv->stats.tx_packets;
- priv->stats.tx_bytes += tx_req->skb->len;
+ ++priv->stats.tx_packets;
+ priv->stats.tx_bytes += tx_req->skb->len;
- dev_kfree_skb_any(tx_req->skb);
+ dev_kfree_skb_any(tx_req->skb);
+ }
+ /*
+ * else this skb failed synchronously when posted and was
+ * freed immediately.
+ */
+
+ if (likely(++tx_ring_index != ipoib_sendq_size))
+ tx_req++;
+ else
+ tx_req = &priv->tx_ring[0];
+ }
spin_lock_irqsave(&priv->tx_lock, flags);
- ++priv->tx_tail;
+ priv->tx_tail += num_completions;
if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
@@ -340,29 +380,57 @@ void ipoib_ib_completion(struct ib_cq *c
netif_rx_schedule(dev_ptr);
}
-static inline int post_send(struct ipoib_dev_priv *priv,
- unsigned int wr_id,
- struct ib_ah *address, u32 qpn,
- u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, first_wr is the first slot in tx_wr[] (or
+ * tx_sge[]). first_wr is normally zero unless a previous post_send returned
+ * error and we are trying to post the untried WR's, in which case first_wr
+ * is the index to the first untried WR.
+ *
+ * Break the WR link before posting so that provider knows how many WR's to
+ * process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+ int first_wr, int num_skbs,
+ struct ib_send_wr **bad_wr)
{
- struct ib_send_wr *bad_wr;
+ int ret;
+ struct ib_send_wr *last_wr, *next_wr;
+
+ last_wr = &priv->tx_wr[first_wr + num_skbs - 1];
+
+ /* Set Completion Notification for last WR */
+ last_wr->send_flags = IB_SEND_SIGNALED;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ /* Terminate the last WR */
+ next_wr = last_wr->next;
+ last_wr->next = NULL;
- priv->tx_wr.wr_id = wr_id;
- priv->tx_wr.wr.ud.remote_qpn = qpn;
- priv->tx_wr.wr.ud.ah = address;
+ /* Send all the WR's in one doorbell */
+ ret = ib_post_send(priv->qp, &priv->tx_wr[first_wr], bad_wr);
- return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+ /* Restore send_flags & WR chain */
+ last_wr->send_flags = 0;
+ last_wr->next = next_wr;
+
+ return ret;
}
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_ring; and details of the WR in tx_wr
+ * to pass to the provider.
+ *
+ * Returns:
+ * 1: Error and the skb is freed.
+ * 0 skb processed successfully.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+ u32 qpn, int wr_num)
{
- struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ipoib_tx_buf *tx_req;
u64 addr;
+ unsigned int tx_ring_index;
if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
@@ -370,7 +438,7 @@ void ipoib_send(struct net_device *dev,
++priv->stats.tx_dropped;
++priv->stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
- return;
+ return 1;
}
ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
@@ -383,35 +451,96 @@ void ipoib_send(struct net_device *dev,
* means we have to make sure everything is properly recorded and
* our state is consistent before we call post_send().
*/
- tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
- tx_req->skb = skb;
- addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
- DMA_TO_DEVICE);
+ addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
++priv->stats.tx_errors;
dev_kfree_skb_any(skb);
- return;
+ return 1;
}
- tx_req->mapping = addr;
- if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
- address->ah, qpn, addr, skb->len))) {
- ipoib_warn(priv, "post_send failed\n");
- ++priv->stats.tx_errors;
- ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
- dev_kfree_skb_any(skb);
- } else {
- dev->trans_start = jiffies;
+ tx_ring_index = priv->tx_head & (ipoib_sendq_size - 1);
+
+ /* Save till completion handler executes */
+ priv->tx_ring[tx_ring_index].skb = skb;
+ priv->tx_ring[tx_ring_index].mapping = addr;
+
+ /* Set WR values for the provider to use */
+ priv->tx_sge[wr_num].addr = addr;
+ priv->tx_sge[wr_num].length = skb->len;
+
+ priv->tx_wr[wr_num].wr_id = tx_ring_index;
+ priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+ priv->tx_wr[wr_num].wr.ud.ah = address->ah;
+
+ priv->tx_head++;
+
+ if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+ ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+ netif_stop_queue(dev);
+ set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+ }
- address->last_send = priv->tx_head;
- ++priv->tx_head;
+ return 0;
+}
- if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
- ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
- netif_stop_queue(dev);
- set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+/*
+ * Send num_skbs to the device. If an skb is passed to this function, it is
+ * single, unprocessed skb send case; otherwise it means that all skbs are
+ * already processed and put on priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int first_wr = 0;
+
+ if (skb && ipoib_process_skb(dev, skb, priv, address, qpn, 0))
+ return;
+
+ /* Send all skb's in one post */
+ do {
+ struct ib_send_wr *bad_wr;
+
+ if (unlikely((post_send(priv, qpn, first_wr, num_skbs,
+ &bad_wr)))) {
+ int done;
+
+ ipoib_warn(priv, "post_send failed\n");
+
+ /* Get number of WR's that finished successfully */
+ done = bad_wr - &priv->tx_wr[first_wr];
+
+ /* Handle 1 error */
+ priv->stats.tx_errors++;
+ ib_dma_unmap_single(priv->ca,
+ priv->tx_sge[first_wr + done].addr,
+ priv->tx_sge[first_wr + done].length,
+ DMA_TO_DEVICE);
+
+ /* Free failed WR & reset for WC handler to recognize */
+ dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+ priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+ /* Handle 'n' successes */
+ if (done) {
+ dev->trans_start = jiffies;
+ address->last_send = priv->tx_head - (num_skbs -
+ done) - 1;
+ }
+
+ /* Get count of skbs that were not tried */
+ num_skbs -= (done + 1);
+ /* + 1 for WR that was tried & failed */
+
+ /* Get start index for next iteration */
+ first_wr += (done + 1);
+ } else {
+ dev->trans_start = jiffies;
+
+ address->last_send = priv->tx_head - 1;
+ num_skbs = 0;
}
- }
+ } while (num_skbs);
}
static void __ipoib_reap_ah(struct net_device *dev)
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (7 preceding siblings ...)
2007-08-08 9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
@ 2007-08-08 9:32 ` Krishna Kumar
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
9 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar @ 2007-08-08 9:32 UTC (permalink / raw)
To: johnpol, kaber, shemminger, davem, sri
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, hadi, netdev, Krishna Kumar, xma
IPoIB: implement the new batching API.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_main.c | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 184 insertions(+), 5 deletions(-)
diff -ruNp ORG/drivers/infiniband/ulp/ipoib/ipoib_main.c NEW/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- ORG/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-12 08:55:06.000000000 +0530
+++ NEW/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-07 13:11:19.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
goto err_drop;
}
} else
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, path->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
} else {
neigh->ah = NULL;
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+ ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
} else if ((path->query || !path_rec_start(dev, path)) &&
skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
goto out;
}
- ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, neigh->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
goto out;
}
@@ -753,6 +755,153 @@ out:
return NETDEV_TX_OK;
}
+#define XMIT_QUEUED_SKBS() \
+ do { \
+ if (wr_num) { \
+ ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \
+ wr_num); \
+ wr_num = 0; \
+ } \
+ } while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc. Status: Done, needs testing.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ struct sk_buff *skb;
+ struct sk_buff_head *blist = dev->skb_blist;
+ int max_skbs, wr_num = 0;
+ u32 qpn, old_qpn = 0;
+ struct ipoib_neigh *neigh, *old_neigh = NULL;
+ unsigned long flags;
+
+ if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+ return NETDEV_TX_LOCKED;
+
+ /*
+ * Figure out how many skbs can be sent. This prevents the device
+ * getting full and avoids checking for queue stopped after each
+ * iteration.
+ */
+ max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+ while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+ if (likely(skb->dst && skb->dst->neighbour)) {
+ if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+ XMIT_QUEUED_SKBS();
+ ipoib_path_lookup(skb, dev);
+ continue;
+ }
+
+ neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+ if (ipoib_cm_get(neigh)) {
+ if (ipoib_cm_up(neigh)) {
+ XMIT_QUEUED_SKBS();
+ ipoib_cm_send(dev, skb,
+ ipoib_cm_get(neigh));
+ continue;
+ }
+ } else if (neigh->ah) {
+ if (unlikely(memcmp(&neigh->dgid.raw,
+ skb->dst->neighbour->ha + 4,
+ sizeof(union ib_gid)))) {
+ spin_lock(&priv->lock);
+ /*
+ * It's safe to call ipoib_put_ah()
+ * inside priv->lock here, because we
+ * know that path->ah will always hold
+ * one more reference, so ipoib_put_ah()
+ * will never do more than decrement
+ * the ref count.
+ */
+ ipoib_put_ah(neigh->ah);
+ list_del(&neigh->list);
+ ipoib_neigh_free(dev, neigh);
+ spin_unlock(&priv->lock);
+ XMIT_QUEUED_SKBS();
+ ipoib_path_lookup(skb, dev);
+ continue;
+ }
+
+ qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+ if (neigh != old_neigh || qpn != old_qpn) {
+ /*
+ * Sending to a different destination
+ * from earlier skb's - send all
+ * existing skbs (if any), and restart.
+ */
+ XMIT_QUEUED_SKBS();
+ old_neigh = neigh;
+ old_qpn = qpn;
+ }
+
+ if (likely(!ipoib_process_skb(dev, skb, priv,
+ neigh->ah, qpn,
+ wr_num)))
+ wr_num++;
+
+ continue;
+ }
+
+ if (skb_queue_len(&neigh->queue) <
+ IPOIB_MAX_PATH_REC_QUEUE) {
+ spin_lock(&priv->lock);
+ __skb_queue_tail(&neigh->queue, skb);
+ spin_unlock(&priv->lock);
+ } else {
+ dev_kfree_skb_any(skb);
+ ++priv->stats.tx_dropped;
+ ++max_skbs;
+ }
+ } else {
+ struct ipoib_pseudoheader *phdr =
+ (struct ipoib_pseudoheader *) skb->data;
+ skb_pull(skb, sizeof *phdr);
+
+ if (phdr->hwaddr[4] == 0xff) {
+ /* Add in the P_Key for multicast*/
+ phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+ phdr->hwaddr[9] = priv->pkey & 0xff;
+
+ XMIT_QUEUED_SKBS();
+ ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+ } else {
+ /* unicast GID -- should be ARP or RARP reply */
+
+ if ((be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_ARP) &&
+ (be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_RARP)) {
+ ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+ IPOIB_GID_FMT "\n",
+ skb->dst ? "neigh" : "dst",
+ be16_to_cpup((__be16 *)
+ skb->data),
+ IPOIB_QPN(phdr->hwaddr),
+ IPOIB_GID_RAW_ARG(phdr->hwaddr
+ + 4));
+ dev_kfree_skb_any(skb);
+ ++priv->stats.tx_dropped;
+ ++max_skbs;
+ continue;
+ }
+ XMIT_QUEUED_SKBS();
+ unicast_arp_send(skb, dev, phdr);
+ }
+ }
+ }
+
+ /* Send out last packets (if any) */
+ XMIT_QUEUED_SKBS();
+
+ spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+ return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -896,13 +1045,37 @@ int ipoib_dev_init(struct net_device *de
goto out_rx_ring_cleanup;
}
- /* priv->tx_head & tx_tail are already 0 */
+ /* priv->tx_head & tx_tail & tx_priv_tail are already 0 */
- if (ipoib_ib_dev_init(dev, ca, port))
+ /* Allocate tx_sge */
+ priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+ GFP_KERNEL);
+ if (!priv->tx_sge) {
+ printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+ ca->name, ipoib_sendq_size);
goto out_tx_ring_cleanup;
+ }
+
+ /* Allocate tx_wr */
+ priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+ GFP_KERNEL);
+ if (!priv->tx_wr) {
+ printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+ ca->name, ipoib_sendq_size);
+ goto out_tx_sge_cleanup;
+ }
+
+ if (ipoib_ib_dev_init(dev, ca, port))
+ goto out_tx_wr_cleanup;
return 0;
+out_tx_wr_cleanup:
+ kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+ kfree(priv->tx_sge);
+
out_tx_ring_cleanup:
kfree(priv->tx_ring);
@@ -930,9 +1103,13 @@ void ipoib_dev_cleanup(struct net_device
kfree(priv->rx_ring);
kfree(priv->tx_ring);
+ kfree(priv->tx_sge);
+ kfree(priv->tx_wr);
priv->rx_ring = NULL;
priv->tx_ring = NULL;
+ priv->tx_sge = NULL;
+ priv->tx_wr = NULL;
}
static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1120,7 @@ static void ipoib_setup(struct net_devic
dev->stop = ipoib_stop;
dev->change_mtu = ipoib_change_mtu;
dev->hard_start_xmit = ipoib_start_xmit;
+ dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
dev->get_stats = ipoib_get_stats;
dev->tx_timeout = ipoib_timeout;
dev->hard_header = ipoib_hard_header;
@@ -979,6 +1157,7 @@ static void ipoib_setup(struct net_devic
spin_lock_init(&priv->lock);
spin_lock_init(&priv->tx_lock);
+ spin_lock_init(&priv->comp_lock);
mutex_init(&priv->mcast_mutex);
mutex_init(&priv->vlan_mutex);
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
` (8 preceding siblings ...)
2007-08-08 9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
@ 2007-08-08 10:49 ` David Miller
2007-08-08 11:09 ` Krishna Kumar2
` (2 more replies)
9 siblings, 3 replies; 34+ messages in thread
From: David Miller @ 2007-08-08 10:49 UTC (permalink / raw)
To: krkumar2
Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, jeff, hadi, general,
mchan, tgraf, netdev, shemminger, kaber, sri
From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed, 08 Aug 2007 15:01:14 +0530
> RESULTS: The performance improvement for TCP No Delay is in the range of -8%
> to 320% (with -8% being the sole negative), with many individual tests
> giving 50% or more improvement (I think it is to do with the hw slots
> getting full quicker resulting in more batching when the queue gets
> woken). The results for TCP is in the range of -11% to 93%, with most
> of the tests (8/12) giving improvements.
Not because I think it obviates your work, but rather because I'm
curious, could you test a TSO-in-hardware driver converted to
batching and see how TSO alone compares to batching for a pure
TCP workload?
I personally don't think it will help for that case at all as
TSO likely does better job of coalescing the work _and_ reducing
bus traffic as well as work in the TCP stack.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
2007-08-08 9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
@ 2007-08-08 10:59 ` Stephen Hemminger
2007-08-08 11:24 ` Krishna Kumar2
2007-08-08 12:01 ` Evgeniy Polyakov
1 sibling, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2007-08-08 10:59 UTC (permalink / raw)
To: Krishna Kumar
Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, kaber, jeff, general,
mchan, tgraf, netdev, hadi, davem, sri
How do qdisc's that do rate control work now? Did you add logic to
breakup batches for token bucket, etc.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
@ 2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 22:01 ` [ofa-general] " David Miller
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 1 reply; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-08 11:09 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> > to 320% (with -8% being the sole negative), with many individual
tests
> > giving 50% or more improvement (I think it is to do with the hw
slots
> > getting full quicker resulting in more batching when the queue gets
> > woken). The results for TCP is in the range of -11% to 93%, with
most
> > of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
Definitely, I will try to do this.
What do you generally think of the patch/implementation ? :)
thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
2007-08-08 10:59 ` [ofa-general] " Stephen Hemminger
@ 2007-08-08 11:24 ` Krishna Kumar2
0 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-08 11:24 UTC (permalink / raw)
To: Stephen Hemminger
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
mchan, tgraf, netdev, sri, davem, rdreier
Stephen Hemminger <shemminger@linux-foundation.org> wrote on 08/08/2007
04:29:46 PM:
> How do qdisc's that do rate control work now? Did you add logic to
> breakup batches for token bucket, etc.
My thought was that the code works identically to the current code except
in requeue cases. For requeue the existing code sched decides which skb
to pass next (first) to device; while in batching case it is assumed that
skbs already in batch list are part of the hardware queue and sent first.
If that is a problem - at the time of enabling batching it is possible
to check what qdisc_ops is being used (or whenever the qdisc's change),
or maybe add a flag to each qdisc_ops to indicate whether qdisc supports
batching, etc ? Like definitely enable batching if using pfifo_fast_ops..
What is your suggestion ?
thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
2007-08-08 9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
2007-08-08 10:59 ` [ofa-general] " Stephen Hemminger
@ 2007-08-08 12:01 ` Evgeniy Polyakov
2007-08-09 3:09 ` Krishna Kumar2
1 sibling, 1 reply; 34+ messages in thread
From: Evgeniy Polyakov @ 2007-08-08 12:01 UTC (permalink / raw)
To: Krishna Kumar
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, kaber, jeff, hadi, general, mchan,
tgraf, netdev, shemminger, davem, sri
Hi Krishna.
On Wed, Aug 08, 2007 at 03:01:35PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
> +{
> + int ret = 0;
> + struct sk_buff_head *blist;
> +
> + if (!dev->hard_start_xmit_batch) {
> + /* Driver doesn't support batching skb API */
> + ret = -ENOTSUPP;
> + goto out;
> + }
> +
> + /* Handle invalid argument */
> + if (new_batch_skb < 0) {
> + ret = -EINVAL;
> + goto out;
> + }
It is unsigned, how can it be less than zero?
And actually you use it just like a binary flag (casted to/from u32 in
the code, btw), so why not using ethtool_value directly here?
> + /* Check if new value is same as the current */
> + if (!!dev->skb_blist == !!new_batch_skb)
> + goto out;
> +
> + if (new_batch_skb &&
> + (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + spin_lock(&dev->queue_lock);
> + if (new_batch_skb) {
> + skb_queue_head_init(blist);
> + dev->skb_blist = blist;
> + } else
> + free_batching(dev);
> + spin_unlock(&dev->queue_lock);
This needs bh lock too, since blist is accessed by qdisc_restart.
> +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> +{
> + if (!list_empty(&ptype_all))
> + dev_queue_xmit_nit(skb, dev);
> +
> + if (netif_needs_gso(dev, skb)) {
> + if (unlikely(dev_gso_segment(skb))) {
> + kfree(skb);
> + return 0;
> + }
> +
> + if (skb->next) {
> + int count = 0;
> +
> + do {
> + struct sk_buff *nskb = skb->next;
> +
> + skb->next = nskb->next;
> + __skb_queue_tail(dev->skb_blist, nskb);
> + count++;
> + } while (skb->next);
Is it possible to move list without iterating over each entry?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-08-08 12:14 ` Evgeniy Polyakov
2007-08-09 3:13 ` Krishna Kumar2
2007-08-08 14:05 ` Patrick McHardy
1 sibling, 1 reply; 34+ messages in thread
From: Evgeniy Polyakov @ 2007-08-08 12:14 UTC (permalink / raw)
To: Krishna Kumar
Cc: jagana, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, kaber, jeff, hadi, general,
mchan, tgraf, netdev, shemminger, davem, sri
On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> + struct sk_buff_head *blist, struct sk_buff **skbp)
> +{
> + if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
> + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> + } else {
> + int max = dev->tx_queue_len - skb_queue_len(blist);
> + struct sk_buff *skb;
> +
> + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> + max -= dev_add_skb_to_blist(skb, dev);
> +
> + *skbp = NULL;
> + return 1; /* we have atleast one skb in blist */
> + }
> +}
Same here - is it possible to get a list in one go instead of pulling
one-by-one, since it forces quite a few additional unneded lock
get/releases. What about dev_dequeue_number_skb(dev, q, num), which will
grab the lock and move a list of skbs from one queue to provided head.
> @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
> /* And release queue */
> spin_unlock(&dev->queue_lock);
>
> - ret = dev_hard_start_xmit(skb, dev);
> + if (likely(skb))
> + ret = dev_hard_start_xmit(skb, dev);
> + else
> + ret = dev->hard_start_xmit_batch(dev);
Perfectionism says that having array of two functions and calling one of
them via array_func_pointer[!!skb] will be much faster. Just a though.
It is actually much faster than if/else on x86 at least.
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-08 11:09 ` Krishna Kumar2
@ 2007-08-08 13:42 ` Herbert Xu
2007-08-08 15:14 ` jamal
` (2 more replies)
2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 3 replies; 34+ messages in thread
From: Herbert Xu @ 2007-08-08 13:42 UTC (permalink / raw)
To: David Miller
Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, jagana, hadi, general, mchan, tgraf,
netdev, shemminger, kaber, sri
On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
You could even lower the bar by disabling TSO and enabling
software GSO.
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
I agree. I suspect the bulk of the effort is in getting
these skb's created and processed by the stack so that by
the time that they're exiting the qdisc there's not much
to be saved anymore.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
2007-08-08 12:14 ` [ofa-general] " Evgeniy Polyakov
@ 2007-08-08 14:05 ` Patrick McHardy
2007-08-08 15:26 ` [ofa-general] " jamal
2007-08-09 4:06 ` Krishna Kumar2
1 sibling, 2 replies; 34+ messages in thread
From: Patrick McHardy @ 2007-08-08 14:05 UTC (permalink / raw)
To: Krishna Kumar
Cc: johnpol, davem, shemminger, sri, jagana, Robert.Olsson,
peter.p.waskiewicz.jr, herbert, gaagaan, kumarkr, rdreier,
rick.jones2, mcarlson, jeff, general, mchan, tgraf, hadi, netdev,
xma
Krishna Kumar wrote:
> + * Algorithm to get skb(s) is:
> + * - Non batching drivers, or if the batch list is empty and there is
> + * 1 skb in the queue - dequeue skb and put it in *skbp to tell the
> + * caller to use the single xmit API.
> + * - Batching drivers where the batch list already contains atleast one
> + * skb, or if there are multiple skbs in the queue: keep dequeue'ing
> + * skb's upto a limit and set *skbp to NULL to tell the caller to use
> + * the multiple xmit API.
> + *
> + * Returns:
> + * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL
> + * (in case >1 skbs present in blist for batching)
> + * 0 - no skbs to be sent.
> + */
> +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> + struct sk_buff_head *blist, struct sk_buff **skbp)
> +{
> + if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
> + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> + } else {
> + int max = dev->tx_queue_len - skb_queue_len(blist);
> + struct sk_buff *skb;
> +
> + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> + max -= dev_add_skb_to_blist(skb, dev);
> +
> + *skbp = NULL;
> + return 1; /* we have atleast one skb in blist */
> + }
> +}
I think you missed my previous reply to this in the flood of
responses (or I missed your reply), so I'm copying it below:
The entire idea of a single queue after qdiscs that is refilled
independantly of transmissions times etc. make be worry a bit.
By changing the timing you're effectively changing the qdiscs
behaviour, at least in some cases. SFQ is a good example, but I
believe it affects most work-conserving qdiscs. Think of this
situation:
100 packets of flow 1 arrive
50 packets of flow 1 are sent
100 packets for flow 2 arrive
remaining packets are sent
On the wire you'll first see 50 packets of flow 1, than 100 packets
alternate of flow 1 and 2, then 50 packets flow 2.
With your additional queue all packets of flow 1 are pulled out of
the qdisc immediately and put in the fifo. When the 100 packets of
the second flow arrive they will also get pulled out immediately
and are put in the fifo behind the remaining 50 packets of flow 1.
So what you get on the wire is:
100 packets of flow 1
100 packets of flow 1
So SFQ is without any effect. This is not completely avoidable of
course, but you can and should limit the damage by only pulling
out as much packets as the driver can take and have the driver
stop the queue afterwards.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
@ 2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:22 ` David Miller
2007-08-09 0:06 ` Shirley Ma
2007-08-09 3:19 ` [ofa-general] " Krishna Kumar2
2 siblings, 2 replies; 34+ messages in thread
From: jamal @ 2007-08-08 15:14 UTC (permalink / raw)
To: Herbert Xu
Cc: johnpol, peter.p.waskiewicz.jr, jeff, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, kaber, jagana, general, mchan, tgraf,
netdev, shemminger, David Miller, sri
On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.
>From my observation for TCP packets slightly above MDU (upto 2K), GSO
gives worse performance than non-GSO throughput-wise. Actually this has
nothing to do with batching, rather the behavior is consistent with or
without batching changes.
> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree.
> I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.
pktgen shows a clear win if you test the driver path - which is what you
should test because thats where the batching changes are.
Using TCP or UDP adds other variables[1] that need to be isolated first
in order to quantify the effect of batching.
For throughput and CPU utilization, the benefit will be clear when there
are a lot more flows.
cheers,
jamal
[1] I think there are too many other variables in play unfortunately
when you are dealing with a path that starts above the driver and one
that covers end to end effect: traffic/app source, system clock sources
as per my recent discovery, congestion control algorithms used, tuning
of recevier etc.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 14:05 ` Patrick McHardy
@ 2007-08-08 15:26 ` jamal
2007-08-09 4:06 ` Krishna Kumar2
1 sibling, 0 replies; 34+ messages in thread
From: jamal @ 2007-08-08 15:26 UTC (permalink / raw)
To: Patrick McHardy
Cc: johnpol, herbert, gaagaan, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, netdev, jagana, general, mchan,
tgraf, jeff, shemminger, davem, sri
On Wed, 2007-08-08 at 16:05 +0200, Patrick McHardy wrote:
> This is not completely avoidable of
> course, but you can and should limit the damage by only pulling
> out as much packets as the driver can take and have the driver
> stop the queue afterwards.
This why the dev->xmit_win exists in my patches. The driver says how
much space it has using this variable - and only those packets get
pulled off; so there is no conflict with any scheduling algorithm be it
work/non-work conserving.
The ideal case is never to carry over anything in dev->blist. However,
there is a challenge with GSO/TSO: if say the descriptors in the driver
are less than the list of skbs, you are faced with the dilema of sending
a fraction of the gso list first.
My current approach is to transmit as much as the driver would allow.
On the next opportunity to transmit, you do immediately send anything
remaining first before you ask the qdisc for more packets. An
alternative approach i had entertained was not to send anything from the
skb list when hitting such a case, but it hasnt seem needed so far
(havent seen any leftovers from my experiments).
cheers,
jamal
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 15:14 ` jamal
@ 2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:40 ` jamal
2007-08-08 22:22 ` David Miller
1 sibling, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2007-08-08 20:55 UTC (permalink / raw)
To: hadi
Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
general, mchan, tgraf, jeff, netdev, kaber, sri
On Wed, 08 Aug 2007 11:14:35 -0400
jamal <hadi@cyberus.ca> wrote:
> On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
> > On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> > >
> > > Not because I think it obviates your work, but rather because I'm
> > > curious, could you test a TSO-in-hardware driver converted to
> > > batching and see how TSO alone compares to batching for a pure
> > > TCP workload?
> >
> > You could even lower the bar by disabling TSO and enabling
> > software GSO.
>
> >From my observation for TCP packets slightly above MDU (upto 2K), GSO
> gives worse performance than non-GSO throughput-wise. Actually this has
> nothing to do with batching, rather the behavior is consistent with or
> without batching changes.
>
> > > I personally don't think it will help for that case at all as
> > > TSO likely does better job of coalescing the work _and_ reducing
> > > bus traffic as well as work in the TCP stack.
> >
> > I agree.
> > I suspect the bulk of the effort is in getting
> > these skb's created and processed by the stack so that by
> > the time that they're exiting the qdisc there's not much
> > to be saved anymore.
>
> pktgen shows a clear win if you test the driver path - which is what you
> should test because thats where the batching changes are.
> Using TCP or UDP adds other variables[1] that need to be isolated first
> in order to quantify the effect of batching.
> For throughput and CPU utilization, the benefit will be clear when there
> are a lot more flows.
Optimizing for pktgen is a mistake for most users. Please show something
useful like router forwarding, TCP (single and multi flow) and/or better
yet application benchmark improvement.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 11:09 ` Krishna Kumar2
@ 2007-08-08 22:01 ` David Miller
2007-08-09 4:19 ` Krishna Kumar2
0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2007-08-08 22:01 UTC (permalink / raw)
To: krkumar2
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Wed, 8 Aug 2007 16:39:47 +0530
> What do you generally think of the patch/implementation ? :)
We have two driver implementation paths on recieve and now
we'll have two on send, and that's not a good trend.
In an ideal world all the drivers would be NAPI and netif_rx()
would only be used by tunneling drivers and similar in the
protocol layers. And likewise all sends would go through
->hard_start_xmit().
If you can come up with a long term strategy that gets rid of
the special transmit method, that'd be great.
We should make Linux network drivers easy to write, not more difficult
by constantly adding most interfaces than we consolidate.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
@ 2007-08-08 22:22 ` David Miller
2007-08-08 22:53 ` jamal
1 sibling, 1 reply; 34+ messages in thread
From: David Miller @ 2007-08-08 22:22 UTC (permalink / raw)
To: hadi
Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
jeff, shemminger, kaber, sri
From: jamal <hadi@cyberus.ca>
Date: Wed, 08 Aug 2007 11:14:35 -0400
> pktgen shows a clear win if you test the driver path - which is what
> you should test because thats where the batching changes are.
The driver path, however, does not exist on an island and what
we care about is the final result with the changes running
inside the full system.
So, to be honest, besides for initial internal development
feedback, the isolated tests only have minimal merit and
it's the full protocol tests that are really interesting.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 20:55 ` Stephen Hemminger
@ 2007-08-08 22:40 ` jamal
0 siblings, 0 replies; 34+ messages in thread
From: jamal @ 2007-08-08 22:40 UTC (permalink / raw)
To: Stephen Hemminger
Cc: johnpol, peter.p.waskiewicz.jr, Herbert Xu, gaagaan,
Robert.Olsson, kumarkr, rdreier, mcarlson, David Miller, jagana,
general, mchan, tgraf, jeff, netdev, kaber, sri
On Wed, 2007-08-08 at 21:55 +0100, Stephen Hemminger wrote:
> > pktgen shows a clear win if you test the driver path - which is what you
> > should test because thats where the batching changes are.
> > Using TCP or UDP adds other variables[1] that need to be isolated first
> > in order to quantify the effect of batching.
> > For throughput and CPU utilization, the benefit will be clear when there
> > are a lot more flows.
>
> Optimizing for pktgen is a mistake for most users.
There is no "optimization for pktgen".
If you are improving the tx path, the first step is to test that you can
show the tx path improved. pktgen happens to be the best test suite for
that because it talks to the driver and exercises the changes.
i.e if one cant show that exercising the direct path demonstrates
improvements you probably wont be able to show batching improves TCP -
but i dont even wanna swear by that.
Does that make sense?
> Please show something
> useful like router forwarding, TCP (single and multi flow) and/or better
> yet application benchmark improvement.
Absolutely, but first things first. Analysis of why something improves
is extremely important, just saying "TCP throughput improved" is not
interesting and lazy.
To be scientific, it is important to isolate variables first in order to
come up with meaningful results that can be analysed.
To make a point, I have noticed extremely different results between TCP
BIC vs reno with batching. So congestion control as a variable is
important.
cheers,
jamal
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 22:22 ` David Miller
@ 2007-08-08 22:53 ` jamal
0 siblings, 0 replies; 34+ messages in thread
From: jamal @ 2007-08-08 22:53 UTC (permalink / raw)
To: David Miller
Cc: johnpol, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
kumarkr, rdreier, mcarlson, netdev, jagana, general, mchan, tgraf,
jeff, shemminger, kaber, sri
On Wed, 2007-08-08 at 15:22 -0700, David Miller wrote:
> The driver path, however, does not exist on an island and what
> we care about is the final result with the changes running
> inside the full system.
>
> So, to be honest, besides for initial internal development
> feedback, the isolated tests only have minimal merit and
> it's the full protocol tests that are really interesting.
But you cant go there if cant show the path which is supposedly improved
has indeed improved;-> I would certainly agree with you that if it
doesnt prove consistently useful with protocols it has no value
(remember thats why i never submitted these patches all this time).
We just need better analysis of the results - i cant ignore that the
selection of the clock sources for example gives me different results
and that when i boot i cant be guaranteed the same clock source. I cant
ignore the fact that i get different results when i use a different
congestion control algorithm. And none of this has to do with the
batching patches.
I am using UDP at the moment because it is simpler to analyze. And yes,
it would be "an interesting idea that gets shelved" if we cant achieve
any of the expected goals. We've shelved ideas before. BTW, read the
little doc i wrote on the dev->prep_xmit() you may find it interesting.
cheers,
jamal
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
2007-08-08 15:14 ` jamal
@ 2007-08-09 0:06 ` Shirley Ma
2007-08-09 3:19 ` [ofa-general] " Krishna Kumar2
2 siblings, 0 replies; 34+ messages in thread
From: Shirley Ma @ 2007-08-09 0:06 UTC (permalink / raw)
To: Herbert Xu
Cc: David Miller, gaagaan, general, hadi, jeff, johnpol, kaber,
krkumar2, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf,
Venkata Jagana
Hello Herbert,
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.
We had a discuss before. GSO doesn't benefit device which has no HW
checksum (like IPoIB) by inducing an extra copy. And GSO benefits one
stream, batching benefits multiple streams.
Thanks
Shirley
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch
2007-08-08 12:01 ` Evgeniy Polyakov
@ 2007-08-09 3:09 ` Krishna Kumar2
0 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 3:09 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: jagana, mcarlson, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan,
tgraf, netdev, shemminger, davem, sri
Hi Evgeniy,
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 08/08/2007 05:31:43 PM:
> > +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long
new_batch_skb)
> > +{
> > + int ret = 0;
> > + struct sk_buff_head *blist;
> > +
> > + if (!dev->hard_start_xmit_batch) {
> > + /* Driver doesn't support batching skb API */
> > + ret = -ENOTSUPP;
> > + goto out;
> > + }
> > +
> > + /* Handle invalid argument */
> > + if (new_batch_skb < 0) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> It is unsigned, how can it be less than zero?
Yuck, originally I had it as int and changed to ulong and forgot to
remove this check.
> And actually you use it just like a binary flag (casted to/from u32 in
> the code, btw), so why not using ethtool_value directly here?
I still need to check if the value is changing, so the one check is needed.
Later I am using it as a value directly.
> > + /* Check if new value is same as the current */
> > + if (!!dev->skb_blist == !!new_batch_skb)
> > + goto out;
> > +
> > + if (new_batch_skb &&
> > + (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + spin_lock(&dev->queue_lock);
> > + if (new_batch_skb) {
> > + skb_queue_head_init(blist);
> > + dev->skb_blist = blist;
> > + } else
> > + free_batching(dev);
> > + spin_unlock(&dev->queue_lock);
>
> This needs bh lock too, since blist is accessed by qdisc_restart.
Yes, had it in the code, put it in the list of changes, but missed it for
some reason :(
> > +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
> > +{
> > + if (!list_empty(&ptype_all))
> > + dev_queue_xmit_nit(skb, dev);
> > +
> > + if (netif_needs_gso(dev, skb)) {
> > + if (unlikely(dev_gso_segment(skb))) {
> > + kfree(skb);
> > + return 0;
> > + }
> > +
> > + if (skb->next) {
> > + int count = 0;
> > +
> > + do {
> > + struct sk_buff *nskb = skb->next;
> > +
> > + skb->next = nskb->next;
> > + __skb_queue_tail(dev->skb_blist, nskb);
> > + count++;
> > + } while (skb->next);
>
> Is it possible to move list without iterating over each entry?
Though I cannot see something obvious to do that, let me see if
something is possible as it will make a good difference.
thanks for your suggestions,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 12:14 ` [ofa-general] " Evgeniy Polyakov
@ 2007-08-09 3:13 ` Krishna Kumar2
0 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 3:13 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 08/08/2007 05:44:02 PM:
> On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar
(krkumar2@in.ibm.com) wrote:
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > + struct sk_buff_head *blist, struct sk_buff **skbp)
> > +{
> > + if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1))) {
> > + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > + } else {
> > + int max = dev->tx_queue_len - skb_queue_len(blist);
> > + struct sk_buff *skb;
> > +
> > + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> > + max -= dev_add_skb_to_blist(skb, dev);
> > +
> > + *skbp = NULL;
> > + return 1; /* we have atleast one skb in blist */
> > + }
> > +}
>
> Same here - is it possible to get a list in one go instead of pulling
> one-by-one, since it forces quite a few additional unneded lock
> get/releases. What about dev_dequeue_number_skb(dev, q, num), which will
> grab the lock and move a list of skbs from one queue to provided head.
OK, I will try this out.
> > @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
> > /* And release queue */
> > spin_unlock(&dev->queue_lock);
> >
> > - ret = dev_hard_start_xmit(skb, dev);
> > + if (likely(skb))
> > + ret = dev_hard_start_xmit(skb, dev);
> > + else
> > + ret = dev->hard_start_xmit_batch(dev);
>
> Perfectionism says that having array of two functions and calling one of
> them via array_func_pointer[!!skb] will be much faster. Just a though.
> It is actually much faster than if/else on x86 at least.
Thinking about this - I will have to store the 2 pointer array in dev
itself wasting some space, and also fn pointer will have wrong signature
as one takes an extra argument. Will ponder some more :)
thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
2007-08-08 15:14 ` jamal
2007-08-09 0:06 ` Shirley Ma
@ 2007-08-09 3:19 ` Krishna Kumar2
2 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 3:19 UTC (permalink / raw)
To: Herbert Xu
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, mcarlson,
peter.p.waskiewicz.jr, hadi, kaber, netdev, general, mchan, tgraf,
sri, shemminger, David Miller, rdreier
Herbert Xu <herbert@gondor.apana.org.au> wrote on 08/08/2007 07:12:47 PM:
> On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
> >
> > Not because I think it obviates your work, but rather because I'm
> > curious, could you test a TSO-in-hardware driver converted to
> > batching and see how TSO alone compares to batching for a pure
> > TCP workload?
>
> You could even lower the bar by disabling TSO and enabling
> software GSO.
I will try with E1000 (though I didn't see improvement when I tested a long
time back). The difference I expect is that TSO would help with large
packets and not necessarily small/medium packets and not definitely in
the case of multiple different skbs (as opposed to single large skb)
getting
queue'd. I think these are two different workloads.
> > I personally don't think it will help for that case at all as
> > TSO likely does better job of coalescing the work _and_ reducing
> > bus traffic as well as work in the TCP stack.
>
> I agree. I suspect the bulk of the effort is in getting
> these skb's created and processed by the stack so that by
> the time that they're exiting the qdisc there's not much
> to be saved anymore.
However, I am getting a large improvement for IPoIB specifically for this
same case. The reason - batching will help only when queue gets full and
stopped (and to a lesser extent if tx lock was not got, which results
in fewer amount of batching that can be done).
thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
2007-08-08 14:05 ` Patrick McHardy
2007-08-08 15:26 ` [ofa-general] " jamal
@ 2007-08-09 4:06 ` Krishna Kumar2
1 sibling, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 4:06 UTC (permalink / raw)
To: Patrick McHardy
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
mchan, tgraf, netdev, shemminger, davem, sri
Hi Patrick,
Patrick McHardy <kaber@trash.net> wrote on 08/08/2007 07:35:17 PM:
> Krishna Kumar wrote:
<snip>
> > +static inline int get_skb(struct net_device *dev, struct Qdisc *q,
> > + struct sk_buff_head *blist, struct sk_buff **skbp)
> > +{
> > + if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <=
1))) {
> > + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
> > + } else {
> > + int max = dev->tx_queue_len - skb_queue_len(blist);
> > + struct sk_buff *skb;
> > +
> > + while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
> > + max -= dev_add_skb_to_blist(skb, dev);
> > +
> > + *skbp = NULL;
> > + return 1; /* we have atleast one skb in blist */
> > + }
> > +}
>
> I think you missed my previous reply to this in the flood of
> responses (or I missed your reply), so I'm copying it below:
Sorry, but I didn't get your post on this point earlier (thanks for
posting again).
> The entire idea of a single queue after qdiscs that is refilled
> independantly of transmissions times etc. make be worry a bit.
> By changing the timing you're effectively changing the qdiscs
> behaviour, at least in some cases. SFQ is a good example, but I
> believe it affects most work-conserving qdiscs. Think of this
> situation:
>
> 100 packets of flow 1 arrive
> 50 packets of flow 1 are sent
> 100 packets for flow 2 arrive
> remaining packets are sent
>
> On the wire you'll first see 50 packets of flow 1, than 100 packets
> alternate of flow 1 and 2, then 50 packets flow 2.
>
> With your additional queue all packets of flow 1 are pulled out of
> the qdisc immediately and put in the fifo. When the 100 packets of
> the second flow arrive they will also get pulled out immediately
> and are put in the fifo behind the remaining 50 packets of flow 1.
> So what you get on the wire is:
>
> 100 packets of flow 1
> 100 packets of flow 1
In normal case (qdisc run from xmit and not from tx_action), the code
executing is the same as regular code without any difference in wire
behavior due to the check:
if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)))
{
return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
(I always pass blist as NULL).
With SFQ for the above scenario, my code will first send out 50 F1 skbs
iteratively (blist==NULL), get blocked so that 50 skbs accumulate on F1
and 100 on F2, then when it is woken up, it will batch but it will pick
up 50 F1 and 50 F2 skbs in alternate order and put in the queue, and
finally pick up the remaining 50 F2 skbs and put those too in the queue.
Since I am calling dev_dequeue_skb, I am assured of RR when using SFQ.
The only time I would have 100 skbs from F1 in my queue (and hence sent
out first) is if there are NO F2 skbs.
> So SFQ is without any effect. This is not completely avoidable of
> course, but you can and should limit the damage by only pulling
> out as much packets as the driver can take and have the driver
> stop the queue afterwards.
I feel this cannot happen. Please correct me if I am wrong.
Thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 22:01 ` [ofa-general] " David Miller
@ 2007-08-09 4:19 ` Krishna Kumar2
2007-08-09 4:27 ` David Miller
0 siblings, 1 reply; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 4:19 UTC (permalink / raw)
To: David Miller
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
Hi Dave,
David Miller <davem@davemloft.net> wrote on 08/09/2007 03:31:37 AM:
> > What do you generally think of the patch/implementation ? :)
>
> We have two driver implementation paths on recieve and now
> we'll have two on send, and that's not a good trend.
Correct.
> In an ideal world all the drivers would be NAPI and netif_rx()
> would only be used by tunneling drivers and similar in the
> protocol layers. And likewise all sends would go through
> ->hard_start_xmit().
>
> If you can come up with a long term strategy that gets rid of
> the special transmit method, that'd be great.
>
> We should make Linux network drivers easy to write, not more difficult
> by constantly adding most interfaces than we consolidate.
I think that is a good top level view, and I agree with that.
Patrick had suggested calling dev_hard_start_xmit() instead of
conditionally calling the new API and to remove the new API
entirely. The driver determines whether batching is required or
not depending on (skb==NULL) or not. Would that approach be fine
with this "single interface" goal ?
Thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-09 4:19 ` Krishna Kumar2
@ 2007-08-09 4:27 ` David Miller
2007-08-09 6:26 ` Krishna Kumar2
0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2007-08-09 4:27 UTC (permalink / raw)
To: krkumar2
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 9 Aug 2007 09:49:57 +0530
> Patrick had suggested calling dev_hard_start_xmit() instead of
> conditionally calling the new API and to remove the new API
> entirely. The driver determines whether batching is required or
> not depending on (skb==NULL) or not. Would that approach be fine
> with this "single interface" goal ?
It is a valid posibility.
Note that this is similar to how we handle TSO, the driver
sets the feature bit and in its ->hard_start_xmit() it checks
the SKB for the given offload property.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-09 4:27 ` David Miller
@ 2007-08-09 6:26 ` Krishna Kumar2
0 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-09 6:26 UTC (permalink / raw)
To: David Miller
Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
David Miller <davem@davemloft.net> wrote on 08/09/2007 09:57:27 AM:
>
> > Patrick had suggested calling dev_hard_start_xmit() instead of
> > conditionally calling the new API and to remove the new API
> > entirely. The driver determines whether batching is required or
> > not depending on (skb==NULL) or not. Would that approach be fine
> > with this "single interface" goal ?
>
> It is a valid posibility.
>
> Note that this is similar to how we handle TSO, the driver
> sets the feature bit and in its ->hard_start_xmit() it checks
> the SKB for the given offload property.
Great, I will try to get rid of two paths entirely, and see how to
re-arrange the code cleanly.
thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
@ 2007-08-14 9:02 ` Krishna Kumar2
2 siblings, 0 replies; 34+ messages in thread
From: Krishna Kumar2 @ 2007-08-14 9:02 UTC (permalink / raw)
To: David Miller
Cc: jagana, johnpol, gaagaan, jeff, Robert.Olsson, kumarkr, rdreier,
peter.p.waskiewicz.jr, hadi, mcarlson, netdev, general, mchan,
tgraf, sri, shemminger, kaber, herbert
Hi Dave,
David Miller <davem@davemloft.net> wrote on 08/08/2007 04:19:00 PM:
> From: Krishna Kumar <krkumar2@in.ibm.com>
> Date: Wed, 08 Aug 2007 15:01:14 +0530
>
> > RESULTS: The performance improvement for TCP No Delay is in the range
of -8%
> > to 320% (with -8% being the sole negative), with many individual
tests
> > giving 50% or more improvement (I think it is to do with the hw
slots
> > getting full quicker resulting in more batching when the queue gets
> > woken). The results for TCP is in the range of -11% to 93%, with
most
> > of the tests (8/12) giving improvements.
>
> Not because I think it obviates your work, but rather because I'm
> curious, could you test a TSO-in-hardware driver converted to
> batching and see how TSO alone compares to batching for a pure
> TCP workload?
>
> I personally don't think it will help for that case at all as
> TSO likely does better job of coalescing the work _and_ reducing
> bus traffic as well as work in the TCP stack.
I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My
hw is 82547GI).
You are right, it doesn't help TSO case at all (infact degrades). Two
things to note though:
- E1000 may not be suitable for adding batching (which is no
longer a new API, as I have changed it already).
- Small skbs where TSO doesn't come into picture still seems to
improve. A couple of cases for large skbs did result in some
improved (like 4K, TCP No Delay, 32 procs).
[Total segments retransmission for original code test run: 2220 & for
new code test run: 1620. So the retransmission problem that I was
getting seems to be an IPoIB bug, though I did have to fix one bug
in my networking component where I was calling qdisc_run(NULL) for
regular xmit path and change to always use batching. The problem is
that skb1 - skb10 may be present in the queue after each of them
failed to be sent out, then net_tx_action fires which batches all of
these into the blist and tries to send them out again, which also
fails (eg tx lock fail or queue full), then the next single skb xmit
will send the latest skb ignoring the 10 skbs that are already waiting
in the batching list. These 10 skbs are sent out only the next time
net_tx_action is called, so out of order skbs result. This fix reduced
retransmissions from 180,000 to 55,000 or so. When I changed IPoIB
driver to use iterative sends of each skb instead of creating multiple
Work Request's, that number went down to 15].
I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
run a longer one tonight). The results are (results in KB/s, and %):
Test Case Org BW New BW % Change
TCP
--------
Size:32 Procs:1 1848 3918 112.01
Size:32 Procs:8 21888 21555 -1.52
Size:32 Procs:32 19317 22433 16.13
Size:256 Procs:1 15584 25991 66.78
Size:256 Procs:8 110937 74565 -32.78
Size:256 Procs:32 105767 98967 -6.42
Size:4096 Procs:1 81910 96073 17.29
Size:4096 Procs:8 113302 94040 -17.00
Size:4096 Procs:32 109664 105522 -3.77
TCP No Delay:
--------------
Size:32 Procs:1 2688 3177 18.19
Size:32 Procs:8 6568 10588 61.20
Size:32 Procs:32 6573 7838 19.24
Size:256 Procs:1 7869 12724 61.69
Size:256 Procs:8 65652 45652 -30.46
Size:256 Procs:32 95114 112279 18.04
Size:4096 Procs:1 95302 84664 -11.16
Size:4096 Procs:8 111119 89111 -19.80
Size:4096 Procs:32 109249 113919 4.27
I will submit Rev4 with suggested changes (including single merged
API) on Thursday after some more testing.
Thanks,
- KK
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2007-08-14 9:02 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-08 9:31 [ofa-general] [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB Krishna Kumar
2007-08-08 9:31 ` [PATCH 1/9 Rev3] [Doc] HOWTO Documentation for batching Krishna Kumar
2007-08-08 9:31 ` [PATCH 2/9 Rev3] [core] Add skb_blist & hard_start_xmit_batch Krishna Kumar
2007-08-08 10:59 ` [ofa-general] " Stephen Hemminger
2007-08-08 11:24 ` Krishna Kumar2
2007-08-08 12:01 ` Evgeniy Polyakov
2007-08-09 3:09 ` Krishna Kumar2
2007-08-08 9:31 ` [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching Krishna Kumar
2007-08-08 12:14 ` [ofa-general] " Evgeniy Polyakov
2007-08-09 3:13 ` Krishna Kumar2
2007-08-08 14:05 ` Patrick McHardy
2007-08-08 15:26 ` [ofa-general] " jamal
2007-08-09 4:06 ` Krishna Kumar2
2007-08-08 9:31 ` [PATCH 4/9 Rev3] [ethtool] Add ethtool support Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 5/9 Rev3] [IPoIB] Header file changes Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 6/9 Rev3] [IPoIB] CM & Multicast changes Krishna Kumar
2007-08-08 9:32 ` [ofa-general] [PATCH 7/9 Rev3] [IPoIB] Verb changes Krishna Kumar
2007-08-08 9:32 ` [PATCH 8/9 Rev3] [IPoIB] Post and work completion handler changes Krishna Kumar
2007-08-08 9:32 ` [PATCH 9/9 Rev3] [IPoIB] Implement the new batching API Krishna Kumar
2007-08-08 10:49 ` [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB David Miller
2007-08-08 11:09 ` Krishna Kumar2
2007-08-08 22:01 ` [ofa-general] " David Miller
2007-08-09 4:19 ` Krishna Kumar2
2007-08-09 4:27 ` David Miller
2007-08-09 6:26 ` Krishna Kumar2
2007-08-08 13:42 ` [ofa-general] " Herbert Xu
2007-08-08 15:14 ` jamal
2007-08-08 20:55 ` Stephen Hemminger
2007-08-08 22:40 ` jamal
2007-08-08 22:22 ` David Miller
2007-08-08 22:53 ` jamal
2007-08-09 0:06 ` Shirley Ma
2007-08-09 3:19 ` [ofa-general] " Krishna Kumar2
2007-08-14 9:02 ` Krishna Kumar2
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).