* [PATCH 00/12 -Rev2] Implement batching skb API
@ 2007-07-22 9:04 Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB Krishna Kumar
` (12 more replies)
0 siblings, 13 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:04 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, kumarkr, herbert,
gaagaan, mcarlson, xma, rick.jones2, hadi, jeff, general, mchan,
tgraf, netdev, jagana, kaber, Krishna Kumar, sri
This set of patches implements the batching API, and makes the following
changes resulting from the review of the first set:
Changes :
---------
1. Changed skb_blist from pointer to static as it saves only 12 bytes
(i386), but bloats the code.
2. Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS"
in register_netdev to enable batching as it is redundant. Changed this
flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
user changable calls can modify this bit to enable/disable batching.
3. Added ethtool support to enable/disable batching (not tested).
4. Added rtnetlink support to enable/disable batching (not tested).
5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
should not have a small queue anyway (adding bloat).
6. skbs are purged from dev_deactivate instead of from unregister_netdev
to drop all references to the device.
7. Removed changelog in source code in sch_generic.c, and unrelated renames
from sch_generic.c (lockless, comments).
8. Removed xmit_slots entirely, as it was adding bloat (code and header)
and not adding value (it is calculated and set twice in internal send
routine and handle work completion, and referenced once in batch xmit;
and can instead be calculated once in xmit).
Issues :
--------
1. Remove /sysfs support completely ?
2. Whether rtnetlink support is required as GSO has only ethtool ?
Patches are described as:
Mail 0/12 : This mail.
Mail 1/12 : HOWTO documentation.
Mail 2/12 : Changes to netdevice.h
Mail 3/12 : dev.c changes.
Mail 4/12 : Ethtool changes.
Mail 5/12 : sysfs changes.
Mail 6/12 : rtnetlink changes.
Mail 7/12 : Change in qdisc_run & qdisc_restart API, modify callers
to use this API.
Mail 8/12 : IPoIB include file changes.
Mail 9/12 : IPoIB verbs changes
Mail 10/12 : IPoIB multicast, CM changes
Mail 11/12 : IPoIB xmit API addition
Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)
I have started a 10 run test for various buffer sizes and processes, and
will post the results on Monday.
Please review and provide feedback/ideas; and consider for inclusion.
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB.
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-22 9:05 ` [PATCH 02/12 -Rev2] Changes to netdevice.h Krishna Kumar
` (11 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, netdev, jagana, general, mchan, tgraf, jeff,
hadi, kaber, sri
diff -ruNp org/Documentation/networking/Batching_skb_API.txt rev2/Documentation/networking/Batching_skb_API.txt
--- org/Documentation/networking/Batching_skb_API.txt 1970-01-01 05:30:00.000000000 +0530
+++ rev2/Documentation/networking/Batching_skb_API.txt 2007-07-20 16:09:45.000000000 +0530
@@ -0,0 +1,91 @@
+ HOWTO for batching skb API support
+ -----------------------------------
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+-----------------------------------------------
+
+An extended API is supported in the netdevice layer, which is very similar
+to the existing hard_start_xmit() API. Drivers which wish to take advantage
+of this new API should implement this routine similar to how the
+hard_start_xmit handler is written. The difference between these API's is
+that while the existing hard_start_xmit processes one skb, the new API can
+process multiple skbs (or even one) in a single call. It is also possible
+for the driver writer to re-use most of the code from the existing API in
+the new API without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-------------------------------------
+
+ This is a new API that is optionally exported by a driver. The pre-
+ requisite for a driver to use this API is that it should have a
+ reasonably sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+-------------------------------------------------------
+
+ The networking stack normally gets called from upper layer protocols
+ with a single skb to xmit. This skb is first enqueue'd and an
+ attempt is next made to transmit it immediately (via qdisc_run).
+ However, events like driver lock contention, queue stopped, etc, can
+ result in the skb not getting sent out, and it remains in the queue.
+ When a new xmit is called or when the queue is re-enabled, qdisc_run
+ could potentially find multiple packets in the queue, and have to
+ send them all out one by one iteratively.
+
+ The batching skb API case was added to exploit this situation where
+ if there are multiple skbs, all of them can be sent to the device in
+ one shot. This reduces driver processing, locking at the driver (or
+ in stack for ~LLTX drivers) gets amortized over multiple skbs, and
+ in case of specific drivers where every xmit results in a completion
+ processing (like IPoIB), optimizations could be made in the driver
+ to get a completion for only the last skb that was sent which will
+ result in saving interrupts for every (but the last) skb that was
+ sent in the same batch.
+
+ This batching can result in significant performance gains for
+ systems that have multiple data stream paths over the same network
+ interface card.
+
+
+Section 3: How drivers can support this API ?
+---------------------------------------------
+
+ The new API - dev->hard_start_xmit_batch(struct net_device *dev),
+ simplistically, can be written almost identically to the regular
+ xmit API (hard_start_xmit), except that all skbs on dev->skb_blist
+ should be processed by the driver instead of just one skb. The new
+ API doesn't get any skb as argument to process, instead it picks up
+ all the skbs from dev->skb_blist, where it was added by the stack,
+ and tries to send them out.
+
+ Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+ dev->features, and dev->hard_start_xmit_batch should point to the
+ new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+---------------------------------------------
+
+ Batching could be disabled for a particular device, e.g. on desktop
+ systems if only one stream of network activity for that device is
+ taking place, since performance could be slightly affected due to
+ extra processing that batching adds. Batching can be enabled if
+ more than one stream of network activity per device is being done,
+ e.g. on servers, or even desktop usage with multiple browser, chat,
+ file transfer sessions, etc.
+
+ Per device batching can be enabled/disabled using:
+
+ echo 1 > /sys/class/net/<device-name>/tx_batch_skbs (enable)
+ echo 0 > /sys/class/net/<device-name>/tx_batch_skbs (disable)
+
+ E.g. to enable batching on eth0, run:
+ echo 1 > /sys/class/net/eth0/tx_batch_skbs
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 02/12 -Rev2] Changes to netdevice.h
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-22 17:06 ` [ofa-general] " Patrick McHardy
2007-07-22 9:05 ` [ofa-general] [PATCH 03/12 -Rev2] dev.c changes Krishna Kumar
` (10 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, xma, rick.jones2, mcarlson, netdev, jagana, general,
mchan, tgraf, jeff, hadi, kaber, Krishna Kumar, sri
diff -ruNp org/include/linux/netdevice.h rev2/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-07-20 07:49:28.000000000 +0530
+++ rev2/include/linux/netdevice.h 2007-07-22 13:20:16.000000000 +0530
@@ -340,6 +340,7 @@ struct net_device
#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
#define NETIF_F_GSO 2048 /* Enable software GSO. */
#define NETIF_F_LLTX 4096 /* LockLess TX */
+#define NETIF_F_BATCH_ON 8192 /* Batching skbs xmit API is enabled */
#define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */
/* Segmentation offload features */
@@ -452,6 +453,7 @@ struct net_device
struct Qdisc *qdisc_sleeping;
struct list_head qdisc_list;
unsigned long tx_queue_len; /* Max frames per queue allowed */
+ struct sk_buff_head skb_blist; /* List of batch skbs */
/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
@@ -472,6 +474,9 @@ struct net_device
void *priv; /* pointer to private data */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+ int (*hard_start_xmit_batch) (struct net_device
+ *dev);
+
/* These may be needed for future network-power-down code. */
unsigned long trans_start; /* Time (in jiffies) of last Tx */
@@ -582,6 +587,8 @@ struct net_device
#define NETDEV_ALIGN 32
#define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1)
+#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0)
+
static inline void *netdev_priv(const struct net_device *dev)
{
return dev->priv;
@@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n
struct sockaddr *);
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev);
+extern int dev_add_skb_to_blist(struct sk_buff *skb,
+ struct net_device *dev);
extern void dev_init(void);
@@ -1104,6 +1113,8 @@ extern void dev_set_promiscuity(struct
extern void dev_set_allmulti(struct net_device *dev, int inc);
extern void netdev_state_change(struct net_device *dev);
extern void netdev_features_change(struct net_device *dev);
+extern int dev_change_tx_batching(struct net_device *dev,
+ unsigned long new_batch_skb);
/* Load a device via the kmod */
extern void dev_load(const char *name);
extern void dev_mcast_init(void);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 03/12 -Rev2] dev.c changes.
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB Krishna Kumar
2007-07-22 9:05 ` [PATCH 02/12 -Rev2] Changes to netdevice.h Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-23 10:44 ` [ofa-general] " Evgeniy Polyakov
2007-07-22 9:05 ` [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes Krishna Kumar
` (9 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, jagana, general, netdev, tgraf, jeff, hadi,
kaber, mchan, sri
diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
--- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530
@@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi
}
}
+/*
+ * dev_change_tx_batching - Enable or disable batching for a driver that
+ * supports batching.
+ */
+int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb)
+{
+ int ret;
+
+ if (!dev->hard_start_xmit_batch) {
+ /* Driver doesn't support skb batching */
+ ret = -ENOTSUPP;
+ goto out;
+ }
+
+ /* Handle invalid argument */
+ if (new_batch_skb < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = 0;
+
+ /* Check if new value is same as the current */
+ if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
+ goto out;
+
+ spin_lock(&dev->queue_lock);
+ if (new_batch_skb) {
+ dev->features |= NETIF_F_BATCH_ON;
+ dev->tx_queue_len >>= 1;
+ } else {
+ if (!skb_queue_empty(&dev->skb_blist))
+ skb_queue_purge(&dev->skb_blist);
+ dev->features &= ~NETIF_F_BATCH_ON;
+ dev->tx_queue_len <<= 1;
+ }
+ spin_unlock(&dev->queue_lock);
+
+out:
+ return ret;
+}
+
/**
* dev_load - load a network module
* @name: name of interface
@@ -1414,6 +1456,45 @@ static int dev_gso_segment(struct sk_buf
return 0;
}
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+ if (!list_empty(&ptype_all))
+ dev_queue_xmit_nit(skb, dev);
+
+ if (netif_needs_gso(dev, skb)) {
+ if (unlikely(dev_gso_segment(skb))) {
+ kfree(skb);
+ return 0;
+ }
+
+ if (skb->next) {
+ int count = 0;
+
+ do {
+ struct sk_buff *nskb = skb->next;
+
+ skb->next = nskb->next;
+ __skb_queue_tail(&dev->skb_blist, nskb);
+ count++;
+ } while (skb->next);
+
+ skb->destructor = DEV_GSO_CB(skb)->destructor;
+ kfree_skb(skb);
+ return count;
+ }
+ }
+ __skb_queue_tail(&dev->skb_blist, skb);
+ return 1;
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
if (likely(!skb->next)) {
@@ -3397,6 +3483,12 @@ int register_netdevice(struct net_device
}
}
+ if (dev->hard_start_xmit_batch) {
+ dev->features |= NETIF_F_BATCH_ON;
+ skb_queue_head_init(&dev->skb_blist);
+ dev->tx_queue_len >>= 1;
+ }
+
/*
* nil rebuild_header routine,
* that should be never called and used as just bug trap.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (2 preceding siblings ...)
2007-07-22 9:05 ` [ofa-general] [PATCH 03/12 -Rev2] dev.c changes Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 05/12 -Rev2] sysfs changes Krishna Kumar
` (8 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, mchan,
tgraf, jeff, sri, hadi, netdev
diff -ruNp org/include/linux/ethtool.h rev2/include/linux/ethtool.h
--- org/include/linux/ethtool.h 2007-07-21 13:39:50.000000000 +0530
+++ rev2/include/linux/ethtool.h 2007-07-21 13:40:57.000000000 +0530
@@ -414,6 +414,8 @@ struct ethtool_ops {
#define ETHTOOL_SUFO 0x00000022 /* Set UFO enable (ethtool_value) */
#define ETHTOOL_GGSO 0x00000023 /* Get GSO enable (ethtool_value) */
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GBTX 0x00000025 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBTX 0x00000026 /* Set Batching (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff -ruNp org/net/core/ethtool.c rev2/net/core/ethtool.c
--- org/net/core/ethtool.c 2007-07-21 13:37:17.000000000 +0530
+++ rev2/net/core/ethtool.c 2007-07-21 22:55:38.000000000 +0530
@@ -648,6 +648,26 @@ static int ethtool_set_gso(struct net_de
return 0;
}
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata = { ETHTOOL_GBTX };
+
+ edata.data = BATCHING_ON(dev);
+ if (copy_to_user(useraddr, &edata, sizeof(edata)))
+ return -EFAULT;
+ return 0;
+}
+
+static int ethtool_set_batch(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata;
+
+ if (copy_from_user(&edata, useraddr, sizeof(edata)))
+ return -EFAULT;
+
+ return dev_change_tx_batching(dev, edata.data);
+}
+
static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
{
struct ethtool_test test;
@@ -959,6 +979,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SGSO:
rc = ethtool_set_gso(dev, useraddr);
break;
+ case ETHTOOL_GBTX:
+ rc = ethtool_get_batch(dev, useraddr);
+ break;
+ case ETHTOOL_SBTX:
+ rc = ethtool_set_batch(dev, useraddr);
+ break;
default:
rc = -EOPNOTSUPP;
}
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 05/12 -Rev2] sysfs changes.
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (3 preceding siblings ...)
2007-07-22 9:05 ` [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes Krishna Kumar
` (7 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, netdev, jagana, general, mchan, tgraf, jeff,
hadi, kaber, sri
diff -ruNp org/net/core/net-sysfs.c rev2/net/core/net-sysfs.c
--- org/net/core/net-sysfs.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/net-sysfs.c 2007-07-21 22:56:32.000000000 +0530
@@ -230,6 +230,21 @@ static ssize_t store_weight(struct devic
return netdev_store(dev, attr, buf, len, change_weight);
}
+static ssize_t show_tx_batch_skb(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct net_device *netdev = to_net_dev(dev);
+
+ return sprintf(buf, fmt_dec, BATCHING_ON(netdev));
+}
+
+static ssize_t store_tx_batch_skb(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ return netdev_store(dev, attr, buf, len, dev_change_tx_batching);
+}
+
static struct device_attribute net_class_attributes[] = {
__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -246,6 +261,8 @@ static struct device_attribute net_class
__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
store_tx_queue_len),
+ __ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skb,
+ store_tx_batch_skb),
__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
{}
};
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes.
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (4 preceding siblings ...)
2007-07-22 9:05 ` [ofa-general] [PATCH 05/12 -Rev2] sysfs changes Krishna Kumar
@ 2007-07-22 9:05 ` Krishna Kumar
2007-07-22 17:10 ` Patrick McHardy
2007-07-22 9:06 ` [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart API, callers Krishna Kumar
` (6 subsequent siblings)
12 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:05 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
peter.p.waskiewicz.jr, mcarlson, jagana, general, netdev, tgraf,
jeff, sri, hadi, kaber, mchan
diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
--- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530
+++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530
@@ -78,6 +78,8 @@ enum
IFLA_LINKMODE,
IFLA_LINKINFO,
#define IFLA_LINKINFO IFLA_LINKINFO
+ IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */
+#define IFLA_TXBTHSKB IFLA_TXBTHSKB
__IFLA_MAX
};
diff -ruNp org/net/core/rtnetlink.c rev2/net/core/rtnetlink.c
--- org/net/core/rtnetlink.c 2007-07-20 16:31:59.000000000 +0530
+++ rev2/net/core/rtnetlink.c 2007-07-21 22:27:10.000000000 +0530
@@ -634,6 +634,7 @@ static int rtnl_fill_ifinfo(struct sk_bu
NLA_PUT_STRING(skb, IFLA_IFNAME, dev->name);
NLA_PUT_U32(skb, IFLA_TXQLEN, dev->tx_queue_len);
+ NLA_PUT_U32(skb, IFLA_TXBTHSKB, BATCHING_ON(dev));
NLA_PUT_U32(skb, IFLA_WEIGHT, dev->weight);
NLA_PUT_U8(skb, IFLA_OPERSTATE,
netif_running(dev) ? dev->operstate : IF_OPER_DOWN);
@@ -833,7 +834,8 @@ static int do_setlink(struct net_device
if (tb[IFLA_TXQLEN])
dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
-
+ if (tb[IFLA_TXBTHSKB])
+ dev_change_tx_batching(dev, nla_get_u32(tb[IFLA_TXBTHSKB]));
if (tb[IFLA_WEIGHT])
dev->weight = nla_get_u32(tb[IFLA_WEIGHT]);
@@ -1072,6 +1074,9 @@ replay:
nla_len(tb[IFLA_BROADCAST]));
if (tb[IFLA_TXQLEN])
dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+ if (tb[IFLA_TXBTHSKB])
+ dev_change_tx_batching(dev,
+ nla_get_u32(tb[IFLA_TXBTHSKB]));
if (tb[IFLA_WEIGHT])
dev->weight = nla_get_u32(tb[IFLA_WEIGHT]);
if (tb[IFLA_OPERSTATE])
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart API, callers
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (5 preceding siblings ...)
2007-07-22 9:05 ` [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes Krishna Kumar
` (5 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, jagana, general, netdev, tgraf, jeff, hadi,
kaber, mchan, sri
diff -ruNp org/include/net/pkt_sched.h rev2/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h 2007-07-20 07:49:28.000000000 +0530
+++ rev2/include/net/pkt_sched.h 2007-07-20 16:09:45.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
struct rtattr *tab);
extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
if (!netif_queue_stopped(dev) &&
!test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
- __qdisc_run(dev);
+ __qdisc_run(dev, blist);
}
extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/net/sched/sch_generic.c rev2/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/sched/sch_generic.c 2007-07-22 12:11:10.000000000 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
struct Qdisc *q)
{
- if (unlikely(skb->next))
- dev->gso_skb = skb;
- else
- q->ops->requeue(skb, q);
+ if (likely(skb)) {
+ if (unlikely(skb->next))
+ dev->gso_skb = skb;
+ else
+ q->ops->requeue(skb, q);
+ }
netif_schedule(dev);
return 0;
@@ -91,18 +93,23 @@ static inline int handle_dev_cpu_collisi
/*
* Same CPU holding the lock. It may be a transient
* configuration error, when hard_start_xmit() recurses. We
- * detect it by checking xmit owner and drop the packet when
- * deadloop is detected. Return OK to try the next skb.
+ * detect it by checking xmit owner and drop skb (or all
+ * skbs in batching case) when deadloop is detected. Return
+ * OK to try the next skb.
*/
- kfree_skb(skb);
+ if (likely(skb))
+ kfree_skb(skb);
+ else if (!skb_queue_empty(&dev->skb_blist))
+ skb_queue_purge(&dev->skb_blist);
+
if (net_ratelimit())
printk(KERN_WARNING "Dead loop on netdevice %s, "
"fix it urgently!\n", dev->name);
ret = qdisc_qlen(q);
} else {
/*
- * Another cpu is holding lock, requeue & delay xmits for
- * some time.
+ * Another cpu is holding lock. Requeue skb and delay xmits
+ * for some time.
*/
__get_cpu_var(netdev_rx_stat).cpu_collision++;
ret = dev_requeue_skb(skb, dev, q);
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
}
/*
+ * Algorithm to get skb(s) is:
+ * - Non batching drivers, or if the batch list is empty and there is
+ * atmost one skb in the queue - dequeue skb and put it in *skbp to
+ * tell the caller to use the single xmit API.
+ * - Batching drivers where the batch list already contains atleast one
+ * skb or if there are multiple skbs in the queue: keep dequeue'ing
+ * skb's upto a limit and set *skbp to NULL to tell the caller to use
+ * the multiple xmit API.
+ *
+ * Returns:
+ * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ * (in case >1 skbs present in blist for batching)
+ * 0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+ struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+ if (likely(!blist) || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1)) {
+ return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+ } else {
+ int max = dev->tx_queue_len - skb_queue_len(blist);
+ struct sk_buff *skb;
+
+ while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+ max -= dev_add_skb_to_blist(skb, dev);
+
+ *skbp = NULL;
+ return 1; /* we have atleast one skb in blist */
+ }
+}
+
+/*
* NOTE: Called under dev->queue_lock with locally disabled BH.
*
* __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
* >0 - queue is not empty.
*
*/
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+ struct sk_buff_head *blist)
{
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
@@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n
int ret;
/* Dequeue packet */
- if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+ if (unlikely(!get_skb(dev, q, blist, &skb)))
return 0;
/*
@@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n
/* And release queue */
spin_unlock(&dev->queue_lock);
- ret = dev_hard_start_xmit(skb, dev);
+ if (likely(skb))
+ ret = dev_hard_start_xmit(skb, dev);
+ else
+ ret = dev->hard_start_xmit_batch(dev);
if (!lockless)
netif_tx_unlock(dev);
@@ -168,7 +211,7 @@ static inline int qdisc_restart(struct n
switch (ret) {
case NETDEV_TX_OK:
- /* Driver sent out skb successfully */
+ /* Driver sent out skb (or entire skb_blist) successfully */
ret = qdisc_qlen(q);
break;
@@ -179,8 +222,8 @@ static inline int qdisc_restart(struct n
default:
/* Driver returned NETDEV_TX_BUSY - requeue skb */
- if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
- printk(KERN_WARNING "BUG %s code %d qlen %d\n",
+ if (unlikely(ret != NETDEV_TX_BUSY) && net_ratelimit())
+ printk(KERN_WARNING " %s: BUG. code %d qlen %d\n",
dev->name, ret, q->q.qlen);
ret = dev_requeue_skb(skb, dev, q);
@@ -190,10 +233,10 @@ static inline int qdisc_restart(struct n
return ret;
}
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
do {
- if (!qdisc_restart(dev))
+ if (!qdisc_restart(dev, blist))
break;
} while (!netif_queue_stopped(dev));
@@ -567,6 +610,13 @@ void dev_deactivate(struct net_device *d
skb = dev->gso_skb;
dev->gso_skb = NULL;
+
+ if (BATCHING_ON(dev)) {
+ /* Free skbs on batch list */
+ if (!skb_queue_empty(&dev->skb_blist))
+ skb_queue_purge(&dev->skb_blist);
+ }
+
spin_unlock_bh(&dev->queue_lock);
kfree_skb(skb);
diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
--- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530
@@ -1647,7 +1647,7 @@ gso:
/* reset queue_mapping to zero */
skb->queue_mapping = 0;
rc = q->enqueue(skb, q);
- qdisc_run(dev);
+ qdisc_run(dev, NULL);
spin_unlock(&dev->queue_lock);
rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1844,7 +1844,12 @@ static void net_tx_action(struct softirq
clear_bit(__LINK_STATE_SCHED, &dev->state);
if (spin_trylock(&dev->queue_lock)) {
- qdisc_run(dev);
+ /*
+ * Try to send out all skbs if batching is
+ * enabled.
+ */
+ qdisc_run(dev, BATCHING_ON(dev) ?
+ &dev->skb_blist : NULL);
spin_unlock(&dev->queue_lock);
} else {
netif_schedule(dev);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes.
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (6 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart API, callers Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes Krishna Kumar
` (4 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, mchan,
tgraf, jeff, sri, hadi, netdev
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h rev2/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 16:09:45.000000000 +0530
@@ -269,8 +269,8 @@ struct ipoib_dev_priv {
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
unsigned tx_tail;
- struct ib_sge tx_sge;
- struct ib_send_wr tx_wr;
+ struct ib_sge *tx_sge;
+ struct ib_send_wr *tx_wr;
struct ib_wc ibwc[IPOIB_NUM_WC];
@@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i
int ipoib_open(struct net_device *dev);
int ipoib_add_pkey_attr(struct net_device *dev);
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, int snum, int tx_index,
+ struct ipoib_ah *address, u32 qpn);
void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn);
+ struct ipoib_ah *address, u32 qpn, int num_skbs);
void ipoib_reap_ah(struct work_struct *work);
void ipoib_flush_paths(struct net_device *dev);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (7 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes Krishna Kumar
` (3 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, netdev, jagana, general, mchan, tgraf, jeff,
hadi, kaber, sri
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 16:09:45.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
.max_send_sge = 1,
.max_recv_sge = 1
},
- .sq_sig_type = IB_SIGNAL_ALL_WR,
+ .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */
.qp_type = IB_QPT_UD
};
-
- int ret, size;
+ struct ib_send_wr *next_wr = NULL;
+ int i, ret, size;
priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff;
priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff;
- priv->tx_sge.lkey = priv->mr->lkey;
-
- priv->tx_wr.opcode = IB_WR_SEND;
- priv->tx_wr.sg_list = &priv->tx_sge;
- priv->tx_wr.num_sge = 1;
- priv->tx_wr.send_flags = IB_SEND_SIGNALED;
+ for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+ priv->tx_sge[i].lkey = priv->mr->lkey;
+ priv->tx_wr[i].opcode = IB_WR_SEND;
+ priv->tx_wr[i].sg_list = &priv->tx_sge[i];
+ priv->tx_wr[i].num_sge = 1;
+ priv->tx_wr[i].send_flags = 0;
+
+ /* Link the list properly for provider to use */
+ priv->tx_wr[i].next = next_wr;
+ next_wr = &priv->tx_wr[i];
+ }
return 0;
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (8 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition Krishna Kumar
` (2 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
peter.p.waskiewicz.jr, mcarlson, jagana, general, netdev, tgraf,
jeff, sri, hadi, kaber, mchan
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 16:09:45.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
unsigned int wr_id,
u64 addr, int len)
{
+ int ret;
struct ib_send_wr *bad_wr;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ priv->tx_sge[0].addr = addr;
+ priv->tx_sge[0].length = len;
+
+ priv->tx_wr[0].wr_id = wr_id;
- priv->tx_wr.wr_id = wr_id;
+ priv->tx_wr[0].next = NULL;
+ ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+ priv->tx_wr[0].next = &priv->tx_wr[1];
- return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+ return ret;
}
void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 16:09:45.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
sizeof (union ib_gid))) {
priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
- priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+ priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
}
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
}
}
- ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+ ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
}
unlock:
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (9 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-22 9:41 ` Michael S. Tsirkin
2007-07-23 10:48 ` Evgeniy Polyakov
2007-07-22 9:06 ` [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c) Krishna Kumar
2007-07-23 9:53 ` [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar2
12 siblings, 2 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, mcarlson, jagana, general, netdev, tgraf, jeff, hadi,
kaber, mchan, sri
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-22 00:08:37.000000000 +0530
@@ -242,8 +242,9 @@ repost:
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int i = 0, num_completions;
+ int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
unsigned int wr_id = wc->wr_id;
- struct ipoib_tx_buf *tx_req;
unsigned long flags;
ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
@@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
return;
}
- tx_req = &priv->tx_ring[wr_id];
+ num_completions = wr_id - tx_ring_index + 1;
+ if (num_completions <= 0)
+ num_completions += ipoib_sendq_size;
- ib_dma_unmap_single(priv->ca, tx_req->mapping,
- tx_req->skb->len, DMA_TO_DEVICE);
+ /*
+ * Handle skbs completion from tx_tail to wr_id. It is possible to
+ * handle WC's from earlier post_sends (possible multiple) in this
+ * iteration as we move from tx_tail to wr_id, since if the last
+ * WR (which is the one which had a completion request) failed to be
+ * sent for any of those earlier request(s), no completion
+ * notification is generated for successful WR's of those earlier
+ * request(s).
+ */
+ while (1) {
+ /*
+ * Could use while (i < num_completions), but it is costly
+ * since in most cases there is 1 completion, and we end up
+ * doing an extra "index = (index+1) & (ipoib_sendq_size-1)"
+ */
+ struct ipoib_tx_buf *tx_req = &priv->tx_ring[tx_ring_index];
+
+ if (likely(tx_req->skb)) {
+ ib_dma_unmap_single(priv->ca, tx_req->mapping,
+ tx_req->skb->len, DMA_TO_DEVICE);
- ++priv->stats.tx_packets;
- priv->stats.tx_bytes += tx_req->skb->len;
+ ++priv->stats.tx_packets;
+ priv->stats.tx_bytes += tx_req->skb->len;
- dev_kfree_skb_any(tx_req->skb);
+ dev_kfree_skb_any(tx_req->skb);
+ }
+ /*
+ * else this skb failed synchronously when posted and was
+ * freed immediately.
+ */
+
+ if (++i == num_completions)
+ break;
+
+ /* More WC's to handle */
+ tx_ring_index = (tx_ring_index + 1) & (ipoib_sendq_size - 1);
+ }
spin_lock_irqsave(&priv->tx_lock, flags);
- ++priv->tx_tail;
+
+ priv->tx_tail += num_completions;
if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
netif_wake_queue(dev);
}
+
spin_unlock_irqrestore(&priv->tx_lock, flags);
if (wc->status != IB_WC_SUCCESS &&
@@ -340,78 +375,178 @@ void ipoib_ib_completion(struct ib_cq *c
netif_rx_schedule(dev_ptr);
}
-static inline int post_send(struct ipoib_dev_priv *priv,
- unsigned int wr_id,
- struct ib_ah *address, u32 qpn,
- u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, 'start_index' is the first slot in
+ * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a
+ * previous post_send returned error and we are trying to send the untried
+ * WR's, in which case start_index will point to the first untried WR.
+ *
+ * We also break the WR link before posting so that the driver knows how
+ * many WR's to process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+ int start_index, int num_skbs,
+ struct ib_send_wr **bad_wr)
{
- struct ib_send_wr *bad_wr;
+ int ret;
+ struct ib_send_wr *last_wr, *next_wr;
+
+ last_wr = &priv->tx_wr[start_index + num_skbs - 1];
+
+ /* Set Completion Notification for last WR */
+ last_wr->send_flags = IB_SEND_SIGNALED;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ /* Terminate the last WR */
+ next_wr = last_wr->next;
+ last_wr->next = NULL;
- priv->tx_wr.wr_id = wr_id;
- priv->tx_wr.wr.ud.remote_qpn = qpn;
- priv->tx_wr.wr.ud.ah = address;
+ /* Send all the WR's in one doorbell */
+ ret = ib_post_send(priv->qp, &priv->tx_wr[start_index], bad_wr);
- return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+ /* Restore send_flags & WR chain */
+ last_wr->send_flags = 0;
+ last_wr->next = next_wr;
+
+ return ret;
}
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_req; and details of the WR in tx_wr
+ * to pass to the driver.
+ *
+ * Returns :
+ * - 0 on successful processing of the skb
+ * - 1 if the skb was freed.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, int wr_num,
+ int tx_ring_index, struct ipoib_ah *address, u32 qpn)
{
- struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ipoib_tx_buf *tx_req;
u64 addr;
+ struct ipoib_tx_buf *tx_req;
if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
- ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+ ipoib_warn(priv, "packet len %d (> %d) too long to "
+ "send, dropping\n",
skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
++priv->stats.tx_dropped;
++priv->stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
- return;
+ return 1;
}
- ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
+ ipoib_dbg_data(priv, "sending packet, length=%d address=%p "
+ "qpn=0x%06x\n",
skb->len, address, qpn);
/*
* We put the skb into the tx_ring _before_ we call post_send()
* because it's entirely possible that the completion handler will
- * run before we execute anything after the post_send(). That
+ * run before we execute anything after the post_send(). That
* means we have to make sure everything is properly recorded and
* our state is consistent before we call post_send().
*/
- tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
- tx_req->skb = skb;
- addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
- DMA_TO_DEVICE);
+ addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
++priv->stats.tx_errors;
dev_kfree_skb_any(skb);
- return;
+ return 1;
}
+
+ tx_req = &priv->tx_ring[tx_ring_index];
+ tx_req->skb = skb;
tx_req->mapping = addr;
+ priv->tx_sge[wr_num].addr = addr;
+ priv->tx_sge[wr_num].length = skb->len;
+ priv->tx_wr[wr_num].wr_id = tx_ring_index;
+ priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+ priv->tx_wr[wr_num].wr.ud.ah = address->ah;
- if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
- address->ah, qpn, addr, skb->len))) {
- ipoib_warn(priv, "post_send failed\n");
- ++priv->stats.tx_errors;
- ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
- dev_kfree_skb_any(skb);
- } else {
- dev->trans_start = jiffies;
+ return 0;
+}
+
+/*
+ * If an skb is passed to this function, it is the single, unprocessed skb
+ * send case. Otherwise if skb is NULL, it means that all skbs are already
+ * processed and put on the priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int start_index = 0;
- address->last_send = priv->tx_head;
- ++priv->tx_head;
+ if (skb && ipoib_process_skb(dev, skb, priv, 0, priv->tx_head &
+ (ipoib_sendq_size - 1), address, qpn))
+ return;
- if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
- ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
- netif_stop_queue(dev);
- set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+ /* Send out all the skb's in one post */
+ while (num_skbs) {
+ struct ib_send_wr *bad_wr;
+
+ if (unlikely((post_send(priv, qpn, start_index, num_skbs,
+ &bad_wr)))) {
+ int done;
+
+ /*
+ * Better error handling can be done here, like free
+ * all untried skbs if err == -ENOMEM. However at this
+ * time, we re-try all the skbs, all of which will
+ * likely fail anyway (unless device finished sending
+ * some out in the meantime). This is not a regression
+ * since the earlier code is not doing this either.
+ */
+ ipoib_warn(priv, "post_send failed\n");
+
+ /* Get #WR's that finished successfully */
+ done = bad_wr - &priv->tx_wr[start_index];
+
+ /* Handle 1 error */
+ priv->stats.tx_errors++;
+ ib_dma_unmap_single(priv->ca,
+ priv->tx_sge[start_index + done].addr,
+ priv->tx_sge[start_index + done].length,
+ DMA_TO_DEVICE);
+
+ /* Handle 'n' successes */
+ if (done) {
+ dev->trans_start = jiffies;
+ address->last_send = priv->tx_head;
+ }
+
+ /* Free failed WR & reset for WC handler to recognize */
+ dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+ priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+ /* Move head to first untried WR */
+ priv->tx_head += (done + 1);
+ /* + 1 for WR that was tried & failed */
+
+ /* Get count of skbs that were not tried */
+ num_skbs -= (done + 1);
+
+ /* Get start index for next iteration */
+ start_index += (done + 1);
+ } else {
+ dev->trans_start = jiffies;
+
+ address->last_send = priv->tx_head;
+ priv->tx_head += num_skbs;
+ num_skbs = 0;
}
}
+
+ if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+ /*
+ * Not accurate as some intermediate slots could have been
+ * freed on error, but no harm - only queue stopped earlier.
+ */
+ ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+ netif_stop_queue(dev);
+ set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+ }
}
static void __ipoib_reap_ah(struct net_device *dev)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c)
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (10 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition Krishna Kumar
@ 2007-07-22 9:06 ` Krishna Kumar
2007-07-23 9:53 ` [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar2
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2007-07-22 9:06 UTC (permalink / raw)
To: davem, rdreier
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr,
peter.p.waskiewicz.jr, mcarlson, kaber, jagana, general, mchan,
tgraf, jeff, sri, hadi, netdev
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-20 07:49:28.000000000 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-22 00:08:28.000000000 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
goto err_drop;
}
} else
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, path->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
} else {
neigh->ah = NULL;
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+ ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
} else if ((path->query || !path_rec_start(dev, path)) &&
skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
goto out;
}
- ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, neigh->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
goto out;
}
@@ -753,6 +755,175 @@ out:
return NETDEV_TX_OK;
}
+#define XMIT_QUEUED_SKBS() \
+ do { \
+ if (num_skbs) { \
+ ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \
+ num_skbs); \
+ num_skbs = 0; \
+ } \
+ } while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ struct sk_buff *skb;
+ struct sk_buff_head *blist;
+ int max_skbs, num_skbs = 0, tx_ring_index = -1;
+ u32 qpn, old_qpn = 0;
+ struct ipoib_neigh *neigh, *old_neigh = NULL;
+ unsigned long flags;
+
+ if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
+ return NETDEV_TX_LOCKED;
+
+ blist = &dev->skb_blist;
+
+ /*
+ * Send atmost 'max_skbs' skbs. This also prevents the device getting
+ * full.
+ */
+ max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+ while (max_skbs-- > 0 && (skb = __skb_dequeue(blist)) != NULL) {
+ /*
+ * From here on, ipoib_send() cannot stop the queue as it
+ * uses the same initialization as 'max_skbs'. So we can
+ * optimize to not check for queue stopped for every skb.
+ */
+ if (likely(skb->dst && skb->dst->neighbour)) {
+ if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+ XMIT_QUEUED_SKBS();
+ ipoib_path_lookup(skb, dev);
+ continue;
+ }
+
+ neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+ if (ipoib_cm_get(neigh)) {
+ if (ipoib_cm_up(neigh)) {
+ XMIT_QUEUED_SKBS();
+ ipoib_cm_send(dev, skb,
+ ipoib_cm_get(neigh));
+ continue;
+ }
+ } else if (neigh->ah) {
+ if (unlikely(memcmp(&neigh->dgid.raw,
+ skb->dst->neighbour->ha + 4,
+ sizeof(union ib_gid)))) {
+ spin_lock(&priv->lock);
+ /*
+ * It's safe to call ipoib_put_ah()
+ * inside priv->lock here, because we
+ * know that path->ah will always hold
+ * one more reference, so ipoib_put_ah()
+ * will never do more than decrement
+ * the ref count.
+ */
+ ipoib_put_ah(neigh->ah);
+ list_del(&neigh->list);
+ ipoib_neigh_free(dev, neigh);
+ spin_unlock(&priv->lock);
+ XMIT_QUEUED_SKBS();
+ ipoib_path_lookup(skb, dev);
+ continue;
+ }
+
+ qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+ if (neigh != old_neigh || qpn != old_qpn) {
+ /*
+ * Sending to a different destination
+ * from earlier skb's - send all
+ * existing skbs (if any).
+ */
+ if (tx_ring_index == -1) {
+ /*
+ * First time, find where to
+ * store skb.
+ */
+ tx_ring_index = priv->tx_head &
+ (ipoib_sendq_size - 1);
+ } else {
+ /* Some skbs to send */
+ XMIT_QUEUED_SKBS();
+ }
+ old_neigh = neigh;
+ old_qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+ }
+
+ if (ipoib_process_skb(dev, skb, priv, num_skbs,
+ tx_ring_index, neigh->ah,
+ qpn))
+ continue;
+
+ num_skbs++;
+
+ /* Queue'd one skb, get index for next skb */
+ if (max_skbs)
+ tx_ring_index = (tx_ring_index + 1) &
+ (ipoib_sendq_size - 1);
+ continue;
+ }
+
+ if (skb_queue_len(&neigh->queue) <
+ IPOIB_MAX_PATH_REC_QUEUE) {
+ spin_lock(&priv->lock);
+ __skb_queue_tail(&neigh->queue, skb);
+ spin_unlock(&priv->lock);
+ } else {
+ dev_kfree_skb_any(skb);
+ ++priv->stats.tx_dropped;
+ ++max_skbs;
+ }
+ } else {
+ struct ipoib_pseudoheader *phdr =
+ (struct ipoib_pseudoheader *) skb->data;
+ skb_pull(skb, sizeof *phdr);
+
+ if (phdr->hwaddr[4] == 0xff) {
+ /* Add in the P_Key for multicast*/
+ phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+ phdr->hwaddr[9] = priv->pkey & 0xff;
+
+ XMIT_QUEUED_SKBS();
+ ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+ } else {
+ /* unicast GID -- should be ARP or RARP reply */
+
+ if ((be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_ARP) &&
+ (be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_RARP)) {
+ ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+ IPOIB_GID_FMT "\n",
+ skb->dst ? "neigh" : "dst",
+ be16_to_cpup((__be16 *)
+ skb->data),
+ IPOIB_QPN(phdr->hwaddr),
+ IPOIB_GID_RAW_ARG(phdr->hwaddr
+ + 4));
+ dev_kfree_skb_any(skb);
+ ++priv->stats.tx_dropped;
+ ++max_skbs;
+ continue;
+ }
+ XMIT_QUEUED_SKBS();
+ unicast_arp_send(skb, dev, phdr);
+ }
+ }
+ }
+
+ /* Send out last packets (if any) */
+ XMIT_QUEUED_SKBS();
+
+ spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+ return skb_queue_empty(blist) ? NETDEV_TX_OK : NETDEV_TX_BUSY;
+}
+
static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -898,11 +1069,35 @@ int ipoib_dev_init(struct net_device *de
/* priv->tx_head & tx_tail are already 0 */
- if (ipoib_ib_dev_init(dev, ca, port))
+ /* Allocate tx_sge */
+ priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+ GFP_KERNEL);
+ if (!priv->tx_sge) {
+ printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+ ca->name, ipoib_sendq_size);
goto out_tx_ring_cleanup;
+ }
+
+ /* Allocate tx_wr */
+ priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+ GFP_KERNEL);
+ if (!priv->tx_wr) {
+ printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+ ca->name, ipoib_sendq_size);
+ goto out_tx_sge_cleanup;
+ }
+
+ if (ipoib_ib_dev_init(dev, ca, port))
+ goto out_tx_wr_cleanup;
return 0;
+out_tx_wr_cleanup:
+ kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+ kfree(priv->tx_sge);
+
out_tx_ring_cleanup:
kfree(priv->tx_ring);
@@ -930,9 +1125,13 @@ void ipoib_dev_cleanup(struct net_device
kfree(priv->rx_ring);
kfree(priv->tx_ring);
+ kfree(priv->tx_sge);
+ kfree(priv->tx_wr);
priv->rx_ring = NULL;
priv->tx_ring = NULL;
+ priv->tx_sge = NULL;
+ priv->tx_wr = NULL;
}
static void ipoib_setup(struct net_device *dev)
@@ -943,6 +1142,7 @@ static void ipoib_setup(struct net_devic
dev->stop = ipoib_stop;
dev->change_mtu = ipoib_change_mtu;
dev->hard_start_xmit = ipoib_start_xmit;
+ dev->hard_start_xmit_batch = ipoib_start_xmit_frames;
dev->get_stats = ipoib_get_stats;
dev->tx_timeout = ipoib_timeout;
dev->hard_header = ipoib_hard_header;
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
2007-07-22 9:06 ` [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition Krishna Kumar
@ 2007-07-22 9:41 ` Michael S. Tsirkin
2007-07-23 2:53 ` [ofa-general] " Krishna Kumar2
2007-07-23 10:48 ` Evgeniy Polyakov
1 sibling, 1 reply; 24+ messages in thread
From: Michael S. Tsirkin @ 2007-07-22 9:41 UTC (permalink / raw)
To: Krishna Kumar
Cc: davem, rdreier, johnpol, Robert.Olsson, peter.p.waskiewicz.jr,
herbert, gaagaan, kumarkr, mcarlson, jagana, general, netdev,
tgraf, jeff, hadi, kaber, mchan, sri
> + /*
> + * Handle skbs completion from tx_tail to wr_id. It is possible to
> + * handle WC's from earlier post_sends (possible multiple) in this
> + * iteration as we move from tx_tail to wr_id, since if the last
> + * WR (which is the one which had a completion request) failed to be
> + * sent for any of those earlier request(s), no completion
> + * notification is generated for successful WR's of those earlier
> + * request(s).
> + */
AFAIK a signalled WR will always generate a completion.
What am I missing?
>
> + /*
> + * Better error handling can be done here, like free
> + * all untried skbs if err == -ENOMEM. However at this
> + * time, we re-try all the skbs, all of which will
> + * likely fail anyway (unless device finished sending
> + * some out in the meantime). This is not a regression
> + * since the earlier code is not doing this either.
> + */
Are you retrying posting skbs? Why is this a good idea?
AFAIK, earlier code did not retry posting WRs at all.
The comment seems to imply that post send fails as a result of SQ overflow -
do you see SQ overflow errors in your testing?
AFAIK, IPoIB should never overflow the SQ.
--
MST
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
2007-07-22 9:05 ` [PATCH 02/12 -Rev2] Changes to netdevice.h Krishna Kumar
@ 2007-07-22 17:06 ` Patrick McHardy
2007-07-23 2:57 ` Krishna Kumar2
0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-07-22 17:06 UTC (permalink / raw)
To: Krishna Kumar
Cc: johnpol, Robert.Olsson, herbert, gaagaan, kumarkr, rdreier,
peter.p.waskiewicz.jr, mcarlson, jagana, general, netdev, tgraf,
jeff, sri, hadi, davem, mchan
Krishna Kumar wrote:
> @@ -472,6 +474,9 @@ struct net_device
> void *priv; /* pointer to private data */
> int (*hard_start_xmit) (struct sk_buff *skb,
> struct net_device *dev);
> + int (*hard_start_xmit_batch) (struct net_device
> + *dev);
> +
Os this function really needed? Can't you just call hard_start_xmit with
a NULL skb and have the driver use dev->blist?
> /* These may be needed for future network-power-down code. */
> unsigned long trans_start; /* Time (in jiffies) of last Tx */
>
> @@ -582,6 +587,8 @@ struct net_device
> #define NETDEV_ALIGN 32
> #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1)
>
> +#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0)
> +
> static inline void *netdev_priv(const struct net_device *dev)
> {
> return dev->priv;
> @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n
> struct sockaddr *);
> extern int dev_hard_start_xmit(struct sk_buff *skb,
> struct net_device *dev);
> +extern int dev_add_skb_to_blist(struct sk_buff *skb,
> + struct net_device *dev);
Again, function signatures should be introduced in the same patch
that contains the function. Splitting by file doesn't make sense.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 06/12 -Rev2] rtnetlink changes.
2007-07-22 9:05 ` [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes Krishna Kumar
@ 2007-07-22 17:10 ` Patrick McHardy
2007-07-23 2:54 ` [ofa-general] " Krishna Kumar2
0 siblings, 1 reply; 24+ messages in thread
From: Patrick McHardy @ 2007-07-22 17:10 UTC (permalink / raw)
To: Krishna Kumar
Cc: davem, rdreier, johnpol, Robert.Olsson, peter.p.waskiewicz.jr,
herbert, gaagaan, kumarkr, xma, rick.jones2, mcarlson, netdev,
jagana, general, mchan, tgraf, jeff, hadi, sri
Krishna Kumar wrote:
> diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
> --- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530
> +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530
> @@ -78,6 +78,8 @@ enum
> IFLA_LINKMODE,
> IFLA_LINKINFO,
> #define IFLA_LINKINFO IFLA_LINKINFO
> + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */
> +#define IFLA_TXBTHSKB IFLA_TXBTHSKB
Ughh what a name :) I prefer pronouncable names since they are
much easier to remember and don't need comments explaining
what they mean.
But I actually think offering just an ethtool interface would
be better, at least for now.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
2007-07-22 9:41 ` Michael S. Tsirkin
@ 2007-07-23 2:53 ` Krishna Kumar2
0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 2:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
mchan, tgraf, netdev, sri, davem, rdreier
Hi Micheal,
"Michael S. Tsirkin" <mst@dev.mellanox.co.il> wrote on 07/22/2007 03:11:36
PM:
> > + /*
> > + * Handle skbs completion from tx_tail to wr_id. It is possible to
> > + * handle WC's from earlier post_sends (possible multiple) in this
> > + * iteration as we move from tx_tail to wr_id, since if the last
> > + * WR (which is the one which had a completion request) failed to
be
> > + * sent for any of those earlier request(s), no completion
> > + * notification is generated for successful WR's of those earlier
> > + * request(s).
> > + */
>
> AFAIK a signalled WR will always generate a completion.
> What am I missing?
Yes, signalled WR will generate a completion. I am trying to catch the case
where, say, I send 64 skbs and set signalling for only the last skb and the
others are set to NO signalling. Now if the driver found the last WR was
bad
for some reason, it will synchronously fail the send for that WR (which
happens to be the only one that is signalled). So after the 1 to 63 skbs
are
finished, there will be no completion called. That was my understanding of
how
this works, and coded it that way so that the next post will clean up the
previous one's completion.
> >
> > + /*
> > + * Better error handling can be done here, like free
> > + * all untried skbs if err == -ENOMEM. However at this
> > + * time, we re-try all the skbs, all of which will
> > + * likely fail anyway (unless device finished sending
> > + * some out in the meantime). This is not a regression
> > + * since the earlier code is not doing this either.
> > + */
>
> Are you retrying posting skbs? Why is this a good idea?
> AFAIK, earlier code did not retry posting WRs at all.
Not exactly. If I send 64 skbs to the device and the provider returned a
bad WR at skb # 50, then I will have to try skb# 51-64 again since the
provider has not attemped to send those out as it bails out at the first
failure. The provider ofcourse has already sent out skb# 1-49 before
returning failure at skb# 50. So it is not strictly retry, just xmit of
next skbs which is what the current code also does. I tested this part out
by simulating errors in mthca_post_send and verified that the next
iteration clears up the remaining skbs.
> The comment seems to imply that post send fails as a result of SQ
overflow -
Correct.
> do you see SQ overflow errors in your testing?
No.
> AFAIK, IPoIB should never overflow the SQ.
Correct. It should never happen unless IPoIB has a bug :) I guess the
comment
should be removed ?
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes.
2007-07-22 17:10 ` Patrick McHardy
@ 2007-07-23 2:54 ` Krishna Kumar2
0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 2:54 UTC (permalink / raw)
To: Patrick McHardy
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
mchan, tgraf, netdev, davem, sri
Hi Patrick,
Patrick McHardy <kaber@trash.net> wrote on 07/22/2007 10:40:37 PM:
> Krishna Kumar wrote:
> > diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
> > --- org/include/linux/if_link.h 2007-07-20 16:33:35.000000000 +0530
> > +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.000000000 +0530
> > @@ -78,6 +78,8 @@ enum
> > IFLA_LINKMODE,
> > IFLA_LINKINFO,
> > #define IFLA_LINKINFO IFLA_LINKINFO
> > + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */
> > +#define IFLA_TXBTHSKB IFLA_TXBTHSKB
>
>
> Ughh what a name :) I prefer pronouncable names since they are
> much easier to remember and don't need comments explaining
> what they mean.
>
> But I actually think offering just an ethtool interface would
> be better, at least for now.
Great, I will remove /sys and rtnetlink and keep the Ethtool i/f.
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
2007-07-22 17:06 ` [ofa-general] " Patrick McHardy
@ 2007-07-23 2:57 ` Krishna Kumar2
0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 2:57 UTC (permalink / raw)
To: Patrick McHardy
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, mcarlson, jeff, general,
mchan, tgraf, netdev, davem, sri
Hi Patrick,
Patrick McHardy <kaber@trash.net> wrote on 07/22/2007 10:36:51 PM:
> Krishna Kumar wrote:
> > @@ -472,6 +474,9 @@ struct net_device
> > void *priv; /* pointer to private data */
> > int (*hard_start_xmit) (struct sk_buff *skb,
> > struct net_device *dev);
> > + int (*hard_start_xmit_batch) (struct net_device
> > + *dev);
> > +
>
>
> Os this function really needed? Can't you just call hard_start_xmit with
> a NULL skb and have the driver use dev->blist?
Probably not. I will see how to do it this way and get back to you.
> > /* These may be needed for future network-power-down code. */
> > unsigned long trans_start; /* Time (in jiffies) of last Tx
*/
> >
> > @@ -582,6 +587,8 @@ struct net_device
> > #define NETDEV_ALIGN 32
> > #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1)
> >
> > +#define BATCHING_ON(dev) ((dev->features & NETIF_F_BATCH_ON) != 0)
> > +
> > static inline void *netdev_priv(const struct net_device *dev)
> > {
> > return dev->priv;
> > @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n
> > struct sockaddr *);
> > extern int dev_hard_start_xmit(struct sk_buff *skb,
> > struct net_device *dev);
> > +extern int dev_add_skb_to_blist(struct sk_buff *skb,
> > + struct net_device *dev);
>
>
> Again, function signatures should be introduced in the same patch
> that contains the function. Splitting by file doesn't make sense.
Right. I did it for some but missed this. Sorry, will redo.
thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 00/12 -Rev2] Implement batching skb API
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
` (11 preceding siblings ...)
2007-07-22 9:06 ` [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c) Krishna Kumar
@ 2007-07-23 9:53 ` Krishna Kumar2
12 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 9:53 UTC (permalink / raw)
To: Krishna Kumar2
Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
rdreier, rick.jones2, Robert.Olsson, sri, tgraf, xma
> I have started a 10 run test for various buffer sizes and processes, and
> will post the results on Monday.
The 10 iteration run results for Rev2 are (average) :
----------------------------------------------------------------------------------
Test Case Org New %Change
----------------------------------------------------------------------------------
TCP 1 Process
Size:32 2703 3063 13.31
Size:128 12948 12217 -5.64
Size:512 48108 55384 15.12
Size:4096 129089 132586 2.70
Average: 192848 203250 5.39
TCP 4 Processes
Size:32 10389 10768 3.64
Size:128 39694 42265 6.47
Size:512 159563 156373 -1.99
Size:4096 268094 256008 -4.50
Average: 477740 465414 -2.58
TCP No Delay 1 Process
Size:32 2606 2950 13.20
Size:128 8115 11864 46.19
Size:512 39113 42608 8.93
Size:4096 103966 105333 1.31
Average: 153800 162755 5.82
TCP No Delay 4 Processes
Size:32 4213 8727 107.14
Size:128 17579 35143 99.91
Size:512 70803 123936 75.04
Size:4096 203541 225259 10.67
Average: 296136 393065 32.73
--------------------------------------------------------------------------
Average: 1120524 1224484 9.28%
There are three cases that degrade a little (upto -5.6%), but there are 13
cases
that improve, and many of those are in the 13% to over 100% (7 cases).
Thanks,
- KK
Krishna Kumar2/India/IBM@IBMIN wrote on 07/22/2007 02:34:57 PM:
> This set of patches implements the batching API, and makes the following
> changes resulting from the review of the first set:
>
> Changes :
> ---------
> 1. Changed skb_blist from pointer to static as it saves only 12 bytes
> (i386), but bloats the code.
> 2. Removed requirement for driver to set "features & NETIF_F_BATCH_SKBS"
> in register_netdev to enable batching as it is redundant. Changed
this
> flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
> user changable calls can modify this bit to enable/disable batching.
> 3. Added ethtool support to enable/disable batching (not tested).
> 4. Added rtnetlink support to enable/disable batching (not tested).
> 5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
> should not have a small queue anyway (adding bloat).
> 6. skbs are purged from dev_deactivate instead of from unregister_netdev
> to drop all references to the device.
> 7. Removed changelog in source code in sch_generic.c, and unrelated
renames
> from sch_generic.c (lockless, comments).
> 8. Removed xmit_slots entirely, as it was adding bloat (code and header)
> and not adding value (it is calculated and set twice in internal send
> routine and handle work completion, and referenced once in batch
xmit;
> and can instead be calculated once in xmit).
>
> Issues :
> --------
> 1. Remove /sysfs support completely ?
> 2. Whether rtnetlink support is required as GSO has only ethtool ?
>
> Patches are described as:
> Mail 0/12 : This mail.
> Mail 1/12 : HOWTO documentation.
> Mail 2/12 : Changes to netdevice.h
> Mail 3/12 : dev.c changes.
> Mail 4/12 : Ethtool changes.
> Mail 5/12 : sysfs changes.
> Mail 6/12 : rtnetlink changes.
> Mail 7/12 : Change in qdisc_run & qdisc_restart API, modify callers
> to use this API.
> Mail 8/12 : IPoIB include file changes.
> Mail 9/12 : IPoIB verbs changes
> Mail 10/12 : IPoIB multicast, CM changes
> Mail 11/12 : IPoIB xmit API addition
> Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)
>
> I have started a 10 run test for various buffer sizes and processes, and
> will post the results on Monday.
>
> Please review and provide feedback/ideas; and consider for inclusion.
>
> Thanks,
>
> - KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.
2007-07-22 9:05 ` [ofa-general] [PATCH 03/12 -Rev2] dev.c changes Krishna Kumar
@ 2007-07-23 10:44 ` Evgeniy Polyakov
2007-07-23 11:17 ` Krishna Kumar2
0 siblings, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2007-07-23 10:44 UTC (permalink / raw)
To: Krishna Kumar
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, kaber, jeff, general, netdev, tgraf,
hadi, davem, mchan, sri
Hi Krishna.
On Sun, Jul 22, 2007 at 02:35:25PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
> --- org/net/core/dev.c 2007-07-20 07:49:28.000000000 +0530
> +++ rev2/net/core/dev.c 2007-07-21 23:08:33.000000000 +0530
> @@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi
> }
> }
>
> +/*
> + * dev_change_tx_batching - Enable or disable batching for a driver that
> + * supports batching.
> + */
> +int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb)
> +{
> + int ret;
> +
> + if (!dev->hard_start_xmit_batch) {
> + /* Driver doesn't support skb batching */
> + ret = -ENOTSUPP;
> + goto out;
> + }
> +
> + /* Handle invalid argument */
> + if (new_batch_skb < 0) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = 0;
> +
> + /* Check if new value is same as the current */
> + if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
> + goto out;
o_O
Scratched head for too long before understood what it means :)
> + spin_lock(&dev->queue_lock);
> + if (new_batch_skb) {
> + dev->features |= NETIF_F_BATCH_ON;
> + dev->tx_queue_len >>= 1;
> + } else {
> + if (!skb_queue_empty(&dev->skb_blist))
> + skb_queue_purge(&dev->skb_blist);
> + dev->features &= ~NETIF_F_BATCH_ON;
> + dev->tx_queue_len <<= 1;
> + }
> + spin_unlock(&dev->queue_lock);
Hmm, should this also stop interrupts?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
2007-07-22 9:06 ` [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition Krishna Kumar
2007-07-22 9:41 ` Michael S. Tsirkin
@ 2007-07-23 10:48 ` Evgeniy Polyakov
2007-07-23 11:17 ` Krishna Kumar2
1 sibling, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2007-07-23 10:48 UTC (permalink / raw)
To: Krishna Kumar
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, herbert, gaagaan,
kumarkr, rdreier, mcarlson, kaber, jeff, general, netdev, tgraf,
hadi, davem, mchan, sri
On Sun, Jul 22, 2007 at 02:36:49PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.000000000 +0530
> +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-22 00:08:37.000000000 +0530
> @@ -242,8 +242,9 @@ repost:
> static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
> {
> struct ipoib_dev_priv *priv = netdev_priv(dev);
> + int i = 0, num_completions;
> + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
> unsigned int wr_id = wc->wr_id;
> - struct ipoib_tx_buf *tx_req;
> unsigned long flags;
>
> ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
> @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
> return;
> }
>
> - tx_req = &priv->tx_ring[wr_id];
> + num_completions = wr_id - tx_ring_index + 1;
> + if (num_completions <= 0)
> + num_completions += ipoib_sendq_size;
Can this still be less than zero?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.
2007-07-23 10:44 ` [ofa-general] " Evgeniy Polyakov
@ 2007-07-23 11:17 ` Krishna Kumar2
0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 11:17 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: jagana, mcarlson, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan,
tgraf, netdev, davem, sri
Hi Evgeniy,
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 07/23/2007 04:14:28 PM:
> > +/*
> > + * dev_change_tx_batching - Enable or disable batching for a driver
that
> > + * supports batching.
> > + /* Check if new value is same as the current */
> > + if (!!(dev->features & NETIF_F_BATCH_ON) == !!new_batch_skb)
> > + goto out;
>
> o_O
>
> Scratched head for too long before understood what it means :)
Is there a easy way to do this ?
> > + spin_lock(&dev->queue_lock);
> > + if (new_batch_skb) {
> > + dev->features |= NETIF_F_BATCH_ON;
> > + dev->tx_queue_len >>= 1;
> > + } else {
> > + if (!skb_queue_empty(&dev->skb_blist))
> > + skb_queue_purge(&dev->skb_blist);
> > + dev->features &= ~NETIF_F_BATCH_ON;
> > + dev->tx_queue_len <<= 1;
> > + }
> > + spin_unlock(&dev->queue_lock);
>
> Hmm, should this also stop interrupts?
That is a good question, and I am not sure. I thought it
is not required, though adding it doesn't affect code
either. Can someone tell if disabling bh is required and
why (couldn't figure out the intention of bh for
dev_queue_xmit either, is this to disable preemption) ?
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
* [ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
2007-07-23 10:48 ` Evgeniy Polyakov
@ 2007-07-23 11:17 ` Krishna Kumar2
0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar2 @ 2007-07-23 11:17 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: jagana, mcarlson, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general, mchan,
tgraf, netdev, davem, sri
Hi Evgeniy,
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 07/23/2007 04:18:26 PM:
> > static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc
*wc)
> > {
> > struct ipoib_dev_priv *priv = netdev_priv(dev);
> > + int i = 0, num_completions;
> > + int tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
> > unsigned int wr_id = wc->wr_id;
> > - struct ipoib_tx_buf *tx_req;
> > unsigned long flags;
> >
> > ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
> > @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
> > return;
> > }
> >
> > - tx_req = &priv->tx_ring[wr_id];
> > + num_completions = wr_id - tx_ring_index + 1;
> > + if (num_completions <= 0)
> > + num_completions += ipoib_sendq_size;
>
> Can this still be less than zero?
Should never happen, otherwise the TX code wrote on bad/unallocated
memory and would have crashed first.
Thanks,
- KK
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2007-07-23 11:17 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-22 9:04 [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB Krishna Kumar
2007-07-22 9:05 ` [PATCH 02/12 -Rev2] Changes to netdevice.h Krishna Kumar
2007-07-22 17:06 ` [ofa-general] " Patrick McHardy
2007-07-23 2:57 ` Krishna Kumar2
2007-07-22 9:05 ` [ofa-general] [PATCH 03/12 -Rev2] dev.c changes Krishna Kumar
2007-07-23 10:44 ` [ofa-general] " Evgeniy Polyakov
2007-07-23 11:17 ` Krishna Kumar2
2007-07-22 9:05 ` [ofa-general] [PATCH 04/12 -Rev2] Ethtool changes Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 05/12 -Rev2] sysfs changes Krishna Kumar
2007-07-22 9:05 ` [ofa-general] [PATCH 06/12 -Rev2] rtnetlink changes Krishna Kumar
2007-07-22 17:10 ` Patrick McHardy
2007-07-23 2:54 ` [ofa-general] " Krishna Kumar2
2007-07-22 9:06 ` [ofa-general] [PATCH 07/12 -Rev2] Change qdisc_run & qdisc_restart API, callers Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 08/12 -Rev2] IPoIB include file changes Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 09/12 -Rev2] IPoIB verbs changes Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 10/12 -Rev2] IPoIB multicast, CM changes Krishna Kumar
2007-07-22 9:06 ` [ofa-general] [PATCH 11/12 -Rev2] IPoIB xmit API addition Krishna Kumar
2007-07-22 9:41 ` Michael S. Tsirkin
2007-07-23 2:53 ` [ofa-general] " Krishna Kumar2
2007-07-23 10:48 ` Evgeniy Polyakov
2007-07-23 11:17 ` Krishna Kumar2
2007-07-22 9:06 ` [ofa-general] [PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c) Krishna Kumar
2007-07-23 9:53 ` [PATCH 00/12 -Rev2] Implement batching skb API Krishna Kumar2
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).