* [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
@ 2007-08-22 8:28 ` Krishna Kumar
2007-08-22 15:50 ` [ofa-general] " Randy Dunlap
2007-08-22 8:29 ` [PATCH 2/10 Rev4] [core] Add skb_blist & support " Krishna Kumar
` (8 subsequent siblings)
9 siblings, 1 reply; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:28 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, kumarkr, xma,
gaagaan, netdev, rdreier, rick.jones2, mcarlson, jeff, general,
mchan, tgraf, Krishna Kumar, sri
Add Documentation describing batching skb xmit capability.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
batching_skb_xmit.txt | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 78 insertions(+)
diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
--- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30:00.000000000 +0530
+++ new/Documentation/networking/batching_skb_xmit.txt 2007-08-22 10:21:19.000000000 +0530
@@ -0,0 +1,78 @@
+ HOWTO for batching skb xmit support
+ -----------------------------------
+
+Section 1: What is batching skb xmit
+Section 2: How batching xmit works vs the regular xmit
+Section 3: How drivers can support batching
+Section 4: How users can work with batching
+
+
+Introduction: Kernel support for batching skb
+----------------------------------------------
+
+A new capability to support xmit of multiple skbs is provided in the netdevice
+layer. Drivers which enable this capability should be able to process multiple
+skbs in a single call to their xmit handler.
+
+
+Section 1: What is batching skb xmit
+-------------------------------------
+
+ This capability is optionally enabled by a driver by setting the
+ NETIF_F_BATCH_SKBS bit in dev->features. The pre-requisite for a
+ driver to use this capability is that it should have a reasonably
+ sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching xmit works vs the regular xmit
+-------------------------------------------------------
+
+ The network stack gets called from upper layer protocols with a single
+ skb to transmit. This skb is first enqueue'd and an attempt is made to
+ transmit it immediately (via qdisc_run). However, events like tx lock
+ contention, tx queue stopped, etc, can result in the skb not getting
+ sent out and it remains in the queue. When the next xmit is called or
+ when the queue is re-enabled, qdisc_run could potentially find
+ multiple packets in the queue, and iteratively send them all out
+ one-by-one.
+
+ Batching skb xmit is a mechanism to exploit this situation where all
+ skbs can be passed in one shot to the device. This reduces driver
+ processing, locking at the driver (or in stack for ~LLTX drivers)
+ gets amortized over multiple skbs, and in case of specific drivers
+ where every xmit results in a completion processing (like IPoIB) -
+ optimizations can be made in the driver to request a completion for
+ only the last skb that was sent which results in saving interrupts
+ for every (but the last) skb that was sent in the same batch.
+
+ Batching can result in significant performance gains for systems that
+ have multiple data stream paths over the same network interface card.
+
+
+Section 3: How drivers can support batching
+---------------------------------------------
+
+ Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+ dev->features.
+
+ The driver's xmit handler should be modified to process multiple skbs
+ instead of one skb. The driver's xmit handler is called either with a
+ skb to transmit or NULL skb, where the latter case should be handled
+ as a call to xmit multiple skbs. This is done by sending out all skbs
+ in the dev->skb_blist list (where it was added by the core stack).
+
+
+Section 4: How users can work with batching
+---------------------------------------------
+
+ Batching can be disabled for a particular device, e.g. on desktop
+ systems if only one stream of network activity for that device is
+ taking place, since performance could be slightly affected due to
+ extra processing that batching adds (unless packets are getting
+ sent fast resulting in stopped queue's). Batching can be enabled if
+ more than one stream of network activity per device is being done,
+ e.g. on servers; or even desktop usage with multiple browser, chat,
+ file transfer sessions, etc.
+
+ Per device batching can be enabled/disabled by passing 'on' or 'off'
+ respectively to ethtool.
^ permalink raw reply [flat|nested] 15+ messages in thread* [ofa-general] Re: [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching
2007-08-22 8:28 ` [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-08-22 15:50 ` Randy Dunlap
2007-08-23 2:48 ` Krishna Kumar2
0 siblings, 1 reply; 15+ messages in thread
From: Randy Dunlap @ 2007-08-22 15:50 UTC (permalink / raw)
To: Krishna Kumar
Cc: johnpol, jagana, herbert, gaagaan, Robert.Olsson, kumarkr,
rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
netdev, tgraf, mcarlson, sri, shemminger, davem, mchan
On Wed, 22 Aug 2007 13:58:58 +0530 Krishna Kumar wrote:
> Add Documentation describing batching skb xmit capability.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
> batching_skb_xmit.txt | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 78 insertions(+)
>
> diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
> --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30:00.000000000 +0530
> +++ new/Documentation/networking/batching_skb_xmit.txt 2007-08-22 10:21:19.000000000 +0530
> @@ -0,0 +1,78 @@
> + HOWTO for batching skb xmit support
> + -----------------------------------
> +
> +Section 1: What is batching skb xmit
> +Section 2: How batching xmit works vs the regular xmit
> +Section 3: How drivers can support batching
> +Section 4: How users can work with batching
> +
> +
> +Introduction: Kernel support for batching skb
> +----------------------------------------------
> +
> +A new capability to support xmit of multiple skbs is provided in the netdevice
> +layer. Drivers which enable this capability should be able to process multiple
> +skbs in a single call to their xmit handler.
> +
> +
> +Section 1: What is batching skb xmit
> +-------------------------------------
> +
> + This capability is optionally enabled by a driver by setting the
> + NETIF_F_BATCH_SKBS bit in dev->features. The pre-requisite for a
prerequisite
> + driver to use this capability is that it should have a reasonably
I would say "reasonably-sized".
> + sized hardware queue that can process multiple skbs.
> +
> +
> +Section 2: How batching xmit works vs the regular xmit
> +-------------------------------------------------------
> +
> + The network stack gets called from upper layer protocols with a single
> + skb to transmit. This skb is first enqueue'd and an attempt is made to
enqueued
> + transmit it immediately (via qdisc_run). However, events like tx lock
> + contention, tx queue stopped, etc, can result in the skb not getting
etc.,
> + sent out and it remains in the queue. When the next xmit is called or
> + when the queue is re-enabled, qdisc_run could potentially find
> + multiple packets in the queue, and iteratively send them all out
> + one-by-one.
> +
> + Batching skb xmit is a mechanism to exploit this situation where all
> + skbs can be passed in one shot to the device. This reduces driver
> + processing, locking at the driver (or in stack for ~LLTX drivers)
> + gets amortized over multiple skbs, and in case of specific drivers
> + where every xmit results in a completion processing (like IPoIB) -
> + optimizations can be made in the driver to request a completion for
> + only the last skb that was sent which results in saving interrupts
> + for every (but the last) skb that was sent in the same batch.
> +
> + Batching can result in significant performance gains for systems that
> + have multiple data stream paths over the same network interface card.
> +
> +
> +Section 3: How drivers can support batching
> +---------------------------------------------
> +
> + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
> + dev->features.
> +
> + The driver's xmit handler should be modified to process multiple skbs
> + instead of one skb. The driver's xmit handler is called either with a
an
> + skb to transmit or NULL skb, where the latter case should be handled
> + as a call to xmit multiple skbs. This is done by sending out all skbs
> + in the dev->skb_blist list (where it was added by the core stack).
> +
> +
> +Section 4: How users can work with batching
> +---------------------------------------------
> +
> + Batching can be disabled for a particular device, e.g. on desktop
> + systems if only one stream of network activity for that device is
> + taking place, since performance could be slightly affected due to
> + extra processing that batching adds (unless packets are getting
> + sent fast resulting in stopped queue's). Batching can be enabled if
queues).
> + more than one stream of network activity per device is being done,
> + e.g. on servers; or even desktop usage with multiple browser, chat,
> + file transfer sessions, etc.
> +
> + Per device batching can be enabled/disabled by passing 'on' or 'off'
> + respectively to ethtool.
with what other parameter(s), e.g.,
ethtool <dev> batching on/off ?
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching
2007-08-22 15:50 ` [ofa-general] " Randy Dunlap
@ 2007-08-23 2:48 ` Krishna Kumar2
0 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar2 @ 2007-08-23 2:48 UTC (permalink / raw)
To: Randy Dunlap
Cc: davem, gaagaan, general, hadi, herbert, jagana, jeff, johnpol,
kaber, kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr,
rdreier, rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma
Hi Randy,
Thanks for your suggestions. Will clean up those changes.
- KK
Randy Dunlap <randy.dunlap@oracle.com> wrote on 08/22/2007 09:20:13 PM:
> On Wed, 22 Aug 2007 13:58:58 +0530 Krishna Kumar wrote:
>
> > Add Documentation describing batching skb xmit capability.
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> > ---
> > batching_skb_xmit.txt | 78
++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 78 insertions(+)
> >
> > diff -ruNp org/Documentation/networking/batching_skb_xmit.txt
> new/Documentation/networking/batching_skb_xmit.txt
> > --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01
05:30:
> 00.000000000 +0530
> > +++ new/Documentation/networking/batching_skb_xmit.txt 2007-08-22
10:21:
> 19.000000000 +0530
> > @@ -0,0 +1,78 @@
> > + HOWTO for batching skb xmit support
> > + -----------------------------------
> > +
> > +Section 1: What is batching skb xmit
> > +Section 2: How batching xmit works vs the regular xmit
> > +Section 3: How drivers can support batching
> > +Section 4: How users can work with batching
> > +
> > +
> > +Introduction: Kernel support for batching skb
> > +----------------------------------------------
> > +
> > +A new capability to support xmit of multiple skbs is provided in the
netdevice
> > +layer. Drivers which enable this capability should be able to process
multiple
> > +skbs in a single call to their xmit handler.
> > +
> > +
> > +Section 1: What is batching skb xmit
> > +-------------------------------------
> > +
> > + This capability is optionally enabled by a driver by setting the
> > + NETIF_F_BATCH_SKBS bit in dev->features. The pre-requisite for a
>
> prerequisite
>
> > + driver to use this capability is that it should have a reasonably
>
> I would say "reasonably-sized".
>
> > + sized hardware queue that can process multiple skbs.
> > +
> > +
> > +Section 2: How batching xmit works vs the regular xmit
> > +-------------------------------------------------------
> > +
> > + The network stack gets called from upper layer protocols with a
single
> > + skb to transmit. This skb is first enqueue'd and an attempt is made
to
>
> enqueued
>
> > + transmit it immediately (via qdisc_run). However, events like tx
lock
> > + contention, tx queue stopped, etc, can result in the skb not
getting
>
> etc.,
>
> > + sent out and it remains in the queue. When the next xmit is called
or
> > + when the queue is re-enabled, qdisc_run could potentially find
> > + multiple packets in the queue, and iteratively send them all out
> > + one-by-one.
> > +
> > + Batching skb xmit is a mechanism to exploit this situation where
all
> > + skbs can be passed in one shot to the device. This reduces driver
> > + processing, locking at the driver (or in stack for ~LLTX drivers)
> > + gets amortized over multiple skbs, and in case of specific drivers
> > + where every xmit results in a completion processing (like IPoIB) -
> > + optimizations can be made in the driver to request a completion for
> > + only the last skb that was sent which results in saving interrupts
> > + for every (but the last) skb that was sent in the same batch.
> > +
> > + Batching can result in significant performance gains for systems
that
> > + have multiple data stream paths over the same network interface
card.
> > +
> > +
> > +Section 3: How drivers can support batching
> > +---------------------------------------------
> > +
> > + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
> > + dev->features.
> > +
> > + The driver's xmit handler should be modified to process multiple
skbs
> > + instead of one skb. The driver's xmit handler is called either with
a
>
>
an
>
> > + skb to transmit or NULL skb, where the latter case should be
handled
> > + as a call to xmit multiple skbs. This is done by sending out all
skbs
> > + in the dev->skb_blist list (where it was added by the core stack).
> > +
> > +
> > +Section 4: How users can work with batching
> > +---------------------------------------------
> > +
> > + Batching can be disabled for a particular device, e.g. on desktop
> > + systems if only one stream of network activity for that device is
> > + taking place, since performance could be slightly affected due to
> > + extra processing that batching adds (unless packets are getting
> > + sent fast resulting in stopped queue's). Batching can be enabled if
>
> queues).
>
> > + more than one stream of network activity per device is being done,
> > + e.g. on servers; or even desktop usage with multiple browser, chat,
> > + file transfer sessions, etc.
> > +
> > + Per device batching can be enabled/disabled by passing 'on' or
'off'
> > + respectively to ethtool.
>
> with what other parameter(s), e.g.,
>
> ethtool <dev> batching on/off ?
>
> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code
***
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/10 Rev4] [core] Add skb_blist & support for batching
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
2007-08-22 8:28 ` [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching Krishna Kumar
@ 2007-08-22 8:29 ` Krishna Kumar
2007-08-22 8:29 ` [PATCH 3/10 Rev4] [sched] Modify qdisc_run to support batching Krishna Kumar
` (7 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:29 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, Krishna Kumar,
general, netdev, tgraf, mchan, sri
Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for
batching/no-batching, etc.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/netdevice.h | 4 ++++
net/core/dev.c | 21 ++++++++++++++++++---
2 files changed, 22 insertions(+), 3 deletions(-)
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-08-20 14:26:36.000000000 +0530
+++ new/include/linux/netdevice.h 2007-08-22 08:42:10.000000000 +0530
@@ -399,6 +399,7 @@ struct net_device
#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
#define NETIF_F_GSO 2048 /* Enable software GSO. */
#define NETIF_F_LLTX 4096 /* LockLess TX */
+#define NETIF_F_BATCH_SKBS 8192 /* Driver supports multiple skbs/xmit */
#define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */
#define NETIF_F_LRO 32768 /* large receive offload */
@@ -510,6 +511,9 @@ struct net_device
/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
+ /* List of batch skbs (optional, used if driver supports skb batching */
+ struct sk_buff_head *skb_blist;
+
/* ingress path synchronizer */
spinlock_t ingress_lock;
struct Qdisc *qdisc_ingress;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2007-08-20 14:26:37.000000000 +0530
+++ new/net/core/dev.c 2007-08-22 10:49:22.000000000 +0530
@@ -898,6 +898,16 @@ void netdev_state_change(struct net_devi
}
}
+static void free_batching(struct net_device *dev)
+{
+ if (dev->skb_blist) {
+ if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+ kfree(dev->skb_blist);
+ dev->skb_blist = NULL;
+ }
+}
+
/**
* dev_load - load a network module
* @name: name of interface
@@ -1458,7 +1468,9 @@ static int dev_gso_segment(struct sk_buf
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
- if (likely(!skb->next)) {
+ if (likely(skb)) {
+ if (unlikely(skb->next))
+ goto gso;
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
@@ -1468,10 +1480,10 @@ int dev_hard_start_xmit(struct sk_buff *
if (skb->next)
goto gso;
}
-
- return dev->hard_start_xmit(skb, dev);
}
+ return dev->hard_start_xmit(skb, dev);
+
gso:
do {
struct sk_buff *nskb = skb->next;
@@ -3791,6 +3803,9 @@ void unregister_netdevice(struct net_dev
synchronize_net();
+ /* Deallocate batching structure */
+ free_batching(dev);
+
/* Shutdown queueing discipline. */
dev_shutdown(dev);
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 3/10 Rev4] [sched] Modify qdisc_run to support batching
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
2007-08-22 8:28 ` [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching Krishna Kumar
2007-08-22 8:29 ` [PATCH 2/10 Rev4] [core] Add skb_blist & support " Krishna Kumar
@ 2007-08-22 8:29 ` Krishna Kumar
2007-08-22 8:29 ` [PATCH 4/10 Rev4] [ethtool] Add ethtool support Krishna Kumar
` (6 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:29 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, mchan, general,
netdev, tgraf, Krishna Kumar, sri
Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/netdevice.h | 2 +
include/net/pkt_sched.h | 6 +--
net/core/dev.c | 44 +++++++++++++++++++++++++++-
net/sched/sch_generic.c | 70 ++++++++++++++++++++++++++++++++++++++--------
4 files changed, 105 insertions(+), 17 deletions(-)
diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h 2007-08-20 14:26:36.000000000 +0530
+++ new/include/net/pkt_sched.h 2007-08-22 09:23:57.000000000 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
struct rtattr *tab);
extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
if (!netif_queue_stopped(dev) &&
!test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
- __qdisc_run(dev);
+ __qdisc_run(dev, blist);
}
extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-08-20 14:26:36.000000000 +0530
+++ new/include/linux/netdevice.h 2007-08-22 08:42:10.000000000 +0530
@@ -892,6 +896,8 @@ extern int dev_set_mac_address(struct n
struct sockaddr *);
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev);
+extern int dev_add_skb_to_blist(struct sk_buff *skb,
+ struct net_device *dev);
extern void dev_init(void);
diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c 2007-08-20 14:26:37.000000000 +0530
+++ new/net/sched/sch_generic.c 2007-08-22 08:49:55.000000000 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
struct Qdisc *q)
{
- if (unlikely(skb->next))
- dev->gso_skb = skb;
- else
- q->ops->requeue(skb, q);
+ if (likely(skb)) {
+ if (unlikely(skb->next))
+ dev->gso_skb = skb;
+ else
+ q->ops->requeue(skb, q);
+ }
netif_schedule(dev);
return 0;
@@ -91,10 +93,15 @@ static inline int handle_dev_cpu_collisi
/*
* Same CPU holding the lock. It may be a transient
* configuration error, when hard_start_xmit() recurses. We
- * detect it by checking xmit owner and drop the packet when
- * deadloop is detected. Return OK to try the next skb.
+ * detect it by checking xmit owner and drop the packet (or
+ * all packets in batching case) when deadloop is detected.
+ * Return OK to try the next skb.
*/
- kfree_skb(skb);
+ if (likely(skb))
+ kfree_skb(skb);
+ else if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+
if (net_ratelimit())
printk(KERN_WARNING "Dead loop on netdevice %s, "
"fix it urgently!\n", dev->name);
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
}
/*
+ * Algorithm to get skb(s) is:
+ * - Non batching drivers, or if the batch list is empty and there is
+ * 1 skb in the queue - dequeue skb and put it in *skbp to tell the
+ * caller to use the single xmit API.
+ * - Batching drivers where the batch list already contains atleast one
+ * skb, or if there are multiple skbs in the queue: keep dequeue'ing
+ * skb's upto a limit and set *skbp to NULL to tell the caller to use
+ * the multiple xmit API.
+ *
+ * Returns:
+ * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ * (in case >1 skbs present in blist for batching)
+ * 0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+ struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+ if (likely(!blist || (!skb_queue_len(blist) && qdisc_qlen(q) <= 1))) {
+ return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+ } else {
+ struct sk_buff *skb;
+ int max = dev->tx_queue_len - skb_queue_len(blist);
+
+ while (max > 0 && (skb = dev_dequeue_skb(dev, q)) != NULL)
+ max -= dev_add_skb_to_blist(skb, dev);
+
+ *skbp = NULL;
+ return 1; /* there is atleast one skb in skb_blist */
+ }
+}
+
+/*
* NOTE: Called under dev->queue_lock with locally disabled BH.
*
* __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
* >0 - queue is not empty.
*
*/
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+ struct sk_buff_head *blist)
{
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
@@ -138,7 +178,7 @@ static inline int qdisc_restart(struct n
int ret;
/* Dequeue packet */
- if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
+ if (unlikely(!get_skb(dev, q, blist, &skb)))
return 0;
/*
@@ -168,7 +208,7 @@ static inline int qdisc_restart(struct n
switch (ret) {
case NETDEV_TX_OK:
- /* Driver sent out skb successfully */
+ /* Driver sent out skb (or entire skb_blist) successfully */
ret = qdisc_qlen(q);
break;
@@ -190,10 +230,10 @@ static inline int qdisc_restart(struct n
return ret;
}
-void __qdisc_run(struct net_device *dev)
+void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
do {
- if (!qdisc_restart(dev))
+ if (!qdisc_restart(dev, blist))
break;
} while (!netif_queue_stopped(dev));
@@ -563,6 +603,12 @@ void dev_deactivate(struct net_device *d
qdisc = dev->qdisc;
dev->qdisc = &noop_qdisc;
+ if (dev->skb_blist) {
+ /* Release skbs on batch list */
+ if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+ }
+
qdisc_reset(qdisc);
skb = dev->gso_skb;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2007-08-20 14:26:37.000000000 +0530
+++ new/net/core/dev.c 2007-08-22 10:49:22.000000000 +0530
@@ -1466,6 +1466,45 @@ static int dev_gso_segment(struct sk_buf
return 0;
}
+/*
+ * Add skb (skbs in case segmentation is required) to dev->skb_blist. No one
+ * can add to this list simultaneously since we are holding QDISC RUNNING
+ * bit. Also list is safe from simultaneous deletes too since skbs are
+ * dequeued only when the driver is invoked.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+ if (!list_empty(&ptype_all))
+ dev_queue_xmit_nit(skb, dev);
+
+ if (netif_needs_gso(dev, skb)) {
+ if (unlikely(dev_gso_segment(skb))) {
+ kfree(skb);
+ return 0;
+ }
+
+ if (skb->next) {
+ int count = 0;
+
+ do {
+ struct sk_buff *nskb = skb->next;
+
+ skb->next = nskb->next;
+ __skb_queue_tail(dev->skb_blist, nskb);
+ count++;
+ } while (skb->next);
+
+ skb->destructor = DEV_GSO_CB(skb)->destructor;
+ kfree_skb(skb);
+ return count;
+ }
+ }
+ __skb_queue_tail(dev->skb_blist, skb);
+ return 1;
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
if (likely(skb)) {
@@ -1620,7 +1659,7 @@ gso:
/* reset queue_mapping to zero */
skb->queue_mapping = 0;
rc = q->enqueue(skb, q);
- qdisc_run(dev);
+ qdisc_run(dev, NULL);
spin_unlock(&dev->queue_lock);
rc = rc == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : rc;
@@ -1818,7 +1857,8 @@ static void net_tx_action(struct softirq
clear_bit(__LINK_STATE_SCHED, &dev->state);
if (spin_trylock(&dev->queue_lock)) {
- qdisc_run(dev);
+ /* Send all skbs if driver supports batching */
+ qdisc_run(dev, dev->skb_blist);
spin_unlock(&dev->queue_lock);
} else {
netif_schedule(dev);
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 4/10 Rev4] [ethtool] Add ethtool support
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (2 preceding siblings ...)
2007-08-22 8:29 ` [PATCH 3/10 Rev4] [sched] Modify qdisc_run to support batching Krishna Kumar
@ 2007-08-22 8:29 ` Krishna Kumar
2007-08-22 8:30 ` [PATCH 5/10 Rev4] [IPoIB] Header file changes Krishna Kumar
` (5 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:29 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
Add ethtool support to enable/disable batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/ethtool.h | 2 ++
include/linux/netdevice.h | 2 ++
net/core/dev.c | 36 ++++++++++++++++++++++++++++++++++++
net/core/ethtool.c | 27 +++++++++++++++++++++++++++
4 files changed, 67 insertions(+)
diff -ruNp org/include/linux/ethtool.h new/include/linux/ethtool.h
--- org/include/linux/ethtool.h 2007-08-20 14:26:35.000000000 +0530
+++ new/include/linux/ethtool.h 2007-08-22 08:37:35.000000000 +0530
@@ -440,6 +440,8 @@ struct ethtool_ops {
#define ETHTOOL_SFLAGS 0x00000026 /* Set flags bitmap(ethtool_value) */
#define ETHTOOL_GPFLAGS 0x00000027 /* Get driver-private flags bitmap */
#define ETHTOOL_SPFLAGS 0x00000028 /* Set driver-private flags bitmap */
+#define ETHTOOL_GBATCH 0x00000029 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBATCH 0x00000030 /* Set Batching (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-08-20 14:26:36.000000000 +0530
+++ new/include/linux/netdevice.h 2007-08-22 08:42:10.000000000 +0530
@@ -1152,6 +1152,8 @@ extern void dev_set_promiscuity(struct
extern void dev_set_allmulti(struct net_device *dev, int inc);
extern void netdev_state_change(struct net_device *dev);
extern void netdev_features_change(struct net_device *dev);
+extern int dev_change_tx_batch_skb(struct net_device *dev,
+ unsigned long new_batch_skb);
/* Load a device via the kmod */
extern void dev_load(const char *name);
extern void dev_mcast_init(void);
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2007-08-20 14:26:37.000000000 +0530
+++ new/net/core/dev.c 2007-08-22 10:49:22.000000000 +0530
@@ -908,6 +908,42 @@ static void free_batching(struct net_dev
}
}
+int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb)
+{
+ int ret = 0;
+ struct sk_buff_head *blist;
+
+ if (!(dev->features & NETIF_F_BATCH_SKBS)) {
+ /* Driver doesn't support batching skb API */
+ ret = -ENOTSUPP;
+ goto out;
+ }
+
+ /*
+ * Check if new value is same as the current (paranoia to use !! for
+ * new_batch_skb as that should always be boolean).
+ */
+ if (!!dev->skb_blist == !!new_batch_skb)
+ goto out;
+
+ if (new_batch_skb &&
+ (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ spin_lock_bh(&dev->queue_lock);
+ if (new_batch_skb) {
+ skb_queue_head_init(blist);
+ dev->skb_blist = blist;
+ } else
+ free_batching(dev);
+ spin_unlock_bh(&dev->queue_lock);
+
+out:
+ return ret;
+}
+
/**
* dev_load - load a network module
* @name: name of interface
diff -ruNp org/net/core/ethtool.c new/net/core/ethtool.c
--- org/net/core/ethtool.c 2007-08-20 14:26:37.000000000 +0530
+++ new/net/core/ethtool.c 2007-08-22 08:36:07.000000000 +0530
@@ -556,6 +556,26 @@ static int ethtool_set_gso(struct net_de
return 0;
}
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata = { ETHTOOL_GBATCH };
+
+ edata.data = dev->skb_blist != NULL;
+ if (copy_to_user(useraddr, &edata, sizeof(edata)))
+ return -EFAULT;
+ return 0;
+}
+
+static int ethtool_set_batch(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata;
+
+ if (copy_from_user(&edata, useraddr, sizeof(edata)))
+ return -EFAULT;
+
+ return dev_change_tx_batch_skb(dev, edata.data);
+}
+
static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
{
struct ethtool_test test;
@@ -813,6 +833,7 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_GGSO:
case ETHTOOL_GFLAGS:
case ETHTOOL_GPFLAGS:
+ case ETHTOOL_GBATCH:
break;
default:
if (!capable(CAP_NET_ADMIN))
@@ -956,6 +977,12 @@ int dev_ethtool(struct ifreq *ifr)
rc = ethtool_set_value(dev, useraddr,
dev->ethtool_ops->set_priv_flags);
break;
+ case ETHTOOL_GBATCH:
+ rc = ethtool_get_batch(dev, useraddr);
+ break;
+ case ETHTOOL_SBATCH:
+ rc = ethtool_set_batch(dev, useraddr);
+ break;
default:
rc = -EOPNOTSUPP;
}
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 5/10 Rev4] [IPoIB] Header file changes
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (3 preceding siblings ...)
2007-08-22 8:29 ` [PATCH 4/10 Rev4] [ethtool] Add ethtool support Krishna Kumar
@ 2007-08-22 8:30 ` Krishna Kumar
2007-08-22 8:30 ` [PATCH 6/10 Rev4] [IPoIB] CM & Multicast changes Krishna Kumar
` (4 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:30 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
IPoIB header file changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib.h | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h 2007-08-22 08:33:51.000000000 +0530
@@ -271,8 +271,8 @@ struct ipoib_dev_priv {
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
unsigned tx_tail;
- struct ib_sge tx_sge;
- struct ib_send_wr tx_wr;
+ struct ib_sge *tx_sge;
+ struct ib_send_wr *tx_wr;
struct ib_wc ibwc[IPOIB_NUM_WC];
@@ -367,8 +367,11 @@ static inline void ipoib_put_ah(struct i
int ipoib_open(struct net_device *dev);
int ipoib_add_pkey_attr(struct net_device *dev);
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+ u32 qpn, int wr_num);
void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn);
+ struct ipoib_ah *address, u32 qpn, int num_skbs);
void ipoib_reap_ah(struct work_struct *work);
void ipoib_flush_paths(struct net_device *dev);
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 6/10 Rev4] [IPoIB] CM & Multicast changes
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (4 preceding siblings ...)
2007-08-22 8:30 ` [PATCH 5/10 Rev4] [IPoIB] Header file changes Krishna Kumar
@ 2007-08-22 8:30 ` Krishna Kumar
2007-08-22 8:30 ` [PATCH 7/10 Rev4] [IPoIB] Verbs changes Krishna Kumar
` (3 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:30 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
IPoIB CM & Multicast changes based on header file changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_cm.c | 13 +++++++++----
ipoib_multicast.c | 4 ++--
2 files changed, 11 insertions(+), 6 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-08-22 08:33:51.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
unsigned int wr_id,
u64 addr, int len)
{
+ int ret;
struct ib_send_wr *bad_wr;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ priv->tx_sge[0].addr = addr;
+ priv->tx_sge[0].length = len;
+
+ priv->tx_wr[0].wr_id = wr_id;
- priv->tx_wr.wr_id = wr_id;
+ priv->tx_wr[0].next = NULL;
+ ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+ priv->tx_wr[0].next = &priv->tx_wr[1];
- return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+ return ret;
}
void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-08-22 08:33:51.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
sizeof (union ib_gid))) {
priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
- priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+ priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
}
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
}
}
- ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+ ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN, 1);
}
unlock:
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 7/10 Rev4] [IPoIB] Verbs changes
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (5 preceding siblings ...)
2007-08-22 8:30 ` [PATCH 6/10 Rev4] [IPoIB] CM & Multicast changes Krishna Kumar
@ 2007-08-22 8:30 ` Krishna Kumar
2007-08-22 8:31 ` [PATCH 8/10 Rev4] [IPoIB] Post and work completion handler changes Krishna Kumar
` (2 subsequent siblings)
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:30 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
IPoIB verb changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_verbs.c | 23 ++++++++++++++---------
1 files changed, 14 insertions(+), 9 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-08-22 08:33:51.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
.max_send_sge = 1,
.max_recv_sge = 1
},
- .sq_sig_type = IB_SIGNAL_ALL_WR,
+ .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */
.qp_type = IB_QPT_UD
};
-
- int ret, size;
+ struct ib_send_wr *next_wr = NULL;
+ int i, ret, size;
priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff;
priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff;
- priv->tx_sge.lkey = priv->mr->lkey;
-
- priv->tx_wr.opcode = IB_WR_SEND;
- priv->tx_wr.sg_list = &priv->tx_sge;
- priv->tx_wr.num_sge = 1;
- priv->tx_wr.send_flags = IB_SEND_SIGNALED;
+ for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+ priv->tx_sge[i].lkey = priv->mr->lkey;
+ priv->tx_wr[i].opcode = IB_WR_SEND;
+ priv->tx_wr[i].sg_list = &priv->tx_sge[i];
+ priv->tx_wr[i].num_sge = 1;
+ priv->tx_wr[i].send_flags = 0;
+
+ /* Link the list properly for provider to use */
+ priv->tx_wr[i].next = next_wr;
+ next_wr = &priv->tx_wr[i];
+ }
return 0;
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 8/10 Rev4] [IPoIB] Post and work completion handler changes
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (6 preceding siblings ...)
2007-08-22 8:30 ` [PATCH 7/10 Rev4] [IPoIB] Verbs changes Krishna Kumar
@ 2007-08-22 8:31 ` Krishna Kumar
2007-08-22 8:31 ` [PATCH 9/10 Rev4] [IPoIB] Implement batching Krishna Kumar
2007-08-22 8:31 ` [PATCH 10/10 Rev4] [E1000] " Krishna Kumar
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:31 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
IPoIB internal post and work completion handler changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_ib.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 163 insertions(+), 44 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-08-22 08:33:51.000000000 +0530
@@ -242,6 +242,8 @@ repost:
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int i, num_completions;
+ unsigned int tx_ring_index;
unsigned int wr_id = wc->wr_id;
struct ipoib_tx_buf *tx_req;
unsigned long flags;
@@ -255,18 +257,46 @@ static void ipoib_ib_handle_tx_wc(struct
return;
}
- tx_req = &priv->tx_ring[wr_id];
+ /* Get first WC to process (no one can update tx_tail at this time) */
+ tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
- ib_dma_unmap_single(priv->ca, tx_req->mapping,
- tx_req->skb->len, DMA_TO_DEVICE);
+ /* Find number of WC's */
+ num_completions = wr_id - tx_ring_index + 1;
+ if (unlikely(num_completions <= 0))
+ num_completions += ipoib_sendq_size;
- ++priv->stats.tx_packets;
- priv->stats.tx_bytes += tx_req->skb->len;
+ /*
+ * Handle WC's from earlier (possibly multiple) post_sends in this
+ * iteration as we move from tx_tail to wr_id, since if the last WR
+ * (which is the one which requested completion notification) failed
+ * to be sent for any of those earlier request(s), no completion
+ * notification is generated for successful WR's of those earlier
+ * request(s).
+ */
+ tx_req = &priv->tx_ring[tx_ring_index];
+ for (i = 0; i < num_completions; i++) {
+ if (likely(tx_req->skb)) {
+ ib_dma_unmap_single(priv->ca, tx_req->mapping,
+ tx_req->skb->len, DMA_TO_DEVICE);
+
+ ++priv->stats.tx_packets;
+ priv->stats.tx_bytes += tx_req->skb->len;
- dev_kfree_skb_any(tx_req->skb);
+ dev_kfree_skb_any(tx_req->skb);
+ }
+ /*
+ * else this skb failed synchronously when posted and was
+ * freed immediately.
+ */
+
+ if (likely(++tx_ring_index != ipoib_sendq_size))
+ tx_req++;
+ else
+ tx_req = &priv->tx_ring[0];
+ }
spin_lock_irqsave(&priv->tx_lock, flags);
- ++priv->tx_tail;
+ priv->tx_tail += num_completions;
if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags)) &&
priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
@@ -335,29 +365,57 @@ void ipoib_ib_completion(struct ib_cq *c
netif_rx_schedule(dev, &priv->napi);
}
-static inline int post_send(struct ipoib_dev_priv *priv,
- unsigned int wr_id,
- struct ib_ah *address, u32 qpn,
- u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, first_wr is the first slot in tx_wr[] (or
+ * tx_sge[]). first_wr is normally zero unless a previous post_send returned
+ * error and we are trying to post the untried WR's, in which case first_wr
+ * is the index to the first untried WR.
+ *
+ * Break the WR link before posting so that provider knows how many WR's to
+ * process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+ int first_wr, int num_skbs,
+ struct ib_send_wr **bad_wr)
{
- struct ib_send_wr *bad_wr;
+ int ret;
+ struct ib_send_wr *last_wr, *next_wr;
+
+ last_wr = &priv->tx_wr[first_wr + num_skbs - 1];
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ /* Set Completion Notification for last WR */
+ last_wr->send_flags = IB_SEND_SIGNALED;
- priv->tx_wr.wr_id = wr_id;
- priv->tx_wr.wr.ud.remote_qpn = qpn;
- priv->tx_wr.wr.ud.ah = address;
+ /* Terminate the last WR */
+ next_wr = last_wr->next;
+ last_wr->next = NULL;
- return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+ /* Send all the WR's in one doorbell */
+ ret = ib_post_send(priv->qp, &priv->tx_wr[first_wr], bad_wr);
+
+ /* Restore send_flags & WR chain */
+ last_wr->send_flags = 0;
+ last_wr->next = next_wr;
+
+ return ret;
}
-void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn)
+/*
+ * Map skb & store skb/mapping in tx_ring; and details of the WR in tx_wr
+ * to pass to the provider.
+ *
+ * Returns:
+ * 1: Error and the skb is freed.
+ * 0 skb processed successfully.
+ */
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+ u32 qpn, int wr_num)
{
- struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ipoib_tx_buf *tx_req;
u64 addr;
+ unsigned int tx_ring_index;
if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
@@ -365,7 +423,7 @@ void ipoib_send(struct net_device *dev,
++priv->stats.tx_dropped;
++priv->stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
- return;
+ return 1;
}
ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n",
@@ -378,35 +436,96 @@ void ipoib_send(struct net_device *dev,
* means we have to make sure everything is properly recorded and
* our state is consistent before we call post_send().
*/
- tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
- tx_req->skb = skb;
- addr = ib_dma_map_single(priv->ca, skb->data, skb->len,
- DMA_TO_DEVICE);
+ addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
++priv->stats.tx_errors;
dev_kfree_skb_any(skb);
- return;
+ return 1;
}
- tx_req->mapping = addr;
- if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
- address->ah, qpn, addr, skb->len))) {
- ipoib_warn(priv, "post_send failed\n");
- ++priv->stats.tx_errors;
- ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
- dev_kfree_skb_any(skb);
- } else {
- dev->trans_start = jiffies;
+ tx_ring_index = priv->tx_head & (ipoib_sendq_size - 1);
- address->last_send = priv->tx_head;
- ++priv->tx_head;
+ /* Save till completion handler executes */
+ priv->tx_ring[tx_ring_index].skb = skb;
+ priv->tx_ring[tx_ring_index].mapping = addr;
+
+ /* Set WR values for the provider to use */
+ priv->tx_sge[wr_num].addr = addr;
+ priv->tx_sge[wr_num].length = skb->len;
+
+ priv->tx_wr[wr_num].wr_id = tx_ring_index;
+ priv->tx_wr[wr_num].wr.ud.remote_qpn = qpn;
+ priv->tx_wr[wr_num].wr.ud.ah = address->ah;
+
+ priv->tx_head++;
+
+ if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size)) {
+ ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+ netif_stop_queue(dev);
+ set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+ }
- if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
- ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
- netif_stop_queue(dev);
- set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+ return 0;
+}
+
+/*
+ * Send num_skbs to the device. If an skb is passed to this function, it is
+ * single, unprocessed skb send case; otherwise it means that all skbs are
+ * already processed and put on priv->tx_wr,tx_sge,tx_ring, etc.
+ */
+void ipoib_send(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_ah *address, u32 qpn, int num_skbs)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int first_wr = 0;
+
+ if (skb && ipoib_process_skb(dev, skb, priv, address, qpn, 0))
+ return;
+
+ /* Send all skb's in one post */
+ do {
+ struct ib_send_wr *bad_wr;
+
+ if (unlikely((post_send(priv, qpn, first_wr, num_skbs,
+ &bad_wr)))) {
+ int done;
+
+ ipoib_warn(priv, "post_send failed\n");
+
+ /* Get number of WR's that finished successfully */
+ done = bad_wr - &priv->tx_wr[first_wr];
+
+ /* Handle 1 error */
+ priv->stats.tx_errors++;
+ ib_dma_unmap_single(priv->ca,
+ priv->tx_sge[first_wr + done].addr,
+ priv->tx_sge[first_wr + done].length,
+ DMA_TO_DEVICE);
+
+ /* Free failed WR & reset for WC handler to recognize */
+ dev_kfree_skb_any(priv->tx_ring[bad_wr->wr_id].skb);
+ priv->tx_ring[bad_wr->wr_id].skb = NULL;
+
+ /* Handle 'n' successes */
+ if (done) {
+ dev->trans_start = jiffies;
+ address->last_send = priv->tx_head - (num_skbs -
+ done) - 1;
+ }
+
+ /* Get count of skbs that were not tried */
+ num_skbs -= (done + 1);
+ /* + 1 for WR that was tried & failed */
+
+ /* Get start index for next iteration */
+ first_wr += (done + 1);
+ } else {
+ dev->trans_start = jiffies;
+
+ address->last_send = priv->tx_head - 1;
+ num_skbs = 0;
}
- }
+ } while (num_skbs);
}
static void __ipoib_reap_ah(struct net_device *dev)
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 9/10 Rev4] [IPoIB] Implement batching
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (7 preceding siblings ...)
2007-08-22 8:31 ` [PATCH 8/10 Rev4] [IPoIB] Post and work completion handler changes Krishna Kumar
@ 2007-08-22 8:31 ` Krishna Kumar
2007-08-22 8:31 ` [PATCH 10/10 Rev4] [E1000] " Krishna Kumar
9 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:31 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, peter.p.waskiewicz.jr, xma, gaagaan,
kumarkr, rdreier, rick.jones2, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
IPoIB: implement the new batching API.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_main.c | 251 ++++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 171 insertions(+), 80 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-20 14:26:26.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-08-22 08:33:51.000000000 +0530
@@ -560,7 +560,8 @@ static void neigh_add_path(struct sk_buf
goto err_drop;
}
} else
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, path->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
} else {
neigh->ah = NULL;
@@ -640,7 +641,7 @@ static void unicast_arp_send(struct sk_b
ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+ ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
} else if ((path->query || !path_rec_start(dev, path)) &&
skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
/* put pseudoheader back on for next time */
@@ -654,105 +655,166 @@ static void unicast_arp_send(struct sk_b
spin_unlock(&priv->lock);
}
+#define XMIT_PROCESSED_SKBS() \
+ do { \
+ if (wr_num) { \
+ ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \
+ wr_num); \
+ wr_num = 0; \
+ } \
+ } while (0)
+
static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ipoib_neigh *neigh;
+ struct sk_buff_head *blist;
+ int max_skbs, wr_num = 0;
+ u32 qpn, old_qpn = 0;
+ struct ipoib_neigh *neigh, *old_neigh = NULL;
unsigned long flags;
if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
return NETDEV_TX_LOCKED;
- /*
- * Check if our queue is stopped. Since we have the LLTX bit
- * set, we can't rely on netif_stop_queue() preventing our
- * xmit function from being called with a full queue.
- */
- if (unlikely(netif_queue_stopped(dev))) {
- spin_unlock_irqrestore(&priv->tx_lock, flags);
- return NETDEV_TX_BUSY;
- }
-
- if (likely(skb->dst && skb->dst->neighbour)) {
- if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
- ipoib_path_lookup(skb, dev);
- goto out;
- }
+ blist = dev->skb_blist;
- neigh = *to_ipoib_neigh(skb->dst->neighbour);
+ if (!skb || (blist && skb_queue_len(blist))) {
+ /*
+ * Either batching xmit call, or single skb case but there are
+ * skbs already in the batch list from previous failure to
+ * xmit - send the earlier skbs first to avoid out of order.
+ */
+
+ if (skb)
+ __skb_queue_tail(blist, skb);
+
+ /*
+ * Figure out how many skbs can be sent. This prevents the
+ * device getting full and avoids checking for stopped queue
+ * after each iteration. Now the queue can get stopped atmost
+ * after xmit of the last skb.
+ */
+ max_skbs = ipoib_sendq_size - (priv->tx_head - priv->tx_tail);
+ skb = __skb_dequeue(blist);
+ } else {
+ blist = NULL;
+ max_skbs = 1;
+ }
- if (ipoib_cm_get(neigh)) {
- if (ipoib_cm_up(neigh)) {
- ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
- goto out;
- }
- } else if (neigh->ah) {
- if (unlikely(memcmp(&neigh->dgid.raw,
- skb->dst->neighbour->ha + 4,
- sizeof(union ib_gid)))) {
- spin_lock(&priv->lock);
- /*
- * It's safe to call ipoib_put_ah() inside
- * priv->lock here, because we know that
- * path->ah will always hold one more reference,
- * so ipoib_put_ah() will never do more than
- * decrement the ref count.
- */
- ipoib_put_ah(neigh->ah);
- list_del(&neigh->list);
- ipoib_neigh_free(dev, neigh);
- spin_unlock(&priv->lock);
+ do {
+ if (likely(skb->dst && skb->dst->neighbour)) {
+ if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
+ XMIT_PROCESSED_SKBS();
ipoib_path_lookup(skb, dev);
- goto out;
+ continue;
}
- ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
- goto out;
- }
-
- if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
- spin_lock(&priv->lock);
- __skb_queue_tail(&neigh->queue, skb);
- spin_unlock(&priv->lock);
- } else {
- ++priv->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- }
- } else {
- struct ipoib_pseudoheader *phdr =
- (struct ipoib_pseudoheader *) skb->data;
- skb_pull(skb, sizeof *phdr);
+ neigh = *to_ipoib_neigh(skb->dst->neighbour);
- if (phdr->hwaddr[4] == 0xff) {
- /* Add in the P_Key for multicast*/
- phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
- phdr->hwaddr[9] = priv->pkey & 0xff;
+ if (ipoib_cm_get(neigh)) {
+ if (ipoib_cm_up(neigh)) {
+ XMIT_PROCESSED_SKBS();
+ ipoib_cm_send(dev, skb,
+ ipoib_cm_get(neigh));
+ continue;
+ }
+ } else if (neigh->ah) {
+ if (unlikely(memcmp(&neigh->dgid.raw,
+ skb->dst->neighbour->ha + 4,
+ sizeof(union ib_gid)))) {
+ spin_lock(&priv->lock);
+ /*
+ * It's safe to call ipoib_put_ah()
+ * inside priv->lock here, because we
+ * know that path->ah will always hold
+ * one more reference, so ipoib_put_ah()
+ * will never do more than decrement
+ * the ref count.
+ */
+ ipoib_put_ah(neigh->ah);
+ list_del(&neigh->list);
+ ipoib_neigh_free(dev, neigh);
+ spin_unlock(&priv->lock);
+ XMIT_PROCESSED_SKBS();
+ ipoib_path_lookup(skb, dev);
+ continue;
+ }
+
+ qpn = IPOIB_QPN(skb->dst->neighbour->ha);
+ if (neigh != old_neigh || qpn != old_qpn) {
+ /*
+ * Sending to a different destination
+ * from earlier skb's (or this is the
+ * first skb) - send all existing skbs.
+ */
+ XMIT_PROCESSED_SKBS();
+ old_neigh = neigh;
+ old_qpn = qpn;
+ }
+
+ if (likely(!ipoib_process_skb(dev, skb, priv,
+ neigh->ah, qpn,
+ wr_num)))
+ wr_num++;
- ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
- } else {
- /* unicast GID -- should be ARP or RARP reply */
+ continue;
+ }
- if ((be16_to_cpup((__be16 *) skb->data) != ETH_P_ARP) &&
- (be16_to_cpup((__be16 *) skb->data) != ETH_P_RARP)) {
- ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
- IPOIB_GID_FMT "\n",
- skb->dst ? "neigh" : "dst",
- be16_to_cpup((__be16 *) skb->data),
- IPOIB_QPN(phdr->hwaddr),
- IPOIB_GID_RAW_ARG(phdr->hwaddr + 4));
+ if (skb_queue_len(&neigh->queue) <
+ IPOIB_MAX_PATH_REC_QUEUE) {
+ spin_lock(&priv->lock);
+ __skb_queue_tail(&neigh->queue, skb);
+ spin_unlock(&priv->lock);
+ } else {
dev_kfree_skb_any(skb);
++priv->stats.tx_dropped;
- goto out;
+ ++max_skbs;
+ }
+ } else {
+ struct ipoib_pseudoheader *phdr =
+ (struct ipoib_pseudoheader *) skb->data;
+ skb_pull(skb, sizeof *phdr);
+
+ if (phdr->hwaddr[4] == 0xff) {
+ /* Add in the P_Key for multicast*/
+ phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+ phdr->hwaddr[9] = priv->pkey & 0xff;
+
+ XMIT_PROCESSED_SKBS();
+ ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+ } else {
+ /* unicast GID -- should be ARP or RARP reply */
+
+ if ((be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_ARP) &&
+ (be16_to_cpup((__be16 *) skb->data) !=
+ ETH_P_RARP)) {
+ ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x "
+ IPOIB_GID_FMT "\n",
+ skb->dst ? "neigh" : "dst",
+ be16_to_cpup((__be16 *)
+ skb->data),
+ IPOIB_QPN(phdr->hwaddr),
+ IPOIB_GID_RAW_ARG(phdr->hwaddr
+ + 4));
+ dev_kfree_skb_any(skb);
+ ++priv->stats.tx_dropped;
+ ++max_skbs;
+ continue;
+ }
+ XMIT_PROCESSED_SKBS();
+ unicast_arp_send(skb, dev, phdr);
}
-
- unicast_arp_send(skb, dev, phdr);
}
- }
+ } while (--max_skbs && (skb = __skb_dequeue(blist)) != NULL);
+
+ /* Send out last packets (if any) */
+ XMIT_PROCESSED_SKBS();
-out:
spin_unlock_irqrestore(&priv->tx_lock, flags);
- return NETDEV_TX_OK;
+ return (!blist || !skb_queue_len(blist)) ? NETDEV_TX_OK :
+ NETDEV_TX_BUSY;
}
static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
@@ -900,11 +962,35 @@ int ipoib_dev_init(struct net_device *de
/* priv->tx_head & tx_tail are already 0 */
- if (ipoib_ib_dev_init(dev, ca, port))
+ /* Allocate tx_sge */
+ priv->tx_sge = kmalloc(ipoib_sendq_size * sizeof *priv->tx_sge,
+ GFP_KERNEL);
+ if (!priv->tx_sge) {
+ printk(KERN_WARNING "%s: failed to allocate TX sge (%d entries)\n",
+ ca->name, ipoib_sendq_size);
goto out_tx_ring_cleanup;
+ }
+
+ /* Allocate tx_wr */
+ priv->tx_wr = kmalloc(ipoib_sendq_size * sizeof *priv->tx_wr,
+ GFP_KERNEL);
+ if (!priv->tx_wr) {
+ printk(KERN_WARNING "%s: failed to allocate TX wr (%d entries)\n",
+ ca->name, ipoib_sendq_size);
+ goto out_tx_sge_cleanup;
+ }
+
+ if (ipoib_ib_dev_init(dev, ca, port))
+ goto out_tx_wr_cleanup;
return 0;
+out_tx_wr_cleanup:
+ kfree(priv->tx_wr);
+
+out_tx_sge_cleanup:
+ kfree(priv->tx_sge);
+
out_tx_ring_cleanup:
kfree(priv->tx_ring);
@@ -932,9 +1018,13 @@ void ipoib_dev_cleanup(struct net_device
kfree(priv->rx_ring);
kfree(priv->tx_ring);
+ kfree(priv->tx_sge);
+ kfree(priv->tx_wr);
priv->rx_ring = NULL;
priv->tx_ring = NULL;
+ priv->tx_sge = NULL;
+ priv->tx_wr = NULL;
}
static void ipoib_setup(struct net_device *dev)
@@ -965,7 +1055,8 @@ static void ipoib_setup(struct net_devic
dev->addr_len = INFINIBAND_ALEN;
dev->type = ARPHRD_INFINIBAND;
dev->tx_queue_len = ipoib_sendq_size * 2;
- dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX;
+ dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX |
+ NETIF_F_BATCH_SKBS;
/* MTU will be reset when mcast join happens */
dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN;
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 10/10 Rev4] [E1000] Implement batching
2007-08-22 8:28 [PATCH 0/10 Rev4] Implement skb batching and support in IPoIB Krishna Kumar
` (8 preceding siblings ...)
2007-08-22 8:31 ` [PATCH 9/10 Rev4] [IPoIB] Implement batching Krishna Kumar
@ 2007-08-22 8:31 ` Krishna Kumar
2007-08-22 14:39 ` [ofa-general] " Kok, Auke
9 siblings, 1 reply; 15+ messages in thread
From: Krishna Kumar @ 2007-08-22 8:31 UTC (permalink / raw)
To: johnpol, herbert, hadi, kaber, shemminger, davem
Cc: jagana, Robert.Olsson, rick.jones2, xma, gaagaan, kumarkr,
rdreier, peter.p.waskiewicz.jr, mcarlson, jeff, general, mchan,
tgraf, netdev, Krishna Kumar, sri
E1000: Implement batching capability (ported thanks to changes taken from
Jamal). Not all changes are made in this as in IPoIB, eg, handling
out of order skbs (see XXX in the first mail).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
e1000_main.c | 150 +++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 121 insertions(+), 29 deletions(-)
diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
--- org/drivers/net/e1000/e1000_main.c 2007-08-20 14:26:29.000000000 +0530
+++ new/drivers/net/e1000/e1000_main.c 2007-08-22 08:33:51.000000000 +0530
@@ -157,6 +157,7 @@ static void e1000_update_phy_info(unsign
static void e1000_watchdog(unsigned long data);
static void e1000_82547_tx_fifo_stall(unsigned long data);
static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static int e1000_xmit_frames(struct net_device *dev);
static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -990,7 +991,7 @@ e1000_probe(struct pci_dev *pdev,
if (pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
- netdev->features |= NETIF_F_LLTX;
+ netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
@@ -3098,6 +3099,18 @@ e1000_tx_map(struct e1000_adapter *adapt
return count;
}
+static void e1000_kick_DMA(struct e1000_adapter *adapter,
+ struct e1000_tx_ring *tx_ring, int i)
+{
+ wmb();
+
+ writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+ /* we need this if more than one processor can write to our tail
+ * at a time, it syncronizes IO on IA64/Altix systems */
+ mmiowb();
+}
+
+
static void
e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
int tx_flags, int count)
@@ -3144,13 +3157,7 @@ e1000_tx_queue(struct e1000_adapter *ada
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
* such as IA-64). */
- wmb();
-
tx_ring->next_to_use = i;
- writel(i, adapter->hw.hw_addr + tx_ring->tdt);
- /* we need this if more than one processor can write to our tail
- * at a time, it syncronizes IO on IA64/Altix systems */
- mmiowb();
}
/**
@@ -3257,21 +3264,31 @@ static int e1000_maybe_stop_tx(struct ne
}
#define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+
+struct e1000_tx_cbdata {
+ int count;
+ unsigned int max_per_txd;
+ unsigned int nr_frags;
+ unsigned int mss;
+};
+
+#define E1000_SKB_CB(__skb) ((struct e1000_tx_cbdata *)&((__skb)->cb[0]))
+#define NETDEV_TX_DROPPED -5
+
static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
{
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_tx_ring *tx_ring;
- unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
+ unsigned int max_per_txd = E1000_MAX_DATA_PER_TXD;
unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
- unsigned int tx_flags = 0;
unsigned int len = skb->len;
- unsigned long flags;
- unsigned int nr_frags = 0;
- unsigned int mss = 0;
+ unsigned int nr_frags;
+ unsigned int mss;
int count = 0;
- int tso;
unsigned int f;
+ struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
+
len -= skb->data_len;
/* This goes back to the question of how to logically map a tx queue
@@ -3282,7 +3299,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
if (unlikely(skb->len <= 0)) {
dev_kfree_skb_any(skb);
- return NETDEV_TX_OK;
+ return NETDEV_TX_DROPPED;
}
/* 82571 and newer doesn't need the workaround that limited descriptor
@@ -3328,7 +3345,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
DPRINTK(DRV, ERR,
"__pskb_pull_tail failed.\n");
dev_kfree_skb_any(skb);
- return NETDEV_TX_OK;
+ return NETDEV_TX_DROPPED;
}
len = skb->len - skb->data_len;
break;
@@ -3372,22 +3389,32 @@ e1000_xmit_frame(struct sk_buff *skb, st
(adapter->hw.mac_type == e1000_82573))
e1000_transfer_dhcp_info(adapter, skb);
- if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
- /* Collision - tell upper layer to requeue */
- return NETDEV_TX_LOCKED;
+ cb->count = count;
+ cb->max_per_txd = max_per_txd;
+ cb->nr_frags = nr_frags;
+ cb->mss = mss;
+
+ return NETDEV_TX_OK;
+}
+
+static int e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+ struct e1000_adapter *adapter = netdev_priv(netdev);
+ struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+ int tso;
+ unsigned int first;
+ unsigned int tx_flags = 0;
+ struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
/* need: count + 2 desc gap to keep tail from touching
* head, otherwise try next time */
- if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
- spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+ if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2)))
return NETDEV_TX_BUSY;
- }
if (unlikely(adapter->hw.mac_type == e1000_82547)) {
if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
netif_stop_queue(netdev);
mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
- spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
return NETDEV_TX_BUSY;
}
}
@@ -3402,8 +3429,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
tso = e1000_tso(adapter, tx_ring, skb);
if (tso < 0) {
dev_kfree_skb_any(skb);
- spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
- return NETDEV_TX_OK;
+ return NETDEV_TX_DROPPED;
}
if (likely(tso)) {
@@ -3420,12 +3446,78 @@ e1000_xmit_frame(struct sk_buff *skb, st
e1000_tx_queue(adapter, tx_ring, tx_flags,
e1000_tx_map(adapter, tx_ring, skb, first,
- max_per_txd, nr_frags, mss));
+ cb->max_per_txd, cb->nr_frags, cb->mss));
+
+ return NETDEV_TX_OK;
+}
- netdev->trans_start = jiffies;
+static inline int e1000_xmit_frames(struct net_device *netdev)
+{
+ struct e1000_adapter *adapter = netdev->priv;
+ struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+ int ret = NETDEV_TX_OK;
+ int skbs_done = 0;
+ struct sk_buff *skb;
+ unsigned long flags;
+
+ if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+ /* Collision - tell upper layer to requeue */
+ return NETDEV_TX_LOCKED;
+ }
- /* Make sure there is space in the ring for the next send. */
- e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+ while ((skb = __skb_dequeue(netdev->skb_blist)) != NULL) {
+ if ((ret = e1000_prep_queue_frame(skb, netdev)) != NETDEV_TX_OK)
+ continue;
+
+ ret = e1000_queue_frame(skb, netdev);
+ if (ret == NETDEV_TX_OK) {
+ skbs_done++;
+ } else {
+ if (ret == NETDEV_TX_BUSY)
+ __skb_queue_head(netdev->skb_blist, skb);
+ break;
+ }
+ }
+
+ if (skbs_done) {
+ e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+ netdev->trans_start = jiffies;
+ e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+ }
+
+ spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+
+ if (ret == NETDEV_TX_DROPPED)
+ ret = NETDEV_TX_OK;
+
+ return ret;
+}
+
+static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+ struct e1000_adapter *adapter = netdev_priv(netdev);
+ struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+ unsigned long flags;
+
+ if (!skb)
+ return e1000_xmit_frames(netdev);
+
+ if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+ /* Collision - tell upper layer to requeue */
+ return NETDEV_TX_LOCKED;
+ }
+
+ if (unlikely(e1000_prep_queue_frame(skb, netdev) != NETDEV_TX_OK)) {
+ spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+ return NETDEV_TX_OK;
+ }
+
+ if (e1000_queue_frame(skb, netdev) == NETDEV_TX_OK) {
+ e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+ netdev->trans_start = jiffies;
+ /* Make sure there is space in the ring for the next send. */
+ e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+ }
spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
return NETDEV_TX_OK;
^ permalink raw reply [flat|nested] 15+ messages in thread* [ofa-general] Re: [PATCH 10/10 Rev4] [E1000] Implement batching
2007-08-22 8:31 ` [PATCH 10/10 Rev4] [E1000] " Krishna Kumar
@ 2007-08-22 14:39 ` Kok, Auke
2007-08-23 2:44 ` Krishna Kumar2
0 siblings, 1 reply; 15+ messages in thread
From: Kok, Auke @ 2007-08-22 14:39 UTC (permalink / raw)
To: Krishna Kumar
Cc: johnpol, jagana, peter.p.waskiewicz.jr, herbert, gaagaan,
Robert.Olsson, kumarkr, rdreier, hadi, netdev, kaber, jeff,
general, mchan, tgraf, mcarlson, sri, shemminger, davem
Krishna Kumar wrote:
> E1000: Implement batching capability (ported thanks to changes taken from
> Jamal). Not all changes are made in this as in IPoIB, eg, handling
> out of order skbs (see XXX in the first mail).
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
> e1000_main.c | 150 +++++++++++++++++++++++++++++++++++++++++++++++------------
> 1 files changed, 121 insertions(+), 29 deletions(-)
Krishna,
while I appreciate the patch I would have preferred a patch to e1000e. Not only
does the e1000e driver remove a lot of the workarounds for old silicon, it is
also a good way for us to move the current e1000 driver into a bit more stable
maintenance mode.
Do you think you can write this patch for e1000e instead? code-wise a lot of
things are still the same, so your patch should be relatively easy to generate.
e1000e currently lives in a branch from jeff garzik's netdev-2.6 tree
Thanks,
Auke
>
> diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
> --- org/drivers/net/e1000/e1000_main.c 2007-08-20 14:26:29.000000000 +0530
> +++ new/drivers/net/e1000/e1000_main.c 2007-08-22 08:33:51.000000000 +0530
> @@ -157,6 +157,7 @@ static void e1000_update_phy_info(unsign
> static void e1000_watchdog(unsigned long data);
> static void e1000_82547_tx_fifo_stall(unsigned long data);
> static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
> +static int e1000_xmit_frames(struct net_device *dev);
> static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
> static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
> static int e1000_set_mac(struct net_device *netdev, void *p);
> @@ -990,7 +991,7 @@ e1000_probe(struct pci_dev *pdev,
> if (pci_using_dac)
> netdev->features |= NETIF_F_HIGHDMA;
>
> - netdev->features |= NETIF_F_LLTX;
> + netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
>
> adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
>
> @@ -3098,6 +3099,18 @@ e1000_tx_map(struct e1000_adapter *adapt
> return count;
> }
>
> +static void e1000_kick_DMA(struct e1000_adapter *adapter,
> + struct e1000_tx_ring *tx_ring, int i)
> +{
> + wmb();
> +
> + writel(i, adapter->hw.hw_addr + tx_ring->tdt);
> + /* we need this if more than one processor can write to our tail
> + * at a time, it syncronizes IO on IA64/Altix systems */
> + mmiowb();
> +}
> +
> +
> static void
> e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
> int tx_flags, int count)
> @@ -3144,13 +3157,7 @@ e1000_tx_queue(struct e1000_adapter *ada
> * know there are new descriptors to fetch. (Only
> * applicable for weak-ordered memory model archs,
> * such as IA-64). */
> - wmb();
> -
> tx_ring->next_to_use = i;
> - writel(i, adapter->hw.hw_addr + tx_ring->tdt);
> - /* we need this if more than one processor can write to our tail
> - * at a time, it syncronizes IO on IA64/Altix systems */
> - mmiowb();
> }
>
> /**
> @@ -3257,21 +3264,31 @@ static int e1000_maybe_stop_tx(struct ne
> }
>
> #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
> +
> +struct e1000_tx_cbdata {
> + int count;
> + unsigned int max_per_txd;
> + unsigned int nr_frags;
> + unsigned int mss;
> +};
> +
> +#define E1000_SKB_CB(__skb) ((struct e1000_tx_cbdata *)&((__skb)->cb[0]))
> +#define NETDEV_TX_DROPPED -5
> +
> static int
> -e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> +e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
> {
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_tx_ring *tx_ring;
> - unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
> + unsigned int max_per_txd = E1000_MAX_DATA_PER_TXD;
> unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
> - unsigned int tx_flags = 0;
> unsigned int len = skb->len;
> - unsigned long flags;
> - unsigned int nr_frags = 0;
> - unsigned int mss = 0;
> + unsigned int nr_frags;
> + unsigned int mss;
> int count = 0;
> - int tso;
> unsigned int f;
> + struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
> +
> len -= skb->data_len;
>
> /* This goes back to the question of how to logically map a tx queue
> @@ -3282,7 +3299,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
>
> if (unlikely(skb->len <= 0)) {
> dev_kfree_skb_any(skb);
> - return NETDEV_TX_OK;
> + return NETDEV_TX_DROPPED;
> }
>
> /* 82571 and newer doesn't need the workaround that limited descriptor
> @@ -3328,7 +3345,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
> DPRINTK(DRV, ERR,
> "__pskb_pull_tail failed.\n");
> dev_kfree_skb_any(skb);
> - return NETDEV_TX_OK;
> + return NETDEV_TX_DROPPED;
> }
> len = skb->len - skb->data_len;
> break;
> @@ -3372,22 +3389,32 @@ e1000_xmit_frame(struct sk_buff *skb, st
> (adapter->hw.mac_type == e1000_82573))
> e1000_transfer_dhcp_info(adapter, skb);
>
> - if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
> - /* Collision - tell upper layer to requeue */
> - return NETDEV_TX_LOCKED;
> + cb->count = count;
> + cb->max_per_txd = max_per_txd;
> + cb->nr_frags = nr_frags;
> + cb->mss = mss;
> +
> + return NETDEV_TX_OK;
> +}
> +
> +static int e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev)
> +{
> + struct e1000_adapter *adapter = netdev_priv(netdev);
> + struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> + int tso;
> + unsigned int first;
> + unsigned int tx_flags = 0;
> + struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
>
> /* need: count + 2 desc gap to keep tail from touching
> * head, otherwise try next time */
> - if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
> - spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2)))
> return NETDEV_TX_BUSY;
> - }
>
> if (unlikely(adapter->hw.mac_type == e1000_82547)) {
> if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
> netif_stop_queue(netdev);
> mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
> - spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> return NETDEV_TX_BUSY;
> }
> }
> @@ -3402,8 +3429,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
> tso = e1000_tso(adapter, tx_ring, skb);
> if (tso < 0) {
> dev_kfree_skb_any(skb);
> - spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> - return NETDEV_TX_OK;
> + return NETDEV_TX_DROPPED;
> }
>
> if (likely(tso)) {
> @@ -3420,12 +3446,78 @@ e1000_xmit_frame(struct sk_buff *skb, st
>
> e1000_tx_queue(adapter, tx_ring, tx_flags,
> e1000_tx_map(adapter, tx_ring, skb, first,
> - max_per_txd, nr_frags, mss));
> + cb->max_per_txd, cb->nr_frags, cb->mss));
> +
> + return NETDEV_TX_OK;
> +}
>
> - netdev->trans_start = jiffies;
> +static inline int e1000_xmit_frames(struct net_device *netdev)
> +{
> + struct e1000_adapter *adapter = netdev->priv;
> + struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> + int ret = NETDEV_TX_OK;
> + int skbs_done = 0;
> + struct sk_buff *skb;
> + unsigned long flags;
> +
> + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
> + /* Collision - tell upper layer to requeue */
> + return NETDEV_TX_LOCKED;
> + }
>
> - /* Make sure there is space in the ring for the next send. */
> - e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> + while ((skb = __skb_dequeue(netdev->skb_blist)) != NULL) {
> + if ((ret = e1000_prep_queue_frame(skb, netdev)) != NETDEV_TX_OK)
> + continue;
> +
> + ret = e1000_queue_frame(skb, netdev);
> + if (ret == NETDEV_TX_OK) {
> + skbs_done++;
> + } else {
> + if (ret == NETDEV_TX_BUSY)
> + __skb_queue_head(netdev->skb_blist, skb);
> + break;
> + }
> + }
> +
> + if (skbs_done) {
> + e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
> + netdev->trans_start = jiffies;
> + e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> + }
> +
> + spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> +
> + if (ret == NETDEV_TX_DROPPED)
> + ret = NETDEV_TX_OK;
> +
> + return ret;
> +}
> +
> +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> +{
> + struct e1000_adapter *adapter = netdev_priv(netdev);
> + struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> + unsigned long flags;
> +
> + if (!skb)
> + return e1000_xmit_frames(netdev);
> +
> + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
> + /* Collision - tell upper layer to requeue */
> + return NETDEV_TX_LOCKED;
> + }
> +
> + if (unlikely(e1000_prep_queue_frame(skb, netdev) != NETDEV_TX_OK)) {
> + spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> + return NETDEV_TX_OK;
> + }
> +
> + if (e1000_queue_frame(skb, netdev) == NETDEV_TX_OK) {
> + e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
> + netdev->trans_start = jiffies;
> + /* Make sure there is space in the ring for the next send. */
> + e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
> + }
>
> spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
> return NETDEV_TX_OK;
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 15+ messages in thread* [ofa-general] Re: [PATCH 10/10 Rev4] [E1000] Implement batching
2007-08-22 14:39 ` [ofa-general] " Kok, Auke
@ 2007-08-23 2:44 ` Krishna Kumar2
0 siblings, 0 replies; 15+ messages in thread
From: Krishna Kumar2 @ 2007-08-23 2:44 UTC (permalink / raw)
To: Kok, Auke
Cc: jagana, johnpol, herbert, gaagaan, Robert.Olsson, kumarkr,
mcarlson, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
mchan, tgraf, netdev, sri, shemminger, davem, rdreier
Hi Auke,
"Kok, Auke" <auke-jan.h.kok@intel.com> wrote on 08/22/2007 08:09:31 PM:
> Krishna,
>
> while I appreciate the patch I would have preferred a patch to e1000e.
Not only
> does the e1000e driver remove a lot of the workarounds for old silicon,
it is
> also a good way for us to move the current e1000 driver into a bit more
stable
> maintenance mode.
>
> Do you think you can write this patch for e1000e instead? code-wise a lot
of
> things are still the same, so your patch should be relatively easy to
generate.
>
> e1000e currently lives in a branch from jeff garzik's netdev-2.6 tree
Definitely, I will pick it up and generate a patch.
Thanks,
- KK
^ permalink raw reply [flat|nested] 15+ messages in thread