* Re: [RFC] New driver API to speed up small packets xmits [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com> @ 2007-05-15 21:17 ` Roland Dreier [not found] ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com> 0 siblings, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-15 21:17 UTC (permalink / raw) To: Shirley Ma; +Cc: David Miller, ak, krkumar2, netdev, netdev-owner > I thought to enable GSO, device driver actually does nothing rather > than enabling the flag. GSO moved TCP offloading to interface layer before > device xmit. It's a different idea with multiple packets per xmit. GSO > still queue the packet one bye one in QDISC and xmit one bye one. The > multiple packets per xmit will xmit N packets when N packets in QDISC > queue. Please corrent me if wrong. Current use of GSO does segmentation just above the netdevice driver. However there's nothing that prevents it from being used to push segmentation into the driver -- as I described, just set NETIF_F_GSO_SOFTWARE and do skb_gso_segment() within the driver. As Michael Chan pointed out, tg3.c already does this to work around a rare bug in the HW TSO implementation. For IPoIB we could do it all the time to get multiple send work requests from one call to the xmit method. - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>]
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com> @ 2007-05-15 21:25 ` Roland Dreier [not found] ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com> 2007-05-15 21:32 ` David Miller 1 sibling, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-15 21:25 UTC (permalink / raw) To: Shirley Ma; +Cc: ak, David Miller, krkumar2, netdev, netdev-owner Shirley> I just wonder without TSO support in HW, how much Shirley> benefit we can get by pushing GSO from interface layer to Shirley> device layer besides we can do multiple packets in IPoIB. The entire benefit comes from having multiple packets to queue in one call to the xmit method. - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>]
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com> @ 2007-05-15 21:38 ` David Miller 0 siblings, 0 replies; 96+ messages in thread From: David Miller @ 2007-05-15 21:38 UTC (permalink / raw) To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner From: Shirley Ma <xma@us.ibm.com> Date: Tue, 15 May 2007 14:35:13 -0700 > I would think it might be better to implement dequeueN in > interface xmit which will benefit multiple streams as well not just > for large packet. I think it is beneficial to have both, each of them helps with different aspects of packet sending overhead. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com> 2007-05-15 21:25 ` Roland Dreier @ 2007-05-15 21:32 ` David Miller 2007-05-15 22:17 ` [WIP] [PATCH] WAS " jamal ` (2 more replies) 1 sibling, 3 replies; 96+ messages in thread From: David Miller @ 2007-05-15 21:32 UTC (permalink / raw) To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner From: Shirley Ma <xma@us.ibm.com> Date: Tue, 15 May 2007 14:22:57 -0700 > I just wonder without TSO support in HW, how much benefit we > can get by pushing GSO from interface layer to device layer besides > we can do multiple packets in IPoIB. I bet the gain is non-trivial. I'd say about half of the gain from TSO comes from only calling down into the driver from TCP one time as opposed to N times. That's the majority of the "CPU work" involved in TCP sending. The rest of the gain comes from only transmitting the packet headers once rather than N times, which conserves I/O bus bandwidth. GSO will not help the case of lots of UDP applications sending small packets, or something like that. An efficient qdisc-->driver transfer during netif_wake_queue() could help solve some of that, as is being discussed here. ^ permalink raw reply [flat|nested] 96+ messages in thread
* [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 21:32 ` David Miller @ 2007-05-15 22:17 ` jamal 2007-05-15 22:48 ` jamal 2007-05-16 22:12 ` Sridhar Samudrala [not found] ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com> 2007-05-21 7:56 ` Herbert Xu 2 siblings, 2 replies; 96+ messages in thread From: jamal @ 2007-05-15 22:17 UTC (permalink / raw) To: David Miller Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner, ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja [-- Attachment #1: Type: text/plain, Size: 2196 bytes --] On Tue, 2007-15-05 at 14:32 -0700, David Miller wrote: > An efficient qdisc-->driver > transfer during netif_wake_queue() could help solve some of that, > as is being discussed here. Ok, heres the approach i discussed at netconf. It needs net-2.6 and the patch i posted earlier to clean up qdisc_restart() [1]. I havent ported over all the bits from 2.6.18, but this works. Krishna and i have colluded privately on working together. I just need to reproduce the patches, so here is the core. A lot of the code in the core could be aggragated later - right now i am worried about correctness. I will post a patch for tun device in a few minutes that i use to test on my laptop (i need to remove some debugs) to show an example. I also plan to post a patch for e1000 - but that will take more than a few minutes. the e1000 driver has changed quiet a bit since 2.6.18, so it is consuming. What does a driver need to do to get batched-to? 1) On initialization (probe probably) a) set NETIF_F_BTX in its dev->features at startup i.e dev->features |= NETIF_F_BTX b) initialize the batch queue i.e something like skb_queue_head_init(&dev->blist); c) set dev->xmit_win to something reasonable like maybe half the DMA ring size or tx_queuelen 2) create a new method for batch txmit. This loops on dev->blist and stashes onto hardware. All return codes like NETDEV_TX_OK etc still apply. 3) set the dev->xmit_win which provides hints on how much data to send from the core to the driver. Some suggestions: a)on doing a netif_stop, set it to 1 b)on netif_wake_queue set it to the max available space Of course, to work, all this requires that the driver to have a threshold for waking up tx path; like drivers such as e1000 or tg3 do in order to invoke netif_wake_queue (example look at TX_WAKE_THRESHOLD usage in e1000). feedback welcome (preferably in the form of patches). Anyone with a really nice tool to measure CPU improvement will help a great deal in quantifying things. As i have said earlier, I never saw any throughput improvement. But like T/GSO it may be just CPU savings (as was suggested at netconf). cheers, jamal [1] http://marc.info/?l=linux-netdev&m=117914954911959&w=2 [-- Attachment #2: batch0 --] [-- Type: text/x-patch, Size: 4512 bytes --] diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f671cd2..7205748 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_GSO 2048 /* Enable software GSO. */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_BTX 8192 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -450,6 +451,11 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_batch_xmit) (struct sk_buff_head *list, + struct net_device *dev); + int (*hard_prep_xmit) (struct sk_buff *skb, + struct net_device *dev); + int xmit_win; /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -466,6 +472,10 @@ struct net_device struct list_head todo_list; /* device index hash chain */ struct hlist_node index_hlist; + /*XXX: Fix eventually to not allocate if device not + *batch capable + */ + struct sk_buff_head blist; struct net_device *link_watch_next; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ed80054..61fa301 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -85,10 +85,12 @@ static inline int do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb->next)) - dev->gso_skb = skb; - else - q->ops->requeue(skb, q); + if (skb) { + if (unlikely(skb->next)) + dev->gso_skb = skb; + else + q->ops->requeue(skb, q); + } /* XXX: Could netif_schedule fail? Or is that fact we are * requeueing imply the hardware path is closed * and even if we fail, some interupt will wake us @@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) int ret = handle_dev_cpu_collision(dev); if (ret == SCHED_TX_DROP) { - kfree_skb(skb); + if (skb) /* we are not batching */ + kfree_skb(skb); + else if (!skb_queue_empty(&dev->blist)) + skb_queue_purge(&dev->blist); return qdisc_qlen(q); } @@ -195,10 +200,99 @@ static inline int qdisc_restart(struct net_device *dev) return do_dev_requeue(skb, dev, q); } +static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count) +{ + struct sk_buff *skb; + struct sk_buff_head *skbs = &dev->blist; + int tdq = count; + + /* + * very unlikely, but who knows .. + * If this happens we dont try to grab more pkts + */ + if (!skb_queue_empty(&dev->blist)) + return skb_queue_len(&dev->blist); + + if (dev->gso_skb) { + count--; + __skb_queue_head(skbs, dev->gso_skb); + dev->gso_skb = NULL; + } + + while (count) { + skb = q->dequeue(q); + if (!skb) + break; + count--; + __skb_queue_head(skbs, skb); + } + + return tdq - count; +} + +static inline int try_tx_pkts(struct net_device *dev) +{ + + return dev->hard_batch_xmit(&dev->blist, dev); + +} + +/* same comments as in qdisc_restart apply; + * at some point use shared code with qdisc_restart*/ +int batch_qdisc_restart(struct net_device *dev) +{ + struct Qdisc *q = dev->qdisc; + unsigned lockless = (dev->features & NETIF_F_LLTX); + int count = dev->xmit_win; + int ret = 0; + + ret = try_get_tx_pkts(dev, q, count); + + if (ret == 0) + return qdisc_qlen(q); + + /* we have packets to send! */ + if (!lockless) { + if (!netif_tx_trylock(dev)) + return tx_islocked(NULL, dev, q); + } + + /* all clear .. */ + spin_unlock(&dev->queue_lock); + + ret = NETDEV_TX_BUSY; + if (!netif_queue_stopped(dev)) + ret = try_tx_pkts(dev); + + if (!lockless) + netif_tx_unlock(dev); + + spin_lock(&dev->queue_lock); + + q = dev->qdisc; + + /* most likely result, packet went ok */ + if (ret == NETDEV_TX_OK) + return qdisc_qlen(q); + /* only for lockless drivers .. */ + if (ret == NETDEV_TX_LOCKED && lockless) + return tx_islocked(NULL, dev, q); + + if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit())) + printk(KERN_WARNING " BUG %s code %d qlen %d\n", + dev->name, ret, q->q.qlen); + + return do_dev_requeue(NULL, dev, q); +} + void __qdisc_run(struct net_device *dev) { + unsigned batching = (dev->features & NETIF_F_BTX); + do { - if (!qdisc_restart(dev)) + if (!batching && !qdisc_restart(dev)) + break; + else if (!batch_qdisc_restart(dev)) break; } while (!netif_queue_stopped(dev)); ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 22:17 ` [WIP] [PATCH] WAS " jamal @ 2007-05-15 22:48 ` jamal 2007-05-16 0:50 ` jamal 2007-05-16 22:12 ` Sridhar Samudrala 1 sibling, 1 reply; 96+ messages in thread From: jamal @ 2007-05-15 22:48 UTC (permalink / raw) To: David Miller Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule, Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja [-- Attachment #1: Type: text/plain, Size: 681 bytes --] On Tue, 2007-15-05 at 18:17 -0400, jamal wrote: > I will post a patch for tun device in a few minutes > that i use to test on my laptop (i need to remove some debugs) to show > an example. Ok, here it is. The way i test is to point packets at a tun device. [One way i do it is attach an ingress qdisc on lo; attach a u32 filter to match all; on match redirect to the tun device]. The user space program reading sleeps for about a second every 20 packets or so. This forces things to accumulate in the drivers queue. Backpressure builds up and the throttling effect is really nice to see working. I will try to post the e1000 patch tonight or tommorow morning. cheers, jamal [-- Attachment #2: tun-b --] [-- Type: text/x-patch, Size: 3734 bytes --] diff --git a/drivers/net/tun.c b/drivers/net/tun.c index a2c6caa..076f794 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -70,6 +70,7 @@ static int debug; #endif +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ /* Network device part of the driver */ static LIST_HEAD(tun_dev_list); @@ -86,9 +87,53 @@ static int tun_net_open(struct net_device *dev) static int tun_net_close(struct net_device *dev) { netif_stop_queue(dev); + //skb_queue_purge(&dev->blist); return 0; } +/* Batch Net device start xmit + * combine with non-batching version + * */ +static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev) +{ + struct sk_buff *skb; + struct tun_struct *tun = netdev_priv(dev); + u32 qlen = skb_queue_len(&tun->readq); + + /* Drop packet if interface is not attached */ + if (!tun->attached) { + tun->stats.tx_dropped+=skb_queue_len(&dev->blist); + skb_queue_purge(&dev->blist); + return NETDEV_TX_OK; + } + + while (skb_queue_len(&dev->blist)) { + skb = __skb_dequeue(skbs); + if (!skb) + break; + skb_queue_tail(&tun->readq, skb); + } + + qlen = skb_queue_len(&tun->readq); + if (qlen >= dev->tx_queue_len) { + netif_stop_queue(dev); + tun->stats.tx_fifo_errors++; + dev->xmit_win = 1; + } else { + dev->xmit_win = dev->tx_queue_len - qlen; + } + + /* Queue packet */ + dev->trans_start = jiffies; + + /* Notify and wake up reader process */ + if (tun->flags & TUN_FASYNC) + kill_fasync(&tun->fasync, SIGIO, POLL_IN); + wake_up_interruptible(&tun->read_wait); + + return NETDEV_TX_OK; +} + /* Net device start xmit */ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev) { @@ -207,6 +252,7 @@ static void tun_net_init(struct net_device *dev) dev->tx_queue_len = TUN_READQ_SIZE; /* We prefer our own queue length */ break; } + dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */ } /* Character device part */ @@ -382,7 +428,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv, schedule(); continue; } - netif_wake_queue(tun->dev); + { + u32 t = skb_queue_len(&tun->readq); + if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { + tun->dev->xmit_win = tun->dev->tx_queue_len; + netif_wake_queue(tun->dev); + } + } /** Decide whether to accept this packet. This code is designed to * behave identically to an Ethernet interface. Accept the packet if @@ -429,6 +481,7 @@ static void tun_setup(struct net_device *dev) struct tun_struct *tun = netdev_priv(dev); skb_queue_head_init(&tun->readq); + skb_queue_head_init(&dev->blist); init_waitqueue_head(&tun->read_wait); tun->owner = -1; @@ -436,6 +489,8 @@ static void tun_setup(struct net_device *dev) SET_MODULE_OWNER(dev); dev->open = tun_net_open; dev->hard_start_xmit = tun_net_xmit; + dev->hard_prep_xmit = NULL; + dev->hard_batch_xmit = tun_net_bxmit; dev->stop = tun_net_close; dev->get_stats = tun_net_stats; dev->ethtool_ops = &tun_ethtool_ops; @@ -458,7 +513,7 @@ static struct tun_struct *tun_get_by_name(const char *name) static int tun_set_iff(struct file *file, struct ifreq *ifr) { struct tun_struct *tun; - struct net_device *dev; + struct net_device *dev = NULL; int err; tun = tun_get_by_name(ifr->ifr_name); @@ -528,12 +583,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr) } DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name); + dev->features |= NETIF_F_BTX; if (ifr->ifr_flags & IFF_NO_PI) tun->flags |= TUN_NO_PI; - if (ifr->ifr_flags & IFF_ONE_QUEUE) + if (ifr->ifr_flags & IFF_ONE_QUEUE) { tun->flags |= TUN_ONE_QUEUE; + dev->features &= ~NETIF_F_BTX; + } file->private_data = tun; tun->attached = 1; ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 22:48 ` jamal @ 2007-05-16 0:50 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-16 0:50 UTC (permalink / raw) To: David Miller Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule, Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja On Tue, 2007-15-05 at 18:48 -0400, jamal wrote: > I will try to post the e1000 patch tonight or tommorow morning. I have the e1000 path done; a few features from the 2.6.18 missing (mainly the one mucking with tx ring pruning on the tx path). While it compiles and looks right - i havent tested it and wont have time for another day or so. However, if anyone wants it in its current form -let me know and i will email you privately. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 22:17 ` [WIP] [PATCH] WAS " jamal 2007-05-15 22:48 ` jamal @ 2007-05-16 22:12 ` Sridhar Samudrala 2007-05-16 22:52 ` jamal 2007-05-17 4:03 ` Krishna Kumar2 1 sibling, 2 replies; 96+ messages in thread From: Sridhar Samudrala @ 2007-05-16 22:12 UTC (permalink / raw) To: hadi Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner, ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja Jamal, Here are some comments i have on your patch. See them inline. Thanks Sridhar +static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count) +{ + struct sk_buff *skb; + struct sk_buff_head *skbs = &dev->blist; + int tdq = count; + + /* + * very unlikely, but who knows .. + * If this happens we dont try to grab more pkts + */ + if (!skb_queue_empty(&dev->blist)) + return skb_queue_len(&dev->blist); + + if (dev->gso_skb) { + count--; + __skb_queue_head(skbs, dev->gso_skb); + dev->gso_skb = NULL; + } AFAIK, gso_skb can be a list of skb's. Can we add a list to another list using __skb_queue_head()? Also, if gso_skb is a list of multiple skb's, i think the count needs to be decremented by the number of segments in gso_skb. + + while (count) { + skb = q->dequeue(q); + if (!skb) + break; + count--; + __skb_queue_head(skbs, skb); + } + + return tdq - count; +} + +static inline int try_tx_pkts(struct net_device *dev) +{ + + return dev->hard_batch_xmit(&dev->blist, dev); + +} + +/* same comments as in qdisc_restart apply; + * at some point use shared code with qdisc_restart*/ +int batch_qdisc_restart(struct net_device *dev) +{ + struct Qdisc *q = dev->qdisc; + unsigned lockless = (dev->features & NETIF_F_LLTX); + int count = dev->xmit_win; + int ret = 0; + + ret = try_get_tx_pkts(dev, q, count); + + if (ret == 0) + return qdisc_qlen(q); + + /* we have packets to send! */ + if (!lockless) { + if (!netif_tx_trylock(dev)) + return tx_islocked(NULL, dev, q); + } + + /* all clear .. */ + spin_unlock(&dev->queue_lock); + + ret = NETDEV_TX_BUSY; + if (!netif_queue_stopped(dev)) + ret = try_tx_pkts(dev); try_tx_pkts() is directly calling the device's batch xmit routine. Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit and GSO segmentation? + + if (!lockless) + netif_tx_unlock(dev); + + spin_lock(&dev->queue_lock); + + q = dev->qdisc; + + /* most likely result, packet went ok */ + if (ret == NETDEV_TX_OK) + return qdisc_qlen(q); + /* only for lockless drivers .. */ + if (ret == NETDEV_TX_LOCKED && lockless) + return tx_islocked(NULL, dev, q); + + if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit())) + printk(KERN_WARNING " BUG %s code %d qlen %d\n", + dev->name, ret, q->q.qlen); + + return do_dev_requeue(NULL, dev, q); +} ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-16 22:12 ` Sridhar Samudrala @ 2007-05-16 22:52 ` jamal 2007-05-17 3:25 ` jamal 2007-05-17 4:03 ` Krishna Kumar2 1 sibling, 1 reply; 96+ messages in thread From: jamal @ 2007-05-16 22:52 UTC (permalink / raw) To: Sridhar Samudrala Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner, ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote: > Jamal, > > Here are some comments i have on your patch. > See them inline. > Thanks for taking the time Sridhar. > try_tx_pkts() is directly calling the device's batch xmit routine. > Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit > and GSO segmentation? > I think this is the core of all your other comments. Good catch - that piece of code has changed since 2.6.18; so the patch i posted wont work with GSO. You actually caught a bug in my other patch as well that nobody else did;-> I will have to think a bit about this; i may end up coalescing when grabbing the packets but call the nit from the driver using a helper. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-16 22:52 ` jamal @ 2007-05-17 3:25 ` jamal 2007-05-18 12:07 ` jamal 0 siblings, 1 reply; 96+ messages in thread From: jamal @ 2007-05-17 3:25 UTC (permalink / raw) To: Sridhar Samudrala Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule, Evgeniy Polyakov, Auke, Gagan Arneja [-- Attachment #1: Type: text/plain, Size: 808 bytes --] On Wed, 2007-16-05 at 18:52 -0400, jamal wrote: > On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote: > > I will have to think a bit about this; i may end up coalescing when > grabbing the packets but call the nit from the driver using a helper. > Thats what i did. This would hopefully work with GSO now (infact nit works now with GSO when it didnt before). This patch now includes two changed drivers (tun and e1000). I have tested tun with this patch. I tested e1000 earlier and i couldnt find any issues - although as the tittle says its a WIP. As before you need net-2.6. You also need the qdisc restart cleanup patch. Please comment. If all is good, I think my next efforts will be to convert pktgen to be aware of the api so we can have some serious traffic generation. cheers, jamal [-- Attachment #2: combined-batch-p --] [-- Type: text/x-patch, Size: 25942 bytes --] diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 637ae8f..b4c900e 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -154,6 +154,8 @@ static void e1000_update_phy_info(unsigned long data); static void e1000_watchdog(unsigned long data); static void e1000_82547_tx_fifo_stall(unsigned long data); static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +static int e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *dev); +static int e1000_xmit_frames(struct sk_buff_head *list, struct net_device *dev); static struct net_device_stats * e1000_get_stats(struct net_device *netdev); static int e1000_change_mtu(struct net_device *netdev, int new_mtu); static int e1000_set_mac(struct net_device *netdev, void *p); @@ -932,6 +934,8 @@ e1000_probe(struct pci_dev *pdev, netdev->open = &e1000_open; netdev->stop = &e1000_close; netdev->hard_start_xmit = &e1000_xmit_frame; + netdev->hard_prep_xmit = &e1000_prep_queue_frame; + netdev->hard_batch_xmit = &e1000_xmit_frames; netdev->get_stats = &e1000_get_stats; netdev->set_multicast_list = &e1000_set_multi; netdev->set_mac_address = &e1000_set_mac; @@ -940,6 +944,7 @@ e1000_probe(struct pci_dev *pdev, e1000_set_ethtool_ops(netdev); netdev->tx_timeout = &e1000_tx_timeout; netdev->watchdog_timeo = 5 * HZ; + skb_queue_head_init(&netdev->blist); #ifdef CONFIG_E1000_NAPI netdev->poll = &e1000_clean; netdev->weight = 64; @@ -998,6 +1003,7 @@ e1000_probe(struct pci_dev *pdev, netdev->features |= NETIF_F_HIGHDMA; netdev->features |= NETIF_F_LLTX; + netdev->features |= NETIF_F_BTX; adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw); @@ -1155,6 +1161,7 @@ e1000_probe(struct pci_dev *pdev, if ((err = register_netdev(netdev))) goto err_register; + netdev->xmit_win = adapter->tx_ring->count>>1; /* tell the stack to leave us alone until e1000_open() is called */ netif_carrier_off(netdev); netif_stop_queue(netdev); @@ -1449,6 +1456,7 @@ e1000_open(struct net_device *netdev) /* fire a link status change interrupt to start the watchdog */ E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_LSC); + printk("%s Batch window is %d\n",netdev->name, netdev->xmit_win); return E1000_SUCCESS; err_req_irq: @@ -1503,6 +1511,7 @@ e1000_close(struct net_device *netdev) e1000_check_mng_mode(&adapter->hw)) e1000_release_hw_control(adapter); + skb_queue_purge(&netdev->blist); return 0; } @@ -3098,6 +3107,18 @@ e1000_tx_map(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring, } static void +e1000_kick_DMA(struct e1000_adapter *adapter, + struct e1000_tx_ring *tx_ring, int i) +{ + wmb(); + writel(i, adapter->hw.hw_addr + tx_ring->tdt); + /* we need this if more than one processor can write to our tail + ** at a time, it syncronizes IO on IA64/Altix systems */ + mmiowb(); +} + + +static void e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring, int tx_flags, int count) { @@ -3139,17 +3160,7 @@ e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring, tx_desc->lower.data |= cpu_to_le32(adapter->txd_cmd); - /* Force memory writes to complete before letting h/w - * know there are new descriptors to fetch. (Only - * applicable for weak-ordered memory model archs, - * such as IA-64). */ - wmb(); - tx_ring->next_to_use = i; - writel(i, adapter->hw.hw_addr + tx_ring->tdt); - /* we need this if more than one processor can write to our tail - * at a time, it syncronizes IO on IA64/Altix systems */ - mmiowb(); } /** @@ -3256,54 +3267,60 @@ static int e1000_maybe_stop_tx(struct net_device *netdev, } #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 ) +struct e1000_tx_cbdata { + int count; + unsigned int max_per_txd; + unsigned int nr_frags; + unsigned int mss; +}; + +#define E1000_SKB_CB(__skb) ((struct e1000_tx_cbdata *)&((__skb)->cb[0])) +#define NETDEV_TX_DROPPED -5 + static int -e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev) { - struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_tx_ring *tx_ring; - unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD; + unsigned int f; + struct e1000_adapter *adapter = netdev_priv(netdev); unsigned int max_txd_pwr = E1000_MAX_TXD_PWR; - unsigned int tx_flags = 0; unsigned int len = skb->len; - unsigned long flags; - unsigned int nr_frags = 0; - unsigned int mss = 0; - int count = 0; - int tso; - unsigned int f; + + struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb); + cb->mss = 0; + cb->nr_frags = 0; + cb->max_per_txd = E1000_MAX_DATA_PER_TXD; + cb->count = 0; + len -= skb->data_len; - /* This goes back to the question of how to logically map a tx queue - * to a flow. Right now, performance is impacted slightly negatively - * if using multiple tx queues. If the stack breaks away from a - * single qdisc implementation, we can look at this again. */ tx_ring = adapter->tx_ring; if (unlikely(skb->len <= 0)) { dev_kfree_skb_any(skb); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } - /* 82571 and newer doesn't need the workaround that limited descriptor - * length to 4kB */ + /* 82571 and newer doesn't need the workaround that limited + descriptor length to 4kB */ if (adapter->hw.mac_type >= e1000_82571) - max_per_txd = 8192; + cb->max_per_txd = 8192; - mss = skb_shinfo(skb)->gso_size; + cb->mss = skb_shinfo(skb)->gso_size; /* The controller does a simple calculation to * make sure there is enough room in the FIFO before * initiating the DMA for each buffer. The calc is: * 4 = ceil(buffer len/mss). To make sure we don't * overrun the FIFO, adjust the max buffer len if mss * drops. */ - if (mss) { + if (cb->mss) { uint8_t hdr_len; - max_per_txd = min(mss << 2, max_per_txd); - max_txd_pwr = fls(max_per_txd) - 1; + cb->max_per_txd = min(cb->mss << 2, cb->max_per_txd); + max_txd_pwr = fls(cb->max_per_txd) - 1; /* TSO Workaround for 82571/2/3 Controllers -- if skb->data - * points to just header, pull a few bytes of payload from - * frags into skb->data */ + * points to just header, pull a few bytes of payload from + * frags into skb->data */ hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb); if (skb->data_len && (hdr_len == (skb->len - skb->data_len))) { switch (adapter->hw.mac_type) { @@ -3315,7 +3332,8 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) * NOTE: this is a TSO only workaround * if end byte alignment not correct move us * into the next dword */ - if ((unsigned long)(skb_tail_pointer(skb) - 1) & 4) + if ((unsigned long)(skb_tail_pointer(skb) - + 1) & 4) break; /* fall through */ case e1000_82571: @@ -3327,7 +3345,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) DPRINTK(DRV, ERR, "__pskb_pull_tail failed.\n"); dev_kfree_skb_any(skb); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } len = skb->len - skb->data_len; break; @@ -3339,46 +3357,56 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) } /* reserve a descriptor for the offload context */ - if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL)) - count++; - count++; + if ((cb->mss) || (skb->ip_summed == CHECKSUM_PARTIAL)) + cb->count++; + cb->count++; /* Controller Erratum workaround */ if (!skb->data_len && tx_ring->last_tx_tso && !skb_is_gso(skb)) - count++; + cb->count++; - count += TXD_USE_COUNT(len, max_txd_pwr); + cb->count += TXD_USE_COUNT(len, max_txd_pwr); if (adapter->pcix_82544) - count++; + cb->count++; /* work-around for errata 10 and it applies to all controllers * in PCI-X mode, so add one more descriptor to the count */ if (unlikely((adapter->hw.bus_type == e1000_bus_type_pcix) && - (len > 2015))) - count++; + (len > 2015))) + cb->count++; - nr_frags = skb_shinfo(skb)->nr_frags; - for (f = 0; f < nr_frags; f++) - count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size, - max_txd_pwr); + cb->nr_frags = skb_shinfo(skb)->nr_frags; + for (f = 0; f < cb->nr_frags; f++) + cb->count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size, + max_txd_pwr); if (adapter->pcix_82544) - count += nr_frags; - + cb->count += cb->nr_frags; if (adapter->hw.tx_pkt_filtering && (adapter->hw.mac_type == e1000_82573)) e1000_transfer_dhcp_info(adapter, skb); - if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) - /* Collision - tell upper layer to requeue */ - return NETDEV_TX_LOCKED; + return NETDEV_TX_OK; +} + +/* invoked under tx_ring->lock */ +static int e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct e1000_tx_ring *tx_ring; + int tso; + unsigned int first; + struct e1000_adapter *adapter = netdev_priv(netdev); + unsigned int tx_flags = 0; + + struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb); + tx_ring = adapter->tx_ring; /* need: count + 2 desc gap to keep tail from touching * head, otherwise try next time */ - if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) { - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) { + netif_stop_queue(netdev); return NETDEV_TX_BUSY; } @@ -3386,7 +3414,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) { netif_stop_queue(netdev); mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1); - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); return NETDEV_TX_BUSY; } } @@ -3401,8 +3428,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) tso = e1000_tso(adapter, tx_ring, skb); if (tso < 0) { dev_kfree_skb_any(skb); - spin_unlock_irqrestore(&tx_ring->tx_lock, flags); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } if (likely(tso)) { @@ -3418,16 +3444,157 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) tx_flags |= E1000_TX_FLAGS_IPV4; e1000_tx_queue(adapter, tx_ring, tx_flags, - e1000_tx_map(adapter, tx_ring, skb, first, - max_per_txd, nr_frags, mss)); + e1000_tx_map(adapter, tx_ring, skb, first, + cb->max_per_txd, cb->nr_frags, cb->mss)); - netdev->trans_start = jiffies; + return NETDEV_TX_OK; +} + +/* called with tx_ring->lock held */ +static int real_e1000_xmit_frame(struct sk_buff *skb, struct net_device *dev) +{ + struct e1000_adapter *adapter = netdev_priv(dev); + int ret = NETDEV_TX_OK; + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + + ret = e1000_queue_frame(skb, dev); + + if (ret == NETDEV_TX_OK) { + e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use); + dev->trans_start = jiffies; + } + if (ret == NETDEV_TX_DROPPED) + ret = NETDEV_TX_OK; - /* Make sure there is space in the ring for the next send. */ - e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2); + /* XXX: This seems so unnecessary, because if we are + * NETDEV_TX_BUSY already, we are already + * netif_queue_stopped(dev) + * but its what the driver does, resolve later */ + if (unlikely(e1000_maybe_stop_tx(dev, tx_ring, MAX_SKB_FRAGS + 2))) { + dev->xmit_win = 1; + netif_stop_queue(dev); + ret = NETDEV_TX_BUSY; + } else { + int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); + dev->xmit_win = rspace; + } + + if (ret == NETDEV_TX_BUSY) + printk("Single: %s stopped with win of %d\n", + dev->name,dev->xmit_win); + return ret; +} + +/* single frame transmitter */ +static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + int ret = NETDEV_TX_OK; + struct e1000_adapter *adapter = netdev_priv(netdev); + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + unsigned long flags; + struct e1000_tx_cbdata *cb; + + /* hopefully we will never cb data > 48 bytes .. */ + memset(skb->cb, 0, sizeof(skb->cb)); + ret = netdev->hard_prep_xmit(skb, netdev); + if (ret != NETDEV_TX_OK) + return NETDEV_TX_OK; + + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) { + /* Collision - tell upper layer to requeue */ + return NETDEV_TX_LOCKED; + } + + cb = E1000_SKB_CB(skb); + /* need: count + 2 desc gap to keep tail from touching + * head, otherwise try next time */ + /* XXX: This seems so unnecessary, because if we are + * NETDEV_TX_BUSY already, we are already + * netif_queue_stopped(dev) + * but its what the driver does, resolve later */ + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) { + spin_unlock_irqrestore(&tx_ring->tx_lock, flags); + return NETDEV_TX_BUSY; + } + + ret = real_e1000_xmit_frame(skb, netdev); spin_unlock_irqrestore(&tx_ring->tx_lock, flags); - return NETDEV_TX_OK; + return ret; +} + +/* + * Batch transmit +*/ +static int +e1000_xmit_frames(struct sk_buff_head *list, struct net_device *netdev) +{ + struct e1000_adapter *adapter = netdev->priv; + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + int ret = NETDEV_TX_OK; + int didq = 0; + struct sk_buff *skb = NULL; + unsigned long flags; + + /* + * we should probably wait for this lock! + */ + if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) { + /* Collision - tell upper layer to requeue */ + return NETDEV_TX_LOCKED; + } + + while ((skb = __skb_dequeue(list)) != NULL) { + memset(skb->cb, 0, sizeof(skb->cb)); /* remove? */ + ret = netdev->hard_prep_xmit(skb, netdev); + if (ret != NETDEV_TX_OK) + continue; + + /*XXX: This may be an opportunity to not give nit + * the packet if the dev ix TX BUSY ;-> */ + dev_do_xmit_nit(skb, netdev); + ret = e1000_queue_frame(skb, netdev); + if (ret == NETDEV_TX_OK) { + didq++; + } else { + /* should never happen, but murphy is around */ + if (ret == NETDEV_TX_BUSY) { + __skb_queue_head(list, skb); + break; + } + } + } + + /* we tried to send as many as we could .. */ + if (didq) { + e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use); + netdev->trans_start = jiffies; + } + + if (ret == NETDEV_TX_DROPPED) + ret = NETDEV_TX_OK; + + /* XXX: This seems so unnecessary, because if we are + * NETDEV_TX_BUSY already, we are already + * netif_queue_stopped(dev) + * but its what the driver does, resolve later */ + /* need: MAX_SKB_FRAGS + 2 desc gap to keep tail from touching + * head, otherwise try next time */ + if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2))) { + netdev->xmit_win = 1; + netif_stop_queue(netdev); + ret = NETDEV_TX_BUSY; + } else { + int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); + netdev->xmit_win = rspace; + printk("batch %s still awake with win of %d\n", + netdev->name, netdev->xmit_win); + } + spin_unlock_irqrestore(&tx_ring->tx_lock, flags); + if (ret == NETDEV_TX_BUSY) + printk("Batch: %s stopped with win of %d\n", + netdev->name, netdev->xmit_win); + return ret; } /** @@ -4032,7 +4199,10 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter, */ smp_mb(); if (netif_queue_stopped(netdev)) { + netdev->xmit_win = E1000_DESC_UNUSED(tx_ring); netif_wake_queue(netdev); + printk(" %s woken with win of %d\n", + netdev->name,netdev->xmit_win); ++adapter->restart_queue; } } diff --git a/drivers/net/tun.c b/drivers/net/tun.c index a2c6caa..e128ae3 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -70,6 +70,7 @@ static int debug; #endif +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ /* Network device part of the driver */ static LIST_HEAD(tun_dev_list); @@ -86,9 +87,56 @@ static int tun_net_open(struct net_device *dev) static int tun_net_close(struct net_device *dev) { netif_stop_queue(dev); + skb_queue_purge(&dev->blist); return 0; } +/* Batch Net device start xmit + * combine with non-batching version + * */ +static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev) +{ + struct sk_buff *skb; + int didq = 0; + struct tun_struct *tun = netdev_priv(dev); + u32 qlen = skb_queue_len(&tun->readq); + + /* Drop packet if interface is not attached */ + if (!tun->attached) { + tun->stats.tx_dropped+=skb_queue_len(&dev->blist); + skb_queue_purge(&dev->blist); + return NETDEV_TX_OK; + } + + while (skb_queue_len(&dev->blist)) { + skb = __skb_dequeue(skbs); + if (!skb) + break; + dev_do_xmit_nit(skb, dev); + skb_queue_tail(&tun->readq, skb); + didq++; + } + + qlen = skb_queue_len(&tun->readq); + if (qlen >= dev->tx_queue_len) { + netif_stop_queue(dev); + tun->stats.tx_fifo_errors++; + dev->xmit_win = 1; + } else { + dev->xmit_win = dev->tx_queue_len - qlen; + } + + if (didq) + dev->trans_start = jiffies; + + /* Notify and wake up reader process */ + if (tun->flags & TUN_FASYNC) + kill_fasync(&tun->fasync, SIGIO, POLL_IN); + wake_up_interruptible(&tun->read_wait); + + return NETDEV_TX_OK; +} + /* Net device start xmit */ static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev) { @@ -207,6 +255,7 @@ static void tun_net_init(struct net_device *dev) dev->tx_queue_len = TUN_READQ_SIZE; /* We prefer our own queue length */ break; } + dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */ } /* Character device part */ @@ -382,7 +431,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv, schedule(); continue; } - netif_wake_queue(tun->dev); + { + u32 t = skb_queue_len(&tun->readq); + if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { + tun->dev->xmit_win = tun->dev->tx_queue_len; + netif_wake_queue(tun->dev); + } + } /** Decide whether to accept this packet. This code is designed to * behave identically to an Ethernet interface. Accept the packet if @@ -429,6 +484,7 @@ static void tun_setup(struct net_device *dev) struct tun_struct *tun = netdev_priv(dev); skb_queue_head_init(&tun->readq); + skb_queue_head_init(&dev->blist); init_waitqueue_head(&tun->read_wait); tun->owner = -1; @@ -436,6 +492,8 @@ static void tun_setup(struct net_device *dev) SET_MODULE_OWNER(dev); dev->open = tun_net_open; dev->hard_start_xmit = tun_net_xmit; + dev->hard_prep_xmit = NULL; + dev->hard_batch_xmit = tun_net_bxmit; dev->stop = tun_net_close; dev->get_stats = tun_net_stats; dev->ethtool_ops = &tun_ethtool_ops; @@ -458,7 +516,7 @@ static struct tun_struct *tun_get_by_name(const char *name) static int tun_set_iff(struct file *file, struct ifreq *ifr) { struct tun_struct *tun; - struct net_device *dev; + struct net_device *dev = NULL; int err; tun = tun_get_by_name(ifr->ifr_name); @@ -528,12 +586,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr) } DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name); + dev->features |= NETIF_F_BTX; if (ifr->ifr_flags & IFF_NO_PI) tun->flags |= TUN_NO_PI; - if (ifr->ifr_flags & IFF_ONE_QUEUE) + if (ifr->ifr_flags & IFF_ONE_QUEUE) { tun->flags |= TUN_ONE_QUEUE; + dev->features &= ~NETIF_F_BTX; + } file->private_data = tun; tun->attached = 1; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f671cd2..85a1baf 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -325,6 +325,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_GSO 2048 /* Enable software GSO. */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_BTX 8192 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -450,6 +451,11 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_batch_xmit) (struct sk_buff_head *list, + struct net_device *dev); + int (*hard_prep_xmit) (struct sk_buff *skb, + struct net_device *dev); + int xmit_win; /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -466,6 +472,10 @@ struct net_device struct list_head todo_list; /* device index hash chain */ struct hlist_node index_hlist; + /*XXX: Fix eventually to not allocate if device not + *batch capable + */ + struct sk_buff_head blist; struct net_device *link_watch_next; @@ -742,7 +752,12 @@ extern int dev_set_mac_address(struct net_device *, struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); - +extern int do_gso_skb(struct sk_buff *skb, + struct sk_buff_head *skbs); +extern int do_possible_gso_skb(struct sk_buff *skb, + struct net_device *dev); +extern void dev_do_xmit_nit(struct sk_buff *skb, + struct net_device *dev); extern void dev_init(void); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index 8301e2a..0d728cd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1372,6 +1372,47 @@ out_kfree_skb: return 0; } +int do_gso_skb(struct sk_buff *skb, struct sk_buff_head *skbs) +{ + int tdq = 0; + do { + struct sk_buff *nskb = skb->next; + + skb->next = nskb->next; + nskb->next = NULL; + tdq++; + __skb_queue_head(skbs, skb); + } while (skb->next); + skb->destructor = DEV_GSO_CB(skb)->destructor; + + return tdq; +} + +int do_possible_gso_skb(struct sk_buff *skb, struct net_device *dev) +{ + struct sk_buff_head *skbs = &dev->blist; + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree_skb(skb); + return 0; + } + if (skb->next) + return do_gso_skb(skb, skbs); + } + + __skb_queue_head(skbs, skb); + return 1; +} + +/* invoked by driver when batching once figured skb is sane*/ +void dev_do_xmit_nit(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(&ptype_all)) + dev_queue_xmit_nit(skb, dev); +} + + #define HARD_TX_LOCK(dev, cpu) { \ if ((dev->features & NETIF_F_LLTX) == 0) { \ netif_tx_lock(dev); \ diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 9cd3a1c..530de14 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3217,9 +3217,7 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev) pkt_dev->next_tx_us++; pkt_dev->next_tx_ns -= 1000; } - } - - else { /* Retry it next time */ + } else { /* netif_queue_stopped -- Retry it next time */ pkt_dev->last_ok = 0; pkt_dev->next_tx_us = getCurUs(); /* TODO */ pkt_dev->next_tx_ns = 0; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ed80054..4fe5a9b 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -85,10 +85,12 @@ static inline int do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb->next)) - dev->gso_skb = skb; - else - q->ops->requeue(skb, q); + if (skb) { + if (unlikely(skb->next)) + dev->gso_skb = skb; + else + q->ops->requeue(skb, q); + } /* XXX: Could netif_schedule fail? Or is that fact we are * requeueing imply the hardware path is closed * and even if we fail, some interupt will wake us @@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) int ret = handle_dev_cpu_collision(dev); if (ret == SCHED_TX_DROP) { - kfree_skb(skb); + if (skb) /* we are not batching */ + kfree_skb(skb); + else if (!skb_queue_empty(&dev->blist)) + skb_queue_purge(&dev->blist); return qdisc_qlen(q); } @@ -195,10 +200,104 @@ static inline int qdisc_restart(struct net_device *dev) return do_dev_requeue(skb, dev, q); } +static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count) +{ + struct sk_buff *skb; + struct sk_buff_head *skbs = &dev->blist; + int tdq = 0; + + /* + * very unlikely, but who knows .. + * If this happens we dont try to grab more pkts + */ + if (!skb_queue_empty(&dev->blist)) + return skb_queue_len(&dev->blist); + + if (unlikely(dev->gso_skb)) { + skb = dev->gso_skb; + dev->gso_skb = NULL; + tdq = do_gso_skb(skb, skbs); + } + + if (tdq > count) + return tdq; /*we will stop here */ + + count -= tdq; + while (count > 0) { + skb = q->dequeue(q); + if (!skb) + break; + + tdq += do_possible_gso_skb(skb, dev); + count -= tdq; + } + + return tdq; +} + +static inline int try_tx_pkts(struct net_device *dev) +{ + + return dev->hard_batch_xmit(&dev->blist, dev); + +} + +/* same comments as in qdisc_restart apply; + * at some point use shared code with qdisc_restart*/ +int batch_qdisc_restart(struct net_device *dev) +{ + struct Qdisc *q = dev->qdisc; + unsigned lockless = (dev->features & NETIF_F_LLTX); + int count = dev->xmit_win; + int ret = 0; + + ret = try_get_tx_pkts(dev, q, count); + + if (ret == 0) + return qdisc_qlen(q); + + /* we have packets to send! */ + if (!lockless) { + if (!netif_tx_trylock(dev)) + return tx_islocked(NULL, dev, q); + } + + /* all clear .. */ + spin_unlock(&dev->queue_lock); + + ret = NETDEV_TX_BUSY; + if (!netif_queue_stopped(dev)) + ret = try_tx_pkts(dev); + + if (!lockless) + netif_tx_unlock(dev); + + spin_lock(&dev->queue_lock); + + q = dev->qdisc; + + /* most likely result, packet went ok */ + if (ret == NETDEV_TX_OK) + return qdisc_qlen(q); + /* only for lockless drivers .. */ + if (ret == NETDEV_TX_LOCKED && lockless) + return tx_islocked(NULL, dev, q); + + if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit())) + printk(KERN_WARNING " BUG %s code %d qlen %d\n", + dev->name, ret, q->q.qlen); + + return do_dev_requeue(NULL, dev, q); +} + void __qdisc_run(struct net_device *dev) { + unsigned batching = (dev->features & NETIF_F_BTX); + do { - if (!qdisc_restart(dev)) + if (!batching && !qdisc_restart(dev)) + break; + else if (!batch_qdisc_restart(dev)) break; } while (!netif_queue_stopped(dev)); ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-17 3:25 ` jamal @ 2007-05-18 12:07 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-18 12:07 UTC (permalink / raw) To: Sridhar Samudrala Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule, Evgeniy Polyakov, Auke, Gagan Arneja On Wed, 2007-16-05 at 23:25 -0400, jamal wrote: > > This patch now includes two changed drivers (tun and e1000). I have > tested tun with this patch. I tested e1000 earlier and i couldnt find > any issues - although as the tittle says its a WIP. > > As before you need net-2.6. You also need the qdisc restart cleanup > patch. > I have found a bug on e1000 which manifests under a lot of pounding with traffic. I am not gonna be posting the patch anymore - but will continue to fix and add things. If you want the patches email me privately. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-16 22:12 ` Sridhar Samudrala 2007-05-16 22:52 ` jamal @ 2007-05-17 4:03 ` Krishna Kumar2 2007-05-16 21:44 ` Sridhar Samudrala 1 sibling, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-17 4:03 UTC (permalink / raw) To: Sridhar Samudrala Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi, Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma Hi Sridhar, Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM: > AFAIK, gso_skb can be a list of skb's. Can we add a list > to another list using __skb_queue_head()? > Also, if gso_skb is a list of multiple skb's, i think the > count needs to be decremented by the number of segments in > gso_skb. gso_skb is the last GSO skb that failed to be sent. This already segmented skb is kept "cached" and whenever the next xmit happens, this skb is first sent before any packets from the queue are taken out (otherwise out-of-order packets). So there can atmost be one gso_skb per device. - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-17 4:03 ` Krishna Kumar2 @ 2007-05-16 21:44 ` Sridhar Samudrala 2007-05-17 5:01 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Sridhar Samudrala @ 2007-05-16 21:44 UTC (permalink / raw) To: Krishna Kumar2 Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi, Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma Krishna Kumar2 wrote: > Hi Sridhar, > > Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM: > >> AFAIK, gso_skb can be a list of skb's. Can we add a list >> to another list using __skb_queue_head()? >> Also, if gso_skb is a list of multiple skb's, i think the >> count needs to be decremented by the number of segments in >> gso_skb. > > gso_skb is the last GSO skb that failed to be sent. This already > segmented skb is kept "cached" and whenever the next xmit happens, > this skb is first sent before any packets from the queue are taken > out (otherwise out-of-order packets). So there can atmost be one > gso_skb per device. Yes. There can be only one gso_skb per device. But it can have more than one segments linked together via skb->next. Thanks Sridhar ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits 2007-05-16 21:44 ` Sridhar Samudrala @ 2007-05-17 5:01 ` Krishna Kumar2 0 siblings, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-17 5:01 UTC (permalink / raw) To: Sridhar Samudrala Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi, Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:14:41 AM: > Krishna Kumar2 wrote: > > Hi Sridhar, > > > > Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM: > > > >> AFAIK, gso_skb can be a list of skb's. Can we add a list > >> to another list using __skb_queue_head()? > >> Also, if gso_skb is a list of multiple skb's, i think the > >> count needs to be decremented by the number of segments in > >> gso_skb. > > > > gso_skb is the last GSO skb that failed to be sent. This already > > segmented skb is kept "cached" and whenever the next xmit happens, > > this skb is first sent before any packets from the queue are taken > > out (otherwise out-of-order packets). So there can atmost be one > > gso_skb per device. > > Yes. There can be only one gso_skb per device. But it can have more > than one segments linked together via skb->next. It will definitely have more than 1 segments (criterea to be on the gso_skb). But I think you are right, the segmented skb will "lose" it's 1-n segments (newsk->next = next) when it is added to the list. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>]
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com> @ 2007-05-15 23:36 ` David Miller 0 siblings, 0 replies; 96+ messages in thread From: David Miller @ 2007-05-15 23:36 UTC (permalink / raw) To: xma; +Cc: ak, krkumar2, netdev, netdev-owner, rdreier From: Shirley Ma <xma@us.ibm.com> Date: Tue, 15 May 2007 16:33:22 -0700 > That's interesting. So a generic LRO in interface layer will benefit > the preformance more, right? Receiving path TCP N times is more expensive > than sending, I think. If you look at some of the drivers doing LRO, the bulk of the implementation is in software, so yes :-) ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 21:32 ` David Miller 2007-05-15 22:17 ` [WIP] [PATCH] WAS " jamal [not found] ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com> @ 2007-05-21 7:56 ` Herbert Xu [not found] ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com> 2 siblings, 1 reply; 96+ messages in thread From: Herbert Xu @ 2007-05-21 7:56 UTC (permalink / raw) To: David Miller; +Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner David Miller <davem@davemloft.net> wrote: > From: Shirley Ma <xma@us.ibm.com> > Date: Tue, 15 May 2007 14:22:57 -0700 > >> I just wonder without TSO support in HW, how much benefit we >> can get by pushing GSO from interface layer to device layer besides >> we can do multiple packets in IPoIB. > > I bet the gain is non-trivial. Yep, for any NIC that supports SG but not TSO then software GSO will be a big win. When the NIC doesn't support SG then the win is mostly offset by the need to copy the packet again. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>]
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com> @ 2007-05-22 22:36 ` David Miller [not found] ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com> 2007-05-22 23:12 ` Herbert Xu 0 siblings, 2 replies; 96+ messages in thread From: David Miller @ 2007-05-22 22:36 UTC (permalink / raw) To: xma; +Cc: herbert, ak, krkumar2, netdev, netdev-owner, rdreier From: Shirley Ma <xma@us.ibm.com> Date: Tue, 22 May 2007 15:22:35 -0700 > > Yep, for any NIC that supports SG but not TSO then software GSO will > > be a big win. When the NIC doesn't support SG then the win is mostly > > offset by the need to copy the packet again. > > > > Cheers, > > -- > > We could avoid packet copy by allocating number of 64K/MTU skb > buffers instead of one big 64K buffer size. Is it a possible approach? I think you misunderstand the problem. If the device does not support scatter gather, it is impossible to avoid the copy. SKB's from TSO are composed of discontiguous page chunks represented by the skb_shared_info() array. These can come either from sendmsg() user data (which the kernel fills into a per-socket data page which is reallocated once filled), or from sendfile() page cache pages. Therefore, there has to be a copy somewhere to be able to give that packet to a non-scatter-gather device. ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>]
* Re: [RFC] New driver API to speed up small packets xmits [not found] ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com> @ 2007-05-22 23:04 ` David Miller 0 siblings, 0 replies; 96+ messages in thread From: David Miller @ 2007-05-22 23:04 UTC (permalink / raw) To: xma; +Cc: ak, herbert, krkumar2, netdev, netdev-owner, rdreier From: Shirley Ma <xma@us.ibm.com> Date: Tue, 22 May 2007 15:58:05 -0700 > Sorry for the confusion. I am thinking to avoid copy in skb_segment() for > GSO. The way could be in tcp_sendmsg() to allocate small discontiguous > buffers (equal = MTU) instead of allocating pages. The SKB splitting algorithm in TCP's transmit engine depends upon the skb_shared_info() array being splittable at arbitrary points with only page counts to manage. This is the only way I found to make SKB splitting at transmit time extremely inexpensive. SACK block processing needs to perform these kinds of splits at well, so it really really has to be cheap. The invariant is that every TCP TSO packet must have it's header at skb->data and all of it's data in the paged skb_shared_info(). ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-22 22:36 ` David Miller [not found] ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com> @ 2007-05-22 23:12 ` Herbert Xu 1 sibling, 0 replies; 96+ messages in thread From: Herbert Xu @ 2007-05-22 23:12 UTC (permalink / raw) To: David Miller; +Cc: xma, ak, krkumar2, netdev, netdev-owner, rdreier On Tue, May 22, 2007 at 03:36:36PM -0700, David Miller wrote: > > > > Yep, for any NIC that supports SG but not TSO then software GSO will > > > be a big win. When the NIC doesn't support SG then the win is mostly > > > offset by the need to copy the packet again. ... > SKB's from TSO are composed of discontiguous page chunks represented > by the skb_shared_info() array. These can come either from sendmsg() > user data (which the kernel fills into a per-socket data page which is > reallocated once filled), or from sendfile() page cache pages. Yes sendmsg() is the case where it's almost even whether GSO is turned on or off because GSO does an extra copy which offsets the win in reduced per-packet cost. For sendfile() it's still a win though since we have to copy it anyway and doing it in GSO avoids the per-packet cost. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits @ 2007-05-11 7:14 Krishna Kumar2 0 siblings, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 7:14 UTC (permalink / raw) To: David Stevens Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner, Rick Jones (Mistaken didn't reply-all previous time) Hi Dave, David Stevens <dlstevens@us.ibm.com> wrote on 05/11/2007 02:57:56 AM: > The word "small" is coming up a lot in this discussion, and > I think packet size really has nothing to do with it. Multiple > streams generating packets of any size would benefit; the > key ingredient is a queue length greater than 1. Correct. Re-thinking about this - for larger packets the bandwidth may not improve as it reaches line-speed, but I guess CPU util could reduce (which could reduce in small packet case too). Using "small" was not covering this case. Thanks, - KK > I think the intent is to remove queue lock cycles by taking > the whole list (at least up to the count of free ring buffers) > when the queue is greater than one packet, thus effectively > removing the lock expense for n-1 packets. ^ permalink raw reply [flat|nested] 96+ messages in thread
* [RFC] New driver API to speed up small packets xmits
@ 2007-05-10 14:53 Krishna Kumar
2007-05-10 15:08 ` Evgeniy Polyakov
` (2 more replies)
0 siblings, 3 replies; 96+ messages in thread
From: Krishna Kumar @ 2007-05-10 14:53 UTC (permalink / raw)
To: netdev; +Cc: krkumar2, Krishna Kumar
Hi all,
While looking at common packet sizes on xmits, I found that most of
the packets are small. On my personal system, the statistics of
packets after using (browsing, mail, ftp'ing two linux kernels from
www.kernel.org) for about 6 hours is :
-----------------------------------------------------------
Packet Size #packets (Total:60720) Percentage
-----------------------------------------------------------
32 0 0
64 7716 12.70
80 40193 66.19
sub-total: 47909 78.90 %
96 2007 3.30
128 1917 3.15
sub-total: 3924 6.46 %
256 1822 3.00
384 863 1.42
512 459 .75
sub-total: 3144 5.18 %
640 763 1.25
768 2329 3.83
1024 1700 2.79
1500 461 .75
sub-total: 5253 8.65 %
2048 312 .51
4096 53 .08
8192 84 .13
16384 41 .06
32768+ 0 0
sub-total: 490 0.81 %
-----------------------------------------------------------
Doing some measurements, I found that for small packets like 128 bytes,
the bandwidth is approximately 60% of the line speed. To possibly speed
up performance of small packet xmits, a method of "linking" skbs was
thought of - where two pointers (skb_flink/blink) is added to the skb.
It is felt (no data yet) that drivers will get better results when more
number of "linked" skbs are sent to it in one shot, rather than sending
each skb independently (where for each skb, extra call to driver is
made and also the driver needs to get/drop lock, etc). The method is to
send as many packets as possible from qdisc (eg multiple packets can
accumulate if the driver is stopped or trylock failed) if the device
supports the new API. Steps for enabling API for a driver is :
- driver needs to set NETIF_F_LINKED_SKB before netdev_register
- register_netdev sets a new tx_link_skbs tunable parameter in
dev to 1, indicating that the driver supports linked skbs.
- driver implements the new API - hard_start_xmit_link to
handle linked skbs, which is mostly a simple task. Eg,
support for e1000 driver can be added, avoiding duplicating
existing code as :
e1000_xmit_frame_link()
{
top:
next_skb = skb->linked
(original driver code here)
skb = next_skb;
if (skb)
goto top;
...
}
e1000_xmit_frame()
{
return e1000_xmit_frame_link(skb, NULL, dev);
}
Drivers can take other approaches, eg, get lock at the top and
handle all the packets in one shot, or get/drop locks for each
skb; but those are internal to the driver. In any case, driver
changes to support (optional) this API is minimal.
The main change is in core/sched code. Qdisc links packets if the
device supports it and multiple skbs are present, and calls
dev_hard_start_xmit, which calls one of the two API's depending on
whether the passed skb is linked or not. A sys interface can set or
reset the tx_link_skbs parameter for the device to use the old or the
new driver API.
The reason to implement the same was to speed up IPoIB driver. But
before doing that, a proof of concept for E1000/AMSO drivers was
considered (as most of the code is generic) before implementing for
IPoIB. I do not have test results at this time but I am working on it.
Please let me know if this approach is acceptable, or any suggestions.
Thanks,
- KK
^ permalink raw reply [flat|nested] 96+ messages in thread* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 14:53 Krishna Kumar @ 2007-05-10 15:08 ` Evgeniy Polyakov 2007-05-10 15:22 ` Krishna Kumar2 2007-05-10 20:21 ` Roland Dreier 2007-05-11 9:05 ` Andi Kleen 2 siblings, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-10 15:08 UTC (permalink / raw) To: Krishna Kumar; +Cc: netdev On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote: > The reason to implement the same was to speed up IPoIB driver. But > before doing that, a proof of concept for E1000/AMSO drivers was > considered (as most of the code is generic) before implementing for > IPoIB. I do not have test results at this time but I am working on it. > > Please let me know if this approach is acceptable, or any suggestions. Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already? Btw, main CPU limiting factor here is syscall overhead (userspace protocol processing with 1500 MTU allows to reduce CPU usage and increase performance for 128 bytes packets sending/receiving total of 10 times). > Thanks, > > - KK -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 15:08 ` Evgeniy Polyakov @ 2007-05-10 15:22 ` Krishna Kumar2 2007-05-10 15:48 ` Evgeniy Polyakov 2007-05-10 17:19 ` Rick Jones 0 siblings, 2 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-10 15:22 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: netdev Hi Evgeniy, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/10/2007 08:38:33 PM: > On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote: > > The reason to implement the same was to speed up IPoIB driver. But > > before doing that, a proof of concept for E1000/AMSO drivers was > > considered (as most of the code is generic) before implementing for > > IPoIB. I do not have test results at this time but I am working on it. > > > > Please let me know if this approach is acceptable, or any suggestions. > > Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already? It is the reverse - GSO will segment one super-packet just before calling the driver so that the stack is traversed only once. In my case, I am trying to send out multiple skbs, possibly small packets, in one shot. GSO will not help for small packets. > Btw, main CPU limiting factor here is syscall overhead (userspace protocol > processing with 1500 MTU allows to reduce CPU usage and increase > performance for 128 bytes packets sending/receiving total of 10 times). I will test this also. But I was curious to see if without any changes to applications, I can get better performance by linking packets and sending it once to the driver. What is your opinion ? thanks, - KK > > > Thanks, > > > > - KK > > -- > Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 15:22 ` Krishna Kumar2 @ 2007-05-10 15:48 ` Evgeniy Polyakov 2007-05-10 16:08 ` jamal 2007-05-10 17:19 ` Rick Jones 1 sibling, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-10 15:48 UTC (permalink / raw) To: Krishna Kumar2; +Cc: netdev On Thu, May 10, 2007 at 08:52:12PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > > > The reason to implement the same was to speed up IPoIB driver. But > > > before doing that, a proof of concept for E1000/AMSO drivers was > > > considered (as most of the code is generic) before implementing for > > > IPoIB. I do not have test results at this time but I am working on it. > > > > > > Please let me know if this approach is acceptable, or any suggestions. > > > > Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already? > > It is the reverse - GSO will segment one super-packet just before calling > the driver so that the stack is traversed only once. In my case, I am > trying to send out multiple skbs, possibly small packets, in one shot. > GSO will not help for small packets. And you can create that packet yourself, if it not enough to combine into MSS sized chunk. > > Btw, main CPU limiting factor here is syscall overhead (userspace > protocol > > processing with 1500 MTU allows to reduce CPU usage and increase > > performance for 128 bytes packets sending/receiving total of 10 times). > > I will test this also. But I was curious to see if without any changes to > applications, I can get better performance by linking packets and sending > it once to the driver. Without changes it will unlikely to be visible. > What is your opinion ? IMHO if you do not see in profile anything related to driver's xmit function, it does not require to be fixed. But as starting point, why just not increase skb allocation size (say twice) in tcp_sendmsg() and see results? It of course can endup with worse behaviour (if mtu is small), but it is noticebly simpler to implement and test first. > thanks, > > - KK > > > > > > Thanks, > > > > > > - KK > > > > -- > > Evgeniy Polyakov -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 15:48 ` Evgeniy Polyakov @ 2007-05-10 16:08 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-10 16:08 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Krishna Kumar2, netdev On Thu, 2007-10-05 at 19:48 +0400, Evgeniy Polyakov wrote: > IMHO if you do not see in profile anything related to driver's xmit > function, it does not require to be fixed. True, but i think there may be value in amortizing the cost towards the driver. i.e If you grab a lock and send X packets vs a single packet, depending on the cost of the lock you may observe a reduction in CPU. I have tried similar experiments in the past and observed zero improvements in sane machines, but huge improvements on insane machines. Slides on my work here (just jump to about slide 93): http://vger.kernel.org/jamal_netconf2006.sxi One thought i had at the time was perhaps the e1000 DMA setup/teardown was very inefficient. As the slides indicate i was trying to get throughput improvements but failed to seen any on sane machines. A suggestion made to me was to look at CPU utilization as a metric. Unfortunately i havent had any chance yet. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 15:22 ` Krishna Kumar2 2007-05-10 15:48 ` Evgeniy Polyakov @ 2007-05-10 17:19 ` Rick Jones 2007-05-10 18:07 ` Sridhar Samudrala ` (2 more replies) 1 sibling, 3 replies; 96+ messages in thread From: Rick Jones @ 2007-05-10 17:19 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev > It is the reverse - GSO will segment one super-packet just before calling > the driver so that the stack is traversed only once. In my case, I am > trying to send out multiple skbs, possibly small packets, in one shot. > GSO will not help for small packets. If there are small packets that implies small sends, which suggests that they would be coalesced either implicitly by the Nagle algorithm or explicitly with TCP_CORK no? rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 17:19 ` Rick Jones @ 2007-05-10 18:07 ` Sridhar Samudrala 2007-05-10 19:43 ` Gagan Arneja 2007-05-10 18:13 ` Vlad Yasevich 2007-05-10 21:27 ` David Stevens 2 siblings, 1 reply; 96+ messages in thread From: Sridhar Samudrala @ 2007-05-10 18:07 UTC (permalink / raw) To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev On Thu, 2007-05-10 at 10:19 -0700, Rick Jones wrote: > > It is the reverse - GSO will segment one super-packet just before calling > > the driver so that the stack is traversed only once. In my case, I am > > trying to send out multiple skbs, possibly small packets, in one shot. > > GSO will not help for small packets. > > If there are small packets that implies small sends, which suggests that they > would be coalesced either implicitly by the Nagle algorithm or explicitly with > TCP_CORK no? small packets belonging to the same connection could be coalesced by TCP, but this may help the case where multiple parallel connections are sending small packets. Thanks Sridhar ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:07 ` Sridhar Samudrala @ 2007-05-10 19:43 ` Gagan Arneja 2007-05-10 20:11 ` jamal ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 19:43 UTC (permalink / raw) To: Sridhar Samudrala; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev > small packets belonging to the same connection could be coalesced by > TCP, but this may help the case where multiple parallel connections are > sending small packets. It's not just small packets. The cost of calling hard_start_xmit/byte was rather high on your particular device. I've seen PCI read transaction in hard_start_xmit taking ~10,000 cycles on one particular device. Count the cycles your brand of NIC is taking in it's xmit_routine. The worse it is, the stronger your case for cluster transmits. Also, I think, you don't have to chain skbs, they're already chained in Qdisc->q. All you have to do is take the whole q and try to shove it at the device hoping for better results. But then, if you have rather big backlog, you run the risk of reordering packets if you have to requeue. > > Thanks > Sridhar -- Gagan > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 19:43 ` Gagan Arneja @ 2007-05-10 20:11 ` jamal 2007-05-10 20:14 ` Rick Jones ` (2 more replies) 2007-05-10 20:37 ` David Miller 2007-05-11 5:20 ` Krishna Kumar2 2 siblings, 3 replies; 96+ messages in thread From: jamal @ 2007-05-10 20:11 UTC (permalink / raw) To: Gagan Arneja Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev The discussion seems to have steered into protocol coalescing. My tests for example were related to forwarding and not specific to any protocol. On Thu, 2007-10-05 at 12:43 -0700, Gagan Arneja wrote: > It's not just small packets. The cost of calling hard_start_xmit/byte > was rather high on your particular device. I've seen PCI read > transaction in hard_start_xmit taking ~10,000 cycles on one particular > device. Count the cycles your brand of NIC is taking in it's > xmit_routine. The worse it is, the stronger your case for cluster > transmits. You would need to almost re-write the driver to make sure it does IO which is taking advantage of the batching. > Also, I think, you don't have to chain skbs, they're already chained in > Qdisc->q. > All you have to do is take the whole q and try to shove it > at the device hoping for better results. You are onto something with desire to use the qdisc; however, note you need to leave the qdisc above so other CPUs can continue enqueueing. In my approach i used the qdisc to populate a list that is attached to the dev. > But then, if you have rather > big backlog, you run the risk of reordering packets if you have to requeue. You could avoid totaly requeueing by asking the driver how much space it has. Once you shove down to it a number of packets, you are then guaranteed they will never be requeued (refer to the slides i pointed to earlier). cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:11 ` jamal @ 2007-05-10 20:14 ` Rick Jones 2007-05-10 20:15 ` jamal 2007-05-10 20:15 ` Gagan Arneja 2007-05-11 5:22 ` Krishna Kumar2 2 siblings, 1 reply; 96+ messages in thread From: Rick Jones @ 2007-05-10 20:14 UTC (permalink / raw) To: hadi Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov, netdev jamal wrote: > The discussion seems to have steered into protocol coalescing. > My tests for example were related to forwarding and not specific > to any protocol. Just the natural tendency of end-system types to think of end-system things rather than router things. rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:14 ` Rick Jones @ 2007-05-10 20:15 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-10 20:15 UTC (permalink / raw) To: Rick Jones Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov, netdev On Thu, 2007-10-05 at 13:14 -0700, Rick Jones wrote: > Just the natural tendency of end-system types to think of end-system things > rather than router things. Well router types felt they were being left out ;-> cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:11 ` jamal 2007-05-10 20:14 ` Rick Jones @ 2007-05-10 20:15 ` Gagan Arneja 2007-05-10 20:21 ` jamal 2007-05-11 5:22 ` Krishna Kumar2 2 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 20:15 UTC (permalink / raw) To: hadi Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev jamal wrote: > You would need to almost re-write the driver to make sure it does IO > which is taking advantage of the batching. Really! It's just the transmit routine. How radical can that be? -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:15 ` Gagan Arneja @ 2007-05-10 20:21 ` jamal 2007-05-10 20:25 ` Gagan Arneja 0 siblings, 1 reply; 96+ messages in thread From: jamal @ 2007-05-10 20:21 UTC (permalink / raw) To: Gagan Arneja Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev On Thu, 2007-10-05 at 13:15 -0700, Gagan Arneja wrote: > Really! It's just the transmit routine. How radical can that be? > Ok, you have a point there, but it could be challenging with many tunables: For example: my biggest challenge with the e1000 was just hacking up the DMA setup path - i seem to get better numbers if i dont kick the DMA until i stash all the packets on the ring first etc. It seemed counter-intuitive. Now if you can ammortize PCI operations as well (the 10K cycles you pointed out), i think thats where you get the best bang for the buck. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:21 ` jamal @ 2007-05-10 20:25 ` Gagan Arneja 0 siblings, 0 replies; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 20:25 UTC (permalink / raw) To: hadi Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev > For example: > my biggest challenge with the e1000 was just hacking up the DMA setup > path - i seem to get better numbers if i dont kick the DMA until i stash > all the packets on the ring first etc. It seemed counter-intuitive. That seems to make sense. The rings are(?) in system memory and you can write to them faster. Kicking the DMA needs a transaction on the PCI bus, which can be painfully slow... -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:11 ` jamal 2007-05-10 20:14 ` Rick Jones 2007-05-10 20:15 ` Gagan Arneja @ 2007-05-11 5:22 ` Krishna Kumar2 2007-05-11 11:27 ` jamal 2 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 5:22 UTC (permalink / raw) To: hadi; +Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala J Hadi Salim <j.hadi123@gmail.com> wrote on 05/11/2007 01:41:27 AM: > > It's not just small packets. The cost of calling hard_start_xmit/byte > > was rather high on your particular device. I've seen PCI read > > transaction in hard_start_xmit taking ~10,000 cycles on one particular > > device. Count the cycles your brand of NIC is taking in it's > > xmit_routine. The worse it is, the stronger your case for cluster > > transmits. > > You would need to almost re-write the driver to make sure it does IO > which is taking advantage of the batching. I didn't try to optimize the driver to take any real advantage, I coded it as simply as : top: next = skb->skb_flink; Original driver code here, or another option is to remove the locking and put it before the "top:" above, no other changes. skb = next; if (skb) goto top; This way, the diffs are minimal. However, driver experts might know of some tricks to optimize this, since most drivers have a hundred checks before sending one skb, and whether some optimization to avoid that for each skb is possible. > You could avoid totaly requeueing by asking the driver how much space it > has. Once you shove down to it a number of packets, you are then > guaranteed they will never be requeued (refer to the slides i pointed to > earlier). This is a good idea, I will look at your slides to see what this exactly is. However, even with this check, drivers will return error, eg trylock failed or, like the e1000 has - fifo_workaround failures. But I still think it will significantly reduce the failures. Right now, I get a lot of requeue's. thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:22 ` Krishna Kumar2 @ 2007-05-11 11:27 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-11 11:27 UTC (permalink / raw) To: Krishna Kumar2 Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala On Fri, 2007-11-05 at 10:52 +0530, Krishna Kumar2 wrote: > I didn't try to optimize the driver to take any real advantage, I coded it > as simply as : > > top: > next = skb->skb_flink; > Original driver code here, or another option is to remove the locking > and put it before the "top:" above, no other changes. > skb = next; > if (skb) > goto top; > This way, the diffs are minimal. However, driver experts might know of some > tricks to optimize this, since most drivers have a hundred checks before > sending one skb, and whether some optimization to avoid that for each skb > is possible. > You will need to at some point, no big deal in order to get things in place first- the simple example of DMA i talked about yesterday is a simple example. i.e you avoid kicking the DMA until you have queued all in the ring (as Gagan pointed out, kicking the DMA is a PCI IO write) I am sure there are gazillion others. > This is a good idea, I will look at your slides to see what this exactly > is. I will try to post my patches updated to the latest net-2.6 kernel before you come back on monday. > However, even with this check, drivers will return error, eg trylock failed > or, like the e1000 has - fifo_workaround failures. But I still think it > will significantly reduce the failures. Yes, this is what i refered to as the "many tunables" and led to the claim that you may have to re-write the whole tx path in my posting. > Right now, I get a lot of requeue's. requeues are a good sign ;-> cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 19:43 ` Gagan Arneja 2007-05-10 20:11 ` jamal @ 2007-05-10 20:37 ` David Miller 2007-05-10 20:40 ` Gagan Arneja 2007-05-11 5:21 ` Krishna Kumar2 2007-05-11 5:20 ` Krishna Kumar2 2 siblings, 2 replies; 96+ messages in thread From: David Miller @ 2007-05-10 20:37 UTC (permalink / raw) To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev From: Gagan Arneja <gaagaan@gmail.com> Date: Thu, 10 May 2007 12:43:53 -0700 > It's not just small packets. The cost of calling hard_start_xmit/byte > was rather high on your particular device. I've seen PCI read > transaction in hard_start_xmit taking ~10,000 cycles on one particular > device. Count the cycles your brand of NIC is taking in it's > xmit_routine. The worse it is, the stronger your case for cluster > transmits. > > Also, I think, you don't have to chain skbs, they're already chained in > Qdisc->q. All you have to do is take the whole q and try to shove it > at the device hoping for better results. But then, if you have rather > big backlog, you run the risk of reordering packets if you have to requeue. If the qdisc is packed with packets and we would just loop sending them to the device, yes it might make sense. But if that isn't the case, which frankly is the usual case, you add a non-trivial amount of latency by batching and that's bad exactly for the kind of applications that send small frames. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:37 ` David Miller @ 2007-05-10 20:40 ` Gagan Arneja 2007-05-10 20:57 ` David Miller 2007-05-11 5:21 ` Krishna Kumar2 1 sibling, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 20:40 UTC (permalink / raw) To: David Miller; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev David Miller wrote: > > If the qdisc is packed with packets and we would just loop sending > them to the device, yes it might make sense. > > But if that isn't the case, which frankly is the usual case, you add a > non-trivial amount of latency by batching and that's bad exactly for > the kind of applications that send small frames. > I don't understand how transmitting already batched up packets in one go introduce latency. -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:40 ` Gagan Arneja @ 2007-05-10 20:57 ` David Miller 2007-05-11 6:07 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-10 20:57 UTC (permalink / raw) To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev From: Gagan Arneja <gaagaan@gmail.com> Date: Thu, 10 May 2007 13:40:22 -0700 > David Miller wrote: > > > > If the qdisc is packed with packets and we would just loop sending > > them to the device, yes it might make sense. > > > > But if that isn't the case, which frankly is the usual case, you add a > > non-trivial amount of latency by batching and that's bad exactly for > > the kind of applications that send small frames. > > > > I don't understand how transmitting already batched up packets in one go > introduce latency. Keep thinking :-) The only case where these ideas can be seriously considered is during netif_wake_queue(). In all other cases we go straight to the device from sendmsg(), since there is space in the TX ring, and we should not try to batch as that will add latency. So maybe find a solution in that area, some new driver interface that can suck N packets out of a qdisc when netif_wake_queue() occurs. If you look at the contexts in which netif_wake_queue() is called, it's perfect, it already has the TX state of the driver locked, it has the TX descriptor head/tail pointers handy, etc. and it's already taken all the cache misses needed in order to work with this state. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:57 ` David Miller @ 2007-05-11 6:07 ` Krishna Kumar2 0 siblings, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 6:07 UTC (permalink / raw) To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri Hi Dave, David Miller <davem@davemloft.net> wrote on 05/11/2007 02:27:07 AM: > > I don't understand how transmitting already batched up packets in one go > > introduce latency. > > Keep thinking :-) > > The only case where these ideas can be seriously considered is during > netif_wake_queue(). In all other cases we go straight to the device > from sendmsg(), since there is space in the TX ring, and we should > not try to batch as that will add latency. I am not very clear about this, you are saying latency will increase due to old packets not being cache-warm ? But the existing code will do exactly the same thing as what I am proposing, except it will iterate over all the packets one by one. So will batching them make latency worse ? In the sendmsg case you wrote above (is it tcp_transmit_skb), it will still come via IP (through dst_output) ? I would like to try your idea if you are sure it will reduce latency. Thanks, - KK > So maybe find a solution in that area, some new driver interface that > can suck N packets out of a qdisc when netif_wake_queue() occurs. > > If you look at the contexts in which netif_wake_queue() is called, > it's perfect, it already has the TX state of the driver locked, it has > the TX descriptor head/tail pointers handy, etc. and it's already > taken all the cache misses needed in order to work with this state. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:37 ` David Miller 2007-05-10 20:40 ` Gagan Arneja @ 2007-05-11 5:21 ` Krishna Kumar2 1 sibling, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 5:21 UTC (permalink / raw) To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri David Miller <davem@davemloft.net> wrote on 05/11/2007 02:07:10 AM: > From: Gagan Arneja <gaagaan@gmail.com> > Date: Thu, 10 May 2007 12:43:53 -0700 > > > > Also, I think, you don't have to chain skbs, they're already chained in > > Qdisc->q. All you have to do is take the whole q and try to shove it > > at the device hoping for better results. But then, if you have rather > > big backlog, you run the risk of reordering packets if you have to requeue. > > If the qdisc is packed with packets and we would just loop sending > them to the device, yes it might make sense. > > But if that isn't the case, which frankly is the usual case, you add a > non-trivial amount of latency by batching and that's bad exactly for > the kind of applications that send small frames. Without batching, the latency should be the same since qdisc_run would send each in a tight loop ? I will also run TCP/UDP RR tests. thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 19:43 ` Gagan Arneja 2007-05-10 20:11 ` jamal 2007-05-10 20:37 ` David Miller @ 2007-05-11 5:20 ` Krishna Kumar2 2007-05-11 5:35 ` Gagan Arneja 2 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 5:20 UTC (permalink / raw) To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 01:13:53 AM: > Also, I think, you don't have to chain skbs, they're already chained in > Qdisc->q. All you have to do is take the whole q and try to shove it > at the device hoping for better results. But then, if you have rather > big backlog, you run the risk of reordering packets if you have to requeue. I haven't seen reordering packets (I did once when I was having a bug in the requeue code, some TCP messages on receiver indicating packets out of order). When a send fails, the packet are requeued in reverse (go to end of the failed skb and traverse back to the failed skb and requeue each skb). Since new inserts go to the end, the queue is guaranteed to be in order. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:20 ` Krishna Kumar2 @ 2007-05-11 5:35 ` Gagan Arneja 2007-05-11 5:43 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-11 5:35 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala Krishna Kumar2 wrote: > I haven't seen reordering packets (I did once when I was having a bug in > the requeue code, some TCP messages on receiver indicating packets out of > order). When a send fails, the packet are requeued in reverse (go to end of > the failed skb and traverse back to the failed skb and requeue each skb). > Since new inserts go to the end, the queue is guaranteed to be in order. queue_lock is dropped when you're in xmit. There's no guarantee packets won't be queued up while you're trying a transmit. > > Thanks, > > - KK > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:35 ` Gagan Arneja @ 2007-05-11 5:43 ` Krishna Kumar2 2007-05-11 5:57 ` Gagan Arneja 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 5:43 UTC (permalink / raw) To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:05:47 AM: > Krishna Kumar2 wrote: > > I haven't seen reordering packets (I did once when I was having a bug in > > the requeue code, some TCP messages on receiver indicating packets out of > > order). When a send fails, the packet are requeued in reverse (go to end of > > the failed skb and traverse back to the failed skb and requeue each skb). > > Since new inserts go to the end, the queue is guaranteed to be in order. > > queue_lock is dropped when you're in xmit. There's no guarantee packets > won't be queued up while you're trying a transmit. Right, but I am the sole dequeue'r, and on failure, I requeue those packets to the beginning of the queue (just as it would happen in the regular case of one packet xmit/failure/requeue). - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:43 ` Krishna Kumar2 @ 2007-05-11 5:57 ` Gagan Arneja 2007-05-11 6:06 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-11 5:57 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala > > Right, but I am the sole dequeue'r, and on failure, I requeue those packets > to > the beginning of the queue (just as it would happen in the regular case of > one > packet xmit/failure/requeue). What about a race between trying to reacquire queue_lock and another failed transmit? -- Gagan > > - KK > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:57 ` Gagan Arneja @ 2007-05-11 6:06 ` Krishna Kumar2 2007-05-11 6:29 ` Gagan Arneja 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 6:06 UTC (permalink / raw) To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala Hi Gagan, Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:27:54 AM: > > > > Right, but I am the sole dequeue'r, and on failure, I requeue those packets > > to > > the beginning of the queue (just as it would happen in the regular case of > > one > > packet xmit/failure/requeue). > > What about a race between trying to reacquire queue_lock and another > failed transmit? That is not possible too. I hold the QDISC_RUNNING bit in dev->state and am the only sender for this device, so there is no other failed transmit. Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the queue lock (which could be held quickly by enqueuer's), and q->requeue() skbs to the head of the list. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 6:06 ` Krishna Kumar2 @ 2007-05-11 6:29 ` Gagan Arneja 2007-05-11 6:52 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-11 6:29 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala Krishna Kumar2 wrote: >> What about a race between trying to reacquire queue_lock and another >> failed transmit? > > That is not possible too. I hold the QDISC_RUNNING bit in dev->state and > am the only sender for this device, so there is no other failed transmit. > Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the > queue lock (which could be held quickly by enqueuer's), and q->requeue() > skbs to the head of the list. I have to claim incomplete familiarity for the code. But still, if you're out there running with no locks for a period, there's no assumption you can make. The "lock could be held quickly" assertion is a fallacy. > > Thanks, > > - KK > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 6:29 ` Gagan Arneja @ 2007-05-11 6:52 ` Krishna Kumar2 0 siblings, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 6:52 UTC (permalink / raw) To: Gagan Arneja Cc: Evgeniy Polyakov, netdev, netdev-owner, Rick Jones, Sridhar Samudrala Hi Gagan, > I have to claim incomplete familiarity for the code. But still, if > you're out there running with no locks for a period, there's no > assumption you can make. The "lock could be held quickly" assertion is a > fallacy. I will try to explain since the code is pretty complicated. Packets coming to IP (in dev_queue_xmit) are first added to a FIFO queue associated with the device (queue_lock is first held). Then qdisc_run is called and if it can set the QDISC_RUNNING bit (making it the "sole dequeue'r"), it calls qdisc_restart (which *could* get the tx lock) and drops the queue_lock to call the actual driver xmit routine. Meanwhile 'n' cpus can race and add more skbs to the same queue, each getting this lock for a short time and dropping, and all these skbs are added to the end of the queue. These are "pure enqueuer's" (since their attempt to set the same bit fails and they return from qdisc_run). In the original thread, the dequeue'r gets an error from driver xmit. It first re-gets the queue_lock and calls requeue, which adds the skb to the head of the queue. Since no one else has been able to dequeue packets while the RUNNING bit is set, packet re-ordering cannot happen. After returning to __qdisc_run, the RUNNING bit is cleared. Hope that clarifies the code. thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 17:19 ` Rick Jones 2007-05-10 18:07 ` Sridhar Samudrala @ 2007-05-10 18:13 ` Vlad Yasevich 2007-05-10 18:20 ` Rick Jones 2007-05-10 21:27 ` David Stevens 2 siblings, 1 reply; 96+ messages in thread From: Vlad Yasevich @ 2007-05-10 18:13 UTC (permalink / raw) To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev Rick Jones wrote: >> It is the reverse - GSO will segment one super-packet just before calling >> the driver so that the stack is traversed only once. In my case, I am >> trying to send out multiple skbs, possibly small packets, in one shot. >> GSO will not help for small packets. > > If there are small packets that implies small sends, which suggests that > they would be coalesced either implicitly by the Nagle algorithm or > explicitly with TCP_CORK no? > > rick jones > - May be for TCP? What about other protocols? -vlad ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:13 ` Vlad Yasevich @ 2007-05-10 18:20 ` Rick Jones 2007-05-10 18:32 ` Vlad Yasevich 0 siblings, 1 reply; 96+ messages in thread From: Rick Jones @ 2007-05-10 18:20 UTC (permalink / raw) To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev Vlad Yasevich wrote: > Rick Jones wrote: > >>>It is the reverse - GSO will segment one super-packet just before calling >>>the driver so that the stack is traversed only once. In my case, I am >>>trying to send out multiple skbs, possibly small packets, in one shot. >>>GSO will not help for small packets. >> >>If there are small packets that implies small sends, which suggests that >>they would be coalesced either implicitly by the Nagle algorithm or >>explicitly with TCP_CORK no? >> >>rick jones >>- > > > May be for TCP? What about other protocols? There are other protocols?-) True, UDP, and I suppose certain modes of SCTP might be sending streams of small packets, as might TCP with TCP_NODELAY set. Do they often queue-up outside the driver? rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:20 ` Rick Jones @ 2007-05-10 18:32 ` Vlad Yasevich 2007-05-10 18:40 ` Rick Jones 2007-05-10 18:59 ` Ian McDonald 0 siblings, 2 replies; 96+ messages in thread From: Vlad Yasevich @ 2007-05-10 18:32 UTC (permalink / raw) To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev Rick Jones wrote: > Vlad Yasevich wrote: >> Rick Jones wrote: >> >>>> It is the reverse - GSO will segment one super-packet just before >>>> calling >>>> the driver so that the stack is traversed only once. In my case, I am >>>> trying to send out multiple skbs, possibly small packets, in one shot. >>>> GSO will not help for small packets. >>> >>> If there are small packets that implies small sends, which suggests that >>> they would be coalesced either implicitly by the Nagle algorithm or >>> explicitly with TCP_CORK no? >>> >>> rick jones >>> - >> >> >> May be for TCP? What about other protocols? > > There are other protocols?-) True, UDP, and I suppose certain modes of > SCTP might be sending streams of small packets, as might TCP with > TCP_NODELAY set. > > Do they often queue-up outside the driver? Not sure if DCCP might fall into this category as well... I think the idea of this patch is gather some number of these small packets and shove them at the driver in one go instead of each small packet at a time. I might be helpful, but reserve judgment till I see more numbers. -vlad ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:32 ` Vlad Yasevich @ 2007-05-10 18:40 ` Rick Jones 2007-05-10 18:59 ` Ian McDonald 1 sibling, 0 replies; 96+ messages in thread From: Rick Jones @ 2007-05-10 18:40 UTC (permalink / raw) To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev > > Not sure if DCCP might fall into this category as well... > > I think the idea of this patch is gather some number of these small packets and > shove them at the driver in one go instead of each small packet at a time. This reminds me... (rick starts waxing rhapshodic about old HP-UX behviour :) The HP-UX drivers have a "packet train" mode whereby IP will give them all the fragments of an IP datagram in one shot. The idea was that the driver would take all of them, or none of them. This was to deal with issues around filling a driver's transmit queue with some of the fragments and then dropping the others, thus transmitting IP datagrams which had no hope of being reassembled into anything other than frankengrams. rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:32 ` Vlad Yasevich 2007-05-10 18:40 ` Rick Jones @ 2007-05-10 18:59 ` Ian McDonald 2007-05-10 19:21 ` Vlad Yasevich 2007-05-11 5:04 ` Krishna Kumar2 1 sibling, 2 replies; 96+ messages in thread From: Ian McDonald @ 2007-05-10 18:59 UTC (permalink / raw) To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote: > >> May be for TCP? What about other protocols? > > > > There are other protocols?-) True, UDP, and I suppose certain modes of > > SCTP might be sending streams of small packets, as might TCP with > > TCP_NODELAY set. > > > > Do they often queue-up outside the driver? > > Not sure if DCCP might fall into this category as well... > Yes DCCP definitely can queue outside the driver. > I think the idea of this patch is gather some number of these small packets and > shove them at the driver in one go instead of each small packet at a time. > > I might be helpful, but reserve judgment till I see more numbers. > > -vlad As I see this proposed patch it is about reducing the number of "task switches" between the driver and the protocol. I use task switch in speech marks as it isn't really as is in the kernel. So in other words we are hoping that spending more time in each area would keep the cache hot and work to be done if locks held. This of course requires that the complexity added doesn't outweigh the gains - otherwise you could end up in a worse scenario where the driver doesn't send packets because the protocol is busy linking them. As far as I can tell you're not combining packets?? This would definitely break UDP/DCCP which are datagram based. Will be interesting to see results but I'm a little sceptical. -- Web: http://wand.net.nz/~iam4/ Blog: http://iansblog.jandi.co.nz WAND Network Research Group ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:59 ` Ian McDonald @ 2007-05-10 19:21 ` Vlad Yasevich 2007-05-10 19:26 ` Ian McDonald 2007-05-10 20:32 ` David Miller 2007-05-11 5:04 ` Krishna Kumar2 1 sibling, 2 replies; 96+ messages in thread From: Vlad Yasevich @ 2007-05-10 19:21 UTC (permalink / raw) To: Ian McDonald; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev Ian McDonald wrote: > On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote: >> >> May be for TCP? What about other protocols? >> > >> > There are other protocols?-) True, UDP, and I suppose certain modes of >> > SCTP might be sending streams of small packets, as might TCP with >> > TCP_NODELAY set. >> > >> > Do they often queue-up outside the driver? >> >> Not sure if DCCP might fall into this category as well... >> > Yes DCCP definitely can queue outside the driver. > >> I think the idea of this patch is gather some number of these small >> packets and >> shove them at the driver in one go instead of each small packet at a >> time. >> >> I might be helpful, but reserve judgment till I see more numbers. >> >> -vlad > > As I see this proposed patch it is about reducing the number of "task > switches" between the driver and the protocol. I use task switch in > speech marks as it isn't really as is in the kernel. So in other words > we are hoping that spending more time in each area would keep the > cache hot and work to be done if locks held. This of course requires > that the complexity added doesn't outweigh the gains - otherwise you > could end up in a worse scenario where the driver doesn't send packets > because the protocol is busy linking them. > > As far as I can tell you're not combining packets?? This would > definitely break UDP/DCCP which are datagram based. If it's combining packets, it would break a lot of other things as well. The proposed usage seems to be to link packets together into a chain and give the whole chain to the driver, instead of handing down one packet at a time. The win might be biggest on a system were a lot of applications send a lot of small packets. Some number will aggregate in the prio queue and then get shoved into a driver in one go. But... this is all conjecture until we see the code. -vlad ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 19:21 ` Vlad Yasevich @ 2007-05-10 19:26 ` Ian McDonald 2007-05-10 20:32 ` David Miller 1 sibling, 0 replies; 96+ messages in thread From: Ian McDonald @ 2007-05-10 19:26 UTC (permalink / raw) To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote: > The win might be biggest on a system were a lot of applications send a lot of > small packets. Some number will aggregate in the prio queue and then get shoved > into a driver in one go. That's assuming that the device doesn't run out of things to send first.... > > But... this is all conjecture until we see the code. > Agree -- Web: http://wand.net.nz/~iam4/ Blog: http://iansblog.jandi.co.nz WAND Network Research Group ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 19:21 ` Vlad Yasevich 2007-05-10 19:26 ` Ian McDonald @ 2007-05-10 20:32 ` David Miller 2007-05-10 20:49 ` Rick Jones 1 sibling, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-10 20:32 UTC (permalink / raw) To: vladislav.yasevich; +Cc: ian.mcdonald, rick.jones2, krkumar2, johnpol, netdev From: Vlad Yasevich <vladislav.yasevich@hp.com> Date: Thu, 10 May 2007 15:21:30 -0400 > The win might be biggest on a system were a lot of applications send > a lot of small packets. Some number will aggregate in the prio > queue and then get shoved into a driver in one go. > > But... this is all conjecture until we see the code. Also, whatever you gain in cpu usage you'll lose in latency. And things sending tiny frames are exactly the ones that care about latency. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:32 ` David Miller @ 2007-05-10 20:49 ` Rick Jones 2007-05-10 21:02 ` David Miller 0 siblings, 1 reply; 96+ messages in thread From: Rick Jones @ 2007-05-10 20:49 UTC (permalink / raw) To: David Miller; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev David Miller wrote: > From: Vlad Yasevich <vladislav.yasevich@hp.com> > Date: Thu, 10 May 2007 15:21:30 -0400 > > >>The win might be biggest on a system were a lot of applications send >>a lot of small packets. Some number will aggregate in the prio >>queue and then get shoved into a driver in one go. >> >>But... this is all conjecture until we see the code. > > > Also, whatever you gain in cpu usage you'll lose in latency. > > And things sending tiny frames are exactly the ones that care about > latency. I'd think one would only do this in those situations/places where a natural "out of driver" queue develops in the first place wouldn't one? rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:49 ` Rick Jones @ 2007-05-10 21:02 ` David Miller 2007-05-10 21:14 ` Gagan Arneja 0 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-10 21:02 UTC (permalink / raw) To: rick.jones2; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev From: Rick Jones <rick.jones2@hp.com> Date: Thu, 10 May 2007 13:49:44 -0700 > I'd think one would only do this in those situations/places where a > natural "out of driver" queue develops in the first place wouldn't > one? Indeed. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:02 ` David Miller @ 2007-05-10 21:14 ` Gagan Arneja 2007-05-11 2:28 ` Stephen Hemminger 0 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 21:14 UTC (permalink / raw) To: David Miller Cc: rick.jones2, vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev David Miller wrote: > From: Rick Jones <rick.jones2@hp.com> > Date: Thu, 10 May 2007 13:49:44 -0700 > >> I'd think one would only do this in those situations/places where a >> natural "out of driver" queue develops in the first place wouldn't >> one? > > Indeed. And one builds in qdisc because your device sink is slow. There's nothing inherently there in the kernel encouraging Qdisc->q to grow. It's precisely these broken devices that can take advantage of cluster transmits. -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:14 ` Gagan Arneja @ 2007-05-11 2:28 ` Stephen Hemminger 2007-05-11 5:01 ` Gagan Arneja 0 siblings, 1 reply; 96+ messages in thread From: Stephen Hemminger @ 2007-05-11 2:28 UTC (permalink / raw) To: Gagan Arneja Cc: David Miller, rick.jones2, vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev On Thu, 10 May 2007 14:14:05 -0700 Gagan Arneja <gaagaan@gmail.com> wrote: > David Miller wrote: > > From: Rick Jones <rick.jones2@hp.com> > > Date: Thu, 10 May 2007 13:49:44 -0700 > > > >> I'd think one would only do this in those situations/places where a > >> natural "out of driver" queue develops in the first place wouldn't > >> one? > > > > Indeed. > > And one builds in qdisc because your device sink is slow. There's > nothing inherently there in the kernel encouraging Qdisc->q to grow. > It's precisely these broken devices that can take advantage of cluster > transmits. If you have braindead slow hardware, there is nothing that says your start_xmit routine can't do its own coalescing. The cost of calling the transmit routine is the responsibility of the driver, not the core network code. -- Stephen Hemminger <shemminger@linux-foundation.org> ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 2:28 ` Stephen Hemminger @ 2007-05-11 5:01 ` Gagan Arneja 0 siblings, 0 replies; 96+ messages in thread From: Gagan Arneja @ 2007-05-11 5:01 UTC (permalink / raw) To: Stephen Hemminger Cc: Gagan Arneja, David Miller, rick.jones2, vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev > If you have braindead slow hardware, > there is nothing that says your start_xmit routine can't do its own > coalescing. The cost of calling the transmit routine is the responsibility > of the driver, not the core network code. Yes, except you very likely run the risk of the driver introducing latency. The problem is prevalent enough that one can argue for a fix (a better one at that) higher up. And it's not all that high up that we need to get completely paranoid about it. > -- > Stephen Hemminger <shemminger@linux-foundation.org> > -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 18:59 ` Ian McDonald 2007-05-10 19:21 ` Vlad Yasevich @ 2007-05-11 5:04 ` Krishna Kumar2 2007-05-11 9:01 ` Evgeniy Polyakov 1 sibling, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 5:04 UTC (permalink / raw) To: Ian McDonald; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Vlad Yasevich "Ian McDonald" <ian.mcdonald@jandi.co.nz> wrote on 05/11/2007 12:29:08 AM: > > As I see this proposed patch it is about reducing the number of "task > switches" between the driver and the protocol. I use task switch in > speech marks as it isn't really as is in the kernel. So in other words > we are hoping that spending more time in each area would keep the > cache hot and work to be done if locks held. This of course requires > that the complexity added doesn't outweigh the gains - otherwise you > could end up in a worse scenario where the driver doesn't send packets > because the protocol is busy linking them. This is a right summary. > As far as I can tell you're not combining packets?? This would > definitely break UDP/DCCP which are datagram based. Not combining packets, I am sending them out in the same sequence it was queued. If the xmit failed, the driver's new API returns the skb which failed to be sent. This skb and all other linked skbs are requeue'd in the reverse order (fofi?) till the next time it is tried again. I see that sometimes I can send tx_queue_len packets in one shot and all succeeds. But the downside is that in failure case, the packets have to be requeue'd. > Will be interesting to see results but I'm a little sceptical. I was just going to test, and Rick helpfully pointed out netperf4. Thanks, - KK > > -- > Web: http://wand.net.nz/~iam4/ > Blog: http://iansblog.jandi.co.nz > WAND Network Research Group ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 5:04 ` Krishna Kumar2 @ 2007-05-11 9:01 ` Evgeniy Polyakov 2007-05-11 9:18 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-11 9:01 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > Not combining packets, I am sending them out in the same sequence it was > queued. If the xmit failed, the driver's new API returns the skb which > failed to be sent. This skb and all other linked skbs are requeue'd in > the reverse order (fofi?) till the next time it is tried again. I see > that sometimes I can send tx_queue_len packets in one shot and all > succeeds. But the downside is that in failure case, the packets have to > be requeue'd. And what if you have thousand(s) of packets queued and first one has failed, requeing all the rest one-by-one is not a solution. If it is being done under heavy lock (with disabled irqs especially) it becomes a disaster. I thought of a bit different approach: driver maintains own queue (or has access to stack's one) and only uses lock to dequeue a packet. If transmit fails, nothing is requeued, but the same packet is tried until transmit is completed. If number of transmits failed in order, driver says it is broken and/or stops queue. Thus one can setup several descriptors in one go and do it without any locks. Stack calls driver's xmit function which essentially sets bit that new packets are available, but driver must have access to qdisk. If e1000 driver would not be so... so 'uncommon' compared to another small gige drivers I would try to cook up a patch, but I will not with e1000 :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:01 ` Evgeniy Polyakov @ 2007-05-11 9:18 ` Krishna Kumar2 2007-05-11 9:32 ` Evgeniy Polyakov 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 9:18 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich Hi Evgeniy, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 02:31:38 PM: > On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > > Not combining packets, I am sending them out in the same sequence it was > > queued. If the xmit failed, the driver's new API returns the skb which > > failed to be sent. This skb and all other linked skbs are requeue'd in > > the reverse order (fofi?) till the next time it is tried again. I see > > that sometimes I can send tx_queue_len packets in one shot and all > > succeeds. But the downside is that in failure case, the packets have to > > be requeue'd. > > And what if you have thousand(s) of packets queued and first one has > failed, requeing all the rest one-by-one is not a solution. If it is > being done under heavy lock (with disabled irqs especially) it becomes a > disaster. If first one failed for other reasons from described below, it is freed up and the next one attempted. There are three cases where we cannot continue : no slots, device blocked, no lock. Jamal had suggested to get information on #available slots from the driver. The queue_stopped is checked before linking packets, so the only other error case is not getting a lock. And this too is true only in the ~LLTX case, which optionally could be a check to enable this linking. Also, there could be limits to how many packets are queue'd. I tried tx_queue_len as well as tx_queue_len/2, but there could be other options. > I thought of a bit different approach: driver maintains own queue (or > has access to stack's one) and only uses lock to dequeue a packet. If > transmit fails, nothing is requeued, but the same packet is tried until > transmit is completed. If number of transmits failed in order, driver > says it is broken and/or stops queue. Thus one can setup several > descriptors in one go and do it without any locks. Stack calls driver's > xmit function which essentially sets bit that new packets are available, > but driver must have access to qdisk. Bugs will increase if drivers also are allowed to access qdisc, since it is difficult code as it is. Right now it is easy to see how the qdisc is manipulated. thanks, - KK > If e1000 driver would not be so... so 'uncommon' compared to another > small gige drivers I would try to cook up a patch, but I will not with > e1000 :) > > -- > Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:18 ` Krishna Kumar2 @ 2007-05-11 9:32 ` Evgeniy Polyakov 2007-05-11 9:52 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-11 9:32 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > > And what if you have thousand(s) of packets queued and first one has > > failed, requeing all the rest one-by-one is not a solution. If it is > > being done under heavy lock (with disabled irqs especially) it becomes a > > disaster. > > If first one failed for other reasons from described below, it is freed up > and the next one attempted. There are three cases where we cannot continue > : > no slots, device blocked, no lock. > > Jamal had suggested to get information on #available slots from the driver. > The queue_stopped is checked before linking packets, so the only other > error case is not getting a lock. And this too is true only in the ~LLTX > case, > which optionally could be a check to enable this linking. Also, there could > be limits to how many packets are queue'd. I tried tx_queue_len as well as > tx_queue_len/2, but there could be other options. Why stack should ever know what hardware is being used? If you are going to break things, break it with scary sound :) Design new interface, where driver has access to the queue (say only using one helper dequeue_first_packet()), stack can only queue packets to the tail, driver can also stop that by setting a bit (which was there likely before we had knew word Linux). That is all, driver does not access qdisk, it only gets the first skb. Transmit function is completely lockless (until something is shared with receive path/interrupts, but it is private driver's part), stack will not be changed at all (except new helper to dequeue first packet from the qdisk, actually it is already there, it just needs to be exported), driver's xit function sets a flag that there are packets (or it can be even empty, since packet's presense can be detected via the same qdisk interface). You read qdisk len, setup several descriptors in one go, dequeue set of skbs and commit them. If something has failed, it is not requeued, but resent (or even dropped after fair amount of attempts). No locks, no requeues? Seems simple imho. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:32 ` Evgeniy Polyakov @ 2007-05-11 9:52 ` Krishna Kumar2 2007-05-11 9:56 ` Evgeniy Polyakov 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 9:52 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 03:02:02 PM: > On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > > > And what if you have thousand(s) of packets queued and first one has > > > failed, requeing all the rest one-by-one is not a solution. If it is > > > being done under heavy lock (with disabled irqs especially) it becomes a > > > disaster. > > > > If first one failed for other reasons from described below, it is freed up > > and the next one attempted. There are three cases where we cannot continue > > : > > no slots, device blocked, no lock. > > > Why stack should ever know what hardware is being used? > If you are going to break things, break it with scary sound :) Actually it could just be an API exported by drivers which implement this new xmit API. > Design new interface, where driver has access to the queue (say only > using one helper dequeue_first_packet()), stack can only queue packets > to the tail, driver can also stop that by setting a bit (which was there > likely before we had knew word Linux). > > That is all, driver does not access qdisk, it only gets the first skb. > Transmit function is completely lockless (until something is shared with > receive path/interrupts, but it is private driver's part), stack will > not be changed at all (except new helper to dequeue first packet from > the qdisk, actually it is already there, it just needs to be exported), > driver's xit function sets a flag that there are packets (or it can be > even empty, since packet's presense can be detected via the same qdisk > interface). You read qdisk len, setup several descriptors in one go, > dequeue set of skbs and commit them. If something has failed, it is not > requeued, but resent (or even dropped after fair amount of attempts). > No locks, no requeues? Seems simple imho. I will analyze this in more detail when I return (leaving just now, so got really no time). The only issue that I see quickly is "No locks", since to get things off the queue you need to hold the queue_lock. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:52 ` Krishna Kumar2 @ 2007-05-11 9:56 ` Evgeniy Polyakov 2007-05-11 11:30 ` jamal 0 siblings, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-11 9:56 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, May 11, 2007 at 03:22:13PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote: > > No locks, no requeues? Seems simple imho. > > I will analyze this in more detail when I return (leaving just now, so got > really no time). The only issue that I see quickly is "No locks", since to > get things off the queue you need to hold the queue_lock. I meant no locks during processing of the packets (pci read/write, dma setup and so on), of course it is needed to dequeue a packet, but only for that operation. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:56 ` Evgeniy Polyakov @ 2007-05-11 11:30 ` jamal 2007-05-11 11:53 ` Evgeniy Polyakov 0 siblings, 1 reply; 96+ messages in thread From: jamal @ 2007-05-11 11:30 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, 2007-11-05 at 13:56 +0400, Evgeniy Polyakov wrote: > I meant no locks during processing of the packets (pci read/write, dma > setup and so on), of course it is needed to dequeue a packet, but only > for that operation. I dont think you can avoid the lock Evgeniy. You need to protect against the tx ring having some other things happening to it from the napi poll or netdev interupts (most of the drivers touch the tx ring on the napi poll). cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 11:30 ` jamal @ 2007-05-11 11:53 ` Evgeniy Polyakov 2007-05-11 12:15 ` jamal 0 siblings, 1 reply; 96+ messages in thread From: Evgeniy Polyakov @ 2007-05-11 11:53 UTC (permalink / raw) To: jamal; +Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, May 11, 2007 at 07:30:02AM -0400, jamal (hadi@cyberus.ca) wrote: > > I meant no locks during processing of the packets (pci read/write, dma > > setup and so on), of course it is needed to dequeue a packet, but only > > for that operation. > > I dont think you can avoid the lock Evgeniy. You need to protect against > the tx ring having some other things happening to it from the napi poll > or netdev interupts (most of the drivers touch the tx ring on the napi > poll). As I said there might be another lock, if interrupt handler is shared, or registers are accessed, but it is privite driver's business, which has nothing in common with stack itself. Stack just queues an skb, which, after detached from the tx queue by driver, just does not exist for stack anymore. It can be dequeed with rcu protection even. > cheers, > jamal -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 11:53 ` Evgeniy Polyakov @ 2007-05-11 12:15 ` jamal 0 siblings, 0 replies; 96+ messages in thread From: jamal @ 2007-05-11 12:15 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich On Fri, 2007-11-05 at 15:53 +0400, Evgeniy Polyakov wrote: > As I said there might be another lock, if interrupt handler is shared, > or registers are accessed, but it is privite driver's business, which > has nothing in common with stack itself. Ok, we are saying the same thing then. eg in e1000 that would be tx_ring->lock or something along those lines. > Stack just queues an skb, > which, after detached from the tx queue by driver, just does not exist > for stack anymore. It can be dequeed with rcu protection even. in my case i had a dev->blist where the stack queued. You really want to avoid requeueing altogether. The trick i used is to have the driver tell you how many packets you can pull off the qdisc. cheers, jamal ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 17:19 ` Rick Jones 2007-05-10 18:07 ` Sridhar Samudrala 2007-05-10 18:13 ` Vlad Yasevich @ 2007-05-10 21:27 ` David Stevens 2007-05-10 21:40 ` David Miller ` (2 more replies) 2 siblings, 3 replies; 96+ messages in thread From: David Stevens @ 2007-05-10 21:27 UTC (permalink / raw) To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner The word "small" is coming up a lot in this discussion, and I think packet size really has nothing to do with it. Multiple streams generating packets of any size would benefit; the key ingredient is a queue length greater than 1. I think the intent is to remove queue lock cycles by taking the whole list (at least up to the count of free ring buffers) when the queue is greater than one packet, thus effectively removing the lock expense for n-1 packets. +-DLS ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:27 ` David Stevens @ 2007-05-10 21:40 ` David Miller 2007-05-10 21:50 ` Gagan Arneja 2007-05-10 21:41 ` Eric Dumazet 2007-05-10 21:45 ` Rick Jones 2 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-10 21:40 UTC (permalink / raw) To: dlstevens; +Cc: rick.jones2, johnpol, krkumar2, netdev, netdev-owner From: David Stevens <dlstevens@us.ibm.com> Date: Thu, 10 May 2007 14:27:56 -0700 > The word "small" is coming up a lot in this discussion, and > I think packet size really has nothing to do with it. Multiple > streams generating packets of any size would benefit; the > key ingredient is a queue length greater than 1. > > I think the intent is to remove queue lock cycles by taking > the whole list (at least up to the count of free ring buffers) > when the queue is greater than one packet, thus effectively > removing the lock expense for n-1 packets. Right. But I think it's critical to do two things: 1) Do this when netif_wake_queue() is triggers and thus the TX is locked already. 2) Have some way for the driver to say how many free TX slots there are in order to minimize if not eliminate requeueing during this batching thing. If you drop the TX lock, the number of free slots can change as another cpu gets in there queuing packets. I know there are some hardware workarounds that require using more TX ring buffer slots and are usually necessary, which makes %100 accurate indication of free slots not possible. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:40 ` David Miller @ 2007-05-10 21:50 ` Gagan Arneja 2007-05-10 22:06 ` David Miller 0 siblings, 1 reply; 96+ messages in thread From: Gagan Arneja @ 2007-05-10 21:50 UTC (permalink / raw) To: David Miller Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner David Miller wrote: > Right. > > But I think it's critical to do two things: > > 1) Do this when netif_wake_queue() is triggers and thus the > TX is locked already. > > 2) Have some way for the driver to say how many free TX slots > there are in order to minimize if not eliminate requeueing > during this batching thing. I don't think you can reliably do this. Jumbograms may end up taking more than one slot. > > If you drop the TX lock, the number of free slots can change > as another cpu gets in there queuing packets. Can you ever have more than one thread inside the driver? Isn't xmit_lock held while we're in there? -- Gagan ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:50 ` Gagan Arneja @ 2007-05-10 22:06 ` David Miller 2007-05-11 9:46 ` Krishna Kumar2 0 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-10 22:06 UTC (permalink / raw) To: gaagaan; +Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner From: Gagan Arneja <gaagaan@gmail.com> Date: Thu, 10 May 2007 14:50:19 -0700 > David Miller wrote: > > > If you drop the TX lock, the number of free slots can change > > as another cpu gets in there queuing packets. > > Can you ever have more than one thread inside the driver? Isn't > xmit_lock held while we're in there? There are restrictions wrt. when the xmit_lock and the queue lock can be held at the same time. The devil is definitely in the details if you try to implemen this. It definitely lends support for Eric D.'s assertion that this change will only add bugs and doing something simple like prefetches is proabably a safer route to go down. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 22:06 ` David Miller @ 2007-05-11 9:46 ` Krishna Kumar2 0 siblings, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 9:46 UTC (permalink / raw) To: David Miller Cc: dlstevens, gaagaan, johnpol, netdev, netdev-owner, rick.jones2 Hi all, Very preliminary testing with 20 procs on E1000 driver gives me following result: skbsz Org BW New BW % Org demand New Demand % 32 315.98 347.48 9.97% 21090 20958 0.62% 96 833.67 882.92 5.91% 7939 9107 -14.71 But this is test running for just 30 secs (just too short) and netperf2 (not netperf4, which I am going to use later). Using single E1000 card cross-cable'd on 2.6.21.1 kernel on 2 CPU 2.8Ghz Xseries systems. I will have more detailed report next week, especially once I run netperf4. I am taking off for the next two days, so will reply later on this thread. Thanks, - KK David Miller <davem@davemloft.net> wrote on 05/11/2007 03:36:05 AM: > From: Gagan Arneja <gaagaan@gmail.com> > Date: Thu, 10 May 2007 14:50:19 -0700 > > > David Miller wrote: > > > > > If you drop the TX lock, the number of free slots can change > > > as another cpu gets in there queuing packets. > > > > Can you ever have more than one thread inside the driver? Isn't > > xmit_lock held while we're in there? > > There are restrictions wrt. when the xmit_lock and the > queue lock can be held at the same time. > > The devil is definitely in the details if you try to > implemen this. It definitely lends support for Eric D.'s > assertion that this change will only add bugs and doing > something simple like prefetches is proabably a safer > route to go down. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:27 ` David Stevens 2007-05-10 21:40 ` David Miller @ 2007-05-10 21:41 ` Eric Dumazet 2007-05-10 22:09 ` Rick Jones 2007-05-10 21:45 ` Rick Jones 2 siblings, 1 reply; 96+ messages in thread From: Eric Dumazet @ 2007-05-10 21:41 UTC (permalink / raw) To: David Stevens Cc: Rick Jones, Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner David Stevens a écrit : > The word "small" is coming up a lot in this discussion, and > I think packet size really has nothing to do with it. Multiple > streams generating packets of any size would benefit; the > key ingredient is a queue length greater than 1. > > I think the intent is to remove queue lock cycles by taking > the whole list (at least up to the count of free ring buffers) > when the queue is greater than one packet, thus effectively > removing the lock expense for n-1 packets. > Yes, but on modern cpus, locked operations are basically free once the CPU already has the cache line in exclusive access in its L1 cache. I am not sure adding yet another driver API will help very much. It will for sure adds some bugs and pain. A less expensive (and less prone to bugs) optimization would be to prefetch one cache line for next qdisc skb, as a cache line miss is far more expensive than a locked operation (if lock already in L1 cache of course) ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:41 ` Eric Dumazet @ 2007-05-10 22:09 ` Rick Jones 0 siblings, 0 replies; 96+ messages in thread From: Rick Jones @ 2007-05-10 22:09 UTC (permalink / raw) To: Eric Dumazet Cc: David Stevens, Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner Eric Dumazet wrote: > David Stevens a écrit : > >> The word "small" is coming up a lot in this discussion, and >> I think packet size really has nothing to do with it. Multiple >> streams generating packets of any size would benefit; the >> key ingredient is a queue length greater than 1. >> >> I think the intent is to remove queue lock cycles by taking >> the whole list (at least up to the count of free ring buffers) >> when the queue is greater than one packet, thus effectively >> removing the lock expense for n-1 packets. >> > > Yes, but on modern cpus, locked operations are basically free once the > CPU already has the cache line in exclusive access in its L1 cache. But will it here? Any of the CPUs are trying to add things to the qdisc, but only one CPU is pulling from it right? Even if the "pulling from it" is happening in a loop, there can be scores or more other cores trying to add things to the queue, which would cause that cache line to migrate. > I am not sure adding yet another driver API will help very much. > It will for sure adds some bugs and pain. That could very well be. > A less expensive (and less prone to bugs) optimization would be to > prefetch one cache line for next qdisc skb, as a cache line miss is far > more expensive than a locked operation (if lock already in L1 cache of > course) Might they not build on on top of the other? rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:27 ` David Stevens 2007-05-10 21:40 ` David Miller 2007-05-10 21:41 ` Eric Dumazet @ 2007-05-10 21:45 ` Rick Jones 2007-05-10 21:53 ` David Stevens 2 siblings, 1 reply; 96+ messages in thread From: Rick Jones @ 2007-05-10 21:45 UTC (permalink / raw) To: David Stevens; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner David Stevens wrote: > The word "small" is coming up a lot in this discussion, and > I think packet size really has nothing to do with it. Multiple > streams generating packets of any size would benefit; the > key ingredient is a queue length greater than 1. > > I think the intent is to remove queue lock cycles by taking > the whole list (at least up to the count of free ring buffers) > when the queue is greater than one packet, thus effectively > removing the lock expense for n-1 packets. Which worked _very_ well (the whole list) going in the other direction for the netisr queue(s) in HP-UX 10.20. OK, I promise no more old HP-UX stories for the balance of the week :) Taking some count of the list might be a triffle too complicated. rick jones ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 21:45 ` Rick Jones @ 2007-05-10 21:53 ` David Stevens 0 siblings, 0 replies; 96+ messages in thread From: David Stevens @ 2007-05-10 21:53 UTC (permalink / raw) To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner > Which worked _very_ well (the whole list) going in the other direction for the > netisr queue(s) in HP-UX 10.20. OK, I promise no more old HP-UX stories for the > balance of the week :) Yes, OSes I worked on in other lives usually took the whole queue and then took responsibility for handling them outside the lock. A dequeue_n() under 1 lock should still be better than n locks, I'd think, since you may bounce to different CPUs during the n individual cycles. +-DLS ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 14:53 Krishna Kumar 2007-05-10 15:08 ` Evgeniy Polyakov @ 2007-05-10 20:21 ` Roland Dreier 2007-05-11 7:30 ` Krishna Kumar2 2007-05-11 9:05 ` Andi Kleen 2 siblings, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-10 20:21 UTC (permalink / raw) To: Krishna Kumar; +Cc: netdev, Krishna Kumar This is pretty interesting to me for IP-over-InfiniBand, for a couple of reasons. First of all, I can push multiple send requests to the underlying adapter in one go, which saves taking and dropping the same lock and also probably allows fewer MMIO writes for doorbells. However the second reason this is useful is actually more interesting. Right now we request a send completion interrupt for every packet we send because most current IB adapters don't really have very good interrupt mitigation features. If we don't signal a send completion for every packet, then we might run into a situation where we don't get a completion for the last packet that gets sent and end up stalling the interface because the stack never gives us another packet to send. However, if I have a whole queue of packets passed to my xmit routine, then I only have to request a completion for the last packet in the list, which could potentially cut down on interrupts and completion processing a lot. - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 20:21 ` Roland Dreier @ 2007-05-11 7:30 ` Krishna Kumar2 2007-05-11 11:21 ` Roland Dreier 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 7:30 UTC (permalink / raw) To: Roland Dreier; +Cc: netdev Hi Roland, Roland Dreier <rdreier@cisco.com> wrote on 05/11/2007 01:51:50 AM: > This is pretty interesting to me for IP-over-InfiniBand, for a couple > of reasons. First of all, I can push multiple send requests to the > underlying adapter in one go, which saves taking and dropping the same > lock and also probably allows fewer MMIO writes for doorbells. > > However the second reason this is useful is actually more > interesting. Right now we request a send completion interrupt for > every packet we send because most current IB adapters don't really > have very good interrupt mitigation features. If we don't signal a > send completion for every packet, then we might run into a situation > where we don't get a completion for the last packet that gets sent and > end up stalling the interface because the stack never gives us another > packet to send. However, if I have a whole queue of packets passed to > my xmit routine, then I only have to request a completion for the last > packet in the list, which could potentially cut down on interrupts and > completion processing a lot. Sounds a good idea. I had a question on error handling. What happens if the driver asynchronously returns an error for this WR (single WR containing multiple skbs) ? Does it mean all the skbs failed to be sent ? Requeuing all of them is a bad idea since it leads to infinitely doing the same thing, so should every skb be free'd ? Isn't that harsh (if first skb is bad and others are fine) and how do we avoid that ? Infact we don't know which skb was transmitted and which failed. If we requeue skbs, there will also be out-of-order skbs as the callback is asynchronous. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 7:30 ` Krishna Kumar2 @ 2007-05-11 11:21 ` Roland Dreier 0 siblings, 0 replies; 96+ messages in thread From: Roland Dreier @ 2007-05-11 11:21 UTC (permalink / raw) To: Krishna Kumar2; +Cc: netdev > Sounds a good idea. I had a question on error handling. What happens if > the driver asynchronously returns an error for this WR (single WR > containing multiple skbs) ? Does it mean all the skbs failed to be sent ? > Requeuing all of them is a bad idea since it leads to infinitely doing the > same thing, so should every skb be free'd ? Isn't that harsh (if first skb > is bad and others are fine) and how do we avoid that ? Infact we don't > know which skb was transmitted and which failed. If we requeue skbs, > there will also be out-of-order skbs as the callback is asynchronous. Each packet will still get a separate work request. However if one work request fails, then the QP will transition to the error state and all later work requests will fail too. But the completion handling doesn't change at all -- IPoIB already queues work requests with the hardware and then collects completions asynchronously. If we get a failed completion then we free the skb and bump tx errors -- there's nothing else really sensible to do. However errors on UD QPs can never really happen except for driver bugs or hardware failures, and for connected QPs (used in connected mode), errors probably mean that a connection broke, so dropping packets is quite sensible. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-10 14:53 Krishna Kumar 2007-05-10 15:08 ` Evgeniy Polyakov 2007-05-10 20:21 ` Roland Dreier @ 2007-05-11 9:05 ` Andi Kleen 2007-05-11 8:32 ` Krishna Kumar2 2 siblings, 1 reply; 96+ messages in thread From: Andi Kleen @ 2007-05-11 9:05 UTC (permalink / raw) To: Krishna Kumar; +Cc: netdev, Krishna Kumar Krishna Kumar <krkumar2@in.ibm.com> writes: > Doing some measurements, I found that for small packets like 128 bytes, > the bandwidth is approximately 60% of the line speed. To possibly speed > up performance of small packet xmits, a method of "linking" skbs was > thought of - where two pointers (skb_flink/blink) is added to the skb. You don't need that. You can just use the normal next/prev pointers. In general it's a good idea to lower lock overhead etc., the VM has used similar tricks very successfully in the past. There were some thoughts about this earlier, but in highend NICs the direction instead seems to go towards LRO (large receive offloading). LRO is basically like TSO, just for receiving. The NIC aggregates multiple packets into a single larger one that is then processed by the stack as one skb. This typically doesn't use linked lists, but an array of pages. Your scheme would help old NICs that don't have this optimization. Might be a worth goal, although people often seem to be more interested in modern hardware. Another problem is that this setup typically requires the aggregate packets to be from the same connection. Otherwise you will only safe a short trip into the stack until the linked packet would need to be split again to pass to multiple sockets. With that the scheme probably helps much less. The hardware schemes typically use at least some kind of hash to aggregiate connections You might need to implement something similar too if it doesn't save enough time. Don't know if it would be very efficient in software. Or you could do this only if multiple packets belong to the same single connection (basically with a one hit cache); but then it would smell a bit like a benchmark hack. -Andi ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:05 ` Andi Kleen @ 2007-05-11 8:32 ` Krishna Kumar2 2007-05-11 9:37 ` Andi Kleen 0 siblings, 1 reply; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 8:32 UTC (permalink / raw) To: Andi Kleen; +Cc: ak, netdev Hi Andy, ak@suse.de wrote on 05/11/2007 02:35:05 PM: > You don't need that. You can just use the normal next/prev pointers. > In general it's a good idea to lower lock overhead etc., the VM has > used similar tricks very successfully in the past. Does this mean each skb should be for the same connection if next/prev is used ? > Another problem is that this setup typically requires the aggregate > packets to be from the same connection. Otherwise you will only > safe a short trip into the stack until the linked packet would need > to be split again to pass to multiple sockets. With that the scheme > probably helps much less. I guess you meant this for receives only. On the send side, packets for different sockets can be linked and sent together, right ? > Or you could do this only if multiple packets belong to the same > single connection (basically with a one hit cache); but then it would But for sends, why does same or different connection matter ? There is no aggregating of skbs. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 8:32 ` Krishna Kumar2 @ 2007-05-11 9:37 ` Andi Kleen 2007-05-11 8:50 ` Krishna Kumar2 2007-05-11 11:16 ` Roland Dreier 0 siblings, 2 replies; 96+ messages in thread From: Andi Kleen @ 2007-05-11 9:37 UTC (permalink / raw) To: Krishna Kumar2; +Cc: Andi Kleen, ak, netdev On Fri, May 11, 2007 at 02:02:55PM +0530, Krishna Kumar2 wrote: > Hi Andy, > > ak@suse.de wrote on 05/11/2007 02:35:05 PM: > > > You don't need that. You can just use the normal next/prev pointers. > > In general it's a good idea to lower lock overhead etc., the VM has > > used similar tricks very successfully in the past. > > Does this mean each skb should be for the same connection if next/prev > is used ? No; but see next paragraph. But without it aggregation on RX is much less useful because the packets cannot be kept together after socket demux which happens relatively early in the packet processing path. > > > Another problem is that this setup typically requires the aggregate > > packets to be from the same connection. Otherwise you will only > > safe a short trip into the stack until the linked packet would need > > to be split again to pass to multiple sockets. With that the scheme > > probably helps much less. > > I guess you meant this for receives only. On the send side, packets Yes. > for different sockets can be linked and sent together, right ? Not implemented and as DaveM pointed out such batching has some problems. There is just TSO/GSO for single sockets > > > Or you could do this only if multiple packets belong to the same > > single connection (basically with a one hit cache); but then it would > > But for sends, why does same or different connection matter ? There is > no aggregating of skbs. I wasn't talking about sending. But there actually is :- TSO/GSO. -Andi > > Thanks, > > - KK > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:37 ` Andi Kleen @ 2007-05-11 8:50 ` Krishna Kumar2 2007-05-11 11:16 ` Roland Dreier 1 sibling, 0 replies; 96+ messages in thread From: Krishna Kumar2 @ 2007-05-11 8:50 UTC (permalink / raw) To: Andi Kleen; +Cc: ak, Andi Kleen, netdev Hi Andy, Andi Kleen <ak@suse.de> wrote on 05/11/2007 03:07:14 PM: > But without it aggregation on RX is much less useful because the packets > cannot be kept together after socket demux which happens relatively early > in the packet processing path. Then I misunderstood you, my proposal is only for TX. GSO will not help as it is for large packets to get segmented at the latest point in time to avoid going through the stack multiple times. My proposal is to link multiple packets to send to the device in one shot rather than calling qdisc_restart multiple times. > > But for sends, why does same or different connection matter ? There is > > no aggregating of skbs. > > I wasn't talking about sending. > > But there actually is :- TSO/GSO. Sorry to mis-write what I meant : in my proposal there is no aggregating of skbs so single/multiple connections should not matter. Thanks, - KK ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 9:37 ` Andi Kleen 2007-05-11 8:50 ` Krishna Kumar2 @ 2007-05-11 11:16 ` Roland Dreier 2007-05-13 6:00 ` Andi Kleen 1 sibling, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-11 11:16 UTC (permalink / raw) To: Andi Kleen; +Cc: Krishna Kumar2, Andi Kleen, netdev > I wasn't talking about sending. > > But there actually is :- TSO/GSO. As I said before, getting multiple packets in one call to xmit would be nice for amortizing per-xmit overhead in IPoIB. So it would be nice if the cases where the stack does GSO ended up passing all the segments into the driver in one go. - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-11 11:16 ` Roland Dreier @ 2007-05-13 6:00 ` Andi Kleen 2007-05-15 16:25 ` Roland Dreier 0 siblings, 1 reply; 96+ messages in thread From: Andi Kleen @ 2007-05-13 6:00 UTC (permalink / raw) To: Roland Dreier; +Cc: Krishna Kumar2, netdev On Friday 11 May 2007 13:16:44 Roland Dreier wrote: > > I wasn't talking about sending. > > > > But there actually is :- TSO/GSO. > > As I said before, getting multiple packets in one call to xmit would > be nice for amortizing per-xmit overhead in IPoIB. So it would be > nice if the cases where the stack does GSO ended up passing all the > segments into the driver in one go. Well TCP does upto 64k -- that is what GSO is about. If you want it for unrelated connections I'm not sure it would actually work because if you have a deep TX queue already it is unlikely the qdisc will buffer that much. Most packet injection should come from the individual sockets which is obviously only a single connection. -Andi ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-13 6:00 ` Andi Kleen @ 2007-05-15 16:25 ` Roland Dreier 2007-05-15 20:18 ` David Miller 0 siblings, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-15 16:25 UTC (permalink / raw) To: Andi Kleen; +Cc: Krishna Kumar2, netdev > > As I said before, getting multiple packets in one call to xmit would > > be nice for amortizing per-xmit overhead in IPoIB. So it would be > > nice if the cases where the stack does GSO ended up passing all the > > segments into the driver in one go. > > Well TCP does upto 64k -- that is what GSO is about. I see... the plan would be to add NETIF_F_GSO_SOFTWARE to the device features and use skb_gso_segment() in the netdevice driver? (I just studied GSO more carefully -- I hadn't realized that was possible) I'll have to think about implementing that for IPoIB. One issue I see is if I have, say, 4 free entries in my send queue and skb_gso_segment() gives me back 5 packets to send. It's not clear I can recover at that point -- I guess I have to check against gso_segs in the xmit routine before actually doing the segmentation. - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 16:25 ` Roland Dreier @ 2007-05-15 20:18 ` David Miller 2007-05-15 20:52 ` Roland Dreier 0 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-15 20:18 UTC (permalink / raw) To: rdreier; +Cc: ak, krkumar2, netdev From: Roland Dreier <rdreier@cisco.com> Date: Tue, 15 May 2007 09:25:28 -0700 > I'll have to think about implementing that for IPoIB. One issue I see > is if I have, say, 4 free entries in my send queue and skb_gso_segment() > gives me back 5 packets to send. It's not clear I can recover at that > point -- I guess I have to check against gso_segs in the xmit routine > before actually doing the segmentation. I'd suggest adding a fudge factor to your free TX space, which is advisable anyways so that when TX is woken up, more of the transfer from queue to device can happen in a batch-like fashion. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 20:18 ` David Miller @ 2007-05-15 20:52 ` Roland Dreier 2007-05-15 21:48 ` Michael Chan 0 siblings, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-15 20:52 UTC (permalink / raw) To: David Miller; +Cc: ak, krkumar2, netdev > > I'll have to think about implementing that for IPoIB. One issue I see > > is if I have, say, 4 free entries in my send queue and skb_gso_segment() > > gives me back 5 packets to send. It's not clear I can recover at that > > point -- I guess I have to check against gso_segs in the xmit routine > > before actually doing the segmentation. > > I'd suggest adding a fudge factor to your free TX space, which > is advisable anyways so that when TX is woken up, more of > the transfer from queue to device can happen in a batch-like > fashion. Well, IPoIB doesn't do netif_wake_queue() until half the device's TX queue is free, so we should get batching. However, I'm not sure that I can count on a fudge factor ensuring that there's enough space to handle everything skb_gso_segment() gives me -- is there any reliable way to get an upper bound on how many segments a given gso skb will use when it's segmented? - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 20:52 ` Roland Dreier @ 2007-05-15 21:48 ` Michael Chan 2007-05-15 21:08 ` Roland Dreier 0 siblings, 1 reply; 96+ messages in thread From: Michael Chan @ 2007-05-15 21:48 UTC (permalink / raw) To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev On Tue, 2007-05-15 at 13:52 -0700, Roland Dreier wrote: > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX > queue is free, so we should get batching. However, I'm not sure that > I can count on a fudge factor ensuring that there's enough space to > handle everything skb_gso_segment() gives me -- is there any reliable > way to get an upper bound on how many segments a given gso skb will > use when it's segmented? Take a look at tg3.c. I use (gso_segs * 3) as the upper bound. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 21:48 ` Michael Chan @ 2007-05-15 21:08 ` Roland Dreier 2007-05-15 22:05 ` Michael Chan 0 siblings, 1 reply; 96+ messages in thread From: Roland Dreier @ 2007-05-15 21:08 UTC (permalink / raw) To: Michael Chan; +Cc: David Miller, ak, krkumar2, netdev > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX > > queue is free, so we should get batching. However, I'm not sure that > > I can count on a fudge factor ensuring that there's enough space to > > handle everything skb_gso_segment() gives me -- is there any reliable > > way to get an upper bound on how many segments a given gso skb will > > use when it's segmented? > > Take a look at tg3.c. I use (gso_segs * 3) as the upper bound. Thanks for the pointer... I noticed that code, but could you tell me where the "* 3" comes from? - R. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 21:08 ` Roland Dreier @ 2007-05-15 22:05 ` Michael Chan 2007-05-15 21:28 ` David Miller 0 siblings, 1 reply; 96+ messages in thread From: Michael Chan @ 2007-05-15 22:05 UTC (permalink / raw) To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote: > > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX > > > queue is free, so we should get batching. However, I'm not sure that > > > I can count on a fudge factor ensuring that there's enough space to > > > handle everything skb_gso_segment() gives me -- is there any reliable > > > way to get an upper bound on how many segments a given gso skb will > > > use when it's segmented? > > > > Take a look at tg3.c. I use (gso_segs * 3) as the upper bound. > > Thanks for the pointer... I noticed that code, but could you tell me > where the "* 3" comes from? > For each gso_seg, there will be a header and the payload may span 2 pages for 1500-byte packets. We always assume 1500-byte packets because the buggy chips do not support jumbo frames. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 22:05 ` Michael Chan @ 2007-05-15 21:28 ` David Miller 2007-05-18 7:04 ` Michael Chan 0 siblings, 1 reply; 96+ messages in thread From: David Miller @ 2007-05-15 21:28 UTC (permalink / raw) To: mchan; +Cc: rdreier, ak, krkumar2, netdev From: "Michael Chan" <mchan@broadcom.com> Date: Tue, 15 May 2007 15:05:28 -0700 > On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote: > > > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX > > > > queue is free, so we should get batching. However, I'm not sure that > > > > I can count on a fudge factor ensuring that there's enough space to > > > > handle everything skb_gso_segment() gives me -- is there any reliable > > > > way to get an upper bound on how many segments a given gso skb will > > > > use when it's segmented? > > > > > > Take a look at tg3.c. I use (gso_segs * 3) as the upper bound. > > > > Thanks for the pointer... I noticed that code, but could you tell me > > where the "* 3" comes from? > > > For each gso_seg, there will be a header and the payload may span 2 > pages for 1500-byte packets. We always assume 1500-byte packets because > the buggy chips do not support jumbo frames. Correct. I think there may be a case where you could see up to 4 segments. If the user corks the TCP socket, does a sendmsg() (which puts the data in the per-socket page) then does a sendfile() you'll see something like: skb->data IP, TCP, ethernet headers, etc. page0 sendmsg() data page1 sendfile page2 sendfile Ie. this can happen if the sendfile() part starts near the end of a page, so it would get split even for a 1500 MTU frame. Even more complex variants are possible if the user does tiny sendfile() requests to different pages within the file. So in fact it can span up to N pages. But there is an upper limit defined by the original GSO frame, and that is controlled by MAX_SKB_FRAGS, so at most you would see MAX_SKB_FRAGS plus some small constant. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC] New driver API to speed up small packets xmits 2007-05-15 21:28 ` David Miller @ 2007-05-18 7:04 ` Michael Chan 0 siblings, 0 replies; 96+ messages in thread From: Michael Chan @ 2007-05-18 7:04 UTC (permalink / raw) To: David Miller; +Cc: rdreier, ak, krkumar2, netdev David Miller wrote: > From: "Michael Chan" <mchan@broadcom.com> > Date: Tue, 15 May 2007 15:05:28 -0700 > > > For each gso_seg, there will be a header and the payload may span 2 > > pages for 1500-byte packets. We always assume 1500-byte packets > > because the buggy chips do not support jumbo frames. > > Correct. > > I think there may be a case where you could see up to 4 segments. > If the user corks the TCP socket, does a sendmsg() (which puts > the data in the per-socket page) then does a sendfile() you'll > see something like: > > skb->data IP, TCP, ethernet headers, etc. > page0 sendmsg() data > page1 sendfile > page2 sendfile > > Ie. this can happen if the sendfile() part starts near the > end of a page, so it would get split even for a 1500 MTU > frame. > > Even more complex variants are possible if the user does > tiny sendfile() requests to different pages within the file. > > So in fact it can span up to N pages. > > But there is an upper limit defined by the original GSO > frame, and that is controlled by MAX_SKB_FRAGS, so at most > you would see MAX_SKB_FRAGS plus some small constant. > > I see. Is the following a better estimate? (gso_segs * 2) + skb_shinfo(skb)->nr_frags For each gso_seg, you get at least a header segment and a segment from a page for 1500-byte packets. You can get more segments if packets cross page boundaries or if there are tiny sendfile pages or some sendmsg pages mixed in. But these additional segments will never be more than nr_frags. ^ permalink raw reply [flat|nested] 96+ messages in thread
end of thread, other threads:[~2007-05-22 23:12 UTC | newest]
Thread overview: 96+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
2007-05-15 21:17 ` [RFC] New driver API to speed up small packets xmits Roland Dreier
[not found] ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
2007-05-15 21:25 ` Roland Dreier
[not found] ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
2007-05-15 21:38 ` David Miller
2007-05-15 21:32 ` David Miller
2007-05-15 22:17 ` [WIP] [PATCH] WAS " jamal
2007-05-15 22:48 ` jamal
2007-05-16 0:50 ` jamal
2007-05-16 22:12 ` Sridhar Samudrala
2007-05-16 22:52 ` jamal
2007-05-17 3:25 ` jamal
2007-05-18 12:07 ` jamal
2007-05-17 4:03 ` Krishna Kumar2
2007-05-16 21:44 ` Sridhar Samudrala
2007-05-17 5:01 ` Krishna Kumar2
[not found] ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
2007-05-15 23:36 ` David Miller
2007-05-21 7:56 ` Herbert Xu
[not found] ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
2007-05-22 22:36 ` David Miller
[not found] ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
2007-05-22 23:04 ` David Miller
2007-05-22 23:12 ` Herbert Xu
2007-05-11 7:14 Krishna Kumar2
-- strict thread matches above, loose matches on Subject: below --
2007-05-10 14:53 Krishna Kumar
2007-05-10 15:08 ` Evgeniy Polyakov
2007-05-10 15:22 ` Krishna Kumar2
2007-05-10 15:48 ` Evgeniy Polyakov
2007-05-10 16:08 ` jamal
2007-05-10 17:19 ` Rick Jones
2007-05-10 18:07 ` Sridhar Samudrala
2007-05-10 19:43 ` Gagan Arneja
2007-05-10 20:11 ` jamal
2007-05-10 20:14 ` Rick Jones
2007-05-10 20:15 ` jamal
2007-05-10 20:15 ` Gagan Arneja
2007-05-10 20:21 ` jamal
2007-05-10 20:25 ` Gagan Arneja
2007-05-11 5:22 ` Krishna Kumar2
2007-05-11 11:27 ` jamal
2007-05-10 20:37 ` David Miller
2007-05-10 20:40 ` Gagan Arneja
2007-05-10 20:57 ` David Miller
2007-05-11 6:07 ` Krishna Kumar2
2007-05-11 5:21 ` Krishna Kumar2
2007-05-11 5:20 ` Krishna Kumar2
2007-05-11 5:35 ` Gagan Arneja
2007-05-11 5:43 ` Krishna Kumar2
2007-05-11 5:57 ` Gagan Arneja
2007-05-11 6:06 ` Krishna Kumar2
2007-05-11 6:29 ` Gagan Arneja
2007-05-11 6:52 ` Krishna Kumar2
2007-05-10 18:13 ` Vlad Yasevich
2007-05-10 18:20 ` Rick Jones
2007-05-10 18:32 ` Vlad Yasevich
2007-05-10 18:40 ` Rick Jones
2007-05-10 18:59 ` Ian McDonald
2007-05-10 19:21 ` Vlad Yasevich
2007-05-10 19:26 ` Ian McDonald
2007-05-10 20:32 ` David Miller
2007-05-10 20:49 ` Rick Jones
2007-05-10 21:02 ` David Miller
2007-05-10 21:14 ` Gagan Arneja
2007-05-11 2:28 ` Stephen Hemminger
2007-05-11 5:01 ` Gagan Arneja
2007-05-11 5:04 ` Krishna Kumar2
2007-05-11 9:01 ` Evgeniy Polyakov
2007-05-11 9:18 ` Krishna Kumar2
2007-05-11 9:32 ` Evgeniy Polyakov
2007-05-11 9:52 ` Krishna Kumar2
2007-05-11 9:56 ` Evgeniy Polyakov
2007-05-11 11:30 ` jamal
2007-05-11 11:53 ` Evgeniy Polyakov
2007-05-11 12:15 ` jamal
2007-05-10 21:27 ` David Stevens
2007-05-10 21:40 ` David Miller
2007-05-10 21:50 ` Gagan Arneja
2007-05-10 22:06 ` David Miller
2007-05-11 9:46 ` Krishna Kumar2
2007-05-10 21:41 ` Eric Dumazet
2007-05-10 22:09 ` Rick Jones
2007-05-10 21:45 ` Rick Jones
2007-05-10 21:53 ` David Stevens
2007-05-10 20:21 ` Roland Dreier
2007-05-11 7:30 ` Krishna Kumar2
2007-05-11 11:21 ` Roland Dreier
2007-05-11 9:05 ` Andi Kleen
2007-05-11 8:32 ` Krishna Kumar2
2007-05-11 9:37 ` Andi Kleen
2007-05-11 8:50 ` Krishna Kumar2
2007-05-11 11:16 ` Roland Dreier
2007-05-13 6:00 ` Andi Kleen
2007-05-15 16:25 ` Roland Dreier
2007-05-15 20:18 ` David Miller
2007-05-15 20:52 ` Roland Dreier
2007-05-15 21:48 ` Michael Chan
2007-05-15 21:08 ` Roland Dreier
2007-05-15 22:05 ` Michael Chan
2007-05-15 21:28 ` David Miller
2007-05-18 7:04 ` Michael Chan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).