netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
@ 2009-12-08 22:50 Sridhar Samudrala
  2009-12-13 12:25 ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-08 22:50 UTC (permalink / raw)
  To: rusty, mst; +Cc: netdev

When testing vhost-net patches with a guest running linux 2.6.32, i am
running into "Unexpected full queue" warning messages from start_xmit() in
virtio_net.c causing a lot of requeues and a drastic reduction in throughput.

With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
but it drops to around 3200Mb/s with 2.6.32.

I don't see this problem with usermode virtio in qemu, but i get a thruput of
only 2700Mb/s with both 2.6.31 and 2.6.32.

The following patch fixes this problem by dropping the skb and not requeuing
to qdisc when it cannot be added to ring buffer. With this patch, i see
similar performance as 2.6.31 with vhost-net.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9e002f..307cfd6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -528,7 +528,11 @@ again:
 			netif_start_queue(dev);
 			goto again;
 		}
-		return NETDEV_TX_BUSY;
+
+		/* drop the skb under stress. */ 
+ 		vi->dev->stats.tx_dropped++;
+		kfree_skb(skb);
+		return NETDEV_TX_OK;
 	}
 	vi->svq->vq_ops->kick(vi->svq);
 

Thanks
Sridhar






^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-08 22:50 Sridhar Samudrala
@ 2009-12-13 12:25 ` Herbert Xu
  2009-12-13 23:40   ` Michael S. Tsirkin
  2009-12-16  2:41   ` Rusty Russell
  0 siblings, 2 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-13 12:25 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: rusty, mst, netdev

Sridhar Samudrala <sri@us.ibm.com> wrote:
> When testing vhost-net patches with a guest running linux 2.6.32, i am
> running into "Unexpected full queue" warning messages from start_xmit() in
> virtio_net.c causing a lot of requeues and a drastic reduction in throughput.
> 
> With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
> but it drops to around 3200Mb/s with 2.6.32.
> 
> I don't see this problem with usermode virtio in qemu, but i get a thruput of
> only 2700Mb/s with both 2.6.31 and 2.6.32.
> 
> The following patch fixes this problem by dropping the skb and not requeuing
> to qdisc when it cannot be added to ring buffer. With this patch, i see
> similar performance as 2.6.31 with vhost-net.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b9e002f..307cfd6 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -528,7 +528,11 @@ again:
>                        netif_start_queue(dev);
>                        goto again;
>                }
> -               return NETDEV_TX_BUSY;
> +
> +               /* drop the skb under stress. */ 
> +               vi->dev->stats.tx_dropped++;
> +               kfree_skb(skb);
> +               return NETDEV_TX_OK;
>        }

This patch only hides the real problem.  The issue is that we
should never start the queue unless we can accomodate a full
64KB packet.

In your case, whenever we have the space for a single slot we'd
start the queue.  However, if the head of the queue required more
than one slot it would immediately return BUSY and stop the queue
again.

Your patch simply drops the packet (likely to be TSO) which would
end up upsetting the TCP transmitter.

Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
for queue management.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-13 12:25 ` Herbert Xu
@ 2009-12-13 23:40   ` Michael S. Tsirkin
  2009-12-15 14:42     ` Herbert Xu
  2009-12-16  2:41   ` Rusty Russell
  1 sibling, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-13 23:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Sridhar Samudrala, rusty, netdev

On Sun, Dec 13, 2009 at 08:25:12PM +0800, Herbert Xu wrote:
> Sridhar Samudrala <sri@us.ibm.com> wrote:
> > When testing vhost-net patches with a guest running linux 2.6.32, i am
> > running into "Unexpected full queue" warning messages from start_xmit() in
> > virtio_net.c causing a lot of requeues and a drastic reduction in throughput.
> > 
> > With a guest running 2.6.31, i get guest to host throughput around 7000Mb/s,
> > but it drops to around 3200Mb/s with 2.6.32.
> > 
> > I don't see this problem with usermode virtio in qemu, but i get a thruput of
> > only 2700Mb/s with both 2.6.31 and 2.6.32.
> > 
> > The following patch fixes this problem by dropping the skb and not requeuing
> > to qdisc when it cannot be added to ring buffer. With this patch, i see
> > similar performance as 2.6.31 with vhost-net.
> > 
> > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index b9e002f..307cfd6 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -528,7 +528,11 @@ again:
> >                        netif_start_queue(dev);
> >                        goto again;
> >                }
> > -               return NETDEV_TX_BUSY;
> > +
> > +               /* drop the skb under stress. */ 
> > +               vi->dev->stats.tx_dropped++;
> > +               kfree_skb(skb);
> > +               return NETDEV_TX_OK;
> >        }
> 
> This patch only hides the real problem.  The issue is that we
> should never start the queue unless we can accomodate a full
> 64KB packet.
> 
> In your case, whenever we have the space for a single slot we'd
> start the queue.  However, if the head of the queue required more
> than one slot it would immediately return BUSY and stop the queue
> again.
> 
> Your patch simply drops the packet (likely to be TSO) which would
> end up upsetting the TCP transmitter.
> 
> Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
> for queue management.
> 
> Cheers,

An interesting thing is that
48925e372f04f5e35fec6269127c62b2c71ab794
was supposed to do this already.

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-13 23:40   ` Michael S. Tsirkin
@ 2009-12-15 14:42     ` Herbert Xu
  2009-12-15 16:26       ` Sridhar Samudrala
  2009-12-15 23:32       ` Michael S. Tsirkin
  0 siblings, 2 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-15 14:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sridhar Samudrala, rusty, netdev

On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
>
> An interesting thing is that
> 48925e372f04f5e35fec6269127c62b2c71ab794
> was supposed to do this already.

But Rusty missed the queue wake call.  That too needs to check
that the required amount of space exists.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-15 14:42     ` Herbert Xu
@ 2009-12-15 16:26       ` Sridhar Samudrala
  2009-12-16  1:21         ` Herbert Xu
  2009-12-15 23:32       ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-15 16:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Michael S. Tsirkin, rusty, netdev

On Tue, 2009-12-15 at 22:42 +0800, Herbert Xu wrote:
> On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
> >
> > An interesting thing is that
> > 48925e372f04f5e35fec6269127c62b2c71ab794
> > was supposed to do this already.
> 
> But Rusty missed the queue wake call.  That too needs to check
> that the required amount of space exists.

Yes. I noticied that. skb_xmit_done() doesn't check for MAX_SKB_FRAGS+2
descriptors before doing a netif_wake_queue(). We don't have any easy way 
to get the number of free desciptors in the virtqueue and we may need a 
vring_get_num_free() interface.

Also, can we do free_old_xmit_skbs() to freeup any used descriptors in
the callback(skb_xmit_done) context? I tried this, but hit a panic in
free_old_xmit_skbs(__skb_unlink() trying to remove the skb from the vi->send queue).
Looks like we need to add the skb to the vi->send queue before doing a kick.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-15 14:42     ` Herbert Xu
  2009-12-15 16:26       ` Sridhar Samudrala
@ 2009-12-15 23:32       ` Michael S. Tsirkin
  2009-12-16  1:58         ` Herbert Xu
  2009-12-16  4:37         ` Rusty Russell
  1 sibling, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-15 23:32 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Sridhar Samudrala, rusty, netdev

On Tue, Dec 15, 2009 at 10:42:52PM +0800, Herbert Xu wrote:
> On Mon, Dec 14, 2009 at 01:40:55AM +0200, Michael S. Tsirkin wrote:
> >
> > An interesting thing is that
> > 48925e372f04f5e35fec6269127c62b2c71ab794
> > was supposed to do this already.
> 
> But Rusty missed the queue wake call.  That too needs to check
> that the required amount of space exists.

Right. Hmm. So for this to work we'll need to
1. Free skb upon interrupt instead of
   waiting for the next xmit call
2. Add API to query free ring capacity

Rusty, sounds like a good plan?

We could also extend host to delay interrupt
until there is sufficient TX capacity
but of course we also need to support
old hosts as well.


> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-15 16:26       ` Sridhar Samudrala
@ 2009-12-16  1:21         ` Herbert Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-16  1:21 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Michael S. Tsirkin, rusty, netdev

On Tue, Dec 15, 2009 at 08:26:20AM -0800, Sridhar Samudrala wrote:
> 
> Also, can we do free_old_xmit_skbs() to freeup any used descriptors in
> the callback(skb_xmit_done) context? I tried this, but hit a panic in
> free_old_xmit_skbs(__skb_unlink() trying to remove the skb from the vi->send queue).
> Looks like we need to add the skb to the vi->send queue before doing a kick.

No you can't free skbs from hard IRQ context.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-15 23:32       ` Michael S. Tsirkin
@ 2009-12-16  1:58         ` Herbert Xu
  2009-12-16  4:37         ` Rusty Russell
  1 sibling, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-16  1:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sridhar Samudrala, rusty, netdev

On Wed, Dec 16, 2009 at 01:32:27AM +0200, Michael S. Tsirkin wrote:
> 
> Right. Hmm. So for this to work we'll need to
> 1. Free skb upon interrupt instead of
>    waiting for the next xmit call

No the next xmit call can free the skbs.

> 2. Add API to query free ring capacity

This is all you need.

> We could also extend host to delay interrupt
> until there is sufficient TX capacity
> but of course we also need to support
> old hosts as well.

Yes that what real hardware does too.  They are able to delay
the interrupt until the number of free entries exceeds a minimum
threshold.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-13 12:25 ` Herbert Xu
  2009-12-13 23:40   ` Michael S. Tsirkin
@ 2009-12-16  2:41   ` Rusty Russell
  2009-12-16  2:53     ` Herbert Xu
  2009-12-16 17:42     ` Sridhar Samudrala
  1 sibling, 2 replies; 58+ messages in thread
From: Rusty Russell @ 2009-12-16  2:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Sridhar Samudrala, mst, netdev

On Sun, 13 Dec 2009 10:55:12 pm Herbert Xu wrote:
> Please see what a real Ethernet driver (e.g., tg3 or ixgbe) does
> for queue management.

Hi Herbert,

   Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.

Sridhar, if you simply remove the dev_warn() does that restore some
performance?  I'll work on a proper fix for 2.6.33 now.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16  2:41   ` Rusty Russell
@ 2009-12-16  2:53     ` Herbert Xu
  2009-12-16 12:45       ` Rusty Russell
  2009-12-16 17:42     ` Sridhar Samudrala
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-16  2:53 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Sridhar Samudrala, mst, netdev

On Wed, Dec 16, 2009 at 01:11:40PM +1030, Rusty Russell wrote:
> 
>    Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
> what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.

Well it depends.  Real drivers can't touch the hardware so they're
stuck with whatever the hardware does.  For virtio we do have the
flexibility of modifying the backend.

Having said that, for existing backends that will signal when there
is just a single free entry on the queue something like NAPI could
reduce the overhead associated with the IRQs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-15 23:32       ` Michael S. Tsirkin
  2009-12-16  1:58         ` Herbert Xu
@ 2009-12-16  4:37         ` Rusty Russell
  2009-12-16 10:37           ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Rusty Russell @ 2009-12-16  4:37 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Wed, 16 Dec 2009 10:02:27 am Michael S. Tsirkin wrote:
> Right. Hmm. So for this to work we'll need to
> 1. Free skb upon interrupt instead of
>    waiting for the next xmit call
> 2. Add API to query free ring capacity
> 
> Rusty, sounds like a good plan?

Well, the query stuff is not too bad, but I can't completely convince myself
it's race-free.  We don't want to do locking.

A NAPI-style solution seems cleaner, and I'm testing that now.

> We could also extend host to delay interrupt
> until there is sufficient TX capacity
> but of course we also need to support
> old hosts as well.

Xen does this, and I rejected it in favor of simple enable/disable
flags in the original virtio design.  It was probably wrong: while the
guest can enable on a "few remaining" heuristic, it's going to have
latency.  The host can do a more timely decision.

There's nothing stopping the Host from doing this heuristic today, of
course: the DISABLE flag is advisory only.  But let's check the limitations
of the guest-enable approach first?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16  4:37         ` Rusty Russell
@ 2009-12-16 10:37           ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-16 10:37 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Wed, Dec 16, 2009 at 03:07:53PM +1030, Rusty Russell wrote:
> On Wed, 16 Dec 2009 10:02:27 am Michael S. Tsirkin wrote:
> > Right. Hmm. So for this to work we'll need to
> > 1. Free skb upon interrupt instead of
> >    waiting for the next xmit call
> > 2. Add API to query free ring capacity
> > 
> > Rusty, sounds like a good plan?
> 
> Well, the query stuff is not too bad, but I can't completely convince myself
> it's race-free.  We don't want to do locking.

We do not want to lock TX? For performance?
It might not be a problem: interrupts only start
to be enabled when we are running out of space on TX,
so this is a kind of slow path anyway now.

If necessary, we can also only do this range check
if netif_tx_queue_stopped.

> A NAPI-style solution seems cleaner, and I'm testing that now.

Hmm, as you say separately, this might not be 2.6.32 material though.
Maybe a simply capacity check will be safe enough for 2.6.32.

> > We could also extend host to delay interrupt
> > until there is sufficient TX capacity
> > but of course we also need to support
> > old hosts as well.
> 
> Xen does this, and I rejected it in favor of simple enable/disable
> flags in the original virtio design.  It was probably wrong: while the
> guest can enable on a "few remaining" heuristic, it's going to have
> latency.  The host can do a more timely decision.
> 
> There's nothing stopping the Host from doing this heuristic today, of
> course: the DISABLE flag is advisory only.

Heh, but this might hurt performance on guests that do
assume it's used correctly. A new feature might be cleaner?

>  But let's check the limitations
> of the guest-enable approach first?

Sure.

> Thanks,
> Rusty.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16  2:53     ` Herbert Xu
@ 2009-12-16 12:45       ` Rusty Russell
  2009-12-16 13:22         ` Michael S. Tsirkin
                           ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Rusty Russell @ 2009-12-16 12:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Sridhar Samudrala, mst, netdev

On Wed, 16 Dec 2009 01:23:31 pm Herbert Xu wrote:
> On Wed, Dec 16, 2009 at 01:11:40PM +1030, Rusty Russell wrote:
> > 
> >    Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
> > what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.
> 
> Well it depends.  Real drivers can't touch the hardware so they're
> stuck with whatever the hardware does.  For virtio we do have the
> flexibility of modifying the backend.
> 
> Having said that, for existing backends that will signal when there
> is just a single free entry on the queue something like NAPI could
> reduce the overhead associated with the IRQs.

OK, this is unfortunately untested, but wanted to send it out tonight:

virtio_net: use NAPI for xmit (UNTESTED)

This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
free transmitted packets.  It neatens things a little as well.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -47,6 +47,9 @@ struct virtnet_info
 	struct napi_struct napi;
 	unsigned int status;
 
+	/* We free packets and decide whether to restart xmit here. */
+	struct napi_struct xmit_napi;
+
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
@@ -60,6 +63,9 @@ struct virtnet_info
 	struct sk_buff_head recv;
 	struct sk_buff_head send;
 
+	/* Capacity left in xmit queue. */
+	unsigned int capacity;
+
 	/* Work struct for refilling if we run low on memory. */
 	struct delayed_work refill;
 
@@ -111,11 +117,8 @@ static void skb_xmit_done(struct virtque
 {
 	struct virtnet_info *vi = svq->vdev->priv;
 
-	/* Suppress further interrupts. */
-	svq->vq_ops->disable_cb(svq);
-
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	napi_schedule(&vi->xmit_napi);
 }
 
 static void receive_skb(struct net_device *dev, struct sk_buff *skb,
@@ -446,7 +449,7 @@ static unsigned int free_old_xmit_skbs(s
 
 	while ((skb = vi->svq->vq_ops->get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
-		__skb_unlink(skb, &vi->send);
+		skb_unlink(skb, &vi->send);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
 		tot_sgs += skb_vnet_hdr(skb)->num_sg;
@@ -455,6 +458,23 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
+static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
+{
+	struct virtnet_info *vi =
+		container_of(xmit_napi, struct virtnet_info, xmit_napi);
+
+	if (netif_queue_stopped(vi->dev)) {
+		vi->capacity += free_old_xmit_skbs(vi);
+		if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
+			/* Suppress further xmit interrupts. */
+			vi->svq->vq_ops->disable_cb(vi->svq);
+			napi_complete(xmit_napi);
+			netif_wake_queue(vi->dev);
+		}
+	}
+	return 1;
+}
+
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 {
 	struct scatterlist sg[2+MAX_SKB_FRAGS];
@@ -509,34 +529,22 @@ static netdev_tx_t start_xmit(struct sk_
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-again:
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
-
 	/* Try to transmit */
+	skb_queue_head(&vi->send, skb);
 	capacity = xmit_skb(vi, skb);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
+		skb_unlink(skb, &vi->send);
 		netif_stop_queue(dev);
 		dev_warn(&dev->dev, "Unexpected full queue\n");
-		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
-			vi->svq->vq_ops->disable_cb(vi->svq);
-			netif_start_queue(dev);
-			goto again;
-		}
+		/* If we missed an interrupt, we let virtnet_xmit_poll deal. */
+		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
+			napi_schedule(&vi->xmit_napi);
 		return NETDEV_TX_BUSY;
 	}
 	vi->svq->vq_ops->kick(vi->svq);
-
-	/*
-	 * Put new one in send queue.  You'd expect we'd need this before
-	 * xmit_skb calls add_buf(), since the callback can be triggered
-	 * immediately after that.  But since the callback just triggers
-	 * another call back here, normal network xmit locking prevents the
-	 * race.
-	 */
-	__skb_queue_head(&vi->send, skb);
+	vi->capacity = capacity;
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -544,15 +552,16 @@ again:
 
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
-	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
-			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				vi->svq->vq_ops->disable_cb(vi->svq);
-			}
+	if (unlikely(capacity < 2+MAX_SKB_FRAGS)) {
+		/* Free old skbs; might make more capacity. */
+		vi->capacity = capacity + free_old_xmit_skbs(vi);
+		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
+			/* Make sure virtnet_xmit_poll sees updated capacity */
+			wmb();
+			netif_stop_queue(dev);
+			/* Missed xmit irq? virtnet_xmit_poll will deal. */
+			if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
+				napi_schedule(&vi->xmit_napi);
 		}
 	}
 
@@ -590,6 +599,7 @@ static int virtnet_open(struct net_devic
 	struct virtnet_info *vi = netdev_priv(dev);
 
 	napi_enable(&vi->napi);
+	napi_enable(&vi->xmit_napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
@@ -652,6 +662,7 @@ static int virtnet_close(struct net_devi
 	struct virtnet_info *vi = netdev_priv(dev);
 
 	napi_disable(&vi->napi);
+	napi_disable(&vi->xmit_napi);
 
 	return 0;
 }
@@ -883,6 +894,7 @@ static int virtnet_probe(struct virtio_d
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
 	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+	netif_napi_add(dev, &vi->xmit_napi, virtnet_xmit_poll, 64);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 12:45       ` Rusty Russell
@ 2009-12-16 13:22         ` Michael S. Tsirkin
  2009-12-16 13:35           ` Herbert Xu
  2009-12-17  2:02           ` Rusty Russell
  2009-12-16 13:30         ` Herbert Xu
  2009-12-17  1:43         ` Sridhar Samudrala
  2 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-16 13:22 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Wed, Dec 16, 2009 at 11:15:38PM +1030, Rusty Russell wrote:
> On Wed, 16 Dec 2009 01:23:31 pm Herbert Xu wrote:
> > On Wed, Dec 16, 2009 at 01:11:40PM +1030, Rusty Russell wrote:
> > > 
> > >    Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
> > > what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.
> > 
> > Well it depends.  Real drivers can't touch the hardware so they're
> > stuck with whatever the hardware does.  For virtio we do have the
> > flexibility of modifying the backend.
> > 
> > Having said that, for existing backends that will signal when there
> > is just a single free entry on the queue something like NAPI could
> > reduce the overhead associated with the IRQs.
> 
> OK, this is unfortunately untested, but wanted to send it out tonight:
> 
> virtio_net: use NAPI for xmit (UNTESTED)
> 
> This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
> free transmitted packets.  It neatens things a little as well.

Looks very neat.

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -47,6 +47,9 @@ struct virtnet_info
>  	struct napi_struct napi;
>  	unsigned int status;
>  
> +	/* We free packets and decide whether to restart xmit here. */
> +	struct napi_struct xmit_napi;
> +
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
>  
> @@ -60,6 +63,9 @@ struct virtnet_info
>  	struct sk_buff_head recv;
>  	struct sk_buff_head send;
>  
> +	/* Capacity left in xmit queue. */
> +	unsigned int capacity;
> +
>  	/* Work struct for refilling if we run low on memory. */
>  	struct delayed_work refill;
>  
> @@ -111,11 +117,8 @@ static void skb_xmit_done(struct virtque
>  {
>  	struct virtnet_info *vi = svq->vdev->priv;
>  
> -	/* Suppress further interrupts. */
> -	svq->vq_ops->disable_cb(svq);
> -
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	napi_schedule(&vi->xmit_napi);
>  }
>  
>  static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -446,7 +449,7 @@ static unsigned int free_old_xmit_skbs(s
>  
>  	while ((skb = vi->svq->vq_ops->get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
> -		__skb_unlink(skb, &vi->send);
> +		skb_unlink(skb, &vi->send);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
>  		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> @@ -455,6 +458,23 @@ static unsigned int free_old_xmit_skbs(s
>  	return tot_sgs;
>  }
>  
> +static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
> +{
> +	struct virtnet_info *vi =
> +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> +
> +	if (netif_queue_stopped(vi->dev)) {

I am a bit concerned here: for example, on link down
you do netif_stop_queue, and start on link up.
So is it enough to check netif_queue_stopped
to verify that tx is not running and that this is because
it was out of capacity?

It would be very bad if this run in parallel with TX ...

If this is all safe, maybe add a comment explaining why ...

> +		vi->capacity += free_old_xmit_skbs(vi);
> +		if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> +			/* Suppress further xmit interrupts. */
> +			vi->svq->vq_ops->disable_cb(vi->svq);
> +			napi_complete(xmit_napi);
> +			netif_wake_queue(vi->dev);
> +		}
> +	}
> +	return 1;
> +}
> +

One concern here: 
we are ignoring budget, is this an issue?


>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  {
>  	struct scatterlist sg[2+MAX_SKB_FRAGS];
> @@ -509,34 +529,22 @@ static netdev_tx_t start_xmit(struct sk_
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
>  
> -again:
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
> +	skb_queue_head(&vi->send, skb);
>  	capacity = xmit_skb(vi, skb);
>  
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
> +		skb_unlink(skb, &vi->send);
>  		netif_stop_queue(dev);
>  		dev_warn(&dev->dev, "Unexpected full queue\n");
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			vi->svq->vq_ops->disable_cb(vi->svq);
> -			netif_start_queue(dev);
> -			goto again;
> -		}
> +		/* If we missed an interrupt, we let virtnet_xmit_poll deal. */
> +		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +			napi_schedule(&vi->xmit_napi);
>  		return NETDEV_TX_BUSY;
>  	}
>  	vi->svq->vq_ops->kick(vi->svq);
> -
> -	/*
> -	 * Put new one in send queue.  You'd expect we'd need this before
> -	 * xmit_skb calls add_buf(), since the callback can be triggered
> -	 * immediately after that.  But since the callback just triggers
> -	 * another call back here, normal network xmit locking prevents the
> -	 * race.
> -	 */
> -	__skb_queue_head(&vi->send, skb);
> +	vi->capacity = capacity;
>  
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -544,15 +552,16 @@ again:
>  
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
> -	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				vi->svq->vq_ops->disable_cb(vi->svq);
> -			}
> +	if (unlikely(capacity < 2+MAX_SKB_FRAGS)) {
> +		/* Free old skbs; might make more capacity. */
> +		vi->capacity = capacity + free_old_xmit_skbs(vi);
> +		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
> +			/* Make sure virtnet_xmit_poll sees updated capacity */
> +			wmb();
> +			netif_stop_queue(dev);

Doesn't netif_stop_queue include an atomic op?
If yes, does not atomic imply wmb already?

> +			/* Missed xmit irq? virtnet_xmit_poll will deal. */
> +			if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +				napi_schedule(&vi->xmit_napi);
>  		}
>  	}
>  
> @@ -590,6 +599,7 @@ static int virtnet_open(struct net_devic
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
>  	napi_enable(&vi->napi);
> +	napi_enable(&vi->xmit_napi);
>  
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
> @@ -652,6 +662,7 @@ static int virtnet_close(struct net_devi
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
>  	napi_disable(&vi->napi);
> +	napi_disable(&vi->xmit_napi);
>  
>  	return 0;
>  }
> @@ -883,6 +894,7 @@ static int virtnet_probe(struct virtio_d
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
>  	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
> +	netif_napi_add(dev, &vi->xmit_napi, virtnet_xmit_poll, 64);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 12:45       ` Rusty Russell
  2009-12-16 13:22         ` Michael S. Tsirkin
@ 2009-12-16 13:30         ` Herbert Xu
  2009-12-17  1:43         ` Sridhar Samudrala
  2 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-16 13:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Sridhar Samudrala, mst, netdev

On Wed, Dec 16, 2009 at 11:15:38PM +1030, Rusty Russell wrote:
> 
> virtio_net: use NAPI for xmit (UNTESTED)
> 
> This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
> free transmitted packets.  It neatens things a little as well.

Looks good to me.

> @@ -544,15 +552,16 @@ again:
>  
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
> -	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				vi->svq->vq_ops->disable_cb(vi->svq);
> -			}
> +	if (unlikely(capacity < 2+MAX_SKB_FRAGS)) {
> +		/* Free old skbs; might make more capacity. */
> +		vi->capacity = capacity + free_old_xmit_skbs(vi);
> +		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
> +			/* Make sure virtnet_xmit_poll sees updated capacity */
> +			wmb();

You only need smp_wmb() since this is protecting against the
guest only as opposed to the host.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 13:22         ` Michael S. Tsirkin
@ 2009-12-16 13:35           ` Herbert Xu
  2009-12-16 13:38             ` Michael S. Tsirkin
  2009-12-17  2:02           ` Rusty Russell
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-16 13:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Sridhar Samudrala, netdev

On Wed, Dec 16, 2009 at 03:22:18PM +0200, Michael S. Tsirkin wrote:
>
> Doesn't netif_stop_queue include an atomic op?
> If yes, does not atomic imply wmb already?

Unfortunately it doesn't.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 13:35           ` Herbert Xu
@ 2009-12-16 13:38             ` Michael S. Tsirkin
  2009-12-16 13:48               ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-16 13:38 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Rusty Russell, Sridhar Samudrala, netdev

On Wed, Dec 16, 2009 at 09:35:12PM +0800, Herbert Xu wrote:
> On Wed, Dec 16, 2009 at 03:22:18PM +0200, Michael S. Tsirkin wrote:
> >
> > Doesn't netif_stop_queue include an atomic op?
> > If yes, does not atomic imply wmb already?
> 
> Unfortunately it doesn't.

So do we need an rmb after testing it as well?
Lock would be simpler :)

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 13:38             ` Michael S. Tsirkin
@ 2009-12-16 13:48               ` Herbert Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-16 13:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Rusty Russell, Sridhar Samudrala, netdev

On Wed, Dec 16, 2009 at 03:38:07PM +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 16, 2009 at 09:35:12PM +0800, Herbert Xu wrote:
> > On Wed, Dec 16, 2009 at 03:22:18PM +0200, Michael S. Tsirkin wrote:
> > >
> > > Doesn't netif_stop_queue include an atomic op?
> > > If yes, does not atomic imply wmb already?
> > 
> > Unfortunately it doesn't.
> 
> So do we need an rmb after testing it as well?

You're right.  We should have an smp_rmb on the other path.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16  2:41   ` Rusty Russell
  2009-12-16  2:53     ` Herbert Xu
@ 2009-12-16 17:42     ` Sridhar Samudrala
  1 sibling, 0 replies; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-16 17:42 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, mst, netdev

On Wed, 2009-12-16 at 13:11 +1030, Rusty Russell wrote:
> Sridhar, if you simply remove the dev_warn() does that restore some
> performance?  I'll work on a proper fix for 2.6.33 now.

Here is the comparision of a 60sec TCP STREAM test from 2.6.31.6 and 2.6.32 guests
to a 2.6.32 host with vhost-net patches.
Removing the dev_warn() in 2.6.32 doesn't make any significant difference
in throughput. I think the requeues are the main cause for the regression.
Will try your xmit NAPI patch.

linux-2.6.32
------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      3279.98   81.48    77.79    2.035   3.885  
$ tc -s qdisc show dev eth0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 24708777531 bytes 1469482 pkt (dropped 0, overlimits 0 requeues 283575) 
 rate 0bit 0pps backlog 0b 0p requeues 283575 
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 54:52:00:35:e3:74 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    97000940   1469585  0       0       0       0      
    TX: bytes  packets  errors  dropped carrier collsns 
    3233946817 1469533  0       0       0       0      

linux-2.6.31.6
--------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      8038.48   69.44    71.02    0.708   1.448  
$ tc -s qdisc show dev eth0
qdisc pfifo_fast 0: root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 60378698117 bytes 937388 pkt (dropped 0, overlimits 0 requeues 61) 
 rate 0bit 0pps backlog 0b 0p requeues 61 
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 54:52:00:35:e3:73 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    54702188   828655   0       0       0       0      
    TX: bytes  packets  errors  dropped carrier collsns 
    248753779  937399   0       7       0       0     



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 12:45       ` Rusty Russell
  2009-12-16 13:22         ` Michael S. Tsirkin
  2009-12-16 13:30         ` Herbert Xu
@ 2009-12-17  1:43         ` Sridhar Samudrala
  2009-12-17  3:12           ` Herbert Xu
  2009-12-17  3:15           ` Herbert Xu
  2 siblings, 2 replies; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17  1:43 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, mst, netdev

On Wed, 2009-12-16 at 23:15 +1030, Rusty Russell wrote:
> On Wed, 16 Dec 2009 01:23:31 pm Herbert Xu wrote:
> > On Wed, Dec 16, 2009 at 01:11:40PM +1030, Rusty Russell wrote:
> > > 
> > >    Thanks for the hint.  They seem to use NAPI for xmit cleanup, so that's
> > > what we should do?  I'll try, but such a rewrite doesn't belong in 2.6.32.
> > 
> > Well it depends.  Real drivers can't touch the hardware so they're
> > stuck with whatever the hardware does.  For virtio we do have the
> > flexibility of modifying the backend.
> > 
> > Having said that, for existing backends that will signal when there
> > is just a single free entry on the queue something like NAPI could
> > reduce the overhead associated with the IRQs.
> 
> OK, this is unfortunately untested, but wanted to send it out tonight:
> 
> virtio_net: use NAPI for xmit (UNTESTED)
> 
> This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
> free transmitted packets.  It neatens things a little as well.

I had to change virtnet_xmit_poll() to get it working. See below.
With this change, i don't see any 'queue full' warnings, but requeues
are still happening at the qdisc level (sch_direct_xmit() finds that
tx queue is stopped and does requeues).

Not sure why the host is not able to keep up with the guest.

Thanks
Sridhar

> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -47,6 +47,9 @@ struct virtnet_info
>  	struct napi_struct napi;
>  	unsigned int status;
> 
> +	/* We free packets and decide whether to restart xmit here. */
> +	struct napi_struct xmit_napi;
> +
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
> 
> @@ -60,6 +63,9 @@ struct virtnet_info
>  	struct sk_buff_head recv;
>  	struct sk_buff_head send;
> 
> +	/* Capacity left in xmit queue. */
> +	unsigned int capacity;
> +
>  	/* Work struct for refilling if we run low on memory. */
>  	struct delayed_work refill;
> 
> @@ -111,11 +117,8 @@ static void skb_xmit_done(struct virtque
>  {
>  	struct virtnet_info *vi = svq->vdev->priv;
> 
> -	/* Suppress further interrupts. */
> -	svq->vq_ops->disable_cb(svq);
> -
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	napi_schedule(&vi->xmit_napi);
>  }
> 
>  static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -446,7 +449,7 @@ static unsigned int free_old_xmit_skbs(s
> 
>  	while ((skb = vi->svq->vq_ops->get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
> -		__skb_unlink(skb, &vi->send);
> +		skb_unlink(skb, &vi->send);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
>  		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> @@ -455,6 +458,23 @@ static unsigned int free_old_xmit_skbs(s
>  	return tot_sgs;
>  }
> 
> +static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
> +{
> +	struct virtnet_info *vi =
> +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> +
> +	if (netif_queue_stopped(vi->dev)) {
> +		vi->capacity += free_old_xmit_skbs(vi);
> +		if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> +			/* Suppress further xmit interrupts. */
> +			vi->svq->vq_ops->disable_cb(vi->svq);
> +			napi_complete(xmit_napi);
> +			netif_wake_queue(vi->dev);
> +		}
> +	}
> +	return 1;
> +}


+static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
+{
+	struct virtnet_info *vi =
+		container_of(xmit_napi, struct virtnet_info, xmit_napi);
+
+	vi->capacity += free_old_xmit_skbs(vi);
+
+	if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
+		/* Suppress further xmit interrupts. */
+		vi->svq->vq_ops->disable_cb(vi->svq);
+		napi_complete(xmit_napi);
+		if (netif_queue_stopped(vi->dev))
+			netif_wake_queue(vi->dev);
+	}
+	return 1;
+}



> +
>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  {
>  	struct scatterlist sg[2+MAX_SKB_FRAGS];
> @@ -509,34 +529,22 @@ static netdev_tx_t start_xmit(struct sk_
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
> 
> -again:
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
> +	skb_queue_head(&vi->send, skb);
>  	capacity = xmit_skb(vi, skb);
> 
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
> +		skb_unlink(skb, &vi->send);
>  		netif_stop_queue(dev);
>  		dev_warn(&dev->dev, "Unexpected full queue\n");
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			vi->svq->vq_ops->disable_cb(vi->svq);
> -			netif_start_queue(dev);
> -			goto again;
> -		}
> +		/* If we missed an interrupt, we let virtnet_xmit_poll deal. */
> +		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +			napi_schedule(&vi->xmit_napi);
>  		return NETDEV_TX_BUSY;
>  	}
>  	vi->svq->vq_ops->kick(vi->svq);
> -
> -	/*
> -	 * Put new one in send queue.  You'd expect we'd need this before
> -	 * xmit_skb calls add_buf(), since the callback can be triggered
> -	 * immediately after that.  But since the callback just triggers
> -	 * another call back here, normal network xmit locking prevents the
> -	 * race.
> -	 */
> -	__skb_queue_head(&vi->send, skb);
> +	vi->capacity = capacity;
> 
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -544,15 +552,16 @@ again:
> 
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
> -	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				vi->svq->vq_ops->disable_cb(vi->svq);
> -			}
> +	if (unlikely(capacity < 2+MAX_SKB_FRAGS)) {
> +		/* Free old skbs; might make more capacity. */
> +		vi->capacity = capacity + free_old_xmit_skbs(vi);
> +		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
> +			/* Make sure virtnet_xmit_poll sees updated capacity */
> +			wmb();
> +			netif_stop_queue(dev);
> +			/* Missed xmit irq? virtnet_xmit_poll will deal. */
> +			if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +				napi_schedule(&vi->xmit_napi);
>  		}
>  	}
> 
> @@ -590,6 +599,7 @@ static int virtnet_open(struct net_devic
>  	struct virtnet_info *vi = netdev_priv(dev);
> 
>  	napi_enable(&vi->napi);
> +	napi_enable(&vi->xmit_napi);
> 
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
> @@ -652,6 +662,7 @@ static int virtnet_close(struct net_devi
>  	struct virtnet_info *vi = netdev_priv(dev);
> 
>  	napi_disable(&vi->napi);
> +	napi_disable(&vi->xmit_napi);
> 
>  	return 0;
>  }
> @@ -883,6 +894,7 @@ static int virtnet_probe(struct virtio_d
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
>  	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
> +	netif_napi_add(dev, &vi->xmit_napi, virtnet_xmit_poll, 64);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-16 13:22         ` Michael S. Tsirkin
  2009-12-16 13:35           ` Herbert Xu
@ 2009-12-17  2:02           ` Rusty Russell
  2009-12-17  9:25             ` Michael S. Tsirkin
  1 sibling, 1 reply; 58+ messages in thread
From: Rusty Russell @ 2009-12-17  2:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Wed, 16 Dec 2009 11:52:18 pm Michael S. Tsirkin wrote:
> On Wed, Dec 16, 2009 at 11:15:38PM +1030, Rusty Russell wrote:
> > +	struct virtnet_info *vi =
> > +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> > +
> > +	if (netif_queue_stopped(vi->dev)) {
> 
> I am a bit concerned here: for example, on link down
> you do netif_stop_queue, and start on link up.
> So is it enough to check netif_queue_stopped
> to verify that tx is not running and that this is because
> it was out of capacity?
> 
> It would be very bad if this run in parallel with TX ...

Yeah, I wasn't happy.  This version uses the tx lock (we're single-queued,
so I used the __ version)

virtio_net: use NAPI for xmit (UNTESTED)

This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
free transmitted packets.  It neatens things a little as well.

Changes since last version:

1) Use the tx lock for the xmit_poll to synchronize against
   start_xmit; it might be overkill, but it's simple.
2) Don't wake queue if the carrier is gone.

(Note: a side effect of this is that we are lazier in freeing old xmit skbs.
 This might be a slight win).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -47,6 +47,9 @@ struct virtnet_info
 	struct napi_struct napi;
 	unsigned int status;
 
+	/* We free packets and decide whether to restart xmit here. */
+	struct napi_struct xmit_napi;
+
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
@@ -60,6 +63,9 @@ struct virtnet_info
 	struct sk_buff_head recv;
 	struct sk_buff_head send;
 
+	/* Capacity left in xmit queue. */
+	unsigned int capacity;
+
 	/* Work struct for refilling if we run low on memory. */
 	struct delayed_work refill;
 
@@ -111,11 +117,8 @@ static void skb_xmit_done(struct virtque
 {
 	struct virtnet_info *vi = svq->vdev->priv;
 
-	/* Suppress further interrupts. */
-	svq->vq_ops->disable_cb(svq);
-
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	napi_schedule(&vi->xmit_napi);
 }
 
 static void receive_skb(struct net_device *dev, struct sk_buff *skb,
@@ -455,6 +458,29 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
+static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
+{
+	struct virtnet_info *vi =
+		container_of(xmit_napi, struct virtnet_info, xmit_napi);
+
+	/* Don't access vq/capacity at same time as start_xmit. */
+	__netif_tx_lock(netdev_get_tx_queue(vi->dev, 0), smp_processor_id());
+
+	vi->capacity += free_old_xmit_skbs(vi);
+	if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
+		/* Suppress further xmit interrupts. */
+		vi->svq->vq_ops->disable_cb(vi->svq);
+		napi_complete(xmit_napi);
+
+		/* Don't wake it if link is down. */
+		if (likely(netif_carrier_ok(vi->vdev)))
+			netif_wake_queue(vi->dev);
+	}
+
+	__netif_tx_unlock(netdev_get_tx_queue(vi->dev, 0));
+	return 1;
+}
+
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 {
 	struct scatterlist sg[2+MAX_SKB_FRAGS];
@@ -509,10 +535,6 @@ static netdev_tx_t start_xmit(struct sk_
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-again:
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
-
 	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
 
@@ -520,14 +542,13 @@ again:
 	if (unlikely(capacity < 0)) {
 		netif_stop_queue(dev);
 		dev_warn(&dev->dev, "Unexpected full queue\n");
-		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
-			vi->svq->vq_ops->disable_cb(vi->svq);
-			netif_start_queue(dev);
-			goto again;
-		}
+		/* If we missed an interrupt, we let virtnet_xmit_poll deal. */
+		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
+			napi_schedule(&vi->xmit_napi);
 		return NETDEV_TX_BUSY;
 	}
 	vi->svq->vq_ops->kick(vi->svq);
+	vi->capacity = capacity;
 
 	/*
 	 * Put new one in send queue.  You'd expect we'd need this before
@@ -545,14 +566,13 @@ again:
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
-			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
-			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				vi->svq->vq_ops->disable_cb(vi->svq);
-			}
+		/* Free old skbs; might make more capacity. */
+		vi->capacity = capacity + free_old_xmit_skbs(vi);
+		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
+			netif_stop_queue(dev);
+			/* Missed xmit irq? virtnet_xmit_poll will deal. */
+			if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
+				napi_schedule(&vi->xmit_napi);
 		}
 	}
 
@@ -590,6 +610,7 @@ static int virtnet_open(struct net_devic
 	struct virtnet_info *vi = netdev_priv(dev);
 
 	napi_enable(&vi->napi);
+	napi_enable(&vi->xmit_napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
@@ -652,6 +673,7 @@ static int virtnet_close(struct net_devi
 	struct virtnet_info *vi = netdev_priv(dev);
 
 	napi_disable(&vi->napi);
+	napi_disable(&vi->xmit_napi);
 
 	return 0;
 }
@@ -818,9 +840,13 @@ static void virtnet_update_status(struct
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		/* Make sure virtnet_xmit_poll sees carrier enabled. */
+		wmb();
+		napi_schedule(&vi->xmit_napi);
 	} else {
 		netif_carrier_off(vi->dev);
+		/* Make sure virtnet_xmit_poll sees carrier disabled. */
+		wmb();
 		netif_stop_queue(vi->dev);
 	}
 }
@@ -883,6 +909,7 @@ static int virtnet_probe(struct virtio_d
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
 	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+	netif_napi_add(dev, &vi->xmit_napi, virtnet_xmit_poll, 64);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  1:43         ` Sridhar Samudrala
@ 2009-12-17  3:12           ` Herbert Xu
  2009-12-17  5:02             ` Sridhar Samudrala
  2009-12-17  3:15           ` Herbert Xu
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17  3:12 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Rusty Russell, mst, netdev

On Wed, Dec 16, 2009 at 05:43:00PM -0800, Sridhar Samudrala wrote:
>
> I had to change virtnet_xmit_poll() to get it working. See below.
> With this change, i don't see any 'queue full' warnings, but requeues
> are still happening at the qdisc level (sch_direct_xmit() finds that
> tx queue is stopped and does requeues).

More importantly how does the performance look?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  1:43         ` Sridhar Samudrala
  2009-12-17  3:12           ` Herbert Xu
@ 2009-12-17  3:15           ` Herbert Xu
  2009-12-17  5:05             ` Sridhar Samudrala
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17  3:15 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Rusty Russell, mst, netdev

On Wed, Dec 16, 2009 at 05:43:00PM -0800, Sridhar Samudrala wrote:
>
> I had to change virtnet_xmit_poll() to get it working. See below.
> With this change, i don't see any 'queue full' warnings, but requeues
> are still happening at the qdisc level (sch_direct_xmit() finds that
> tx queue is stopped and does requeues).

Actually this makes no sense.  The queue should only be stopped
at the end of a xmit run, in which case sch_direct_xmit should
return 0 so we should never see a requeue.

So if we're still seeing requeues then it hasn't been fixed
properly.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  3:12           ` Herbert Xu
@ 2009-12-17  5:02             ` Sridhar Samudrala
  0 siblings, 0 replies; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17  5:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Rusty Russell, mst, netdev

Herbert Xu wrote:
> On Wed, Dec 16, 2009 at 05:43:00PM -0800, Sridhar Samudrala wrote:
>   
>> I had to change virtnet_xmit_poll() to get it working. See below.
>> With this change, i don't see any 'queue full' warnings, but requeues
>> are still happening at the qdisc level (sch_direct_xmit() finds that
>> tx queue is stopped and does requeues).
>>     
>
> More importantly how does the performance look?
>   
TCP stream throughput is around the same 3200Mb/sec

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  3:15           ` Herbert Xu
@ 2009-12-17  5:05             ` Sridhar Samudrala
  2009-12-17  6:28               ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17  5:05 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Rusty Russell, mst, netdev

Herbert Xu wrote:
> On Wed, Dec 16, 2009 at 05:43:00PM -0800, Sridhar Samudrala wrote:
>   
>> I had to change virtnet_xmit_poll() to get it working. See below.
>> With this change, i don't see any 'queue full' warnings, but requeues
>> are still happening at the qdisc level (sch_direct_xmit() finds that
>> tx queue is stopped and does requeues).
>>     
>
> Actually this makes no sense.  The queue should only be stopped
> at the end of a xmit run, in which case sch_direct_xmit should
> return 0 so we should never see a requeue.
>   
I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as 
the tx queue is stopped
and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.

> So if we're still seeing requeues then it hasn't been fixed
> properly.
>
> Cheers,
>   



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  5:05             ` Sridhar Samudrala
@ 2009-12-17  6:28               ` Herbert Xu
  2009-12-17  6:45                 ` Sridhar Samudrala
  0 siblings, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17  6:28 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Rusty Russell, mst, netdev

On Wed, Dec 16, 2009 at 09:05:32PM -0800, Sridhar Samudrala wrote:
>
> I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as  
> the tx queue is stopped
> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.

Yes but if the queue was stopped then we shouldn't even get into
sch_direct_xmit.  While inside sch_direct_xmit, then only way for
the queue to be stopped is through dev_hard_start_xmit, and when
that happens sch_direct_xmit will exit so we should never see a
requeue if the driver is working properly.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  6:28               ` Herbert Xu
@ 2009-12-17  6:45                 ` Sridhar Samudrala
  2009-12-17 10:03                   ` Krishna Kumar2
  0 siblings, 1 reply; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17  6:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Rusty Russell, mst, netdev

Herbert Xu wrote:
> On Wed, Dec 16, 2009 at 09:05:32PM -0800, Sridhar Samudrala wrote:
>   
>> I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as  
>> the tx queue is stopped
>> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.
>>     
>
> Yes but if the queue was stopped then we shouldn't even get into
> sch_direct_xmit.  
I don't see any checks for txq_stopped in the callers of sch_direct_xmit() :
__dev_xmit_skb() and qdisc_restart().  Both these routines get the txq 
and call
sch_direct_xmit() which checks if tx queue is stopped or frozen.

Am i missing something?

Thanks
Sridhar
> While inside sch_direct_xmit, then only way for
> the queue to be stopped is through dev_hard_start_xmit, and when
> that happens sch_direct_xmit will exit so we should never see a
> requeue if the driver is working properly.
>
> Cheers,
>   



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  2:02           ` Rusty Russell
@ 2009-12-17  9:25             ` Michael S. Tsirkin
  2009-12-18  1:55               ` Rusty Russell
  0 siblings, 1 reply; 58+ messages in thread
From: Michael S. Tsirkin @ 2009-12-17  9:25 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Thu, Dec 17, 2009 at 12:32:26PM +1030, Rusty Russell wrote:
> On Wed, 16 Dec 2009 11:52:18 pm Michael S. Tsirkin wrote:
> > On Wed, Dec 16, 2009 at 11:15:38PM +1030, Rusty Russell wrote:
> > > +	struct virtnet_info *vi =
> > > +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> > > +
> > > +	if (netif_queue_stopped(vi->dev)) {
> > 
> > I am a bit concerned here: for example, on link down
> > you do netif_stop_queue, and start on link up.
> > So is it enough to check netif_queue_stopped
> > to verify that tx is not running and that this is because
> > it was out of capacity?
> > 
> > It would be very bad if this run in parallel with TX ...
> 
> Yeah, I wasn't happy.  This version uses the tx lock (we're single-queued,
> so I used the __ version)
> 
> virtio_net: use NAPI for xmit (UNTESTED)
> 
> This is closer to the way tg3 and ixgbe do it: use the NAPI framework to
> free transmitted packets.  It neatens things a little as well.
> 
> Changes since last version:
> 
> 1) Use the tx lock for the xmit_poll to synchronize against
>    start_xmit; it might be overkill, but it's simple.
> 2) Don't wake queue if the carrier is gone.
> 
> (Note: a side effect of this is that we are lazier in freeing old xmit skbs.
>  This might be a slight win).
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

That's very clean. Some questions below:

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -47,6 +47,9 @@ struct virtnet_info
>  	struct napi_struct napi;
>  	unsigned int status;
>  
> +	/* We free packets and decide whether to restart xmit here. */
> +	struct napi_struct xmit_napi;
> +
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
>  
> @@ -60,6 +63,9 @@ struct virtnet_info
>  	struct sk_buff_head recv;
>  	struct sk_buff_head send;
>  
> +	/* Capacity left in xmit queue. */
> +	unsigned int capacity;
> +
>  	/* Work struct for refilling if we run low on memory. */
>  	struct delayed_work refill;
>  
> @@ -111,11 +117,8 @@ static void skb_xmit_done(struct virtque
>  {
>  	struct virtnet_info *vi = svq->vdev->priv;
>  
> -	/* Suppress further interrupts. */
> -	svq->vq_ops->disable_cb(svq);
> -
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	napi_schedule(&vi->xmit_napi);
>  }
>  
>  static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> @@ -455,6 +458,29 @@ static unsigned int free_old_xmit_skbs(s
>  	return tot_sgs;
>  }
>  
> +static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
> +{
> +	struct virtnet_info *vi =
> +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> +
> +	/* Don't access vq/capacity at same time as start_xmit. */
> +	__netif_tx_lock(netdev_get_tx_queue(vi->dev, 0), smp_processor_id());

So now that we are locking, we could build a variant of this
without NAPI (maybe with trylock: we can't spin on xmit lock from
from hard irq context, can we?)? Possibly, if we do, that would be
a small enough change to be applicable in 2.6.32.

> +
> +	vi->capacity += free_old_xmit_skbs(vi);

Should we build a variant of free_old_xmit_skbs
that gets budget, to avoid starving others
while we poll the vq?

> +	if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> +		/* Suppress further xmit interrupts. */
> +		vi->svq->vq_ops->disable_cb(vi->svq);
> +		napi_complete(xmit_napi);
> +
> +		/* Don't wake it if link is down. */
> +		if (likely(netif_carrier_ok(vi->vdev)))
> +			netif_wake_queue(vi->dev);
> +	}
> +
> +	__netif_tx_unlock(netdev_get_tx_queue(vi->dev, 0));
> +	return 1;
> +}
> +
>  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  {
>  	struct scatterlist sg[2+MAX_SKB_FRAGS];
> @@ -509,10 +535,6 @@ static netdev_tx_t start_xmit(struct sk_
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
>  
> -again:
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
>  
> @@ -520,14 +542,13 @@ again:
>  	if (unlikely(capacity < 0)) {
>  		netif_stop_queue(dev);
>  		dev_warn(&dev->dev, "Unexpected full queue\n");
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			vi->svq->vq_ops->disable_cb(vi->svq);
> -			netif_start_queue(dev);
> -			goto again;
> -		}
> +		/* If we missed an interrupt, we let virtnet_xmit_poll deal. */
> +		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +			napi_schedule(&vi->xmit_napi);
>  		return NETDEV_TX_BUSY;
>  	}
>  	vi->svq->vq_ops->kick(vi->svq);
> +	vi->capacity = capacity;
>  
>  	/*
>  	 * Put new one in send queue.  You'd expect we'd need this before
> @@ -545,14 +566,13 @@ again:
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq))) {
> -			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> -			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				vi->svq->vq_ops->disable_cb(vi->svq);
> -			}
> +		/* Free old skbs; might make more capacity. */
> +		vi->capacity = capacity + free_old_xmit_skbs(vi);
> +		if (unlikely(vi->capacity < 2+MAX_SKB_FRAGS)) {
> +			netif_stop_queue(dev);
> +			/* Missed xmit irq? virtnet_xmit_poll will deal. */
> +			if (unlikely(!vi->svq->vq_ops->enable_cb(vi->svq)))
> +				napi_schedule(&vi->xmit_napi);
>  		}
>  	}
>  
> @@ -590,6 +610,7 @@ static int virtnet_open(struct net_devic
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
>  	napi_enable(&vi->napi);
> +	napi_enable(&vi->xmit_napi);
>  
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
> @@ -652,6 +673,7 @@ static int virtnet_close(struct net_devi
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
>  	napi_disable(&vi->napi);
> +	napi_disable(&vi->xmit_napi);
>  
>  	return 0;
>  }
> @@ -818,9 +840,13 @@ static void virtnet_update_status(struct
>  
>  	if (vi->status & VIRTIO_NET_S_LINK_UP) {
>  		netif_carrier_on(vi->dev);
> -		netif_wake_queue(vi->dev);
> +		/* Make sure virtnet_xmit_poll sees carrier enabled. */
> +		wmb();

I think this should be smp_wmb, we are not synchronising with hardware
here. Right?

> +		napi_schedule(&vi->xmit_napi);
>  	} else {
>  		netif_carrier_off(vi->dev);
> +		/* Make sure virtnet_xmit_poll sees carrier disabled. */
> +		wmb();

And here.

>  		netif_stop_queue(vi->dev);
>  	}
>  }
> @@ -883,6 +909,7 @@ static int virtnet_probe(struct virtio_d
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
>  	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
> +	netif_napi_add(dev, &vi->xmit_napi, virtnet_xmit_poll, 64);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  6:45                 ` Sridhar Samudrala
@ 2009-12-17 10:03                   ` Krishna Kumar2
  2009-12-17 11:27                     ` Jarek Poplawski
  0 siblings, 1 reply; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 10:03 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Herbert Xu, mst, netdev, Rusty Russell

> Sridhar Samudrala <sri@us.ibm.com>
>
> Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
>
> Herbert Xu wrote:
> > On Wed, Dec 16, 2009 at 09:05:32PM -0800, Sridhar Samudrala wrote:
> >
> >> I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as

> >> the tx queue is stopped
> >> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.
> >>
> >
> > Yes but if the queue was stopped then we shouldn't even get into
> > sch_direct_xmit.
> I don't see any checks for txq_stopped in the callers of
sch_direct_xmit() :
> __dev_xmit_skb() and qdisc_restart().  Both these routines get the txq
> and call
> sch_direct_xmit() which checks if tx queue is stopped or frozen.
>
> Am i missing something?

Yes - dequeue_skb.

The final skb, before the queue was stopped, is transmitted by
the driver. The next time sch_direct_xmit is called, it gets a
skb and finds the device is stopped and requeue's the skb. For
all subsequent xmits, dequeue_skb returns NULL (and the other
caller - __dev_xmit_skb can never be called since qdisc_qlen is
true) and thus requeue's will not happen. This also means that
the number of requeues you see (eg 283K in one run) is the number
of times the queue was stopped and restarted. So it looks like
driver either:

1. didn't stop the queue when xmiting a packet successfully (the
      condition being that it would not be possible to xmit the
      next skb). But this doesn't seem to be the case.
2. wrongly restarted the queue. Possible - since a few places
      use both the start & wake queue api's.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
@ 2009-12-17 11:20 Krishna Kumar
  2009-12-17 19:28 ` Jarek Poplawski
  0 siblings, 1 reply; 58+ messages in thread
From: Krishna Kumar @ 2009-12-17 11:20 UTC (permalink / raw)
  To: davem; +Cc: herbert, mst, netdev, rusty, Krishna Kumar, sri

> On Thu, Dec 17, 2009, Krishna Kumar2/India/IBM@IBMIN wrote on
>
> Subject: Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
>
> > >> I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as
> > >> the tx queue is stopped
> > >> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.
> > >
> > > Yes but if the queue was stopped then we shouldn't even get into
> > > sch_direct_xmit.
> > I don't see any checks for txq_stopped in the callers of
> sch_direct_xmit() :
> > __dev_xmit_skb() and qdisc_restart().  Both these routines get the txq
> > and call
> > sch_direct_xmit() which checks if tx queue is stopped or frozen.
> >
> > Am i missing something?
> 
> Yes - dequeue_skb.
> 
> The final skb, before the queue was stopped, is transmitted by
> the driver. The next time sch_direct_xmit is called, it gets a
> skb and finds the device is stopped and requeue's the skb. For
> all subsequent xmits, dequeue_skb returns NULL (and the other
> caller - __dev_xmit_skb can never be called since qdisc_qlen is
> true) and thus requeue's will not happen. This also means that
> the number of requeues you see (eg 283K in one run) is the number
> of times the queue was stopped and restarted. So it looks like
> driver either:
> 
> 1. didn't stop the queue when xmiting a packet successfully (the
>       condition being that it would not be possible to xmit the
>       next skb). But this doesn't seem to be the case.
> 2. wrongly restarted the queue. Possible - since a few places
>       use both the start & wake queue api's.

[ Resending, since I sent to wrong id last time - sorry for
some duplicates ]

On a (slightly) related note, qdisc_restart() has this code:
	/* Dequeue packet */
	skb = dequeue_skb(q);
	if (unlikely(!skb))
		return 0;

When a txq is stopped, all subsequent dev_queue_xmits will
execute this path, pass the "unlikely" code, and return. Is
it reasonable to remove "unlikely" in both dequeue_skb and
qdisc_restart, if so patch inlined below:

thanks,

- KK

From: Krishna Kumar <krkumar2@in.ibm.com>

1. Remove two unlikely checks since stopped queue's will result
	int getting a NULL skb.
2. Remove an extra space after unlikely check.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/sched/sch_generic.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c	2009-12-17 16:29:59.000000000 +0530
+++ new/net/sched/sch_generic.c	2009-12-17 16:30:55.000000000 +0530
@@ -51,7 +51,7 @@ static inline struct sk_buff *dequeue_sk
 {
 	struct sk_buff *skb = q->gso_skb;
 
-	if (unlikely(skb)) {
+	if (skb) {
 		struct net_device *dev = qdisc_dev(q);
 		struct netdev_queue *txq;
 
@@ -134,7 +134,7 @@ int sch_direct_xmit(struct sk_buff *skb,
 		ret = handle_dev_cpu_collision(skb, txq, q);
 	} else {
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
-		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
+		if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
@@ -176,7 +176,7 @@ static inline int qdisc_restart(struct Q
 
 	/* Dequeue packet */
 	skb = dequeue_skb(q);
-	if (unlikely(!skb))
+	if (!skb)
 		return 0;
 
 	root_lock = qdisc_lock(q);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 10:03                   ` Krishna Kumar2
@ 2009-12-17 11:27                     ` Jarek Poplawski
  2009-12-17 11:45                       ` Herbert Xu
  2009-12-17 11:56                       ` Krishna Kumar2
  0 siblings, 2 replies; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 11:27 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Sridhar Samudrala, Herbert Xu, mst, netdev, Rusty Russell

On 17-12-2009 11:03, Krishna Kumar2 wrote:
>> Sridhar Samudrala <sri@us.ibm.com>
>>
>> Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
>>
>> Herbert Xu wrote:
>>> On Wed, Dec 16, 2009 at 09:05:32PM -0800, Sridhar Samudrala wrote:
>>>
>>>> I think sch_direct_xmit() is not even calling dev_hard_start_xmit() as
> 
>>>> the tx queue is stopped
>>>> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.
>>>>
>>> Yes but if the queue was stopped then we shouldn't even get into
>>> sch_direct_xmit.
>> I don't see any checks for txq_stopped in the callers of
> sch_direct_xmit() :
>> __dev_xmit_skb() and qdisc_restart().  Both these routines get the txq
>> and call
>> sch_direct_xmit() which checks if tx queue is stopped or frozen.
>>
>> Am i missing something?
> 
> Yes - dequeue_skb.
> 
> The final skb, before the queue was stopped, is transmitted by
> the driver. The next time sch_direct_xmit is called, it gets a
> skb and finds the device is stopped and requeue's the skb.

So we _should_ get into sch_direct_xmit when the queue was stopped...
I guess Herbert might forget the multiqueue change, and Sridhar isn't
missing much. ;-)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:27                     ` Jarek Poplawski
@ 2009-12-17 11:45                       ` Herbert Xu
  2009-12-17 11:49                         ` Herbert Xu
                                           ` (2 more replies)
  2009-12-17 11:56                       ` Krishna Kumar2
  1 sibling, 3 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 11:45 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Krishna Kumar2, Sridhar Samudrala, mst, netdev, Rusty Russell,
	David S. Miller

On Thu, Dec 17, 2009 at 11:27:55AM +0000, Jarek Poplawski wrote:
>
> So we _should_ get into sch_direct_xmit when the queue was stopped...
> I guess Herbert might forget the multiqueue change, and Sridhar isn't
> missing much. ;-)

Hmm, if that's the case then this new multiqueue code is seriously
broken.  It means that once the queue is stopped, every new tx packet
will cause an unnecessary dequeue/requeue, up to 1000.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:45                       ` Herbert Xu
@ 2009-12-17 11:49                         ` Herbert Xu
  2009-12-17 12:08                           ` Herbert Xu
  2009-12-17 11:59                         ` Krishna Kumar2
  2009-12-17 12:19                         ` Jarek Poplawski
  2 siblings, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 11:49 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Krishna Kumar2, Sridhar Samudrala, mst, netdev, Rusty Russell,
	David S. Miller

On Thu, Dec 17, 2009 at 07:45:35PM +0800, Herbert Xu wrote:
> On Thu, Dec 17, 2009 at 11:27:55AM +0000, Jarek Poplawski wrote:
> >
> > So we _should_ get into sch_direct_xmit when the queue was stopped...
> > I guess Herbert might forget the multiqueue change, and Sridhar isn't
> > missing much. ;-)
> 
> Hmm, if that's the case then this new multiqueue code is seriously
> broken.  It means that once the queue is stopped, every new tx packet
> will cause an unnecessary dequeue/requeue, up to 1000.

And indeed, this new requeue behaviour was introduced in 2.6.32.

Sridhar, please build the 2.6.32 virtio-net driver under a 2.6.31
kernel and see if the regression persists.  Alternatively, build
the virtio-net driver under 2.6.32 and see whether it still performs
as well.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:27                     ` Jarek Poplawski
  2009-12-17 11:45                       ` Herbert Xu
@ 2009-12-17 11:56                       ` Krishna Kumar2
  2009-12-17 13:17                         ` Jarek Poplawski
  1 sibling, 1 reply; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 11:56 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Herbert Xu, mst, netdev, Rusty Russell, Sridhar Samudrala

> Jarek Poplawski <jarkao2@gmail.com>
>
> >>> On Wed, Dec 16, 2009 at 09:05:32PM -0800, Sridhar Samudrala wrote:
> >>>
> >>>> I think sch_direct_xmit() is not even calling dev_hard_start_xmit()
as
> >
> >>>> the tx queue is stopped
> >>>> and does a dev_requeue_skb() and returns NETDEV_TX_BUSY.
> >>>>
> >>> Yes but if the queue was stopped then we shouldn't even get into
> >>> sch_direct_xmit.
> >> I don't see any checks for txq_stopped in the callers of
> > sch_direct_xmit() :
> >> __dev_xmit_skb() and qdisc_restart().  Both these routines get the txq
> >> and call
> >> sch_direct_xmit() which checks if tx queue is stopped or frozen.
> >>
> >> Am i missing something?
> >
> > Yes - dequeue_skb.
> >
> > The final skb, before the queue was stopped, is transmitted by
> > the driver. The next time sch_direct_xmit is called, it gets a
> > skb and finds the device is stopped and requeue's the skb.
>
> So we _should_ get into sch_direct_xmit when the queue was stopped...
> I guess Herbert might forget the multiqueue change, and Sridhar isn't
> missing much. ;-)

I meant his question on who is checking tx queue stopped before
calling driver xmit. In stopped queue case, qdisc_restart makes
sure sch_direct_xmit is not called for all subsequent skbs.

Sridhar is seeing 280K requeue's, and that probably implies device
was stopped and wrongly restarted immediately. So the next xmit in
the kernel found the txq is not stopped and called the xmit handler,
get a BUSY, requeue, and so on. That would also explain why his BW
drops so much - all false starts (besides 19% of all skbs being
requeued). I assume that each time when we check:

      if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
            ret = dev_hard_start_xmit(skb, dev, txq);
it passes the check and dev_hard_start_xmit is called wrongly.

#Requeues: 283575
#total skbs: 1469482
Percentage requeued: 19.29%

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:45                       ` Herbert Xu
  2009-12-17 11:49                         ` Herbert Xu
@ 2009-12-17 11:59                         ` Krishna Kumar2
  2009-12-17 12:19                         ` Jarek Poplawski
  2 siblings, 0 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 11:59 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, netdev-owner,
	Rusty Russell, Sridhar Samudrala

> Herbert Xu <herbert@gondor.apana.org.au>
> Sent by: netdev-owner@vger.kernel.org
>
> 12/17/2009 05:15 PM
>
> To
>
> Jarek Poplawski <jarkao2@gmail.com>
>
> cc
>
> Krishna Kumar2/India/IBM@IBMIN, Sridhar Samudrala <sri@us.ibm.com>,
> mst@redhat.com, netdev@vger.kernel.org, Rusty Russell
> <rusty@rustcorp.com.au>, "David S. Miller" <davem@davemloft.net>
>
> Subject
>
> Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
>
> On Thu, Dec 17, 2009 at 11:27:55AM +0000, Jarek Poplawski wrote:
> >
> > So we _should_ get into sch_direct_xmit when the queue was stopped...
> > I guess Herbert might forget the multiqueue change, and Sridhar isn't
> > missing much. ;-)
>
> Hmm, if that's the case then this new multiqueue code is seriously
> broken.  It means that once the queue is stopped, every new tx packet
> will cause an unnecessary dequeue/requeue, up to 1000.

I think it is fine. The first skb after the stop is requeue'd. All
subsequent new skbs get screened out at qdisc_restart() -> dequeue_skb
-> NULL -> no work to do. Hope I didn't miss something now :-)

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:49                         ` Herbert Xu
@ 2009-12-17 12:08                           ` Herbert Xu
  2009-12-17 12:27                             ` Krishna Kumar2
  0 siblings, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 12:08 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Krishna Kumar2, Sridhar Samudrala, mst, netdev, Rusty Russell,
	David S. Miller

On Thu, Dec 17, 2009 at 07:49:37PM +0800, Herbert Xu wrote:
>
> And indeed, this new requeue behaviour was introduced in 2.6.32.

Actually no, it was merely rearranged.  It would appear that this
behaviour has been around for over a year.

It is even worse than I thought.  Not only would new tx packets
trigger this unnecessary queue run, but once it is triggered, it
would consume 100% CPU as dev_requeue_skb unconditionally reschedules
the queue!

Tell me this is not true please...

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:45                       ` Herbert Xu
  2009-12-17 11:49                         ` Herbert Xu
  2009-12-17 11:59                         ` Krishna Kumar2
@ 2009-12-17 12:19                         ` Jarek Poplawski
  2 siblings, 0 replies; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 12:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Krishna Kumar2, Sridhar Samudrala, mst, netdev, Rusty Russell,
	David S. Miller

On Thu, Dec 17, 2009 at 07:45:35PM +0800, Herbert Xu wrote:
> On Thu, Dec 17, 2009 at 11:27:55AM +0000, Jarek Poplawski wrote:
> >
> > So we _should_ get into sch_direct_xmit when the queue was stopped...
> > I guess Herbert might forget the multiqueue change, and Sridhar isn't
> > missing much. ;-)
> 
> Hmm, if that's the case then this new multiqueue code is seriously
> broken.  It means that once the queue is stopped, every new tx packet
> will cause an unnecessary dequeue/requeue, up to 1000.

Hmm, I can even remember who inspired this change ;-) But I'm not sure
this really broken: it seems it's "optimized" for non-stopped case.

Btw, IIRC, Krishna worked on changing this for uniqueues - I wonder if
there could be much difference in testing?

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:08                           ` Herbert Xu
@ 2009-12-17 12:27                             ` Krishna Kumar2
  2009-12-17 12:42                               ` Jarek Poplawski
  2009-12-17 13:44                               ` Herbert Xu
  0 siblings, 2 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 12:27 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, Rusty Russell,
	Sridhar Samudrala

Herbert Xu <herbert@gondor.apana.org.au> wrote on 12/17/2009 05:38:56 PM:

> > And indeed, this new requeue behaviour was introduced in 2.6.32.
>
> Actually no, it was merely rearranged.  It would appear that this
> behaviour has been around for over a year.
>
> It is even worse than I thought.  Not only would new tx packets
> trigger this unnecessary queue run, but once it is triggered, it
> would consume 100% CPU as dev_requeue_skb unconditionally reschedules
> the queue!
>
> Tell me this is not true please...

Hi Herbert,

I am confused. Isn't dequeue_skb returning NULL for 2nd - nth skbs
till the queue is restarted, so how is it broken?

__dev_xmit_skb calls sch_direct_xmit only once after the device is
stopped, and this call results in skb being re-queued. Next call to
__dev_xmit_skb calls qdisc_restart which will always bail out since
dev_dequeue returns NULL. We also avoid rescheduling the queue after
the first resched. I also think the resched in requeue is probably
not required, since the device will call netif_tx_wake_queue anyway.

dev_dequeue()
{
      if (unlikely(skb)) {
            struct net_device *dev = qdisc_dev(q);
            struct netdev_queue *txq;

            /* check the reason of requeuing without tx lock first */
            txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
            if (!netif_tx_queue_stopped(txq) &&
                !netif_tx_queue_frozen(txq)) {
                        q->gso_skb = NULL;
                        q->q.qlen--;
            } else
                  skb = NULL;
}

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:27                             ` Krishna Kumar2
@ 2009-12-17 12:42                               ` Jarek Poplawski
  2009-12-17 12:56                                 ` Herbert Xu
  2009-12-17 13:04                                 ` Krishna Kumar2
  2009-12-17 13:44                               ` Herbert Xu
  1 sibling, 2 replies; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 12:42 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Herbert Xu, David S. Miller, mst, netdev, Rusty Russell,
	Sridhar Samudrala

On Thu, Dec 17, 2009 at 05:57:15PM +0530, Krishna Kumar2 wrote:
> the first resched. I also think the resched in requeue is probably
> not required, since the device will call netif_tx_wake_queue anyway.

IIRC there were mentioned some possibly broken drivers around these
changes. So it would require a little bit of reviewing...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
       [not found] ` <20091217123153.GA31131@gondor.apana.org.au>
@ 2009-12-17 12:56   ` Krishna Kumar2
  2009-12-17 13:40     ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 12:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, mst, netdev, Rusty Russell, Sridhar Samudrala

Herbert Xu <herbert@gondor.apana.org.au> wrote on 12/17/2009 06:01:53 PM:
>
> > On a (slightly) related note, qdisc_restart() has this code:
> >    /* Dequeue packet */
> >    skb = dequeue_skb(q);
> >    if (unlikely(!skb))
> >       return 0;
> >
> > When a txq is stopped, all subsequent dev_queue_xmits will
> > execute this path, pass the "unlikely" code, and return. Is
> > it reasonable to remove "unlikely" in both dequeue_skb and
> > qdisc_restart, if so patch inlined below:
>
> Why would it return skb == NULL? If the queue is stopped, then
> chances are packets will pile up in the qdisc so dequeue_skb
> should return a non-NULL skb.

Since dequeue_skb finds the first skb that was cached in gso_skb,
handles the stopped txq and returns NULL for all subsequent skbs.
And qdisc_restart bails out immediately. Only the first skb will
get requeue'd, and no more skbs are taken out of the queue -
either dequeue or gso_skb, till the driver restarts the queue.
That's why I think this code is not the issue, and probably as
optimized as possible for all cases.

I still feel the device is wrongly waking up the txq, and next
tx fails, driver stops the queue, and core does another requeue/sched,
and so on.

thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:42                               ` Jarek Poplawski
@ 2009-12-17 12:56                                 ` Herbert Xu
  2009-12-17 13:22                                   ` Krishna Kumar2
  2009-12-17 13:04                                 ` Krishna Kumar2
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 12:56 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Krishna Kumar2, David S. Miller, mst, netdev, Rusty Russell,
	Sridhar Samudrala

On Thu, Dec 17, 2009 at 12:42:54PM +0000, Jarek Poplawski wrote:
> On Thu, Dec 17, 2009 at 05:57:15PM +0530, Krishna Kumar2 wrote:
> > the first resched. I also think the resched in requeue is probably
> > not required, since the device will call netif_tx_wake_queue anyway.

Actually, the requeue is there for the case where we hit lock
contention on the device TX lock so it is needed for that at
least.

However, for the case where the device queue is stopped you're
right we don't need to reschedule.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:42                               ` Jarek Poplawski
  2009-12-17 12:56                                 ` Herbert Xu
@ 2009-12-17 13:04                                 ` Krishna Kumar2
  1 sibling, 0 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 13:04 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David S. Miller, Herbert Xu, mst, netdev, Rusty Russell,
	Sridhar Samudrala

Jarek Poplawski <jarkao2@gmail.com> wrote on 12/17/2009 06:12:54 PM:
>
> On Thu, Dec 17, 2009 at 05:57:15PM +0530, Krishna Kumar2 wrote:
> > the first resched. I also think the resched in requeue is probably
> > not required, since the device will call netif_tx_wake_queue anyway.
>
> IIRC there were mentioned some possibly broken drivers around these
> changes. So it would require a little bit of reviewing...

Yes, that was the reason I dropped that patch before submitting. I
was not sure of that optimization since I didn't check that all,
esp antique drivers, would resched correctly, but that was only a
minor optimization (avoiding one call of resched per stop).

However, the qdisc_restart/dev_dequeue path is clean since the stopped
txq case is handled very early, and no actual dequeue/requeue/resched
happens till the device restarts the txq.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:56                       ` Krishna Kumar2
@ 2009-12-17 13:17                         ` Jarek Poplawski
  2009-12-17 14:10                           ` Krishna Kumar2
  0 siblings, 1 reply; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 13:17 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Herbert Xu, mst, netdev, Rusty Russell, Sridhar Samudrala

On Thu, Dec 17, 2009 at 05:26:37PM +0530, Krishna Kumar2 wrote:
> Sridhar is seeing 280K requeue's, and that probably implies device
> was stopped and wrongly restarted immediately. So the next xmit in
> the kernel found the txq is not stopped and called the xmit handler,
> get a BUSY, requeue, and so on. That would also explain why his BW
> drops so much - all false starts (besides 19% of all skbs being
> requeued). I assume that each time when we check:
> 
>       if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
>             ret = dev_hard_start_xmit(skb, dev, txq);
> it passes the check and dev_hard_start_xmit is called wrongly.
> 
> #Requeues: 283575
> #total skbs: 1469482
> Percentage requeued: 19.29%

I haven't followed this thread, so I'm not sure what are you looking
for, but can't these requeues/drops mean some hardware limits were
reached? I wonder why there are compared linux-2.6.32 vs. 2.6.31.6
with different test conditions (avg. packet sizes: 16800 vs. 64400)?

Jarek P.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:56                                 ` Herbert Xu
@ 2009-12-17 13:22                                   ` Krishna Kumar2
  0 siblings, 0 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 13:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, Rusty Russell,
	Sridhar Samudrala

Herbert Xu <herbert@gondor.apana.org.au> wrote on 12/17/2009 06:26:54 PM:

>
> > On Thu, Dec 17, 2009 at 05:57:15PM +0530, Krishna Kumar2 wrote:
> > > the first resched. I also think the resched in requeue is probably
> > > not required, since the device will call netif_tx_wake_queue anyway.
>
> Actually, the requeue is there for the case where we hit lock
> contention on the device TX lock so it is needed for that at
> least.
>
> However, for the case where the device queue is stopped you're
> right we don't need to reschedule.

Yes, my patch had a handle for both cases by adding another arg
to dev_requeue_skb, something like:

static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q,
                                  int reason)
{
      q->gso_skb = skb;
      q->qstats.requeues++;
      q->q.qlen++;    /* it's still part of the queue */

      if (reason != NETDEV_TX_BUSY)
            __netif_schedule(q);

      return 0;
}

and pass the reason in the two places that call requeue. I didn't
submit it since it felt a small optimization - avoiding one resched
per stopped txq (not for every skb), plus I was not sure if every
driver would properly call netif_tx_wake_queue.

thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:56   ` Krishna Kumar2
@ 2009-12-17 13:40     ` Herbert Xu
  2009-12-17 13:56       ` Krishna Kumar2
  0 siblings, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 13:40 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: davem, mst, netdev, Rusty Russell, Sridhar Samudrala

On Thu, Dec 17, 2009 at 06:26:33PM +0530, Krishna Kumar2 wrote:
>
> Since dequeue_skb finds the first skb that was cached in gso_skb,
> handles the stopped txq and returns NULL for all subsequent skbs.
> And qdisc_restart bails out immediately. Only the first skb will
> get requeue'd, and no more skbs are taken out of the queue -
> either dequeue or gso_skb, till the driver restarts the queue.

But dev_requeue_skb immediately reschedules a new TX softirq
so this will keep going until the hardware wakes up the queue.

That is, we have an infinite loop consuming CPU that could otherwise
be spent doing useful work, such as generating new packets.

Note that this is not new in 2.6.32.  It has been this way since
2008 at least.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 12:27                             ` Krishna Kumar2
  2009-12-17 12:42                               ` Jarek Poplawski
@ 2009-12-17 13:44                               ` Herbert Xu
  2009-12-17 14:35                                 ` Krishna Kumar2
  1 sibling, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 13:44 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, Rusty Russell,
	Sridhar Samudrala

On Thu, Dec 17, 2009 at 05:57:15PM +0530, Krishna Kumar2 wrote:
> 
> I am confused. Isn't dequeue_skb returning NULL for 2nd - nth skbs
> till the queue is restarted, so how is it broken?

Sorry I didn't read dev_dequeue carefully enough.  Indeed it
correctly checks the queue status so the loop that I thought
was there doesn't exist.

The requeues are probably caused by the driver still.  Was Sridhar
testing Rusty's latest patch?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 13:40     ` Herbert Xu
@ 2009-12-17 13:56       ` Krishna Kumar2
  0 siblings, 0 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 13:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, mst, netdev, Rusty Russell, Sridhar Samudrala

Herbert Xu <herbert@gondor.apana.org.au> wrote on 12/17/2009 07:10:53 PM:

> > Since dequeue_skb finds the first skb that was cached in gso_skb,
> > handles the stopped txq and returns NULL for all subsequent skbs.
> > And qdisc_restart bails out immediately. Only the first skb will
> > get requeue'd, and no more skbs are taken out of the queue -
> > either dequeue or gso_skb, till the driver restarts the queue.
>
> But dev_requeue_skb immediately reschedules a new TX softirq
> so this will keep going until the hardware wakes up the queue.
>
> That is, we have an infinite loop consuming CPU that could otherwise
> be spent doing useful work, such as generating new packets.

Except that dev_requeue_skb is never called again till the device is
restarted and stopped. It is called once per stop, and after this,
dequeue_skb never returns a skb, hence no more requeue's.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 13:17                         ` Jarek Poplawski
@ 2009-12-17 14:10                           ` Krishna Kumar2
  2009-12-17 14:16                             ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 14:10 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Herbert Xu, mst, netdev, Rusty Russell, Sridhar Samudrala

Jarek Poplawski <jarkao2@gmail.com> wrote on 12/17/2009 06:47:09 PM:

> On Thu, Dec 17, 2009 at 05:26:37PM +0530, Krishna Kumar2 wrote:
> > Sridhar is seeing 280K requeue's, and that probably implies device
> > was stopped and wrongly restarted immediately. So the next xmit in
> > the kernel found the txq is not stopped and called the xmit handler,
> > get a BUSY, requeue, and so on. That would also explain why his BW
> > drops so much - all false starts (besides 19% of all skbs being
> > requeued). I assume that each time when we check:
> >
> >       if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
> >             ret = dev_hard_start_xmit(skb, dev, txq);
> > it passes the check and dev_hard_start_xmit is called wrongly.
> >
> > #Requeues: 283575
> > #total skbs: 1469482
> > Percentage requeued: 19.29%
>
> I haven't followed this thread, so I'm not sure what are you looking
> for, but can't these requeues/drops mean some hardware limits were
> reached? I wonder why there are compared linux-2.6.32 vs. 2.6.31.6
> with different test conditions (avg. packet sizes: 16800 vs. 64400)?

Hi Jarek,

That is a good point. I am not sure why the avg packet sizes are
so different in the bstats. Did GSO change in these two versions?

I took the numbers from Sridhar's mail before the NAPI patch. I think
having 280K requeue's in 1 min means that the driver is waking up the
queue when it should not. The NAPI patch fixes that, but he still
reported seeing requeue's.

thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 14:10                           ` Krishna Kumar2
@ 2009-12-17 14:16                             ` Herbert Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 14:16 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Jarek Poplawski, mst, netdev, Rusty Russell, Sridhar Samudrala

On Thu, Dec 17, 2009 at 07:40:32PM +0530, Krishna Kumar2 wrote:
> 
> I took the numbers from Sridhar's mail before the NAPI patch. I think
> having 280K requeue's in 1 min means that the driver is waking up the
> queue when it should not. The NAPI patch fixes that, but he still
> reported seeing requeue's.

Since we're still in the dark as to why this is happening, it
would be really helpful if we could test 2.6.31's virtio-net
under 2.6.32 or vice versa to determine whether it is a driver
change or a core change that's creating this regression.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 13:44                               ` Herbert Xu
@ 2009-12-17 14:35                                 ` Krishna Kumar2
  2009-12-17 14:36                                   ` Herbert Xu
  0 siblings, 1 reply; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-17 14:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, Rusty Russell,
	Sridhar Samudrala

Herbert Xu <herbert@gondor.apana.org.au> wrote on 12/17/2009 07:14:08 PM:

> > I am confused. Isn't dequeue_skb returning NULL for 2nd - nth skbs
> > till the queue is restarted, so how is it broken?
>
> Sorry I didn't read dev_dequeue carefully enough.  Indeed it
> correctly checks the queue status so the loop that I thought
> was there doesn't exist.
>
> The requeues are probably caused by the driver still.  Was Sridhar
> testing Rusty's latest patch?

I was using numbers from his first test run. He has not posted
the requeue numbers for the NAPI run. But he ran with NAPI, and
said:

"I had to change virtnet_xmit_poll() to get it working. See below.
With this change, i don't see any 'queue full' warnings, but requeues
are still happening at the qdisc level (sch_direct_xmit() finds that
tx queue is stopped and does requeues)".

I think the bug is in this check:

+     if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
+           /* Suppress further xmit interrupts. */
+           vi->svq->vq_ops->disable_cb(vi->svq);
+           napi_complete(xmit_napi);
+
+           /* Don't wake it if link is down. */
+           if (likely(netif_carrier_ok(vi->vdev)))
+                 netif_wake_queue(vi->dev);
+     }

We wake up too fast, just enough space for one more skb to be sent
before the queue is stopped again. And hence no more messages about
queue full, but lot of requeues. The qdisc code is doing the correct
thing, but we need to increase the limit here.

Can we try with some big number, 64, 128?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 14:35                                 ` Krishna Kumar2
@ 2009-12-17 14:36                                   ` Herbert Xu
  2009-12-17 21:50                                     ` Sridhar Samudrala
  0 siblings, 1 reply; 58+ messages in thread
From: Herbert Xu @ 2009-12-17 14:36 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David S. Miller, Jarek Poplawski, mst, netdev, Rusty Russell,
	Sridhar Samudrala

On Thu, Dec 17, 2009 at 08:05:02PM +0530, Krishna Kumar2 wrote:
>
> I think the bug is in this check:
> 
> +     if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> +           /* Suppress further xmit interrupts. */
> +           vi->svq->vq_ops->disable_cb(vi->svq);
> +           napi_complete(xmit_napi);
> +
> +           /* Don't wake it if link is down. */
> +           if (likely(netif_carrier_ok(vi->vdev)))
> +                 netif_wake_queue(vi->dev);
> +     }
> 
> We wake up too fast, just enough space for one more skb to be sent
> before the queue is stopped again. And hence no more messages about
> queue full, but lot of requeues. The qdisc code is doing the correct
> thing, but we need to increase the limit here.
> 
> Can we try with some big number, 64, 128?

Good point.  Sridhar, could you please test doubling or quadrupling
the threshold? 128 is probably a bit too much though as the ring
only has 256 entries.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 11:20 [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net Krishna Kumar
@ 2009-12-17 19:28 ` Jarek Poplawski
  0 siblings, 0 replies; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 19:28 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, herbert, mst, netdev, rusty, sri

Krishna Kumar wrote, On 12/17/2009 12:20 PM:

...
> [ Resending, since I sent to wrong id last time - sorry for
> some duplicates ]
> 
> On a (slightly) related note, qdisc_restart() has this code:
> 	/* Dequeue packet */
> 	skb = dequeue_skb(q);
> 	if (unlikely(!skb))
> 		return 0;
> 
> When a txq is stopped, all subsequent dev_queue_xmits will
> execute this path, pass the "unlikely" code, and return. Is
> it reasonable to remove "unlikely" in both dequeue_skb and
> qdisc_restart, if so patch inlined below:


IMHO we should rather care for the "non-stopped" even if
these cases are equally probable, but I doubt they are,
especially wrt. modern super-fast NICs, so it would be
nice to prove such a change with some test, I guess.

Thanks,
Jarek P.

> 
> thanks,
> 
> - KK
> 
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> 1. Remove two unlikely checks since stopped queue's will result
> 	int getting a NULL skb.
> 2. Remove an extra space after unlikely check.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  net/sched/sch_generic.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c
> --- org/net/sched/sch_generic.c	2009-12-17 16:29:59.000000000 +0530
> +++ new/net/sched/sch_generic.c	2009-12-17 16:30:55.000000000 +0530
> @@ -51,7 +51,7 @@ static inline struct sk_buff *dequeue_sk
>  {
>  	struct sk_buff *skb = q->gso_skb;
>  
> -	if (unlikely(skb)) {
> +	if (skb) {
>  		struct net_device *dev = qdisc_dev(q);
>  		struct netdev_queue *txq;
>  
> @@ -134,7 +134,7 @@ int sch_direct_xmit(struct sk_buff *skb,
>  		ret = handle_dev_cpu_collision(skb, txq, q);
>  	} else {
>  		/* Driver returned NETDEV_TX_BUSY - requeue skb */
> -		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
> +		if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
>  			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
>  			       dev->name, ret, q->q.qlen);
>  
> @@ -176,7 +176,7 @@ static inline int qdisc_restart(struct Q
>  
>  	/* Dequeue packet */
>  	skb = dequeue_skb(q);
> -	if (unlikely(!skb))
> +	if (!skb)
>  		return 0;
>  
>  	root_lock = qdisc_lock(q);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 14:36                                   ` Herbert Xu
@ 2009-12-17 21:50                                     ` Sridhar Samudrala
  2009-12-17 22:28                                       ` Sridhar Samudrala
                                                         ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17 21:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Krishna Kumar2, David S. Miller, Jarek Poplawski, mst, netdev,
	Rusty Russell

On Thu, 2009-12-17 at 22:36 +0800, Herbert Xu wrote:
> On Thu, Dec 17, 2009 at 08:05:02PM +0530, Krishna Kumar2 wrote:
> >
> > I think the bug is in this check:
> > 
> > +     if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> > +           /* Suppress further xmit interrupts. */
> > +           vi->svq->vq_ops->disable_cb(vi->svq);
> > +           napi_complete(xmit_napi);
> > +
> > +           /* Don't wake it if link is down. */
> > +           if (likely(netif_carrier_ok(vi->vdev)))
> > +                 netif_wake_queue(vi->dev);
> > +     }
> > 
> > We wake up too fast, just enough space for one more skb to be sent
> > before the queue is stopped again. And hence no more messages about
> > queue full, but lot of requeues. The qdisc code is doing the correct
> > thing, but we need to increase the limit here.
> > 
> > Can we try with some big number, 64, 128?
> 
> Good point.  Sridhar, could you please test doubling or quadrupling
> the threshold? 128 is probably a bit too much though as the ring
> only has 256 entries.

Increasing the wakeup threshold value reduced the number of requeues, but
didn't eliminate them. The throughput improved a little, but the CPU utilization
went up.
I don't see any 'queue full' warning messages from the driver and hence the driver
is not returning NETDEV_TX_BUSY. The requeues are happening in sch_direct_xmit()
as it is finding that the tx queue is stopped.

I could not get 2.6.31 virtio-net driver to work with 2.6.32 kernel by simply
replacing virtio-net.c. The compile and build went through fine, but the guest
is not seeing the virtio-net device when it comes up. 
I think it is a driver issue, not a core issue as I am able to get good results
by not stopping the queue early in start_xmit() and dropping the skb when xmit_skb()
fails even with 2.6.32 kernel. I think this behavior is somewhat similar to 2.6.31
virtio-net driver as it caches 1 skb and drops any further skbs when ring is full
in its start_xmit routine. 

2.6.32 + Rusty's xmit_napi v2 patch
-----------------------------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      3255.22   87.16    82.57    2.193   4.156  
$ tc -s qdisc show dev eth0 
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 24524210050 bytes 1482737 pkt (dropped 0, overlimits 0 requeues 339101) 
 rate 0bit 0pps backlog 0b 0p requeues 339101 

2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=64
---------------------------------------------------------
$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      3356.71   95.41    89.56    2.329   4.372  
[sridhar@localhost ~]$ tc -s qdisc show dev eth0 
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 25290227186 bytes 1555119 pkt (dropped 0, overlimits 0 requeues 78179) 
 rate 0bit 0pps backlog 0b 0p requeues 78179 

2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=128
----------------------------------------------------------
$./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      3413.79   96.30    89.79    2.311   4.309  
[sridhar@localhost ~]$ tc -s qdisc show dev eth0 
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 25719585472 bytes 1579448 pkt (dropped 0, overlimits 0 requeues 40299) 
 rate 0bit 0pps backlog 0b 0p requeues 40299 

2.6.32 + Rusty's xmit_napi v2 patch + don't stop early & drop skb on fail patch
-------------------------------------------------------------------------------
$./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    60.03      7741.65   70.09    72.84    0.742   1.542  
[sridhar@localhost ~]$ tc -s qdisc show dev eth0 
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 58149531018 bytes 897991 pkt (dropped 0, overlimits 0 requeues 1) 
 rate 0bit 0pps backlog 0b 0p requeues 1




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 21:50                                     ` Sridhar Samudrala
@ 2009-12-17 22:28                                       ` Sridhar Samudrala
  2009-12-17 22:41                                         ` Jarek Poplawski
  2009-12-18 13:46                                       ` Krishna Kumar2
  2009-12-18 19:13                                       ` Sridhar Samudrala
  2 siblings, 1 reply; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-17 22:28 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Krishna Kumar2, David S. Miller, Jarek Poplawski, mst, netdev,
	Rusty Russell

On Thu, 2009-12-17 at 13:50 -0800, Sridhar Samudrala wrote:
> On Thu, 2009-12-17 at 22:36 +0800, Herbert Xu wrote:
> > On Thu, Dec 17, 2009 at 08:05:02PM +0530, Krishna Kumar2 wrote:
> > >
> > > I think the bug is in this check:
> > > 
> > > +     if (vi->capacity >= 2 + MAX_SKB_FRAGS) {
> > > +           /* Suppress further xmit interrupts. */
> > > +           vi->svq->vq_ops->disable_cb(vi->svq);
> > > +           napi_complete(xmit_napi);
> > > +
> > > +           /* Don't wake it if link is down. */
> > > +           if (likely(netif_carrier_ok(vi->vdev)))
> > > +                 netif_wake_queue(vi->dev);
> > > +     }
> > > 
> > > We wake up too fast, just enough space for one more skb to be sent
> > > before the queue is stopped again. And hence no more messages about
> > > queue full, but lot of requeues. The qdisc code is doing the correct
> > > thing, but we need to increase the limit here.
> > > 
> > > Can we try with some big number, 64, 128?
> > 
> > Good point.  Sridhar, could you please test doubling or quadrupling
> > the threshold? 128 is probably a bit too much though as the ring
> > only has 256 entries.
> 
> Increasing the wakeup threshold value reduced the number of requeues, but
> didn't eliminate them. The throughput improved a little, but the CPU utilization
> went up.
> I don't see any 'queue full' warning messages from the driver and hence the driver
> is not returning NETDEV_TX_BUSY. The requeues are happening in sch_direct_xmit()
> as it is finding that the tx queue is stopped.
> 
> I could not get 2.6.31 virtio-net driver to work with 2.6.32 kernel by simply
> replacing virtio-net.c. The compile and build went through fine, but the guest
> is not seeing the virtio-net device when it comes up. 
> I think it is a driver issue, not a core issue as I am able to get good results
> by not stopping the queue early in start_xmit() and dropping the skb when xmit_skb()
> fails even with 2.6.32 kernel. I think this behavior is somewhat similar to 2.6.31
> virtio-net driver as it caches 1 skb and drops any further skbs when ring is full
> in its start_xmit routine. 

Another interesting observation that Jarek pointed out is that if we drop skb's, i guess
it is causing TCP to backoff and enabling TCP GSO code to send much larger packets(64K 
compared to 16K). So although the throughput is higher, packets/sec is less in that
case.

> 
> 2.6.32 + Rusty's xmit_napi v2 patch
> -----------------------------------
> $ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    60.03      3255.22   87.16    82.57    2.193   4.156  
> $ tc -s qdisc show dev eth0 
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 24524210050 bytes 1482737 pkt (dropped 0, overlimits 0 requeues 339101) 
>  rate 0bit 0pps backlog 0b 0p requeues 339101 
> 
> 2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=64
> ---------------------------------------------------------
> $ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    60.03      3356.71   95.41    89.56    2.329   4.372  
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0 
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 25290227186 bytes 1555119 pkt (dropped 0, overlimits 0 requeues 78179) 
>  rate 0bit 0pps backlog 0b 0p requeues 78179 
> 
> 2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=128
> ----------------------------------------------------------
> $./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    60.03      3413.79   96.30    89.79    2.311   4.309  
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0 
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 25719585472 bytes 1579448 pkt (dropped 0, overlimits 0 requeues 40299) 
>  rate 0bit 0pps backlog 0b 0p requeues 40299 
> 
> 2.6.32 + Rusty's xmit_napi v2 patch + don't stop early & drop skb on fail patch
> -------------------------------------------------------------------------------
> $./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    60.03      7741.65   70.09    72.84    0.742   1.542  
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0 
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 58149531018 bytes 897991 pkt (dropped 0, overlimits 0 requeues 1) 
>  rate 0bit 0pps backlog 0b 0p requeues 1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 22:28                                       ` Sridhar Samudrala
@ 2009-12-17 22:41                                         ` Jarek Poplawski
  0 siblings, 0 replies; 58+ messages in thread
From: Jarek Poplawski @ 2009-12-17 22:41 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Herbert Xu, Krishna Kumar2, David S. Miller, mst, netdev,
	Rusty Russell

On Thu, Dec 17, 2009 at 02:28:10PM -0800, Sridhar Samudrala wrote:
> Another interesting observation that Jarek pointed out is that if we drop skb's, i guess
> it is causing TCP to backoff and enabling TCP GSO code to send much larger packets(64K 
> compared to 16K). So although the throughput is higher, packets/sec is less in that
> case.

Maybe you're right, but there would be useful to have TCP-independent
tests to compare, or at least without such impact, so with GSO disabled.

Jarek P.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17  9:25             ` Michael S. Tsirkin
@ 2009-12-18  1:55               ` Rusty Russell
  0 siblings, 0 replies; 58+ messages in thread
From: Rusty Russell @ 2009-12-18  1:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Herbert Xu, Sridhar Samudrala, netdev

On Thu, 17 Dec 2009 07:55:31 pm Michael S. Tsirkin wrote:
> > +static int virtnet_xmit_poll(struct napi_struct *xmit_napi, int budget)
> > +{
> > +	struct virtnet_info *vi =
> > +		container_of(xmit_napi, struct virtnet_info, xmit_napi);
> > +
> > +	/* Don't access vq/capacity at same time as start_xmit. */
> > +	__netif_tx_lock(netdev_get_tx_queue(vi->dev, 0), smp_processor_id());
> 
> So now that we are locking, we could build a variant of this
> without NAPI (maybe with trylock: we can't spin on xmit lock from
> from hard irq context, can we?)? Possibly, if we do, that would be
> a small enough change to be applicable in 2.6.32.

We'd need a separate lock and irq disable; it's not obvious to me that
trylock would never cause us to miss a required wakeup.

I'd rather get this right and backport, than introduce YA random change.
The NAPI change is no more complex that this.

>From what Sridhar has said, this doesn't even fix the issue.  I'm confused...

Rusty.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 21:50                                     ` Sridhar Samudrala
  2009-12-17 22:28                                       ` Sridhar Samudrala
@ 2009-12-18 13:46                                       ` Krishna Kumar2
  2009-12-18 19:13                                       ` Sridhar Samudrala
  2 siblings, 0 replies; 58+ messages in thread
From: Krishna Kumar2 @ 2009-12-18 13:46 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David S. Miller, Herbert Xu, Jarek Poplawski, mst, netdev,
	Rusty Russell

Sridhar Samudrala <sri@us.ibm.com> wrote on 12/18/2009 03:20:08 AM:

> Increasing the wakeup threshold value reduced the number of requeues, but
> didn't eliminate them. The throughput improved a little, but the CPU
> utilization
> went up.
> I don't see any 'queue full' warning messages from the driver and
> hence the driver
> is not returning NETDEV_TX_BUSY. The requeues are happening in
> sch_direct_xmit()
> as it is finding that the tx queue is stopped.
>
> I could not get 2.6.31 virtio-net driver to work with 2.6.32 kernel by
simply
> replacing virtio-net.c. The compile and build went through fine, butthe
guest
> is not seeing the virtio-net device when it comes up.
> I think it is a driver issue, not a core issue as I am able to get
> good results
> by not stopping the queue early in start_xmit() and dropping the skb
> when xmit_skb()
> fails even with 2.6.32 kernel. I think this behavior is somewhat
> similar to 2.6.31
> virtio-net driver as it caches 1 skb and drops any further skbs when
> ring is full
> in its start_xmit routine.
>
> 2.6.32 + Rusty's xmit_napi v2 patch
> -----------------------------------
> $ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.
> 122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send
Recv
> Size   Size    Size     Time     Throughput  local    remote   local
remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
us/KB
>
>  87380  16384  16384    60.03      3255.22   87.16    82.57    2.193
4.156
> $ tc -s qdisc show dev eth0
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1
> 1 1 1 1 1 1 1
>  Sent 24524210050 bytes 1482737 pkt (dropped 0, overlimits 0 requeues
339101)
>  rate 0bit 0pps backlog 0b 0p requeues 339101
>
> 2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=64
> ---------------------------------------------------------
> $ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.
> 122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send
Recv
> Size   Size    Size     Time     Throughput  local    remote   local
remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
us/KB
>
>  87380  16384  16384    60.03      3356.71   95.41    89.56    2.329
4.372
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1
> 1 1 1 1 1 1 1
>  Sent 25290227186 bytes 1555119 pkt (dropped 0, overlimits 0 requeues
78179)
>  rate 0bit 0pps backlog 0b 0p requeues 78179
>
> 2.6.32 + Rusty's xmit_napi v2 patch + wakeup threshold=128
> ----------------------------------------------------------
> $./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.
> 122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send
Recv
> Size   Size    Size     Time     Throughput  local    remote   local
remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
us/KB
>
>  87380  16384  16384    60.03      3413.79   96.30    89.79    2.311
4.309
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1
> 1 1 1 1 1 1 1
>  Sent 25719585472 bytes 1579448 pkt (dropped 0, overlimits 0 requeues
40299)
>  rate 0bit 0pps backlog 0b 0p requeues 40299
>
> 2.6.32 + Rusty's xmit_napi v2 patch + don't stop early & drop skb onfail
patch
>
-------------------------------------------------------------------------------

> $./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.
> 122.1 (192.168.122.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send
Recv
> Size   Size    Size     Time     Throughput  local    remote   local
remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
us/KB
>
>  87380  16384  16384    60.03      7741.65   70.09    72.84    0.742
1.542
> [sridhar@localhost ~]$ tc -s qdisc show dev eth0
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0
1__dev_xmit_skb
> 1 1 1 1 1 1 1
>  Sent 58149531018 bytes 897991 pkt (dropped 0, overlimits 0 requeues 1)
>  rate 0bit 0pps backlog 0b 0p requeues 1

Is the "drop skb" patch doing:
-                 return NETDEV_TX_BUSY;
+
+                 /* drop the skb under stress. */
+                 vi->dev->stats.tx_dropped++;
+                 kfree_skb(skb);
+                 return NETDEV_TX_OK;

Why is dropped count zero in the last test case?

sch_direct_xmit is called from two places, and if it finds
the txq stopped, it was called from __dev_xmit_skb (where
the previous sucessful xmit had stopped the queue). This
means the device is still stopping and restarting 1000's
of times a min, and each restart fills up the device h/w
queue with the backlogged skbs resulting in another stop.
Isn't the txqlen set to 1000 in ether_setup? Can you
increase the restart limit to a really high value, like
1/2 or 3/4th of the queue should be empty? Another thing
to test is to simultaneously set txqueuelen to a big value.

Requeue does not seem to be the reason for BW drop since
it barely improved when requeue's reduced from 340K to 40K.
So, as Jarek suggested, GSO could be reason. You could try
testing with 64K I/O size (with GSO enabled) to get
comparable results.

thanks,

- KK


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net
  2009-12-17 21:50                                     ` Sridhar Samudrala
  2009-12-17 22:28                                       ` Sridhar Samudrala
  2009-12-18 13:46                                       ` Krishna Kumar2
@ 2009-12-18 19:13                                       ` Sridhar Samudrala
  2 siblings, 0 replies; 58+ messages in thread
From: Sridhar Samudrala @ 2009-12-18 19:13 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David S. Miller, Herbert Xu, Jarek Poplawski, mst, netdev,
	Rusty Russell

On Fri, 2009-12-18 at 19:16 +0530, Krishna Kumar2 wrote:

> >
> > 2.6.32 + Rusty's xmit_napi v2 patch + don't stop early & drop skb onfail
> patch
> >
> -------------------------------------------------------------------------------
> 
> > $./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60
> > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.
> > 122.1 (192.168.122.1) port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service
> Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send
> Recv
> > Size   Size    Size     Time     Throughput  local    remote   local
> remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
> us/KB
> >
> >  87380  16384  16384    60.03      7741.65   70.09    72.84    0.742
> 1.542
> > [sridhar@localhost ~]$ tc -s qdisc show dev eth0
> > qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0
> 1__dev_xmit_skb
> > 1 1 1 1 1 1 1
> >  Sent 58149531018 bytes 897991 pkt (dropped 0, overlimits 0 requeues 1)
> >  rate 0bit 0pps backlog 0b 0p requeues 1
> 
> Is the "drop skb" patch doing:
> -                 return NETDEV_TX_BUSY;
> +
> +                 /* drop the skb under stress. */
> +                 vi->dev->stats.tx_dropped++;
> +                 kfree_skb(skb);
> +                 return NETDEV_TX_OK;

Yes. This is the patch i used with plain 2.6.32. But with Rusty's patch,
i also commented out the if condition that stops the queue early in
start_xmit().
> 
> Why is dropped count zero in the last test case?

The dropped count reported by 'tc' are drops at the qdisc level and are
counted via qdisc_drop(). The drops at the driver level are counted as
net_device stats and are reported by ip -s link command. I see a few drops(5-6)
in a 60sec run with 2.6.31 kernel.
> 
> sch_direct_xmit is called from two places, and if it finds
> the txq stopped, it was called from __dev_xmit_skb (where
> the previous sucessful xmit had stopped the queue). This
> means the device is still stopping and restarting 1000's
> of times a min, and each restart fills up the device h/w
> queue with the backlogged skbs resulting in another stop.
> Isn't the txqlen set to 1000 in ether_setup? Can you
> increase the restart limit to a really high value, like
> 1/2 or 3/4th of the queue should be empty? Another thing
> to test is to simultaneously set txqueuelen to a big value.

txqueuelen limits the qdisc queue, not the device transmit queue.
The device tx queue length is set by qemu and defaults to 256 for
virtio-net. So a reasonable wakeup threshhold could be 64/128 and
it does reduce the number of requeues.

> 
> Requeue does not seem to be the reason for BW drop since
> it barely improved when requeue's reduced from 340K to 40K.
> So, as Jarek suggested, GSO could be reason. You could try
> testing with 64K I/O size (with GSO enabled) to get
> comparable results.

Yes. with 64K messages, i am getting comparable thruput, in fact
slightly better although cpu utilization is higher. So it looks like
the better thruput with 2.6.31 kernel with 16K message size is a 
side-effect of the drops.

I think Rusty's patch with 1/4 of tx ring as wakeup threshold is the first
step to address the queue full warnings in 2.6.32. With further tuning
it may be possible to eliminate the requeues.

2.6.32 + Rusty's xmit_napi_v2 patch

$ ./netperf -c -C -H 192.168.122.1 -t TCP_STREAM  -l60 -- -m 65536
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.1 (192.168.122.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  65536    60.03      8200.80   92.52    91.63    0.924   1.831  
$ tc -s qdisc show dev eth0 
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 61613336514 bytes 1208233 pkt (dropped 0, overlimits 0 requeues 237750) 
 rate 0bit 0pps backlog 0b 0p requeues 237750 
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 54:52:00:35:e3:74 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    59348763   899170   0       0       0       0      
    TX: bytes  packets  errors  dropped carrier collsns 
    1483793932 1208230  0       0       0       0      

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2009-12-18 19:13 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-17 11:20 [RFC PATCH] Regression in linux 2.6.32 virtio_net seen with vhost-net Krishna Kumar
2009-12-17 19:28 ` Jarek Poplawski
     [not found] <20091217111219.9809.27432.sendpatchset@krkumar2.in.ibm.com>
     [not found] ` <20091217123153.GA31131@gondor.apana.org.au>
2009-12-17 12:56   ` Krishna Kumar2
2009-12-17 13:40     ` Herbert Xu
2009-12-17 13:56       ` Krishna Kumar2
  -- strict thread matches above, loose matches on Subject: below --
2009-12-08 22:50 Sridhar Samudrala
2009-12-13 12:25 ` Herbert Xu
2009-12-13 23:40   ` Michael S. Tsirkin
2009-12-15 14:42     ` Herbert Xu
2009-12-15 16:26       ` Sridhar Samudrala
2009-12-16  1:21         ` Herbert Xu
2009-12-15 23:32       ` Michael S. Tsirkin
2009-12-16  1:58         ` Herbert Xu
2009-12-16  4:37         ` Rusty Russell
2009-12-16 10:37           ` Michael S. Tsirkin
2009-12-16  2:41   ` Rusty Russell
2009-12-16  2:53     ` Herbert Xu
2009-12-16 12:45       ` Rusty Russell
2009-12-16 13:22         ` Michael S. Tsirkin
2009-12-16 13:35           ` Herbert Xu
2009-12-16 13:38             ` Michael S. Tsirkin
2009-12-16 13:48               ` Herbert Xu
2009-12-17  2:02           ` Rusty Russell
2009-12-17  9:25             ` Michael S. Tsirkin
2009-12-18  1:55               ` Rusty Russell
2009-12-16 13:30         ` Herbert Xu
2009-12-17  1:43         ` Sridhar Samudrala
2009-12-17  3:12           ` Herbert Xu
2009-12-17  5:02             ` Sridhar Samudrala
2009-12-17  3:15           ` Herbert Xu
2009-12-17  5:05             ` Sridhar Samudrala
2009-12-17  6:28               ` Herbert Xu
2009-12-17  6:45                 ` Sridhar Samudrala
2009-12-17 10:03                   ` Krishna Kumar2
2009-12-17 11:27                     ` Jarek Poplawski
2009-12-17 11:45                       ` Herbert Xu
2009-12-17 11:49                         ` Herbert Xu
2009-12-17 12:08                           ` Herbert Xu
2009-12-17 12:27                             ` Krishna Kumar2
2009-12-17 12:42                               ` Jarek Poplawski
2009-12-17 12:56                                 ` Herbert Xu
2009-12-17 13:22                                   ` Krishna Kumar2
2009-12-17 13:04                                 ` Krishna Kumar2
2009-12-17 13:44                               ` Herbert Xu
2009-12-17 14:35                                 ` Krishna Kumar2
2009-12-17 14:36                                   ` Herbert Xu
2009-12-17 21:50                                     ` Sridhar Samudrala
2009-12-17 22:28                                       ` Sridhar Samudrala
2009-12-17 22:41                                         ` Jarek Poplawski
2009-12-18 13:46                                       ` Krishna Kumar2
2009-12-18 19:13                                       ` Sridhar Samudrala
2009-12-17 11:59                         ` Krishna Kumar2
2009-12-17 12:19                         ` Jarek Poplawski
2009-12-17 11:56                       ` Krishna Kumar2
2009-12-17 13:17                         ` Jarek Poplawski
2009-12-17 14:10                           ` Krishna Kumar2
2009-12-17 14:16                             ` Herbert Xu
2009-12-16 17:42     ` Sridhar Samudrala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).