Re: Network performance with small packets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Network performance with small packets - continued
       [not found] <201103071631.41964.tahm@linux.vnet.ibm.com>
@ 2011-03-09  7:15 ` Michael S. Tsirkin
       [not found] ` <20110309071558.GA25757@redhat.com>
  1 sibling, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09  7:15 UTC (permalink / raw)
  To: Shirley Ma, Rusty Russell, Michael S. Tsirkin, Krishna Kumar2,
	David Miller

On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> We've been doing some more experimenting with the small packet network 
> performance problem in KVM.  I have a different setup than what Steve D. was 
> using so I re-baselined things on the kvm.git kernel on both the host and 
> guest with a 10GbE adapter.  I also made use of the virtio-stats patch.
> 
> The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters 
> (the first connected to a 1GbE adapter and a LAN, the second connected to a 
> 10GbE adapter that is direct connected to another system with the same 10GbE 
> adapter) running the kvm.git kernel.  The test was a TCP_RR test with 100 
> connections from a baremetal client to the KVM guest using a 256 byte message 
> size in both directions.
> 
> I used the uperf tool to do this after verifying the results against netperf.  
> Uperf allows the specification of the number of connections as a parameter in 
> an XML file as opposed to launching, in this case, 100 separate instances of 
> netperf.
> 
> Here is the baseline for baremetal using 2 physical CPUs:
>   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
>   TxCPU: 7.88%  RxCPU: 99.41%
> 
> To be sure to get consistent results with KVM I disabled the hyperthreads, 
> pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter 
> interrupts (this resulted in runs that differed by only about 2% from lowest 
> to highest).  The fact that pinning is required to get consistent results is a 
> different problem that we'll have to look into later...
> 
> Here is the KVM baseline (average of six runs):
>   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
>   Exits: 148,444.58 Exits/Sec
>   TxCPU: 2.40%  RxCPU: 99.35%
> About 42% of baremetal.
> 

Can you add interrupt stats as well please?

> empty.  So I coded a quick patch to delay freeing of the used Tx buffers until 
> more than half the ring was used (I did not test this under a stream condition 
> so I don't know if this would have a negative impact).  Here are the results 
> from delaying the freeing of used Tx buffers (average of six runs):
>   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
>   Exits: 142,681.67 Exits/Sec
>   TxCPU: 2.78%  RxCPU: 99.36%
> About a 4% increase over baseline and about 44% of baremetal.

Hmm, I am not sure what you mean by delaying freeing.
I think we do have a problem that free_old_xmit_skbs
tries to flush out the ring aggressively:
it always polls until the ring is empty,
so there could be bursts of activity where
we spend a lot of time flushing the old entries
before e.g. sending an ack, resulting in
latency bursts.

Generally we'll need some smarter logic,
but with indirect at the moment we can just poll
a single packet after we post a new one, and be done with it.
Is your patch something like the patch below?
Could you try mine as well please?


> This spread out the kick_notify but still resulted in alot of them.  I decided 
> to build on the delayed Tx buffer freeing and code up an "ethtool" like 
> coalescing patch in order to delay the kick_notify until there were at least 5 
> packets on the ring or 2000 usecs, whichever occurred first.  Here are the 
> results of delaying the kick_notify (average of six runs):
>   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
>   Exits: 102,587.28 Exits/Sec
>   TxCPU: 3.03%  RxCPU: 99.33%
> About a 23% increase over baseline and about 52% of baremetal.
> 
> Running the perf command against the guest I noticed almost 19% of the time 
> being spent in _raw_spin_lock.  Enabling lockstat in the guest showed alot of 
> contention in the "irq_desc_lock_class". Pinning the virtio1-input interrupt 
> to a single cpu in the guest and re-running the last test resulted in 
> tremendous gains (average of six runs):
>   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
>   Exits: 62,603.37 Exits/Sec
>   TxCPU: 3.73%  RxCPU: 98.52%
> About a 77% increase over baseline and about 74% of baremetal.
> 
> Vhost is receiving a lot of notifications for packets that are to be 
> transmitted (over 60% of the packets generate a kick_notify).  Also, it looks 
> like vhost is sending a lot of notifications for packets it has received 
> before the guest can get scheduled to disable notifications and begin 
> processing the packets

Hmm, is this really what happens to you?  The effect would be that guest
gets an interrupt while notifications are disabled in guest, right? Could
you add a counter and check this please?

Another possible thing to try would be these old patches to publish used index
from guest to make sure this double interrupt does not happen:
 [PATCHv2] virtio: put last seen used index into ring itself
 [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature

> resulting in some lock contention in the guest (and 
> high interrupt rates).
> 
> Some thoughts for the transmit path...  can vhost be enhanced to do some 
> adaptive polling so that the number of kick_notify events are reduced and 
> replaced by kick_no_notify events?

Worth a try.

> 
> Comparing the transmit path to the receive path, the guest disables 
> notifications after the first kick and vhost re-enables notifications after 
> completing processing of the tx ring.

Is this really what happens? I though the host disables notifications
after the first kick.

>  Can a similar thing be done for the 
> receive path?  Once vhost sends the first notification for a received packet 
> it can disable notifications and let the guest re-enable notifications when it 
> has finished processing the receive ring.  Also, can the virtio-net driver do 
> some adaptive polling (or does napi take care of that for the guest)?

Worth a try. I don't think napi does anything like this.

> Running the same workload on the same configuration with a different 
> hypervisor results in performance that is almost equivalent to baremetal 
> without doing any pinning.
> 
> Thanks,
> Tom Lendacky


There's no need to flush out all used buffers
before we post more for transmit: with indirect,
just a single one is enough. Without indirect we'll
need more possibly, but just for testing this should
be enough.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

Note: untested.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 82dba5a..ebe3337 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
+		tot_sgs = 2+MAX_SKB_FRAGS;
 		dev_kfree_skb_any(skb);
 	}
 	return tot_sgs;
@@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
-
 	/* Try to transmit */
 	capacity = xmit_skb(vi, skb);
 
@@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_orphan(skb);
 	nf_reset(skb);
 
+	/* Free up any old buffers so we can queue new ones. */
+	if (capacity < 2+MAX_SKB_FRAGS)
+		capacity += free_old_xmit_skbs(vi);
+
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
       [not found] ` <20110309071558.GA25757@redhat.com>
@ 2011-03-09 15:45   ` Shirley Ma
  2011-03-09 16:10     ` Michael S. Tsirkin
  2011-03-09 16:09   ` Tom Lendacky
  2011-03-09 20:11   ` Tom Lendacky
  2 siblings, 1 reply; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 15:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, netdev, steved,
	Tom Lendacky

On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote:
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 82dba5a..ebe3337 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> virtnet_info *vi)
>         struct sk_buff *skb;
>         unsigned int len, tot_sgs = 0;
> 
> -       while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +       if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>                 pr_debug("Sent skb %p\n", skb);
>                 vi->dev->stats.tx_bytes += skb->len;
>                 vi->dev->stats.tx_packets++;
> -               tot_sgs += skb_vnet_hdr(skb)->num_sg;
> +               tot_sgs = 2+MAX_SKB_FRAGS;
>                 dev_kfree_skb_any(skb);
>         }
>         return tot_sgs;

Return value should be different based on indirect or direct buffers
here?

> @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev)
>         struct virtnet_info *vi = netdev_priv(dev);
>         int capacity;
> 
> -       /* Free up any pending old buffers before queueing new ones.
> */
> -       free_old_xmit_skbs(vi);
> -
>         /* Try to transmit */
>         capacity = xmit_skb(vi, skb);
> 
> @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff
> *skb, struct net_device *dev)
>         skb_orphan(skb);
>         nf_reset(skb);
> 
> +       /* Free up any old buffers so we can queue new ones. */
> +       if (capacity < 2+MAX_SKB_FRAGS)
> +               capacity += free_old_xmit_skbs(vi);
> +
>         /* Apparently nice girls don't return TX_BUSY; stop the queue
>          * before it gets out of hand.  Naturally, this wastes
> entries. */
>         if (capacity < 2+MAX_SKB_FRAGS) { 

I tried a similar patch before, it didn't help much on TCP stream
performance. But I didn't try multiple stream TCP_RR.

Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
       [not found] ` <20110309071558.GA25757@redhat.com>
  2011-03-09 15:45   ` Shirley Ma
@ 2011-03-09 16:09   ` Tom Lendacky
  2011-03-09 16:21     ` Shirley Ma
                       ` (3 more replies)
  2011-03-09 20:11   ` Tom Lendacky
  2 siblings, 4 replies; 25+ messages in thread
From: Tom Lendacky @ 2011-03-09 16:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> > We've been doing some more experimenting with the small packet network
> > performance problem in KVM.  I have a different setup than what Steve D.
> > was using so I re-baselined things on the kvm.git kernel on both the
> > host and guest with a 10GbE adapter.  I also made use of the
> > virtio-stats patch.
> > 
> > The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
> > adapters (the first connected to a 1GbE adapter and a LAN, the second
> > connected to a 10GbE adapter that is direct connected to another system
> > with the same 10GbE adapter) running the kvm.git kernel.  The test was a
> > TCP_RR test with 100 connections from a baremetal client to the KVM
> > guest using a 256 byte message size in both directions.
> > 
> > I used the uperf tool to do this after verifying the results against
> > netperf. Uperf allows the specification of the number of connections as
> > a parameter in an XML file as opposed to launching, in this case, 100
> > separate instances of netperf.
> > 
> > Here is the baseline for baremetal using 2 physical CPUs:
> >   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
> >   TxCPU: 7.88%  RxCPU: 99.41%
> > 
> > To be sure to get consistent results with KVM I disabled the
> > hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
> > ethernet adapter interrupts (this resulted in runs that differed by only
> > about 2% from lowest to highest).  The fact that pinning is required to
> > get consistent results is a different problem that we'll have to look
> > into later...
> > 
> > Here is the KVM baseline (average of six runs):
> >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> >   Exits: 148,444.58 Exits/Sec
> >   TxCPU: 2.40%  RxCPU: 99.35%
> > 
> > About 42% of baremetal.
> 
> Can you add interrupt stats as well please?

Yes I can.  Just the guest interrupts for the virtio device?

> 
> > empty.  So I coded a quick patch to delay freeing of the used Tx buffers
> > until more than half the ring was used (I did not test this under a
> > stream condition so I don't know if this would have a negative impact). 
> > Here are the results
> > 
> > from delaying the freeing of used Tx buffers (average of six runs):
> >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> >   Exits: 142,681.67 Exits/Sec
> >   TxCPU: 2.78%  RxCPU: 99.36%
> > 
> > About a 4% increase over baseline and about 44% of baremetal.
> 
> Hmm, I am not sure what you mean by delaying freeing.

In the start_xmit function of virtio_net.c the first thing done is to free any 
used entries from the ring.  I patched the code to track the number of used tx 
ring entries and only free the used entries when they are greater than half 
the capacity of the ring (similar to the way the rx ring is re-filled).

> I think we do have a problem that free_old_xmit_skbs
> tries to flush out the ring aggressively:
> it always polls until the ring is empty,
> so there could be bursts of activity where
> we spend a lot of time flushing the old entries
> before e.g. sending an ack, resulting in
> latency bursts.
> 
> Generally we'll need some smarter logic,
> but with indirect at the moment we can just poll
> a single packet after we post a new one, and be done with it.
> Is your patch something like the patch below?
> Could you try mine as well please?

Yes, I'll try the patch and post the results.

> 
> > This spread out the kick_notify but still resulted in alot of them.  I
> > decided to build on the delayed Tx buffer freeing and code up an
> > "ethtool" like coalescing patch in order to delay the kick_notify until
> > there were at least 5 packets on the ring or 2000 usecs, whichever
> > occurred first.  Here are the
> > 
> > results of delaying the kick_notify (average of six runs):
> >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> >   Exits: 102,587.28 Exits/Sec
> >   TxCPU: 3.03%  RxCPU: 99.33%
> > 
> > About a 23% increase over baseline and about 52% of baremetal.
> > 
> > Running the perf command against the guest I noticed almost 19% of the
> > time being spent in _raw_spin_lock.  Enabling lockstat in the guest
> > showed alot of contention in the "irq_desc_lock_class". Pinning the
> > virtio1-input interrupt to a single cpu in the guest and re-running the
> > last test resulted in
> > 
> > tremendous gains (average of six runs):
> >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> >   Exits: 62,603.37 Exits/Sec
> >   TxCPU: 3.73%  RxCPU: 98.52%
> > 
> > About a 77% increase over baseline and about 74% of baremetal.
> > 
> > Vhost is receiving a lot of notifications for packets that are to be
> > transmitted (over 60% of the packets generate a kick_notify).  Also, it
> > looks like vhost is sending a lot of notifications for packets it has
> > received before the guest can get scheduled to disable notifications and
> > begin processing the packets
> 
> Hmm, is this really what happens to you?  The effect would be that guest
> gets an interrupt while notifications are disabled in guest, right? Could
> you add a counter and check this please?

The disabling of the interrupt/notifications is done by the guest.  So the 
guest has to get scheduled and handle the notification before it disables 
them.  The vhost_signal routine will keep injecting an interrupt until this 
happens causing the contention in the guest.  I'll try the patches you specify 
below and post the results.  They look like they should take care of this 
issue.

> 
> Another possible thing to try would be these old patches to publish used
> index from guest to make sure this double interrupt does not happen:
>  [PATCHv2] virtio: put last seen used index into ring itself
>  [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature
> 
> > resulting in some lock contention in the guest (and
> > high interrupt rates).
> > 
> > Some thoughts for the transmit path...  can vhost be enhanced to do some
> > adaptive polling so that the number of kick_notify events are reduced and
> > replaced by kick_no_notify events?
> 
> Worth a try.
> 
> > Comparing the transmit path to the receive path, the guest disables
> > notifications after the first kick and vhost re-enables notifications
> > after completing processing of the tx ring.
> 
> Is this really what happens? I though the host disables notifications
> after the first kick.

Yup, sorry for the confusion.  The kick is done by the guest and then vhost 
disables notifications.  Maybe a similar approach to the above patches of 
checking the used index in the virtio_net driver could also help here?

> 
> >  Can a similar thing be done for the
> > 
> > receive path?  Once vhost sends the first notification for a received
> > packet it can disable notifications and let the guest re-enable
> > notifications when it has finished processing the receive ring.  Also,
> > can the virtio-net driver do some adaptive polling (or does napi take
> > care of that for the guest)?
> 
> Worth a try. I don't think napi does anything like this.
> 
> > Running the same workload on the same configuration with a different
> > hypervisor results in performance that is almost equivalent to baremetal
> > without doing any pinning.
> > 
> > Thanks,
> > Tom Lendacky
> 
> There's no need to flush out all used buffers
> before we post more for transmit: with indirect,
> just a single one is enough. Without indirect we'll
> need more possibly, but just for testing this should
> be enough.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> Note: untested.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 82dba5a..ebe3337 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> virtnet_info *vi) struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
> 
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
> -		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> +		tot_sgs = 2+MAX_SKB_FRAGS;
>  		dev_kfree_skb_any(skb);
>  	}
>  	return tot_sgs;
> @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
> 
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> 
> @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) skb_orphan(skb);
>  	nf_reset(skb);
> 
> +	/* Free up any old buffers so we can queue new ones. */
> +	if (capacity < 2+MAX_SKB_FRAGS)
> +		capacity += free_old_xmit_skbs(vi);
> +
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 15:45   ` Shirley Ma
@ 2011-03-09 16:10     ` Michael S. Tsirkin
  2011-03-09 16:25       ` Shirley Ma
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09 16:10 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, netdev, steved,
	Tom Lendacky

On Wed, Mar 09, 2011 at 07:45:43AM -0800, Shirley Ma wrote:
> On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote:
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 82dba5a..ebe3337 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> > virtnet_info *vi)
> >         struct sk_buff *skb;
> >         unsigned int len, tot_sgs = 0;
> > 
> > -       while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +       if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> >                 pr_debug("Sent skb %p\n", skb);
> >                 vi->dev->stats.tx_bytes += skb->len;
> >                 vi->dev->stats.tx_packets++;
> > -               tot_sgs += skb_vnet_hdr(skb)->num_sg;
> > +               tot_sgs = 2+MAX_SKB_FRAGS;
> >                 dev_kfree_skb_any(skb);
> >         }
> >         return tot_sgs;
> 
> Return value should be different based on indirect or direct buffers
> here?

Something like that. Or we can assume no indirect, worst-case.
But just for testing, I think it should work as an estimation.

> > @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >         struct virtnet_info *vi = netdev_priv(dev);
> >         int capacity;
> > 
> > -       /* Free up any pending old buffers before queueing new ones.
> > */
> > -       free_old_xmit_skbs(vi);
> > -
> >         /* Try to transmit */
> >         capacity = xmit_skb(vi, skb);
> > 
> > @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff
> > *skb, struct net_device *dev)
> >         skb_orphan(skb);
> >         nf_reset(skb);
> > 
> > +       /* Free up any old buffers so we can queue new ones. */
> > +       if (capacity < 2+MAX_SKB_FRAGS)
> > +               capacity += free_old_xmit_skbs(vi);
> > +
> >         /* Apparently nice girls don't return TX_BUSY; stop the queue
> >          * before it gets out of hand.  Naturally, this wastes
> > entries. */
> >         if (capacity < 2+MAX_SKB_FRAGS) { 
> 
> I tried a similar patch before, it didn't help much on TCP stream
> performance. But I didn't try multiple stream TCP_RR.
> 
> Shirley

There's a bug in myh patch by the way. Pls try the following
instead (still untested).

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 82dba5a..4477b9a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
-		tot_sgs += skb_vnet_hdr(skb)->num_sg;
+		tot_sgs = 2+MAX_SKB_FRAGS;
 		dev_kfree_skb_any(skb);
 	}
 	return tot_sgs;
@@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
 
-	/* Free up any pending old buffers before queueing new ones. */
+	/* Free up any old buffers so we can queue new ones. */
 	free_old_xmit_skbs(vi);
 
 	/* Try to transmit */
@@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_orphan(skb);
 	nf_reset(skb);
 
+	/* Free up any old buffers so we can queue new ones. */
+	if (capacity < 2+MAX_SKB_FRAGS)
+		capacity += free_old_xmit_skbs(vi);
+
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:09   ` Tom Lendacky
@ 2011-03-09 16:21     ` Shirley Ma
  2011-03-09 16:28     ` Michael S. Tsirkin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 16:21 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Michael S. Tsirkin, Rusty Russell, Krishna Kumar2, David Miller,
	kvm, netdev, steved

On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
> > 
> > > This spread out the kick_notify but still resulted in alot of
> them.  I
> > > decided to build on the delayed Tx buffer freeing and code up an
> > > "ethtool" like coalescing patch in order to delay the kick_notify
> until
> > > there were at least 5 packets on the ring or 2000 usecs, whichever
> > > occurred first.  Here are the
> > > 
> > > results of delaying the kick_notify (average of six runs):
> > >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> > >   Exits: 102,587.28 Exits/Sec
> > >   TxCPU: 3.03%  RxCPU: 99.33%
> > > 
> > > About a 23% increase over baseline and about 52% of baremetal.
> > > 
> > > Running the perf command against the guest I noticed almost 19% of
> the
> > > time being spent in _raw_spin_lock.  Enabling lockstat in the
> guest
> > > showed alot of contention in the "irq_desc_lock_class". Pinning
> the
> > > virtio1-input interrupt to a single cpu in the guest and
> re-running the
> > > last test resulted in
> > > 
> > > tremendous gains (average of six runs):
> > >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> > >   Exits: 62,603.37 Exits/Sec
> > >   TxCPU: 3.73%  RxCPU: 98.52%
> > > 
> > > About a 77% increase over baseline and about 74% of baremetal.
> > > 
> > > Vhost is receiving a lot of notifications for packets that are to
> be
> > > transmitted (over 60% of the packets generate a kick_notify).
> Also, it
> > > looks like vhost is sending a lot of notifications for packets it
> has
> > > received before the guest can get scheduled to disable
> notifications and
> > > begin processing the packets
> > 
> > Hmm, is this really what happens to you?  The effect would be that
> guest
> > gets an interrupt while notifications are disabled in guest, right?
> Could
> > you add a counter and check this please?
> 
> The disabling of the interrupt/notifications is done by the guest.  So
> the 
> guest has to get scheduled and handle the notification before it
> disables 
> them.  The vhost_signal routine will keep injecting an interrupt until
> this 
> happens causing the contention in the guest.  I'll try the patches you
> specify 
> below and post the results.  They look like they should take care of
> this 
> issue.

In guest TX path, the guest interrupt should be disabled in the start
since it free_old_xmit_skbs in start_xmit call, it's not necessary to
receive any send completion interrupts to handle free old skbs. Then the
interrupt is only enabled when the netif queue is full. For multiple
streams TCP_RR test, we never hit netif queue full situation, the
cat /proc/interrupts/ send completion interrupts rate is 0, right?

Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:10     ` Michael S. Tsirkin
@ 2011-03-09 16:25       ` Shirley Ma
  2011-03-09 16:32         ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, netdev, steved,
	Tom Lendacky

On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote:
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 82dba5a..4477b9a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> virtnet_info *vi)
>         struct sk_buff *skb;
>         unsigned int len, tot_sgs = 0;
> 
> -       while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +       if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>                 pr_debug("Sent skb %p\n", skb);
>                 vi->dev->stats.tx_bytes += skb->len;
>                 vi->dev->stats.tx_packets++;
> -               tot_sgs += skb_vnet_hdr(skb)->num_sg;
> +               tot_sgs = 2+MAX_SKB_FRAGS;
>                 dev_kfree_skb_any(skb);
>         }
>         return tot_sgs;
> @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev)
>         struct virtnet_info *vi = netdev_priv(dev);
>         int capacity;
> 
> -       /* Free up any pending old buffers before queueing new ones.
> */
> +       /* Free up any old buffers so we can queue new ones. */
>         free_old_xmit_skbs(vi);
> 
>         /* Try to transmit */
> @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff
> *skb, struct net_device *dev)
>         skb_orphan(skb);
>         nf_reset(skb);
> 
> +       /* Free up any old buffers so we can queue new ones. */
> +       if (capacity < 2+MAX_SKB_FRAGS)
> +               capacity += free_old_xmit_skbs(vi);
> +
>         /* Apparently nice girls don't return TX_BUSY; stop the queue
>          * before it gets out of hand.  Naturally, this wastes
> entries. */
>         if (capacity < 2+MAX_SKB_FRAGS) {
> -- 

I tried this one as well. It might improve TCP_RR performance but not
TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:09   ` Tom Lendacky
  2011-03-09 16:21     ` Shirley Ma
@ 2011-03-09 16:28     ` Michael S. Tsirkin
  2011-03-09 16:51     ` Shirley Ma
  2011-03-09 22:51     ` Tom Lendacky
  3 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09 16:28 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, Mar 09, 2011 at 10:09:26AM -0600, Tom Lendacky wrote:
> On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> > > We've been doing some more experimenting with the small packet network
> > > performance problem in KVM.  I have a different setup than what Steve D.
> > > was using so I re-baselined things on the kvm.git kernel on both the
> > > host and guest with a 10GbE adapter.  I also made use of the
> > > virtio-stats patch.
> > > 
> > > The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
> > > adapters (the first connected to a 1GbE adapter and a LAN, the second
> > > connected to a 10GbE adapter that is direct connected to another system
> > > with the same 10GbE adapter) running the kvm.git kernel.  The test was a
> > > TCP_RR test with 100 connections from a baremetal client to the KVM
> > > guest using a 256 byte message size in both directions.

One thing that might be happening is that we are out of
atomic memory poll in guest, so indirect allocations
start failing, and this is slow path.
Could you check this please?


> > > I used the uperf tool to do this after verifying the results against
> > > netperf. Uperf allows the specification of the number of connections as
> > > a parameter in an XML file as opposed to launching, in this case, 100
> > > separate instances of netperf.
> > > 
> > > Here is the baseline for baremetal using 2 physical CPUs:
> > >   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
> > >   TxCPU: 7.88%  RxCPU: 99.41%
> > > 
> > > To be sure to get consistent results with KVM I disabled the
> > > hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
> > > ethernet adapter interrupts (this resulted in runs that differed by only
> > > about 2% from lowest to highest).  The fact that pinning is required to
> > > get consistent results is a different problem that we'll have to look
> > > into later...
> > > 
> > > Here is the KVM baseline (average of six runs):
> > >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> > >   Exits: 148,444.58 Exits/Sec
> > >   TxCPU: 2.40%  RxCPU: 99.35%
> > > 
> > > About 42% of baremetal.
> > 
> > Can you add interrupt stats as well please?
> 
> Yes I can.  Just the guest interrupts for the virtio device?

Guess so: tx and rx.

> > 
> > > empty.  So I coded a quick patch to delay freeing of the used Tx buffers
> > > until more than half the ring was used (I did not test this under a
> > > stream condition so I don't know if this would have a negative impact). 
> > > Here are the results
> > > 
> > > from delaying the freeing of used Tx buffers (average of six runs):
> > >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> > >   Exits: 142,681.67 Exits/Sec
> > >   TxCPU: 2.78%  RxCPU: 99.36%
> > > 
> > > About a 4% increase over baseline and about 44% of baremetal.
> > 
> > Hmm, I am not sure what you mean by delaying freeing.
> 
> In the start_xmit function of virtio_net.c the first thing done is to free any 
> used entries from the ring.  I patched the code to track the number of used tx 
> ring entries and only free the used entries when they are greater than half 
> the capacity of the ring (similar to the way the rx ring is re-filled).

We don't even need than: just max skb frag + 2.
Also don't need to free them all: just enough to get
place for  max skb frag + 2 entries.

> > I think we do have a problem that free_old_xmit_skbs
> > tries to flush out the ring aggressively:
> > it always polls until the ring is empty,
> > so there could be bursts of activity where
> > we spend a lot of time flushing the old entries
> > before e.g. sending an ack, resulting in
> > latency bursts.
> > 
> > Generally we'll need some smarter logic,
> > but with indirect at the moment we can just poll
> > a single packet after we post a new one, and be done with it.
> > Is your patch something like the patch below?
> > Could you try mine as well please?
> 
> Yes, I'll try the patch and post the results.
> 
> > 
> > > This spread out the kick_notify but still resulted in alot of them.  I
> > > decided to build on the delayed Tx buffer freeing and code up an
> > > "ethtool" like coalescing patch in order to delay the kick_notify until
> > > there were at least 5 packets on the ring or 2000 usecs, whichever
> > > occurred first.  Here are the
> > > 
> > > results of delaying the kick_notify (average of six runs):
> > >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> > >   Exits: 102,587.28 Exits/Sec
> > >   TxCPU: 3.03%  RxCPU: 99.33%
> > > 
> > > About a 23% increase over baseline and about 52% of baremetal.
> > > 
> > > Running the perf command against the guest I noticed almost 19% of the
> > > time being spent in _raw_spin_lock.  Enabling lockstat in the guest
> > > showed alot of contention in the "irq_desc_lock_class". Pinning the
> > > virtio1-input interrupt to a single cpu in the guest and re-running the
> > > last test resulted in
> > > 
> > > tremendous gains (average of six runs):
> > >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> > >   Exits: 62,603.37 Exits/Sec
> > >   TxCPU: 3.73%  RxCPU: 98.52%
> > > 
> > > About a 77% increase over baseline and about 74% of baremetal.
> > > 
> > > Vhost is receiving a lot of notifications for packets that are to be
> > > transmitted (over 60% of the packets generate a kick_notify).  Also, it
> > > looks like vhost is sending a lot of notifications for packets it has
> > > received before the guest can get scheduled to disable notifications and
> > > begin processing the packets
> > 
> > Hmm, is this really what happens to you?  The effect would be that guest
> > gets an interrupt while notifications are disabled in guest, right? Could
> > you add a counter and check this please?
> 
> The disabling of the interrupt/notifications is done by the guest.  So the 
> guest has to get scheduled and handle the notification before it disables 
> them.  The vhost_signal routine will keep injecting an interrupt until this 
> happens causing the contention in the guest.  I'll try the patches you specify 
> below and post the results.  They look like they should take care of this 
> issue.
> 
> > 
> > Another possible thing to try would be these old patches to publish used
> > index from guest to make sure this double interrupt does not happen:
> >  [PATCHv2] virtio: put last seen used index into ring itself
> >  [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature
> > 
> > > resulting in some lock contention in the guest (and
> > > high interrupt rates).
> > > 
> > > Some thoughts for the transmit path...  can vhost be enhanced to do some
> > > adaptive polling so that the number of kick_notify events are reduced and
> > > replaced by kick_no_notify events?
> > 
> > Worth a try.
> > 
> > > Comparing the transmit path to the receive path, the guest disables
> > > notifications after the first kick and vhost re-enables notifications
> > > after completing processing of the tx ring.
> > 
> > Is this really what happens? I though the host disables notifications
> > after the first kick.
> 
> Yup, sorry for the confusion.  The kick is done by the guest and then vhost 
> disables notifications.  Maybe a similar approach to the above patches of 
> checking the used index in the virtio_net driver could also help here?

If this happens there will be many kicks that find an empty ring.
Could you add a counter in vhost and check please?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:25       ` Shirley Ma
@ 2011-03-09 16:32         ` Michael S. Tsirkin
  2011-03-09 16:38           ` Shirley Ma
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09 16:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, netdev, steved,
	Tom Lendacky

On Wed, Mar 09, 2011 at 08:25:34AM -0800, Shirley Ma wrote:
> On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote:
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 82dba5a..4477b9a 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> > virtnet_info *vi)
> >         struct sk_buff *skb;
> >         unsigned int len, tot_sgs = 0;
> > 
> > -       while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +       if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> >                 pr_debug("Sent skb %p\n", skb);
> >                 vi->dev->stats.tx_bytes += skb->len;
> >                 vi->dev->stats.tx_packets++;
> > -               tot_sgs += skb_vnet_hdr(skb)->num_sg;
> > +               tot_sgs = 2+MAX_SKB_FRAGS;
> >                 dev_kfree_skb_any(skb);
> >         }
> >         return tot_sgs;
> > @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >         struct virtnet_info *vi = netdev_priv(dev);
> >         int capacity;
> > 
> > -       /* Free up any pending old buffers before queueing new ones.
> > */
> > +       /* Free up any old buffers so we can queue new ones. */
> >         free_old_xmit_skbs(vi);
> > 
> >         /* Try to transmit */
> > @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff
> > *skb, struct net_device *dev)
> >         skb_orphan(skb);
> >         nf_reset(skb);
> > 
> > +       /* Free up any old buffers so we can queue new ones. */
> > +       if (capacity < 2+MAX_SKB_FRAGS)
> > +               capacity += free_old_xmit_skbs(vi);
> > +
> >         /* Apparently nice girls don't return TX_BUSY; stop the queue
> >          * before it gets out of hand.  Naturally, this wastes
> > entries. */
> >         if (capacity < 2+MAX_SKB_FRAGS) {
> > -- 
> 
> I tried this one as well. It might improve TCP_RR performance but not
> TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls.
> 
> Thanks
> Shirley

I think your issues are with TX overrun.
Besides delaying IRQ on TX, I don't have many ideas.

The one interesting thing is that you see better speed
if you drop packets. netdev crowd says this should not happen,
so could be an indicator of a problem somewhere.


-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:32         ` Michael S. Tsirkin
@ 2011-03-09 16:38           ` Shirley Ma
  0 siblings, 0 replies; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 16:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Krishna Kumar2, David Miller, kvm, netdev, steved,
	Tom Lendacky

On Wed, 2011-03-09 at 18:32 +0200, Michael S. Tsirkin wrote:
> I think your issues are with TX overrun.
> Besides delaying IRQ on TX, I don't have many ideas.
> 
> The one interesting thing is that you see better speed
> if you drop packets. netdev crowd says this should not happen,
> so could be an indicator of a problem somewhere.

Yes, I am looking at why guest didn't see see used_buffers on time from
vhost send TX completion I am trying to collect some data on vhost.

I also wonder whether it's a scheduler issue.

Thanks
Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:09   ` Tom Lendacky
  2011-03-09 16:21     ` Shirley Ma
  2011-03-09 16:28     ` Michael S. Tsirkin
@ 2011-03-09 16:51     ` Shirley Ma
  2011-03-09 17:16       ` Michael S. Tsirkin
  2011-03-09 22:51     ` Tom Lendacky
  3 siblings, 1 reply; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 16:51 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Michael S. Tsirkin, Rusty Russell, Krishna Kumar2, David Miller,
	kvm, netdev, steved

On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
> > > Vhost is receiving a lot of notifications for packets that are to
> be
> > > transmitted (over 60% of the packets generate a kick_notify). 

This is guest TX send notification when vhost enables notification.

In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT, it rarely
enables the notification, vhost re-enters handle_tx from NAPI poll, so
guest doesn't do much kick_notify.

In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq
very often, so it enables notification most of the time.

Shirley

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:51     ` Shirley Ma
@ 2011-03-09 17:16       ` Michael S. Tsirkin
  2011-03-09 18:16         ` Shirley Ma
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09 17:16 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Tom Lendacky, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote:
> On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
> > > > Vhost is receiving a lot of notifications for packets that are to
> > be
> > > > transmitted (over 60% of the packets generate a kick_notify). 
> 
> This is guest TX send notification when vhost enables notification.
> 
> In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT,


You mean virtio?

> it rarely
> enables the notification, vhost re-enters handle_tx from NAPI poll,

Does NAPI really call handle_tx? Not rx?

> so
> guest doesn't do much kick_notify.
> 
> In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq
> very often, so it enables notification most of the time.
> 
> Shirley

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 17:16       ` Michael S. Tsirkin
@ 2011-03-09 18:16         ` Shirley Ma
  0 siblings, 0 replies; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 18:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lendacky, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, 2011-03-09 at 19:16 +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote:
> > On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
> > > > > Vhost is receiving a lot of notifications for packets that are
> to
> > > be
> > > > > transmitted (over 60% of the packets generate a kick_notify). 
> > 
> > This is guest TX send notification when vhost enables notification.
> > 
> > In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT,
> 
> 
> You mean virtio?

Sorry, I messed up NAPI WEIGHT and VHOST NET WEIGHT.

I meant VHOST_NET_WEIGH, vhost exit handdle_tx() from VHOST NET WEIGHT
w/o enabling notification.

> 
> > it rarely
> > enables the notification, vhost re-enters handle_tx from NAPI poll,
> 
> Does NAPI really call handle_tx? Not rx? 

I meant for TX/RX, vhost re-enter handle_tx from vhost_poll_queue() not
from kick_notify.

Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
       [not found] ` <20110309071558.GA25757@redhat.com>
  2011-03-09 15:45   ` Shirley Ma
  2011-03-09 16:09   ` Tom Lendacky
@ 2011-03-09 20:11   ` Tom Lendacky
  2011-03-09 21:56     ` Michael S. Tsirkin
  2011-03-09 22:45     ` Shirley Ma
  2 siblings, 2 replies; 25+ messages in thread
From: Tom Lendacky @ 2011-03-09 20:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

Here are the results again with the addition of the interrupt rate that 
occurred on the guest virtio_net device:

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About 42% of baremetal.

Delayed freeing of TX buffers (average of six runs):
  Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
  Exits: 142,681.67 Exits/Sec
  TxCPU: 2.78%  RxCPU: 99.36%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 4% increase over baseline and about 44% of baremetal.

Delaying kick_notify (kick every 5 packets -average of six runs):
  Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
  Exits: 102,587.28 Exits/Sec
  TxCPU: 3.03%  RxCPU: 99.33%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 23% increase over baseline and about 52% of baremetal.

Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):
  Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
  Exits: 62,603.37 Exits/Sec
  TxCPU: 3.73%  RxCPU: 98.52%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 77% increase over baseline and about 74% of baremetal.


On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> > We've been doing some more experimenting with the small packet network
> > performance problem in KVM.  I have a different setup than what Steve D.
> > was using so I re-baselined things on the kvm.git kernel on both the
> > host and guest with a 10GbE adapter.  I also made use of the
> > virtio-stats patch.
> > 
> > The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
> > adapters (the first connected to a 1GbE adapter and a LAN, the second
> > connected to a 10GbE adapter that is direct connected to another system
> > with the same 10GbE adapter) running the kvm.git kernel.  The test was a
> > TCP_RR test with 100 connections from a baremetal client to the KVM
> > guest using a 256 byte message size in both directions.
> > 
> > I used the uperf tool to do this after verifying the results against
> > netperf. Uperf allows the specification of the number of connections as
> > a parameter in an XML file as opposed to launching, in this case, 100
> > separate instances of netperf.
> > 
> > Here is the baseline for baremetal using 2 physical CPUs:
> >   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
> >   TxCPU: 7.88%  RxCPU: 99.41%
> > 
> > To be sure to get consistent results with KVM I disabled the
> > hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
> > ethernet adapter interrupts (this resulted in runs that differed by only
> > about 2% from lowest to highest).  The fact that pinning is required to
> > get consistent results is a different problem that we'll have to look
> > into later...
> > 
> > Here is the KVM baseline (average of six runs):
> >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> >   Exits: 148,444.58 Exits/Sec
> >   TxCPU: 2.40%  RxCPU: 99.35%
> > 
> > About 42% of baremetal.
> 
> Can you add interrupt stats as well please?
> 
> > empty.  So I coded a quick patch to delay freeing of the used Tx buffers
> > until more than half the ring was used (I did not test this under a
> > stream condition so I don't know if this would have a negative impact). 
> > Here are the results
> > 
> > from delaying the freeing of used Tx buffers (average of six runs):
> >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> >   Exits: 142,681.67 Exits/Sec
> >   TxCPU: 2.78%  RxCPU: 99.36%
> > 
> > About a 4% increase over baseline and about 44% of baremetal.
> 
> Hmm, I am not sure what you mean by delaying freeing.
> I think we do have a problem that free_old_xmit_skbs
> tries to flush out the ring aggressively:
> it always polls until the ring is empty,
> so there could be bursts of activity where
> we spend a lot of time flushing the old entries
> before e.g. sending an ack, resulting in
> latency bursts.
> 
> Generally we'll need some smarter logic,
> but with indirect at the moment we can just poll
> a single packet after we post a new one, and be done with it.
> Is your patch something like the patch below?
> Could you try mine as well please?
> 
> > This spread out the kick_notify but still resulted in alot of them.  I
> > decided to build on the delayed Tx buffer freeing and code up an
> > "ethtool" like coalescing patch in order to delay the kick_notify until
> > there were at least 5 packets on the ring or 2000 usecs, whichever
> > occurred first.  Here are the
> > 
> > results of delaying the kick_notify (average of six runs):
> >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> >   Exits: 102,587.28 Exits/Sec
> >   TxCPU: 3.03%  RxCPU: 99.33%
> > 
> > About a 23% increase over baseline and about 52% of baremetal.
> > 
> > Running the perf command against the guest I noticed almost 19% of the
> > time being spent in _raw_spin_lock.  Enabling lockstat in the guest
> > showed alot of contention in the "irq_desc_lock_class". Pinning the
> > virtio1-input interrupt to a single cpu in the guest and re-running the
> > last test resulted in
> > 
> > tremendous gains (average of six runs):
> >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> >   Exits: 62,603.37 Exits/Sec
> >   TxCPU: 3.73%  RxCPU: 98.52%
> > 
> > About a 77% increase over baseline and about 74% of baremetal.
> > 
> > Vhost is receiving a lot of notifications for packets that are to be
> > transmitted (over 60% of the packets generate a kick_notify).  Also, it
> > looks like vhost is sending a lot of notifications for packets it has
> > received before the guest can get scheduled to disable notifications and
> > begin processing the packets
> 
> Hmm, is this really what happens to you?  The effect would be that guest
> gets an interrupt while notifications are disabled in guest, right? Could
> you add a counter and check this please?
> 
> Another possible thing to try would be these old patches to publish used
> index from guest to make sure this double interrupt does not happen:
>  [PATCHv2] virtio: put last seen used index into ring itself
>  [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature
> 
> > resulting in some lock contention in the guest (and
> > high interrupt rates).
> > 
> > Some thoughts for the transmit path...  can vhost be enhanced to do some
> > adaptive polling so that the number of kick_notify events are reduced and
> > replaced by kick_no_notify events?
> 
> Worth a try.
> 
> > Comparing the transmit path to the receive path, the guest disables
> > notifications after the first kick and vhost re-enables notifications
> > after completing processing of the tx ring.
> 
> Is this really what happens? I though the host disables notifications
> after the first kick.
> 
> >  Can a similar thing be done for the
> > 
> > receive path?  Once vhost sends the first notification for a received
> > packet it can disable notifications and let the guest re-enable
> > notifications when it has finished processing the receive ring.  Also,
> > can the virtio-net driver do some adaptive polling (or does napi take
> > care of that for the guest)?
> 
> Worth a try. I don't think napi does anything like this.
> 
> > Running the same workload on the same configuration with a different
> > hypervisor results in performance that is almost equivalent to baremetal
> > without doing any pinning.
> > 
> > Thanks,
> > Tom Lendacky
> 
> There's no need to flush out all used buffers
> before we post more for transmit: with indirect,
> just a single one is enough. Without indirect we'll
> need more possibly, but just for testing this should
> be enough.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> Note: untested.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 82dba5a..ebe3337 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> virtnet_info *vi) struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
> 
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
> -		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> +		tot_sgs = 2+MAX_SKB_FRAGS;
>  		dev_kfree_skb_any(skb);
>  	}
>  	return tot_sgs;
> @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
> 
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> 
> @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) skb_orphan(skb);
>  	nf_reset(skb);
> 
> +	/* Free up any old buffers so we can queue new ones. */
> +	if (capacity < 2+MAX_SKB_FRAGS)
> +		capacity += free_old_xmit_skbs(vi);
> +
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 20:11   ` Tom Lendacky
@ 2011-03-09 21:56     ` Michael S. Tsirkin
  2011-03-09 23:25       ` Tom Lendacky
  2011-03-10  0:59       ` Shirley Ma
  2011-03-09 22:45     ` Shirley Ma
  1 sibling, 2 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-09 21:56 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote:
> Here are the results again with the addition of the interrupt rate that 
> occurred on the guest virtio_net device:
> 
> Here is the KVM baseline (average of six runs):
>   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
>   Exits: 148,444.58 Exits/Sec
>   TxCPU: 2.40%  RxCPU: 99.35%
>   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
>   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> 
> About 42% of baremetal.
> 
> Delayed freeing of TX buffers (average of six runs):
>   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
>   Exits: 142,681.67 Exits/Sec
>   TxCPU: 2.78%  RxCPU: 99.36%
>   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
>   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> 
> About a 4% increase over baseline and about 44% of baremetal.

Looks like delayed freeing is a good idea generally.
Is this my patch? Yours?



> Delaying kick_notify (kick every 5 packets -average of six runs):
>   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
>   Exits: 102,587.28 Exits/Sec
>   TxCPU: 3.03%  RxCPU: 99.33%
>   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
>   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> 
> About a 23% increase over baseline and about 52% of baremetal.
> 
> Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):

What exactly moves the interrupt handler between CPUs?
irqbalancer?  Does it matter which CPU you pin it to?
If yes, do you have any idea why?

Also, what happens without delaying kick_notify
but with pinning?

>   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
>   Exits: 62,603.37 Exits/Sec
>   TxCPU: 3.73%  RxCPU: 98.52%
>   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
>   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> 
> About a 77% increase over baseline and about 74% of baremetal.

Hmm we get about 20 packets per interrupt on average.
That's pretty decent. The problem is with exits.
Let's try something adaptive in the host?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 20:11   ` Tom Lendacky
  2011-03-09 21:56     ` Michael S. Tsirkin
@ 2011-03-09 22:45     ` Shirley Ma
  2011-03-09 22:57       ` Tom Lendacky
  1 sibling, 1 reply; 25+ messages in thread
From: Shirley Ma @ 2011-03-09 22:45 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Michael S. Tsirkin, Rusty Russell, Krishna Kumar2, David Miller,
	kvm, netdev, steved

Hello Tom,

Do you also have Rusty's virtio stat patch results for both send queue
and recv queue to share here?

Thanks
Shirley


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 16:09   ` Tom Lendacky
                       ` (2 preceding siblings ...)
  2011-03-09 16:51     ` Shirley Ma
@ 2011-03-09 22:51     ` Tom Lendacky
  3 siblings, 0 replies; 25+ messages in thread
From: Tom Lendacky @ 2011-03-09 22:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wednesday, March 09, 2011 10:09:26 am Tom Lendacky wrote:
> On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> > > We've been doing some more experimenting with the small packet network
> > > performance problem in KVM.  I have a different setup than what Steve
> > > D. was using so I re-baselined things on the kvm.git kernel on both
> > > the host and guest with a 10GbE adapter.  I also made use of the
> > > virtio-stats patch.
> > > 
> > > The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
> > > adapters (the first connected to a 1GbE adapter and a LAN, the second
> > > connected to a 10GbE adapter that is direct connected to another system
> > > with the same 10GbE adapter) running the kvm.git kernel.  The test was
> > > a TCP_RR test with 100 connections from a baremetal client to the KVM
> > > guest using a 256 byte message size in both directions.
> > > 
> > > I used the uperf tool to do this after verifying the results against
> > > netperf. Uperf allows the specification of the number of connections as
> > > a parameter in an XML file as opposed to launching, in this case, 100
> > > separate instances of netperf.
> > > 
> > > Here is the baseline for baremetal using 2 physical CPUs:
> > >   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
> > >   TxCPU: 7.88%  RxCPU: 99.41%
> > > 
> > > To be sure to get consistent results with KVM I disabled the
> > > hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
> > > ethernet adapter interrupts (this resulted in runs that differed by
> > > only about 2% from lowest to highest).  The fact that pinning is
> > > required to get consistent results is a different problem that we'll
> > > have to look into later...
> > > 
> > > Here is the KVM baseline (average of six runs):
> > >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> > >   Exits: 148,444.58 Exits/Sec
> > >   TxCPU: 2.40%  RxCPU: 99.35%
> > > 
> > > About 42% of baremetal.
> > 
> > Can you add interrupt stats as well please?
> 
> Yes I can.  Just the guest interrupts for the virtio device?
> 
> > > empty.  So I coded a quick patch to delay freeing of the used Tx
> > > buffers until more than half the ring was used (I did not test this
> > > under a stream condition so I don't know if this would have a negative
> > > impact). Here are the results
> > > 
> > > from delaying the freeing of used Tx buffers (average of six runs):
> > >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> > >   Exits: 142,681.67 Exits/Sec
> > >   TxCPU: 2.78%  RxCPU: 99.36%
> > > 
> > > About a 4% increase over baseline and about 44% of baremetal.
> > 
> > Hmm, I am not sure what you mean by delaying freeing.
> 
> In the start_xmit function of virtio_net.c the first thing done is to free
> any used entries from the ring.  I patched the code to track the number of
> used tx ring entries and only free the used entries when they are greater
> than half the capacity of the ring (similar to the way the rx ring is
> re-filled).
> 
> > I think we do have a problem that free_old_xmit_skbs
> > tries to flush out the ring aggressively:
> > it always polls until the ring is empty,
> > so there could be bursts of activity where
> > we spend a lot of time flushing the old entries
> > before e.g. sending an ack, resulting in
> > latency bursts.
> > 
> > Generally we'll need some smarter logic,
> > but with indirect at the moment we can just poll
> > a single packet after we post a new one, and be done with it.
> > Is your patch something like the patch below?
> > Could you try mine as well please?
> 
> Yes, I'll try the patch and post the results.
> 
> > > This spread out the kick_notify but still resulted in alot of them.  I
> > > decided to build on the delayed Tx buffer freeing and code up an
> > > "ethtool" like coalescing patch in order to delay the kick_notify until
> > > there were at least 5 packets on the ring or 2000 usecs, whichever
> > > occurred first.  Here are the
> > > 
> > > results of delaying the kick_notify (average of six runs):
> > >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> > >   Exits: 102,587.28 Exits/Sec
> > >   TxCPU: 3.03%  RxCPU: 99.33%
> > > 
> > > About a 23% increase over baseline and about 52% of baremetal.
> > > 
> > > Running the perf command against the guest I noticed almost 19% of the
> > > time being spent in _raw_spin_lock.  Enabling lockstat in the guest
> > > showed alot of contention in the "irq_desc_lock_class". Pinning the
> > > virtio1-input interrupt to a single cpu in the guest and re-running the
> > > last test resulted in
> > > 
> > > tremendous gains (average of six runs):
> > >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> > >   Exits: 62,603.37 Exits/Sec
> > >   TxCPU: 3.73%  RxCPU: 98.52%
> > > 
> > > About a 77% increase over baseline and about 74% of baremetal.
> > > 
> > > Vhost is receiving a lot of notifications for packets that are to be
> > > transmitted (over 60% of the packets generate a kick_notify).  Also, it
> > > looks like vhost is sending a lot of notifications for packets it has
> > > received before the guest can get scheduled to disable notifications
> > > and begin processing the packets
> > 
> > Hmm, is this really what happens to you?  The effect would be that guest
> > gets an interrupt while notifications are disabled in guest, right? Could
> > you add a counter and check this please?
> 
> The disabling of the interrupt/notifications is done by the guest.  So the
> guest has to get scheduled and handle the notification before it disables
> them.  The vhost_signal routine will keep injecting an interrupt until this
> happens causing the contention in the guest.  I'll try the patches you
> specify below and post the results.  They look like they should take care
> of this issue.
> 
> > Another possible thing to try would be these old patches to publish used
> > 
> > index from guest to make sure this double interrupt does not happen:
> >  [PATCHv2] virtio: put last seen used index into ring itself
> >  [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature

I was able to apply these patches with a little work, but unfortunately the 
guest oops during boot up in virtqueue_add_buf_gfp.  It happens in the 
virtio_blk driver.  Any chance you can re-work these patches against the 
kvm.git tree?

> >  
> > > resulting in some lock contention in the guest (and
> > > high interrupt rates).
> > > 
> > > Some thoughts for the transmit path...  can vhost be enhanced to do
> > > some adaptive polling so that the number of kick_notify events are
> > > reduced and replaced by kick_no_notify events?
> > 
> > Worth a try.
> > 
> > > Comparing the transmit path to the receive path, the guest disables
> > > notifications after the first kick and vhost re-enables notifications
> > > after completing processing of the tx ring.
> > 
> > Is this really what happens? I though the host disables notifications
> > after the first kick.
> 
> Yup, sorry for the confusion.  The kick is done by the guest and then vhost
> disables notifications.  Maybe a similar approach to the above patches of
> checking the used index in the virtio_net driver could also help here?
> 
> > >  Can a similar thing be done for the
> > > 
> > > receive path?  Once vhost sends the first notification for a received
> > > packet it can disable notifications and let the guest re-enable
> > > notifications when it has finished processing the receive ring.  Also,
> > > can the virtio-net driver do some adaptive polling (or does napi take
> > > care of that for the guest)?
> > 
> > Worth a try. I don't think napi does anything like this.
> > 
> > > Running the same workload on the same configuration with a different
> > > hypervisor results in performance that is almost equivalent to
> > > baremetal without doing any pinning.
> > > 
> > > Thanks,
> > > Tom Lendacky
> > 
> > There's no need to flush out all used buffers
> > before we post more for transmit: with indirect,
> > just a single one is enough. Without indirect we'll
> > need more possibly, but just for testing this should
> > be enough.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > Note: untested.
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 82dba5a..ebe3337 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> > virtnet_info *vi) struct sk_buff *skb;
> > 
> >  	unsigned int len, tot_sgs = 0;
> > 
> > -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > +	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> > 
> >  		pr_debug("Sent skb %p\n", skb);
> >  		vi->dev->stats.tx_bytes += skb->len;
> >  		vi->dev->stats.tx_packets++;
> > 
> > -		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> > +		tot_sgs = 2+MAX_SKB_FRAGS;
> > 
> >  		dev_kfree_skb_any(skb);
> >  	
> >  	}
> >  	return tot_sgs;
> > 
> > @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev);
> > 
> >  	int capacity;
> > 
> > -	/* Free up any pending old buffers before queueing new ones. */
> > -	free_old_xmit_skbs(vi);
> > -
> > 
> >  	/* Try to transmit */
> >  	capacity = xmit_skb(vi, skb);
> > 
> > @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev) skb_orphan(skb);
> > 
> >  	nf_reset(skb);
> > 
> > +	/* Free up any old buffers so we can queue new ones. */
> > +	if (capacity < 2+MAX_SKB_FRAGS)
> > +		capacity += free_old_xmit_skbs(vi);
> > +
> > 
> >  	/* Apparently nice girls don't return TX_BUSY; stop the queue
> >  	
> >  	 * before it gets out of hand.  Naturally, this wastes entries. */
> >  	
> >  	if (capacity < 2+MAX_SKB_FRAGS) {
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 22:45     ` Shirley Ma
@ 2011-03-09 22:57       ` Tom Lendacky
  0 siblings, 0 replies; 25+ messages in thread
From: Tom Lendacky @ 2011-03-09 22:57 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Rusty Russell, Krishna Kumar2, David Miller,
	kvm, netdev, steved

On Wednesday, March 09, 2011 04:45:12 pm Shirley Ma wrote:
> Hello Tom,
> 
> Do you also have Rusty's virtio stat patch results for both send queue
> and recv queue to share here?

Let me see what I can do about getting the data extracted, averaged and in a 
form that I can put in an email.

> 
> Thanks
> Shirley
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 21:56     ` Michael S. Tsirkin
@ 2011-03-09 23:25       ` Tom Lendacky
  2011-03-10  6:54         ` Michael S. Tsirkin
  2011-03-10  0:59       ` Shirley Ma
  1 sibling, 1 reply; 25+ messages in thread
From: Tom Lendacky @ 2011-03-09 23:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wednesday, March 09, 2011 03:56:15 pm Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote:
> > Here are the results again with the addition of the interrupt rate that
> > occurred on the guest virtio_net device:
> > 
> > Here is the KVM baseline (average of six runs):
> >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> >   Exits: 148,444.58 Exits/Sec
> >   TxCPU: 2.40%  RxCPU: 99.35%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About 42% of baremetal.
> > 
> > Delayed freeing of TX buffers (average of six runs):
> >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> >   Exits: 142,681.67 Exits/Sec
> >   TxCPU: 2.78%  RxCPU: 99.36%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About a 4% increase over baseline and about 44% of baremetal.
> 
> Looks like delayed freeing is a good idea generally.
> Is this my patch? Yours?

These results are for my patch, I haven't had a chance to run your patch yet.

> 
> > Delaying kick_notify (kick every 5 packets -average of six runs):
> >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> >   Exits: 102,587.28 Exits/Sec
> >   TxCPU: 3.03%  RxCPU: 99.33%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About a 23% increase over baseline and about 52% of baremetal.
> 
> > Delaying kick_notify and pinning virtio1-input to CPU0 (average of six 
runs):
> What exactly moves the interrupt handler between CPUs?
> irqbalancer?  Does it matter which CPU you pin it to?
> If yes, do you have any idea why?

Looking at the guest, irqbalance isn't running and the smp_affinity for the 
irq is set to 3 (both CPUs).  It could be that irqbalance would help in this 
situation since it would probably change the smp_affinity mask to a single CPU 
and remove the irq lock contention (I think the last used index patch would be 
best though since it will avoid the extra irq injections).  I'll kick off a 
run with irqbalance running.

As for which CPU the interrupt gets pinned to, that doesn't matter - see 
below.

> 
> Also, what happens without delaying kick_notify
> but with pinning?

Here are the results of a single "baseline" run with the IRQ pinned to CPU0:

  Txn Rate: 108,212.12 Txn/Sec, Pkt Rate: 214,994 Pkts/Sec
  Exits: 119,310.21 Exits/Sec
  TxCPU: 9.63%  RxCPU: 99.47%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 
  Virtio1-output Interrupts/Sec (CPU0/CPU1):

and CPU1:

  Txn Rate: 108,053.02 Txn/Sec, Pkt Rate: 214,678 Pkts/Sec
  Exits: 119,320.12 Exits/Sec
  TxCPU: 9.64%  RxCPU: 99.42%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,608/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/13,830

About a 24% increase over baseline.

> 
> >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkts/Sec
> >   Exits: 62,603.37 Exits/Sec
> >   TxCPU: 3.73%  RxCPU: 98.52%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About a 77% increase over baseline and about 74% of baremetal.
> 
> Hmm we get about 20 packets per interrupt on average.
> That's pretty decent. The problem is with exits.
> Let's try something adaptive in the host?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 21:56     ` Michael S. Tsirkin
  2011-03-09 23:25       ` Tom Lendacky
@ 2011-03-10  0:59       ` Shirley Ma
  2011-03-10  2:30         ` Rick Jones
  1 sibling, 1 reply; 25+ messages in thread
From: Shirley Ma @ 2011-03-10  0:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lendacky, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, 2011-03-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> >   Exits: 62,603.37 Exits/Sec
> >   TxCPU: 3.73%  RxCPU: 98.52%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About a 77% increase over baseline and about 74% of baremetal.
> 
> Hmm we get about 20 packets per interrupt on average.
> That's pretty decent. The problem is with exits.
> Let's try something adaptive in the host? 

I did some hack before, for 32-64 multiple stream TCP_RR cases, either
queue multiple skbs per kick or delay vhost exit from handle_tx, both
improved TCP_RR aggregation performance, but single TCP_RR latency
increased.

Here, the test is about 100 TCP_RR streams from a bare metal client to
KVM guest, the kick_notify from guest RX path should be small (every 1/2
ring size, it does a kick and even under that kick, vhost might already
disable the notification). 

The kick_notify from guest TX path seems the main reason causes the
guest huge of exits, (it does a kick for every send skb, under that kick
vhost might mostly likely exit from empty ring not reaching
VHOST_NET_WEIGH. The indirect buffer is used, so I wonder how many
packets per handle_tx processed here?

In theory, for lots of TCP_RR streams, the guest should be able to keep
sending xmit skbs to send vq, so vhost should be able to disable
notification most of the time, then number of guest exits should be
significantly reduced? Why we saw lots of guest exits here still? Is it
worth to try 256 (send queue size) TCP_RRs?

Tom's kick_notify data from Rusty's patch would be helpful to understand
what's going here.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-10  0:59       ` Shirley Ma
@ 2011-03-10  2:30         ` Rick Jones
  0 siblings, 0 replies; 25+ messages in thread
From: Rick Jones @ 2011-03-10  2:30 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Tom Lendacky, Rusty Russell, Krishna Kumar2,
	David Miller, kvm, netdev, steved

On Wed, 2011-03-09 at 16:59 -0800, Shirley Ma wrote:
> In theory, for lots of TCP_RR streams, the guest should be able to keep
> sending xmit skbs to send vq, so vhost should be able to disable
> notification most of the time, then number of guest exits should be
> significantly reduced? Why we saw lots of guest exits here still? Is it
> worth to try 256 (send queue size) TCP_RRs?

If these are single-transaction-at-a-time TCP_RRs rather than "burst
mode" then the number may be something other than send queue size to
keep it constantly active given the RTTs.  In the "bare iron" world at
least, that is one of the reasons I added the "burst mode" to the _RR
test - because it could take a Very Large Number of concurrent netperfs
to take a link to saturation, at which point it might have been just as
much a context switching benchmark as anything else :)

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-09 23:25       ` Tom Lendacky
@ 2011-03-10  6:54         ` Michael S. Tsirkin
  2011-03-10 15:23           ` Tom Lendacky
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-10  6:54 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
> As for which CPU the interrupt gets pinned to, that doesn't matter - see 
> below.

So what hurts us the most is that the IRQ jumps between the VCPUs?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-10  6:54         ` Michael S. Tsirkin
@ 2011-03-10 15:23           ` Tom Lendacky
  2011-03-10 15:34             ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Lendacky @ 2011-03-10 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
> > As for which CPU the interrupt gets pinned to, that doesn't matter - see
> > below.
> 
> So what hurts us the most is that the IRQ jumps between the VCPUs?

Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.  
Without the publish last used index patch, vhost keeps injecting an irq for 
every received packet until the guest eventually turns off notifications. 
Because the irq injections end up overlapping we get contention on the 
irq_desc_lock_class lock. Here are some results using the "baseline" setup 
with irqbalance running.

  Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
  Exits: 121,050.45 Exits/Sec
  TxCPU: 9.61%  RxCPU: 99.45%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 24% increase over baseline.  Irqbalance essentially pinned the virtio 
irq to CPU0 preventing the irq lock contention and resulting in nice gains.

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-10 15:23           ` Tom Lendacky
@ 2011-03-10 15:34             ` Michael S. Tsirkin
  2011-03-10 17:16               ` Tom Lendacky
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2011-03-10 15:34 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
> On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
> > > As for which CPU the interrupt gets pinned to, that doesn't matter - see
> > > below.
> > 
> > So what hurts us the most is that the IRQ jumps between the VCPUs?
> 
> Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.  
> Without the publish last used index patch, vhost keeps injecting an irq for 
> every received packet until the guest eventually turns off notifications. 

Are you sure you see that? If yes publish used should help a lot.

> Because the irq injections end up overlapping we get contention on the 
> irq_desc_lock_class lock. Here are some results using the "baseline" setup 
> with irqbalance running.
> 
>   Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
>   Exits: 121,050.45 Exits/Sec
>   TxCPU: 9.61%  RxCPU: 99.45%
>   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
>   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> 
> About a 24% increase over baseline.  Irqbalance essentially pinned the virtio 
> irq to CPU0 preventing the irq lock contention and resulting in nice gains.

OK, so we probably want some form of delayed free for TX
on top, and that should get us nice results already.

> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-10 15:34             ` Michael S. Tsirkin
@ 2011-03-10 17:16               ` Tom Lendacky
  2011-03-18 15:38                 ` Tom Lendacky
  0 siblings, 1 reply; 25+ messages in thread
From: Tom Lendacky @ 2011-03-10 17:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote:
> On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
> > On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
> > > > As for which CPU the interrupt gets pinned to, that doesn't matter -
> > > > see below.
> > > 
> > > So what hurts us the most is that the IRQ jumps between the VCPUs?
> > 
> > Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.
> > Without the publish last used index patch, vhost keeps injecting an irq
> > for every received packet until the guest eventually turns off
> > notifications.
> 
> Are you sure you see that? If yes publish used should help a lot.

I definitely see that.  I ran lockstat in the guest and saw the contention on 
the lock when the irq was able to run on either vCPU.  Once the irq was pinned 
the contention disappeared.  The publish used index patch should eliminate the 
extra irq injections and then the pinning or use of irqbalance shouldn't be 
required.  I'm getting a kernel oops during boot with the publish last used 
patches that I pulled from the mailing list - I had to make some changes in 
order to get them to apply and compile and might not have done the right 
things.  Can you re-spin that patchset against kvm.git?

> 
> > Because the irq injections end up overlapping we get contention on the
> > irq_desc_lock_class lock. Here are some results using the "baseline"
> > setup with irqbalance running.
> > 
> >   Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
> >   Exits: 121,050.45 Exits/Sec
> >   TxCPU: 9.61%  RxCPU: 99.45%
> >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
> >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > 
> > About a 24% increase over baseline.  Irqbalance essentially pinned the
> > virtio irq to CPU0 preventing the irq lock contention and resulting in
> > nice gains.
> 
> OK, so we probably want some form of delayed free for TX
> on top, and that should get us nice results already.
> 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Network performance with small packets - continued
  2011-03-10 17:16               ` Tom Lendacky
@ 2011-03-18 15:38                 ` Tom Lendacky
  0 siblings, 0 replies; 25+ messages in thread
From: Tom Lendacky @ 2011-03-18 15:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Rusty Russell, Krishna Kumar2, David Miller, kvm,
	netdev, steved

On Thursday, March 10, 2011 11:16:11 am Tom Lendacky wrote:
> On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote:
> > On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
> > > On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
> > > > On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
> > > > > As for which CPU the interrupt gets pinned to, that doesn't matter
> > > > > - see below.
> > > > 
> > > > So what hurts us the most is that the IRQ jumps between the VCPUs?
> > > 
> > > Yes, it appears that allowing the IRQ to run on more than one vCPU
> > > hurts. Without the publish last used index patch, vhost keeps
> > > injecting an irq for every received packet until the guest eventually
> > > turns off notifications.
> > 
> > Are you sure you see that? If yes publish used should help a lot.
> 
> I definitely see that.  I ran lockstat in the guest and saw the contention
> on the lock when the irq was able to run on either vCPU.  Once the irq was
> pinned the contention disappeared.  The publish used index patch should
> eliminate the extra irq injections and then the pinning or use of
> irqbalance shouldn't be required.  I'm getting a kernel oops during boot
> with the publish last used patches that I pulled from the mailing list - I
> had to make some changes in order to get them to apply and compile and
> might not have done the right things.  Can you re-spin that patchset
> against kvm.git?
> 

Here are the results for the publish last used index patch (with the baseline 
provided again for reference).

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

Using the publish last used index w/o irqbalance (average of six runs):
  Txn Rate: 112,180.10 Txn/Sec, Pkt Rate: 222,878.33 Pkts/Sec
  Exits: 96,280.11 Exits/Sec
  TxCPU: 1.14%  RxCPU: 99.33%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 3,400/3,400
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 29% increase over baseline.

Using the publish last used index w/  irqbalance (average of six runs):
  Txn Rate: 110,891.12 Txn/Sec, Pkt Rate: 220,315.67 Pkts/Sec
  Exits: 97,190.68 Exits/Sec
  TxCPU: 1.10%  RxCPU: 99.38%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 7,040/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 27% increase over baseline.

Here is data from running without the publish last used index patch but with 
irqbalance running (pinning results were near identical):
  Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
  Exits: 121,050.45 Exits/Sec
  TxCPU: 9.61%  RxCPU: 99.45%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

The publish last used index patch provides a 3%-4% improvement while reducing 
the exit rate and interrupt rate in the guest as well as reducing the 
transmitting system CPU% quite dramatically.

> > > Because the irq injections end up overlapping we get contention on the
> > > irq_desc_lock_class lock. Here are some results using the "baseline"
> > > setup with irqbalance running.
> > > 
> > >   Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
> > >   Exits: 121,050.45 Exits/Sec
> > >   TxCPU: 9.61%  RxCPU: 99.45%
> > >   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
> > >   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
> > > 
> > > About a 24% increase over baseline.  Irqbalance essentially pinned the
> > > virtio irq to CPU0 preventing the irq lock contention and resulting in
> > > nice gains.
> > 
> > OK, so we probably want some form of delayed free for TX
> > on top, and that should get us nice results already.
> > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-03-18 15:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <201103071631.41964.tahm@linux.vnet.ibm.com>
2011-03-09  7:15 ` Network performance with small packets - continued Michael S. Tsirkin
     [not found] ` <20110309071558.GA25757@redhat.com>
2011-03-09 15:45   ` Shirley Ma
2011-03-09 16:10     ` Michael S. Tsirkin
2011-03-09 16:25       ` Shirley Ma
2011-03-09 16:32         ` Michael S. Tsirkin
2011-03-09 16:38           ` Shirley Ma
2011-03-09 16:09   ` Tom Lendacky
2011-03-09 16:21     ` Shirley Ma
2011-03-09 16:28     ` Michael S. Tsirkin
2011-03-09 16:51     ` Shirley Ma
2011-03-09 17:16       ` Michael S. Tsirkin
2011-03-09 18:16         ` Shirley Ma
2011-03-09 22:51     ` Tom Lendacky
2011-03-09 20:11   ` Tom Lendacky
2011-03-09 21:56     ` Michael S. Tsirkin
2011-03-09 23:25       ` Tom Lendacky
2011-03-10  6:54         ` Michael S. Tsirkin
2011-03-10 15:23           ` Tom Lendacky
2011-03-10 15:34             ` Michael S. Tsirkin
2011-03-10 17:16               ` Tom Lendacky
2011-03-18 15:38                 ` Tom Lendacky
2011-03-10  0:59       ` Shirley Ma
2011-03-10  2:30         ` Rick Jones
2011-03-09 22:45     ` Shirley Ma
2011-03-09 22:57       ` Tom Lendacky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).