Linux virtualization list
 help / color / mirror / Atom feed
* [RFC V3 1/5] option: introduce qemu_get_opt_all()
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

Sometimes, we need to pass option like -netdev tap,fd=100,fd=101,fd=102 which
can not be properly parsed by qemu_find_opt() because it only returns the first
matched option. So qemu_get_opt_all() were introduced to return an array of
pointers which contains all matched option.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 qemu-option.c |   19 +++++++++++++++++++
 qemu-option.h |    2 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/qemu-option.c b/qemu-option.c
index bb3886c..9263125 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -545,6 +545,25 @@ static QemuOpt *qemu_opt_find(QemuOpts *opts, const char *name)
     return NULL;
 }
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max)
+{
+    QemuOpt *opt;
+    int index = 0;
+
+    QTAILQ_FOREACH_REVERSE(opt, &opts->head, QemuOptHead, next) {
+        if (strcmp(opt->name, name) == 0) {
+            if (index < max) {
+                optp[index++] = opt->str;
+            }
+            if (index == max) {
+                break;
+            }
+        }
+    }
+    return index;
+}
+
 const char *qemu_opt_get(QemuOpts *opts, const char *name)
 {
     QemuOpt *opt = qemu_opt_find(opts, name);
diff --git a/qemu-option.h b/qemu-option.h
index 951dec3..3c9a273 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -106,6 +106,8 @@ struct QemuOptsList {
     QemuOptDesc desc[];
 };
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max);
 const char *qemu_opt_get(QemuOpts *opts, const char *name);
 bool qemu_opt_get_bool(QemuOpts *opts, const char *name, bool defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t defval);
-- 
1.7.1

^ permalink raw reply related

* [RFC V3 0/5] Multiqueue support for tap and virtio-net/vhost
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm

Hello all:

This seires is an update of last version of multiqueue support to add multiqueue
capability to both tap and virtio-net.

Some kinds of tap backends has (macvatp in linux) or would (tap) support
multiqueue. In such kind of tap backend, each file descriptor of a tap is a
qeueu and ioctls were prodived to attach an exist tap file descriptor to the
tun/tap device. So the patch let qemu to use this kind of backend, and let it
can transmit and receving packets through multiple file descriptors.

Patch 1 introduce a new help to get all matched options, after this patch, we
could pass multiple file descriptors to a signle netdev by:

      qemu -netdev tap,id=h0,queues=2,fd=10,fd=11 ...

Patch 2 introduce generic helpers in tap to attach or detach a file descriptor
from a tap device, emulated nics could use this helper to enable/disable queues.

Patch 3 modifies the NICState to allow multiple VLANClientState to be stored in
it, with this patch, qemu has basic support of multiple capable tap backend.

Patch 4 implement 1:1 mapping of tx/rx virtqueue pairs with vhost_net backend.

Patch 5 converts virtio-net to multiqueue device, after this patch, multiqueue
virtio-net device could be specified by:

      qemu -netdev tap,id=h0,queues=2 -device virtio-net-pci,netdev=h0,queues=2

Performace numbers:

I post them in the threads of RFC of multiqueue virtio-net driver:
http://www.spinics.net/lists/kvm/msg75386.html

Multiqueue with vhost shows improvemnt in TCP_RR, and degradate for small packet
transmission.

Changes from V2:
- split vhost patch from virtio-net
- add the support of queue number negotiation through control virtqueue
- hotplug, set_link and migration support
- bug fixes

Changes from V1:

- rebase to the latest
- fix memory leak in parse_netdev
- fix guest notifiers assignment/de-assignment
- changes the command lines to:
   qemu -netdev tap,queues=2 -device virtio-net-pci,queues=2

References:
- V2 http://www.spinics.net/lists/kvm/msg74588.html
- V1 http://comments.gmane.org/gmane.comp.emulators.qemu/100481

Jason Wang (5):
  option: introduce qemu_get_opt_all()
  tap: multiqueue support
  net: multiqueue support
  vhost: multiqueue support
  virtio-net: add multiqueue support

 hw/dp8393x.c         |    2 +-
 hw/mcf_fec.c         |    2 +-
 hw/qdev-properties.c |   34 +++-
 hw/qdev.h            |    3 +-
 hw/vhost.c           |   53 ++++--
 hw/vhost.h           |    2 +
 hw/vhost_net.c       |    7 +-
 hw/vhost_net.h       |    2 +-
 hw/virtio-net.c      |  505 ++++++++++++++++++++++++++++++++++----------------
 hw/virtio-net.h      |   12 ++
 net.c                |   83 +++++++--
 net.h                |   16 ++-
 net/tap-aix.c        |   13 ++-
 net/tap-bsd.c        |   13 ++-
 net/tap-haiku.c      |   13 ++-
 net/tap-linux.c      |   56 ++++++-
 net/tap-linux.h      |    4 +
 net/tap-solaris.c    |   13 ++-
 net/tap-win32.c      |   11 +
 net/tap.c            |  199 +++++++++++++-------
 net/tap.h            |    7 +-
 qemu-option.c        |   19 ++
 qemu-option.h        |    2 +
 23 files changed, 787 insertions(+), 284 deletions(-)

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Jason Wang @ 2012-07-06  9:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, Sasha Levin, jwhan, sri, davem, tahm
In-Reply-To: <20120705233816.3ec0b827@nehalam.linuxnetplumber.net>

On 07/06/2012 02:38 PM, Stephen Hemminger wrote:
> On Fri, 06 Jul 2012 11:20:06 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 07/05/2012 08:51 PM, Sasha Levin wrote:
>>> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>>>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>>           if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>                   vi->has_cvq = true;
>>>>
>>>> +       /* Use single tx/rx queue pair as default */
>>>> +       vi->num_queue_pairs = 1;
>>>> +       vi->total_queue_pairs = num_queue_pairs;
>>> The code is using this "default" even if the amount of queue pairs it
>>> wants was specified during initialization. This basically limits any
>>> device to use 1 pair when starting up.
>>>
>> Yes, currently the virtio-net driver would use 1 txq/txq by default
>> since multiqueue may not outperform in all kinds of workload. So it's
>> better to keep it as default and let user enable multiqueue by ethtool -L.
>>
> I would prefer that the driver sized number of queues based on number
> of online CPU's. That is what real hardware does. What kind of workload
> are you doing? If it is some DBMS benchmark then maybe the issue is that
> some CPU's need to be reserved.

I run rr and stream test of netperf, and multiqueue shows improvement on 
rr test and regression on small packet transmission in stream test. For 
small packet transmission, multiqueue tends to send much more small 
packets which also increase the cpu utilization. I suspect multiqueue is 
faster and tcp does not merger big enough packet to send, but may need 
more think.

^ permalink raw reply

* Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
From: Nicholas A. Bellinger @ 2012-07-06  9:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Anthony Liguori, kvm-devel, linux-scsi,
	Michael S. Tsirkin, lf-virt, Anthony Liguori, target-devel,
	ksummit-2012-discuss, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341553397.3023.16.camel@dabdike.hilton.com>

On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
> On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
> 
> > So I'm pretty sure this discrepancy is attributed to the small block
> > random I/O bottleneck currently present for all Linux/SCSI core LLDs
> > regardless of physical or virtual storage fabric.
> > 
> > The SCSI wide host-lock less conversion that happened in .38 code back
> > in 2010, and subsequently having LLDs like virtio-scsi convert to run in
> > host-lock-less mode have helped to some extent..  But it's still not
> > enough..
> > 
> > Another example where we've been able to prove this bottleneck recently
> > is with the following target setup:
> > 
> > *) Intel Romley production machines with 128 GB of DDR-3 memory
> > *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2)
> > *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec 
> > *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED
> > 
> > In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 +
> > iomemory_vsl export we end up avoiding SCSI core bottleneck on the
> > target machine, just as with the tcm_vhost example here for host kernel
> > side processing with vhost.
> > 
> > Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP
> > (OFED) Initiator connected to four ib_srpt LUNs, we've observed that
> > MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs.
> > ~215K with heavy random 4k WRITE iometer / fio tests.  Note this with an
> > optimized queue_depth ib_srp client w/ noop I/O schedulering, but is
> > still lacking the host_lock-less patches on RHEL 6.2 OFED..
> > 
> > This bottleneck has been mentioned by various people (including myself)
> > on linux-scsi the last 18 months, and I've proposed that that it be
> > discussed at KS-2012 so we can start making some forward progress:
> 
> Well, no, it hasn't.  You randomly drop things like this into unrelated
> email (I suppose that is a mention in strict English construction) but
> it's not really enough to get anyone to pay attention since they mostly
> stopped reading at the top, if they got that far: most people just go by
> subject when wading through threads initially.
> 

It most certainly has been made clear to me, numerous times from many
people in the Linux/SCSI community that there is a bottleneck for small
block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
Linux based SCSI subsystems.

My apologies if mentioning this issue last year at LC 2011 to you
privately did not take a tone of a more serious nature, or that
proposing a topic for LSF-2012 this year was not a clear enough
indication of a problem with SCSI small block random I/O performance.

> But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
> kernel, which is now nearly three years old) is 25% slower than W2k8R2
> on infiniband isn't really going to get anyone excited either
> (particularly when you mention OFED, which usually means a stack
> replacement on Linux anyway).
> 

The specific issue was first raised for .38 where we where able to get
most of the interesting high performance LLDs converted to using
internal locking methods so that host_lock did not have to be obtained
during each ->queuecommand() I/O dispatch, right..?

This has helped a good deal for large multi-lun scsi_host configs that
are now running in host-lock less mode, but there is still a large
discrepancy single LUN vs. raw struct block_device access even with LLD
host_lock less mode enabled.

Now I think the virtio-blk client performance is demonstrating this
issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
flash benchmarks that is demonstrate some other yet-to-be determined
limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
randrw workload.

> What people might pay attention to is evidence that there's a problem in
> 3.5-rc6 (without any OFED crap).  If you're not going to bother
> investigating, it has to be in an environment they can reproduce (so
> ordinary hardware, not infiniband) otherwise it gets ignored as an
> esoteric hardware issue.
> 

It's really quite simple for anyone to demonstrate the bottleneck
locally on any machine using tcm_loop with raw block flash.  Take a
struct block_device backend (like a Fusion IO /dev/fio*) and using
IBLOCK and export locally accessible SCSI LUNs via tcm_loop..

Using FIO there is a significant drop for randrw 4k performance between
tcm_loop <-> IBLOCK vs. raw struct block device backends.  And no, it's
not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
LUN limitation for small block random I/Os on the order of ~75K for each
SCSI LUN.

If anyone has gone actually gone faster than this with any single SCSI
LUN on any storage fabric, I would be interested in hearing about your
setup.

Thanks,

--nab

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Sasha Levin @ 2012-07-06  8:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF65966.9040600@redhat.com>

On Fri, 2012-07-06 at 11:20 +0800, Jason Wang wrote:
> On 07/05/2012 08:51 PM, Sasha Levin wrote:
> > On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> >> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >>                  vi->has_cvq = true;
> >>
> >> +       /* Use single tx/rx queue pair as default */
> >> +       vi->num_queue_pairs = 1;
> >> +       vi->total_queue_pairs = num_queue_pairs;
> > The code is using this "default" even if the amount of queue pairs it
> > wants was specified during initialization. This basically limits any
> > device to use 1 pair when starting up.
> >
> 
> Yes, currently the virtio-net driver would use 1 txq/txq by default 
> since multiqueue may not outperform in all kinds of workload. So it's 
> better to keep it as default and let user enable multiqueue by ethtool -L.

I think it makes sense to set it to 1 if the amount of initial queue
pairs wasn't specified.

On the other hand, if a virtio-net driver was probed to provide
VIRTIO_NET_F_MULTIQUEUE and has set something reasonable in
virtio_net_config.num_queues, then that setting shouldn't be quietly
ignored and reset back to 1.

What I'm basically saying is that I agree that the *default* should be 1
- but if the user has explicitly asked for something else during
initialization, then the default should be overridden.

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Jason Wang @ 2012-07-06  7:46 UTC (permalink / raw)
  To: Amos Kong
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, Sasha Levin, jwhan, sri, davem, tahm
In-Reply-To: <4FF5F3F7.8090307@redhat.com>

On 07/06/2012 04:07 AM, Amos Kong wrote:
> On 07/05/2012 08:51 PM, Sasha Levin wrote:
>> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>                  vi->has_cvq = true;
>>>
>
>>> +       /* Use single tx/rx queue pair as default */
>>> +       vi->num_queue_pairs = 1;
>>> +       vi->total_queue_pairs = num_queue_pairs;
> vi->total_queue_pairs also should be set to 1
>
>             vi->total_queue_pairs = 1;

Hi Amos:

total_queue_pairs is the max number of queue pairs that the deivce could 
provide, so it's ok here.
>> The code is using this "default" even if the amount of queue pairs it
>> wants was specified during initialization. This basically limits any
>> device to use 1 pair when starting up.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Jason Wang @ 2012-07-06  7:45 UTC (permalink / raw)
  To: Amos Kong
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF5F2F2.9050307@redhat.com>

On 07/06/2012 04:02 AM, Amos Kong wrote:
> On 07/05/2012 06:29 PM, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
> ...
>
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
>
> s/atleast/at least/
>
>
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>                  num_queues = MAX_QUEUES * 2;
>
> MAX_QUEUES is the limitation of RX or TX.

Right, it's a typo, thanks.
>> +
>> +	num_queue_pairs = num_queues / 2;
> ...
>

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Jason Wang @ 2012-07-06  7:42 UTC (permalink / raw)
  To: Rick Jones
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF5D2B7.6080602@hp.com>

On 07/06/2012 01:45 AM, Rick Jones wrote:
> On 07/05/2012 03:29 AM, Jason Wang wrote:
>
>>
>> Test result:
>>
>> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 650.55 655.61 100% 24.88 24.86 99%
>> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
>> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
>> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
>
> Was the -D test-specific option used to set TCP_NODELAY?  I'm guessing 
> from your description of how packet sizes were smaller with multiqueue 
> and your need to hack tcp_write_xmit() it wasn't but since we don't 
> have the specific netperf command lines (hint hint :) I wanted to make 
> certain.
Hi Rick:

I didn't specify -D for disabling Nagle. I also collects rx packets and 
average packet size:

Guest to External Host ( 2vcpu 1q vs 2q )
sessions size tput-sq tput-mq %  norm-sq norm-mq %  #tx-pkts-sq 
#tx-pkts-mq % avg-sz-sq avg-sz-mq %
1 64 668.85 671.13 100% 25.80 26.86 104% 629038 627126 99% 1395 1403 100%
2 64 1421.29 1345.40 94% 32.06 27.57 85% 1318498 1246721 94% 1413 1414 100%
4 64 1469.96 1365.42 92% 32.44 27.04 83% 1362542 1277848 93% 1414 1401 99%
8 64 1131.00 1361.58 120% 24.81 26.76 107% 1223700 1280970 104% 1395 
1394 99%
1 256 1883.98 1649.87 87% 60.67 58.48 96% 1542775 1465836 95% 1592 1472 92%
2 256 4847.09 3539.74 73% 98.35 64.05 65% 2683346 3074046 114% 2323 1505 64%
4 256 5197.33 3283.48 63% 109.14 62.39 57% 1819814 2929486 160% 3636 
1467 40%
8 256 5953.53 3359.22 56% 122.75 64.21 52% 906071 2924148 322% 8282 1502 18%
1 512 3019.70 2646.07 87% 93.89 86.78 92% 2003780 2256077 112% 1949 1532 78%
2 512 7455.83 5861.03 78% 173.79 104.43 60% 1200322 3577142 298% 7831 
2114 26%
4 512 8962.28 7062.20 78% 213.08 127.82 59% 468142 2594812 554% 24030 
3468 14%
8 512 7849.82 8523.85 108% 175.41 154.19 87% 304923 1662023 545% 38640 
6479 16%

When multiqueue were enabled, it does have a higher packets per second 
but with a much more smaller packet size. It looks to me that multiqueue 
is faster and guest tcp have less oppotunity to build a larger skbs to 
send, so lots of small packet were required to send which leads to much 
more #exit and vhost works. One interesting thing is, if I run tcpdump 
in the host where guest run, I can get obvious throughput increasing. To 
verify the assumption, I hack the tcp_write_xmit() with following patch 
and set tcp_tso_win_divisor=1, then I multiqueue can outperform or at 
least get the same throughput as singlequeue, though it could introduce 
latency but I havent' measured it.

I'm not expert of tcp, but looks like the changes are reasonable:
- we can do full-sized TSO check in tcp_tso_should_defer() only for 
westwood, according to tcp westwood
- run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..166a888 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1567,7 +1567,7 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)

         in_flight = tcp_packets_in_flight(tp);

-       BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));
+       BUG_ON(tp->snd_cwnd <= in_flight);

         send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;

@@ -1576,9 +1576,11 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)

         limit = min(send_win, cong_win);

+#if 0
         /* If a full-sized TSO skb can be sent, do it. */
         if (limit >= sk->sk_gso_max_size)
                 goto send_now;
+#endif

         /* Middle in queue won't get any more data, full sendable 
already? */
         if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
@@ -1795,10 +1797,9 @@ static bool tcp_write_xmit(struct sock *sk, 
unsigned int mss_now, int nonagle,
                                                      
(tcp_skb_is_last(sk, skb) ?
                                                       nonagle : 
TCP_NAGLE_PUSH))))
                                 break;
-               } else {
-                       if (!push_one && tcp_tso_should_defer(sk, skb))
-                               break;
                 }
+               if (!push_one && tcp_tso_should_defer(sk, skb))
+                       break;

                 limit = mss_now;
                 if (tso_segs > 1 && !tcp_urg_mode(tp))




>
> Instead of calling them throughput1 and throughput2, it might be more 
> clear in future to identify them as singlequeue and multiqueue.
>

Sure.
> Also, how are you combining the concurrent netperf results?  Are you 
> taking sums of what netperf reports, or are you gathering statistics 
> outside of netperf?
>

The throughput were just sumed from netperf result like what netperf 
manual suggests. The cpu utilization were measured by mpstat.
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
>
> A single instance TCP_RR test would help confirm/refute any 
> non-trivial change in (effective) path length between the two cases.
>

Yes, I would test this thanks.
> happy benchmarking,
>
> rick jones
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Stephen Hemminger @ 2012-07-06  6:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, Sasha Levin, jwhan, sri, davem, tahm
In-Reply-To: <4FF65966.9040600@redhat.com>

On Fri, 06 Jul 2012 11:20:06 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 07/05/2012 08:51 PM, Sasha Levin wrote:
> > On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> >> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >>                  vi->has_cvq = true;
> >>
> >> +       /* Use single tx/rx queue pair as default */
> >> +       vi->num_queue_pairs = 1;
> >> +       vi->total_queue_pairs = num_queue_pairs;
> > The code is using this "default" even if the amount of queue pairs it
> > wants was specified during initialization. This basically limits any
> > device to use 1 pair when starting up.
> >
> 
> Yes, currently the virtio-net driver would use 1 txq/txq by default 
> since multiqueue may not outperform in all kinds of workload. So it's 
> better to keep it as default and let user enable multiqueue by ethtool -L.
> 

I would prefer that the driver sized number of queues based on number
of online CPU's. That is what real hardware does. What kind of workload
are you doing? If it is some DBMS benchmark then maybe the issue is that
some CPU's need to be reserved.

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Paolo Bonzini @ 2012-07-06  5:39 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Jens Axboe, Anthony Liguori, linux-scsi, kvm-devel,
	Michael S. Tsirkin, lf-virt, Anthony Liguori, target-devel,
	Zhi Yong Wu, Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341545893.23954.325.camel@haakon2.linux-iscsi.org>

Il 06/07/2012 05:38, Nicholas A. Bellinger ha scritto:
> So I imagine that setting inquiry/vpd/mode via configfs attribs to match
> whatever the guest wants to see (or expects to see) can be enabled
> via /sys/kernel/config/target/core/$HBA/$DEV/[wwn,attrib]/ easily to
> whatever is required.
> 
> However, beyond basic SCSI WWN related bits, I would avoid trying to
> match complex SCSI target state between the in-kernel patch and QEMU
> SCSI.

Agreed.  It should just be the bare minimum to make stable /dev/disk
paths, well, stable between the two backends.

>>> so that it is not possible to migrate one to the other.
>>
>> Migration between different backend types does not seem all that useful.
>> The general rule is you need identical flags on both sides to allow
>> migration, and it is not clear how valuable it is to relax this
>> somewhat.
> 
> I really need to learn more about how QEMU Live migration works wrt to
> storage before saying how this may (or may not) work.

vhost-scsi live migration should be easy to fix.  You need some ioctl or
eventfd mechanism to communicate to userspace that there is no pending
I/O, but you need that anyway also for other operations (as simple as
stopping the VM: QEMU guarantees that the "stop" monitor command returns
only when there is no outstanding I/O).

What worries me most is: 1) the amount of functionality that is lost
with vhost-scsi, especially the new live operations that we're adding to
QEMU; 2) whether any hook we introduce in the QEMU block layer will
cause problems down the road when we set to fix the existing
virtio-blk/virtio-scsi-qemu performance problems.  This is the reason
why I'm reluctant to merge the QEMU bits.  The kernel bits are
self-contained and are much less problematic.

It may well be that _the same_ (or very similar) hooks will be needed by
both tcm_vhost and high-performance userspace virtio backends.  This
would of course remove the objection.

Paolo

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Nicholas A. Bellinger @ 2012-07-06  3:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jens Axboe, Anthony Liguori, linux-scsi, kvm-devel, lf-virt,
	Anthony Liguori, target-devel, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <20120705135318.GG30572@redhat.com>

On Thu, 2012-07-05 at 16:53 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 12:22:33PM +0200, Paolo Bonzini wrote:
> > Il 05/07/2012 03:52, Nicholas A. Bellinger ha scritto:
> > > 
> > > fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
> > > ------------------------------------------------------------------------------------
> > > 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
> > > 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K
> > 
> > This is impressive, but I think it's still not enough to justify the
> > inclusion of tcm_vhost.  In my opinion, vhost-blk/vhost-scsi are mostly
> > worthwhile as drivers for improvements to QEMU performance.  We want to
> > add more fast paths to QEMU that let us move SCSI and virtio processing
> > to separate threads, we have proof of concepts that this can be done,
> > and we can use vhost-blk/vhost-scsi to find bottlenecks more effectively.
> 
> A general rant below:
> 
> OTOH if it works, and adds value, we really should consider including code.
> To me, it does not make sense to reject code just because in theory
> someone could write even better code. Code walks. Time to marker matters too.
> Yes I realize more options increases support. But downstreams can make
> their own decisions on whether to support some configurations:
> add a configure option to disable it and that's enough.
> 

+1 for mst here.

I think that type of sentiment deserves a toast at KS/LC in August.  ;)

> > In fact, virtio-scsi-qemu and virtio-scsi-vhost are effectively two
> > completely different devices that happen to speak the same SCSI
> > transport.  Not only virtio-scsi-vhost must be configured outside QEMU
> 
> configuration outside QEMU is OK I think - real users use
> management anyway. But maybe we can have helper scripts
> like we have for tun?
> 
> > and doesn't support -device;
> 
> This needs to be fixed I think.
> 
> > it (obviously) presents different
> > inquiry/vpd/mode data than virtio-scsi-qemu,
> 
> Why is this obvious and can't be fixed? Userspace virtio-scsi
> is pretty flexible - can't it supply matching inquiry/vpd/mode data
> so that switching is transparent to the guest?
> 

So I imagine that setting inquiry/vpd/mode via configfs attribs to match
whatever the guest wants to see (or expects to see) can be enabled
via /sys/kernel/config/target/core/$HBA/$DEV/[wwn,attrib]/ easily to
whatever is required.

However, beyond basic SCSI WWN related bits, I would avoid trying to
match complex SCSI target state between the in-kernel patch and QEMU
SCSI.  We've had this topic come up numerous times over the nears for
other fabric modules (namely iscsi-target) and usually it end's up with
a long email thread re-hashing history of failures until Linus starts
yelling at the person who is pushing complex kernel <-> user split.

The part where I start to get nervous is where you get into the cluster
+ multipath features.  We have methods in TCM core that rebuild the
exact state of this bits based upon external file metadata, based upon
the running configfs layout.  This is used by physical node failover +
re-takeover to ensure the SCSI client sees exactly the same SCSI state.

Trying to propagate up this type of complexity is where I think you go
overboard.  KISS and let's let fabric independent configfs (leaning on
vfs) do the hard work for tracking these types of SCSI relationships.

> > so that it is not possible to migrate one to the other.
> 
> Migration between different backend types does not seem all that useful.
> The general rule is you need identical flags on both sides to allow
> migration, and it is not clear how valuable it is to relax this
> somewhat.
> 

I really need to learn more about how QEMU Live migration works wrt to
storage before saying how this may (or may not) work.

We certainly have no problems doing physical machine failover with
target_core_mod for iscsi-target, and ATM I don't see why the QEMU
userspace process driving the real-time configfs configuration of the
storage fabric would not work..

> > I don't think vhost-scsi is particularly useful for virtualization,
> > honestly.  However, if it is useful for development, testing or
> > benchmarking of lio itself (does this make any sense? :)) that could be
> > by itself a good reason to include it.
> > 
> > Paolo
> 

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Jason Wang @ 2012-07-06  3:20 UTC (permalink / raw)
  To: Sasha Levin
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341492679.18786.18.camel@lappy>

On 07/05/2012 08:51 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>                  vi->has_cvq = true;
>>
>> +       /* Use single tx/rx queue pair as default */
>> +       vi->num_queue_pairs = 1;
>> +       vi->total_queue_pairs = num_queue_pairs;
> The code is using this "default" even if the amount of queue pairs it
> wants was specified during initialization. This basically limits any
> device to use 1 pair when starting up.
>

Yes, currently the virtio-net driver would use 1 txq/txq by default 
since multiqueue may not outperform in all kinds of workload. So it's 
better to keep it as default and let user enable multiqueue by ethtool -L.

^ permalink raw reply

* Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
From: Jason Wang @ 2012-07-06  3:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341488454.18786.15.camel@lappy>

On 07/05/2012 07:40 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> Instead of storing the queue index in virtio infos, this patch moves them to
>> vring_virtqueue and introduces helpers to set and get the value. This would
>> simplify the management and tracing.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> This patch actually fails to compile:
>
> drivers/virtio/virtio_mmio.c: In function ‘vm_notify’:
> drivers/virtio/virtio_mmio.c:229:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
> drivers/virtio/virtio_mmio.c: In function ‘vm_del_vq’:
> drivers/virtio/virtio_mmio.c:278:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
> make[2]: *** [drivers/virtio/virtio_mmio.o] Error 1
>
> It probably misses the following hunks:
>
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index f5432b6..12b6180 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -222,11 +222,10 @@ static void vm_reset(struct virtio_device *vdev)
>   static void vm_notify(struct virtqueue *vq)
>   {
>          struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
> -       struct virtio_mmio_vq_info *info = vq->priv;
>
>          /* We write the queue's selector into the notification register to
>           * signal the other end */
> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
>   }
>
>   /* Notify all virtqueues on an interrupt. */
> @@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq)
>          vring_del_virtqueue(vq);
>
>          /* Select and deactivate the queue */
> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>          writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>
>          size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
>
Oops, I miss the virtio mmio part, thanks for pointing this.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Nicholas A. Bellinger @ 2012-07-06  3:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jens Axboe, Stefan Hajnoczi, kvm-devel, lf-virt, Anthony Liguori,
	target-devel, linux-scsi, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig
In-Reply-To: <20120705093113.GC29373@redhat.com>

On Thu, 2012-07-05 at 12:31 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 04, 2012 at 07:01:05PM -0700, Nicholas A. Bellinger wrote:
> > On Wed, 2012-07-04 at 18:05 +0300, Michael S. Tsirkin wrote:

<SNIP>

> > > I was talking about 4/6 first of all.
> > 
> > So yeah, this code is still considered RFC at this point for-3.6, but
> > I'd like to get this into target-pending/for-next in next week for more
> > feedback and start collecting signoffs for the necessary pieces that
> > effect existing vhost code.
> > 
> > By that time the cmwq conversion of tcm_vhost should be in place as
> > well..
> 
> I'll try to give some feedback but I think we do need
> to see the qemu patches - they weren't posted yet, were they?
> This driver has some userspace interface and once
> that is merged it has to be supported.
> So I think we need the buy-in from the qemu side at the principal level.
> 

<nod>

Stefan posted the QEMU vhost-scsi patches a few items, but I think it's
been awhile since the last round of review.  For the recent
development's with tcm_vhost, I've been using Zhi's QEMU tree here:

https://github.com/wuzhy/qemu/tree/vhost-scsi

Other than a few printf I added to help me understand how it works, no
function changes have been made to work with target-pending/tcm_vhost.

Thank you,

--nab

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Nicholas A. Bellinger @ 2012-07-06  3:01 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jens Axboe, linux-scsi, kvm-devel, Michael S. Tsirkin, lf-virt,
	Anthony Liguori, target-devel, ksummit-2012-discuss,
	Paolo Bonzini, Zhi Yong Wu, Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <4FF59F6B.2000101@us.ibm.com>

On Thu, 2012-07-05 at 09:06 -0500, Anthony Liguori wrote:
> On 07/05/2012 08:53 AM, Michael S. Tsirkin wrote:
> > On Thu, Jul 05, 2012 at 12:22:33PM +0200, Paolo Bonzini wrote:
> >> Il 05/07/2012 03:52, Nicholas A. Bellinger ha scritto:
> >>>
> >>> fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
> >>> ------------------------------------------------------------------------------------
> >>> 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
> >>> 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K
> >>
> >> This is impressive, but I think it's still not enough to justify the
> >> inclusion of tcm_vhost.
> 
> We have demonstrated better results at much higher IOP rates with virtio-blk in 
> userspace so while these results are nice, there's no reason to believe we can't 
> do this in userspace.
> 

So I'm pretty sure this discrepancy is attributed to the small block
random I/O bottleneck currently present for all Linux/SCSI core LLDs
regardless of physical or virtual storage fabric.

The SCSI wide host-lock less conversion that happened in .38 code back
in 2010, and subsequently having LLDs like virtio-scsi convert to run in
host-lock-less mode have helped to some extent..  But it's still not
enough..

Another example where we've been able to prove this bottleneck recently
is with the following target setup:

*) Intel Romley production machines with 128 GB of DDR-3 memory
*) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2)
*) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec 
*) Infiniband SRP Target backported to RHEL 6.2 + latest OFED

In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 +
iomemory_vsl export we end up avoiding SCSI core bottleneck on the
target machine, just as with the tcm_vhost example here for host kernel
side processing with vhost.

Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP
(OFED) Initiator connected to four ib_srpt LUNs, we've observed that
MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs.
~215K with heavy random 4k WRITE iometer / fio tests.  Note this with an
optimized queue_depth ib_srp client w/ noop I/O schedulering, but is
still lacking the host_lock-less patches on RHEL 6.2 OFED..

This bottleneck has been mentioned by various people (including myself)
on linux-scsi the last 18 months, and I've proposed that that it be
discussed at KS-2012 so we can start making some forward progress:

http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/000098.html,

> >> In my opinion, vhost-blk/vhost-scsi are mostly
> >> worthwhile as drivers for improvements to QEMU performance.  We want to
> >> add more fast paths to QEMU that let us move SCSI and virtio processing
> >> to separate threads, we have proof of concepts that this can be done,
> >> and we can use vhost-blk/vhost-scsi to find bottlenecks more effectively.
> >
> > A general rant below:
> >
> > OTOH if it works, and adds value, we really should consider including code.
> 
> Users want something that has lots of features and performs really, really well. 
>   They want everything.
> 
> Having one device type that is "fast" but has no features and another that is 
> "not fast" but has a lot of features forces the user to make a bad choice.  No 
> one wins in the end.
> 
> virtio-scsi is brand new.  It's not as if we've had any significant time to make 
> virtio-scsi-qemu faster.  In fact, tcm_vhost existed before virtio-scsi-qemu did 
> if I understand correctly.
> 

So based upon the data above, I'm going to make a prediction that MSFT
guests connected with SCSI miniport <-> tcm_vhost will out perform Linux
guests with virtio-scsi (w/ <= 3.5 host-lock-less) <-> tcm_vhost w/
connected to the same raw block flash iomemory_vsl backends.

Of course that depends upon how fast virtio-scsi drivers get written for
MSFT guests vs. us fixing the long-term performance bottleneck in our
SCSI subsystem.  ;)

(Ksummit-2012 discuss CC'ed for the later)

> > To me, it does not make sense to reject code just because in theory
> > someone could write even better code.
> 
> There is no theory.  We have proof points with virtio-blk.
> 
> > Code walks. Time to marker matters too.
> 
> But guest/user facing decisions cannot be easily unmade and making the wrong 
> technical choices because of premature concerns of "time to market" just result 
> in a long term mess.
> 
> There is no technical reason why tcm_vhost is going to be faster than doing it 
> in userspace.  We can demonstrate this with virtio-blk.  This isn't a 
> theoretical argument.
> 
> > Yes I realize more options increases support. But downstreams can make
> > their own decisions on whether to support some configurations:
> > add a configure option to disable it and that's enough.
> >
> >> In fact, virtio-scsi-qemu and virtio-scsi-vhost are effectively two
> >> completely different devices that happen to speak the same SCSI
> >> transport.  Not only virtio-scsi-vhost must be configured outside QEMU
> >
> > configuration outside QEMU is OK I think - real users use
> > management anyway. But maybe we can have helper scripts
> > like we have for tun?
> 
> Asking a user to write a helper script is pretty awful...
> 

It's easy for anyone with basic python knowledge to use rtslib packages
in the downstream distros to connect to tcm_vhost endpoints LUNs right
now.

All you need is the following vhost.spec, and tcm_vhost works out of the
box for rtslib and targetcli/rtsadmin without any modification to
existing userspace packages:

root@tifa:~# cat /var/target/fabric/vhost.spec 
# WARNING: This is a draft specfile supplied for testing only.

# The fabric module feature set
features = nexus

# Use naa WWNs.
wwn_type = naa

# Non-standard module naming scheme
kernel_module = tcm_vhost

# The configfs group
configfs_group = vhost

^ permalink raw reply

* Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk
From: Asias He @ 2012-07-06  1:03 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Sasha Levin, Christoph Hellwig
In-Reply-To: <87lij0w8np.fsf@rustcorp.com.au>

On 07/04/2012 10:40 AM, Rusty Russell wrote:
> On Tue, 03 Jul 2012 08:39:39 +0800, Asias He <asias@redhat.com> wrote:
>> On 07/02/2012 02:41 PM, Rusty Russell wrote:
>>> Sure, our guest merging might save us 100x as many exits as no merging.
>>> But since we're not doing many requests, does it matter?
>>
>> We can still have many requests with slow devices. The number of
>> requests depends on the workload in guest. E.g. 512 IO threads in guest
>> keeping doing IO.
>
> You can have many requests outstanding.  But if the device is slow, the
> rate of requests being serviced must be low.

Yes.

> Am I misunderstanding something?  I thought if you could have a high
> rate of requests, it's not a slow device.

Sure.

-- 
Asias

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Michael S. Tsirkin @ 2012-07-05 21:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jens Axboe, Anthony Liguori, linux-scsi, kvm-devel, lf-virt,
	Anthony Liguori, target-devel, Zhi Yong Wu, Christoph Hellwig,
	Stefan Hajnoczi
In-Reply-To: <4FF5A57F.2000504@redhat.com>

On Thu, Jul 05, 2012 at 04:32:31PM +0200, Paolo Bonzini wrote:
> Il 05/07/2012 15:53, Michael S. Tsirkin ha scritto:
> > On Thu, Jul 05, 2012 at 12:22:33PM +0200, Paolo Bonzini wrote:
> >> Il 05/07/2012 03:52, Nicholas A. Bellinger ha scritto:
> >>>
> >>> fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
> >>> ------------------------------------------------------------------------------------
> >>> 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
> >>> 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K
> >>
> >> This is impressive, but I think it's still not enough to justify the
> >> inclusion of tcm_vhost.  In my opinion, vhost-blk/vhost-scsi are mostly
> >> worthwhile as drivers for improvements to QEMU performance.  We want to
> >> add more fast paths to QEMU that let us move SCSI and virtio processing
> >> to separate threads, we have proof of concepts that this can be done,
> >> and we can use vhost-blk/vhost-scsi to find bottlenecks more effectively.
> > 
> > A general rant below:
> > 
> > OTOH if it works, and adds value, we really should consider including code.
> > To me, it does not make sense to reject code just because in theory
> > someone could write even better code.
> 
> It's not about writing better code.  It's about having two completely
> separate SCSI/block layers with completely different feature sets.

You mean qemu one versus kernel one? Both exist anyway :)

> > Code walks. Time to marker matters too.
> > Yes I realize more options increases support. But downstreams can make
> > their own decisions on whether to support some configurations:
> > add a configure option to disable it and that's enough.
> > 
> >> In fact, virtio-scsi-qemu and virtio-scsi-vhost are effectively two
> >> completely different devices that happen to speak the same SCSI
> >> transport.  Not only virtio-scsi-vhost must be configured outside QEMU
> > 
> > configuration outside QEMU is OK I think - real users use
> > management anyway. But maybe we can have helper scripts
> > like we have for tun?
> 
> We could add hooks for vhost-scsi in the SCSI devices and let them
> configure themselves.  I'm not sure it is a good idea.

This is exactly what virtio-net does.

> >> and doesn't support -device;
> > 
> > This needs to be fixed I think.
> 
> To be clear, it supports -device for the virtio-scsi HBA itself; it
> doesn't support using -drive/-device to set up the disks hanging off it.

Fixable, isn't it?

> >> it (obviously) presents different
> >> inquiry/vpd/mode data than virtio-scsi-qemu,
> > 
> > Why is this obvious and can't be fixed? Userspace virtio-scsi
> > is pretty flexible - can't it supply matching inquiry/vpd/mode data
> > so that switching is transparent to the guest?
> 
> It cannot support anyway the whole feature set unless you want to port
> thousands of lines from the kernel to QEMU (well, perhaps we'll get
> there but it's far.  And dually, the in-kernel target of course does not
> support qcow2 and friends though perhaps you could imagine some hack
> based on NBD.
> 
> Paolo

Exactly. Kernel also gains functionality all the time.

-- 
MST

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Amos Kong @ 2012-07-05 20:07 UTC (permalink / raw)
  To: Sasha Levin
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341492679.18786.18.camel@lappy>

On 07/05/2012 08:51 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>         if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>                 vi->has_cvq = true;
>>  


>> +       /* Use single tx/rx queue pair as default */
>> +       vi->num_queue_pairs = 1;
>> +       vi->total_queue_pairs = num_queue_pairs; 

vi->total_queue_pairs also should be set to 1

           vi->total_queue_pairs = 1;

> 
> The code is using this "default" even if the amount of queue pairs it
> wants was specified during initialization. This basically limits any
> device to use 1 pair when starting up.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
			Amos.

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Amos Kong @ 2012-07-05 20:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-5-git-send-email-jasowang@redhat.com>

On 07/05/2012 06:29 PM, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

...

>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */


s/atleast/at least/


> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;

                num_queues = MAX_QUEUES * 2;

MAX_QUEUES is the limitation of RX or TX.

> +
> +	num_queue_pairs = num_queues / 2;

...

-- 
			Amos.

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Bart Van Assche @ 2012-07-05 19:57 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Jens Axboe, Anthony Liguori, kvm-devel, linux-scsi,
	Michael S. Tsirkin, lf-virt, Anthony Liguori, target-devel,
	Paolo Bonzini, Zhi Yong Wu, Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <4FF5D494.2090707@acm.org>

On 07/05/12 17:53, Bart Van Assche wrote:

> On 07/05/12 01:52, Nicholas A. Bellinger wrote:
>> fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
>> ------------------------------------------------------------------------------------
>> 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
>> 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K
> 
> These numbers are interesting. To me these numbers mean that there is a
> huge performance bottleneck in the virtio-scsi-raw storage path. Why is
> the virtio-scsi-raw bandwidth only one third of the bare-metal raw block
> bandwidth ?


(replying to my own e-mail)

Or maybe the above numbers mean that in the virtio-scsi-raw test I/O was
serialized (I/O depth 1) while the other two tests use a large I/O depth
(64) ? It can't be a coincidence that the virtio-scsi-raw results are
close to the bare-metal results for I/O depth 1.

Another question: which functionality does tcm_vhost provide that is not
yet provided by the SCSI emulation code in qemu + tcm_loop ?

Bart.

^ permalink raw reply

* Re: [PATCH 4/6] tcm_vhost: Initial merge for vhost level target fabric driver
From: Bart Van Assche @ 2012-07-05 17:59 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Jens Axboe, Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin,
	Zhi Yong Wu, Anthony Liguori, target-devel, linux-scsi,
	Paolo Bonzini, lf-virt, Christoph Hellwig
In-Reply-To: <4FF5D31F.90404@acm.org>

On 07/05/12 17:47, Bart Van Assche wrote:

> On 07/04/12 04:24, Nicholas A. Bellinger wrote:
>> +/* Fill in status and signal that we are done processing this command
>> + *
>> + * This is scheduled in the vhost work queue so we are called with the owner
>> + * process mm and can access the vring.
>> + */
>> +static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
>> +{
> 
> As far as I can see vhost_scsi_complete_cmd_work() runs on the context
> of a work queue kernel thread and hence doesn't have an mm context. Did
> I misunderstand something ?


Please ignore the above - I've found the answer in vhost_dev_ioctl() and
vhost_dev_set_owner().

Bart.

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Bart Van Assche @ 2012-07-05 17:53 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Jens Axboe, Anthony Liguori, kvm-devel, linux-scsi,
	Michael S. Tsirkin, lf-virt, Anthony Liguori, target-devel,
	Paolo Bonzini, Zhi Yong Wu, Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341453135.23954.214.camel@haakon2.linux-iscsi.org>

On 07/05/12 01:52, Nicholas A. Bellinger wrote:

> fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
> ------------------------------------------------------------------------------------
> 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
> 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K


These numbers are interesting. To me these numbers mean that there is a
huge performance bottleneck in the virtio-scsi-raw storage path. Why is
the virtio-scsi-raw bandwidth only one third of the bare-metal raw block
bandwidth ?

Bart.

^ permalink raw reply

* Re: [PATCH 4/6] tcm_vhost: Initial merge for vhost level target fabric driver
From: Bart Van Assche @ 2012-07-05 17:47 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Jens Axboe, Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin,
	Zhi Yong Wu, Anthony Liguori, target-devel, linux-scsi,
	Paolo Bonzini, lf-virt, Christoph Hellwig
In-Reply-To: <1341375846-27882-5-git-send-email-nab@linux-iscsi.org>

On 07/04/12 04:24, Nicholas A. Bellinger wrote:

> +/* Fill in status and signal that we are done processing this command
> + *
> + * This is scheduled in the vhost work queue so we are called with the owner
> + * process mm and can access the vring.
> + */
> +static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
> +{


As far as I can see vhost_scsi_complete_cmd_work() runs on the context
of a work queue kernel thread and hence doesn't have an mm context. Did
I misunderstand something ?

Bart.

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Rick Jones @ 2012-07-05 17:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-1-git-send-email-jasowang@redhat.com>

On 07/05/2012 03:29 AM, Jason Wang wrote:

>
> Test result:
>
> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 650.55 655.61 100% 24.88 24.86 99%
> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
> 8 64 1450.89 1270.82 87% 30.83 25.95 84%

Was the -D test-specific option used to set TCP_NODELAY?  I'm guessing 
from your description of how packet sizes were smaller with multiqueue 
and your need to hack tcp_write_xmit() it wasn't but since we don't have 
the specific netperf command lines (hint hint :) I wanted to make certain.

Instead of calling them throughput1 and throughput2, it might be more 
clear in future to identify them as singlequeue and multiqueue.

Also, how are you combining the concurrent netperf results?  Are you 
taking sums of what netperf reports, or are you gathering statistics 
outside of netperf?

> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%

A single instance TCP_RR test would help confirm/refute any non-trivial 
change in (effective) path length between the two cases.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Michael S. Tsirkin @ 2012-07-05 17:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jens Axboe, Anthony Liguori, linux-scsi, kvm-devel, lf-virt,
	Anthony Liguori, target-devel, Zhi Yong Wu, Christoph Hellwig,
	Stefan Hajnoczi
In-Reply-To: <4FF5A90F.5050902@redhat.com>

On Thu, Jul 05, 2012 at 04:47:43PM +0200, Paolo Bonzini wrote:
> Il 05/07/2012 16:40, Michael S. Tsirkin ha scritto:
> >> virtio-scsi is brand new.  It's not as if we've had any significant
> >> time to make virtio-scsi-qemu faster.  In fact, tcm_vhost existed
> >> before virtio-scsi-qemu did if I understand correctly.
> 
> Yes.
> 
> > Can't same can be said about virtio scsi - it seems to be
> > slower so we force a bad choice between blk and scsi at the user?
> 
> virtio-scsi supports multiple devices per PCI slot (or even function),
> can talk to tapes, has better passthrough support for disks, and does a
> bunch of other things that virtio-blk by design doesn't do.  This
> applies to both tcm_vhost and virtio-scsi-qemu.
> 
> So far, all that virtio-scsi vs. virtio-blk benchmarks say is that more
> benchmarking is needed.  Some people see it faster, some people see it
> slower.  In some sense, it's consistent with the expectation that the
> two should roughly be the same. :)

Anyway, all I was saying is new technology often lacks some features of
the old one. We are not forcing new inferior one on anyone, so we can
let it mature it tree.

> >> But guest/user facing decisions cannot be easily unmade and making
> >> the wrong technical choices because of premature concerns of "time
> >> to market" just result in a long term mess.
> >>
> >> There is no technical reason why tcm_vhost is going to be faster
> >> than doing it in userspace.
> > 
> > But doing what in userspace exactly?
> 
> Processing virtqueues in separate threads, switching the block and SCSI
> layer to fine-grained locking, adding some more fast paths.
> 
> >> Basically, the issue is that the kernel has more complete SCSI
> >> emulation that QEMU does right now.
> >>
> >> There are lots of ways to try to solve this--like try to reuse the
> >> kernel code in userspace or just improving the userspace code.  If
> >> we were able to make the two paths identical, then I strongly
> >> suspect there'd be no point in having tcm_vhost anyway.
> > 
> > However, a question we should ask ourselves is whether this will happen
> > in practice, and when.
> 
> It's already happening, but it takes a substantial amount of preparatory
> work before you can actually see results.
> 
> Paolo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox