Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
From: Nicholas A. Bellinger @ 2012-07-11 21:15 UTC (permalink / raw)
  To: target-devel
  Cc: Jens Axboe, Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin,
	Zhi Yong Wu, Anthony Liguori, linux-scsi, Paolo Bonzini, lf-virt,
	Christoph Hellwig

From: Nicholas Bellinger <nab@linux-iscsi.org>

Hi folks,

The following is a RFC-v2 series of tcm_vhost target fabric driver code
currently in-flight for-3.6 mainline code.

After last week's developments along with the help of some new folks, the
changelog v1 -> v2 so far looks like:

*) Fix drivers/vhost/test.c to use VHOST_NET_FEATURES in patch #1 (Asias He)
*) Fix tv_cmd completion -> release SGL memory leak (nab)
*) Fix sparse warnings for static variable usage (Fengguang Wu)
*) Fix sparse warnings for min() typing + printk format specs (Fengguang Wu)
*) Convert to cmwq submission for I/O dispatch (nab + hch)

Also following Paolo's request, a patch for hw/virtio-scsi.c that sets
scsi_host->max_target=0 that removes the need for virtio-scsi LLD to hardcode
VirtIOSCSIConfig->max_id=1 in order to function with tcm_vhost.

Note this series has been pushed into target-pending.git/for-next-merge, and
should be getting picked up for tomorrow's linux-next build.

Please let us know if you have any concerns and/or additional review feedback.

Thank you!

Nicholas Bellinger (2):
  vhost: Add vhost_scsi specific defines
  tcm_vhost: Initial merge for vhost level target fabric driver

Stefan Hajnoczi (2):
  vhost: Separate vhost-net features from vhost features
  vhost: make vhost work queue visible

 drivers/vhost/Kconfig     |    6 +
 drivers/vhost/Makefile    |    1 +
 drivers/vhost/net.c       |    4 +-
 drivers/vhost/tcm_vhost.c | 1609 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/tcm_vhost.h |   74 ++
 drivers/vhost/test.c      |    4 +-
 drivers/vhost/vhost.c     |    5 +-
 drivers/vhost/vhost.h     |    6 +-
 include/linux/vhost.h     |    9 +
 9 files changed, 1710 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vhost/tcm_vhost.c
 create mode 100644 drivers/vhost/tcm_vhost.h

-- 
1.7.2.5

^ permalink raw reply

* [PATCH] hw/virtio-scsi: Set max_target=0 during vhost-scsi operation
From: Nicholas A. Bellinger @ 2012-07-11 20:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin, Zhi Yong Wu,
	target-devel, Paolo Bonzini, lf-virt

From: Nicholas Bellinger <nab@linux-iscsi.org>

This QEMU patch sets VirtIOSCSIConfig->max_target=0 for vhost-scsi operation
to restrict virtio-scsi LLD guest scanning to max_id=0 (a single target ID
instance) when connected to individual tcm_vhost endpoints as requested by
Paolo.

This ensures that virtio-scsi LLD only attempts to scan target IDs up to
VIRTIO_SCSI_MAX_TARGET when connected via virtio-scsi-raw.

It's currently cut against Zhi's qemu vhost-scsi tree here:

   https://github.com/wuzhy/qemu/tree/vhost-scsi

Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 hw/virtio-scsi.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/hw/virtio-scsi.c b/hw/virtio-scsi.c
index e38cdd0..71276b6 100644
--- a/hw/virtio-scsi.c
+++ b/hw/virtio-scsi.c
@@ -523,7 +523,11 @@ static void virtio_scsi_get_config(VirtIODevice *vdev,
     stl_raw(&scsiconf->sense_size, s->sense_size);
     stl_raw(&scsiconf->cdb_size, s->cdb_size);
     stl_raw(&scsiconf->max_channel, VIRTIO_SCSI_MAX_CHANNEL);
-    stl_raw(&scsiconf->max_target, VIRTIO_SCSI_MAX_TARGET);
+    if (s->vhost_scsi) {
+        stl_raw(&scsiconf->max_target, 0);
+    } else {
+        stl_raw(&scsiconf->max_target, VIRTIO_SCSI_MAX_TARGET);
+    }
     stl_raw(&scsiconf->max_lun, VIRTIO_SCSI_MAX_LUN);
 }
 
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
From: Nicholas A. Bellinger @ 2012-07-10  0:29 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jens Axboe, linux-scsi, kvm-devel, Michael S. Tsirkin, lf-virt,
	Anthony Liguori, target-devel, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341453135.23954.214.camel@haakon2.linux-iscsi.org>

Hi folks,

On Wed, 2012-07-04 at 18:52 -0700, Nicholas A. Bellinger wrote:
> 
> To give an idea of how things are looking on the performance side, here
> some initial numbers for small block (4k) mixed random IOPs using the
> following fio test setup:

<SNIP>

> fio randrw workload | virtio-scsi-raw | virtio-scsi+tcm_vhost | bare-metal raw block
> ------------------------------------------------------------------------------------
> 25 Write / 75 Read  |      ~15K       |         ~45K          |         ~70K
> 75 Write / 25 Read  |      ~20K       |         ~55K          |         ~60K
> 
> 

After checking the original benchmarks here again, I realized that for
virtio-scsi+tcm_vhost the results where actually switched..

So this should have been: heavier READ case (25 / 75) == 55K, and
heavier WRITE case (75 / 25) == 45K.

> In the first case, virtio-scsi+tcm_vhost is out performing by 3x
> compared to virtio-scsi-raw using QEMU SCSI emulation with the same raw
> flash backend device.  For the second case heavier WRITE case, tcm_vhost
> is nearing full bare-metal utilization (~55K vs. ~60K).
> 
> Also converting tcm_vhost to use proper cmwq process context I/O
> submission will help to get even closer to bare metal speeds for both
> work-loads.
> 

Here are initial follow-up virtio-scsi randrw 4k benchmarks with
tcm_vhost recently converted to run backend I/O dispatch via modern cmwq
primitives (kworkerd).

fio randrw 4k workload | virtio-scsi+tcm_vhost+cmwq
---------------------------------------------------
  25 Write / 75 Read   |          ~60K
  75 Write / 25 Read   |	  ~45K

So aside from the minor performance improvement for the 25 / 75
workload, the other main improvement is lower CPU usage using the
iomemory_vsl backends.  This is attributed to cmwq providing process
context on the same core as the vhost thread pulling items off vq, which
ends up being on the order of 1/3 less host CPU usage (for both
workloads) primarly from positive cache effects.

This patch is now available in target-pending/tcm_vhost, and I'll be
respinning the initial merge series into for-next-merge over the next
days + another round of list review.

Please let us know if you have any concerns.

Thanks!

--nab

^ permalink raw reply

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Ben Hutchings @ 2012-07-09 20:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-6-git-send-email-jasowang@redhat.com>

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> This patch let the virtio_net driver can negotiate the number of queues it
> wishes to use through control virtqueue and export an ethtool interface to let
> use tweak it.
> 
> As current multiqueue virtio-net implementation has optimizations on per-cpu
> virtuqueues, so only two modes were support:
> 
> - single queue pair mode
> - multiple queue paris mode, the number of queues matches the number of vcpus
> 
> The single queue mode were used by default currently due to regression of
> multiqueue mode in some test (especially in stream test).
> 
> Since virtio core does not support paritially deleting virtqueues, so during
> mode switching the whole virtqueue were deleted and the driver would re-create
> the virtqueues it would used.
> 
> btw. The queue number negotiating were defered to .ndo_open(), this is because
> only after feature negotitaion could we send the command to control virtqueue
> (as it may also use event index).
[...]
> +static int virtnet_set_channels(struct net_device *dev,
> +				struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	u16 queues = channels->rx_count;
> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +
> +	if (channels->rx_count != channels->tx_count)
> +		return -EINVAL;
[...]
> +static void virtnet_get_channels(struct net_device *dev,
> +				 struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	channels->max_rx = vi->total_queue_pairs;
> +	channels->max_tx = vi->total_queue_pairs;
> +	channels->max_other = 0;
> +	channels->max_combined = 0;
> +	channels->rx_count = vi->num_queue_pairs;
> +	channels->tx_count = vi->num_queue_pairs;
> +	channels->other_count = 0;
> +	channels->combined_count = 0;
> +}
[...]

It looks like the queue-pairs should be treated as 'combined channels',
not separate RX and TX channels.  Also you don't need to clear the other
members; you can assume that the ethtool core will zero-initialise
structures for 'get' operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Rick Jones @ 2012-07-09 16:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FFA4EAD.7000707@redhat.com>

On 07/08/2012 08:23 PM, Jason Wang wrote:
> On 07/07/2012 12:23 AM, Rick Jones wrote:
>> On 07/06/2012 12:42 AM, Jason Wang wrote:
>> Which mechanism to address skew error?  The netperf manual describes
>> more than one:
>
> This mechanism is missed in my test, I would add them to my test scripts.
>>
>> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
>>
>>
>> Personally, my preference these days is to use the "demo mode" method
>> of aggregate results as it can be rather faster than (ab)using the
>> confidence intervals mechanism, which I suspect may not really scale
>> all that well to large numbers of concurrent netperfs.
>
> During my test, the confidence interval would even hard to achieved in
> RR test when I pin vhost/vcpus in the processors, so I didn't use it.

When running aggregate netperfs, *something* has to be done to address 
the prospect of skew error.  Otherwise the results are suspect.

happy benchmarking,

rick jones

^ permalink raw reply

* [PATCH] tcm_vhost: Convert to cmwq submission for I/O dispatch
From: Nicholas A. Bellinger @ 2012-07-09  6:10 UTC (permalink / raw)
  To: target-devel
  Cc: Jens Axboe, Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin,
	Zhi Yong Wu, Anthony Liguori, linux-scsi, Paolo Bonzini, lf-virt,
	Christoph Hellwig

From: Nicholas Bellinger <nab@linux-iscsi.org>

This patch converts tcm_vhost to use modern concurrency managed workqueues to
offload setup of tcm_vhost_cmd descriptors to a kworker CPU thread that is
running on the same core as the vhost thread pulling elements off the virtqueue
from within vhost_scsi_handle_vq().

This includes the addition of tcm_vhost_submission_work() to perform the
LUN lookup, target_setup_cmd_from_cdb(), transport_generic_map_mem_to_cmd()
and transport_handle_cdb_direct() calls for setup -> memory map -> backend I/O
execution.

Also, now remove the legacy tcm_vhost_new_cmd_map() code originally used to
perform memory map -> backend I/O execution from transport_processing_thread()
process context.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 drivers/vhost/tcm_vhost.c |  168 ++++++++++++++++++++++++---------------------
 drivers/vhost/tcm_vhost.h |    6 ++-
 2 files changed, 96 insertions(+), 78 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 81b77f3..da0b8ac 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -67,6 +67,8 @@ struct vhost_scsi {
 /* Local pointer to allocated TCM configfs fabric module */
 static struct target_fabric_configfs *tcm_vhost_fabric_configfs;
 
+static struct workqueue_struct *tcm_vhost_workqueue;
+
 /* Global spinlock to protect tcm_vhost TPG list for vhost IOCTL access */
 static DEFINE_MUTEX(tcm_vhost_mutex);
 static LIST_HEAD(tcm_vhost_list);
@@ -247,55 +249,6 @@ static u32 tcm_vhost_tpg_get_inst_index(struct se_portal_group *se_tpg)
 	return 1;
 }
 
-/*
- * Called by struct target_core_fabric_ops->new_cmd_map()
- *
- * Always called in process context.  A non zero return value
- * here will signal to handle an exception based on the return code.
- */
-static int tcm_vhost_new_cmd_map(struct se_cmd *se_cmd)
-{
-	struct tcm_vhost_cmd *tv_cmd = container_of(se_cmd,
-				struct tcm_vhost_cmd, tvc_se_cmd);
-	struct scatterlist *sg_ptr, *sg_bidi_ptr = NULL;
-	u32 sg_no_bidi = 0;
-	int ret;
-	/*
-	 * Allocate the necessary tasks to complete the received CDB+data
-	 */
-	ret = target_setup_cmd_from_cdb(se_cmd, tv_cmd->tvc_cdb);
-	if (ret != 0)
-		return ret;
-	/*
-	 * Setup the struct scatterlist memory from the received
-	 * struct tcm_vhost_cmd..
-	 */
-	if (tv_cmd->tvc_sgl_count) {
-		sg_ptr = tv_cmd->tvc_sgl;
-		/*
-		 * For BIDI commands, pass in the extra READ buffer
-		 * to transport_generic_map_mem_to_cmd() below..
-		 */
-/* FIXME: Fix BIDI operation in tcm_vhost_new_cmd_map() */
-#if 0
-		if (se_cmd->se_cmd_flags & SCF_BIDI) {
-			mem_bidi_ptr = NULL;
-			sg_no_bidi = 0;
-		}
-#endif
-	} else {
-		/*
-		 * Used for DMA_NONE
-		 */
-		sg_ptr = NULL;
-	}
-
-	/* Tell the core about our preallocated memory */
-	return transport_generic_map_mem_to_cmd(se_cmd, sg_ptr,
-				tv_cmd->tvc_sgl_count, sg_bidi_ptr,
-				sg_no_bidi);
-}
-
 static void tcm_vhost_release_cmd(struct se_cmd *se_cmd)
 {
 	return;
@@ -509,12 +462,6 @@ static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd(
 	if (bidi)
 		se_cmd->se_cmd_flags |= SCF_BIDI;
 #endif
-	/*
-	 * From here the rest of the se_cmd will be setup and dispatched
-	 * via tcm_vhost_new_cmd_map() from TCM backend thread context
-	 * after transport_generic_handle_cdb_map() has been called from
-	 * vhost_scsi_handle_vq() below..
-	 */
 	return tv_cmd;
 }
 
@@ -611,6 +558,71 @@ static int vhost_scsi_map_iov_to_sgl(struct tcm_vhost_cmd *tv_cmd,
 	return 0;
 }
 
+static void tcm_vhost_submission_work(struct work_struct *work)
+{
+	struct tcm_vhost_cmd *tv_cmd =
+		container_of(work, struct tcm_vhost_cmd, work);
+	struct se_cmd *se_cmd = &tv_cmd->tvc_se_cmd;
+	struct scatterlist *sg_ptr, *sg_bidi_ptr = NULL;
+	int rc, sg_no_bidi = 0;
+	/*
+	 * Locate the struct se_lun pointer based on v_req->lun, and
+	 * attach it to struct se_cmd
+	 */
+	rc = transport_lookup_cmd_lun(&tv_cmd->tvc_se_cmd, tv_cmd->tvc_lun);
+	if (rc < 0) {
+		pr_err("Failed to look up lun: %d\n", tv_cmd->tvc_lun);
+		transport_send_check_condition_and_sense(&tv_cmd->tvc_se_cmd,
+			tv_cmd->tvc_se_cmd.scsi_sense_reason, 0);
+		transport_generic_free_cmd(se_cmd, 0);
+		return;
+	}
+
+	rc = target_setup_cmd_from_cdb(se_cmd, tv_cmd->tvc_cdb);
+	if (rc == -ENOMEM) {
+		transport_send_check_condition_and_sense(se_cmd,
+				TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE, 0);
+		transport_generic_free_cmd(se_cmd, 0);
+		return;
+	} else if (rc < 0) {
+		if (se_cmd->se_cmd_flags & SCF_SCSI_RESERVATION_CONFLICT)
+			tcm_vhost_queue_status(se_cmd);
+		else
+			transport_send_check_condition_and_sense(se_cmd,
+					se_cmd->scsi_sense_reason, 0);
+		transport_generic_free_cmd(se_cmd, 0);
+		return;
+	}
+
+	if (tv_cmd->tvc_sgl_count) {
+		sg_ptr = tv_cmd->tvc_sgl;
+		/*
+		 * For BIDI commands, pass in the extra READ buffer
+		 * to transport_generic_map_mem_to_cmd() below..
+		 */
+/* FIXME: Fix BIDI operation in tcm_vhost_submission_work() */
+#if 0
+		if (se_cmd->se_cmd_flags & SCF_BIDI) {
+			sg_bidi_ptr = NULL;
+			sg_no_bidi = 0;
+		}
+#endif
+	} else {
+		sg_ptr = NULL;
+	}
+
+	rc = transport_generic_map_mem_to_cmd(se_cmd, sg_ptr,
+				tv_cmd->tvc_sgl_count, sg_bidi_ptr,
+				sg_no_bidi);
+	if (rc < 0) {
+		transport_send_check_condition_and_sense(se_cmd,
+				se_cmd->scsi_sense_reason, 0);
+		transport_generic_free_cmd(se_cmd, 0);
+		return;
+	}
+	transport_handle_cdb_direct(se_cmd);
+}
+
 static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 {
 	struct vhost_virtqueue *vq = &vs->vqs[2];
@@ -619,7 +631,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 	struct tcm_vhost_cmd *tv_cmd;
 	u32 exp_data_len, data_first, data_num, data_direction;
 	unsigned out, in, i;
-	int head, ret, lun;
+	int head, ret;
 
 	/* Must use ioctl VHOST_SCSI_SET_ENDPOINT */
 	tv_tpg = vs->vs_tpg;
@@ -732,10 +744,10 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 				scsi_command_size(tv_cmd->tvc_cdb), TCM_VHOST_MAX_CDB_SIZE);
 			break; /* TODO */
 		}
-		lun = ((v_req.lun[2] << 8) | v_req.lun[3]) & 0x3FFF;
+		tv_cmd->tvc_lun = ((v_req.lun[2] << 8) | v_req.lun[3]) & 0x3FFF;
 
 		pr_debug("vhost_scsi got command opcode: %#02x, lun: %d\n",
-			tv_cmd->tvc_cdb[0], lun);
+			tv_cmd->tvc_cdb[0], tv_cmd->tvc_lun);
 
 		if (data_direction != DMA_NONE) {
 			ret = vhost_scsi_map_iov_to_sgl(tv_cmd, &vq->iov[data_first],
@@ -753,22 +765,13 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs)
 		 */
 		tv_cmd->tvc_vq_desc = head;
 		/*
-		 * Locate the struct se_lun pointer based on v_req->lun, and
-		 * attach it to struct se_cmd
-		 */
-		if (transport_lookup_cmd_lun(&tv_cmd->tvc_se_cmd, lun) < 0) {
-			pr_err("Failed to look up lun: %d\n", lun);
-			/* NON_EXISTENT_LUN */
-			transport_send_check_condition_and_sense(&tv_cmd->tvc_se_cmd,
-					tv_cmd->tvc_se_cmd.scsi_sense_reason, 0);
-			continue;
-		}
-		/*
-		 * Now queue up the newly allocated se_cmd to be processed
-		 * within TCM thread context to finish the setup and dispatched
-		 * into a TCM backend struct se_device.
+		 * Dispatch tv_cmd descriptor for cmwq execution in process
+		 * context provided by tcm_vhost_workqueue.  This also ensures
+		 * tv_cmd is executed on the same kworker CPU as this vhost
+		 * thread to gain positive L2 cache locality effects..
 		 */
-		transport_generic_handle_cdb_map(&tv_cmd->tvc_se_cmd);
+		INIT_WORK(&tv_cmd->work, tcm_vhost_submission_work);
+		queue_work(tcm_vhost_workqueue, &tv_cmd->work);
 	}
 
 	mutex_unlock(&vq->mutex);
@@ -1478,7 +1481,6 @@ static struct target_core_fabric_ops tcm_vhost_ops = {
 	.tpg_alloc_fabric_acl		= tcm_vhost_alloc_fabric_acl,
 	.tpg_release_fabric_acl		= tcm_vhost_release_fabric_acl,
 	.tpg_get_inst_index		= tcm_vhost_tpg_get_inst_index,
-	.new_cmd_map			= tcm_vhost_new_cmd_map,
 	.release_cmd			= tcm_vhost_release_cmd,
 	.shutdown_session		= tcm_vhost_shutdown_session,
 	.close_session			= tcm_vhost_close_session,
@@ -1570,23 +1572,35 @@ static void tcm_vhost_deregister_configfs(void)
 
 static int __init tcm_vhost_init(void)
 {
-	int ret;
+	int ret = -ENOMEM;
+
+	tcm_vhost_workqueue = alloc_workqueue("tcm_vhost", 0, 0);
+	if (!tcm_vhost_workqueue)
+		goto out;
 
 	ret = vhost_scsi_register();
 	if (ret < 0)
-		return ret;
+		goto out_destroy_workqueue;
 
 	ret = tcm_vhost_register_configfs();
 	if (ret < 0)
-		return ret;
+		goto out_vhost_scsi_deregister;
 
 	return 0;
+
+out_vhost_scsi_deregister:
+	vhost_scsi_deregister();
+out_destroy_workqueue:
+	destroy_workqueue(tcm_vhost_workqueue);
+out:
+	return ret;
 };
 
 static void tcm_vhost_exit(void)
 {
 	tcm_vhost_deregister_configfs();
 	vhost_scsi_deregister();
+	destroy_workqueue(tcm_vhost_workqueue);
 };
 
 MODULE_DESCRIPTION("TCM_VHOST series fabric driver");
diff --git a/drivers/vhost/tcm_vhost.h b/drivers/vhost/tcm_vhost.h
index 0e8951b..9d6cace 100644
--- a/drivers/vhost/tcm_vhost.h
+++ b/drivers/vhost/tcm_vhost.h
@@ -9,14 +9,18 @@ struct tcm_vhost_cmd {
 	u64 tvc_tag;
 	/* The number of scatterlists associated with this cmd */
 	u32 tvc_sgl_count;
+	/* Saved unpacked SCSI LUN for tcm_vhost_submission_work() */
+	u32 tvc_lun;
 	/* Pointer to the SGL formatted memory from virtio-scsi */
 	struct scatterlist *tvc_sgl;
 	/* Pointer to response */
 	struct virtio_scsi_cmd_resp __user *tvc_resp;
 	/* Pointer to vhost_scsi for our device */
 	struct vhost_scsi *tvc_vhost;
-	 /* The TCM I/O descriptor that is accessed via container_of() */
+	/* The TCM I/O descriptor that is accessed via container_of() */
 	struct se_cmd tvc_se_cmd;
+	/* work item used for cmwq dispatch to tcm_vhost_submission_work() */
+	struct work_struct work;
 	/* Copy of the incoming SCSI command descriptor block (CDB) */
 	unsigned char tvc_cdb[TCM_VHOST_MAX_CDB_SIZE];
 	/* Sense buffer that will be mapped into outgoing status */
-- 
1.7.2.5

^ permalink raw reply related

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Jason Wang @ 2012-07-09  5:35 UTC (permalink / raw)
  To: Ronen Hod
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF9429A.8020508@redhat.com>

On 07/08/2012 04:19 PM, Ronen Hod wrote:
> On 07/05/2012 01:29 PM, Jason Wang wrote:
>> Hello All:
>>
>> This series is an update version of multiqueue virtio-net driver 
>> based on
>> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to 
>> do the
>> packets reception and transmission. Please review and comments.
>>
>> Test Environment:
>> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
>> - Two directed connected 82599
>>
>> Test Summary:
>>
>> - Highlights: huge improvements on TCP_RR test
>
> Hi Jason,
>
> It might be that the good TCP_RR results are due to the large number 
> of sessions (50-250). Can you test it also with small number of sessions?

Sure, I would test them.
>
>> - Lowlights: regression on small packet transmission, higher cpu 
>> utilization
>>               than single queue, need further optimization
>>
>> Analysis of the performance result:
>>
>> - I count the number of packets sending/receiving during the test, and
>>    multiqueue show much more ability in terms of packets per second.
>>
>> - For the tx regression, multiqueue send about 1-2 times of more packets
>>    compared to single queue, and the packets size were much smaller 
>> than single
>>    queue does. I suspect tcp does less batching in multiqueue, so I 
>> hack the
>>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>>    singlequeue for both small transmission and throughput
>
> Could it be that since the CPUs are not busy they are available for 
> immediate handling of the packets (little batching)? In such scenario 
> the CPU utilization is not really interesting. What will happen on a 
> busy machine?
>

The regression happnes when test guest transmission in stream test, the 
cpu utilization is 100% in this situation.
> Ronen.
>
>>
>> - I didn't pack the accelerate RFS with virtio-net in this sereis as 
>> it still
>>    need further shaping, for the one that interested in this please see:
>>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>>
>> Changes from V4:
>> - Add ability to negotiate the number of queues through control 
>> virtqueue
>> - Ethtool -{L|l} support and default the tx/rx queue number to 1
>> - Expose the API to set irq affinity instead of irq itself
>>
>> Changes from V3:
>>
>> - Rebase to the net-next
>> - Let queue 2 to be the control virtqueue to obey the spec
>> - Prodives irq affinity
>> - Choose txq based on processor id
>>
>> References:
>>
>> - V4: https://lkml.org/lkml/2012/6/25/120
>> - V3: http://lwn.net/Articles/467283/
>>
>> Test result:
>>
>> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 650.55 655.61 100% 24.88 24.86 99%
>> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
>> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
>> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
>> 1 256 1699.45 1779.58 104% 56.75 59.08 104%
>> 2 256 4902.71 3446.59 70% 98.53 62.78 63%
>> 4 256 4803.76 2980.76 62% 97.44 54.68 56%
>> 8 256 5128.88 3158.74 61% 104.68 58.61 55%
>> 1 512 2837.98 2838.42 100% 89.76 90.41 100%
>> 2 512 6742.59 5495.83 81% 155.03 99.07 63%
>> 4 512 9193.70 5900.17 64% 202.84 106.44 52%
>> 8 512 9287.51 7107.79 76% 202.18 129.08 63%
>> 1 1024 4166.42 4224.98 101% 128.55 129.86 101%
>> 2 1024 6196.94 7823.08 126% 181.80 168.81 92%
>> 4 1024 9113.62 9219.49 101% 235.15 190.93 81%
>> 8 1024 9324.25 9402.66 100% 239.10 179.99 75%
>> 1 2048 7441.63 6534.04 87% 248.01 215.63 86%
>> 2 2048 7024.61 7414.90 105% 225.79 219.62 97%
>> 4 2048 8971.49 9269.00 103% 278.94 220.84 79%
>> 8 2048 9314.20 9359.96 100% 268.36 192.23 71%
>> 1 4096 8282.60 8990.08 108% 277.45 320.05 115%
>> 2 4096 9194.80 9293.78 101% 317.02 248.76 78%
>> 4 4096 9340.73 9313.19 99% 300.34 230.35 76%
>> 8 4096 9148.23 9347.95 102% 279.49 199.43 71%
>> 1 16384 8787.89 8766.31 99% 312.38 316.53 101%
>> 2 16384 9306.35 9156.14 98% 319.53 279.83 87%
>> 4 16384 9177.81 9307.50 101% 312.69 230.07 73%
>> 8 16384 9035.82 9188.00 101% 298.32 199.17 66%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
>> 100 1 60141.88 88598.94 147% 2157.90 2000.45 92%
>> 250 1 74763.56 135584.22 181% 2541.94 2628.59 103%
>> 50 64 51628.38 82867.50 160% 1872.55 1812.16 96%
>> 100 64 60367.73 84080.60 139% 2215.69 1867.69 84%
>> 250 64 68502.70 124910.59 182% 2321.43 2495.76 107%
>> 50 128 53477.08 77625.07 145% 1905.10 1870.99 98%
>> 100 128 59697.56 74902.37 125% 2230.66 1751.03 78%
>> 250 128 71248.74 133963.55 188% 2453.12 2711.72 110%
>> 50 256 47663.86 67742.63 142% 1880.45 1735.30 92%
>> 100 256 54051.84 68738.57 127% 2123.03 1778.59 83%
>> 250 256 68250.06 124487.90 182% 2321.89 2598.60 111%
>> - External Host to Guest TCP STRAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 847.71 864.83 102% 57.99 57.93 99%
>> 2 64 1690.82 1544.94 91% 80.13 55.09 68%
>> 4 64 3434.98 3455.53 100% 127.17 89.00 69%
>> 8 64 5890.19 6557.35 111% 194.70 146.52 75%
>> 1 256 2094.04 2109.14 100% 130.73 127.14 97%
>> 2 256 5218.13 3731.97 71% 219.15 114.02 52%
>> 4 256 6734.51 9213.47 136% 227.87 208.31 91%
>> 8 256 6452.86 9402.78 145% 224.83 207.77 92%
>> 1 512 3945.07 4203.68 106% 279.72 273.30 97%
>> 2 512 7878.96 8122.55 103% 278.25 231.71 83%
>> 4 512 7645.89 9402.13 122% 252.10 217.42 86%
>> 8 512 6657.06 9403.71 141% 239.81 214.89 89%
>> 1 1024 5729.06 5111.21 89% 289.38 303.09 104%
>> 2 1024 8097.27 8159.67 100% 269.29 242.97 90%
>> 4 1024 7778.93 8919.02 114% 261.28 205.50 78%
>> 8 1024 6458.02 9360.02 144% 221.26 208.09 94%
>> 1 2048 6426.94 5195.59 80% 292.52 307.47 105%
>> 2 2048 8221.90 9025.66 109% 283.80 242.25 85%
>> 4 2048 7364.72 8527.79 115% 248.10 198.36 79%
>> 8 2048 6760.63 9161.07 135% 230.53 205.12 88%
>> 1 4096 7247.02 6874.21 94% 276.23 287.68 104%
>> 2 4096 8346.04 8818.65 105% 281.49 254.81 90%
>> 4 4096 6710.00 9354.59 139% 216.41 210.13 97%
>> 8 4096 6265.69 9406.87 150% 206.69 210.92 102%
>> 1 16384 8159.50 8048.79 98% 266.94 283.11 106%
>> 2 16384 8525.66 8552.41 100% 294.36 239.27 81%
>> 4 16384 6042.24 8447.86 139% 200.21 196.40 98%
>> 8 16384 6432.63 9403.49 146% 211.48 206.13 97%
>>
>> 2) 1 vm 4 vcpu 1q vs 4q, 1 - 1q, 2 - 4q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 636.93 657.69 103% 23.55 24.42 103%
>> 2 64 1457.46 1268.78 87% 30.97 26.02 84%
>> 4 64 3062.86 2302.43 75% 41.00 29.64 72%
>> 8 64 3107.68 2308.32 74% 41.62 29.07 69%
>> 1 256 1743.50 1750.11 100% 59.00 56.63 95%
>> 2 256 4582.61 2870.31 62% 92.47 51.97 56%
>> 4 256 8440.96 4795.37 56% 135.10 56.39 41%
>> 8 256 9240.31 6654.82 72% 144.76 74.89 51%
>> 1 512 2918.25 2735.26 93% 91.08 86.47 94%
>> 2 512 8978.32 5107.95 56% 200.00 94.97 47%
>> 4 512 8850.39 6864.37 77% 190.32 101.09 53%
>> 8 512 9270.30 8483.01 91% 193.44 118.73 61%
>> 1 1024 4416.10 3679.70 83% 135.54 110.63 81%
>> 2 1024 9085.20 8770.48 96% 242.23 175.59 72%
>> 4 1024 9158.57 9011.56 98% 234.39 159.17 67%
>> 8 1024 9345.89 9067.43 97% 233.35 138.73 59%
>> 1 2048 8455.19 6077.94 71% 338.52 190.16 56%
>> 2 2048 9223.32 8237.73 89% 270.00 198.27 73%
>> 4 2048 9080.75 9257.63 101% 261.30 172.80 66%
>> 8 2048 9177.39 8977.10 97% 256.89 147.50 57%
>> 1 4096 8665.35 8394.78 96% 289.63 289.85 100%
>> 2 4096 7850.73 8857.86 112% 253.33 252.62 99%
>> 4 4096 9332.55 8508.37 91% 289.19 151.29 52%
>> 8 4096 8482.30 9146.80 107% 255.41 156.02 61%
>> 1 16384 8825.72 8778.26 99% 314.60 308.89 98%
>> 2 16384 9283.85 8927.40 96% 316.48 246.98 78%
>> 4 16384 7766.95 8708.06 112% 265.25 155.59 58%
>> 8 16384 8945.55 8940.23 99% 298.45 151.32 50%
>> - TCP_RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 60848.70 81719.39 134% 2196.86 1551.05 70%
>> 100 1 61886.19 81425.02 131% 2215.76 1517.52 68%
>> 250 1 72058.41 162597.84 225% 2441.84 2278.14 93%
>> 50 64 51646.93 74160.10 143% 1861.07 1322.22 71%
>> 100 64 57574.86 83488.26 145% 2076.54 1479.79 71%
>> 250 64 67583.35 138482.15 204% 2314.46 2022.83 87%
>> 50 128 59931.51 71633.03 119% 2244.60 1309.18 58%
>> 100 128 58329.80 73104.90 125% 2202.98 1329.52 60%
>> 250 128 71021.55 161067.73 226% 2469.11 2205.28 89%
>> 50 256 47509.24 64330.24 135% 1915.75 1269.90 66%
>> 100 256 49293.03 68507.94 138% 1939.75 1263.64 65%
>> 250 256 63169.07 138390.68 219% 2255.47 2098.13 93%
>> - External Host to Guest TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 850.18 854.96 100% 56.94 58.25 102%
>> 2 64 1659.12 1730.25 104% 81.65 67.57 82%
>> 4 64 3254.70 3397.17 104% 118.57 76.21 64%
>> 8 64 6251.97 6389.29 102% 207.68 104.21 50%
>> 1 256 2029.14 2105.18 103% 116.45 119.69 102%
>> 2 256 5412.02 4260.32 78% 240.87 139.73 58%
>> 4 256 7777.28 8743.12 112% 263.20 174.65 66%
>> 8 256 6459.51 9388.93 145% 218.94 158.37 72%
>> 1 512 4566.31 4269.30 93% 274.74 289.83 105%
>> 2 512 7444.52 8240.64 110% 286.24 243.74 85%
>> 4 512 7722.29 9391.16 121% 261.96 180.36 68%
>> 8 512 6228.50 9134.52 146% 209.17 161.00 76%
>> 1 1024 4965.50 4953.68 99% 307.64 280.48 91%
>> 2 1024 8270.08 7733.71 93% 288.32 197.04 68%
>> 4 1024 7551.04 9394.58 124% 268.41 206.62 76%
>> 8 1024 6307.78 9179.03 145% 216.67 159.63 73%
>> 1 2048 5741.12 5948.80 103% 290.34 268.66 92%
>> 2 2048 7932.79 8766.05 110% 262.96 215.90 82%
>> 4 2048 6907.55 9255.97 133% 233.56 203.96 87%
>> 8 2048 6037.22 9399.41 155% 197.14 164.09 83%
>> 1 4096 7131.70 7535.10 105% 279.43 275.12 98%
>> 2 4096 8109.17 9348.04 115% 274.29 211.49 77%
>> 4 4096 6878.92 9319.13 135% 244.21 192.06 78%
>> 8 4096 6265.92 9408.35 150% 211.85 159.26 75%
>> 1 16384 8288.01 8596.39 103% 272.85 290.22 106%
>> 2 16384 8166.29 9280.12 113% 277.04 236.61 85%
>> 4 16384 6446.97 9382.22 145% 222.91 187.24 83%
>> 8 16384 6066.98 9405.51 155% 198.98 157.09 78%
>>
>> 3) 2 vms each with 2 vcpus, 1q vs 2q - pin vhost/vcpu in the same node
>>
>> - 2 Guests to External Hosts TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 1442.07 1475.11 102% 30.82 31.21 101%
>> 2 64 3124.87 2900.93 92% 40.29 35.95 89%
>> 4 64 3166.52 2864.04 90% 40.70 35.47 87%
>> 8 64 3141.45 2848.94 90% 40.38 35.34 87%
>> 1 256 3628.54 3711.73 102% 68.47 70.22 102%
>> 2 256 7806.95 7586.69 97% 111.23 84.38 75%
>> 4 256 8823.65 7612.74 86% 132.92 85.04 63%
>> 8 256 9194.89 9373.41 101% 135.98 119.62 87%
>> 1 512 7106.67 7128.00 100% 124.79 124.30 99%
>> 2 512 9190.22 9397.33 102% 180.84 149.34 82%
>> 4 512 9401.01 9376.67 99% 173.00 140.15 81%
>> 8 512 8572.84 9032.90 105% 150.49 127.58 84%
>> 1 1024 9361.93 9379.24 100% 205.81 202.94 98%
>> 2 1024 9386.69 9389.04 100% 201.78 165.75 82%
>> 4 1024 9403.43 9378.54 99% 195.33 152.06 77%
>> 8 1024 9213.63 9180.64 99% 178.99 141.51 79%
>> 1 2048 9338.95 9384.67 100% 223.22 227.86 102%
>> 2 2048 9389.28 9389.45 100% 202.37 170.08 84%
>> 4 2048 9405.86 9388.71 99% 193.76 161.54 83%
>> 8 2048 9352.40 9384.06 100% 189.16 157.06 83%
>> 1 4096 9380.74 9384.90 100% 239.37 241.56 100%
>> 2 4096 9393.47 9376.74 99% 213.84 195.61 91%
>> 4 4096 9393.85 9381.50 99% 198.06 170.18 85%
>> 8 4096 9400.41 9232.31 98% 192.87 163.56 84%
>> 1 16384 9348.18 9335.55 99% 253.02 254.86 100%
>> 2 16384 9384.97 9359.53 99% 218.56 208.59 95%
>> 4 16384 9326.60 9382.15 100% 206.24 179.72 87%
>> 8 16384 9355.82 9392.85 100% 198.22 172.89 87%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 200340.33 261750.19 130% 2935.27 3018.59 102%
>> 100 1 236141.58 266304.49 112% 3452.16 3071.74 88%
>> 250 1 361574.59 320825.08 88% 4972.98 3705.70 74%
>> 50 64 225748.53 242671.12 107% 3011.48 2869.07 95%
>> 100 64 249885.37 260453.72 104% 3240.21 3063.67 94%
>> 250 64 360341.12 310775.60 86% 4682.42 3657.91 78%
>> 50 128 227995.27 289320.38 126% 2950.92 3479.37 117%
>> 100 128 239491.11 291135.77 121% 3099.55 3508.75 113%
>> 250 128 390390.68 362484.35 92% 5042.30 4368.52 86%
>> 50 256 222604.51 317140.97 142% 3058.08 3839.39 125%
>> 100 256 254770.92 335606.03 131% 3326.16 4046.65 121%
>> 250 256 400584.52 436749.22 109% 5220.79 5278.86 101%
>> - External Host to 2 Guests
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 1667.99 1684.50 100% 59.66 60.77 101%
>> 2 64 3338.83 3379.97 101% 83.61 64.82 77%
>> 4 64 6613.65 6619.11 100% 131.00 97.19 74%
>> 8 64 6553.07 6418.31 97% 141.35 98.27 69%
>> 1 256 3938.40 4068.52 103% 125.21 123.76 98%
>> 2 256 9215.57 9210.88 99% 185.31 154.27 83%
>> 4 256 9407.29 9008.13 95% 186.72 150.01 80%
>> 8 256 9377.17 9385.57 100% 190.28 137.59 72%
>> 1 512 7360.19 6984.80 94% 214.09 211.66 98%
>> 2 512 9392.91 9401.88 100% 193.92 173.11 89%
>> 4 512 9382.64 9394.34 100% 189.27 145.80 77%
>> 8 512 9308.60 9094.08 97% 189.70 141.26 74%
>> 1 1024 9153.26 9066.06 99% 223.07 219.95 98%
>> 2 1024 9393.38 9398.43 100% 194.02 173.82 89%
>> 4 1024 9395.92 8960.73 95% 192.61 145.82 75%
>> 8 1024 9388.92 9399.08 100% 191.18 143.87 75%
>> 1 2048 9355.32 9240.63 98% 221.50 223.03 100%
>> 2 2048 9395.68 9399.62 100% 193.31 177.21 91%
>> 4 2048 9397.67 9399.56 100% 195.25 157.53 80%
>> 8 2048 9397.89 9401.70 100% 197.57 146.96 74%
>> 1 4096 9375.84 9381.72 100% 223.06 225.06 100%
>> 2 4096 9389.47 9396.00 100% 193.91 197.13 101%
>> 4 4096 9397.45 9400.11 100% 192.33 163.60 85%
>> 8 4096 9105.40 9415.76 103% 192.71 140.41 72%
>> 1 16384 9381.53 9381.40 99% 223.53 225.66 100%
>> 2 16384 9387.90 9395.44 100% 193.34 177.03 91%
>> 4 16384 9397.92 9410.98 100% 195.04 151.14 77%
>> 8 16384 9259.00 9419.48 101% 194.91 153.48 78%
>>
>> 4) Local vm to vm 2 vcpu 1q vs 2q - pin vcpu/thread in the same numa 
>> node
>>
>> - VM to VM TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 576.05 576.14 100% 12.25 12.32 100%
>> 2 64 1266.75 1160.04 91% 19.10 16.05 84%
>> 4 64 1267.34 1123.70 88% 19.08 15.51 81%
>> 8 64 1230.88 1174.70 95% 18.53 15.58 84%
>> 1 256 1311.00 1303.02 99% 25.34 25.35 100%
>> 2 256 5400.26 2794.00 51% 75.92 36.43 47%
>> 4 256 5200.67 2818.88 54% 72.81 33.92 46%
>> 8 256 5234.55 2893.74 55% 73.10 34.97 47%
>> 1 512 3244.09 3263.72 100% 56.48 56.65 100%
>> 2 512 8172.16 4661.15 57% 119.05 67.89 57%
>> 4 512 10567.44 7063.25 66% 147.76 77.27 52%
>> 8 512 10477.87 8471.33 80% 145.94 102.91 70%
>> 1 1024 5432.54 5333.99 98% 93.69 92.38 98%
>> 2 1024 12590.24 9259.97 73% 185.37 135.28 72%
>> 4 1024 15600.53 10731.93 68% 222.20 123.60 55%
>> 8 1024 16222.87 10704.85 65% 227.05 113.81 50%
>> 1 2048 6667.61 7484.37 112% 116.75 129.72 111%
>> 2 2048 8180.43 11500.88 140% 137.84 156.64 113%
>> 4 2048 15127.93 14416.16 95% 227.60 154.59 67%
>> 8 2048 16381.79 14794.10 90% 244.29 158.45 64%
>> 1 4096 7375.63 8948.90 121% 131.97 156.57 118%
>> 2 4096 9321.16 14443.21 154% 161.24 163.74 101%
>> 4 4096 13028.45 15984.94 122% 212.78 171.26 80%
>> 8 4096 15611.28 18810.54 120% 245.15 198.65 81%
>> 1 16384 15304.38 14202.08 92% 259.94 244.04 93%
>> 2 16384 15508.97 15913.09 102% 261.30 244.26 93%
>> 4 16384 14859.98 20164.34 135% 248.29 214.26 86%
>> 8 16384 15594.59 19960.99 127% 253.79 211.27 83%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54972.51 69820.99 127% 1133.58 1063.58 93%
>> 100 1 55847.16 72407.93 129% 1155.73 1024.35 88%
>> 250 1 60066.23 108266.50 180% 1114.30 1323.55 118%
>> 50 64 48727.63 62378.32 128% 1014.29 888.78 87%
>> 100 64 51804.65 69250.51 133% 1077.78 986.97 91%
>> 250 64 61278.68 100015.78 163% 1076.93 1243.18 115%
>> 50 256 51593.29 62046.22 120% 1069.14 871.08 81%
>> 100 256 51647.00 68197.43 132% 1071.66 958.51 89%
>> 250 256 60433.88 99072.59 163% 1072.41 1199.10 111%
>> 50 512 52177.79 66483.77 127% 1082.65 960.82 88%
>> 100 512 50351.67 62537.63 124% 1041.61 876.41 84%
>> 250 512 60510.14 103856.79 171% 1055.21 1245.17 118%
>>
>>
>> Jason Wang (4):
>>    virtio_ring: move queue_index to vring_virtqueue
>>    virtio: intorduce an API to set affinity for a virtqueue
>>    virtio_net: multiqueue support
>>    virtio_net: support negotiating the number of queues through ctrl vq
>>
>> Krishna Kumar (1):
>>    virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
>>
>>   drivers/net/virtio_net.c      |  792 
>> +++++++++++++++++++++++++++++------------
>>   drivers/virtio/virtio_mmio.c  |    5 +-
>>   drivers/virtio/virtio_pci.c   |   58 +++-
>>   drivers/virtio/virtio_ring.c  |   17 +
>>   include/linux/virtio.h        |    4 +
>>   include/linux/virtio_config.h |   21 ++
>>   include/linux/virtio_net.h    |   10 +
>>   7 files changed, 677 insertions(+), 230 deletions(-)
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Jason Wang @ 2012-07-09  3:23 UTC (permalink / raw)
  To: Rick Jones
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF710FD.2090100@hp.com>

On 07/07/2012 12:23 AM, Rick Jones wrote:
> On 07/06/2012 12:42 AM, Jason Wang wrote:
>> I'm not expert of tcp, but looks like the changes are reasonable:
>> - we can do full-sized TSO check in tcp_tso_should_defer() only for
>> westwood, according to tcp westwood
>> - run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.
>
> I'm sure Eric and David will weigh-in on the TCP change.  My initial 
> inclination would have been to say "well, if multiqueue is draining 
> faster, that means ACKs come-back faster, which means the "race" 
> between more data being queued by netperf and ACKs will go more to the 
> ACKs which means the segments being sent will be smaller - as 
> TCP_NODELAY is not set, the Nagle algorithm is in force, which means 
> once there is data outstanding on the connection, no more will be sent 
> until either the outstanding data is ACKed, or there is an 
> accumulation of > MSS worth of data to send.
>
>>> Also, how are you combining the concurrent netperf results?  Are you
>>> taking sums of what netperf reports, or are you gathering statistics
>>> outside of netperf?
>>>
>>
>> The throughput were just sumed from netperf result like what netperf
>> manual suggests. The cpu utilization were measured by mpstat.
>
> Which mechanism to address skew error?  The netperf manual describes 
> more than one:

This mechanism is missed in my test, I would add them to my test scripts.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance 
>
>
> Personally, my preference these days is to use the "demo mode" method 
> of aggregate results as it can be rather faster than (ab)using the 
> confidence intervals mechanism, which I suspect may not really scale 
> all that well to large numbers of concurrent netperfs.

During my test, the confidence interval would even hard to achieved in 
RR test when I pin vhost/vcpus in the processors, so I didn't use it.
>
> I also tend to use the --enable-burst configure option to allow me to 
> minimize the number of concurrent netperfs in the first place.  Set 
> TCP_NODELAY (the test-specific -D option) and then have several 
> transactions outstanding at one time (test-specific -b option with a 
> number of additional in-flight transactions).
>
> This is expressed in the runemomniaggdemo.sh script:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh 
>
>
> which uses the find_max_burst.sh script:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh
>
> to pick the burst size to use in the concurrent netperfs, the results 
> of which can be post-processed with:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py
>
> The nice feature of using the "demo mode" mechanism is when it is 
> coupled with systems with reasonably synchronized clocks (eg NTP) it 
> can be used for many-to-many testing in addition to one-to-many 
> testing (which cannot be dealt with by the confidence interval method 
> of dealing with skew error)
>

Yes, looks "demo mode" is helpful. I would have a look at these scripts, 
Thanks.
>>> A single instance TCP_RR test would help confirm/refute any
>>> non-trivial change in (effective) path length between the two cases.
>>>
>>
>> Yes, I would test this thanks.
>
> Excellent.
>
> happy benchmarking,
>
> rick jones
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: add kvm list for virtio components
From: Rusty Russell @ 2012-07-09  0:37 UTC (permalink / raw)
  To: Michael S. Tsirkin, Paolo Bonzini
  Cc: amit.shah, avi, linux-kernel, kvm, virtualization
In-Reply-To: <20120705122225.GA30572@redhat.com>

On Thu, 5 Jul 2012 15:22:25 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Jul 05, 2012 at 12:07:07PM +0200, Paolo Bonzini wrote:
> > The KVM list is followed by more people than the generic
> > virtualization@lists.linux-foundation.org mailing list, and is
> > already "de facto" the place where virtio patches are posted.
> 
> I have no data on the first statement (do you?) and I disagree with the
> last statement, but have no objection to people adding kvm list as well.

Well, the kvm list is far more active.  It's annoying to have two lists,
but it is how things are so best to document it.

I'm OK with the original commit message, but MST is smarter than me, so
let's do what he suggests :)

Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio-blk: add back VIRTIO_BLK_F_FLUSH
From: Rusty Russell @ 2012-07-09  0:11 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, virtualization, kvm
  Cc: levinsasha928, Michael S. Tsirkin
In-Reply-To: <1341482902-13318-1-git-send-email-pbonzini@redhat.com>

On Thu,  5 Jul 2012 12:08:22 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> The old name is part of the userspace API, add it back for compatibility.
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  include/linux/virtio_blk.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/virtio_blk.h b/include/linux/virtio_blk.h
> index 18a1027..83a3116 100644
> --- a/include/linux/virtio_blk.h
> +++ b/include/linux/virtio_blk.h
> @@ -41,6 +41,9 @@
>  #define VIRTIO_BLK_F_TOPOLOGY	10	/* Topology information is available */
>  #define VIRTIO_BLK_F_CONFIG_WCE	11	/* Writeback mode available in config */
>  
> +/* Backwards-compatibility #defines for renamed features.  */
> +#define VIRTIO_BLK_F_FLUSH	VIRTIO_BLK_F_WCE

Please #ifndef __KERNEL__?

Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio-blk spec: document topology fields
From: Rusty Russell @ 2012-07-08 23:42 UTC (permalink / raw)
  To: Paolo Bonzini, virtualization, kvm
In-Reply-To: <1341410556-5791-1-git-send-email-pbonzini@redhat.com>

On Wed,  4 Jul 2012 16:02:36 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> This completes the changes from yesterday.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks, applied!

Cheers,
Rusty.

^ permalink raw reply

* [PULL] Minor virtio-balloon fix.
From: Rusty Russell @ 2012-07-08 23:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael S. Tsirkin, virtualization

The following changes since commit 02edf6abe01610a5fb379df442de3c837ad99467:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac (2012-06-18 13:34:25 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus.git tags/virtio-for-linus

for you to fetch changes up to 9c378abc5c0c6fc8e3acf5968924d274503819b3:

  virtio-balloon: fix add/get API use (2012-07-09 09:07:22 +0930)

----------------------------------------------------------------
Theoretical fix, which greatly simplifies upcoming balloon patches which will go in via some vm tree.

----------------------------------------------------------------
Michael S. Tsirkin (1):
      virtio-balloon: fix add/get API use

 drivers/virtio/virtio_balloon.c |   24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

^ permalink raw reply

* Re: RFD: virtio balloon API use (was Re: [PATCH 5 of 5] virtio: expose added descriptors immediately)
From: Rusty Russell @ 2012-07-08 23:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtualization, Rafael Aquini, kvm, linux-kernel
In-Reply-To: <20120704105533.GB21704@redhat.com>

On Wed, 4 Jul 2012 13:55:33 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Jul 03, 2012 at 10:17:46AM +0930, Rusty Russell wrote:
> > On Mon, 2 Jul 2012 13:08:19 -0300, Rafael Aquini <aquini@redhat.com> wrote:
> > > As 'locking in balloon', may I assume the approach I took for the compaction case
> > > is OK and aligned to address these concerns of yours? If not, do not hesitate in
> > > giving me your thoughts, please. I'm respinning a V3 series to address a couple
> > > of extra nitpicks from the compaction standpoint, and I'd love to be able to
> > > address any extra concern you might have on the balloon side of that work.
> > 
> > It's orthogonal, though looks like they clash textually :(
> > 
> > I'll re-spin MST's patch on top of yours, and include both in my tree,
> > otherwise linux-next will have to do the merge.  But I'll await your
> > push before pushing to Linus next merge window.
> > 
> > Thanks,
> > Rusty.
> 
> While theoretical mine is a bugfix so could be 3.5 material, no?

It's a little thin, but it's a good idea.  I'll try.

Cheers,
Rusty.

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Ronen Hod @ 2012-07-08  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-1-git-send-email-jasowang@redhat.com>

On 07/05/2012 01:29 PM, Jason Wang wrote:
> Hello All:
>
> This series is an update version of multiqueue virtio-net driver based on
> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
> packets reception and transmission. Please review and comments.
>
> Test Environment:
> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
> - Two directed connected 82599
>
> Test Summary:
>
> - Highlights: huge improvements on TCP_RR test

Hi Jason,

It might be that the good TCP_RR results are due to the large number of sessions (50-250). Can you test it also with small number of sessions?

> - Lowlights: regression on small packet transmission, higher cpu utilization
>               than single queue, need further optimization
>
> Analysis of the performance result:
>
> - I count the number of packets sending/receiving during the test, and
>    multiqueue show much more ability in terms of packets per second.
>
> - For the tx regression, multiqueue send about 1-2 times of more packets
>    compared to single queue, and the packets size were much smaller than single
>    queue does. I suspect tcp does less batching in multiqueue, so I hack the
>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>    singlequeue for both small transmission and throughput

Could it be that since the CPUs are not busy they are available for immediate handling of the packets (little batching)? In such scenario the CPU utilization is not really interesting. What will happen on a busy machine?

Ronen.

>
> - I didn't pack the accelerate RFS with virtio-net in this sereis as it still
>    need further shaping, for the one that interested in this please see:
>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>
> Changes from V4:
> - Add ability to negotiate the number of queues through control virtqueue
> - Ethtool -{L|l} support and default the tx/rx queue number to 1
> - Expose the API to set irq affinity instead of irq itself
>
> Changes from V3:
>
> - Rebase to the net-next
> - Let queue 2 to be the control virtqueue to obey the spec
> - Prodives irq affinity
> - Choose txq based on processor id
>
> References:
>
> - V4: https://lkml.org/lkml/2012/6/25/120
> - V3: http://lwn.net/Articles/467283/
>
> Test result:
>
> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 650.55 655.61 100% 24.88 24.86 99%
> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
> 1 256 1699.45 1779.58 104% 56.75 59.08 104%
> 2 256 4902.71 3446.59 70% 98.53 62.78 63%
> 4 256 4803.76 2980.76 62% 97.44 54.68 56%
> 8 256 5128.88 3158.74 61% 104.68 58.61 55%
> 1 512 2837.98 2838.42 100% 89.76 90.41 100%
> 2 512 6742.59 5495.83 81% 155.03 99.07 63%
> 4 512 9193.70 5900.17 64% 202.84 106.44 52%
> 8 512 9287.51 7107.79 76% 202.18 129.08 63%
> 1 1024 4166.42 4224.98 101% 128.55 129.86 101%
> 2 1024 6196.94 7823.08 126% 181.80 168.81 92%
> 4 1024 9113.62 9219.49 101% 235.15 190.93 81%
> 8 1024 9324.25 9402.66 100% 239.10 179.99 75%
> 1 2048 7441.63 6534.04 87% 248.01 215.63 86%
> 2 2048 7024.61 7414.90 105% 225.79 219.62 97%
> 4 2048 8971.49 9269.00 103% 278.94 220.84 79%
> 8 2048 9314.20 9359.96 100% 268.36 192.23 71%
> 1 4096 8282.60 8990.08 108% 277.45 320.05 115%
> 2 4096 9194.80 9293.78 101% 317.02 248.76 78%
> 4 4096 9340.73 9313.19 99% 300.34 230.35 76%
> 8 4096 9148.23 9347.95 102% 279.49 199.43 71%
> 1 16384 8787.89 8766.31 99% 312.38 316.53 101%
> 2 16384 9306.35 9156.14 98% 319.53 279.83 87%
> 4 16384 9177.81 9307.50 101% 312.69 230.07 73%
> 8 16384 9035.82 9188.00 101% 298.32 199.17 66%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
> 100 1 60141.88 88598.94 147% 2157.90 2000.45 92%
> 250 1 74763.56 135584.22 181% 2541.94 2628.59 103%
> 50 64 51628.38 82867.50 160% 1872.55 1812.16 96%
> 100 64 60367.73 84080.60 139% 2215.69 1867.69 84%
> 250 64 68502.70 124910.59 182% 2321.43 2495.76 107%
> 50 128 53477.08 77625.07 145% 1905.10 1870.99 98%
> 100 128 59697.56 74902.37 125% 2230.66 1751.03 78%
> 250 128 71248.74 133963.55 188% 2453.12 2711.72 110%
> 50 256 47663.86 67742.63 142% 1880.45 1735.30 92%
> 100 256 54051.84 68738.57 127% 2123.03 1778.59 83%
> 250 256 68250.06 124487.90 182% 2321.89 2598.60 111%
> - External Host to Guest TCP STRAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 847.71 864.83 102% 57.99 57.93 99%
> 2 64 1690.82 1544.94 91% 80.13 55.09 68%
> 4 64 3434.98 3455.53 100% 127.17 89.00 69%
> 8 64 5890.19 6557.35 111% 194.70 146.52 75%
> 1 256 2094.04 2109.14 100% 130.73 127.14 97%
> 2 256 5218.13 3731.97 71% 219.15 114.02 52%
> 4 256 6734.51 9213.47 136% 227.87 208.31 91%
> 8 256 6452.86 9402.78 145% 224.83 207.77 92%
> 1 512 3945.07 4203.68 106% 279.72 273.30 97%
> 2 512 7878.96 8122.55 103% 278.25 231.71 83%
> 4 512 7645.89 9402.13 122% 252.10 217.42 86%
> 8 512 6657.06 9403.71 141% 239.81 214.89 89%
> 1 1024 5729.06 5111.21 89% 289.38 303.09 104%
> 2 1024 8097.27 8159.67 100% 269.29 242.97 90%
> 4 1024 7778.93 8919.02 114% 261.28 205.50 78%
> 8 1024 6458.02 9360.02 144% 221.26 208.09 94%
> 1 2048 6426.94 5195.59 80% 292.52 307.47 105%
> 2 2048 8221.90 9025.66 109% 283.80 242.25 85%
> 4 2048 7364.72 8527.79 115% 248.10 198.36 79%
> 8 2048 6760.63 9161.07 135% 230.53 205.12 88%
> 1 4096 7247.02 6874.21 94% 276.23 287.68 104%
> 2 4096 8346.04 8818.65 105% 281.49 254.81 90%
> 4 4096 6710.00 9354.59 139% 216.41 210.13 97%
> 8 4096 6265.69 9406.87 150% 206.69 210.92 102%
> 1 16384 8159.50 8048.79 98% 266.94 283.11 106%
> 2 16384 8525.66 8552.41 100% 294.36 239.27 81%
> 4 16384 6042.24 8447.86 139% 200.21 196.40 98%
> 8 16384 6432.63 9403.49 146% 211.48 206.13 97%
>
> 2) 1 vm 4 vcpu 1q vs 4q, 1 - 1q, 2 - 4q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 636.93 657.69 103% 23.55 24.42 103%
> 2 64 1457.46 1268.78 87% 30.97 26.02 84%
> 4 64 3062.86 2302.43 75% 41.00 29.64 72%
> 8 64 3107.68 2308.32 74% 41.62 29.07 69%
> 1 256 1743.50 1750.11 100% 59.00 56.63 95%
> 2 256 4582.61 2870.31 62% 92.47 51.97 56%
> 4 256 8440.96 4795.37 56% 135.10 56.39 41%
> 8 256 9240.31 6654.82 72% 144.76 74.89 51%
> 1 512 2918.25 2735.26 93% 91.08 86.47 94%
> 2 512 8978.32 5107.95 56% 200.00 94.97 47%
> 4 512 8850.39 6864.37 77% 190.32 101.09 53%
> 8 512 9270.30 8483.01 91% 193.44 118.73 61%
> 1 1024 4416.10 3679.70 83% 135.54 110.63 81%
> 2 1024 9085.20 8770.48 96% 242.23 175.59 72%
> 4 1024 9158.57 9011.56 98% 234.39 159.17 67%
> 8 1024 9345.89 9067.43 97% 233.35 138.73 59%
> 1 2048 8455.19 6077.94 71% 338.52 190.16 56%
> 2 2048 9223.32 8237.73 89% 270.00 198.27 73%
> 4 2048 9080.75 9257.63 101% 261.30 172.80 66%
> 8 2048 9177.39 8977.10 97% 256.89 147.50 57%
> 1 4096 8665.35 8394.78 96% 289.63 289.85 100%
> 2 4096 7850.73 8857.86 112% 253.33 252.62 99%
> 4 4096 9332.55 8508.37 91% 289.19 151.29 52%
> 8 4096 8482.30 9146.80 107% 255.41 156.02 61%
> 1 16384 8825.72 8778.26 99% 314.60 308.89 98%
> 2 16384 9283.85 8927.40 96% 316.48 246.98 78%
> 4 16384 7766.95 8708.06 112% 265.25 155.59 58%
> 8 16384 8945.55 8940.23 99% 298.45 151.32 50%
> - TCP_RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 60848.70 81719.39 134% 2196.86 1551.05 70%
> 100 1 61886.19 81425.02 131% 2215.76 1517.52 68%
> 250 1 72058.41 162597.84 225% 2441.84 2278.14 93%
> 50 64 51646.93 74160.10 143% 1861.07 1322.22 71%
> 100 64 57574.86 83488.26 145% 2076.54 1479.79 71%
> 250 64 67583.35 138482.15 204% 2314.46 2022.83 87%
> 50 128 59931.51 71633.03 119% 2244.60 1309.18 58%
> 100 128 58329.80 73104.90 125% 2202.98 1329.52 60%
> 250 128 71021.55 161067.73 226% 2469.11 2205.28 89%
> 50 256 47509.24 64330.24 135% 1915.75 1269.90 66%
> 100 256 49293.03 68507.94 138% 1939.75 1263.64 65%
> 250 256 63169.07 138390.68 219% 2255.47 2098.13 93%
> - External Host to Guest TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 850.18 854.96 100% 56.94 58.25 102%
> 2 64 1659.12 1730.25 104% 81.65 67.57 82%
> 4 64 3254.70 3397.17 104% 118.57 76.21 64%
> 8 64 6251.97 6389.29 102% 207.68 104.21 50%
> 1 256 2029.14 2105.18 103% 116.45 119.69 102%
> 2 256 5412.02 4260.32 78% 240.87 139.73 58%
> 4 256 7777.28 8743.12 112% 263.20 174.65 66%
> 8 256 6459.51 9388.93 145% 218.94 158.37 72%
> 1 512 4566.31 4269.30 93% 274.74 289.83 105%
> 2 512 7444.52 8240.64 110% 286.24 243.74 85%
> 4 512 7722.29 9391.16 121% 261.96 180.36 68%
> 8 512 6228.50 9134.52 146% 209.17 161.00 76%
> 1 1024 4965.50 4953.68 99% 307.64 280.48 91%
> 2 1024 8270.08 7733.71 93% 288.32 197.04 68%
> 4 1024 7551.04 9394.58 124% 268.41 206.62 76%
> 8 1024 6307.78 9179.03 145% 216.67 159.63 73%
> 1 2048 5741.12 5948.80 103% 290.34 268.66 92%
> 2 2048 7932.79 8766.05 110% 262.96 215.90 82%
> 4 2048 6907.55 9255.97 133% 233.56 203.96 87%
> 8 2048 6037.22 9399.41 155% 197.14 164.09 83%
> 1 4096 7131.70 7535.10 105% 279.43 275.12 98%
> 2 4096 8109.17 9348.04 115% 274.29 211.49 77%
> 4 4096 6878.92 9319.13 135% 244.21 192.06 78%
> 8 4096 6265.92 9408.35 150% 211.85 159.26 75%
> 1 16384 8288.01 8596.39 103% 272.85 290.22 106%
> 2 16384 8166.29 9280.12 113% 277.04 236.61 85%
> 4 16384 6446.97 9382.22 145% 222.91 187.24 83%
> 8 16384 6066.98 9405.51 155% 198.98 157.09 78%
>
> 3) 2 vms each with 2 vcpus, 1q vs 2q - pin vhost/vcpu in the same node
>
> - 2 Guests to External Hosts TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 1442.07 1475.11 102% 30.82 31.21 101%
> 2 64 3124.87 2900.93 92% 40.29 35.95 89%
> 4 64 3166.52 2864.04 90% 40.70 35.47 87%
> 8 64 3141.45 2848.94 90% 40.38 35.34 87%
> 1 256 3628.54 3711.73 102% 68.47 70.22 102%
> 2 256 7806.95 7586.69 97% 111.23 84.38 75%
> 4 256 8823.65 7612.74 86% 132.92 85.04 63%
> 8 256 9194.89 9373.41 101% 135.98 119.62 87%
> 1 512 7106.67 7128.00 100% 124.79 124.30 99%
> 2 512 9190.22 9397.33 102% 180.84 149.34 82%
> 4 512 9401.01 9376.67 99% 173.00 140.15 81%
> 8 512 8572.84 9032.90 105% 150.49 127.58 84%
> 1 1024 9361.93 9379.24 100% 205.81 202.94 98%
> 2 1024 9386.69 9389.04 100% 201.78 165.75 82%
> 4 1024 9403.43 9378.54 99% 195.33 152.06 77%
> 8 1024 9213.63 9180.64 99% 178.99 141.51 79%
> 1 2048 9338.95 9384.67 100% 223.22 227.86 102%
> 2 2048 9389.28 9389.45 100% 202.37 170.08 84%
> 4 2048 9405.86 9388.71 99% 193.76 161.54 83%
> 8 2048 9352.40 9384.06 100% 189.16 157.06 83%
> 1 4096 9380.74 9384.90 100% 239.37 241.56 100%
> 2 4096 9393.47 9376.74 99% 213.84 195.61 91%
> 4 4096 9393.85 9381.50 99% 198.06 170.18 85%
> 8 4096 9400.41 9232.31 98% 192.87 163.56 84%
> 1 16384 9348.18 9335.55 99% 253.02 254.86 100%
> 2 16384 9384.97 9359.53 99% 218.56 208.59 95%
> 4 16384 9326.60 9382.15 100% 206.24 179.72 87%
> 8 16384 9355.82 9392.85 100% 198.22 172.89 87%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 200340.33 261750.19 130% 2935.27 3018.59 102%
> 100 1 236141.58 266304.49 112% 3452.16 3071.74 88%
> 250 1 361574.59 320825.08 88% 4972.98 3705.70 74%
> 50 64 225748.53 242671.12 107% 3011.48 2869.07 95%
> 100 64 249885.37 260453.72 104% 3240.21 3063.67 94%
> 250 64 360341.12 310775.60 86% 4682.42 3657.91 78%
> 50 128 227995.27 289320.38 126% 2950.92 3479.37 117%
> 100 128 239491.11 291135.77 121% 3099.55 3508.75 113%
> 250 128 390390.68 362484.35 92% 5042.30 4368.52 86%
> 50 256 222604.51 317140.97 142% 3058.08 3839.39 125%
> 100 256 254770.92 335606.03 131% 3326.16 4046.65 121%
> 250 256 400584.52 436749.22 109% 5220.79 5278.86 101%
> - External Host to 2 Guests
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 1667.99 1684.50 100% 59.66 60.77 101%
> 2 64 3338.83 3379.97 101% 83.61 64.82 77%
> 4 64 6613.65 6619.11 100% 131.00 97.19 74%
> 8 64 6553.07 6418.31 97% 141.35 98.27 69%
> 1 256 3938.40 4068.52 103% 125.21 123.76 98%
> 2 256 9215.57 9210.88 99% 185.31 154.27 83%
> 4 256 9407.29 9008.13 95% 186.72 150.01 80%
> 8 256 9377.17 9385.57 100% 190.28 137.59 72%
> 1 512 7360.19 6984.80 94% 214.09 211.66 98%
> 2 512 9392.91 9401.88 100% 193.92 173.11 89%
> 4 512 9382.64 9394.34 100% 189.27 145.80 77%
> 8 512 9308.60 9094.08 97% 189.70 141.26 74%
> 1 1024 9153.26 9066.06 99% 223.07 219.95 98%
> 2 1024 9393.38 9398.43 100% 194.02 173.82 89%
> 4 1024 9395.92 8960.73 95% 192.61 145.82 75%
> 8 1024 9388.92 9399.08 100% 191.18 143.87 75%
> 1 2048 9355.32 9240.63 98% 221.50 223.03 100%
> 2 2048 9395.68 9399.62 100% 193.31 177.21 91%
> 4 2048 9397.67 9399.56 100% 195.25 157.53 80%
> 8 2048 9397.89 9401.70 100% 197.57 146.96 74%
> 1 4096 9375.84 9381.72 100% 223.06 225.06 100%
> 2 4096 9389.47 9396.00 100% 193.91 197.13 101%
> 4 4096 9397.45 9400.11 100% 192.33 163.60 85%
> 8 4096 9105.40 9415.76 103% 192.71 140.41 72%
> 1 16384 9381.53 9381.40 99% 223.53 225.66 100%
> 2 16384 9387.90 9395.44 100% 193.34 177.03 91%
> 4 16384 9397.92 9410.98 100% 195.04 151.14 77%
> 8 16384 9259.00 9419.48 101% 194.91 153.48 78%
>
> 4) Local vm to vm 2 vcpu 1q vs 2q - pin vcpu/thread in the same numa node
>
> - VM to VM TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 576.05 576.14 100% 12.25 12.32 100%
> 2 64 1266.75 1160.04 91% 19.10 16.05 84%
> 4 64 1267.34 1123.70 88% 19.08 15.51 81%
> 8 64 1230.88 1174.70 95% 18.53 15.58 84%
> 1 256 1311.00 1303.02 99% 25.34 25.35 100%
> 2 256 5400.26 2794.00 51% 75.92 36.43 47%
> 4 256 5200.67 2818.88 54% 72.81 33.92 46%
> 8 256 5234.55 2893.74 55% 73.10 34.97 47%
> 1 512 3244.09 3263.72 100% 56.48 56.65 100%
> 2 512 8172.16 4661.15 57% 119.05 67.89 57%
> 4 512 10567.44 7063.25 66% 147.76 77.27 52%
> 8 512 10477.87 8471.33 80% 145.94 102.91 70%
> 1 1024 5432.54 5333.99 98% 93.69 92.38 98%
> 2 1024 12590.24 9259.97 73% 185.37 135.28 72%
> 4 1024 15600.53 10731.93 68% 222.20 123.60 55%
> 8 1024 16222.87 10704.85 65% 227.05 113.81 50%
> 1 2048 6667.61 7484.37 112% 116.75 129.72 111%
> 2 2048 8180.43 11500.88 140% 137.84 156.64 113%
> 4 2048 15127.93 14416.16 95% 227.60 154.59 67%
> 8 2048 16381.79 14794.10 90% 244.29 158.45 64%
> 1 4096 7375.63 8948.90 121% 131.97 156.57 118%
> 2 4096 9321.16 14443.21 154% 161.24 163.74 101%
> 4 4096 13028.45 15984.94 122% 212.78 171.26 80%
> 8 4096 15611.28 18810.54 120% 245.15 198.65 81%
> 1 16384 15304.38 14202.08 92% 259.94 244.04 93%
> 2 16384 15508.97 15913.09 102% 261.30 244.26 93%
> 4 16384 14859.98 20164.34 135% 248.29 214.26 86%
> 8 16384 15594.59 19960.99 127% 253.79 211.27 83%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54972.51 69820.99 127% 1133.58 1063.58 93%
> 100 1 55847.16 72407.93 129% 1155.73 1024.35 88%
> 250 1 60066.23 108266.50 180% 1114.30 1323.55 118%
> 50 64 48727.63 62378.32 128% 1014.29 888.78 87%
> 100 64 51804.65 69250.51 133% 1077.78 986.97 91%
> 250 64 61278.68 100015.78 163% 1076.93 1243.18 115%
> 50 256 51593.29 62046.22 120% 1069.14 871.08 81%
> 100 256 51647.00 68197.43 132% 1071.66 958.51 89%
> 250 256 60433.88 99072.59 163% 1072.41 1199.10 111%
> 50 512 52177.79 66483.77 127% 1082.65 960.82 88%
> 100 512 50351.67 62537.63 124% 1041.61 876.41 84%
> 250 512 60510.14 103856.79 171% 1055.21 1245.17 118%
>
>
> Jason Wang (4):
>    virtio_ring: move queue_index to vring_virtqueue
>    virtio: intorduce an API to set affinity for a virtqueue
>    virtio_net: multiqueue support
>    virtio_net: support negotiating the number of queues through ctrl vq
>
> Krishna Kumar (1):
>    virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
>
>   drivers/net/virtio_net.c      |  792 +++++++++++++++++++++++++++++------------
>   drivers/virtio/virtio_mmio.c  |    5 +-
>   drivers/virtio/virtio_pci.c   |   58 +++-
>   drivers/virtio/virtio_ring.c  |   17 +
>   include/linux/virtio.h        |    4 +
>   include/linux/virtio_config.h |   21 ++
>   include/linux/virtio_net.h    |   10 +
>   7 files changed, 677 insertions(+), 230 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
From: Nicholas A. Bellinger @ 2012-07-06 22:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jens Axboe, Anthony Liguori, kvm-devel, linux-scsi, lf-virt,
	James Bottomley, Anthony Liguori, target-devel,
	ksummit-2012-discuss, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <alpine.DEB.2.00.1207061528520.2656@router.home>

On Fri, 2012-07-06 at 15:30 -0500, Christoph Lameter wrote:
> On Fri, 6 Jul 2012, James Bottomley wrote:
> 
> > What people might pay attention to is evidence that there's a problem in
> > 3.5-rc6 (without any OFED crap).  If you're not going to bother
> > investigating, it has to be in an environment they can reproduce (so
> > ordinary hardware, not infiniband) otherwise it gets ignored as an
> > esoteric hardware issue.
> 
> The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
> supported for a long time and its a very important technology given the
> problematic nature of ethernet at high network speeds.
> 
> OFED crap exists for those running RHEL5/6. The new enterprise distros are
> based on the 3.2 kernel which has pretty good Infiniband support
> out of the box.
> 

So I don't think the HCAs or Infiniband fabric was the limiting factor
for small block random I/O in the RHEL 6.2 w/ OFED vs. Windows Server
2008 R2 w/ OFED setup mentioned earlier.

I've seen both FC and iSCSI fabrics demonstrate the same type of random
small block I/O performance anomalies with Linux/SCSI clients too.  The
v3.x Linux/SCSI clients are certainly better in the multi-lun per host
small block random I/O case, but single LUN performance is (still)
lacking compared to everything else.

Also RHEL 6.2 does have the scsi-host-lock less bits in place now, but
it's been more a matter of converting OFED ib_srp code to run in
host-lock less mode to realize extra gains for multi-lun per host.

^ permalink raw reply

* Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
From: Christoph Lameter @ 2012-07-06 20:30 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Anthony Liguori, linux-scsi, kvm-devel, lf-virt,
	Anthony Liguori, target-devel, ksummit-2012-discuss,
	Paolo Bonzini, Zhi Yong Wu, Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341553397.3023.16.camel@dabdike.hilton.com>

On Fri, 6 Jul 2012, James Bottomley wrote:

> What people might pay attention to is evidence that there's a problem in
> 3.5-rc6 (without any OFED crap).  If you're not going to bother
> investigating, it has to be in an environment they can reproduce (so
> ordinary hardware, not infiniband) otherwise it gets ignored as an
> esoteric hardware issue.

The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
supported for a long time and its a very important technology given the
problematic nature of ethernet at high network speeds.

OFED crap exists for those running RHEL5/6. The new enterprise distros are
based on the 3.2 kernel which has pretty good Infiniband support
out of the box.

^ permalink raw reply

* Re: [PATCH] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
From: Nicholas A. Bellinger @ 2012-07-06 20:19 UTC (permalink / raw)
  To: target-devel
  Cc: Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin, stable,
	Zhi Yong Wu, James Bottomley, linux-scsi, Paolo Bonzini, lf-virt,
	Christoph Hellwig
In-Reply-To: <1341605705-13114-1-git-send-email-nab@linux-iscsi.org>

Hi James,

Please consider picking this one up for your next scsi-rc-fixes PULL,
and it's CC'ed to stable following Paolo's request.

Thank you,

--nab

On Fri, 2012-07-06 at 20:15 +0000, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger <nab@linux-iscsi.org>
> 
> This patch changes virtio-scsi to use a new virtio_driver->scan() callback
> so that scsi_scan_host() can be properly invoked once virtio_dev_probe() has
> set add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK) to signal active virtio-ring
> operation, instead of from within virtscsi_probe().
> 
> This fixes a bug where SCSI LUN scanning for both virtio-scsi-raw and
> virtio-scsi/tcm_vhost setups was happening before VIRTIO_CONFIG_S_DRIVER_OK
> had been set, causing VIRTIO_SCSI_S_BAD_TARGET to occur.  This fixes a bug
> with virtio-scsi/tcm_vhost where LUN scan was not detecting LUNs.
> 
> Tested with virtio-scsi-raw + virtio-scsi/tcm_vhost w/ IBLOCK on 3.5-rc2 code.
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
> ---
>  drivers/scsi/virtio_scsi.c |   15 ++++++++++++---
>  drivers/virtio/virtio.c    |    5 ++++-
>  include/linux/virtio.h     |    1 +
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 1b38431..391b30d 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -481,9 +481,10 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
>  	err = scsi_add_host(shost, &vdev->dev);
>  	if (err)
>  		goto scsi_add_host_failed;
> -
> -	scsi_scan_host(shost);
> -
> +	/*
> +	 * scsi_scan_host() happens in virtscsi_scan() via virtio_driver->scan()
> +	 * after VIRTIO_CONFIG_S_DRIVER_OK has been set..
> +	 */
>  	return 0;
>  
>  scsi_add_host_failed:
> @@ -493,6 +494,13 @@ virtscsi_init_failed:
>  	return err;
>  }
>  
> +static void virtscsi_scan(struct virtio_device *vdev)
> +{
> +	struct Scsi_Host *shost = (struct Scsi_Host *)vdev->priv;
> +
> +	scsi_scan_host(shost);
> +}
> +
>  static void virtscsi_remove_vqs(struct virtio_device *vdev)
>  {
>  	/* Stop all the virtqueues. */
> @@ -537,6 +545,7 @@ static struct virtio_driver virtio_scsi_driver = {
>  	.driver.owner = THIS_MODULE,
>  	.id_table = id_table,
>  	.probe = virtscsi_probe,
> +	.scan = virtscsi_scan,
>  #ifdef CONFIG_PM
>  	.freeze = virtscsi_freeze,
>  	.restore = virtscsi_restore,
> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> index f355807..c3b3f7f 100644
> --- a/drivers/virtio/virtio.c
> +++ b/drivers/virtio/virtio.c
> @@ -141,8 +141,11 @@ static int virtio_dev_probe(struct device *_d)
>  	err = drv->probe(dev);
>  	if (err)
>  		add_status(dev, VIRTIO_CONFIG_S_FAILED);
> -	else
> +	else {
>  		add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> +		if (drv->scan)
> +			drv->scan(dev);
> +	}
>  
>  	return err;
>  }
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 8efd28a..a1ba8bb 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -92,6 +92,7 @@ struct virtio_driver {
>  	const unsigned int *feature_table;
>  	unsigned int feature_table_size;
>  	int (*probe)(struct virtio_device *dev);
> +	void (*scan)(struct virtio_device *dev);
>  	void (*remove)(struct virtio_device *dev);
>  	void (*config_changed)(struct virtio_device *dev);
>  #ifdef CONFIG_PM

^ permalink raw reply

* [PATCH] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
From: Nicholas A. Bellinger @ 2012-07-06 20:15 UTC (permalink / raw)
  To: target-devel
  Cc: Stefan Hajnoczi, kvm-devel, Michael S. Tsirkin, Zhi Yong Wu,
	linux-scsi, Paolo Bonzini, stable, lf-virt, Christoph Hellwig

From: Nicholas Bellinger <nab@linux-iscsi.org>

This patch changes virtio-scsi to use a new virtio_driver->scan() callback
so that scsi_scan_host() can be properly invoked once virtio_dev_probe() has
set add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK) to signal active virtio-ring
operation, instead of from within virtscsi_probe().

This fixes a bug where SCSI LUN scanning for both virtio-scsi-raw and
virtio-scsi/tcm_vhost setups was happening before VIRTIO_CONFIG_S_DRIVER_OK
had been set, causing VIRTIO_SCSI_S_BAD_TARGET to occur.  This fixes a bug
with virtio-scsi/tcm_vhost where LUN scan was not detecting LUNs.

Tested with virtio-scsi-raw + virtio-scsi/tcm_vhost w/ IBLOCK on 3.5-rc2 code.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
 drivers/scsi/virtio_scsi.c |   15 ++++++++++++---
 drivers/virtio/virtio.c    |    5 ++++-
 include/linux/virtio.h     |    1 +
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 1b38431..391b30d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -481,9 +481,10 @@ static int __devinit virtscsi_probe(struct virtio_device *vdev)
 	err = scsi_add_host(shost, &vdev->dev);
 	if (err)
 		goto scsi_add_host_failed;
-
-	scsi_scan_host(shost);
-
+	/*
+	 * scsi_scan_host() happens in virtscsi_scan() via virtio_driver->scan()
+	 * after VIRTIO_CONFIG_S_DRIVER_OK has been set..
+	 */
 	return 0;
 
 scsi_add_host_failed:
@@ -493,6 +494,13 @@ virtscsi_init_failed:
 	return err;
 }
 
+static void virtscsi_scan(struct virtio_device *vdev)
+{
+	struct Scsi_Host *shost = (struct Scsi_Host *)vdev->priv;
+
+	scsi_scan_host(shost);
+}
+
 static void virtscsi_remove_vqs(struct virtio_device *vdev)
 {
 	/* Stop all the virtqueues. */
@@ -537,6 +545,7 @@ static struct virtio_driver virtio_scsi_driver = {
 	.driver.owner = THIS_MODULE,
 	.id_table = id_table,
 	.probe = virtscsi_probe,
+	.scan = virtscsi_scan,
 #ifdef CONFIG_PM
 	.freeze = virtscsi_freeze,
 	.restore = virtscsi_restore,
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f355807..c3b3f7f 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -141,8 +141,11 @@ static int virtio_dev_probe(struct device *_d)
 	err = drv->probe(dev);
 	if (err)
 		add_status(dev, VIRTIO_CONFIG_S_FAILED);
-	else
+	else {
 		add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
+		if (drv->scan)
+			drv->scan(dev);
+	}
 
 	return err;
 }
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..a1ba8bb 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -92,6 +92,7 @@ struct virtio_driver {
 	const unsigned int *feature_table;
 	unsigned int feature_table_size;
 	int (*probe)(struct virtio_device *dev);
+	void (*scan)(struct virtio_device *dev);
 	void (*remove)(struct virtio_device *dev);
 	void (*config_changed)(struct virtio_device *dev);
 #ifdef CONFIG_PM
-- 
1.7.2.5

^ permalink raw reply related

* Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]
From: Nicholas A. Bellinger @ 2012-07-06 18:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Anthony Liguori, kvm-devel, linux-scsi,
	Michael S. Tsirkin, lf-virt, Anthony Liguori, target-devel,
	ksummit-2012-discuss, Paolo Bonzini, Zhi Yong Wu,
	Christoph Hellwig, Stefan Hajnoczi
In-Reply-To: <1341582550.2941.10.camel@dabdike>

On Fri, 2012-07-06 at 17:49 +0400, James Bottomley wrote:
> On Fri, 2012-07-06 at 02:13 -0700, Nicholas A. Bellinger wrote:
> > On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
> > > On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
> > > 

<SNIP>

> > > > This bottleneck has been mentioned by various people (including myself)
> > > > on linux-scsi the last 18 months, and I've proposed that that it be
> > > > discussed at KS-2012 so we can start making some forward progress:
> > > 
> > > Well, no, it hasn't.  You randomly drop things like this into unrelated
> > > email (I suppose that is a mention in strict English construction) but
> > > it's not really enough to get anyone to pay attention since they mostly
> > > stopped reading at the top, if they got that far: most people just go by
> > > subject when wading through threads initially.
> > > 
> > 
> > It most certainly has been made clear to me, numerous times from many
> > people in the Linux/SCSI community that there is a bottleneck for small
> > block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
> > Linux based SCSI subsystems.
> > 
> > My apologies if mentioning this issue last year at LC 2011 to you
> > privately did not take a tone of a more serious nature, or that
> > proposing a topic for LSF-2012 this year was not a clear enough
> > indication of a problem with SCSI small block random I/O performance.
> > 
> > > But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
> > > kernel, which is now nearly three years old) is 25% slower than W2k8R2
> > > on infiniband isn't really going to get anyone excited either
> > > (particularly when you mention OFED, which usually means a stack
> > > replacement on Linux anyway).
> > > 
> > 
> > The specific issue was first raised for .38 where we where able to get
> > most of the interesting high performance LLDs converted to using
> > internal locking methods so that host_lock did not have to be obtained
> > during each ->queuecommand() I/O dispatch, right..?
> > 
> > This has helped a good deal for large multi-lun scsi_host configs that
> > are now running in host-lock less mode, but there is still a large
> > discrepancy single LUN vs. raw struct block_device access even with LLD
> > host_lock less mode enabled.
> > 
> > Now I think the virtio-blk client performance is demonstrating this
> > issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
> > flash benchmarks that is demonstrate some other yet-to-be determined
> > limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
> > randrw workload.
> > 
> > > What people might pay attention to is evidence that there's a problem in
> > > 3.5-rc6 (without any OFED crap).  If you're not going to bother
> > > investigating, it has to be in an environment they can reproduce (so
> > > ordinary hardware, not infiniband) otherwise it gets ignored as an
> > > esoteric hardware issue.
> > > 
> > 
> > It's really quite simple for anyone to demonstrate the bottleneck
> > locally on any machine using tcm_loop with raw block flash.  Take a
> > struct block_device backend (like a Fusion IO /dev/fio*) and using
> > IBLOCK and export locally accessible SCSI LUNs via tcm_loop..
> > 
> > Using FIO there is a significant drop for randrw 4k performance between
> > tcm_loop <-> IBLOCK vs. raw struct block device backends.  And no, it's
> > not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
> > LUN limitation for small block random I/Os on the order of ~75K for each
> > SCSI LUN.
> 
> Here, you're saying here that the end to end SCSI stack tops out at
> around 75k iops, which is reasonably respectable if you don't employ any
> mitigation like queue steering and interrupt polling ... what were the
> mitigation techniques in the test you employed by the way?
> 

~75K per SCSI LUN in a multi-lun per host setup is being optimistic btw.
On the other side of the coin, the same pure block device can easily go
~200K per backend.-

For the simplest case with tcm_loop, a struct scsi_cmnd is queued via
cmwq to execute in process context -> submit the backend I/O.  Once
completed from IBLOCK, the I/O is run though a target completion wq, and
completed back to SCSI.

There is no fancy queue steering or interrupt polling going on (at least
not in tcm_loop) because it's a simple virtual SCSI LLD similar to
scsi_debug.

> But previously, you ascribed a performance drop of around 75% on
> virtio-scsi (topping out around 15-20k iops) to this same problem ...
> that doesn't really seem likely.
> 

No.  I ascribed the performance difference between virtio-scsi+tcm_vhost
vs. bare-metal raw block flash to this bottleneck in Linux/SCSI.

It's obvious that virtio-scsi-raw going through QEMU SCSI / block is
having some other shortcomings.

> Here's the rough ranges of concern:
> 
> 10K iops: standard arrays
> 100K iops: modern expensive fast flash drives on 6Gb links
> 1M iops: PCIe NVMexpress like devices
> 
> SCSI should do arrays with no problem at all, so I'd be really concerned
> that it can't make 0-20k iops.  If you push the system and fine tune it,
> SCSI can just about get to 100k iops.  1M iops is still a stretch goal
> for pure block drivers.
> 

1M iops is not a stretch for pure block drivers anymore on commodity
hardwrae.  5 Fusion-IO HBAs + Romley HW can easily go 1M random 4k IOPs
using a pure block driver.

The point is that it would currently take at least 2x the amount of SCSI
LUNs in order to even get close to 1M IOPs with an single LLD driver.
And from the feedback from everyone I've talked to, no one has been able
to make Linux/SCSI go 1M IOPs with any kernel.

--nab

^ permalink raw reply

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
From: Rick Jones @ 2012-07-06 16:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <4FF696C9.5070907@redhat.com>

On 07/06/2012 12:42 AM, Jason Wang wrote:
> I'm not expert of tcp, but looks like the changes are reasonable:
> - we can do full-sized TSO check in tcp_tso_should_defer() only for
> westwood, according to tcp westwood
> - run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.

I'm sure Eric and David will weigh-in on the TCP change.  My initial 
inclination would have been to say "well, if multiqueue is draining 
faster, that means ACKs come-back faster, which means the "race" between 
more data being queued by netperf and ACKs will go more to the ACKs 
which means the segments being sent will be smaller - as TCP_NODELAY is 
not set, the Nagle algorithm is in force, which means once there is data 
outstanding on the connection, no more will be sent until either the 
outstanding data is ACKed, or there is an accumulation of > MSS worth of 
data to send.

>> Also, how are you combining the concurrent netperf results?  Are you
>> taking sums of what netperf reports, or are you gathering statistics
>> outside of netperf?
>>
>
> The throughput were just sumed from netperf result like what netperf
> manual suggests. The cpu utilization were measured by mpstat.

Which mechanism to address skew error?  The netperf manual describes 
more than one:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

Personally, my preference these days is to use the "demo mode" method of 
aggregate results as it can be rather faster than (ab)using the 
confidence intervals mechanism, which I suspect may not really scale all 
that well to large numbers of concurrent netperfs.

I also tend to use the --enable-burst configure option to allow me to 
minimize the number of concurrent netperfs in the first place.  Set 
TCP_NODELAY (the test-specific -D option) and then have several 
transactions outstanding at one time (test-specific -b option with a 
number of additional in-flight transactions).

This is expressed in the runemomniaggdemo.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh

which uses the find_max_burst.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh

to pick the burst size to use in the concurrent netperfs, the results of 
which can be post-processed with:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py

The nice feature of using the "demo mode" mechanism is when it is 
coupled with systems with reasonably synchronized clocks (eg NTP) it can 
be used for many-to-many testing in addition to one-to-many testing 
(which cannot be dealt with by the confidence interval method of dealing 
with skew error)

>> A single instance TCP_RR test would help confirm/refute any
>> non-trivial change in (effective) path length between the two cases.
>>
>
> Yes, I would test this thanks.

Excellent.

happy benchmarking,

rick jones

^ permalink raw reply

* [RFC V3 5/5] virtio-net: add multiqueue support
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

Based on the multiqueue support for taps and NICState, this patch add the
capability of multiqueue for virtio-net. For userspace virtio-net emulation,
each pair of VLANClientState peers were abstracted as a tx/rx queue. For vhost,
the vhost net devices were created per virtio-net tx/rx queue pairs, so when
multiqueue is enabled, N vhost devices/threads were created for a N queues
virtio-net devices.

Since guest may not want to use all queues that qemu provided ( one example is
the old guest w/o multiqueue support). The files were attached/detached on
demand when guest set status for virtio_net.

This feature was negotiated through VIRTIO_NET_F_MULTIQUEUE. A new property
"queues" were added to virtio-net device to specify the number of queues it
supported. With this patch a virtio-net backend with N queues could be created
by:

qemu -netdev tap,id=hn0,queues=2 -device virtio-net-pci,netdev=hn0,queues=2

To let user tweak the performance, guest could negotiate the num of queues it
wishes to use through control virtqueue.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/virtio-net.c |  505 ++++++++++++++++++++++++++++++++++++++-----------------
 hw/virtio-net.h |   12 ++
 2 files changed, 361 insertions(+), 156 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 30eb4f4..afe69e8 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -21,39 +21,48 @@
 #include "virtio-net.h"
 #include "vhost_net.h"
 
-#define VIRTIO_NET_VM_VERSION    11
+#define VIRTIO_NET_VM_VERSION    12
 
 #define MAC_TABLE_ENTRIES    64
 #define MAX_VLAN    (1 << 12)   /* Per 802.1Q definition */
 
-typedef struct VirtIONet
+struct VirtIONet;
+
+typedef struct VirtIONetQueue
 {
-    VirtIODevice vdev;
-    uint8_t mac[ETH_ALEN];
-    uint16_t status;
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
-    VirtQueue *ctrl_vq;
-    NICState *nic;
     QEMUTimer *tx_timer;
     QEMUBH *tx_bh;
     uint32_t tx_timeout;
-    int32_t tx_burst;
     int tx_waiting;
-    uint32_t has_vnet_hdr;
-    uint8_t has_ufo;
     struct {
         VirtQueueElement elem;
         ssize_t len;
     } async_tx;
+    struct VirtIONet *n;
+    uint8_t vhost_started;
+} VirtIONetQueue;
+
+typedef struct VirtIONet
+{
+    VirtIODevice vdev;
+    uint8_t mac[ETH_ALEN];
+    uint16_t status;
+    VirtIONetQueue vqs[MAX_QUEUE_NUM];
+    VirtQueue *ctrl_vq;
+    NICState *nic;
+    int32_t tx_burst;
+    uint32_t has_vnet_hdr;
+    uint8_t has_ufo;
     int mergeable_rx_bufs;
+    int multiqueue;
     uint8_t promisc;
     uint8_t allmulti;
     uint8_t alluni;
     uint8_t nomulti;
     uint8_t nouni;
     uint8_t nobcast;
-    uint8_t vhost_started;
     struct {
         int in_use;
         int first_multi;
@@ -63,6 +72,8 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint16_t queues;
+    uint16_t real_queues;
 } VirtIONet;
 
 /* TODO
@@ -74,12 +85,25 @@ static VirtIONet *to_virtio_net(VirtIODevice *vdev)
     return (VirtIONet *)vdev;
 }
 
+static int vq_get_pair_index(VirtIONet *n, VirtQueue *vq)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->vqs[i].tx_vq == vq || n->vqs[i].rx_vq == vq) {
+            return i;
+        }
+    }
+    assert(1);
+    return -1;
+}
+
 static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIONet *n = to_virtio_net(vdev);
     struct virtio_net_config netcfg;
 
     stw_p(&netcfg.status, n->status);
+    netcfg.queues = n->queues * 2;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -103,78 +127,146 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
         (n->status & VIRTIO_NET_S_LINK_UP) && n->vdev.vm_running;
 }
 
-static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+static void virtio_net_vhost_status(VLANClientState *nc, VirtIONet *n,
+                                    uint8_t status)
 {
-    if (!n->nic->nc.peer) {
+    int queue_index = nc->queue_index;
+    VLANClientState *peer = nc->peer;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
+
+    if (!peer) {
         return;
     }
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (peer->info->type != NET_CLIENT_TYPE_TAP) {
         return;
     }
 
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(peer)) {
         return;
     }
-    if (!!n->vhost_started == virtio_net_started(n, status) &&
-                              !n->nic->nc.peer->link_down) {
+    if (!!netq->vhost_started == virtio_net_started(n, status) &&
+                                 !peer->link_down) {
         return;
     }
-    if (!n->vhost_started) {
-        int r;
-        if (!vhost_net_query(tap_get_vhost_net(n->nic->nc.peer), &n->vdev)) {
+    if (!netq->vhost_started) {
+	int r;
+        if (!vhost_net_query(tap_get_vhost_net(peer), &n->vdev)) {
             return;
         }
-        r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev, 0);
+
+        r = vhost_net_start(tap_get_vhost_net(peer), &n->vdev,
+                            queue_index == 0 ? 0 : queue_index * 2 + 1);
         if (r < 0) {
             error_report("unable to start vhost net: %d: "
                          "falling back on userspace virtio", -r);
         } else {
-            n->vhost_started = 1;
+            netq->vhost_started = 1;
         }
     } else {
-        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
-        n->vhost_started = 0;
+        vhost_net_stop(tap_get_vhost_net(peer), &n->vdev);
+        netq->vhost_started = 0;
     }
 }
 
-static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
+static int peer_attach(VirtIONet *n, int index)
 {
-    VirtIONet *n = to_virtio_net(vdev);
+    if (!n->nic->ncs[index]->peer) {
+	return -1;
+    }
 
-    virtio_net_vhost_status(n, status);
+    if (n->nic->ncs[index]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+	return -1;
+    }
 
-    if (!n->tx_waiting) {
-        return;
+    return tap_attach(n->nic->ncs[index]->peer);
+}
+
+static int peer_detach(VirtIONet *n, int index)
+{
+    if (!n->nic->ncs[index]->peer) {
+	return -1;
     }
 
-    if (virtio_net_started(n, status) && !n->vhost_started) {
-        if (n->tx_timer) {
-            qemu_mod_timer(n->tx_timer,
-                           qemu_get_clock_ns(vm_clock) + n->tx_timeout);
+    if (n->nic->ncs[index]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+	return -1;
+    }
+
+    return tap_detach(n->nic->ncs[index]->peer);
+}
+
+static void virtio_net_set_queues(VirtIONet *n)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if ((!n->multiqueue && i != 0) || i >= n->real_queues) {
+            assert(peer_detach(n, i) == 0);
         } else {
-            qemu_bh_schedule(n->tx_bh);
+            assert(peer_attach(n, i) == 0);
         }
-    } else {
-        if (n->tx_timer) {
-            qemu_del_timer(n->tx_timer);
+    }
+}
+
+static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
+{
+    VirtIONet *n = to_virtio_net(vdev);
+    int i;
+
+    virtio_net_set_queues(n);
+
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+
+        if ((!n->multiqueue && i != 0) || i >= n->real_queues)
+            status = 0;
+
+        virtio_net_vhost_status(n->nic->ncs[i], n, status);
+
+        if (!netq->tx_waiting) {
+            continue;
+        }
+
+        if (virtio_net_started(n, status) && !netq->vhost_started) {
+            if (netq->tx_timer) {
+                qemu_mod_timer(netq->tx_timer,
+                               qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+            } else {
+                qemu_bh_schedule(netq->tx_bh);
+            }
         } else {
-            qemu_bh_cancel(n->tx_bh);
+            if (netq->tx_timer) {
+                qemu_del_timer(netq->tx_timer);
+            } else {
+                qemu_bh_cancel(netq->tx_bh);
+            }
+        }
+    }
+}
+
+static bool virtio_net_is_link_up(VirtIONet *n)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->nic->ncs[i]->link_down) {
+            return false;
         }
     }
+    return true;
 }
 
 static void virtio_net_set_link_status(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
     uint16_t old_status = n->status;
 
-    if (nc->link_down)
-        n->status &= ~VIRTIO_NET_S_LINK_UP;
-    else
+    if (virtio_net_is_link_up(n)) {
         n->status |= VIRTIO_NET_S_LINK_UP;
+    } else {
+        n->status &= ~VIRTIO_NET_S_LINK_UP;
+    }
 
-    if (n->status != old_status)
+    if (n->status != old_status) {
         virtio_notify_config(&n->vdev);
+    }
 
     virtio_net_set_status(&n->vdev, n->vdev.status);
 }
@@ -190,6 +282,7 @@ static void virtio_net_reset(VirtIODevice *vdev)
     n->nomulti = 0;
     n->nouni = 0;
     n->nobcast = 0;
+    n->real_queues = n->queues;
 
     /* Flush any MAC and VLAN filter table state */
     n->mac_table.in_use = 0;
@@ -202,13 +295,15 @@ static void virtio_net_reset(VirtIODevice *vdev)
 
 static int peer_has_vnet_hdr(VirtIONet *n)
 {
-    if (!n->nic->nc.peer)
+    if (!n->nic->ncs[0]->peer) {
         return 0;
+    }
 
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP)
+    if (n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return 0;
+    }
 
-    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->nc.peer);
+    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->ncs[0]->peer);
 
     return n->has_vnet_hdr;
 }
@@ -218,7 +313,7 @@ static int peer_has_ufo(VirtIONet *n)
     if (!peer_has_vnet_hdr(n))
         return 0;
 
-    n->has_ufo = tap_has_ufo(n->nic->nc.peer);
+    n->has_ufo = tap_has_ufo(n->nic->ncs[0]->peer);
 
     return n->has_ufo;
 }
@@ -228,9 +323,13 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    features |= (1 << VIRTIO_NET_F_MULTIQUEUE);
 
     if (peer_has_vnet_hdr(n)) {
-        tap_using_vnet_hdr(n->nic->nc.peer, 1);
+        int i;
+        for (i = 0; i < n->queues; i++) {
+            tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+        }
     } else {
         features &= ~(0x1 << VIRTIO_NET_F_CSUM);
         features &= ~(0x1 << VIRTIO_NET_F_HOST_TSO4);
@@ -248,14 +347,15 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
         features &= ~(0x1 << VIRTIO_NET_F_HOST_UFO);
     }
 
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (!n->nic->ncs[0]->peer ||
+        n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return features;
     }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(n->nic->ncs[0]->peer)) {
         return features;
     }
-    return vhost_net_get_features(tap_get_vhost_net(n->nic->nc.peer), features);
+    return vhost_net_get_features(tap_get_vhost_net(n->nic->ncs[0]->peer),
+                                  features);
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
@@ -276,25 +376,36 @@ static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
 static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    int i;
 
     n->mergeable_rx_bufs = !!(features & (1 << VIRTIO_NET_F_MRG_RXBUF));
+    n->multiqueue = !!(features & (1 << VIRTIO_NET_F_MULTIQUEUE));
 
-    if (n->has_vnet_hdr) {
-        tap_set_offload(n->nic->nc.peer,
-                        (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                        (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
-    }
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
-        return;
-    }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
-        return;
+    if (!n->multiqueue)
+	    n->real_queues = 1;
+
+    /* attach the files for tap_set_offload */
+    virtio_net_set_queues(n);
+
+    for (i = 0; i < n->real_queues; i++) {
+        if (n->has_vnet_hdr) {
+            tap_set_offload(n->nic->ncs[i]->peer,
+                            (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                            (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+        }
+        if (!n->nic->ncs[i]->peer ||
+            n->nic->ncs[i]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+            continue;
+        }
+        if (!tap_get_vhost_net(n->nic->ncs[i]->peer)) {
+            continue;
+        }
+        vhost_net_ack_features(tap_get_vhost_net(n->nic->ncs[i]->peer),
+                               features);
     }
-    vhost_net_ack_features(tap_get_vhost_net(n->nic->nc.peer), features);
 }
 
 static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
@@ -404,6 +515,26 @@ static int virtio_net_handle_vlan_table(VirtIONet *n, uint8_t cmd,
     return VIRTIO_NET_OK;
 }
 
+static int virtio_net_handle_multiqueue(VirtIONet *n, uint8_t cmd,
+                                        VirtQueueElement *elem)
+{
+    if (elem->out_num != 2 ||
+        elem->out_sg[1].iov_len != sizeof(n->real_queues)) {
+        error_report("virtio-net ctrl invalid multiqueue command");
+        return VIRTIO_NET_ERR;
+    }
+
+    n->real_queues = lduw_p(elem->out_sg[1].iov_base);
+    if (n->real_queues > n->queues) {
+	    return VIRTIO_NET_ERR;
+    }
+
+    virtio_net_set_status(&n->vdev, n->vdev.status);
+
+    return VIRTIO_NET_OK;
+}
+
+
 static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
@@ -432,6 +563,8 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
             status = virtio_net_handle_mac(n, ctrl.cmd, &elem);
         else if (ctrl.class == VIRTIO_NET_CTRL_VLAN)
             status = virtio_net_handle_vlan_table(n, ctrl.cmd, &elem);
+        else if (ctrl.class == VIRTIO_NET_CTRL_MULTIQUEUE)
+            status = virtio_net_handle_multiqueue(n, ctrl.cmd, &elem);
 
         stb_p(elem.in_sg[elem.in_num - 1].iov_base, status);
 
@@ -446,7 +579,7 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    qemu_flush_queued_packets(&n->nic->nc);
+    qemu_flush_queued_packets(n->nic->ncs[vq_get_pair_index(n, vq)]);
 
     /* We now have RX buffers, signal to the IO thread to break out of the
      * select to re-poll the tap file descriptor */
@@ -455,36 +588,37 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 
 static int virtio_net_can_receive(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+
     if (!n->vdev.vm_running) {
         return 0;
     }
 
-    if (!virtio_queue_ready(n->rx_vq) ||
+    if (!virtio_queue_ready(n->vqs[queue_index].rx_vq) ||
         !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return 0;
 
     return 1;
 }
 
-static int virtio_net_has_buffers(VirtIONet *n, int bufsize)
+static int virtio_net_has_buffers(VirtIONet *n, int bufsize, VirtQueue *vq)
 {
-    if (virtio_queue_empty(n->rx_vq) ||
-        (n->mergeable_rx_bufs &&
-         !virtqueue_avail_bytes(n->rx_vq, bufsize, 0))) {
-        virtio_queue_set_notification(n->rx_vq, 1);
+    if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+        !virtqueue_avail_bytes(vq, bufsize, 0))) {
+        virtio_queue_set_notification(vq, 1);
 
         /* To avoid a race condition where the guest has made some buffers
          * available after the above check but before notification was
          * enabled, check for available buffers again.
          */
-        if (virtio_queue_empty(n->rx_vq) ||
-            (n->mergeable_rx_bufs &&
-             !virtqueue_avail_bytes(n->rx_vq, bufsize, 0)))
+        if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+            !virtqueue_avail_bytes(vq, bufsize, 0))) {
             return 0;
+        }
     }
 
-    virtio_queue_set_notification(n->rx_vq, 0);
+    virtio_queue_set_notification(vq, 0);
     return 1;
 }
 
@@ -595,12 +729,15 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
 
 static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_t size)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
+    VirtQueue *vq = n->vqs[queue_index].rx_vq;
     struct virtio_net_hdr_mrg_rxbuf *mhdr = NULL;
     size_t guest_hdr_len, offset, i, host_hdr_len;
 
-    if (!virtio_net_can_receive(&n->nic->nc))
+    if (!virtio_net_can_receive(n->nic->ncs[queue_index])) {
         return -1;
+    }
 
     /* hdr_len refers to the header we supply to the guest */
     guest_hdr_len = n->mergeable_rx_bufs ?
@@ -608,7 +745,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
 
     host_hdr_len = n->has_vnet_hdr ? sizeof(struct virtio_net_hdr) : 0;
-    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len))
+    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len, vq))
         return 0;
 
     if (!receive_filter(n, buf, size))
@@ -623,7 +760,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
         total = 0;
 
-        if (virtqueue_pop(n->rx_vq, &elem) == 0) {
+        if (virtqueue_pop(vq, &elem) == 0) {
             if (i == 0)
                 return -1;
             error_report("virtio-net unexpected empty queue: "
@@ -675,47 +812,50 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
         }
 
         /* signal other side */
-        virtqueue_fill(n->rx_vq, &elem, total, i++);
+        virtqueue_fill(vq, &elem, total, i++);
     }
 
     if (mhdr) {
         stw_p(&mhdr->num_buffers, i);
     }
 
-    virtqueue_flush(n->rx_vq, i);
-    virtio_notify(&n->vdev, n->rx_vq);
+    virtqueue_flush(vq, i);
+    virtio_notify(&n->vdev, vq);
 
     return size;
 }
 
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq);
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *tvq);
 
 static void virtio_net_tx_complete(VLANClientState *nc, ssize_t len)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    virtqueue_push(netq->tx_vq, &netq->async_tx.elem, netq->async_tx.len);
+    virtio_notify(&n->vdev, netq->tx_vq);
 
-    n->async_tx.elem.out_num = n->async_tx.len = 0;
+    netq->async_tx.elem.out_num = netq->async_tx.len;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 /* TX */
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *netq)
 {
     VirtQueueElement elem;
     int32_t num_packets = 0;
+    VirtQueue *vq = netq->tx_vq;
+
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return num_packets;
     }
 
     assert(n->vdev.vm_running);
 
-    if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    if (netq->async_tx.elem.out_num) {
+        virtio_queue_set_notification(vq, 0);
         return num_packets;
     }
 
@@ -747,12 +887,12 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
-        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
-                                      virtio_net_tx_complete);
+        ret = qemu_sendv_packet_async(n->nic->ncs[vq_get_pair_index(n, vq)],
+                                      out_sg, out_num, virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
-            n->async_tx.elem = elem;
-            n->async_tx.len  = len;
+            virtio_queue_set_notification(vq, 0);
+            netq->async_tx.elem = elem;
+            netq->async_tx.len  = len;
             return -EBUSY;
         }
 
@@ -771,22 +911,23 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
-        n->tx_waiting = 1;
+        netq->tx_waiting = 1;
         return;
     }
 
-    if (n->tx_waiting) {
+    if (netq->tx_waiting) {
         virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_waiting = 0;
-        virtio_net_flush_tx(n, vq);
+        qemu_del_timer(netq->tx_timer);
+        netq->tx_waiting = 0;
+        virtio_net_flush_tx(n, netq);
     } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock_ns(vm_clock) + n->tx_timeout);
-        n->tx_waiting = 1;
+        qemu_mod_timer(netq->tx_timer,
+                       qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+        netq->tx_waiting = 1;
         virtio_queue_set_notification(vq, 0);
     }
 }
@@ -794,48 +935,53 @@ static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 static void virtio_net_handle_tx_bh(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
-    if (unlikely(n->tx_waiting)) {
+    if (unlikely(netq->tx_waiting)) {
         return;
     }
-    n->tx_waiting = 1;
+    netq->tx_waiting = 1;
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
         return;
     }
     virtio_queue_set_notification(vq, 0);
-    qemu_bh_schedule(n->tx_bh);
+    qemu_bh_schedule(netq->tx_bh);
 }
 
 static void virtio_net_tx_timer(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtIONet *n = netq->n;
+
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 static void virtio_net_tx_bh(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtQueue *vq = netq->tx_vq;
+    VirtIONet *n = netq->n;
     int32_t ret;
 
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
         return;
 
-    ret = virtio_net_flush_tx(n, n->tx_vq);
+    ret = virtio_net_flush_tx(n, netq);
     if (ret == -EBUSY) {
         return; /* Notification re-enable handled by tx_complete */
     }
@@ -843,33 +989,39 @@ static void virtio_net_tx_bh(void *opaque)
     /* If we flush a full burst of packets, assume there are
      * more coming and immediately reschedule */
     if (ret >= n->tx_burst) {
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
         return;
     }
 
     /* If less than a full burst, re-enable notification and flush
      * anything that may have come in while we weren't looking.  If
      * we find something, assume the guest is still active and reschedule */
-    virtio_queue_set_notification(n->tx_vq, 1);
-    if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
-        virtio_queue_set_notification(n->tx_vq, 0);
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+    virtio_queue_set_notification(vq, 1);
+    if (virtio_net_flush_tx(n, netq) > 0) {
+        virtio_queue_set_notification(vq, 0);
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
     }
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
     VirtIONet *n = opaque;
+    int i;
 
     /* At this point, backend must be stopped, otherwise
      * it might keep writing to memory. */
-    assert(!n->vhost_started);
+    for (i = 0; i < n->queues; i++) {
+        assert(!n->vqs[i].vhost_started);
+    }
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
-    qemu_put_be32(f, n->tx_waiting);
+    qemu_put_be32(f, n->queues);
+    for (i = 0; i < n->queues; i++) {
+        qemu_put_be32(f, n->vqs[i].tx_waiting);
+    }
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
     qemu_put_byte(f, n->promisc);
@@ -885,6 +1037,8 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
     qemu_put_byte(f, n->nouni);
     qemu_put_byte(f, n->nobcast);
     qemu_put_byte(f, n->has_ufo);
+    qemu_put_be16(f, n->queues);
+    qemu_put_be16(f, n->real_queues);
 }
 
 static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
@@ -902,7 +1056,10 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     }
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
-    n->tx_waiting = qemu_get_be32(f);
+    n->queues = qemu_get_be32(f);
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].tx_waiting = qemu_get_be32(f);
+    }
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
     if (version_id >= 3)
@@ -930,7 +1087,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
             n->mac_table.in_use = 0;
         }
     }
- 
+
     if (version_id >= 6)
         qemu_get_buffer(f, (uint8_t *)n->vlans, MAX_VLAN >> 3);
 
@@ -941,13 +1098,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
         }
 
         if (n->has_vnet_hdr) {
-            tap_using_vnet_hdr(n->nic->nc.peer, 1);
-            tap_set_offload(n->nic->nc.peer,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+            for(i = 0; i < n->queues; i++) {
+                tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+                tap_set_offload(n->nic->ncs[i]->peer,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  &
+                        1);
+           }
         }
     }
 
@@ -970,6 +1130,13 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
         }
     }
 
+    if (version_id >= 12) {
+        if (n->queues != qemu_get_be16(f)) {
+            error_report("virtio-net: the number of queues does not match");
+        }
+        n->real_queues = qemu_get_be16(f);
+    }
+
     /* Find the first multicast entry in the saved MAC filter */
     for (i = 0; i < n->mac_table.in_use; i++) {
         if (n->mac_table.macs[i * ETH_ALEN] & 1) {
@@ -982,7 +1149,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
 
 static void virtio_net_cleanup(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
 
     n->nic = NULL;
 }
@@ -1000,6 +1167,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
                               virtio_net_conf *net)
 {
     VirtIONet *n;
+    int i;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
@@ -1012,7 +1180,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
 
     if (net->tx && strcmp(net->tx, "timer") && strcmp(net->tx, "bh")) {
         error_report("virtio-net: "
@@ -1021,15 +1188,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
         error_report("Defaulting to \"bh\"");
     }
 
-    if (net->tx && !strcmp(net->tx, "timer")) {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_timer);
-        n->tx_timer = qemu_new_timer_ns(vm_clock, virtio_net_tx_timer, n);
-        n->tx_timeout = net->txtimer;
-    } else {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_bh);
-        n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
-    }
-    n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
     qemu_macaddr_default_if_unset(&conf->macaddr);
     memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -1038,7 +1196,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 
     qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
 
-    n->tx_waiting = 0;
     n->tx_burst = net->txburst;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
@@ -1046,6 +1203,33 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->mac_table.macs = g_malloc0(MAC_TABLE_ENTRIES * ETH_ALEN);
 
     n->vlans = g_malloc0(MAX_VLAN >> 3);
+    n->queues = conf->queues;
+    n->real_queues = n->queues;
+
+    /* Allocate per rx/tx vq's */
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
+        if (net->tx && !strcmp(net->tx, "timer")) {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_timer);
+            n->vqs[i].tx_timer = qemu_new_timer_ns(vm_clock,
+                                                   virtio_net_tx_timer,
+                                                   &n->vqs[i]);
+            n->vqs[i].tx_timeout = net->txtimer;
+        } else {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_bh);
+            n->vqs[i].tx_bh = qemu_bh_new(virtio_net_tx_bh, &n->vqs[i]);
+        }
+
+        n->vqs[i].tx_waiting = 0;
+        n->vqs[i].n = n;
+
+        if (i == 0) {
+            /* keep compatiable with spec and old guest */
+            n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
+        }
+    }
 
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
@@ -1059,24 +1243,33 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 void virtio_net_exit(VirtIODevice *vdev)
 {
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
+    int i;
 
     /* This will stop vhost backend if appropriate. */
     virtio_net_set_status(vdev, 0);
 
-    qemu_purge_queued_packets(&n->nic->nc);
+    for (i = 0; i < n->queues; i++) {
+        qemu_purge_queued_packets(n->nic->ncs[i]);
+    }
 
     unregister_savevm(n->qdev, "virtio-net", n);
 
     g_free(n->mac_table.macs);
     g_free(n->vlans);
 
-    if (n->tx_timer) {
-        qemu_del_timer(n->tx_timer);
-        qemu_free_timer(n->tx_timer);
-    } else {
-        qemu_bh_delete(n->tx_bh);
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+        if (netq->tx_timer) {
+            qemu_del_timer(netq->tx_timer);
+            qemu_free_timer(netq->tx_timer);
+        } else {
+            qemu_bh_delete(netq->tx_bh);
+        }
     }
 
-    qemu_del_vlan_client(&n->nic->nc);
     virtio_cleanup(&n->vdev);
+
+    for (i = 0; i < n->queues; i++) {
+        qemu_del_vlan_client(n->nic->ncs[i]);
+    }
 }
diff --git a/hw/virtio-net.h b/hw/virtio-net.h
index 36aa463..987b5dd 100644
--- a/hw/virtio-net.h
+++ b/hw/virtio-net.h
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_MULTIQUEUE   22
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -72,6 +73,8 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+
+    uint16_t queues;
 } QEMU_PACKED;
 
 /* This is the first element of the scatter-gather list.  If you don't
@@ -168,6 +171,15 @@ struct virtio_net_ctrl_mac {
  #define VIRTIO_NET_CTRL_VLAN_ADD             0
  #define VIRTIO_NET_CTRL_VLAN_DEL             1
 
+/* Control Multiqueue
+ *
+ */
+struct virtio_net_ctrl_multiqueue {
+    uint16_t num_queue_pairs;
+};
+#define VIRTIO_NET_CTRL_MULTIQUEUE    4
+ #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM        0
+
 #define DEFINE_VIRTIO_NET_FEATURES(_state, _field) \
         DEFINE_VIRTIO_COMMON_FEATURES(_state, _field), \
         DEFINE_PROP_BIT("csum", _state, _field, VIRTIO_NET_F_CSUM, true), \
-- 
1.7.1

^ permalink raw reply related

* [RFC V3 4/5] vhost: multiqueue support
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

This patch converts the vhost to support multiqueue queues. It implement a 1:1
mapping of vhost devs and tap fds. That it to say, the patch creates and uses
N vhost devs as the backend of the N queues virtio-net deivce.

The main work is to convert the virtqueue index into vhost queue index, this is
done by introducing an vq_index filed in vhost_dev struct to record the index of
first virtuque that is used by the vhost devs. Then vhost could simply convert
it to the local vhost queue index and issue ioctls.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/vhost.c      |   53 ++++++++++++++++++++++++++++++++++-------------------
 hw/vhost.h      |    2 ++
 hw/vhost_net.c  |    7 +++++--
 hw/vhost_net.h  |    2 +-
 hw/virtio-net.c |    2 +-
 5 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 43664e7..3eb6037 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -620,11 +620,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 {
     target_phys_addr_t s, l, a;
     int r;
+    int vhost_vq_index = (idx > 2 ? idx - 1 : idx) % dev->nvqs;
     struct vhost_vring_file file = {
-        .index = idx,
+        .index = vhost_vq_index
     };
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = vhost_vq_index
     };
     struct VirtQueue *vvq = virtio_get_queue(vdev, idx);
 
@@ -670,11 +671,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
         goto fail_alloc_ring;
     }
 
-    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
+    r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
     if (r < 0) {
         r = -errno;
         goto fail_alloc;
     }
+
     file.fd = event_notifier_get_fd(virtio_queue_get_host_notifier(vvq));
     r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
     if (r) {
@@ -715,7 +717,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
                                     unsigned idx)
 {
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = (idx > 2 ? idx - 1 : idx) % dev->nvqs,
     };
     int r;
     r = ioctl(dev->control, VHOST_GET_VRING_BASE, &state);
@@ -829,7 +831,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     }
 
     for (i = 0; i < hdev->nvqs; ++i) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, true);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+					     hdev->vq_index + i,
+					     true);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier binding failed: %d\n", i, -r);
             goto fail_vq;
@@ -839,7 +843,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     return 0;
 fail_vq:
     while (--i >= 0) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, false);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+					     hdev->vq_index + i,
+					     false);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier cleanup error: %d\n", i, -r);
             fflush(stderr);
@@ -860,7 +866,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     int i, r;
 
     for (i = 0; i < hdev->nvqs; ++i) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, false);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+                                             hdev->vq_index + i,
+                                             false);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier cleanup failed: %d\n", i, -r);
             fflush(stderr);
@@ -879,10 +887,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         goto fail;
     }
 
-    r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
-    if (r < 0) {
-        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
-        goto fail_notifiers;
+    if (hdev->vq_index == 0) {
+        r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
+        if (r < 0) {
+            fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+            goto fail_notifiers;
+        }
     }
 
     r = vhost_dev_set_features(hdev, hdev->log_enabled);
@@ -898,7 +908,7 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         r = vhost_virtqueue_init(hdev,
                                  vdev,
                                  hdev->vqs + i,
-                                 i);
+                                 hdev->vq_index + i);
         if (r < 0) {
             goto fail_vq;
         }
@@ -925,8 +935,9 @@ fail_vq:
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->vq_index + i);
     }
+    i = hdev->nvqs;
 fail_mem:
 fail_features:
     vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
@@ -944,18 +955,22 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->vq_index + i);
     }
+
     for (i = 0; i < hdev->n_mem_sections; ++i) {
         vhost_sync_dirty_bitmap(hdev, &hdev->mem_sections[i],
                                 0, (target_phys_addr_t)~0x0ull);
     }
-    r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
-    if (r < 0) {
-        fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", r);
-        fflush(stderr);
+
+    if (hdev->vq_index == 0) {
+	r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
+	if (r < 0) {
+	    fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", r);
+	    fflush(stderr);
+	}
+	assert (r >= 0);
     }
-    assert (r >= 0);
 
     hdev->started = false;
     g_free(hdev->log);
diff --git a/hw/vhost.h b/hw/vhost.h
index 80e64df..edfe9d2 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -34,6 +34,8 @@ struct vhost_dev {
     MemoryRegionSection *mem_sections;
     struct vhost_virtqueue *vqs;
     int nvqs;
+    /* the first virtuque which would be used by this vhost dev */
+    int vq_index;
     unsigned long long features;
     unsigned long long acked_features;
     unsigned long long backend_features;
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index f672e9d..7cc3339 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -138,13 +138,15 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-                    VirtIODevice *dev)
+                    VirtIODevice *dev,
+                    int vq_index)
 {
     struct vhost_vring_file file = { };
     int r;
 
     net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
+    net->dev.vq_index = vq_index;
 
     r = vhost_dev_enable_notifiers(&net->dev, dev);
     if (r < 0) {
@@ -227,7 +229,8 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-		    VirtIODevice *dev)
+                    VirtIODevice *dev,
+		int vq_index)
 {
     return -ENOSYS;
 }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 91e40b1..6282b2c 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -9,7 +9,7 @@ typedef struct vhost_net VHostNetState;
 VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force);
 
 bool vhost_net_query(VHostNetState *net, VirtIODevice *dev);
-int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
+int vhost_net_start(VHostNetState *net, VirtIODevice *dev, int vq_index);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
 
 void vhost_net_cleanup(VHostNetState *net);
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 3f190d4..30eb4f4 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -124,7 +124,7 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
         if (!vhost_net_query(tap_get_vhost_net(n->nic->nc.peer), &n->vdev)) {
             return;
         }
-        r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
+        r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev, 0);
         if (r < 0) {
             error_report("unable to start vhost net: %d: "
                          "falling back on userspace virtio", -r);
-- 
1.7.1

^ permalink raw reply related

* [RFC V3 3/5] net: multiqueue support
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

This patch adds the multiqueues support for emulated nics. Each VLANClientState
pairs are now abstract as a queue instead of a nic, and multiple VLANClientState
pointers were stored in the NICState. A queue_index were also introduced to let
the emulated nics know which queue the packet were came from or sent
out. Virtio-net would be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/dp8393x.c         |    2 +-
 hw/mcf_fec.c         |    2 +-
 hw/qdev-properties.c |   34 +++++++++++++++++----
 hw/qdev.h            |    3 +-
 net.c                |   79 +++++++++++++++++++++++++++++++++++++++++---------
 net.h                |   16 +++++++--
 net/tap.c            |    2 +-
 7 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index 017d074..483a868 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, target_phys_addr_t base, int it_shift,
 
     s->conf.macaddr = nd->macaddr;
     s->conf.vlan = nd->vlan;
-    s->conf.peer = nd->netdev;
+    s->conf.peers[0] = nd->netdev;
 
     s->nic = qemu_new_nic(&net_dp83932_info, &s->conf, nd->model, nd->name, s);
 
diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c
index ae37bef..69f508d 100644
--- a/hw/mcf_fec.c
+++ b/hw/mcf_fec.c
@@ -473,7 +473,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd,
 
     s->conf.macaddr = nd->macaddr;
     s->conf.vlan = nd->vlan;
-    s->conf.peer = nd->netdev;
+    s->conf.peers[0] = nd->netdev;
 
     s->nic = qemu_new_nic(&net_mcf_fec_info, &s->conf, nd->model, nd->name, s);
 
diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c
index 9ae3187..88e97e9 100644
--- a/hw/qdev-properties.c
+++ b/hw/qdev-properties.c
@@ -554,16 +554,38 @@ PropertyInfo qdev_prop_chr = {
 
 static int parse_netdev(DeviceState *dev, const char *str, void **ptr)
 {
-    VLANClientState *netdev = qemu_find_netdev(str);
+    VLANClientState ***nc = (VLANClientState ***)ptr;
+    VLANClientState *vcs[MAX_QUEUE_NUM];
+    int queues, i = 0;
+    int ret;
 
-    if (netdev == NULL) {
-        return -ENOENT;
+    *nc = g_malloc(MAX_QUEUE_NUM * sizeof(VLANClientState *));
+    queues = qemu_find_netdev_all(str, vcs, MAX_QUEUE_NUM);
+    if (queues == 0) {
+        ret = -ENOENT;
+        goto err;
     }
-    if (netdev->peer) {
-        return -EEXIST;
+
+    for (i = 0; i < queues; i++) {
+        if (vcs[i] == NULL) {
+            ret = -ENOENT;
+            goto err;
+        }
+
+        if (vcs[i]->peer) {
+            ret = -EEXIST;
+            goto err;
+        }
+
+        (*nc)[i] = vcs[i];
+	vcs[i]->queue_index = i;
     }
-    *ptr = netdev;
+
     return 0;
+
+err:
+    g_free(*nc);
+    return ret;
 }
 
 static const char *print_netdev(void *ptr)
diff --git a/hw/qdev.h b/hw/qdev.h
index 5386b16..1c023b4 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -248,6 +248,7 @@ extern PropertyInfo qdev_prop_blocksize;
         .defval    = (bool)_defval,                              \
         }
 
+
 #define DEFINE_PROP_UINT8(_n, _s, _f, _d)                       \
     DEFINE_PROP_DEFAULT(_n, _s, _f, _d, qdev_prop_uint8, uint8_t)
 #define DEFINE_PROP_UINT16(_n, _s, _f, _d)                      \
@@ -274,7 +275,7 @@ extern PropertyInfo qdev_prop_blocksize;
 #define DEFINE_PROP_STRING(_n, _s, _f)             \
     DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*)
 #define DEFINE_PROP_NETDEV(_n, _s, _f)             \
-    DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState*)
+    DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState**)
 #define DEFINE_PROP_VLAN(_n, _s, _f)             \
     DEFINE_PROP(_n, _s, _f, qdev_prop_vlan, VLANState*)
 #define DEFINE_PROP_DRIVE(_n, _s, _f) \
diff --git a/net.c b/net.c
index eabe830..f5db537 100644
--- a/net.c
+++ b/net.c
@@ -238,16 +238,33 @@ NICState *qemu_new_nic(NetClientInfo *info,
 {
     VLANClientState *nc;
     NICState *nic;
+    int i;
 
     assert(info->type == NET_CLIENT_TYPE_NIC);
     assert(info->size >= sizeof(NICState));
 
-    nc = qemu_new_net_client(info, conf->vlan, conf->peer, model, name);
+    if (conf->peers) {
+        nc = qemu_new_net_client(info, NULL, conf->peers[0], model, name);
+    } else {
+        nc = qemu_new_net_client(info, conf->vlan, NULL, model, name);
+    }
 
     nic = DO_UPCAST(NICState, nc, nc);
     nic->conf = conf;
     nic->opaque = opaque;
 
+    /* For compatiablity with single queue nic */
+    nic->ncs[0] = nc;
+    nc->opaque = nic;
+
+    for (i = 1 ; i < conf->queues; i++) {
+        VLANClientState *vc = qemu_new_net_client(info, NULL, conf->peers[i],
+                                                  model, name);
+        vc->opaque = nic;
+        nic->ncs[i] = vc;
+        vc->queue_index = i;
+    }
+
     return nic;
 }
 
@@ -283,11 +300,10 @@ void qemu_del_vlan_client(VLANClientState *vc)
 {
     /* If there is a peer NIC, delete and cleanup client, but do not free. */
     if (!vc->vlan && vc->peer && vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
-        NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
-        if (nic->peer_deleted) {
+        if (vc->peer_deleted) {
             return;
         }
-        nic->peer_deleted = true;
+        vc->peer_deleted = true;
         /* Let NIC know peer is gone. */
         vc->peer->link_down = true;
         if (vc->peer->info->link_status_changed) {
@@ -299,8 +315,7 @@ void qemu_del_vlan_client(VLANClientState *vc)
 
     /* If this is a peer NIC and peer has already been deleted, free it now. */
     if (!vc->vlan && vc->peer && vc->info->type == NET_CLIENT_TYPE_NIC) {
-        NICState *nic = DO_UPCAST(NICState, nc, vc);
-        if (nic->peer_deleted) {
+        if (vc->peer_deleted) {
             qemu_free_vlan_client(vc->peer);
         }
     }
@@ -342,14 +357,14 @@ void qemu_foreach_nic(qemu_nic_foreach func, void *opaque)
 
     QTAILQ_FOREACH(nc, &non_vlan_clients, next) {
         if (nc->info->type == NET_CLIENT_TYPE_NIC) {
-            func(DO_UPCAST(NICState, nc, nc), opaque);
+            func((NICState *)nc->opaque, opaque);
         }
     }
 
     QTAILQ_FOREACH(vlan, &vlans, next) {
         QTAILQ_FOREACH(nc, &vlan->clients, next) {
             if (nc->info->type == NET_CLIENT_TYPE_NIC) {
-                func(DO_UPCAST(NICState, nc, nc), opaque);
+                func((NICState *)nc->opaque, opaque);
             }
         }
     }
@@ -674,6 +689,26 @@ VLANClientState *qemu_find_netdev(const char *id)
     return NULL;
 }
 
+int qemu_find_netdev_all(const char *id, VLANClientState **vcs, int max)
+{
+    VLANClientState *vc;
+    int ret = 0;
+
+    QTAILQ_FOREACH(vc, &non_vlan_clients, next) {
+        if (vc->info->type == NET_CLIENT_TYPE_NIC) {
+            continue;
+        }
+        if (!strcmp(vc->name, id) && ret < max) {
+            vcs[ret++] = vc;
+        }
+        if (ret >= max) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
 static int nic_get_free_idx(void)
 {
     int index;
@@ -1275,22 +1310,27 @@ exit_err:
 
 void qmp_netdev_del(const char *id, Error **errp)
 {
-    VLANClientState *vc;
+    VLANClientState *vcs[MAX_QUEUE_NUM];
+    int queues, i;
 
-    vc = qemu_find_netdev(id);
-    if (!vc) {
+    queues = qemu_find_netdev_all(id, vcs, MAX_QUEUE_NUM);
+    if (queues == 0) {
         error_set(errp, QERR_DEVICE_NOT_FOUND, id);
         return;
     }
 
-    qemu_del_vlan_client(vc);
+    for (i = 0; i < queues; i++) {
+        /* FIXME: pointer check? */
+        qemu_del_vlan_client(vcs[i]);
+    }
     qemu_opts_del(qemu_opts_find(qemu_find_opts_err("netdev", errp), id));
 }
 
 static void print_net_client(Monitor *mon, VLANClientState *vc)
 {
-    monitor_printf(mon, "%s: type=%s,%s\n", vc->name,
-                   net_client_types[vc->info->type].type, vc->info_str);
+    monitor_printf(mon, "%s: type=%s,%s, queue=%d\n", vc->name,
+                   net_client_types[vc->info->type].type, vc->info_str,
+                   vc->queue_index);
 }
 
 void do_info_network(Monitor *mon)
@@ -1326,6 +1366,17 @@ void qmp_set_link(const char *name, bool up, Error **errp)
 {
     VLANState *vlan;
     VLANClientState *vc = NULL;
+    VLANClientState *vcs[MAX_QUEUE_NUM];
+    int queues, i;
+
+    queues = qemu_find_netdev_all(name, vcs, MAX_QUEUE_NUM);
+    if (queues) {
+	    for (i = 1; i < queues; i++) {
+		    vcs[i]->link_down = !up;
+	    }
+	    vc = vcs[0];
+	    goto done;
+    }
 
     QTAILQ_FOREACH(vlan, &vlans, next) {
         QTAILQ_FOREACH(vc, &vlan->clients, next) {
diff --git a/net.h b/net.h
index bdc2a06..40378ce 100644
--- a/net.h
+++ b/net.h
@@ -12,20 +12,24 @@ struct MACAddr {
     uint8_t a[6];
 };
 
+#define MAX_QUEUE_NUM 32
+
 /* qdev nic properties */
 
 typedef struct NICConf {
     MACAddr macaddr;
     VLANState *vlan;
-    VLANClientState *peer;
+    VLANClientState **peers;
     int32_t bootindex;
+    int32_t queues;
 } NICConf;
 
 #define DEFINE_NIC_PROPERTIES(_state, _conf)                            \
     DEFINE_PROP_MACADDR("mac",   _state, _conf.macaddr),                \
     DEFINE_PROP_VLAN("vlan",     _state, _conf.vlan),                   \
-    DEFINE_PROP_NETDEV("netdev", _state, _conf.peer),                   \
-    DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1)
+    DEFINE_PROP_NETDEV("netdev", _state, _conf.peers),                   \
+    DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
+    DEFINE_PROP_INT32("queues", _state, _conf.queues, 1)
 
 /* VLANs support */
 
@@ -72,13 +76,16 @@ struct VLANClientState {
     char *name;
     char info_str[256];
     unsigned receive_disabled : 1;
+    unsigned int queue_index;
+    bool peer_deleted;
+    void *opaque;
 };
 
 typedef struct NICState {
     VLANClientState nc;
+    VLANClientState *ncs[MAX_QUEUE_NUM];
     NICConf *conf;
     void *opaque;
-    bool peer_deleted;
 } NICState;
 
 struct VLANState {
@@ -90,6 +97,7 @@ struct VLANState {
 
 VLANState *qemu_find_vlan(int id, int allocate);
 VLANClientState *qemu_find_netdev(const char *id);
+int qemu_find_netdev_all(const char *id, VLANClientState **vcs, int max);
 VLANClientState *qemu_new_net_client(NetClientInfo *info,
                                      VLANState *vlan,
                                      VLANClientState *peer,
diff --git a/net/tap.c b/net/tap.c
index 65457e5..4ce83b8 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -290,7 +290,7 @@ static void tap_cleanup(VLANClientState *nc)
 
     qemu_purge_queued_packets(nc);
 
-    if (s->down_script[0])
+    if (s->down_script[0] && nc->queue_index == 0)
         launch_script(s->down_script, s->down_script_arg, s->fd);
 
     tap_read_poll(s, 0);
-- 
1.7.1

^ permalink raw reply related

* [RFC V3 2/5] tap: multiqueue support
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

Some operating system ( such as Linux ) supports multiqueue tap, this is done
through attaching multiple sockets to the net device and expose multiple file
descriptors.

This patch let qemu utilizes this kind of backend, and introduces helpter for:

- creating a multiple capable tap device
- increase and decrease the number of queues by introducing helpter to attach or
  detach a file descriptor to the device

Then qemu can use this as the infrastructures of emulating a multiple queue
capable network cards.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net.c             |    4 +
 net/tap-aix.c     |   13 +++-
 net/tap-bsd.c     |   13 +++-
 net/tap-haiku.c   |   13 +++-
 net/tap-linux.c   |   56 +++++++++++++++-
 net/tap-linux.h   |    4 +
 net/tap-solaris.c |   13 +++-
 net/tap-win32.c   |   11 +++
 net/tap.c         |  197 ++++++++++++++++++++++++++++++++++-------------------
 net/tap.h         |    7 ++-
 10 files changed, 253 insertions(+), 78 deletions(-)

diff --git a/net.c b/net.c
index 4aa416c..eabe830 100644
--- a/net.c
+++ b/net.c
@@ -978,6 +978,10 @@ static const struct {
                 .name = "vhostforce",
                 .type = QEMU_OPT_BOOL,
                 .help = "force vhost on for non-MSIX virtio guests",
+            }, {
+                .name = "queues",
+                .type = QEMU_OPT_NUMBER,
+                .help = "number of queues the backend can provides",
         },
 #endif /* _WIN32 */
             { /* end of list */ }
diff --git a/net/tap-aix.c b/net/tap-aix.c
index e19aaba..f111e0f 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include "net/tap.h"
 #include <stdio.h>
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     fprintf(stderr, "no tap on AIX\n");
     return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 937a94b..44f3421 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -33,7 +33,8 @@
 #include <net/if_tap.h>
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     int fd;
 #ifdef TAPGIFNAME
@@ -145,3 +146,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index 91dda8e..6fb6719 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -25,7 +25,8 @@
 #include "net/tap.h"
 #include <stdio.h>
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     fprintf(stderr, "no tap on Haiku\n");
     return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 41d581b..ed0afe9 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -35,7 +35,8 @@
 
 #define PATH_NET_TUN "/dev/net/tun"
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     struct ifreq ifr;
     int fd, ret;
@@ -47,6 +48,8 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
     }
     memset(&ifr, 0, sizeof(ifr));
     ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+    if (!attach)
+        ifr.ifr_flags |= IFF_MULTI_QUEUE;
 
     if (*vnet_hdr) {
         unsigned int features;
@@ -71,7 +74,11 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
         pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
     else
         pstrcpy(ifr.ifr_name, IFNAMSIZ, "tap%d");
-    ret = ioctl(fd, TUNSETIFF, (void *) &ifr);
+    if (attach) {
+        ifr.ifr_flags |= IFF_ATTACH_QUEUE;
+        ret = ioctl(fd, TUNSETQUEUE, (void *) &ifr);
+    } else
+        ret = ioctl(fd, TUNSETIFF, (void *) &ifr);
     if (ret != 0) {
         if (ifname[0] != '\0') {
             error_report("could not configure %s (%s): %m", PATH_NET_TUN, ifr.ifr_name);
@@ -197,3 +204,48 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
         }
     }
 }
+
+/* Attach a file descriptor to a TUN/TAP device. This descriptor should be
+ * detached before.
+ */
+int tap_fd_attach(int fd, const char *ifname)
+{
+    struct ifreq ifr;
+    int ret;
+
+    memset(&ifr, 0, sizeof(ifr));
+
+    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_ATTACH_QUEUE;
+    pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
+
+    ret = ioctl(fd, TUNSETQUEUE, (void *) &ifr);
+
+    if (ret != 0) {
+        error_report("could not attach to %s", ifname);
+    }
+
+    return ret;
+}
+
+/* Detach a file descriptor to a TUN/TAP device. This file descriptor must have
+ * been attach to a device.
+ */
+int tap_fd_detach(int fd, const char *ifname)
+{
+    struct ifreq ifr;
+    int ret;
+
+    memset(&ifr, 0, sizeof(ifr));
+
+    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR | IFF_DETACH_QUEUE;
+    pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
+
+    ret = ioctl(fd, TUNSETQUEUE, (void *) &ifr);
+
+    if (ret != 0) {
+        error_report("could not detach to %s", ifname);
+    }
+
+    return ret;
+}
+
diff --git a/net/tap-linux.h b/net/tap-linux.h
index 659e981..648d29f 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -29,6 +29,7 @@
 #define TUNSETSNDBUF   _IOW('T', 212, int)
 #define TUNGETVNETHDRSZ _IOR('T', 215, int)
 #define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE  _IOW('T', 217, int)
 
 #endif
 
@@ -36,6 +37,9 @@
 #define IFF_TAP		0x0002
 #define IFF_NO_PI	0x1000
 #define IFF_VNET_HDR	0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index cf76463..f7c8e8d 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -173,7 +173,8 @@ static int tap_alloc(char *dev, size_t dev_size)
     return tap_fd;
 }
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     char  dev[10]="";
     int fd;
@@ -225,3 +226,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-win32.c b/net/tap-win32.c
index a801a55..dae1c00 100644
--- a/net/tap-win32.c
+++ b/net/tap-win32.c
@@ -749,3 +749,14 @@ struct vhost_net *tap_get_vhost_net(VLANClientState *nc)
 {
     return NULL;
 }
+
+int tap_attach(VLANClientState *nc)
+{
+    return -1;
+}
+
+int tap_detach(VLANClientState *nc)
+{
+    return -1;
+}
+
diff --git a/net/tap.c b/net/tap.c
index 5ac4ba3..65457e5 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -53,11 +53,13 @@ typedef struct TAPState {
     int fd;
     char down_script[1024];
     char down_script_arg[128];
+    char ifname[128];
     uint8_t buf[TAP_BUFSIZE];
     unsigned int read_poll : 1;
     unsigned int write_poll : 1;
     unsigned int using_vnet_hdr : 1;
     unsigned int has_ufo: 1;
+    unsigned int enabled:1;
     VHostNetState *vhost_net;
     unsigned host_vnet_hdr_len;
 } TAPState;
@@ -248,8 +250,10 @@ void tap_set_vnet_hdr_len(VLANClientState *nc, int len)
     assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
            len == sizeof(struct virtio_net_hdr));
 
-    tap_fd_set_vnet_hdr_len(s->fd, len);
-    s->host_vnet_hdr_len = len;
+    if (s->enabled) {
+        tap_fd_set_vnet_hdr_len(s->fd, len);
+        s->host_vnet_hdr_len = len;
+    }
 }
 
 void tap_using_vnet_hdr(VLANClientState *nc, int using_vnet_hdr)
@@ -546,7 +550,7 @@ int net_init_bridge(QemuOpts *opts, const char *name, VLANState *vlan)
     return 0;
 }
 
-static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
+static int net_tap_init(QemuOpts *opts, int *vnet_hdr, int attach)
 {
     int fd, vnet_hdr_required;
     char ifname[128] = {0,};
@@ -563,7 +567,9 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
         vnet_hdr_required = 0;
     }
 
-    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required));
+    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required,
+                      attach));
+
     if (fd < 0) {
         return -1;
     }
@@ -572,7 +578,7 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
     if (setup_script &&
         setup_script[0] != '\0' &&
         strcmp(setup_script, "no") != 0 &&
-        launch_script(setup_script, ifname, fd)) {
+        (!attach && launch_script(setup_script, ifname, fd))) {
         close(fd);
         return -1;
     }
@@ -582,74 +588,11 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
     return fd;
 }
 
-int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
+static int __net_init_tap(QemuOpts *opts, Monitor *mon, const char *name,
+                          VLANState *vlan, int fd, int vnet_hdr)
 {
-    TAPState *s;
-    int fd, vnet_hdr = 0;
-    const char *model;
-
-    if (qemu_opt_get(opts, "fd")) {
-        if (qemu_opt_get(opts, "ifname") ||
-            qemu_opt_get(opts, "script") ||
-            qemu_opt_get(opts, "downscript") ||
-            qemu_opt_get(opts, "vnet_hdr") ||
-            qemu_opt_get(opts, "helper")) {
-            error_report("ifname=, script=, downscript=, vnet_hdr=, "
-                         "and helper= are invalid with fd=");
-            return -1;
-        }
-
-        fd = net_handle_fd_param(cur_mon, qemu_opt_get(opts, "fd"));
-        if (fd == -1) {
-            return -1;
-        }
+    TAPState *s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
 
-        fcntl(fd, F_SETFL, O_NONBLOCK);
-
-        vnet_hdr = tap_probe_vnet_hdr(fd);
-
-        model = "tap";
-
-    } else if (qemu_opt_get(opts, "helper")) {
-        if (qemu_opt_get(opts, "ifname") ||
-            qemu_opt_get(opts, "script") ||
-            qemu_opt_get(opts, "downscript") ||
-            qemu_opt_get(opts, "vnet_hdr")) {
-            error_report("ifname=, script=, downscript=, and vnet_hdr= "
-                         "are invalid with helper=");
-            return -1;
-        }
-
-        fd = net_bridge_run_helper(qemu_opt_get(opts, "helper"),
-                                   DEFAULT_BRIDGE_INTERFACE);
-        if (fd == -1) {
-            return -1;
-        }
-
-        fcntl(fd, F_SETFL, O_NONBLOCK);
-
-        vnet_hdr = tap_probe_vnet_hdr(fd);
-
-        model = "bridge";
-
-    } else {
-        if (!qemu_opt_get(opts, "script")) {
-            qemu_opt_set(opts, "script", DEFAULT_NETWORK_SCRIPT);
-        }
-
-        if (!qemu_opt_get(opts, "downscript")) {
-            qemu_opt_set(opts, "downscript", DEFAULT_NETWORK_DOWN_SCRIPT);
-        }
-
-        fd = net_tap_init(opts, &vnet_hdr);
-        if (fd == -1) {
-            return -1;
-        }
-
-        model = "tap";
-    }
-
-    s = net_tap_fd_init(vlan, model, name, fd, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
@@ -671,6 +614,7 @@ int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
         script     = qemu_opt_get(opts, "script");
         downscript = qemu_opt_get(opts, "downscript");
 
+        pstrcpy(s->ifname, sizeof(s->ifname), ifname);
         snprintf(s->nc.info_str, sizeof(s->nc.info_str),
                  "ifname=%s,script=%s,downscript=%s",
                  ifname, script, downscript);
@@ -704,6 +648,84 @@ int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
         return -1;
     }
 
+    s->enabled = 1;
+    return 0;
+}
+
+int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
+{
+    int i, fd, vnet_hdr = 0;
+    int numqueues = qemu_opt_get_number(opts, "queues", 1);
+
+    if (qemu_opt_get(opts, "fd")) {
+        const char *fdp[16];
+        if (qemu_opt_get(opts, "ifname") ||
+            qemu_opt_get(opts, "script") ||
+            qemu_opt_get(opts, "downscript") ||
+            qemu_opt_get(opts, "vnet_hdr") ||
+            qemu_opt_get(opts, "helper")) {
+            error_report("ifname=, script=, downscript=, vnet_hdr=, "
+                         "and helper= are invalid with fd=");
+            return -1;
+        }
+
+        if (numqueues != qemu_opt_get_all(opts, "fd", fdp, 16)) {
+            error_report("the number of queue does not match the"
+                         "number of fd passed");
+            return -1;
+        }
+
+        for (i = 0; i < numqueues; i++) {
+            fd = net_handle_fd_param(cur_mon, fdp[i]);
+            if (fd == -1) {
+                return -1;
+            }
+
+            fcntl(fd, F_SETFL, O_NONBLOCK);
+
+            vnet_hdr = tap_probe_vnet_hdr(fd);
+
+            if (__net_init_tap(opts, cur_mon, name, vlan, fd, vnet_hdr))
+                return -1;
+        }
+    } else if (qemu_opt_get(opts, "helper")) {
+        if (qemu_opt_get(opts, "ifname") ||
+            qemu_opt_get(opts, "script") ||
+            qemu_opt_get(opts, "downscript") ||
+            qemu_opt_get(opts, "vnet_hdr") ||
+            numqueues > 1) {
+            error_report("ifname=, script=, downscript=, and vnet_hdr=, "
+                         "and queues > 1 are invalid with helper=");
+            return -1;
+        }
+
+        fd = net_bridge_run_helper(qemu_opt_get(opts, "helper"),
+                                   DEFAULT_BRIDGE_INTERFACE);
+        if (fd == -1) {
+            return -1;
+        }
+
+        fcntl(fd, F_SETFL, O_NONBLOCK);
+
+        vnet_hdr = tap_probe_vnet_hdr(fd);
+    } else {
+        if (!qemu_opt_get(opts, "script")) {
+            qemu_opt_set(opts, "script", DEFAULT_NETWORK_SCRIPT);
+        }
+
+        if (!qemu_opt_get(opts, "downscript")) {
+            qemu_opt_set(opts, "downscript", DEFAULT_NETWORK_DOWN_SCRIPT);
+        }
+
+        for (i = 0; i < numqueues; i++) {
+            fd = net_tap_init(opts, &vnet_hdr, i != 0);
+            if (fd == -1) {
+                return -1;
+            }
+            if(__net_init_tap(opts, cur_mon, name, vlan, fd, vnet_hdr))
+                return -1;
+        }
+    }
     return 0;
 }
 
@@ -713,3 +735,36 @@ VHostNetState *tap_get_vhost_net(VLANClientState *nc)
     assert(nc->info->type == NET_CLIENT_TYPE_TAP);
     return s->vhost_net;
 }
+
+int tap_attach(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    int ret;
+
+    if (s->enabled) {
+        return 0;
+    } else {
+        ret = tap_fd_attach(s->fd, s->ifname);
+        if (ret == 0) {
+            s->enabled = 1;
+        }
+        return ret;
+    }
+}
+
+int tap_detach(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    int ret;
+
+    if (s->enabled == 0) {
+        return 0;
+    } else {
+        ret = tap_fd_detach(s->fd, s->ifname);
+        if (ret == 0) {
+            s->enabled = 0;
+        }
+        return ret;
+    }
+}
+
diff --git a/net/tap.h b/net/tap.h
index b2a9450..cead7ca 100644
--- a/net/tap.h
+++ b/net/tap.h
@@ -34,7 +34,8 @@
 
 int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan);
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required);
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach);
 
 ssize_t tap_read_packet(int tapfd, uint8_t *buf, int maxlen);
 
@@ -51,6 +52,10 @@ int tap_probe_vnet_hdr_len(int fd, int len);
 int tap_probe_has_ufo(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo);
 void tap_fd_set_vnet_hdr_len(int fd, int len);
+int tap_attach(VLANClientState *vc);
+int tap_detach(VLANClientState *vc);
+int tap_fd_attach(int fd, const char *ifname);
+int tap_fd_detach(int fd, const char *ifname);
 
 int tap_get_fd(VLANClientState *vc);
 
-- 
1.7.1

^ permalink raw reply related

* [RFC V3 1/5] option: introduce qemu_get_opt_all()
From: Jason Wang @ 2012-07-06  9:31 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <1341567070-14136-1-git-send-email-jasowang@redhat.com>

Sometimes, we need to pass option like -netdev tap,fd=100,fd=101,fd=102 which
can not be properly parsed by qemu_find_opt() because it only returns the first
matched option. So qemu_get_opt_all() were introduced to return an array of
pointers which contains all matched option.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 qemu-option.c |   19 +++++++++++++++++++
 qemu-option.h |    2 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/qemu-option.c b/qemu-option.c
index bb3886c..9263125 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -545,6 +545,25 @@ static QemuOpt *qemu_opt_find(QemuOpts *opts, const char *name)
     return NULL;
 }
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max)
+{
+    QemuOpt *opt;
+    int index = 0;
+
+    QTAILQ_FOREACH_REVERSE(opt, &opts->head, QemuOptHead, next) {
+        if (strcmp(opt->name, name) == 0) {
+            if (index < max) {
+                optp[index++] = opt->str;
+            }
+            if (index == max) {
+                break;
+            }
+        }
+    }
+    return index;
+}
+
 const char *qemu_opt_get(QemuOpts *opts, const char *name)
 {
     QemuOpt *opt = qemu_opt_find(opts, name);
diff --git a/qemu-option.h b/qemu-option.h
index 951dec3..3c9a273 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -106,6 +106,8 @@ struct QemuOptsList {
     QemuOptDesc desc[];
 };
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max);
 const char *qemu_opt_get(QemuOpts *opts, const char *name);
 bool qemu_opt_get_bool(QemuOpts *opts, const char *name, bool defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t defval);
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox