Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Avi Kivity @ 2011-12-07 12:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Raghavendra K T, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuk
In-Reply-To: <20111207123330.GA32212@amt.cnet>

On 12/07/2011 02:33 PM, Marcelo Tosatti wrote:
> > 
> > Also Avi pointed that, logically kvm_arch_vcpu_ioctl_set_mpstate should
> > be called only in vcpu thread, so after further debugging, I noticed
> > that,  setting vcpuN->mp_state = KVM_MP_STATE_RUNNABLE; is not
> > necessary.
> > I 'll remove that in the next patch. Thanks for pointing.
>
> In fact you don't need kvm_arch_vcpu_ioctl_set_mpstate either, only the
> new "kicked" flag.

If we have a kicked flag, it becomes necessary to live migrate it.

Maybe we can change KVM_GET_MP_STATE to fold the kicked flag into the mp
state (converting HALTED into RUNNABLE).

Also I think we can keep the kicked flag in vcpu->requests, no need for
new storage.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Avi Kivity @ 2011-12-07 13:37 UTC (permalink / raw)
  To: Rusty Russell
  Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization,
	Sasha Levin
In-Reply-To: <87pqg1kiuu.fsf@rustcorp.com.au>

On 12/06/2011 02:03 PM, Rusty Russell wrote:
> On Tue, 06 Dec 2011 11:58:21 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/06/2011 07:07 AM, Rusty Russell wrote:
> > > Yes, but the hypervisor/trusted party would simply have to do the copy;
> > > the rings themselves would be shared A would say "copy this to/from B's
> > > ring entry N" and you know that A can't have changed B's entry.
> > 
> > Sorry, I don't follow.  How can the rings be shared?  If A puts a gpa in
> > A's address space into the ring, there's no way B can do anything with
> > it, it's an opaque number.  Xen solves this with an extra layer of
> > indirection (grant table handles) that cost extra hypercalls to map or
> > copy.
>
> It's not symmetric.  B can see the desc and avail pages R/O, and the
> used page R/W.  It needs to ask the something to copy in/out of
> descriptors, though, because they're an opaque number, and it doesn't
> have access.  ie. the existence of the descriptor in the ring *implies*
> a grant.
>
> Perhaps this could be generalized further into a "connect these two
> rings", but I'm not sure.  Descriptors with both read and write parts
> are tricky.

Okay, I was using a wrong mental model of how this works.  B must be
aware of the translation from A's address space into B.  Both qemu and
the kernel can do this on their own, but if B is another guest, then it
cannot do this except by calling H.

vhost-copy cannot work fully transparently, because you need some memory
to copy into.  Maybe we can have a pci device with a large BAR that
contains buffers for copying, and also a translation from A addresses
into B addresses.  It would work something like this:

  A prepares a request with both out and in buffers
  vhost-copy allocates memory in B's virtio-copy BAR, copies (using a
DMA engine) the out buffers into it, and rewrites the out descriptors to
contain B addresses
  B services the request, and updates the in addresses in the
descriptors to point at B memory
  vhost-copy copies (using a DMA engine) the in buffers into A memory

> > > I'm just not sure how the host would even know to hint.
> > 
> > For JBOD storage, a good rule of thumb is (number of spindles) x 3. 
> > With less, you might leave an idle spindle; with more, you're just
> > adding latency.  This assumes you're using indirects so ring entry ==
> > request.  The picture is muddier with massive battery-backed RAID
> > controllers or flash.
> > 
> > For networking, you want (ring size) * min(expected packet size, page
> > size) / (link bandwidth) to be something that doesn't get the
> > bufferbloat people after your blood.
>
> OK, so while neither side knows, the host knows slightly more.
>
> Now I think about it, from a spec POV, saying it's a "hint" is useless,
> as it doesn't tell the driver what to do with it.  I'll say it's a
> maximum, which keeps it simple.
>

Those rules of thumb always have exceptions, I'd say it's the default
that the guest can override.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Marcelo Tosatti @ 2011-12-07 13:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Raghavendra K T, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuk
In-Reply-To: <4EDF6049.2030204@redhat.com>

On Wed, Dec 07, 2011 at 02:47:05PM +0200, Avi Kivity wrote:
> On 12/07/2011 02:33 PM, Marcelo Tosatti wrote:
> > > 
> > > Also Avi pointed that, logically kvm_arch_vcpu_ioctl_set_mpstate should
> > > be called only in vcpu thread, so after further debugging, I noticed
> > > that,  setting vcpuN->mp_state = KVM_MP_STATE_RUNNABLE; is not
> > > necessary.
> > > I 'll remove that in the next patch. Thanks for pointing.
> >
> > In fact you don't need kvm_arch_vcpu_ioctl_set_mpstate either, only the
> > new "kicked" flag.
> 
> If we have a kicked flag, it becomes necessary to live migrate it.
> 
> Maybe we can change KVM_GET_MP_STATE to fold the kicked flag into the mp
> state (converting HALTED into RUNNABLE).

Yep, that works.

> Also I think we can keep the kicked flag in vcpu->requests, no need for
> new storage.

Was going to suggest it but it violates the currently organized
processing of entries at the beginning of vcpu_enter_guest.

That is, this "kicked" flag is different enough from vcpu->requests
processing that a separate variable seems worthwhile (even more
different with convertion to MP_STATE at KVM_GET_MP_STATE).

^ permalink raw reply

* Re: [PATCH v4 00/12] virtio: s4 support
From: Dor Laor @ 2011-12-07 13:43 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Michael S. Tsirkin, linux-kernel, Virtualization List,
	levinsasha928, Amit Shah
In-Reply-To: <20111207104459.GR12507@redhat.com>

On 12/07/2011 12:44 PM, Gleb Natapov wrote:
> On Wed, Dec 07, 2011 at 05:54:29PM +1030, Rusty Russell wrote:
>> On Wed,  7 Dec 2011 01:18:38 +0530, Amit Shah<amit.shah@redhat.com>  wrote:
>>> Hi,
>>>
>>> These patches add support for S4 to virtio (pci) and all drivers.
>>
>> Dumb meta-question: why do we want to hibernate virtual machines?
>>
> For instance you want to hibernate your laptop while it is running a virtual
> machine. You can pause them of course, but then time will be incorrect
> inside a guest after resume.

Exactly. Today with do that through live migration into a file, we call 
it external hibernation. The problem that it is transparent from the 
guest and thus it loses time keeping. Even NTP can't handle it since if 
it has more than 0.5% delay it stops syncing.

The solution is to do it through s4 where the guest gets a notification 
when it resumes.


>
>> I figure there's a reason, but it seems a bit weird :)
>>
>> Thanks,
>> Rusty.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Sasha Levin @ 2011-12-07 14:02 UTC (permalink / raw)
  To: Avi Kivity; +Cc: markmc, kvm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <1323023039.3256.7.camel@lappy>

On Sun, 2011-12-04 at 20:23 +0200, Sasha Levin wrote:

[snip]

Rusty, Michael, does the below looks a reasonable optimization for you?
should I send it as a patch?

> Something like the following patch:
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index c7a2c20..3166ca0 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -82,6 +82,7 @@ struct vring_virtqueue
>  
>  	/* Host supports indirect buffers */
>  	bool indirect;
> +	struct kmem_cache *indirect_cache;
>  
>  	/* Host publishes avail event idx */
>  	bool event;
> @@ -110,6 +111,9 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> +static unsigned int ind_alloc_thresh = 0;
> +module_param(ind_alloc_thresh, uint, S_IRUGO);
> +
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static int vring_add_indirect(struct vring_virtqueue *vq,
>  			      struct scatterlist sg[],
> @@ -121,7 +125,10 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
>  	unsigned head;
>  	int i;
>  
> -	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
> +	if ((out + in) <= ind_alloc_thresh)
> +		desc = kmem_cache_alloc(vq->indirect_cache, gfp);
> +	else
> +		desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
>  	if (!desc)
>  		return -ENOMEM;
>  
> @@ -479,6 +486,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
>  	vq->broken = false;
>  	vq->last_used_idx = 0;
>  	vq->num_added = 0;
> +	if (ind_alloc_thresh)
> +		vq->indirect_cache = KMEM_CACHE(vring_desc[ind_alloc_thresh], 0);
>  	list_add_tail(&vq->vq.list, &vdev->vqs);
>  #ifdef DEBUG
>  	vq->in_use = false;
> 

-- 

Sasha.

^ permalink raw reply

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Avi Kivity @ 2011-12-07 14:52 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Raghavendra K T, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuk
In-Reply-To: <20111207133947.GA1708@amt.cnet>

On 12/07/2011 03:39 PM, Marcelo Tosatti wrote:
> > Also I think we can keep the kicked flag in vcpu->requests, no need for
> > new storage.
>
> Was going to suggest it but it violates the currently organized
> processing of entries at the beginning of vcpu_enter_guest.
>
> That is, this "kicked" flag is different enough from vcpu->requests
> processing that a separate variable seems worthwhile (even more
> different with convertion to MP_STATE at KVM_GET_MP_STATE).

IMO, it's similar to KVM_REQ_EVENT (which can also cause mpstate to
change due to apic re-evaluation).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [net-next RFC PATCH 5/5] virtio-net: flow director support
From: Stefan Hajnoczi @ 2011-12-07 15:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928,
	bhutchings
In-Reply-To: <4EDF57C7.90406@redhat.com>

On Wed, Dec 7, 2011 at 12:10 PM, Jason Wang <jasowang@redhat.com> wrote:
> On 12/07/2011 05:08 PM, Stefan Hajnoczi wrote:
> [...]
>
>>> >  Consider the complexity of the host nic each with their own steering
>>> >  features,  this series make the first step with minimal effort to try
>>> > to let
>>> >  guest driver and host tap/macvtap co-operate like what physical nic
>>> > does.
>>> >  There may be other method, but performance numbers is also needed to
>>> > give
>>> >  the answer.
>>
>> I agree that performance results for this need to be shown.
>>
>> My original point is really that it's not a good idea to take
>> individual steps without a good big picture because this will change
>> the virtio-net device specification.  If this turns out to be a dead
>> end then hosts will need to continue to support the interface forever
>> (legacy guests could still try to use it).  So please first explain
>> what the full stack picture is going to look like and how you think it
>> will lead to better performance.  You don't need to have all the code
>> or evidence, but just enough explanation so we see where this is all
>> going.
>
> I think I mention too little in the cover message.
>
> There's no much changes with Krishna's series except the method that
> choosing a rx virtqueue. Since original series use different hash methods in
> host (rxhash) and guest (txhash), a different virtqueue were chose for a
> flow which could lead packets of a flow to be handled by different vhost
> thread and vcpu. This may damage the performance.
>
> This series tries to let one vhost thread to process the packets of a flow
> and also let the packets to be sent directly to a vcpu local to the thread
> process the data. This is done by letting guest tell the desired queue form
> which it want to receive the pakcet of a dedicated flow.
>
> So passing the hash from host to guest is needed to get the same hash in the
> two sides. Then a guest programmable hash to queue table were introduced and
> guest co-operate with the host through accelerate RFS in guest.

Thanks for the context, that helps me understand it a little better.

Stefan

^ permalink raw reply

* [PATCH 1/1] Staging: hv: storvsc: Fix a bug in create_bounce_buffer()
From: K. Y. Srinivasan @ 2011-12-07 15:15 UTC (permalink / raw)
  To: gregkh, linux-kernel, devel, virtualization, ohering,
	James.Bottomley, hch, linux-scsi
  Cc: K. Y. Srinivasan, Haiyang Zhang

Set the length field of the scatter gather elements correctly when we create
the bounce buffer. When we use the bounce buffer for a "write" operation,
the act of copying to the bounce buffer, correctly deals with this issue.
However, on the "read" side, the current code was not correctly setting
the buffer length. Fix this bug. Note that when we copy from the bounce
buffer (for the read case), the amount we copy is controlled by the original
scatter gather list given to the driver.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reported-by: Dadok Milan <dadok@kvados.cz> 
---
 drivers/staging/hv/storvsc_drv.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/hv/storvsc_drv.c b/drivers/staging/hv/storvsc_drv.c
index 18f8771..eb853f7 100644
--- a/drivers/staging/hv/storvsc_drv.c
+++ b/drivers/staging/hv/storvsc_drv.c
@@ -893,12 +893,14 @@ static int do_bounce_buffer(struct scatterlist *sgl, unsigned int sg_count)
 
 static struct scatterlist *create_bounce_buffer(struct scatterlist *sgl,
 						unsigned int sg_count,
-						unsigned int len)
+						unsigned int len,
+						int write)
 {
 	int i;
 	int num_pages;
 	struct scatterlist *bounce_sgl;
 	struct page *page_buf;
+	unsigned int buf_len = ((write == WRITE_TYPE) ? 0 : PAGE_SIZE);
 
 	num_pages = ALIGN(len, PAGE_SIZE) >> PAGE_SHIFT;
 
@@ -910,7 +912,7 @@ static struct scatterlist *create_bounce_buffer(struct scatterlist *sgl,
 		page_buf = alloc_page(GFP_ATOMIC);
 		if (!page_buf)
 			goto cleanup;
-		sg_set_page(&bounce_sgl[i], page_buf, 0, 0);
+		sg_set_page(&bounce_sgl[i], page_buf, buf_len, 0);
 	}
 
 	return bounce_sgl;
@@ -1348,7 +1350,8 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
 		if (do_bounce_buffer(sgl, scsi_sg_count(scmnd)) != -1) {
 			cmd_request->bounce_sgl =
 				create_bounce_buffer(sgl, scsi_sg_count(scmnd),
-						     scsi_bufflen(scmnd));
+						     scsi_bufflen(scmnd),
+						     vm_srb->data_in);
 			if (!cmd_request->bounce_sgl) {
 				scmnd->host_scribble = NULL;
 				mempool_free(cmd_request,
-- 
1.7.4.1


^ permalink raw reply related

* [PATCH RFC] virtio_net: fix refill related races
From: Michael S. Tsirkin @ 2011-12-07 15:21 UTC (permalink / raw)
  To: Amit Shah; +Cc: netdev, virtualization, linux-kernel, Michael S. Tsirkin

Fix theoretical races related to refill work:
1. After napi is disabled by ndo_stop, refill work
   can run and re-enable it.
2. Refill can reschedule itself, if this happens
   it can run after cancel_delayed_work_sync,
   and will access device after it is destroyed.

As a solution, add flags to track napi state and
to disable refill, and toggle them on start, stop
and remove; check these flags on refill.

refill work structure and new fields aren't used
on data path, so put them together near the end of
struct virtnet_info.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

RFC only since it's untested at this point.
A bugfix so 3.2 material?
Comments?


 drivers/net/virtio_net.c |   57 +++++++++++++++++++++++++++++++++++++---------
 1 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f5b3f19..39eb24c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -21,6 +21,7 @@
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 #include <linux/module.h>
+#include <linux/mutex.h>
 #include <linux/virtio.h>
 #include <linux/virtio_net.h>
 #include <linux/scatterlist.h>
@@ -68,15 +69,24 @@ struct virtnet_info {
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
 	/* fragments + linear part + virtio header */
 	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
 	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Set flag to allow delayed refill work, protected by a refill_lock. */
+	bool refill_enable;
+
+	/* Whether napi is enabled, protected by a refill_lock. */
+	bool napi_enable;
+
+	/* Lock to protect refill and napi enable/disable operations. */
+	struct mutex refill_lock;
 };
 
 struct skb_vnet_hdr {
@@ -477,20 +487,35 @@ static void virtnet_napi_enable(struct virtnet_info *vi)
 	}
 }
 
+static void virtnet_refill_enable(struct virtnet_info *vi, bool enable)
+{
+	mutex_lock(&vi->refill_lock);
+	vi->refill_enable = enable;
+	mutex_unlock(&vi->refill_lock);
+}
+
 static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
 	bool still_empty;
 
 	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
 
-	/* In theory, this can happen: if we don't get any buffers in
-	 * we will *never* try to fill again. */
-	if (still_empty)
-		schedule_delayed_work(&vi->refill, HZ/2);
+	mutex_lock(&vi->refill_lock);
+	if (vi->refill_enable) {
+		if (vi->napi_enable)
+			napi_disable(&vi->napi);
+		still_empty = !try_fill_recv(vi, GFP_KERNEL);
+		if (vi->napi_enable)
+			virtnet_napi_enable(vi);
+
+		/* In theory, this can happen: if we don't get any buffers in
+		 * we will *never* try to fill again. */
+		if (still_empty)
+			schedule_delayed_work(&vi->refill, HZ/2);
+
+	}
+	mutex_unlock(&vi->refill_lock);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
@@ -708,7 +733,10 @@ static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
+	mutex_lock(&vi->refill_lock);
+	vi->napi_enable = true;
 	virtnet_napi_enable(vi);
+	mutex_unlock(&vi->refill_lock);
 	return 0;
 }
 
@@ -761,7 +789,10 @@ static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
+	mutex_lock(&vi->refill_lock);
+	vi->napi_enable = false;
 	napi_disable(&vi->napi);
+	mutex_unlock(&vi->refill_lock);
 
 	return 0;
 }
@@ -1010,6 +1041,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (vi->stats == NULL)
 		goto free;
 
+	mutex_init(&vi->refill_lock);
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
 	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
@@ -1047,6 +1079,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		goto free_vqs;
 	}
 
+	virtnet_refill_enable(vi, true);
+
 	/* Last of all, set up some receive buffers. */
 	try_fill_recv(vi, GFP_KERNEL);
 
@@ -1107,10 +1141,11 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
 
+	virtnet_refill_enable(vi, false);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
-
 	unregister_netdev(vi->dev);
 	cancel_delayed_work_sync(&vi->refill);
 
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH RFC] virtio_blk: fix config handler race
From: Michael S. Tsirkin @ 2011-12-07 15:39 UTC (permalink / raw)
  To: Amit Shah; +Cc: virtualization, linux-kernel, Michael S. Tsirkin

Fix a theoretical race related to config work
handler: a config interrupt might happen
after we flush config work but before we
reset the device. It will then cause the
config work to run during or after reset.

Two problems with this:
- if this runs after device is gone we will get use after free
- access of config while reset is in progress is racy
(as layout is changing).

As a solution
1. flush after reset when we know there will be no more interrupts
2. add a flag to disable config access before reset

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

RFC only as it's untested.
Bugfix so 3.2 material? Comments?

 drivers/block/virtio_blk.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4d0b70a..34633f3 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -4,6 +4,7 @@
 #include <linux/blkdev.h>
 #include <linux/hdreg.h>
 #include <linux/module.h>
+#include <linux/mutex.h>
 #include <linux/virtio.h>
 #include <linux/virtio_blk.h>
 #include <linux/scatterlist.h>
@@ -36,6 +37,12 @@ struct virtio_blk
 	/* Process context for config space updates */
 	struct work_struct config_work;
 
+	/* Lock for config space updates */
+	struct mutex config_lock;
+
+	/* enable config space updates */
+	bool config_enable;
+
 	/* What host tells us, plus 2 for header & tailer. */
 	unsigned int sg_elems;
 
@@ -318,6 +325,10 @@ static void virtblk_config_changed_work(struct work_struct *work)
 	char cap_str_2[10], cap_str_10[10];
 	u64 capacity, size;
 
+	mutex_lock(&vblk->config_lock);
+	if (!vblk->config_enable)
+		goto done;
+
 	/* Host must always specify the capacity. */
 	vdev->config->get(vdev, offsetof(struct virtio_blk_config, capacity),
 			  &capacity, sizeof(capacity));
@@ -340,6 +351,8 @@ static void virtblk_config_changed_work(struct work_struct *work)
 		  cap_str_10, cap_str_2);
 
 	set_capacity(vblk->disk, capacity);
+done:
+	mutex_lock(&vblk->config_lock);
 }
 
 static void virtblk_config_changed(struct virtio_device *vdev)
@@ -388,7 +401,9 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->vdev = vdev;
 	vblk->sg_elems = sg_elems;
 	sg_init_table(vblk->sg, vblk->sg_elems);
+	mutex_init(&vblk->config_lock);
 	INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
+	vblk->config_enable = true;
 
 	/* We expect one virtqueue, for output. */
 	vblk->vq = virtio_find_single_vq(vdev, blk_done, "requests");
@@ -542,7 +557,10 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 	struct virtio_blk *vblk = vdev->priv;
 	int index = vblk->index;
 
-	flush_work(&vblk->config_work);
+	/* Prevent config work handler from accessing the device. */
+	mutex_lock(&vblk->config_lock);
+	vblk->config_enable = false;
+	mutex_unlock(&vblk->config_lock);
 
 	/* Nothing should be pending. */
 	BUG_ON(!list_empty(&vblk->reqs));
@@ -550,6 +568,8 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
 
+	flush_work(&vblk->config_work);
+
 	del_gendisk(vblk->disk);
 	blk_cleanup_queue(vblk->disk->queue);
 	put_disk(vblk->disk);
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Michael S. Tsirkin @ 2011-12-07 15:48 UTC (permalink / raw)
  To: Sasha Levin; +Cc: markmc, kvm, linux-kernel, virtualization, Avi Kivity
In-Reply-To: <1323266565.4009.10.camel@lappy>

On Wed, Dec 07, 2011 at 04:02:45PM +0200, Sasha Levin wrote:
> On Sun, 2011-12-04 at 20:23 +0200, Sasha Levin wrote:
> 
> [snip]
> 
> Rusty, Michael, does the below looks a reasonable optimization for you?

OK overall but a bit hard to say for sure as it looks pretty incomplete ...

> should I send it as a patch?

What's the performance gain?

> > Something like the following patch:
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c7a2c20..3166ca0 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -82,6 +82,7 @@ struct vring_virtqueue
> >  
> >  	/* Host supports indirect buffers */
> >  	bool indirect;
> > +	struct kmem_cache *indirect_cache;
> >  
> >  	/* Host publishes avail event idx */
> >  	bool event;
> > @@ -110,6 +111,9 @@ struct vring_virtqueue
> >  
> >  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
> >  
> > +static unsigned int ind_alloc_thresh = 0;
> > +module_param(ind_alloc_thresh, uint, S_IRUGO);

0 will have no effect?

> > +
> >  /* Set up an indirect table of descriptors and add it to the queue. */
> >  static int vring_add_indirect(struct vring_virtqueue *vq,
> >  			      struct scatterlist sg[],

A global parameter is OK for testing but likely not what we want
in real life. This needs to be different per device.



> > @@ -121,7 +125,10 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
> >  	unsigned head;
> >  	int i;
> >  
> > -	desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
> > +	if ((out + in) <= ind_alloc_thresh)
> > +		desc = kmem_cache_alloc(vq->indirect_cache, gfp);
> > +	else
> > +		desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
> >  	if (!desc)
> >  		return -ENOMEM;
> >  

free unaffected?

> > @@ -479,6 +486,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
> >  	vq->broken = false;
> >  	vq->last_used_idx = 0;
> >  	vq->num_added = 0;
> > +	if (ind_alloc_thresh)
> > +		vq->indirect_cache = KMEM_CACHE(vring_desc[ind_alloc_thresh], 0);

and need to cleanup too?

> >  	list_add_tail(&vq->vq.list, &vdev->vqs);
> >  #ifdef DEBUG
> >  	vq->in_use = false;
> > 
> 
> -- 
> 
> Sasha.

^ permalink raw reply

* Re: [PATCH v4 04/12] virtio: console: Add freeze and restore handlers to support S4
From: Michael S. Tsirkin @ 2011-12-07 15:54 UTC (permalink / raw)
  To: Amit Shah; +Cc: linux-kernel, levinsasha928, Virtualization List
In-Reply-To: <20111207105955.GJ4651@amit-x200.redhat.com>

On Wed, Dec 07, 2011 at 04:29:55PM +0530, Amit Shah wrote:
> On (Wed) 07 Dec 2011 [12:43:24], Michael S. Tsirkin wrote:
> 
> > > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > > index e14f5aa..fd2fd6f 100644
> > > --- a/drivers/char/virtio_console.c
> > > +++ b/drivers/char/virtio_console.c
> > > @@ -1844,6 +1844,60 @@ static unsigned int features[] = {
> > >  	VIRTIO_CONSOLE_F_MULTIPORT,
> > >  };
> > >  
> > > +#ifdef CONFIG_PM
> > > +static int virtcons_freeze(struct virtio_device *vdev)
> > > +{
> > > +	struct ports_device *portdev;
> > > +	struct port *port;
> > > +
> > > +	portdev = vdev->priv;
> > > +
> > > +	vdev->config->reset(vdev);
> > 
> > 
> > So here, cancel_work_sync might still be running.
> > If it does run, might it try to access the device
> > config?  Could not determine this quickly, if yes it's a problem.
> 
> Similar to the other comment: I don't see why just resetting device
> can cause config queue access to go bad.
> 
> 		Amit

Hmm, not sure anymore. Sorry.

^ permalink raw reply

* Re: [PATCH v4 06/12] virtio: blk: Add freeze, restore handlers to support S4
From: Michael S. Tsirkin @ 2011-12-07 15:58 UTC (permalink / raw)
  To: Amit Shah; +Cc: linux-kernel, levinsasha928, Virtualization List
In-Reply-To: <20111207105647.GI4651@amit-x200.redhat.com>

On Wed, Dec 07, 2011 at 04:26:47PM +0530, Amit Shah wrote:
> On (Wed) 07 Dec 2011 [12:37:02], Michael S. Tsirkin wrote:
> > On Wed, Dec 07, 2011 at 01:18:44AM +0530, Amit Shah wrote:
> > > Delete the vq and flush any pending requests from the block queue on the
> > > freeze callback to prepare for hibernation.
> > > 
> > > Re-create the vq in the restore callback to resume normal function.
> > > 
> > > Signed-off-by: Amit Shah <amit.shah@redhat.com>
> > > ---
> > >  drivers/block/virtio_blk.c |   38 ++++++++++++++++++++++++++++++++++++++
> > >  1 files changed, 38 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > > index 467f218..a9147a6 100644
> > > --- a/drivers/block/virtio_blk.c
> > > +++ b/drivers/block/virtio_blk.c
> > > @@ -568,6 +568,40 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
> > >  	ida_simple_remove(&vd_index_ida, index);
> > >  }
> > >  
> > > +#ifdef CONFIG_PM
> > > +static int virtblk_freeze(struct virtio_device *vdev)
> > > +{
> > > +	struct virtio_blk *vblk = vdev->priv;
> > > +
> > > +	/* Ensure we don't receive any more interrupts */
> > > +	vdev->config->reset(vdev);
> > > +
> > > +	flush_work(&vblk->config_work);
> > 
> > It bothers me that config work can be running
> > after reset here. If it does, it will not get sane
> > values from reading config.
> 
> Why so?
> 
> The reset only ensures the host doesn't write anything more, isn't it?
> Why would the values be affected?

Generally, not only that. Reset also clears configuration to the
reset value :) As since accesses are done byte
by byte you might get a value that is different from
*both* old and new one as a result.

But that is a general comment, specifically for block,
I don't know if there is a problem with this.

Same for console.

> > Also, can there be stuff in the reqs list?
> > If yes is this a problem?
> 
> Should be all cleared by the two commands below.  At least that's my
> expectation.  If not, let me know!
> 
> > > +	spin_lock_irq(vblk->disk->queue->queue_lock);
> > > +	blk_stop_queue(vblk->disk->queue);
> > > +	spin_unlock_irq(vblk->disk->queue->queue_lock);
> > > +	blk_sync_queue(vblk->disk->queue);
> > > +
> > > +	vdev->config->del_vqs(vdev);
> > > +	return 0;
> > > +}
> > > +
> > 
> > Thinking about it, looks like there's a bug in
> > virtblk_remove: if we get a config change after
> > flush_work we schedule another work.
> > That's a problem for sure as structure is removed.
> 
> Yep, it is a potential issue.
> 
> 		Amit

Sent a patch.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Michael S. Tsirkin @ 2011-12-07 16:10 UTC (permalink / raw)
  To: David Miller; +Cc: krkumar2, arnd, netdev, virtualization, levinsasha928
In-Reply-To: <20111125.013552.1613051198566054931.davem@davemloft.net>

On Fri, Nov 25, 2011 at 01:35:52AM -0500, David Miller wrote:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Fri, 25 Nov 2011 09:39:11 +0530
> 
> > Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
> >>
> >> My description is not clear again :(
> >> I mean the same vhost thead:
> >>
> >> vhost thread #0 transmits packets of flow A on processor M
> >> ...
> >> vhost thread #0 move to another process N and start to transmit packets
> >> of flow A
> > 
> > Thanks for clarifying. Yes, binding vhosts to CPU's
> > makes the incoming packet go to the same vhost each
> > time. BTW, are you doing any binding and/or irqbalance
> > when you run your tests? I am not running either at
> > this time, but thought both might be useful.
> 
> So are we going with this patch or are we saying that vhost binding
> is a requirement?

OK we didn't come to a conclusion so I would be inclined
to merge this patch as is for 3.2, and revisit later.
One question though: do these changes affect userspace
in any way? For example, will this commit us to
ensure that a single flow gets a unique hash even
for strange configurations that transmit the same flow
from multiple cpus?

-- 
MST

^ permalink raw reply

* [PATCH 1/1] Staging: hv: storvsc: Move the storage driver out of the staging area
From: K. Y. Srinivasan @ 2011-12-07 16:16 UTC (permalink / raw)
  To: gregkh, linux-kernel, devel, virtualization, ohering,
	James.Bottomley, hch, linux-scsi
  Cc: K. Y. Srinivasan

The storage driver (storvsc_drv.c) handles all block storage devices
assigned to Linux guests hosted on Hyper-V. This driver has been in the
staging tree for a while and this patch moves it out of the staging area.
As per Greg's recommendation, this patch makes no changes to the staging/hv
directory. Once the driver moves out of staging, we will cleanup the
staging/hv directory.

This patch includes all the patches that I have sent against the staging/hv
tree to address the comments I have gotten to date on this storage driver.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/scsi/Kconfig       |    7 +
 drivers/scsi/Makefile      |    3 +
 drivers/scsi/storvsc_drv.c | 1586 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1596 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/storvsc_drv.c

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 06ea3bc..4910269 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -662,6 +662,13 @@ config VMWARE_PVSCSI
 	  To compile this driver as a module, choose M here: the
 	  module will be called vmw_pvscsi.
 
+config HYPERV_STORAGE
+	tristate "Microsoft Hyper-V virtual storage driver"
+	depends on SCSI && HYPERV
+	default HYPERV
+	help
+	  Select this option to enable the Hyper-V virtual storage driver.
+
 config LIBFC
 	tristate "LibFC module"
 	select SCSI_FC_ATTRS
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 2b88749..e4c1a69 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -142,6 +142,7 @@ obj-$(CONFIG_SCSI_BNX2_ISCSI)	+= libiscsi.o bnx2i/
 obj-$(CONFIG_BE2ISCSI)		+= libiscsi.o be2iscsi/
 obj-$(CONFIG_SCSI_PMCRAID)	+= pmcraid.o
 obj-$(CONFIG_VMWARE_PVSCSI)	+= vmw_pvscsi.o
+obj-$(CONFIG_HYPERV_STORAGE)	+= hv_storvsc.o
 
 obj-$(CONFIG_ARM)		+= arm/
 
@@ -170,6 +171,8 @@ scsi_mod-$(CONFIG_SCSI_PROC_FS)	+= scsi_proc.o
 scsi_mod-y			+= scsi_trace.o
 scsi_mod-$(CONFIG_PM)		+= scsi_pm.o
 
+hv_storvsc-y			:= storvsc_drv.o
+
 scsi_tgt-y			+= scsi_tgt_lib.o scsi_tgt_if.o
 
 sd_mod-objs	:= sd.o
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
new file mode 100644
index 0000000..eb853f7
--- /dev/null
+++ b/drivers/scsi/storvsc_drv.c
@@ -0,0 +1,1586 @@
+/*
+ * Copyright (c) 2009, Microsoft Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Authors:
+ *   Haiyang Zhang <haiyangz@microsoft.com>
+ *   Hank Janssen  <hjanssen@microsoft.com>
+ *   K. Y. Srinivasan <kys@microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/hyperv.h>
+#include <linux/mempool.h>
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_tcq.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_dbg.h>
+
+
+#define STORVSC_MIN_BUF_NR				64
+#define STORVSC_RING_BUFFER_SIZE			(20*PAGE_SIZE)
+static int storvsc_ringbuffer_size = STORVSC_RING_BUFFER_SIZE;
+
+module_param(storvsc_ringbuffer_size, int, S_IRUGO);
+MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
+
+/* to alert the user that structure sizes may be mismatched even though the */
+/* protocol versions match. */
+
+
+#define REVISION_STRING(REVISION_) #REVISION_
+#define FILL_VMSTOR_REVISION(RESULT_LVALUE_)				\
+	do {								\
+		char *revision_string					\
+			= REVISION_STRING($Rev : 6 $) + 6;		\
+		RESULT_LVALUE_ = 0;					\
+		while (*revision_string >= '0'				\
+			&& *revision_string <= '9') {			\
+			RESULT_LVALUE_ *= 10;				\
+			RESULT_LVALUE_ += *revision_string - '0';	\
+			revision_string++;				\
+		}							\
+	} while (0)
+
+/* Major/minor macros.  Minor version is in LSB, meaning that earlier flat */
+/* version numbers will be interpreted as "0.x" (i.e., 1 becomes 0.1). */
+#define VMSTOR_PROTOCOL_MAJOR(VERSION_)		(((VERSION_) >> 8) & 0xff)
+#define VMSTOR_PROTOCOL_MINOR(VERSION_)		(((VERSION_))      & 0xff)
+#define VMSTOR_PROTOCOL_VERSION(MAJOR_, MINOR_)	((((MAJOR_) & 0xff) << 8) | \
+						 (((MINOR_) & 0xff)))
+#define VMSTOR_INVALID_PROTOCOL_VERSION		(-1)
+
+/* Version history: */
+/* V1 Beta                    0.1 */
+/* V1 RC < 2008/1/31          1.0 */
+/* V1 RC > 2008/1/31          2.0 */
+#define VMSTOR_PROTOCOL_VERSION_CURRENT VMSTOR_PROTOCOL_VERSION(4, 2)
+
+
+
+
+/*  This will get replaced with the max transfer length that is possible on */
+/*  the host adapter. */
+/*  The max transfer length will be published when we offer a vmbus channel. */
+#define MAX_TRANSFER_LENGTH	0x40000
+#define DEFAULT_PACKET_SIZE (sizeof(struct vmdata_gpa_direct) +	\
+			sizeof(struct vstor_packet) +		\
+			sizesizeof(u64) * (MAX_TRANSFER_LENGTH / PAGE_SIZE)))
+
+
+/*  Packet structure describing virtual storage requests. */
+enum vstor_packet_operation {
+	VSTOR_OPERATION_COMPLETE_IO		= 1,
+	VSTOR_OPERATION_REMOVE_DEVICE		= 2,
+	VSTOR_OPERATION_EXECUTE_SRB		= 3,
+	VSTOR_OPERATION_RESET_LUN		= 4,
+	VSTOR_OPERATION_RESET_ADAPTER		= 5,
+	VSTOR_OPERATION_RESET_BUS		= 6,
+	VSTOR_OPERATION_BEGIN_INITIALIZATION	= 7,
+	VSTOR_OPERATION_END_INITIALIZATION	= 8,
+	VSTOR_OPERATION_QUERY_PROTOCOL_VERSION	= 9,
+	VSTOR_OPERATION_QUERY_PROPERTIES	= 10,
+	VSTOR_OPERATION_ENUMERATE_BUS		= 11,
+	VSTOR_OPERATION_MAXIMUM			= 11
+};
+
+/*
+ * Platform neutral description of a scsi request -
+ * this remains the same across the write regardless of 32/64 bit
+ * note: it's patterned off the SCSI_PASS_THROUGH structure
+ */
+#define CDB16GENERIC_LENGTH			0x10
+
+#ifndef SENSE_BUFFER_SIZE
+#define SENSE_BUFFER_SIZE			0x12
+#endif
+
+#define MAX_DATA_BUF_LEN_WITH_PADDING		0x14
+
+struct vmscsi_request {
+	unsigned short length;
+	unsigned char srb_status;
+	unsigned char scsi_status;
+
+	unsigned char port_number;
+	unsigned char path_id;
+	unsigned char target_id;
+	unsigned char lun;
+
+	unsigned char cdb_length;
+	unsigned char sense_info_length;
+	unsigned char data_in;
+	unsigned char reserved;
+
+	unsigned int data_transfer_length;
+
+	union {
+		unsigned char cdb[CDB16GENERIC_LENGTH];
+		unsigned char sense_data[SENSE_BUFFER_SIZE];
+		unsigned char reserved_array[MAX_DATA_BUF_LEN_WITH_PADDING];
+	};
+} __attribute((packed));
+
+
+/*
+ * This structure is sent during the intialization phase to get the different
+ * properties of the channel.
+ */
+struct vmstorage_channel_properties {
+	unsigned short protocol_version;
+	unsigned char path_id;
+	unsigned char target_id;
+
+	/* Note: port number is only really known on the client side */
+	unsigned int port_number;
+	unsigned int flags;
+	unsigned int max_transfer_bytes;
+
+	/*  This id is unique for each channel and will correspond with */
+	/*  vendor specific data in the inquirydata */
+	unsigned long long unique_id;
+} __packed;
+
+/*  This structure is sent during the storage protocol negotiations. */
+struct vmstorage_protocol_version {
+	/* Major (MSW) and minor (LSW) version numbers. */
+	unsigned short major_minor;
+
+	/*
+	 * Revision number is auto-incremented whenever this file is changed
+	 * (See FILL_VMSTOR_REVISION macro above).  Mismatch does not
+	 * definitely indicate incompatibility--but it does indicate mismatched
+	 * builds.
+	 */
+	unsigned short revision;
+} __packed;
+
+/* Channel Property Flags */
+#define STORAGE_CHANNEL_REMOVABLE_FLAG		0x1
+#define STORAGE_CHANNEL_EMULATED_IDE_FLAG	0x2
+
+struct vstor_packet {
+	/* Requested operation type */
+	enum vstor_packet_operation operation;
+
+	/*  Flags - see below for values */
+	unsigned int flags;
+
+	/* Status of the request returned from the server side. */
+	unsigned int status;
+
+	/* Data payload area */
+	union {
+		/*
+		 * Structure used to forward SCSI commands from the
+		 * client to the server.
+		 */
+		struct vmscsi_request vm_srb;
+
+		/* Structure used to query channel properties. */
+		struct vmstorage_channel_properties storage_channel_properties;
+
+		/* Used during version negotiations. */
+		struct vmstorage_protocol_version version;
+	};
+} __packed;
+
+/* Packet flags */
+/*
+ * This flag indicates that the server should send back a completion for this
+ * packet.
+ */
+#define REQUEST_COMPLETION_FLAG	0x1
+
+/*  This is the set of flags that the vsc can set in any packets it sends */
+#define VSC_LEGAL_FLAGS		(REQUEST_COMPLETION_FLAG)
+
+
+/* Defines */
+
+#define STORVSC_MAX_IO_REQUESTS				128
+
+/*
+ * In Hyper-V, each port/path/target maps to 1 scsi host adapter.  In
+ * reality, the path/target is not used (ie always set to 0) so our
+ * scsi host adapter essentially has 1 bus with 1 target that contains
+ * up to 256 luns.
+ */
+#define STORVSC_MAX_LUNS_PER_TARGET			64
+#define STORVSC_MAX_TARGETS				1
+#define STORVSC_MAX_CHANNELS				1
+#define STORVSC_MAX_CMD_LEN				16
+
+/* Matches Windows-end */
+enum storvsc_request_type {
+	WRITE_TYPE,
+	READ_TYPE,
+	UNKNOWN_TYPE,
+};
+
+
+struct hv_storvsc_request {
+	struct hv_device *device;
+
+	/* Synchronize the request/response if needed */
+	struct completion wait_event;
+
+	unsigned char *sense_buffer;
+	void *context;
+	void (*on_io_completion)(struct hv_storvsc_request *request);
+	struct hv_multipage_buffer data_buffer;
+
+	struct vstor_packet vstor_packet;
+};
+
+
+/* A storvsc device is a device object that contains a vmbus channel */
+struct storvsc_device {
+	struct hv_device *device;
+
+	bool	 destroy;
+	bool	 drain_notify;
+	atomic_t num_outstanding_req;
+	struct Scsi_Host *host;
+
+	wait_queue_head_t waiting_to_drain;
+
+	/*
+	 * Each unique Port/Path/Target represents 1 channel ie scsi
+	 * controller. In reality, the pathid, targetid is always 0
+	 * and the port is set by us
+	 */
+	unsigned int port_number;
+	unsigned char path_id;
+	unsigned char target_id;
+
+	/* Used for vsc/vsp channel reset process */
+	struct hv_storvsc_request init_request;
+	struct hv_storvsc_request reset_request;
+};
+
+struct stor_mem_pools {
+	struct kmem_cache *request_pool;
+	mempool_t *request_mempool;
+};
+
+struct hv_host_device {
+	struct hv_device *dev;
+	unsigned int port;
+	unsigned char path;
+	unsigned char target;
+};
+
+struct storvsc_cmd_request {
+	struct list_head entry;
+	struct scsi_cmnd *cmd;
+
+	unsigned int bounce_sgl_count;
+	struct scatterlist *bounce_sgl;
+
+	struct hv_storvsc_request request;
+};
+
+struct storvsc_scan_work {
+	struct work_struct work;
+	struct Scsi_Host *host;
+	uint lun;
+};
+
+static void storvsc_bus_scan(struct work_struct *work)
+{
+	struct storvsc_scan_work *wrk;
+	int id, order_id;
+
+	wrk = container_of(work, struct storvsc_scan_work, work);
+	for (id = 0; id < wrk->host->max_id; ++id) {
+		if (wrk->host->reverse_ordering)
+			order_id = wrk->host->max_id - id - 1;
+		else
+			order_id = id;
+
+		scsi_scan_target(&wrk->host->shost_gendev, 0,
+				order_id, SCAN_WILD_CARD, 1);
+	}
+	kfree(wrk);
+}
+
+static void storvsc_remove_lun(struct work_struct *work)
+{
+	struct storvsc_scan_work *wrk;
+	struct scsi_device *sdev;
+
+	wrk = container_of(work, struct storvsc_scan_work, work);
+	if (!scsi_host_get(wrk->host))
+		goto done;
+
+	sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun);
+
+	if (sdev) {
+		scsi_remove_device(sdev);
+		scsi_device_put(sdev);
+	}
+	scsi_host_put(wrk->host);
+
+done:
+	kfree(wrk);
+}
+
+static inline struct storvsc_device *get_out_stor_device(
+					struct hv_device *device)
+{
+	struct storvsc_device *stor_device;
+
+	stor_device = hv_get_drvdata(device);
+
+	if (stor_device && stor_device->destroy)
+		stor_device = NULL;
+
+	return stor_device;
+}
+
+
+static inline void storvsc_wait_to_drain(struct storvsc_device *dev)
+{
+	dev->drain_notify = true;
+	wait_event(dev->waiting_to_drain,
+		   atomic_read(&dev->num_outstanding_req) == 0);
+	dev->drain_notify = false;
+}
+
+static inline struct storvsc_device *get_in_stor_device(
+					struct hv_device *device)
+{
+	struct storvsc_device *stor_device;
+
+	stor_device = hv_get_drvdata(device);
+
+	if (!stor_device)
+		goto get_in_err;
+
+	/*
+	 * If the device is being destroyed; allow incoming
+	 * traffic only to cleanup outstanding requests.
+	 */
+
+	if (stor_device->destroy  &&
+		(atomic_read(&stor_device->num_outstanding_req) == 0))
+		stor_device = NULL;
+
+get_in_err:
+	return stor_device;
+
+}
+
+static int storvsc_channel_init(struct hv_device *device)
+{
+	struct storvsc_device *stor_device;
+	struct hv_storvsc_request *request;
+	struct vstor_packet *vstor_packet;
+	int ret, t;
+
+	stor_device = get_out_stor_device(device);
+	if (!stor_device)
+		return -ENODEV;
+
+	request = &stor_device->init_request;
+	vstor_packet = &request->vstor_packet;
+
+	/*
+	 * Now, initiate the vsc/vsp initialization protocol on the open
+	 * channel
+	 */
+	memset(request, 0, sizeof(struct hv_storvsc_request));
+	init_completion(&request->wait_event);
+	vstor_packet->operation = VSTOR_OPERATION_BEGIN_INITIALIZATION;
+	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+	ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	if (ret != 0)
+		goto cleanup;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0) {
+		ret = -ETIMEDOUT;
+		goto cleanup;
+	}
+
+	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+	    vstor_packet->status != 0)
+		goto cleanup;
+
+
+	/* reuse the packet for version range supported */
+	memset(vstor_packet, 0, sizeof(struct vstor_packet));
+	vstor_packet->operation = VSTOR_OPERATION_QUERY_PROTOCOL_VERSION;
+	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+	vstor_packet->version.major_minor = VMSTOR_PROTOCOL_VERSION_CURRENT;
+	FILL_VMSTOR_REVISION(vstor_packet->version.revision);
+
+	ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	if (ret != 0)
+		goto cleanup;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0) {
+		ret = -ETIMEDOUT;
+		goto cleanup;
+	}
+
+	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+	    vstor_packet->status != 0)
+		goto cleanup;
+
+
+	memset(vstor_packet, 0, sizeof(struct vstor_packet));
+	vstor_packet->operation = VSTOR_OPERATION_QUERY_PROPERTIES;
+	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+	vstor_packet->storage_channel_properties.port_number =
+					stor_device->port_number;
+
+	ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+
+	if (ret != 0)
+		goto cleanup;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0) {
+		ret = -ETIMEDOUT;
+		goto cleanup;
+	}
+
+	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+	    vstor_packet->status != 0)
+		goto cleanup;
+
+	stor_device->path_id = vstor_packet->storage_channel_properties.path_id;
+	stor_device->target_id
+		= vstor_packet->storage_channel_properties.target_id;
+
+	memset(vstor_packet, 0, sizeof(struct vstor_packet));
+	vstor_packet->operation = VSTOR_OPERATION_END_INITIALIZATION;
+	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+	ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+
+	if (ret != 0)
+		goto cleanup;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0) {
+		ret = -ETIMEDOUT;
+		goto cleanup;
+	}
+
+	if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+	    vstor_packet->status != 0)
+		goto cleanup;
+
+
+cleanup:
+	return ret;
+}
+
+static void storvsc_on_io_completion(struct hv_device *device,
+				  struct vstor_packet *vstor_packet,
+				  struct hv_storvsc_request *request)
+{
+	struct storvsc_device *stor_device;
+	struct vstor_packet *stor_pkt;
+
+	stor_device = hv_get_drvdata(device);
+	stor_pkt = &request->vstor_packet;
+
+	/*
+	 * The current SCSI handling on the host side does
+	 * not correctly handle:
+	 * INQUIRY command with page code parameter set to 0x80
+	 * MODE_SENSE command with cmd[2] == 0x1c
+	 *
+	 * Setup srb and scsi status so this won't be fatal.
+	 * We do this so we can distinguish truly fatal failues
+	 * (srb status == 0x4) and off-line the device in that case.
+	 */
+
+	if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
+		(stor_pkt->vm_srb.cdb[0] == MODE_SENSE)) {
+		vstor_packet->vm_srb.scsi_status = 0;
+		vstor_packet->vm_srb.srb_status = 0x1;
+	}
+
+
+	/* Copy over the status...etc */
+	stor_pkt->vm_srb.scsi_status = vstor_packet->vm_srb.scsi_status;
+	stor_pkt->vm_srb.srb_status = vstor_packet->vm_srb.srb_status;
+	stor_pkt->vm_srb.sense_info_length =
+	vstor_packet->vm_srb.sense_info_length;
+
+	if (vstor_packet->vm_srb.scsi_status != 0 ||
+		vstor_packet->vm_srb.srb_status != 1){
+		dev_warn(&device->device,
+			 "cmd 0x%x scsi status 0x%x srb status 0x%x\n",
+			 stor_pkt->vm_srb.cdb[0],
+			 vstor_packet->vm_srb.scsi_status,
+			 vstor_packet->vm_srb.srb_status);
+	}
+
+	if ((vstor_packet->vm_srb.scsi_status & 0xFF) == 0x02) {
+		/* CHECK_CONDITION */
+		if (vstor_packet->vm_srb.srb_status & 0x80) {
+			/* autosense data available */
+			dev_warn(&device->device,
+				 "stor pkt %p autosense data valid - len %d\n",
+				 request,
+				 vstor_packet->vm_srb.sense_info_length);
+
+			memcpy(request->sense_buffer,
+			       vstor_packet->vm_srb.sense_data,
+			       vstor_packet->vm_srb.sense_info_length);
+
+		}
+	}
+
+	stor_pkt->vm_srb.data_transfer_length =
+	vstor_packet->vm_srb.data_transfer_length;
+
+	request->on_io_completion(request);
+
+	if (atomic_dec_and_test(&stor_device->num_outstanding_req) &&
+		stor_device->drain_notify)
+		wake_up(&stor_device->waiting_to_drain);
+
+
+}
+
+static void storvsc_on_receive(struct hv_device *device,
+			     struct vstor_packet *vstor_packet,
+			     struct hv_storvsc_request *request)
+{
+	struct storvsc_scan_work *work;
+	struct storvsc_device *stor_device;
+
+	switch (vstor_packet->operation) {
+	case VSTOR_OPERATION_COMPLETE_IO:
+		storvsc_on_io_completion(device, vstor_packet, request);
+		break;
+
+	case VSTOR_OPERATION_REMOVE_DEVICE:
+	case VSTOR_OPERATION_ENUMERATE_BUS:
+		stor_device = get_in_stor_device(device);
+		work = kmalloc(sizeof(struct storvsc_scan_work), GFP_ATOMIC);
+		if (!work)
+			return;
+
+		INIT_WORK(&work->work, storvsc_bus_scan);
+		work->host = stor_device->host;
+		schedule_work(&work->work);
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void storvsc_on_channel_callback(void *context)
+{
+	struct hv_device *device = (struct hv_device *)context;
+	struct storvsc_device *stor_device;
+	u32 bytes_recvd;
+	u64 request_id;
+	unsigned char packet[ALIGN(sizeof(struct vstor_packet), 8)];
+	struct hv_storvsc_request *request;
+	int ret;
+
+
+	stor_device = get_in_stor_device(device);
+	if (!stor_device)
+		return;
+
+	do {
+		ret = vmbus_recvpacket(device->channel, packet,
+				       ALIGN(sizeof(struct vstor_packet), 8),
+				       &bytes_recvd, &request_id);
+		if (ret == 0 && bytes_recvd > 0) {
+
+			request = (struct hv_storvsc_request *)
+					(unsigned long)request_id;
+
+			if ((request == &stor_device->init_request) ||
+			    (request == &stor_device->reset_request)) {
+
+				memcpy(&request->vstor_packet, packet,
+				       sizeof(struct vstor_packet));
+				complete(&request->wait_event);
+			} else {
+				storvsc_on_receive(device,
+						(struct vstor_packet *)packet,
+						request);
+			}
+		} else {
+			break;
+		}
+	} while (1);
+
+	return;
+}
+
+static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size)
+{
+	struct vmstorage_channel_properties props;
+	int ret;
+
+	memset(&props, 0, sizeof(struct vmstorage_channel_properties));
+
+	/* Open the channel */
+	ret = vmbus_open(device->channel,
+			 ring_size,
+			 ring_size,
+			 (void *)&props,
+			 sizeof(struct vmstorage_channel_properties),
+			 storvsc_on_channel_callback, device);
+
+	if (ret != 0)
+		return ret;
+
+	ret = storvsc_channel_init(device);
+
+	return ret;
+}
+
+static int storvsc_dev_remove(struct hv_device *device)
+{
+	struct storvsc_device *stor_device;
+	unsigned long flags;
+
+	stor_device = hv_get_drvdata(device);
+
+	spin_lock_irqsave(&device->channel->inbound_lock, flags);
+	stor_device->destroy = true;
+	spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
+
+	/*
+	 * At this point, all outbound traffic should be disable. We
+	 * only allow inbound traffic (responses) to proceed so that
+	 * outstanding requests can be completed.
+	 */
+
+	storvsc_wait_to_drain(stor_device);
+
+	/*
+	 * Since we have already drained, we don't need to busy wait
+	 * as was done in final_release_stor_device()
+	 * Note that we cannot set the ext pointer to NULL until
+	 * we have drained - to drain the outgoing packets, we need to
+	 * allow incoming packets.
+	 */
+	spin_lock_irqsave(&device->channel->inbound_lock, flags);
+	hv_set_drvdata(device, NULL);
+	spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
+
+	/* Close the channel */
+	vmbus_close(device->channel);
+
+	kfree(stor_device);
+	return 0;
+}
+
+static int storvsc_do_io(struct hv_device *device,
+			      struct hv_storvsc_request *request)
+{
+	struct storvsc_device *stor_device;
+	struct vstor_packet *vstor_packet;
+	int ret = 0;
+
+	vstor_packet = &request->vstor_packet;
+	stor_device = get_out_stor_device(device);
+
+	if (!stor_device)
+		return -ENODEV;
+
+
+	request->device  = device;
+
+
+	vstor_packet->flags |= REQUEST_COMPLETION_FLAG;
+
+	vstor_packet->vm_srb.length = sizeof(struct vmscsi_request);
+
+
+	vstor_packet->vm_srb.sense_info_length = SENSE_BUFFER_SIZE;
+
+
+	vstor_packet->vm_srb.data_transfer_length =
+	request->data_buffer.len;
+
+	vstor_packet->operation = VSTOR_OPERATION_EXECUTE_SRB;
+
+	if (request->data_buffer.len) {
+		ret = vmbus_sendpacket_multipagebuffer(device->channel,
+				&request->data_buffer,
+				vstor_packet,
+				sizeof(struct vstor_packet),
+				(unsigned long)request);
+	} else {
+		ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	}
+
+	if (ret != 0)
+		return ret;
+
+	atomic_inc(&stor_device->num_outstanding_req);
+
+	return ret;
+}
+
+static void storvsc_get_ide_info(struct hv_device *dev, int *target, int *path)
+{
+	*target =
+		dev->dev_instance.b[5] << 8 | dev->dev_instance.b[4];
+
+	*path =
+		dev->dev_instance.b[3] << 24 |
+		dev->dev_instance.b[2] << 16 |
+		dev->dev_instance.b[1] << 8  | dev->dev_instance.b[0];
+}
+
+
+static int storvsc_device_alloc(struct scsi_device *sdevice)
+{
+	struct stor_mem_pools *memp;
+	int number = STORVSC_MIN_BUF_NR;
+
+	memp = kzalloc(sizeof(struct stor_mem_pools), GFP_KERNEL);
+	if (!memp)
+		return -ENOMEM;
+
+	memp->request_pool =
+		kmem_cache_create(dev_name(&sdevice->sdev_dev),
+				sizeof(struct storvsc_cmd_request), 0,
+				SLAB_HWCACHE_ALIGN, NULL);
+
+	if (!memp->request_pool)
+		goto err0;
+
+	memp->request_mempool = mempool_create(number, mempool_alloc_slab,
+						mempool_free_slab,
+						memp->request_pool);
+
+	if (!memp->request_mempool)
+		goto err1;
+
+	sdevice->hostdata = memp;
+
+	return 0;
+
+err1:
+	kmem_cache_destroy(memp->request_pool);
+
+err0:
+	kfree(memp);
+	return -ENOMEM;
+}
+
+static void storvsc_device_destroy(struct scsi_device *sdevice)
+{
+	struct stor_mem_pools *memp = sdevice->hostdata;
+
+	mempool_destroy(memp->request_mempool);
+	kmem_cache_destroy(memp->request_pool);
+	kfree(memp);
+	sdevice->hostdata = NULL;
+}
+
+static int storvsc_device_configure(struct scsi_device *sdevice)
+{
+	scsi_adjust_queue_depth(sdevice, MSG_SIMPLE_TAG,
+				STORVSC_MAX_IO_REQUESTS);
+
+	blk_queue_max_segment_size(sdevice->request_queue, PAGE_SIZE);
+
+	blk_queue_bounce_limit(sdevice->request_queue, BLK_BOUNCE_ANY);
+
+	return 0;
+}
+
+static void destroy_bounce_buffer(struct scatterlist *sgl,
+				  unsigned int sg_count)
+{
+	int i;
+	struct page *page_buf;
+
+	for (i = 0; i < sg_count; i++) {
+		page_buf = sg_page((&sgl[i]));
+		if (page_buf != NULL)
+			__free_page(page_buf);
+	}
+
+	kfree(sgl);
+}
+
+static int do_bounce_buffer(struct scatterlist *sgl, unsigned int sg_count)
+{
+	int i;
+
+	/* No need to check */
+	if (sg_count < 2)
+		return -1;
+
+	/* We have at least 2 sg entries */
+	for (i = 0; i < sg_count; i++) {
+		if (i == 0) {
+			/* make sure 1st one does not have hole */
+			if (sgl[i].offset + sgl[i].length != PAGE_SIZE)
+				return i;
+		} else if (i == sg_count - 1) {
+			/* make sure last one does not have hole */
+			if (sgl[i].offset != 0)
+				return i;
+		} else {
+			/* make sure no hole in the middle */
+			if (sgl[i].length != PAGE_SIZE || sgl[i].offset != 0)
+				return i;
+		}
+	}
+	return -1;
+}
+
+static struct scatterlist *create_bounce_buffer(struct scatterlist *sgl,
+						unsigned int sg_count,
+						unsigned int len,
+						int write)
+{
+	int i;
+	int num_pages;
+	struct scatterlist *bounce_sgl;
+	struct page *page_buf;
+	unsigned int buf_len = ((write == WRITE_TYPE) ? 0 : PAGE_SIZE);
+
+	num_pages = ALIGN(len, PAGE_SIZE) >> PAGE_SHIFT;
+
+	bounce_sgl = kcalloc(num_pages, sizeof(struct scatterlist), GFP_ATOMIC);
+	if (!bounce_sgl)
+		return NULL;
+
+	for (i = 0; i < num_pages; i++) {
+		page_buf = alloc_page(GFP_ATOMIC);
+		if (!page_buf)
+			goto cleanup;
+		sg_set_page(&bounce_sgl[i], page_buf, buf_len, 0);
+	}
+
+	return bounce_sgl;
+
+cleanup:
+	destroy_bounce_buffer(bounce_sgl, num_pages);
+	return NULL;
+}
+
+
+/* Assume the original sgl has enough room */
+static unsigned int copy_from_bounce_buffer(struct scatterlist *orig_sgl,
+					    struct scatterlist *bounce_sgl,
+					    unsigned int orig_sgl_count,
+					    unsigned int bounce_sgl_count)
+{
+	int i;
+	int j = 0;
+	unsigned long src, dest;
+	unsigned int srclen, destlen, copylen;
+	unsigned int total_copied = 0;
+	unsigned long bounce_addr = 0;
+	unsigned long dest_addr = 0;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	for (i = 0; i < orig_sgl_count; i++) {
+		dest_addr = (unsigned long)kmap_atomic(sg_page((&orig_sgl[i])),
+					KM_IRQ0) + orig_sgl[i].offset;
+		dest = dest_addr;
+		destlen = orig_sgl[i].length;
+
+		if (bounce_addr == 0)
+			bounce_addr =
+			(unsigned long)kmap_atomic(sg_page((&bounce_sgl[j])),
+							KM_IRQ0);
+
+		while (destlen) {
+			src = bounce_addr + bounce_sgl[j].offset;
+			srclen = bounce_sgl[j].length - bounce_sgl[j].offset;
+
+			copylen = min(srclen, destlen);
+			memcpy((void *)dest, (void *)src, copylen);
+
+			total_copied += copylen;
+			bounce_sgl[j].offset += copylen;
+			destlen -= copylen;
+			dest += copylen;
+
+			if (bounce_sgl[j].offset == bounce_sgl[j].length) {
+				/* full */
+				kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+				j++;
+
+				/*
+				 * It is possible that the number of elements
+				 * in the bounce buffer may not be equal to
+				 * the number of elements in the original
+				 * scatter list. Handle this correctly.
+				 */
+
+				if (j == bounce_sgl_count) {
+					/*
+					 * We are done; cleanup and return.
+					 */
+					kunmap_atomic((void *)(dest_addr -
+							orig_sgl[i].offset),
+							KM_IRQ0);
+					local_irq_restore(flags);
+					return total_copied;
+				}
+
+				/* if we need to use another bounce buffer */
+				if (destlen || i != orig_sgl_count - 1)
+					bounce_addr =
+					(unsigned long)kmap_atomic(
+					sg_page((&bounce_sgl[j])), KM_IRQ0);
+			} else if (destlen == 0 && i == orig_sgl_count - 1) {
+				/* unmap the last bounce that is < PAGE_SIZE */
+				kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+			}
+		}
+
+		kunmap_atomic((void *)(dest_addr - orig_sgl[i].offset),
+			      KM_IRQ0);
+	}
+
+	local_irq_restore(flags);
+
+	return total_copied;
+}
+
+
+/* Assume the bounce_sgl has enough room ie using the create_bounce_buffer() */
+static unsigned int copy_to_bounce_buffer(struct scatterlist *orig_sgl,
+					  struct scatterlist *bounce_sgl,
+					  unsigned int orig_sgl_count)
+{
+	int i;
+	int j = 0;
+	unsigned long src, dest;
+	unsigned int srclen, destlen, copylen;
+	unsigned int total_copied = 0;
+	unsigned long bounce_addr = 0;
+	unsigned long src_addr = 0;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	for (i = 0; i < orig_sgl_count; i++) {
+		src_addr = (unsigned long)kmap_atomic(sg_page((&orig_sgl[i])),
+				KM_IRQ0) + orig_sgl[i].offset;
+		src = src_addr;
+		srclen = orig_sgl[i].length;
+
+		if (bounce_addr == 0)
+			bounce_addr =
+			(unsigned long)kmap_atomic(sg_page((&bounce_sgl[j])),
+						KM_IRQ0);
+
+		while (srclen) {
+			/* assume bounce offset always == 0 */
+			dest = bounce_addr + bounce_sgl[j].length;
+			destlen = PAGE_SIZE - bounce_sgl[j].length;
+
+			copylen = min(srclen, destlen);
+			memcpy((void *)dest, (void *)src, copylen);
+
+			total_copied += copylen;
+			bounce_sgl[j].length += copylen;
+			srclen -= copylen;
+			src += copylen;
+
+			if (bounce_sgl[j].length == PAGE_SIZE) {
+				/* full..move to next entry */
+				kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+				j++;
+
+				/* if we need to use another bounce buffer */
+				if (srclen || i != orig_sgl_count - 1)
+					bounce_addr =
+					(unsigned long)kmap_atomic(
+					sg_page((&bounce_sgl[j])), KM_IRQ0);
+
+			} else if (srclen == 0 && i == orig_sgl_count - 1) {
+				/* unmap the last bounce that is < PAGE_SIZE */
+				kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+			}
+		}
+
+		kunmap_atomic((void *)(src_addr - orig_sgl[i].offset), KM_IRQ0);
+	}
+
+	local_irq_restore(flags);
+
+	return total_copied;
+}
+
+
+static int storvsc_remove(struct hv_device *dev)
+{
+	struct storvsc_device *stor_device = hv_get_drvdata(dev);
+	struct Scsi_Host *host = stor_device->host;
+
+	scsi_remove_host(host);
+
+	scsi_host_put(host);
+
+	storvsc_dev_remove(dev);
+
+	return 0;
+}
+
+
+static int storvsc_get_chs(struct scsi_device *sdev, struct block_device * bdev,
+			   sector_t capacity, int *info)
+{
+	sector_t nsect = capacity;
+	sector_t cylinders = nsect;
+	int heads, sectors_pt;
+
+	/*
+	 * We are making up these values; let us keep it simple.
+	 */
+	heads = 0xff;
+	sectors_pt = 0x3f;      /* Sectors per track */
+	sector_div(cylinders, heads * sectors_pt);
+	if ((sector_t)(cylinders + 1) * heads * sectors_pt < nsect)
+		cylinders = 0xffff;
+
+	info[0] = heads;
+	info[1] = sectors_pt;
+	info[2] = (int)cylinders;
+
+	return 0;
+}
+
+static int storvsc_host_reset(struct hv_device *device)
+{
+	struct storvsc_device *stor_device;
+	struct hv_storvsc_request *request;
+	struct vstor_packet *vstor_packet;
+	int ret, t;
+
+
+	stor_device = get_out_stor_device(device);
+	if (!stor_device)
+		return FAILED;
+
+	request = &stor_device->reset_request;
+	vstor_packet = &request->vstor_packet;
+
+	init_completion(&request->wait_event);
+
+	vstor_packet->operation = VSTOR_OPERATION_RESET_BUS;
+	vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+	vstor_packet->vm_srb.path_id = stor_device->path_id;
+
+	ret = vmbus_sendpacket(device->channel, vstor_packet,
+			       sizeof(struct vstor_packet),
+			       (unsigned long)&stor_device->reset_request,
+			       VM_PKT_DATA_INBAND,
+			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+	if (ret != 0)
+		return FAILED;
+
+	t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+	if (t == 0)
+		return TIMEOUT_ERROR;
+
+
+	/*
+	 * At this point, all outstanding requests in the adapter
+	 * should have been flushed out and return to us
+	 */
+
+	return SUCCESS;
+}
+
+
+/*
+ * storvsc_host_reset_handler - Reset the scsi HBA
+ */
+static int storvsc_host_reset_handler(struct scsi_cmnd *scmnd)
+{
+	struct hv_host_device *host_dev = shost_priv(scmnd->device->host);
+	struct hv_device *dev = host_dev->dev;
+
+	return storvsc_host_reset(dev);
+}
+
+
+/*
+ * storvsc_command_completion - Command completion processing
+ */
+static void storvsc_command_completion(struct hv_storvsc_request *request)
+{
+	struct storvsc_cmd_request *cmd_request =
+		(struct storvsc_cmd_request *)request->context;
+	struct scsi_cmnd *scmnd = cmd_request->cmd;
+	struct hv_host_device *host_dev = shost_priv(scmnd->device->host);
+	void (*scsi_done_fn)(struct scsi_cmnd *);
+	struct scsi_sense_hdr sense_hdr;
+	struct vmscsi_request *vm_srb;
+	struct storvsc_scan_work *wrk;
+	struct stor_mem_pools *memp = scmnd->device->hostdata;
+
+	vm_srb = &request->vstor_packet.vm_srb;
+	if (cmd_request->bounce_sgl_count) {
+		if (vm_srb->data_in == READ_TYPE)
+			copy_from_bounce_buffer(scsi_sglist(scmnd),
+					cmd_request->bounce_sgl,
+					scsi_sg_count(scmnd),
+					cmd_request->bounce_sgl_count);
+		destroy_bounce_buffer(cmd_request->bounce_sgl,
+					cmd_request->bounce_sgl_count);
+	}
+
+	/*
+	 * If there is an error; offline the device since all
+	 * error recovery strategies would have already been
+	 * deployed on the host side.
+	 */
+	if (vm_srb->srb_status == 0x4)
+		scmnd->result = DID_TARGET_FAILURE << 16;
+	else
+		scmnd->result = vm_srb->scsi_status;
+
+	/*
+	 * If the LUN is invalid; remove the device.
+	 */
+	if (vm_srb->srb_status == 0x20) {
+		struct storvsc_device *stor_dev;
+		struct hv_device *dev = host_dev->dev;
+		struct Scsi_Host *host;
+
+		stor_dev = get_in_stor_device(dev);
+		host = stor_dev->host;
+
+		wrk = kmalloc(sizeof(struct storvsc_scan_work),
+				GFP_ATOMIC);
+		if (!wrk) {
+			scmnd->result = DID_TARGET_FAILURE << 16;
+		} else {
+			wrk->host = host;
+			wrk->lun = vm_srb->lun;
+			INIT_WORK(&wrk->work, storvsc_remove_lun);
+			schedule_work(&wrk->work);
+		}
+	}
+
+	if (scmnd->result) {
+		if (scsi_normalize_sense(scmnd->sense_buffer,
+				SCSI_SENSE_BUFFERSIZE, &sense_hdr))
+			scsi_print_sense_hdr("storvsc", &sense_hdr);
+	}
+
+	scsi_set_resid(scmnd,
+		request->data_buffer.len -
+		vm_srb->data_transfer_length);
+
+	scsi_done_fn = scmnd->scsi_done;
+
+	scmnd->host_scribble = NULL;
+	scmnd->scsi_done = NULL;
+
+	scsi_done_fn(scmnd);
+
+	mempool_free(cmd_request, memp->request_mempool);
+}
+
+static bool storvsc_check_scsi_cmd(struct scsi_cmnd *scmnd)
+{
+	bool allowed = true;
+	u8 scsi_op = scmnd->cmnd[0];
+
+	switch (scsi_op) {
+	/* smartd sends this command, which will offline the device */
+	case SET_WINDOW:
+		scmnd->result = ILLEGAL_REQUEST << 16;
+		allowed = false;
+		break;
+	default:
+		break;
+	}
+	return allowed;
+}
+
+/*
+ * storvsc_queuecommand - Initiate command processing
+ */
+static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+{
+	int ret;
+	struct hv_host_device *host_dev = shost_priv(host);
+	struct hv_device *dev = host_dev->dev;
+	struct hv_storvsc_request *request;
+	struct storvsc_cmd_request *cmd_request;
+	unsigned int request_size = 0;
+	int i;
+	struct scatterlist *sgl;
+	unsigned int sg_count = 0;
+	struct vmscsi_request *vm_srb;
+	struct stor_mem_pools *memp = scmnd->device->hostdata;
+
+	if (storvsc_check_scsi_cmd(scmnd) == false) {
+		scmnd->scsi_done(scmnd);
+		return 0;
+	}
+
+	/* If retrying, no need to prep the cmd */
+	if (scmnd->host_scribble) {
+
+		cmd_request =
+			(struct storvsc_cmd_request *)scmnd->host_scribble;
+
+		goto retry_request;
+	}
+
+	request_size = sizeof(struct storvsc_cmd_request);
+
+	cmd_request = mempool_alloc(memp->request_mempool,
+				       GFP_ATOMIC);
+	if (!cmd_request)
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+
+	memset(cmd_request, 0, sizeof(struct storvsc_cmd_request));
+
+	/* Setup the cmd request */
+	cmd_request->bounce_sgl_count = 0;
+	cmd_request->bounce_sgl = NULL;
+	cmd_request->cmd = scmnd;
+
+	scmnd->host_scribble = (unsigned char *)cmd_request;
+
+	request = &cmd_request->request;
+	vm_srb = &request->vstor_packet.vm_srb;
+
+
+	/* Build the SRB */
+	switch (scmnd->sc_data_direction) {
+	case DMA_TO_DEVICE:
+		vm_srb->data_in = WRITE_TYPE;
+		break;
+	case DMA_FROM_DEVICE:
+		vm_srb->data_in = READ_TYPE;
+		break;
+	default:
+		vm_srb->data_in = UNKNOWN_TYPE;
+		break;
+	}
+
+	request->on_io_completion = storvsc_command_completion;
+	request->context = cmd_request;/* scmnd; */
+
+	vm_srb->port_number = host_dev->port;
+	vm_srb->path_id = scmnd->device->channel;
+	vm_srb->target_id = scmnd->device->id;
+	vm_srb->lun = scmnd->device->lun;
+
+	vm_srb->cdb_length = scmnd->cmd_len;
+
+	memcpy(vm_srb->cdb, scmnd->cmnd, vm_srb->cdb_length);
+
+	request->sense_buffer = scmnd->sense_buffer;
+
+
+	request->data_buffer.len = scsi_bufflen(scmnd);
+	if (scsi_sg_count(scmnd)) {
+		sgl = (struct scatterlist *)scsi_sglist(scmnd);
+		sg_count = scsi_sg_count(scmnd);
+
+		/* check if we need to bounce the sgl */
+		if (do_bounce_buffer(sgl, scsi_sg_count(scmnd)) != -1) {
+			cmd_request->bounce_sgl =
+				create_bounce_buffer(sgl, scsi_sg_count(scmnd),
+						     scsi_bufflen(scmnd),
+						     vm_srb->data_in);
+			if (!cmd_request->bounce_sgl) {
+				scmnd->host_scribble = NULL;
+				mempool_free(cmd_request,
+						memp->request_mempool);
+
+				return SCSI_MLQUEUE_HOST_BUSY;
+			}
+
+			cmd_request->bounce_sgl_count =
+				ALIGN(scsi_bufflen(scmnd), PAGE_SIZE) >>
+					PAGE_SHIFT;
+
+			if (vm_srb->data_in == WRITE_TYPE)
+				copy_to_bounce_buffer(sgl,
+					cmd_request->bounce_sgl,
+					scsi_sg_count(scmnd));
+
+			sgl = cmd_request->bounce_sgl;
+			sg_count = cmd_request->bounce_sgl_count;
+		}
+
+		request->data_buffer.offset = sgl[0].offset;
+
+		for (i = 0; i < sg_count; i++)
+			request->data_buffer.pfn_array[i] =
+				page_to_pfn(sg_page((&sgl[i])));
+
+	} else if (scsi_sglist(scmnd)) {
+		request->data_buffer.offset =
+			virt_to_phys(scsi_sglist(scmnd)) & (PAGE_SIZE-1);
+		request->data_buffer.pfn_array[0] =
+			virt_to_phys(scsi_sglist(scmnd)) >> PAGE_SHIFT;
+	}
+
+retry_request:
+	/* Invokes the vsc to start an IO */
+	ret = storvsc_do_io(dev, &cmd_request->request);
+
+	if (ret == -EAGAIN) {
+		/* no more space */
+
+		if (cmd_request->bounce_sgl_count)
+			destroy_bounce_buffer(cmd_request->bounce_sgl,
+					cmd_request->bounce_sgl_count);
+
+		mempool_free(cmd_request, memp->request_mempool);
+
+		scmnd->host_scribble = NULL;
+
+		ret = SCSI_MLQUEUE_DEVICE_BUSY;
+	}
+
+	return ret;
+}
+
+/* Scsi driver */
+static struct scsi_host_template scsi_driver = {
+	.module	=		THIS_MODULE,
+	.name =			"storvsc_host_t",
+	.bios_param =		storvsc_get_chs,
+	.queuecommand =		storvsc_queuecommand,
+	.eh_host_reset_handler =	storvsc_host_reset_handler,
+	.slave_alloc =		storvsc_device_alloc,
+	.slave_destroy =	storvsc_device_destroy,
+	.slave_configure =	storvsc_device_configure,
+	.cmd_per_lun =		1,
+	/* 64 max_queue * 1 target */
+	.can_queue =		STORVSC_MAX_IO_REQUESTS*STORVSC_MAX_TARGETS,
+	.this_id =		-1,
+	/* no use setting to 0 since ll_blk_rw reset it to 1 */
+	/* currently 32 */
+	.sg_tablesize =		MAX_MULTIPAGE_BUFFER_COUNT,
+	.use_clustering =	DISABLE_CLUSTERING,
+	/* Make sure we dont get a sg segment crosses a page boundary */
+	.dma_boundary =		PAGE_SIZE-1,
+};
+
+enum {
+	SCSI_GUID,
+	IDE_GUID,
+};
+
+static const struct hv_vmbus_device_id id_table[] = {
+	/* SCSI guid */
+	{ VMBUS_DEVICE(0xd9, 0x63, 0x61, 0xba, 0xa1, 0x04, 0x29, 0x4d,
+		       0xb6, 0x05, 0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f)
+	  .driver_data = SCSI_GUID },
+	/* IDE guid */
+	{ VMBUS_DEVICE(0x32, 0x26, 0x41, 0x32, 0xcb, 0x86, 0xa2, 0x44,
+		       0x9b, 0x5c, 0x50, 0xd1, 0x41, 0x73, 0x54, 0xf5)
+	  .driver_data = IDE_GUID },
+	{ },
+};
+
+MODULE_DEVICE_TABLE(vmbus, id_table);
+
+
+/*
+ * storvsc_probe - Add a new device for this driver
+ */
+
+static int storvsc_probe(struct hv_device *device,
+			const struct hv_vmbus_device_id *dev_id)
+{
+	int ret;
+	struct Scsi_Host *host;
+	struct hv_host_device *host_dev;
+	bool dev_is_ide = ((dev_id->driver_data == IDE_GUID) ? true : false);
+	int path = 0;
+	int target = 0;
+	struct storvsc_device *stor_device;
+
+	host = scsi_host_alloc(&scsi_driver,
+			       sizeof(struct hv_host_device));
+	if (!host)
+		return -ENOMEM;
+
+	host_dev = shost_priv(host);
+	memset(host_dev, 0, sizeof(struct hv_host_device));
+
+	host_dev->port = host->host_no;
+	host_dev->dev = device;
+
+
+	stor_device = kzalloc(sizeof(struct storvsc_device), GFP_KERNEL);
+	if (!stor_device) {
+		ret = -ENOMEM;
+		goto err_out0;
+	}
+
+	stor_device->destroy = false;
+	init_waitqueue_head(&stor_device->waiting_to_drain);
+	stor_device->device = device;
+	stor_device->host = host;
+	hv_set_drvdata(device, stor_device);
+
+	stor_device->port_number = host->host_no;
+	ret = storvsc_connect_to_vsp(device, storvsc_ringbuffer_size);
+	if (ret)
+		goto err_out1;
+
+	if (dev_is_ide)
+		storvsc_get_ide_info(device, &target, &path);
+
+	host_dev->path = stor_device->path_id;
+	host_dev->target = stor_device->target_id;
+
+	/* max # of devices per target */
+	host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
+	/* max # of targets per channel */
+	host->max_id = STORVSC_MAX_TARGETS;
+	/* max # of channels */
+	host->max_channel = STORVSC_MAX_CHANNELS - 1;
+	/* max cmd length */
+	host->max_cmd_len = STORVSC_MAX_CMD_LEN;
+
+	/* Register the HBA and start the scsi bus scan */
+	ret = scsi_add_host(host, &device->device);
+	if (ret != 0)
+		goto err_out2;
+
+	if (!dev_is_ide) {
+		scsi_scan_host(host);
+		return 0;
+	}
+	ret = scsi_add_device(host, 0, target, 0);
+	if (ret) {
+		scsi_remove_host(host);
+		goto err_out2;
+	}
+	return 0;
+
+err_out2:
+	/*
+	 * Once we have connected with the host, we would need to
+	 * to invoke storvsc_dev_remove() to rollback this state and
+	 * this call also frees up the stor_device; hence the jump around
+	 * err_out1 label.
+	 */
+	storvsc_dev_remove(device);
+	goto err_out0;
+
+err_out1:
+	kfree(stor_device);
+
+err_out0:
+	scsi_host_put(host);
+	return ret;
+}
+
+/* The one and only one */
+
+static struct hv_driver storvsc_drv = {
+	.name = KBUILD_MODNAME,
+	.id_table = id_table,
+	.probe = storvsc_probe,
+	.remove = storvsc_remove,
+};
+
+static int __init storvsc_drv_init(void)
+{
+	u32 max_outstanding_req_per_channel;
+
+	/*
+	 * Divide the ring buffer data size (which is 1 page less
+	 * than the ring buffer size since that page is reserved for
+	 * the ring buffer indices) by the max request size (which is
+	 * vmbus_channel_packet_multipage_buffer + struct vstor_packet + u64)
+	 */
+	max_outstanding_req_per_channel =
+		((storvsc_ringbuffer_size - PAGE_SIZE) /
+		ALIGN(MAX_MULTIPAGE_BUFFER_PACKET +
+		sizeof(struct vstor_packet) + sizeof(u64),
+		sizeof(u64)));
+
+	if (max_outstanding_req_per_channel <
+	    STORVSC_MAX_IO_REQUESTS)
+		return -EINVAL;
+
+	return vmbus_driver_register(&storvsc_drv);
+}
+
+static void __exit storvsc_drv_exit(void)
+{
+	vmbus_driver_unregister(&storvsc_drv);
+}
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION(HV_DRV_VERSION);
+MODULE_DESCRIPTION("Microsoft Hyper-V virtual storage driver");
+module_init(storvsc_drv_init);
+module_exit(storvsc_drv_exit);
-- 
1.7.4.1


^ permalink raw reply related

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Raghavendra K T @ 2011-12-07 16:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Avi Kivity, Rik van Riel,
	Konrad Rzeszutek Wilk, Srivatsa Vaddagiri, Jeremy Fitzhardinge,
	Sasha Levin, Sedat Dilek, Thomas Gleixner, Yinghai Lu,
	Greg Kroah-Hartman, LKML, Dave Hansen
In-Reply-To: <20111207123330.GA32212@amt.cnet>

On 12/07/2011 06:03 PM, Marcelo Tosatti wrote:
> On Wed, Dec 07, 2011 at 05:24:59PM +0530, Raghavendra K T wrote:
>> On 12/07/2011 04:18 PM, Marcelo Tosatti wrote:
>> Yes you are right. It was potentially racy and it was harmful too!.
>> I had observed that it was stalling the CPU before I introduced
>> kicked flag.
>>
>> But now,
>>
>> vcpu->kicked = 1  ==>  kvm_make_request(KVM_REQ_UNHALT, vcpu); ==>
>
> Ok, please use a more descriptive name, such as "pvlock_kicked" or
> something.
>

Yes, pvlock_kicked seems good. 'll use same unless something else
flashes.

>>
>> __vcpu_run() ==>  kvm_check_request(KVM_REQ_UNHALT, vcpu) ==>
>>
>> vcpuN->mp_state = KVM_MP_STATE_RUNNABLE; so eventually we will end up
>> in RUNNABLE.
>>
>> Also Avi pointed that, logically kvm_arch_vcpu_ioctl_set_mpstate should
>> be called only in vcpu thread, so after further debugging, I noticed
>> that,  setting vcpuN->mp_state = KVM_MP_STATE_RUNNABLE; is not
>> necessary.
>> I 'll remove that in the next patch. Thanks for pointing.
>
> In fact you don't need kvm_arch_vcpu_ioctl_set_mpstate either, only the
> new "kicked" flag.
>

True indeed, I meant the same.

^ permalink raw reply

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Raghavendra K T @ 2011-12-07 16:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuki
In-Reply-To: <4EDF7DC5.5010504@redhat.com>

On 12/07/2011 08:22 PM, Avi Kivity wrote:
> On 12/07/2011 03:39 PM, Marcelo Tosatti wrote:
>>> Also I think we can keep the kicked flag in vcpu->requests, no need for
>>> new storage.
>>
>> Was going to suggest it but it violates the currently organized
>> processing of entries at the beginning of vcpu_enter_guest.
>>
>> That is, this "kicked" flag is different enough from vcpu->requests
>> processing that a separate variable seems worthwhile (even more
>> different with convertion to MP_STATE at KVM_GET_MP_STATE).
>
> IMO, it's similar to KVM_REQ_EVENT (which can also cause mpstate to
> change due to apic re-evaluation).
>

Ok, So what I understand is we have to either :
1. retain current kick flag AS-IS but would have to make it migration 
friendly. [I still have to get more familiar with migration side]
or
2. introduce notion similar to KVM_REQ_PVLOCK_KICK(??) to be part of
vcpu->requests.

So what would be better? Please let me know.

^ permalink raw reply

* Re: [net-next RFC PATCH 0/5] Series short description
From: Ben Hutchings @ 2011-12-07 17:02 UTC (permalink / raw)
  To: Jason Wang, Rusty Russell
  Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <4EDF4E82.7050201@redhat.com>

On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
> On 12/07/2011 03:30 PM, Rusty Russell wrote:
> > On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang<jasowang@redhat.com>  wrote:
> >> multiple queue virtio-net: flow steering through host/guest cooperation
> >>
> >> Hello all:
> >>
> >> This is a rough series adds the guest/host cooperation of flow
> >> steering support based on Krish Kumar's multiple queue virtio-net
> >> driver patch 3/3 (http://lwn.net/Articles/467283/).
> > Is there a real (physical) device which does this kind of thing?  How do
> > they do it?  Can we copy them?
> >
> > Cheers,
> > Rusty.
> As far as I see, ixgbe and sfc have similar but much more sophisticated 
> mechanism.
> 
> The idea was originally suggested by Ben and it was just borrowed form 
> those real physical nic cards who can dispatch packets based on their 
> hash. All of theses cards can filter the flow based on the hash of 
> L2/L3/L4 header and the stack would tell the card which queue should 
> this flow goes.

Solarflare controllers (sfc driver) have 8192 perfect filters for
TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
are organised as a hash table, but matched based on 5-tuples.)  I
implemented the 'accelerated RFS' interface in this driver.

I believe the Intel 82599 controllers (ixgbe driver) have both
hash-based and perfect filter modes and the driver can be configured to
use one or the other.  The driver has its own independent mechanism for
steering RX and TX flows which predates RFS; I don't know whether it
uses hash-based or perfect filters.

Most multi-queue controllers could support a kind of hash-based
filtering for TCP/IP by adjusting the RSS indirection table.  However,
this table is usually quite small (64-256 entries).  This means that
hash collisions will be quite common and this can result in reordering.
The same applies to the small table Jason has proposed for virtio-net.

> So in host, a simple hash to queue table were introduced in tap/macvtap 
> and in guest, the guest driver would tell the desired queue of a flow 
> through changing this table.

I don't think accelerated RFS can work well without the use of perfect
filtering or hash-based filtering with a very low rate of collisions.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: David Miller @ 2011-12-07 18:52 UTC (permalink / raw)
  To: mst; +Cc: krkumar2, arnd, netdev, virtualization, levinsasha928
In-Reply-To: <20111207161001.GD23845@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 7 Dec 2011 18:10:02 +0200

> On Fri, Nov 25, 2011 at 01:35:52AM -0500, David Miller wrote:
>> From: Krishna Kumar2 <krkumar2@in.ibm.com>
>> Date: Fri, 25 Nov 2011 09:39:11 +0530
>> 
>> > Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
>> >>
>> >> My description is not clear again :(
>> >> I mean the same vhost thead:
>> >>
>> >> vhost thread #0 transmits packets of flow A on processor M
>> >> ...
>> >> vhost thread #0 move to another process N and start to transmit packets
>> >> of flow A
>> > 
>> > Thanks for clarifying. Yes, binding vhosts to CPU's
>> > makes the incoming packet go to the same vhost each
>> > time. BTW, are you doing any binding and/or irqbalance
>> > when you run your tests? I am not running either at
>> > this time, but thought both might be useful.
>> 
>> So are we going with this patch or are we saying that vhost binding
>> is a requirement?
> 
> OK we didn't come to a conclusion so I would be inclined
> to merge this patch as is for 3.2, and revisit later.
> One question though: do these changes affect userspace
> in any way? For example, will this commit us to
> ensure that a single flow gets a unique hash even
> for strange configurations that transmit the same flow
> from multiple cpus?

Once you sort this out, reply with an Acked-by: for me, thanks.

^ permalink raw reply

* Re: [PATCH RFC] virtio_net: fix refill related races
From: Rusty Russell @ 2011-12-08  4:37 UTC (permalink / raw)
  To: Amit Shah; +Cc: netdev, virtualization, linux-kernel, Michael S. Tsirkin
In-Reply-To: <20111207152120.GA23417@redhat.com>

On Wed, 7 Dec 2011 17:21:22 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> Fix theoretical races related to refill work:
> 1. After napi is disabled by ndo_stop, refill work
>    can run and re-enable it.
> 2. Refill can reschedule itself, if this happens
>    it can run after cancel_delayed_work_sync,
>    and will access device after it is destroyed.
> 
> As a solution, add flags to track napi state and
> to disable refill, and toggle them on start, stop
> and remove; check these flags on refill.

Why isn't a "dont-readd" flag sufficient?

Cheers,
Rusty.

^ permalink raw reply

* Re: [PATCH RFC V3 2/4] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
From: Avi Kivity @ 2011-12-08  9:40 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Jeremy Fitzhardinge, x86, Peter Zijlstra, Virtualization,
	H. Peter Anvin, Stefano Stabellini, Xen, Dave Jiang, KVM,
	Raghavendra K T, Ingo Molnar, Rik van Riel, Konrad Rzeszutek Wilk,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, Sasha Levin, Sedat Dilek,
	Thomas Gleixner, Yinghai Lu, Greg Kroah-Hartman, LKML,
	Dave Hansen, Suzuki
In-Reply-To: <4EDF987B.8040900@linux.vnet.ibm.com>

On 12/07/2011 06:46 PM, Raghavendra K T wrote:
> On 12/07/2011 08:22 PM, Avi Kivity wrote:
>> On 12/07/2011 03:39 PM, Marcelo Tosatti wrote:
>>>> Also I think we can keep the kicked flag in vcpu->requests, no need
>>>> for
>>>> new storage.
>>>
>>> Was going to suggest it but it violates the currently organized
>>> processing of entries at the beginning of vcpu_enter_guest.
>>>
>>> That is, this "kicked" flag is different enough from vcpu->requests
>>> processing that a separate variable seems worthwhile (even more
>>> different with convertion to MP_STATE at KVM_GET_MP_STATE).
>>
>> IMO, it's similar to KVM_REQ_EVENT (which can also cause mpstate to
>> change due to apic re-evaluation).
>>
>
> Ok, So what I understand is we have to either :
> 1. retain current kick flag AS-IS but would have to make it migration
> friendly. [I still have to get more familiar with migration side]
> or
> 2. introduce notion similar to KVM_REQ_PVLOCK_KICK(??) to be part of
> vcpu->requests.
>
> So what would be better? Please let me know.
>

IMO, KVM_REQ.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors
From: Rusty Russell @ 2011-12-08  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Sasha Levin
  Cc: markmc, virtualization, Avi Kivity, kvm, linux-kernel
In-Reply-To: <20111207154816.GA23845@redhat.com>

On Wed, 7 Dec 2011 17:48:17 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Dec 07, 2011 at 04:02:45PM +0200, Sasha Levin wrote:
> > On Sun, 2011-12-04 at 20:23 +0200, Sasha Levin wrote:
> > 
> > [snip]
> > 
> > Rusty, Michael, does the below looks a reasonable optimization for you?
> 
> OK overall but a bit hard to say for sure as it looks pretty incomplete ...

A static threshold is very hackish; we need to either initialize it to
a proven-good value (since noone will ever change it) or be cleverer.

Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-12-08  9:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, arnd, netdev, virtualization, levinsasha928,
	David Miller
In-Reply-To: <20111207161001.GD23845@redhat.com>

On 12/08/2011 12:10 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 25, 2011 at 01:35:52AM -0500, David Miller wrote:
>> From: Krishna Kumar2<krkumar2@in.ibm.com>
>> Date: Fri, 25 Nov 2011 09:39:11 +0530
>>
>>> Jason Wang<jasowang@redhat.com>  wrote on 11/25/2011 08:51:57 AM:
>>>> My description is not clear again :(
>>>> I mean the same vhost thead:
>>>>
>>>> vhost thread #0 transmits packets of flow A on processor M
>>>> ...
>>>> vhost thread #0 move to another process N and start to transmit packets
>>>> of flow A
>>> Thanks for clarifying. Yes, binding vhosts to CPU's
>>> makes the incoming packet go to the same vhost each
>>> time. BTW, are you doing any binding and/or irqbalance
>>> when you run your tests? I am not running either at
>>> this time, but thought both might be useful.
>> So are we going with this patch or are we saying that vhost binding
>> is a requirement?
> OK we didn't come to a conclusion so I would be inclined
> to merge this patch as is for 3.2, and revisit later.
> One question though: do these changes affect userspace
> in any way? For example, will this commit us to
> ensure that a single flow gets a unique hash even
> for strange configurations that transmit the same flow
> from multiple cpus?
>
The hash were generated by either host kernel or host nic, so I think 
it's unique except for the nic that would provide different hashes for a 
flow. I wonder whether there's a such nic.

^ permalink raw reply

* Re: [net-next RFC PATCH 0/5] Series short description
From: Jason Wang @ 2011-12-08 10:06 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: krkumar2, kvm, mst, netdev, virtualization, levinsasha928
In-Reply-To: <1323277324.2728.34.camel@bwh-desktop>

On 12/08/2011 01:02 AM, Ben Hutchings wrote:
> On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
>> On 12/07/2011 03:30 PM, Rusty Russell wrote:
>>> On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wang<jasowang@redhat.com>   wrote:
>>>> multiple queue virtio-net: flow steering through host/guest cooperation
>>>>
>>>> Hello all:
>>>>
>>>> This is a rough series adds the guest/host cooperation of flow
>>>> steering support based on Krish Kumar's multiple queue virtio-net
>>>> driver patch 3/3 (http://lwn.net/Articles/467283/).
>>> Is there a real (physical) device which does this kind of thing?  How do
>>> they do it?  Can we copy them?
>>>
>>> Cheers,
>>> Rusty.
>> As far as I see, ixgbe and sfc have similar but much more sophisticated
>> mechanism.
>>
>> The idea was originally suggested by Ben and it was just borrowed form
>> those real physical nic cards who can dispatch packets based on their
>> hash. All of theses cards can filter the flow based on the hash of
>> L2/L3/L4 header and the stack would tell the card which queue should
>> this flow goes.
> Solarflare controllers (sfc driver) have 8192 perfect filters for
> TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
> are organised as a hash table, but matched based on 5-tuples.)  I
> implemented the 'accelerated RFS' interface in this driver.
>
> I believe the Intel 82599 controllers (ixgbe driver) have both
> hash-based and perfect filter modes and the driver can be configured to
> use one or the other.  The driver has its own independent mechanism for
> steering RX and TX flows which predates RFS; I don't know whether it
> uses hash-based or perfect filters.

As far as I see, their driver predates RFS by binding the TX queue and 
RX queue to the same CPU and adding hash based filter during packet 
transmission.

> Most multi-queue controllers could support a kind of hash-based
> filtering for TCP/IP by adjusting the RSS indirection table.  However,
> this table is usually quite small (64-256 entries).  This means that
> hash collisions will be quite common and this can result in reordering.
> The same applies to the small table Jason has proposed for virtio-net.
>

Thanks for the clarification. Consider the hash were provided by host 
nic or host kernel, the collision rate is not fixed. Perfect filter is 
more suitable then.
>> So in host, a simple hash to queue table were introduced in tap/macvtap
>> and in guest, the guest driver would tell the desired queue of a flow
>> through changing this table.
> I don't think accelerated RFS can work well without the use of perfect
> filtering or hash-based filtering with a very low rate of collisions.
>
> Ben.
>

^ permalink raw reply

* [PATCH 0/11] RFC: PCI using capabilitities
From: Rusty Russell @ 2011-12-08 10:22 UTC (permalink / raw)
  To: virtualization; +Cc: Avi Kivity, Sasha Levin, kvm, Michael S. Tsirkin

Here's the patch series I ended up with.  I haven't coded up the QEMU
side yet, so no idea if the new driver works.

Questions:
(1) Do we win from separating ISR, NOTIFY and COMMON?
(2) I used a "u8 bar"; should I use a bir and pack it instead?  BIR
    seems a little obscure (noone else in the kernel source seems to
    refer to it).

Cheers,
Rusty.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox