* [RFC PATCH v1 1/2] virtio_net: Pass gfp flags when allocating rx buffers.
From: Mike Waychison @ 2012-01-04 22:52 UTC (permalink / raw)
To: Rusty Russell, Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20120104225223.18184.1537.stgit@mike2.sea.corp.google.com>
Currently, the refill path for RX buffers will always allocate the
buffers as GFP_ATOMIC, even if we are in process context. This will
fail to apply memory pressure as the worker thread will not contribute
to the freeing of memory.
Fix this by changing add_recvbuf_small to use the gfp variant allocator,
__netdev_alloc_skb_ip_align().
Signed-off-by: Mike Waychison <mikew@google.com>
---
drivers/net/virtio_net.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2055386..76fe14e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -156,6 +156,7 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
*len -= size;
}
+/* Called from bottom half context */
static struct sk_buff *page_to_skb(struct virtnet_info *vi,
struct page *page, unsigned int len)
{
@@ -358,7 +359,7 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
struct skb_vnet_hdr *hdr;
int err;
- skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
+ skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
if (unlikely(!skb))
return -ENOMEM;
^ permalink raw reply related
* [RFC PATCH v1 0/2] virtio_net: Better low memory handling.
From: Mike Waychison @ 2012-01-04 22:52 UTC (permalink / raw)
To: Rusty Russell, Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization
The following series applies to net-next.
The following series changes the low memory paths in virtio_net to allow
the driver to contribute to reclaim when memory is tight.
It attempts to rectify some performance problems we've seen where the
network performance drops significantly when memory is low. The working
theory is that while the driver contributes to memory pressure when
throughput is high, it does not contribute to reclaim when memory is
low.
The observed situation on an unmodified kernel is that a system with low
memory will in turn quickly stop being being able to refill the rx queue
with buffers, in turn causing packets to be dropped by the host OS while
the driver waits for somebody else to free up memory. The way the
process-context refill loop works, the memory subsystem is effectively
polled every half second when things look bleak, leading to significant
packet loss and congestion window collapse.
The first patch rectifies the "not contributing to reclaim" by letting
the driver try to allocate as GFP_KERNEL when run from process context.
The second patch is a bit more complicated. It essentially removes the
serialization that is currently in place built around enabling and
disabling napi polling, and replaces it by protecting the underlying
virtqueue accesses with a bottom-half spinlock. As well, in order to
continue allocating as GFP_KERNEL in the slow path allocation of buffers
and adding them to the virtqueue for device use is split apart. To try
and amortize the locking, batching is used.
This patchset seems to help the test setups that I'm playing with.
Example tests include using several bit-torrent clients within a VM and
watching the network throughputs on the host. Other similar workloads
(high network traffic, reasonably high disk IO causing memory pressure)
also appear to have throughput problems alleviated by these changes.
As a warning, I've only really tested this using the "small buffers"
paths. The devices I'm using do not yet support "mergeable" or "big"
receive buffers.
---
Mike Waychison (2):
virtio_net: Pass gfp flags when allocating rx buffers.
virtio_net: Don't disable napi on low memory.
drivers/net/virtio_net.c | 213 +++++++++++++++++++++++++++++++++++-----------
1 files changed, 162 insertions(+), 51 deletions(-)
^ permalink raw reply
* Re: [Xen-devel] [PATCH 2/2] xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
From: Konrad Rzeszutek Wilk @ 2012-01-04 15:12 UTC (permalink / raw)
To: Jan Beulich
Cc: Jeremy Fitzhardinge, xen-devel, Ian Campbell, linux-kernel,
virtualization, Haogang Chen
In-Reply-To: <4F044D10020000780006A5B4@nat28.tlf.novell.com>
On Wed, Jan 04, 2012 at 11:58:56AM +0000, Jan Beulich wrote:
> >>> On 04.01.12 at 12:39, Ian Campbell <ian.campbell@citrix.com> wrote:
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
>
> Acked-by: Jan Beulich <jbeulich@suse.com>
OK, took:
Ian Campbell (3):
xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
And put the ' xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX'
on the stable@kernel.org as well.
Thanks!
^ permalink raw reply
* [PATCH 2/2] xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
From: Ian Campbell @ 2012-01-04 11:39 UTC (permalink / raw)
Cc: Jeremy Fitzhardinge, xen-devel, Ian Campbell,
Konrad Rzeszutek Wilk, linux-kernel, virtualization, Haogang Chen
In-Reply-To: <1325677158.25206.244.camel@zakaz.uk.xensource.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Haogang Chen <haogangchen@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
drivers/xen/xenbus/xenbus_xs.c | 17 +++++++----------
1 files changed, 7 insertions(+), 10 deletions(-)
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index 6f0121e..226d1ac 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -532,21 +532,18 @@ int xenbus_printf(struct xenbus_transaction t,
{
va_list ap;
int ret;
-#define PRINTF_BUFFER_SIZE 4096
- char *printf_buffer;
-
- printf_buffer = kmalloc(PRINTF_BUFFER_SIZE, GFP_NOIO | __GFP_HIGH);
- if (printf_buffer == NULL)
- return -ENOMEM;
+ char *buf;
va_start(ap, fmt);
- ret = vsnprintf(printf_buffer, PRINTF_BUFFER_SIZE, fmt, ap);
+ buf = kvasprintf(GFP_NOIO | __GFP_HIGH, fmt, ap);
va_end(ap);
- BUG_ON(ret > PRINTF_BUFFER_SIZE-1);
- ret = xenbus_write(t, dir, node, printf_buffer);
+ if (!buf)
+ return -ENOMEM;
+
+ ret = xenbus_write(t, dir, node, buf);
- kfree(printf_buffer);
+ kfree(buf);
return ret;
}
--
1.7.2.5
^ permalink raw reply related
* [PATCH 1/2] xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
From: Ian Campbell @ 2012-01-04 11:39 UTC (permalink / raw)
Cc: Jeremy Fitzhardinge, xen-devel, Ian Campbell,
Konrad Rzeszutek Wilk, linux-kernel, virtualization, Haogang Chen
In-Reply-To: <1325677158.25206.244.camel@zakaz.uk.xensource.com>
Use this now that it is defined even though it happens to be == PAGE_SIZE.
The code which takes requests from userspace already validates against the size
of this buffer so no further checks are required to ensure that userspace
requests comply with the protocol in this respect.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Haogang Chen <haogangchen@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
drivers/xen/xenbus/xenbus_dev_frontend.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index fb30cff..1fe4324 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -104,7 +104,7 @@ struct xenbus_file_priv {
unsigned int len;
union {
struct xsd_sockmsg msg;
- char buffer[PAGE_SIZE];
+ char buffer[XENSTORE_PAYLOAD_MAX];
} u;
/* Response queue. */
--
1.7.2.5
^ permalink raw reply related
* [PATCH 0/2] xen: Miscelaneous xenbus cleanups
From: Ian Campbell @ 2012-01-04 11:39 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge, Haogang Chen,
virtualization@lists.linux-foundation.org,
xen-devel@lists.xensource.com, linux-kernel@vger.kernel.org
In-Reply-To: <1325669689.25206.181.camel@zakaz.uk.xensource.com>
Just a couple of things I noticed while reviewing Haogang's patch.
Applies on top of my suggested replacement for that patch (in
<1325669689.25206.181.camel@zakaz.uk.xensource.com>).
Not extensively tested but I did run it in dom0 and start both and HVM
and PV guest.
Ian.
^ permalink raw reply
* Re: [PATCH] XEN: xenbus: integer overflow in process_msg()
From: Ian Campbell @ 2012-01-04 9:34 UTC (permalink / raw)
To: Haogang Chen
Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com,
linux-kernel@vger.kernel.org, Konrad Rzeszutek Wilk,
virtualization@lists.linux-foundation.org
In-Reply-To: <1325669095.25206.176.camel@zakaz.uk.xensource.com>
On Wed, 2012-01-04 at 09:24 +0000, Ian Campbell wrote:
> On Tue, 2012-01-03 at 19:42 +0000, Haogang Chen wrote:
> > There is a potential integer overflow in process_msg() that could result
> > in cross-domain attack.
> >
> > body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
> >
> > When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent
> > call to xb_read() would write to a zero-length buffer.
>
> The other end of this connection is always the xenstore backend daemon
> so there is no guest (malicious or otherwise) which can do this. The
> xenstore daemon is a trusted component in the system.
>
> However this seem like a reasonable robustness improvement so we should
> have it.
Actually, rereading docs/misc/xenstore.txt I see:
The payload length (len field of the header) is limited to 4096
(XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the
limit, its xenstored connection will be immediately killed by
xenstored, which is usually catastrophic from the client's point of
view. Clients (particularly domains, which cannot just reconnect)
should avoid this.
So probably we actually want (untested, but seems obvious enough):
8<---------------------------------------------------------
From a010c48fe67ef15495884c265ec6af8a688795f8 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@citrix.com>
Date: Wed, 4 Jan 2012 09:29:41 +0000
Subject: [PATCH] xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
This also avoids a potential integer overflow pointed out by Haogang Chen.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Haogang Chen <haogangchen@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
drivers/xen/xenbus/xenbus_xs.c | 6 ++++++
include/xen/interface/io/xs_wire.h | 3 +++
2 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index b3b8f2f..6f0121e 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -810,6 +810,12 @@ static int process_msg(void)
goto out;
}
+ if (msg->hdr.len > XENSTORE_PAYLOAD_MAX) {
+ kfree(msg);
+ err = -EINVAL;
+ goto out;
+ }
+
body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
if (body == NULL) {
kfree(msg);
diff --git a/include/xen/interface/io/xs_wire.h b/include/xen/interface/io/xs_wire.h
index f0b6890..3c1877c 100644
--- a/include/xen/interface/io/xs_wire.h
+++ b/include/xen/interface/io/xs_wire.h
@@ -88,4 +88,7 @@ struct xenstore_domain_interface {
XENSTORE_RING_IDX rsp_cons, rsp_prod;
};
+/* Violating this is very bad. See docs/misc/xenstore.txt. */
+#define XENSTORE_PAYLOAD_MAX 4096
+
#endif /* _XS_WIRE_H */
--
1.7.2.5
^ permalink raw reply related
* Re: [PATCH] XEN: xenbus: integer overflow in process_msg()
From: Ian Campbell @ 2012-01-04 9:24 UTC (permalink / raw)
To: Haogang Chen
Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com,
virtualization@lists.linux-foundation.org,
linux-kernel@vger.kernel.org, Konrad Rzeszutek Wilk
In-Reply-To: <1325619731-13936-1-git-send-email-haogangchen@gmail.com>
On Tue, 2012-01-03 at 19:42 +0000, Haogang Chen wrote:
> There is a potential integer overflow in process_msg() that could result
> in cross-domain attack.
>
> body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
>
> When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent
> call to xb_read() would write to a zero-length buffer.
The other end of this connection is always the xenstore backend daemon
so there is no guest (malicious or otherwise) which can do this. The
xenstore daemon is a trusted component in the system.
However this seem like a reasonable robustness improvement so we should
have it.
> This causes
> kernel oops in the receiving guest and hangs its xenbus kernel thread.
> The patch returns -EINVAL in that case.
>
> Signed-off-by: Haogang Chen <haogangchen@gmail.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
> ---
> drivers/xen/xenbus/xenbus_xs.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
> index ede860f..e32aefb 100644
> --- a/drivers/xen/xenbus/xenbus_xs.c
> +++ b/drivers/xen/xenbus/xenbus_xs.c
> @@ -801,6 +801,12 @@ static int process_msg(void)
> goto out;
> }
>
> + if (msg->hdr.len == UINT_MAX) {
> + kfree(msg);
> + err = -EINVAL;
> + goto out;
> + }
> +
> body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
> if (body == NULL) {
> kfree(msg);
^ permalink raw reply
* [PATCH] (please ack, don't apply!) virtio: net: Add freeze, restore handlers to support S4
From: Rusty Russell @ 2012-01-04 3:00 UTC (permalink / raw)
To: netdev; +Cc: Amit Shah, Michael S. Tsirkin, Virtualization List
In-Reply-To: <87obukcgzi.fsf@rustcorp.com.au>
(This needs PM changes to virtio_pci which are in my tree, so best done
via that rather than net-next. Thanks!)
From: Amit Shah <amit.shah@redhat.com>
Remove all the vqs, disable napi and detach from the netdev on
hibernation.
Re-create vqs after restoring from a hibernated image, re-enable napi
and re-attach the netdev. This keeps networking working across
hibernation.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1159,6 +1159,48 @@ static void __devexit virtnet_remove(str
free_netdev(vi->dev);
}
+#ifdef CONFIG_PM
+static int virtnet_freeze(struct virtio_device *vdev)
+{
+ struct virtnet_info *vi = vdev->priv;
+
+ virtqueue_disable_cb(vi->rvq);
+ virtqueue_disable_cb(vi->svq);
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
+ virtqueue_disable_cb(vi->cvq);
+
+ netif_device_detach(vi->dev);
+ cancel_delayed_work_sync(&vi->refill);
+
+ if (netif_running(vi->dev))
+ napi_disable(&vi->napi);
+
+ remove_vq_common(vi);
+
+ return 0;
+}
+
+static int virtnet_restore(struct virtio_device *vdev)
+{
+ struct virtnet_info *vi = vdev->priv;
+ int err;
+
+ err = init_vqs(vi);
+ if (err)
+ return err;
+
+ if (netif_running(vi->dev))
+ virtnet_napi_enable(vi);
+
+ netif_device_attach(vi->dev);
+
+ if (!try_fill_recv(vi, GFP_KERNEL))
+ queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+
+ return 0;
+}
+#endif
+
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -1183,6 +1225,10 @@ static struct virtio_driver virtio_net_d
.probe = virtnet_probe,
.remove = __devexit_p(virtnet_remove),
.config_changed = virtnet_config_changed,
+#ifdef CONFIG_PM
+ .freeze = virtnet_freeze,
+ .restore = virtnet_restore,
+#endif
};
static int __init init(void)
^ permalink raw reply
* [PATCH 2/2] virtio: net: Move vq and vq buf removal into separate function
From: Rusty Russell @ 2012-01-04 2:58 UTC (permalink / raw)
To: netdev; +Cc: Amit Shah, Michael S. Tsirkin, Virtualization List
In-Reply-To: <87obukcgzi.fsf@rustcorp.com.au>
From: Amit Shah <amit.shah@redhat.com>
The remove and PM freeze functions will share this code.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/net/virtio_net.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ec8c400..7a2a5bf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1134,22 +1134,26 @@ static void free_unused_bufs(struct virtnet_info *vi)
BUG_ON(vi->num != 0);
}
-static void __devexit virtnet_remove(struct virtio_device *vdev)
+static void remove_vq_common(struct virtnet_info *vi)
{
- struct virtnet_info *vi = vdev->priv;
-
- /* Stop all the virtqueues. */
- vdev->config->reset(vdev);
-
- unregister_netdev(vi->dev);
+ vi->vdev->config->reset(vi->vdev);
/* Free unused buffers in both send and recv, if any. */
free_unused_bufs(vi);
- vdev->config->del_vqs(vi->vdev);
+ vi->vdev->config->del_vqs(vi->vdev);
while (vi->pages)
__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+}
+
+static void __devexit virtnet_remove(struct virtio_device *vdev)
+{
+ struct virtnet_info *vi = vdev->priv;
+
+ unregister_netdev(vi->dev);
+
+ remove_vq_common(vi);
free_percpu(vi->stats);
free_netdev(vi->dev);
^ permalink raw reply related
* [PATCH 1/2] virtio: net: Move vq initialization into separate function
From: Rusty Russell @ 2012-01-04 2:57 UTC (permalink / raw)
To: netdev; +Cc: Amit Shah, Michael S. Tsirkin, Virtualization List
From: Amit Shah <amit.shah@redhat.com>
The probe and PM restore functions will share this code.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/net/virtio_net.c | 47 ++++++++++++++++++++++++++++-------------------
1 file changed, 28 insertions(+), 19 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1b67e8a..ec8c400 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -966,15 +966,38 @@ static void virtnet_config_changed(struct virtio_device *vdev)
virtnet_update_status(vi);
}
+static int init_vqs(struct virtnet_info *vi)
+{
+ struct virtqueue *vqs[3];
+ vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
+ const char *names[] = { "input", "output", "control" };
+ int nvqs, err;
+
+ /* We expect two virtqueues, receive then send,
+ * and optionally control. */
+ nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+
+ err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
+ if (err)
+ return err;
+
+ vi->rvq = vqs[0];
+ vi->svq = vqs[1];
+
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+ vi->cvq = vqs[2];
+
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+ vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+ }
+ return 0;
+}
+
static int virtnet_probe(struct virtio_device *vdev)
{
int err;
struct net_device *dev;
struct virtnet_info *vi;
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
- const char *names[] = { "input", "output", "control" };
- int nvqs;
/* Allocate ourselves a network device with room for our info */
dev = alloc_etherdev(sizeof(struct virtnet_info));
@@ -1046,24 +1069,10 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
vi->mergeable_rx_bufs = true;
- /* We expect two virtqueues, receive then send,
- * and optionally control. */
- nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
- err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+ err = init_vqs(vi);
if (err)
goto free_stats;
- vi->rvq = vqs[0];
- vi->svq = vqs[1];
-
- if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
- vi->cvq = vqs[2];
-
- if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
- dev->features |= NETIF_F_HW_VLAN_FILTER;
- }
-
err = register_netdev(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
^ permalink raw reply related
* Re: [PATCH v6 09/11] virtio: net: Add freeze, restore handlers to support S4
From: Rusty Russell @ 2012-01-04 2:49 UTC (permalink / raw)
To: Amit Shah, Michael S. Tsirkin
Cc: netdev, linux-pm, linux-kernel, Virtualization List
In-Reply-To: <20120103170748.GE4536@amit.redhat.com>
On Tue, 3 Jan 2012 22:37:48 +0530, Amit Shah <amit.shah@redhat.com> wrote:
> On (Sun) 01 Jan 2012 [13:47:27], Michael S. Tsirkin wrote:
> > On Fri, Dec 30, 2011 at 03:08:19PM +0530, Amit Shah wrote:
> > > >
> > > > > + if (!try_fill_recv(vi, GFP_KERNEL))
> > > > > + schedule_delayed_work(&vi->refill, 0);
> > > >
> > > > This needs to be switched to non reentrant wq too?
> > >
> > > Subsequent invocations could create problems? Note that this will be
> > > the first item to be queued in the workqueue for the refill work.
> > >
> > > Amit
> >
> > See recent patches on net-next.
> > We switched other calls to a non reentrant wq so I think same
> > reasoning applies there.
>
> There's a patch ordering issue; so this series should land after the
> -net series is merged.
>
> Rusty, do you want me to re-spin this patch on top of yours, or is
> that something that you'll handle?
Indeed, this can't go via -net, since it needs the virtio-pci patches
which change power management.
I've changed that one line, and will post those patches for DaveM's ack.
Thanks,
Rusty.
^ permalink raw reply
* Re: [patch] virtio_blk: lock() => unlock() typo in virtblk_config_changed_work()
From: Rusty Russell @ 2012-01-04 2:32 UTC (permalink / raw)
To: Michael S. Tsirkin, Dan Carpenter; +Cc: kernel-janitors, virtualization
In-Reply-To: <20120103153127.GA8440@redhat.com>
On Tue, 3 Jan 2012 17:31:28 +0200, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Jan 03, 2012 at 05:45:13PM +0300, Dan Carpenter wrote:
> > This will deadlock.
> >
> > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index f3d77b3..ba73661 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -352,7 +352,7 @@ static void virtblk_config_changed_work(struct work_struct *work)
> >
> > set_capacity(vblk->disk, capacity);
> > done:
> > - mutex_lock(&vblk->config_lock);
> > + mutex_unlock(&vblk->config_lock);
> > }
> >
> > static void virtblk_config_changed(struct virtio_device *vdev)
>
> Good catch.
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
Indeed, I folded it into that patch.
Thanks!
Rusty.
^ permalink raw reply
* [PATCH 1/1][RESEND] Staging: hv: storvsc: Move the storage driver out of the staging area
From: K. Y. Srinivasan @ 2012-01-03 20:27 UTC (permalink / raw)
To: gregkh, linux-kernel, devel, virtualization, ohering,
James.Bottomley, hch, linux-scsi
Cc: K. Y. Srinivasan
The storage driver (storvsc_drv.c) handles all block storage devices
assigned to Linux guests hosted on Hyper-V. This driver has been in the
staging tree for a while and this patch moves it out of the staging area.
As per Greg's recommendation, this patch makes no changes to the staging/hv
directory. Once the driver moves out of staging, we will cleanup the
staging/hv directory.
This patch includes all the patches that I have sent against the staging/hv
tree to address the comments I have gotten to date on this storage driver.
Greg has already applied all these patches to the staging tree.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
---
drivers/scsi/Kconfig | 7 +
drivers/scsi/Makefile | 3 +
drivers/scsi/storvsc_drv.c | 1586 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1596 insertions(+), 0 deletions(-)
create mode 100644 drivers/scsi/storvsc_drv.c
diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 06ea3bc..4910269 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -662,6 +662,13 @@ config VMWARE_PVSCSI
To compile this driver as a module, choose M here: the
module will be called vmw_pvscsi.
+config HYPERV_STORAGE
+ tristate "Microsoft Hyper-V virtual storage driver"
+ depends on SCSI && HYPERV
+ default HYPERV
+ help
+ Select this option to enable the Hyper-V virtual storage driver.
+
config LIBFC
tristate "LibFC module"
select SCSI_FC_ATTRS
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 2b88749..e4c1a69 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -142,6 +142,7 @@ obj-$(CONFIG_SCSI_BNX2_ISCSI) += libiscsi.o bnx2i/
obj-$(CONFIG_BE2ISCSI) += libiscsi.o be2iscsi/
obj-$(CONFIG_SCSI_PMCRAID) += pmcraid.o
obj-$(CONFIG_VMWARE_PVSCSI) += vmw_pvscsi.o
+obj-$(CONFIG_HYPERV_STORAGE) += hv_storvsc.o
obj-$(CONFIG_ARM) += arm/
@@ -170,6 +171,8 @@ scsi_mod-$(CONFIG_SCSI_PROC_FS) += scsi_proc.o
scsi_mod-y += scsi_trace.o
scsi_mod-$(CONFIG_PM) += scsi_pm.o
+hv_storvsc-y := storvsc_drv.o
+
scsi_tgt-y += scsi_tgt_lib.o scsi_tgt_if.o
sd_mod-objs := sd.o
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
new file mode 100644
index 0000000..eb853f7
--- /dev/null
+++ b/drivers/scsi/storvsc_drv.c
@@ -0,0 +1,1586 @@
+/*
+ * Copyright (c) 2009, Microsoft Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Authors:
+ * Haiyang Zhang <haiyangz@microsoft.com>
+ * Hank Janssen <hjanssen@microsoft.com>
+ * K. Y. Srinivasan <kys@microsoft.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/hyperv.h>
+#include <linux/mempool.h>
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_tcq.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_dbg.h>
+
+
+#define STORVSC_MIN_BUF_NR 64
+#define STORVSC_RING_BUFFER_SIZE (20*PAGE_SIZE)
+static int storvsc_ringbuffer_size = STORVSC_RING_BUFFER_SIZE;
+
+module_param(storvsc_ringbuffer_size, int, S_IRUGO);
+MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
+
+/* to alert the user that structure sizes may be mismatched even though the */
+/* protocol versions match. */
+
+
+#define REVISION_STRING(REVISION_) #REVISION_
+#define FILL_VMSTOR_REVISION(RESULT_LVALUE_) \
+ do { \
+ char *revision_string \
+ = REVISION_STRING($Rev : 6 $) + 6; \
+ RESULT_LVALUE_ = 0; \
+ while (*revision_string >= '0' \
+ && *revision_string <= '9') { \
+ RESULT_LVALUE_ *= 10; \
+ RESULT_LVALUE_ += *revision_string - '0'; \
+ revision_string++; \
+ } \
+ } while (0)
+
+/* Major/minor macros. Minor version is in LSB, meaning that earlier flat */
+/* version numbers will be interpreted as "0.x" (i.e., 1 becomes 0.1). */
+#define VMSTOR_PROTOCOL_MAJOR(VERSION_) (((VERSION_) >> 8) & 0xff)
+#define VMSTOR_PROTOCOL_MINOR(VERSION_) (((VERSION_)) & 0xff)
+#define VMSTOR_PROTOCOL_VERSION(MAJOR_, MINOR_) ((((MAJOR_) & 0xff) << 8) | \
+ (((MINOR_) & 0xff)))
+#define VMSTOR_INVALID_PROTOCOL_VERSION (-1)
+
+/* Version history: */
+/* V1 Beta 0.1 */
+/* V1 RC < 2008/1/31 1.0 */
+/* V1 RC > 2008/1/31 2.0 */
+#define VMSTOR_PROTOCOL_VERSION_CURRENT VMSTOR_PROTOCOL_VERSION(4, 2)
+
+
+
+
+/* This will get replaced with the max transfer length that is possible on */
+/* the host adapter. */
+/* The max transfer length will be published when we offer a vmbus channel. */
+#define MAX_TRANSFER_LENGTH 0x40000
+#define DEFAULT_PACKET_SIZE (sizeof(struct vmdata_gpa_direct) + \
+ sizeof(struct vstor_packet) + \
+ sizesizeof(u64) * (MAX_TRANSFER_LENGTH / PAGE_SIZE)))
+
+
+/* Packet structure describing virtual storage requests. */
+enum vstor_packet_operation {
+ VSTOR_OPERATION_COMPLETE_IO = 1,
+ VSTOR_OPERATION_REMOVE_DEVICE = 2,
+ VSTOR_OPERATION_EXECUTE_SRB = 3,
+ VSTOR_OPERATION_RESET_LUN = 4,
+ VSTOR_OPERATION_RESET_ADAPTER = 5,
+ VSTOR_OPERATION_RESET_BUS = 6,
+ VSTOR_OPERATION_BEGIN_INITIALIZATION = 7,
+ VSTOR_OPERATION_END_INITIALIZATION = 8,
+ VSTOR_OPERATION_QUERY_PROTOCOL_VERSION = 9,
+ VSTOR_OPERATION_QUERY_PROPERTIES = 10,
+ VSTOR_OPERATION_ENUMERATE_BUS = 11,
+ VSTOR_OPERATION_MAXIMUM = 11
+};
+
+/*
+ * Platform neutral description of a scsi request -
+ * this remains the same across the write regardless of 32/64 bit
+ * note: it's patterned off the SCSI_PASS_THROUGH structure
+ */
+#define CDB16GENERIC_LENGTH 0x10
+
+#ifndef SENSE_BUFFER_SIZE
+#define SENSE_BUFFER_SIZE 0x12
+#endif
+
+#define MAX_DATA_BUF_LEN_WITH_PADDING 0x14
+
+struct vmscsi_request {
+ unsigned short length;
+ unsigned char srb_status;
+ unsigned char scsi_status;
+
+ unsigned char port_number;
+ unsigned char path_id;
+ unsigned char target_id;
+ unsigned char lun;
+
+ unsigned char cdb_length;
+ unsigned char sense_info_length;
+ unsigned char data_in;
+ unsigned char reserved;
+
+ unsigned int data_transfer_length;
+
+ union {
+ unsigned char cdb[CDB16GENERIC_LENGTH];
+ unsigned char sense_data[SENSE_BUFFER_SIZE];
+ unsigned char reserved_array[MAX_DATA_BUF_LEN_WITH_PADDING];
+ };
+} __attribute((packed));
+
+
+/*
+ * This structure is sent during the intialization phase to get the different
+ * properties of the channel.
+ */
+struct vmstorage_channel_properties {
+ unsigned short protocol_version;
+ unsigned char path_id;
+ unsigned char target_id;
+
+ /* Note: port number is only really known on the client side */
+ unsigned int port_number;
+ unsigned int flags;
+ unsigned int max_transfer_bytes;
+
+ /* This id is unique for each channel and will correspond with */
+ /* vendor specific data in the inquirydata */
+ unsigned long long unique_id;
+} __packed;
+
+/* This structure is sent during the storage protocol negotiations. */
+struct vmstorage_protocol_version {
+ /* Major (MSW) and minor (LSW) version numbers. */
+ unsigned short major_minor;
+
+ /*
+ * Revision number is auto-incremented whenever this file is changed
+ * (See FILL_VMSTOR_REVISION macro above). Mismatch does not
+ * definitely indicate incompatibility--but it does indicate mismatched
+ * builds.
+ */
+ unsigned short revision;
+} __packed;
+
+/* Channel Property Flags */
+#define STORAGE_CHANNEL_REMOVABLE_FLAG 0x1
+#define STORAGE_CHANNEL_EMULATED_IDE_FLAG 0x2
+
+struct vstor_packet {
+ /* Requested operation type */
+ enum vstor_packet_operation operation;
+
+ /* Flags - see below for values */
+ unsigned int flags;
+
+ /* Status of the request returned from the server side. */
+ unsigned int status;
+
+ /* Data payload area */
+ union {
+ /*
+ * Structure used to forward SCSI commands from the
+ * client to the server.
+ */
+ struct vmscsi_request vm_srb;
+
+ /* Structure used to query channel properties. */
+ struct vmstorage_channel_properties storage_channel_properties;
+
+ /* Used during version negotiations. */
+ struct vmstorage_protocol_version version;
+ };
+} __packed;
+
+/* Packet flags */
+/*
+ * This flag indicates that the server should send back a completion for this
+ * packet.
+ */
+#define REQUEST_COMPLETION_FLAG 0x1
+
+/* This is the set of flags that the vsc can set in any packets it sends */
+#define VSC_LEGAL_FLAGS (REQUEST_COMPLETION_FLAG)
+
+
+/* Defines */
+
+#define STORVSC_MAX_IO_REQUESTS 128
+
+/*
+ * In Hyper-V, each port/path/target maps to 1 scsi host adapter. In
+ * reality, the path/target is not used (ie always set to 0) so our
+ * scsi host adapter essentially has 1 bus with 1 target that contains
+ * up to 256 luns.
+ */
+#define STORVSC_MAX_LUNS_PER_TARGET 64
+#define STORVSC_MAX_TARGETS 1
+#define STORVSC_MAX_CHANNELS 1
+#define STORVSC_MAX_CMD_LEN 16
+
+/* Matches Windows-end */
+enum storvsc_request_type {
+ WRITE_TYPE,
+ READ_TYPE,
+ UNKNOWN_TYPE,
+};
+
+
+struct hv_storvsc_request {
+ struct hv_device *device;
+
+ /* Synchronize the request/response if needed */
+ struct completion wait_event;
+
+ unsigned char *sense_buffer;
+ void *context;
+ void (*on_io_completion)(struct hv_storvsc_request *request);
+ struct hv_multipage_buffer data_buffer;
+
+ struct vstor_packet vstor_packet;
+};
+
+
+/* A storvsc device is a device object that contains a vmbus channel */
+struct storvsc_device {
+ struct hv_device *device;
+
+ bool destroy;
+ bool drain_notify;
+ atomic_t num_outstanding_req;
+ struct Scsi_Host *host;
+
+ wait_queue_head_t waiting_to_drain;
+
+ /*
+ * Each unique Port/Path/Target represents 1 channel ie scsi
+ * controller. In reality, the pathid, targetid is always 0
+ * and the port is set by us
+ */
+ unsigned int port_number;
+ unsigned char path_id;
+ unsigned char target_id;
+
+ /* Used for vsc/vsp channel reset process */
+ struct hv_storvsc_request init_request;
+ struct hv_storvsc_request reset_request;
+};
+
+struct stor_mem_pools {
+ struct kmem_cache *request_pool;
+ mempool_t *request_mempool;
+};
+
+struct hv_host_device {
+ struct hv_device *dev;
+ unsigned int port;
+ unsigned char path;
+ unsigned char target;
+};
+
+struct storvsc_cmd_request {
+ struct list_head entry;
+ struct scsi_cmnd *cmd;
+
+ unsigned int bounce_sgl_count;
+ struct scatterlist *bounce_sgl;
+
+ struct hv_storvsc_request request;
+};
+
+struct storvsc_scan_work {
+ struct work_struct work;
+ struct Scsi_Host *host;
+ uint lun;
+};
+
+static void storvsc_bus_scan(struct work_struct *work)
+{
+ struct storvsc_scan_work *wrk;
+ int id, order_id;
+
+ wrk = container_of(work, struct storvsc_scan_work, work);
+ for (id = 0; id < wrk->host->max_id; ++id) {
+ if (wrk->host->reverse_ordering)
+ order_id = wrk->host->max_id - id - 1;
+ else
+ order_id = id;
+
+ scsi_scan_target(&wrk->host->shost_gendev, 0,
+ order_id, SCAN_WILD_CARD, 1);
+ }
+ kfree(wrk);
+}
+
+static void storvsc_remove_lun(struct work_struct *work)
+{
+ struct storvsc_scan_work *wrk;
+ struct scsi_device *sdev;
+
+ wrk = container_of(work, struct storvsc_scan_work, work);
+ if (!scsi_host_get(wrk->host))
+ goto done;
+
+ sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun);
+
+ if (sdev) {
+ scsi_remove_device(sdev);
+ scsi_device_put(sdev);
+ }
+ scsi_host_put(wrk->host);
+
+done:
+ kfree(wrk);
+}
+
+static inline struct storvsc_device *get_out_stor_device(
+ struct hv_device *device)
+{
+ struct storvsc_device *stor_device;
+
+ stor_device = hv_get_drvdata(device);
+
+ if (stor_device && stor_device->destroy)
+ stor_device = NULL;
+
+ return stor_device;
+}
+
+
+static inline void storvsc_wait_to_drain(struct storvsc_device *dev)
+{
+ dev->drain_notify = true;
+ wait_event(dev->waiting_to_drain,
+ atomic_read(&dev->num_outstanding_req) == 0);
+ dev->drain_notify = false;
+}
+
+static inline struct storvsc_device *get_in_stor_device(
+ struct hv_device *device)
+{
+ struct storvsc_device *stor_device;
+
+ stor_device = hv_get_drvdata(device);
+
+ if (!stor_device)
+ goto get_in_err;
+
+ /*
+ * If the device is being destroyed; allow incoming
+ * traffic only to cleanup outstanding requests.
+ */
+
+ if (stor_device->destroy &&
+ (atomic_read(&stor_device->num_outstanding_req) == 0))
+ stor_device = NULL;
+
+get_in_err:
+ return stor_device;
+
+}
+
+static int storvsc_channel_init(struct hv_device *device)
+{
+ struct storvsc_device *stor_device;
+ struct hv_storvsc_request *request;
+ struct vstor_packet *vstor_packet;
+ int ret, t;
+
+ stor_device = get_out_stor_device(device);
+ if (!stor_device)
+ return -ENODEV;
+
+ request = &stor_device->init_request;
+ vstor_packet = &request->vstor_packet;
+
+ /*
+ * Now, initiate the vsc/vsp initialization protocol on the open
+ * channel
+ */
+ memset(request, 0, sizeof(struct hv_storvsc_request));
+ init_completion(&request->wait_event);
+ vstor_packet->operation = VSTOR_OPERATION_BEGIN_INITIALIZATION;
+ vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+ if (ret != 0)
+ goto cleanup;
+
+ t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+ if (t == 0) {
+ ret = -ETIMEDOUT;
+ goto cleanup;
+ }
+
+ if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+ vstor_packet->status != 0)
+ goto cleanup;
+
+
+ /* reuse the packet for version range supported */
+ memset(vstor_packet, 0, sizeof(struct vstor_packet));
+ vstor_packet->operation = VSTOR_OPERATION_QUERY_PROTOCOL_VERSION;
+ vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+ vstor_packet->version.major_minor = VMSTOR_PROTOCOL_VERSION_CURRENT;
+ FILL_VMSTOR_REVISION(vstor_packet->version.revision);
+
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+ if (ret != 0)
+ goto cleanup;
+
+ t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+ if (t == 0) {
+ ret = -ETIMEDOUT;
+ goto cleanup;
+ }
+
+ if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+ vstor_packet->status != 0)
+ goto cleanup;
+
+
+ memset(vstor_packet, 0, sizeof(struct vstor_packet));
+ vstor_packet->operation = VSTOR_OPERATION_QUERY_PROPERTIES;
+ vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+ vstor_packet->storage_channel_properties.port_number =
+ stor_device->port_number;
+
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+
+ if (ret != 0)
+ goto cleanup;
+
+ t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+ if (t == 0) {
+ ret = -ETIMEDOUT;
+ goto cleanup;
+ }
+
+ if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+ vstor_packet->status != 0)
+ goto cleanup;
+
+ stor_device->path_id = vstor_packet->storage_channel_properties.path_id;
+ stor_device->target_id
+ = vstor_packet->storage_channel_properties.target_id;
+
+ memset(vstor_packet, 0, sizeof(struct vstor_packet));
+ vstor_packet->operation = VSTOR_OPERATION_END_INITIALIZATION;
+ vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+
+ if (ret != 0)
+ goto cleanup;
+
+ t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+ if (t == 0) {
+ ret = -ETIMEDOUT;
+ goto cleanup;
+ }
+
+ if (vstor_packet->operation != VSTOR_OPERATION_COMPLETE_IO ||
+ vstor_packet->status != 0)
+ goto cleanup;
+
+
+cleanup:
+ return ret;
+}
+
+static void storvsc_on_io_completion(struct hv_device *device,
+ struct vstor_packet *vstor_packet,
+ struct hv_storvsc_request *request)
+{
+ struct storvsc_device *stor_device;
+ struct vstor_packet *stor_pkt;
+
+ stor_device = hv_get_drvdata(device);
+ stor_pkt = &request->vstor_packet;
+
+ /*
+ * The current SCSI handling on the host side does
+ * not correctly handle:
+ * INQUIRY command with page code parameter set to 0x80
+ * MODE_SENSE command with cmd[2] == 0x1c
+ *
+ * Setup srb and scsi status so this won't be fatal.
+ * We do this so we can distinguish truly fatal failues
+ * (srb status == 0x4) and off-line the device in that case.
+ */
+
+ if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
+ (stor_pkt->vm_srb.cdb[0] == MODE_SENSE)) {
+ vstor_packet->vm_srb.scsi_status = 0;
+ vstor_packet->vm_srb.srb_status = 0x1;
+ }
+
+
+ /* Copy over the status...etc */
+ stor_pkt->vm_srb.scsi_status = vstor_packet->vm_srb.scsi_status;
+ stor_pkt->vm_srb.srb_status = vstor_packet->vm_srb.srb_status;
+ stor_pkt->vm_srb.sense_info_length =
+ vstor_packet->vm_srb.sense_info_length;
+
+ if (vstor_packet->vm_srb.scsi_status != 0 ||
+ vstor_packet->vm_srb.srb_status != 1){
+ dev_warn(&device->device,
+ "cmd 0x%x scsi status 0x%x srb status 0x%x\n",
+ stor_pkt->vm_srb.cdb[0],
+ vstor_packet->vm_srb.scsi_status,
+ vstor_packet->vm_srb.srb_status);
+ }
+
+ if ((vstor_packet->vm_srb.scsi_status & 0xFF) == 0x02) {
+ /* CHECK_CONDITION */
+ if (vstor_packet->vm_srb.srb_status & 0x80) {
+ /* autosense data available */
+ dev_warn(&device->device,
+ "stor pkt %p autosense data valid - len %d\n",
+ request,
+ vstor_packet->vm_srb.sense_info_length);
+
+ memcpy(request->sense_buffer,
+ vstor_packet->vm_srb.sense_data,
+ vstor_packet->vm_srb.sense_info_length);
+
+ }
+ }
+
+ stor_pkt->vm_srb.data_transfer_length =
+ vstor_packet->vm_srb.data_transfer_length;
+
+ request->on_io_completion(request);
+
+ if (atomic_dec_and_test(&stor_device->num_outstanding_req) &&
+ stor_device->drain_notify)
+ wake_up(&stor_device->waiting_to_drain);
+
+
+}
+
+static void storvsc_on_receive(struct hv_device *device,
+ struct vstor_packet *vstor_packet,
+ struct hv_storvsc_request *request)
+{
+ struct storvsc_scan_work *work;
+ struct storvsc_device *stor_device;
+
+ switch (vstor_packet->operation) {
+ case VSTOR_OPERATION_COMPLETE_IO:
+ storvsc_on_io_completion(device, vstor_packet, request);
+ break;
+
+ case VSTOR_OPERATION_REMOVE_DEVICE:
+ case VSTOR_OPERATION_ENUMERATE_BUS:
+ stor_device = get_in_stor_device(device);
+ work = kmalloc(sizeof(struct storvsc_scan_work), GFP_ATOMIC);
+ if (!work)
+ return;
+
+ INIT_WORK(&work->work, storvsc_bus_scan);
+ work->host = stor_device->host;
+ schedule_work(&work->work);
+ break;
+
+ default:
+ break;
+ }
+}
+
+static void storvsc_on_channel_callback(void *context)
+{
+ struct hv_device *device = (struct hv_device *)context;
+ struct storvsc_device *stor_device;
+ u32 bytes_recvd;
+ u64 request_id;
+ unsigned char packet[ALIGN(sizeof(struct vstor_packet), 8)];
+ struct hv_storvsc_request *request;
+ int ret;
+
+
+ stor_device = get_in_stor_device(device);
+ if (!stor_device)
+ return;
+
+ do {
+ ret = vmbus_recvpacket(device->channel, packet,
+ ALIGN(sizeof(struct vstor_packet), 8),
+ &bytes_recvd, &request_id);
+ if (ret == 0 && bytes_recvd > 0) {
+
+ request = (struct hv_storvsc_request *)
+ (unsigned long)request_id;
+
+ if ((request == &stor_device->init_request) ||
+ (request == &stor_device->reset_request)) {
+
+ memcpy(&request->vstor_packet, packet,
+ sizeof(struct vstor_packet));
+ complete(&request->wait_event);
+ } else {
+ storvsc_on_receive(device,
+ (struct vstor_packet *)packet,
+ request);
+ }
+ } else {
+ break;
+ }
+ } while (1);
+
+ return;
+}
+
+static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size)
+{
+ struct vmstorage_channel_properties props;
+ int ret;
+
+ memset(&props, 0, sizeof(struct vmstorage_channel_properties));
+
+ /* Open the channel */
+ ret = vmbus_open(device->channel,
+ ring_size,
+ ring_size,
+ (void *)&props,
+ sizeof(struct vmstorage_channel_properties),
+ storvsc_on_channel_callback, device);
+
+ if (ret != 0)
+ return ret;
+
+ ret = storvsc_channel_init(device);
+
+ return ret;
+}
+
+static int storvsc_dev_remove(struct hv_device *device)
+{
+ struct storvsc_device *stor_device;
+ unsigned long flags;
+
+ stor_device = hv_get_drvdata(device);
+
+ spin_lock_irqsave(&device->channel->inbound_lock, flags);
+ stor_device->destroy = true;
+ spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
+
+ /*
+ * At this point, all outbound traffic should be disable. We
+ * only allow inbound traffic (responses) to proceed so that
+ * outstanding requests can be completed.
+ */
+
+ storvsc_wait_to_drain(stor_device);
+
+ /*
+ * Since we have already drained, we don't need to busy wait
+ * as was done in final_release_stor_device()
+ * Note that we cannot set the ext pointer to NULL until
+ * we have drained - to drain the outgoing packets, we need to
+ * allow incoming packets.
+ */
+ spin_lock_irqsave(&device->channel->inbound_lock, flags);
+ hv_set_drvdata(device, NULL);
+ spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
+
+ /* Close the channel */
+ vmbus_close(device->channel);
+
+ kfree(stor_device);
+ return 0;
+}
+
+static int storvsc_do_io(struct hv_device *device,
+ struct hv_storvsc_request *request)
+{
+ struct storvsc_device *stor_device;
+ struct vstor_packet *vstor_packet;
+ int ret = 0;
+
+ vstor_packet = &request->vstor_packet;
+ stor_device = get_out_stor_device(device);
+
+ if (!stor_device)
+ return -ENODEV;
+
+
+ request->device = device;
+
+
+ vstor_packet->flags |= REQUEST_COMPLETION_FLAG;
+
+ vstor_packet->vm_srb.length = sizeof(struct vmscsi_request);
+
+
+ vstor_packet->vm_srb.sense_info_length = SENSE_BUFFER_SIZE;
+
+
+ vstor_packet->vm_srb.data_transfer_length =
+ request->data_buffer.len;
+
+ vstor_packet->operation = VSTOR_OPERATION_EXECUTE_SRB;
+
+ if (request->data_buffer.len) {
+ ret = vmbus_sendpacket_multipagebuffer(device->channel,
+ &request->data_buffer,
+ vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request);
+ } else {
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+ }
+
+ if (ret != 0)
+ return ret;
+
+ atomic_inc(&stor_device->num_outstanding_req);
+
+ return ret;
+}
+
+static void storvsc_get_ide_info(struct hv_device *dev, int *target, int *path)
+{
+ *target =
+ dev->dev_instance.b[5] << 8 | dev->dev_instance.b[4];
+
+ *path =
+ dev->dev_instance.b[3] << 24 |
+ dev->dev_instance.b[2] << 16 |
+ dev->dev_instance.b[1] << 8 | dev->dev_instance.b[0];
+}
+
+
+static int storvsc_device_alloc(struct scsi_device *sdevice)
+{
+ struct stor_mem_pools *memp;
+ int number = STORVSC_MIN_BUF_NR;
+
+ memp = kzalloc(sizeof(struct stor_mem_pools), GFP_KERNEL);
+ if (!memp)
+ return -ENOMEM;
+
+ memp->request_pool =
+ kmem_cache_create(dev_name(&sdevice->sdev_dev),
+ sizeof(struct storvsc_cmd_request), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+
+ if (!memp->request_pool)
+ goto err0;
+
+ memp->request_mempool = mempool_create(number, mempool_alloc_slab,
+ mempool_free_slab,
+ memp->request_pool);
+
+ if (!memp->request_mempool)
+ goto err1;
+
+ sdevice->hostdata = memp;
+
+ return 0;
+
+err1:
+ kmem_cache_destroy(memp->request_pool);
+
+err0:
+ kfree(memp);
+ return -ENOMEM;
+}
+
+static void storvsc_device_destroy(struct scsi_device *sdevice)
+{
+ struct stor_mem_pools *memp = sdevice->hostdata;
+
+ mempool_destroy(memp->request_mempool);
+ kmem_cache_destroy(memp->request_pool);
+ kfree(memp);
+ sdevice->hostdata = NULL;
+}
+
+static int storvsc_device_configure(struct scsi_device *sdevice)
+{
+ scsi_adjust_queue_depth(sdevice, MSG_SIMPLE_TAG,
+ STORVSC_MAX_IO_REQUESTS);
+
+ blk_queue_max_segment_size(sdevice->request_queue, PAGE_SIZE);
+
+ blk_queue_bounce_limit(sdevice->request_queue, BLK_BOUNCE_ANY);
+
+ return 0;
+}
+
+static void destroy_bounce_buffer(struct scatterlist *sgl,
+ unsigned int sg_count)
+{
+ int i;
+ struct page *page_buf;
+
+ for (i = 0; i < sg_count; i++) {
+ page_buf = sg_page((&sgl[i]));
+ if (page_buf != NULL)
+ __free_page(page_buf);
+ }
+
+ kfree(sgl);
+}
+
+static int do_bounce_buffer(struct scatterlist *sgl, unsigned int sg_count)
+{
+ int i;
+
+ /* No need to check */
+ if (sg_count < 2)
+ return -1;
+
+ /* We have at least 2 sg entries */
+ for (i = 0; i < sg_count; i++) {
+ if (i == 0) {
+ /* make sure 1st one does not have hole */
+ if (sgl[i].offset + sgl[i].length != PAGE_SIZE)
+ return i;
+ } else if (i == sg_count - 1) {
+ /* make sure last one does not have hole */
+ if (sgl[i].offset != 0)
+ return i;
+ } else {
+ /* make sure no hole in the middle */
+ if (sgl[i].length != PAGE_SIZE || sgl[i].offset != 0)
+ return i;
+ }
+ }
+ return -1;
+}
+
+static struct scatterlist *create_bounce_buffer(struct scatterlist *sgl,
+ unsigned int sg_count,
+ unsigned int len,
+ int write)
+{
+ int i;
+ int num_pages;
+ struct scatterlist *bounce_sgl;
+ struct page *page_buf;
+ unsigned int buf_len = ((write == WRITE_TYPE) ? 0 : PAGE_SIZE);
+
+ num_pages = ALIGN(len, PAGE_SIZE) >> PAGE_SHIFT;
+
+ bounce_sgl = kcalloc(num_pages, sizeof(struct scatterlist), GFP_ATOMIC);
+ if (!bounce_sgl)
+ return NULL;
+
+ for (i = 0; i < num_pages; i++) {
+ page_buf = alloc_page(GFP_ATOMIC);
+ if (!page_buf)
+ goto cleanup;
+ sg_set_page(&bounce_sgl[i], page_buf, buf_len, 0);
+ }
+
+ return bounce_sgl;
+
+cleanup:
+ destroy_bounce_buffer(bounce_sgl, num_pages);
+ return NULL;
+}
+
+
+/* Assume the original sgl has enough room */
+static unsigned int copy_from_bounce_buffer(struct scatterlist *orig_sgl,
+ struct scatterlist *bounce_sgl,
+ unsigned int orig_sgl_count,
+ unsigned int bounce_sgl_count)
+{
+ int i;
+ int j = 0;
+ unsigned long src, dest;
+ unsigned int srclen, destlen, copylen;
+ unsigned int total_copied = 0;
+ unsigned long bounce_addr = 0;
+ unsigned long dest_addr = 0;
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ for (i = 0; i < orig_sgl_count; i++) {
+ dest_addr = (unsigned long)kmap_atomic(sg_page((&orig_sgl[i])),
+ KM_IRQ0) + orig_sgl[i].offset;
+ dest = dest_addr;
+ destlen = orig_sgl[i].length;
+
+ if (bounce_addr == 0)
+ bounce_addr =
+ (unsigned long)kmap_atomic(sg_page((&bounce_sgl[j])),
+ KM_IRQ0);
+
+ while (destlen) {
+ src = bounce_addr + bounce_sgl[j].offset;
+ srclen = bounce_sgl[j].length - bounce_sgl[j].offset;
+
+ copylen = min(srclen, destlen);
+ memcpy((void *)dest, (void *)src, copylen);
+
+ total_copied += copylen;
+ bounce_sgl[j].offset += copylen;
+ destlen -= copylen;
+ dest += copylen;
+
+ if (bounce_sgl[j].offset == bounce_sgl[j].length) {
+ /* full */
+ kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+ j++;
+
+ /*
+ * It is possible that the number of elements
+ * in the bounce buffer may not be equal to
+ * the number of elements in the original
+ * scatter list. Handle this correctly.
+ */
+
+ if (j == bounce_sgl_count) {
+ /*
+ * We are done; cleanup and return.
+ */
+ kunmap_atomic((void *)(dest_addr -
+ orig_sgl[i].offset),
+ KM_IRQ0);
+ local_irq_restore(flags);
+ return total_copied;
+ }
+
+ /* if we need to use another bounce buffer */
+ if (destlen || i != orig_sgl_count - 1)
+ bounce_addr =
+ (unsigned long)kmap_atomic(
+ sg_page((&bounce_sgl[j])), KM_IRQ0);
+ } else if (destlen == 0 && i == orig_sgl_count - 1) {
+ /* unmap the last bounce that is < PAGE_SIZE */
+ kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+ }
+ }
+
+ kunmap_atomic((void *)(dest_addr - orig_sgl[i].offset),
+ KM_IRQ0);
+ }
+
+ local_irq_restore(flags);
+
+ return total_copied;
+}
+
+
+/* Assume the bounce_sgl has enough room ie using the create_bounce_buffer() */
+static unsigned int copy_to_bounce_buffer(struct scatterlist *orig_sgl,
+ struct scatterlist *bounce_sgl,
+ unsigned int orig_sgl_count)
+{
+ int i;
+ int j = 0;
+ unsigned long src, dest;
+ unsigned int srclen, destlen, copylen;
+ unsigned int total_copied = 0;
+ unsigned long bounce_addr = 0;
+ unsigned long src_addr = 0;
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ for (i = 0; i < orig_sgl_count; i++) {
+ src_addr = (unsigned long)kmap_atomic(sg_page((&orig_sgl[i])),
+ KM_IRQ0) + orig_sgl[i].offset;
+ src = src_addr;
+ srclen = orig_sgl[i].length;
+
+ if (bounce_addr == 0)
+ bounce_addr =
+ (unsigned long)kmap_atomic(sg_page((&bounce_sgl[j])),
+ KM_IRQ0);
+
+ while (srclen) {
+ /* assume bounce offset always == 0 */
+ dest = bounce_addr + bounce_sgl[j].length;
+ destlen = PAGE_SIZE - bounce_sgl[j].length;
+
+ copylen = min(srclen, destlen);
+ memcpy((void *)dest, (void *)src, copylen);
+
+ total_copied += copylen;
+ bounce_sgl[j].length += copylen;
+ srclen -= copylen;
+ src += copylen;
+
+ if (bounce_sgl[j].length == PAGE_SIZE) {
+ /* full..move to next entry */
+ kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+ j++;
+
+ /* if we need to use another bounce buffer */
+ if (srclen || i != orig_sgl_count - 1)
+ bounce_addr =
+ (unsigned long)kmap_atomic(
+ sg_page((&bounce_sgl[j])), KM_IRQ0);
+
+ } else if (srclen == 0 && i == orig_sgl_count - 1) {
+ /* unmap the last bounce that is < PAGE_SIZE */
+ kunmap_atomic((void *)bounce_addr, KM_IRQ0);
+ }
+ }
+
+ kunmap_atomic((void *)(src_addr - orig_sgl[i].offset), KM_IRQ0);
+ }
+
+ local_irq_restore(flags);
+
+ return total_copied;
+}
+
+
+static int storvsc_remove(struct hv_device *dev)
+{
+ struct storvsc_device *stor_device = hv_get_drvdata(dev);
+ struct Scsi_Host *host = stor_device->host;
+
+ scsi_remove_host(host);
+
+ scsi_host_put(host);
+
+ storvsc_dev_remove(dev);
+
+ return 0;
+}
+
+
+static int storvsc_get_chs(struct scsi_device *sdev, struct block_device * bdev,
+ sector_t capacity, int *info)
+{
+ sector_t nsect = capacity;
+ sector_t cylinders = nsect;
+ int heads, sectors_pt;
+
+ /*
+ * We are making up these values; let us keep it simple.
+ */
+ heads = 0xff;
+ sectors_pt = 0x3f; /* Sectors per track */
+ sector_div(cylinders, heads * sectors_pt);
+ if ((sector_t)(cylinders + 1) * heads * sectors_pt < nsect)
+ cylinders = 0xffff;
+
+ info[0] = heads;
+ info[1] = sectors_pt;
+ info[2] = (int)cylinders;
+
+ return 0;
+}
+
+static int storvsc_host_reset(struct hv_device *device)
+{
+ struct storvsc_device *stor_device;
+ struct hv_storvsc_request *request;
+ struct vstor_packet *vstor_packet;
+ int ret, t;
+
+
+ stor_device = get_out_stor_device(device);
+ if (!stor_device)
+ return FAILED;
+
+ request = &stor_device->reset_request;
+ vstor_packet = &request->vstor_packet;
+
+ init_completion(&request->wait_event);
+
+ vstor_packet->operation = VSTOR_OPERATION_RESET_BUS;
+ vstor_packet->flags = REQUEST_COMPLETION_FLAG;
+ vstor_packet->vm_srb.path_id = stor_device->path_id;
+
+ ret = vmbus_sendpacket(device->channel, vstor_packet,
+ sizeof(struct vstor_packet),
+ (unsigned long)&stor_device->reset_request,
+ VM_PKT_DATA_INBAND,
+ VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
+ if (ret != 0)
+ return FAILED;
+
+ t = wait_for_completion_timeout(&request->wait_event, 5*HZ);
+ if (t == 0)
+ return TIMEOUT_ERROR;
+
+
+ /*
+ * At this point, all outstanding requests in the adapter
+ * should have been flushed out and return to us
+ */
+
+ return SUCCESS;
+}
+
+
+/*
+ * storvsc_host_reset_handler - Reset the scsi HBA
+ */
+static int storvsc_host_reset_handler(struct scsi_cmnd *scmnd)
+{
+ struct hv_host_device *host_dev = shost_priv(scmnd->device->host);
+ struct hv_device *dev = host_dev->dev;
+
+ return storvsc_host_reset(dev);
+}
+
+
+/*
+ * storvsc_command_completion - Command completion processing
+ */
+static void storvsc_command_completion(struct hv_storvsc_request *request)
+{
+ struct storvsc_cmd_request *cmd_request =
+ (struct storvsc_cmd_request *)request->context;
+ struct scsi_cmnd *scmnd = cmd_request->cmd;
+ struct hv_host_device *host_dev = shost_priv(scmnd->device->host);
+ void (*scsi_done_fn)(struct scsi_cmnd *);
+ struct scsi_sense_hdr sense_hdr;
+ struct vmscsi_request *vm_srb;
+ struct storvsc_scan_work *wrk;
+ struct stor_mem_pools *memp = scmnd->device->hostdata;
+
+ vm_srb = &request->vstor_packet.vm_srb;
+ if (cmd_request->bounce_sgl_count) {
+ if (vm_srb->data_in == READ_TYPE)
+ copy_from_bounce_buffer(scsi_sglist(scmnd),
+ cmd_request->bounce_sgl,
+ scsi_sg_count(scmnd),
+ cmd_request->bounce_sgl_count);
+ destroy_bounce_buffer(cmd_request->bounce_sgl,
+ cmd_request->bounce_sgl_count);
+ }
+
+ /*
+ * If there is an error; offline the device since all
+ * error recovery strategies would have already been
+ * deployed on the host side.
+ */
+ if (vm_srb->srb_status == 0x4)
+ scmnd->result = DID_TARGET_FAILURE << 16;
+ else
+ scmnd->result = vm_srb->scsi_status;
+
+ /*
+ * If the LUN is invalid; remove the device.
+ */
+ if (vm_srb->srb_status == 0x20) {
+ struct storvsc_device *stor_dev;
+ struct hv_device *dev = host_dev->dev;
+ struct Scsi_Host *host;
+
+ stor_dev = get_in_stor_device(dev);
+ host = stor_dev->host;
+
+ wrk = kmalloc(sizeof(struct storvsc_scan_work),
+ GFP_ATOMIC);
+ if (!wrk) {
+ scmnd->result = DID_TARGET_FAILURE << 16;
+ } else {
+ wrk->host = host;
+ wrk->lun = vm_srb->lun;
+ INIT_WORK(&wrk->work, storvsc_remove_lun);
+ schedule_work(&wrk->work);
+ }
+ }
+
+ if (scmnd->result) {
+ if (scsi_normalize_sense(scmnd->sense_buffer,
+ SCSI_SENSE_BUFFERSIZE, &sense_hdr))
+ scsi_print_sense_hdr("storvsc", &sense_hdr);
+ }
+
+ scsi_set_resid(scmnd,
+ request->data_buffer.len -
+ vm_srb->data_transfer_length);
+
+ scsi_done_fn = scmnd->scsi_done;
+
+ scmnd->host_scribble = NULL;
+ scmnd->scsi_done = NULL;
+
+ scsi_done_fn(scmnd);
+
+ mempool_free(cmd_request, memp->request_mempool);
+}
+
+static bool storvsc_check_scsi_cmd(struct scsi_cmnd *scmnd)
+{
+ bool allowed = true;
+ u8 scsi_op = scmnd->cmnd[0];
+
+ switch (scsi_op) {
+ /* smartd sends this command, which will offline the device */
+ case SET_WINDOW:
+ scmnd->result = ILLEGAL_REQUEST << 16;
+ allowed = false;
+ break;
+ default:
+ break;
+ }
+ return allowed;
+}
+
+/*
+ * storvsc_queuecommand - Initiate command processing
+ */
+static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+{
+ int ret;
+ struct hv_host_device *host_dev = shost_priv(host);
+ struct hv_device *dev = host_dev->dev;
+ struct hv_storvsc_request *request;
+ struct storvsc_cmd_request *cmd_request;
+ unsigned int request_size = 0;
+ int i;
+ struct scatterlist *sgl;
+ unsigned int sg_count = 0;
+ struct vmscsi_request *vm_srb;
+ struct stor_mem_pools *memp = scmnd->device->hostdata;
+
+ if (storvsc_check_scsi_cmd(scmnd) == false) {
+ scmnd->scsi_done(scmnd);
+ return 0;
+ }
+
+ /* If retrying, no need to prep the cmd */
+ if (scmnd->host_scribble) {
+
+ cmd_request =
+ (struct storvsc_cmd_request *)scmnd->host_scribble;
+
+ goto retry_request;
+ }
+
+ request_size = sizeof(struct storvsc_cmd_request);
+
+ cmd_request = mempool_alloc(memp->request_mempool,
+ GFP_ATOMIC);
+ if (!cmd_request)
+ return SCSI_MLQUEUE_DEVICE_BUSY;
+
+ memset(cmd_request, 0, sizeof(struct storvsc_cmd_request));
+
+ /* Setup the cmd request */
+ cmd_request->bounce_sgl_count = 0;
+ cmd_request->bounce_sgl = NULL;
+ cmd_request->cmd = scmnd;
+
+ scmnd->host_scribble = (unsigned char *)cmd_request;
+
+ request = &cmd_request->request;
+ vm_srb = &request->vstor_packet.vm_srb;
+
+
+ /* Build the SRB */
+ switch (scmnd->sc_data_direction) {
+ case DMA_TO_DEVICE:
+ vm_srb->data_in = WRITE_TYPE;
+ break;
+ case DMA_FROM_DEVICE:
+ vm_srb->data_in = READ_TYPE;
+ break;
+ default:
+ vm_srb->data_in = UNKNOWN_TYPE;
+ break;
+ }
+
+ request->on_io_completion = storvsc_command_completion;
+ request->context = cmd_request;/* scmnd; */
+
+ vm_srb->port_number = host_dev->port;
+ vm_srb->path_id = scmnd->device->channel;
+ vm_srb->target_id = scmnd->device->id;
+ vm_srb->lun = scmnd->device->lun;
+
+ vm_srb->cdb_length = scmnd->cmd_len;
+
+ memcpy(vm_srb->cdb, scmnd->cmnd, vm_srb->cdb_length);
+
+ request->sense_buffer = scmnd->sense_buffer;
+
+
+ request->data_buffer.len = scsi_bufflen(scmnd);
+ if (scsi_sg_count(scmnd)) {
+ sgl = (struct scatterlist *)scsi_sglist(scmnd);
+ sg_count = scsi_sg_count(scmnd);
+
+ /* check if we need to bounce the sgl */
+ if (do_bounce_buffer(sgl, scsi_sg_count(scmnd)) != -1) {
+ cmd_request->bounce_sgl =
+ create_bounce_buffer(sgl, scsi_sg_count(scmnd),
+ scsi_bufflen(scmnd),
+ vm_srb->data_in);
+ if (!cmd_request->bounce_sgl) {
+ scmnd->host_scribble = NULL;
+ mempool_free(cmd_request,
+ memp->request_mempool);
+
+ return SCSI_MLQUEUE_HOST_BUSY;
+ }
+
+ cmd_request->bounce_sgl_count =
+ ALIGN(scsi_bufflen(scmnd), PAGE_SIZE) >>
+ PAGE_SHIFT;
+
+ if (vm_srb->data_in == WRITE_TYPE)
+ copy_to_bounce_buffer(sgl,
+ cmd_request->bounce_sgl,
+ scsi_sg_count(scmnd));
+
+ sgl = cmd_request->bounce_sgl;
+ sg_count = cmd_request->bounce_sgl_count;
+ }
+
+ request->data_buffer.offset = sgl[0].offset;
+
+ for (i = 0; i < sg_count; i++)
+ request->data_buffer.pfn_array[i] =
+ page_to_pfn(sg_page((&sgl[i])));
+
+ } else if (scsi_sglist(scmnd)) {
+ request->data_buffer.offset =
+ virt_to_phys(scsi_sglist(scmnd)) & (PAGE_SIZE-1);
+ request->data_buffer.pfn_array[0] =
+ virt_to_phys(scsi_sglist(scmnd)) >> PAGE_SHIFT;
+ }
+
+retry_request:
+ /* Invokes the vsc to start an IO */
+ ret = storvsc_do_io(dev, &cmd_request->request);
+
+ if (ret == -EAGAIN) {
+ /* no more space */
+
+ if (cmd_request->bounce_sgl_count)
+ destroy_bounce_buffer(cmd_request->bounce_sgl,
+ cmd_request->bounce_sgl_count);
+
+ mempool_free(cmd_request, memp->request_mempool);
+
+ scmnd->host_scribble = NULL;
+
+ ret = SCSI_MLQUEUE_DEVICE_BUSY;
+ }
+
+ return ret;
+}
+
+/* Scsi driver */
+static struct scsi_host_template scsi_driver = {
+ .module = THIS_MODULE,
+ .name = "storvsc_host_t",
+ .bios_param = storvsc_get_chs,
+ .queuecommand = storvsc_queuecommand,
+ .eh_host_reset_handler = storvsc_host_reset_handler,
+ .slave_alloc = storvsc_device_alloc,
+ .slave_destroy = storvsc_device_destroy,
+ .slave_configure = storvsc_device_configure,
+ .cmd_per_lun = 1,
+ /* 64 max_queue * 1 target */
+ .can_queue = STORVSC_MAX_IO_REQUESTS*STORVSC_MAX_TARGETS,
+ .this_id = -1,
+ /* no use setting to 0 since ll_blk_rw reset it to 1 */
+ /* currently 32 */
+ .sg_tablesize = MAX_MULTIPAGE_BUFFER_COUNT,
+ .use_clustering = DISABLE_CLUSTERING,
+ /* Make sure we dont get a sg segment crosses a page boundary */
+ .dma_boundary = PAGE_SIZE-1,
+};
+
+enum {
+ SCSI_GUID,
+ IDE_GUID,
+};
+
+static const struct hv_vmbus_device_id id_table[] = {
+ /* SCSI guid */
+ { VMBUS_DEVICE(0xd9, 0x63, 0x61, 0xba, 0xa1, 0x04, 0x29, 0x4d,
+ 0xb6, 0x05, 0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f)
+ .driver_data = SCSI_GUID },
+ /* IDE guid */
+ { VMBUS_DEVICE(0x32, 0x26, 0x41, 0x32, 0xcb, 0x86, 0xa2, 0x44,
+ 0x9b, 0x5c, 0x50, 0xd1, 0x41, 0x73, 0x54, 0xf5)
+ .driver_data = IDE_GUID },
+ { },
+};
+
+MODULE_DEVICE_TABLE(vmbus, id_table);
+
+
+/*
+ * storvsc_probe - Add a new device for this driver
+ */
+
+static int storvsc_probe(struct hv_device *device,
+ const struct hv_vmbus_device_id *dev_id)
+{
+ int ret;
+ struct Scsi_Host *host;
+ struct hv_host_device *host_dev;
+ bool dev_is_ide = ((dev_id->driver_data == IDE_GUID) ? true : false);
+ int path = 0;
+ int target = 0;
+ struct storvsc_device *stor_device;
+
+ host = scsi_host_alloc(&scsi_driver,
+ sizeof(struct hv_host_device));
+ if (!host)
+ return -ENOMEM;
+
+ host_dev = shost_priv(host);
+ memset(host_dev, 0, sizeof(struct hv_host_device));
+
+ host_dev->port = host->host_no;
+ host_dev->dev = device;
+
+
+ stor_device = kzalloc(sizeof(struct storvsc_device), GFP_KERNEL);
+ if (!stor_device) {
+ ret = -ENOMEM;
+ goto err_out0;
+ }
+
+ stor_device->destroy = false;
+ init_waitqueue_head(&stor_device->waiting_to_drain);
+ stor_device->device = device;
+ stor_device->host = host;
+ hv_set_drvdata(device, stor_device);
+
+ stor_device->port_number = host->host_no;
+ ret = storvsc_connect_to_vsp(device, storvsc_ringbuffer_size);
+ if (ret)
+ goto err_out1;
+
+ if (dev_is_ide)
+ storvsc_get_ide_info(device, &target, &path);
+
+ host_dev->path = stor_device->path_id;
+ host_dev->target = stor_device->target_id;
+
+ /* max # of devices per target */
+ host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
+ /* max # of targets per channel */
+ host->max_id = STORVSC_MAX_TARGETS;
+ /* max # of channels */
+ host->max_channel = STORVSC_MAX_CHANNELS - 1;
+ /* max cmd length */
+ host->max_cmd_len = STORVSC_MAX_CMD_LEN;
+
+ /* Register the HBA and start the scsi bus scan */
+ ret = scsi_add_host(host, &device->device);
+ if (ret != 0)
+ goto err_out2;
+
+ if (!dev_is_ide) {
+ scsi_scan_host(host);
+ return 0;
+ }
+ ret = scsi_add_device(host, 0, target, 0);
+ if (ret) {
+ scsi_remove_host(host);
+ goto err_out2;
+ }
+ return 0;
+
+err_out2:
+ /*
+ * Once we have connected with the host, we would need to
+ * to invoke storvsc_dev_remove() to rollback this state and
+ * this call also frees up the stor_device; hence the jump around
+ * err_out1 label.
+ */
+ storvsc_dev_remove(device);
+ goto err_out0;
+
+err_out1:
+ kfree(stor_device);
+
+err_out0:
+ scsi_host_put(host);
+ return ret;
+}
+
+/* The one and only one */
+
+static struct hv_driver storvsc_drv = {
+ .name = KBUILD_MODNAME,
+ .id_table = id_table,
+ .probe = storvsc_probe,
+ .remove = storvsc_remove,
+};
+
+static int __init storvsc_drv_init(void)
+{
+ u32 max_outstanding_req_per_channel;
+
+ /*
+ * Divide the ring buffer data size (which is 1 page less
+ * than the ring buffer size since that page is reserved for
+ * the ring buffer indices) by the max request size (which is
+ * vmbus_channel_packet_multipage_buffer + struct vstor_packet + u64)
+ */
+ max_outstanding_req_per_channel =
+ ((storvsc_ringbuffer_size - PAGE_SIZE) /
+ ALIGN(MAX_MULTIPAGE_BUFFER_PACKET +
+ sizeof(struct vstor_packet) + sizeof(u64),
+ sizeof(u64)));
+
+ if (max_outstanding_req_per_channel <
+ STORVSC_MAX_IO_REQUESTS)
+ return -EINVAL;
+
+ return vmbus_driver_register(&storvsc_drv);
+}
+
+static void __exit storvsc_drv_exit(void)
+{
+ vmbus_driver_unregister(&storvsc_drv);
+}
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION(HV_DRV_VERSION);
+MODULE_DESCRIPTION("Microsoft Hyper-V virtual storage driver");
+module_init(storvsc_drv_init);
+module_exit(storvsc_drv_exit);
--
1.7.4.1
^ permalink raw reply related
* [PATCH] XEN: xenbus: integer overflow in process_msg()
From: Haogang Chen @ 2012-01-03 19:42 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Ian Campbell
Cc: Haogang Chen, xen-devel, linux-kernel, virtualization
There is a potential integer overflow in process_msg() that could result
in cross-domain attack.
body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent
call to xb_read() would write to a zero-length buffer. This causes
kernel oops in the receiving guest and hangs its xenbus kernel thread.
The patch returns -EINVAL in that case.
Signed-off-by: Haogang Chen <haogangchen@gmail.com>
---
drivers/xen/xenbus/xenbus_xs.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index ede860f..e32aefb 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -801,6 +801,12 @@ static int process_msg(void)
goto out;
}
+ if (msg->hdr.len == UINT_MAX) {
+ kfree(msg);
+ err = -EINVAL;
+ goto out;
+ }
+
body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
if (body == NULL) {
kfree(msg);
--
1.7.5.4
^ permalink raw reply related
* Re: [PATCH 4/9] drivers/xen/gntalloc.c: introduce missing kfree
From: Konrad Rzeszutek Wilk @ 2012-01-03 18:42 UTC (permalink / raw)
To: Julia Lawall
Cc: Jeremy Fitzhardinge, xen-devel, kernel-janitors, linux-kernel,
virtualization
In-Reply-To: <1324661974-17281-4-git-send-email-julia@diku.dk>
On Fri, Dec 23, 2011 at 06:39:29PM +0100, Julia Lawall wrote:
> Error handling code following a kmalloc should free the allocated data.
> Out_unlock is used on both success and failure, so free vm_priv before
> jumping to that label.
Applied. Thanks!
^ permalink raw reply
* Re: [PATCH v6 09/11] virtio: net: Add freeze, restore handlers to support S4
From: Amit Shah @ 2012-01-03 17:07 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: linux-pm, linux-kernel, Virtualization List
In-Reply-To: <20120101114727.GC25350@redhat.com>
On (Sun) 01 Jan 2012 [13:47:27], Michael S. Tsirkin wrote:
> On Fri, Dec 30, 2011 at 03:08:19PM +0530, Amit Shah wrote:
> > >
> > > > + if (!try_fill_recv(vi, GFP_KERNEL))
> > > > + schedule_delayed_work(&vi->refill, 0);
> > >
> > > This needs to be switched to non reentrant wq too?
> >
> > Subsequent invocations could create problems? Note that this will be
> > the first item to be queued in the workqueue for the refill work.
> >
> > Amit
>
> See recent patches on net-next.
> We switched other calls to a non reentrant wq so I think same
> reasoning applies there.
There's a patch ordering issue; so this series should land after the
-net series is merged.
Rusty, do you want me to re-spin this patch on top of yours, or is
that something that you'll handle?
Amit
^ permalink raw reply
* Re: [patch] virtio_blk: lock() => unlock() typo in virtblk_config_changed_work()
From: Michael S. Tsirkin @ 2012-01-03 15:31 UTC (permalink / raw)
To: Dan Carpenter; +Cc: kernel-janitors, virtualization
In-Reply-To: <20120103144513.GB21712@elgon.mountain>
On Tue, Jan 03, 2012 at 05:45:13PM +0300, Dan Carpenter wrote:
> This will deadlock.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index f3d77b3..ba73661 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -352,7 +352,7 @@ static void virtblk_config_changed_work(struct work_struct *work)
>
> set_capacity(vblk->disk, capacity);
> done:
> - mutex_lock(&vblk->config_lock);
> + mutex_unlock(&vblk->config_lock);
> }
>
> static void virtblk_config_changed(struct virtio_device *vdev)
Good catch.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
^ permalink raw reply
* [patch] virtio_blk: lock() => unlock() typo in virtblk_config_changed_work()
From: Dan Carpenter @ 2012-01-03 14:45 UTC (permalink / raw)
To: Rusty Russell; +Cc: virtualization, kernel-janitors, Michael S. Tsirkin
This will deadlock.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f3d77b3..ba73661 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -352,7 +352,7 @@ static void virtblk_config_changed_work(struct work_struct *work)
set_capacity(vblk->disk, capacity);
done:
- mutex_lock(&vblk->config_lock);
+ mutex_unlock(&vblk->config_lock);
}
static void virtblk_config_changed(struct virtio_device *vdev)
^ permalink raw reply related
* Re: [PATCH v6 09/11] virtio: net: Add freeze, restore handlers to support S4
From: Michael S. Tsirkin @ 2012-01-01 11:47 UTC (permalink / raw)
To: Amit Shah; +Cc: linux-pm, linux-kernel, Virtualization List
In-Reply-To: <20111230085205.GG4576@amit.redhat.com>
On Fri, Dec 30, 2011 at 03:08:19PM +0530, Amit Shah wrote:
> >
> > > + if (!try_fill_recv(vi, GFP_KERNEL))
> > > + schedule_delayed_work(&vi->refill, 0);
> >
> > This needs to be switched to non reentrant wq too?
>
> Subsequent invocations could create problems? Note that this will be
> the first item to be queued in the workqueue for the refill work.
>
> Amit
See recent patches on net-next.
We switched other calls to a non reentrant wq so I think same
reasoning applies there.
^ permalink raw reply
* Re: [PATCH v6 09/11] virtio: net: Add freeze, restore handlers to support S4
From: Amit Shah @ 2011-12-30 9:38 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: linux-pm, linux-kernel, Virtualization List
In-Reply-To: <20111229161523.GB2300@redhat.com>
On (Thu) 29 Dec 2011 [18:15:23], Michael S. Tsirkin wrote:
> On Thu, Dec 22, 2011 at 04:58:33PM +0530, Amit Shah wrote:
> > Remove all the vqs, disable napi and detach from the netdev on
> > hibernation.
> >
> > Re-create vqs after restoring from a hibernated image, re-enable napi
> > and re-attach the netdev. This keeps networking working across
> > hibernation.
> >
> > Signed-off-by: Amit Shah <amit.shah@redhat.com>
> > ---
> > drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 46 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 7a2a5bf..b31670f 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1159,6 +1159,48 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
> > free_netdev(vi->dev);
> > }
> >
> > +#ifdef CONFIG_PM
> > +static int virtnet_freeze(struct virtio_device *vdev)
> > +{
> > + struct virtnet_info *vi = vdev->priv;
> > +
> > + virtqueue_disable_cb(vi->rvq);
> > + virtqueue_disable_cb(vi->svq);
> > + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
> > + virtqueue_disable_cb(vi->cvq);
> > +
> > + netif_device_detach(vi->dev);
> > + cancel_delayed_work_sync(&vi->refill);
> > +
> > + if (netif_running(vi->dev))
> > + napi_disable(&vi->napi);
> > +
> > + remove_vq_common(vi);
> > +
> > + return 0;
> > +}
> > +
> > +static int virtnet_restore(struct virtio_device *vdev)
> > +{
> > + struct virtnet_info *vi = vdev->priv;
> > + int err;
> > +
> > + err = init_vqs(vi);
> > + if (err)
> > + return err;
> > +
> > + if (netif_running(vi->dev))
>
> Can virtnet close run at this point?
> If yes it's a bug.
Userspace is suspended, and kthreads are frozen. If virtnet_close was
called before suspend, it would have already succeeded in disabling
napi.
> > + virtnet_napi_enable(vi);
> > +
> > + netif_device_attach(vi->dev);
> > +
>
> Not that open/close start/stop refill,
> I wonder whether this is a problem.
> For example, can virtnet_close run at this point,
> and cancel refill?
Same as above
>
> > + if (!try_fill_recv(vi, GFP_KERNEL))
> > + schedule_delayed_work(&vi->refill, 0);
>
> This needs to be switched to non reentrant wq too?
Subsequent invocations could create problems? Note that this will be
the first item to be queued in the workqueue for the refill work.
Amit
^ permalink raw reply
* Re: [PATCH v6 09/11] virtio: net: Add freeze, restore handlers to support S4
From: Michael S. Tsirkin @ 2011-12-29 16:15 UTC (permalink / raw)
To: Amit Shah; +Cc: linux-pm, linux-kernel, Virtualization List
In-Reply-To: <f84fa9fdc9f5629e711cbb89a2e7efbb289d72fa.1324553223.git.amit.shah@redhat.com>
On Thu, Dec 22, 2011 at 04:58:33PM +0530, Amit Shah wrote:
> Remove all the vqs, disable napi and detach from the netdev on
> hibernation.
>
> Re-create vqs after restoring from a hibernated image, re-enable napi
> and re-attach the netdev. This keeps networking working across
> hibernation.
>
> Signed-off-by: Amit Shah <amit.shah@redhat.com>
> ---
> drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7a2a5bf..b31670f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1159,6 +1159,48 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
> free_netdev(vi->dev);
> }
>
> +#ifdef CONFIG_PM
> +static int virtnet_freeze(struct virtio_device *vdev)
> +{
> + struct virtnet_info *vi = vdev->priv;
> +
> + virtqueue_disable_cb(vi->rvq);
> + virtqueue_disable_cb(vi->svq);
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
> + virtqueue_disable_cb(vi->cvq);
> +
> + netif_device_detach(vi->dev);
> + cancel_delayed_work_sync(&vi->refill);
> +
> + if (netif_running(vi->dev))
> + napi_disable(&vi->napi);
> +
> + remove_vq_common(vi);
> +
> + return 0;
> +}
> +
> +static int virtnet_restore(struct virtio_device *vdev)
> +{
> + struct virtnet_info *vi = vdev->priv;
> + int err;
> +
> + err = init_vqs(vi);
> + if (err)
> + return err;
> +
> + if (netif_running(vi->dev))
Can virtnet close run at this point?
If yes it's a bug.
> + virtnet_napi_enable(vi);
> +
> + netif_device_attach(vi->dev);
> +
Not that open/close start/stop refill,
I wonder whether this is a problem.
For example, can virtnet_close run at this point,
and cancel refill?
> + if (!try_fill_recv(vi, GFP_KERNEL))
> + schedule_delayed_work(&vi->refill, 0);
This needs to be switched to non reentrant wq too?
> +
> + return 0;
> +}
> +#endif
> +
> static struct virtio_device_id id_table[] = {
> { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
> { 0 },
> @@ -1183,6 +1225,10 @@ static struct virtio_driver virtio_net_driver = {
> .probe = virtnet_probe,
> .remove = __devexit_p(virtnet_remove),
> .config_changed = virtnet_config_changed,
> +#ifdef CONFIG_PM
> + .freeze = virtnet_freeze,
> + .restore = virtnet_restore,
> +#endif
> };
>
> static int __init init(void)
> --
> 1.7.7.4
^ permalink raw reply
* Re: [RESENDx2] [PULL] virtio: fix barriers for virtio-mmio
From: Ohad Ben-Cohen @ 2011-12-29 6:53 UTC (permalink / raw)
To: Rusty Russell; +Cc: linux-kernel, linux-arm, virtualization
In-Reply-To: <87y5twf16x.fsf@rustcorp.com.au>
On Thu, Dec 29, 2011 at 6:31 AM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> OK, I've applied this patch; you might want to carry it in your tree
> too. You'll need to fix up your call to vring_new_virtqueue()....
Will do. Thanks !
^ permalink raw reply
* Re: [PATCH v6 00/11] virtio: s4 support
From: Rusty Russell @ 2011-12-29 4:32 UTC (permalink / raw)
To: Virtualization List; +Cc: Amit Shah, linux-pm, linux-kernel, Michael S. Tsirkin
In-Reply-To: <cover.1324553223.git.amit.shah@redhat.com>
On Thu, 22 Dec 2011 16:58:24 +0530, Amit Shah <amit.shah@redhat.com> wrote:
> Hi,
>
> These patches add support for S4 to virtio (pci) and all drivers.
Thanks, they're neatly divided, and all seem sane.
Applied,
Rusty.
^ permalink raw reply
* Re: [RESENDx2] [PULL] virtio: fix barriers for virtio-mmio
From: Rusty Russell @ 2011-12-29 4:31 UTC (permalink / raw)
To: Ohad Ben-Cohen; +Cc: linux-kernel, linux-arm, virtualization
In-Reply-To: <CAK=WgbacH1tXHJiHNKLNNsFA7-vnFT_J6vYOANh0wzGyUj8Zpg@mail.gmail.com>
On Fri, 23 Dec 2011 13:35:26 +0200, Ohad Ben-Cohen <ohad@wizery.com> wrote:
> On Fri, Dec 23, 2011 at 11:35 AM, Linus Torvalds
> And rpmsg is using virtio to avoid implementing another shared memory
> "wire" protocol. And of course to be able to reuse all the existing
> virtio drivers (e.g. net, block, console) with a remote core backend.
OK, I've applied this patch; you might want to carry it in your tree
too. You'll need to fix up your call to vring_new_virtqueue()....
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: virtio: harsher barriers for rpmsg.
We were cheating with our barriers; using the smp ones rather than the
real device ones. That was fine, until rpmsg came along, which is
used to talk to a real device (a non-SMP CPU).
Unfortunately, just putting back the real barriers (reverting
d57ed95d) causes a performance regression on virtio-pci. In
particular, Amos reports netbench's TCP_RR over virtio_net CPU
utilization increased up to 35% while throughput went down by up to
14%.
By comparison, this branch is in the noise.
Reference: https://lkml.org/lkml/2011/12/11/22
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/lguest/lguest_device.c | 8 +++++---
drivers/s390/kvm/kvm_virtio.c | 2 +-
drivers/virtio/virtio_mmio.c | 4 ++--
drivers/virtio/virtio_pci.c | 4 ++--
drivers/virtio/virtio_ring.c | 34 +++++++++++++++++++++-------------
include/linux/virtio_ring.h | 1 +
tools/virtio/linux/virtio.h | 1 +
tools/virtio/virtio_test.c | 3 ++-
8 files changed, 35 insertions(+), 22 deletions(-)
diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -292,10 +292,12 @@ static struct virtqueue *lg_find_vq(stru
/*
* OK, tell virtio_ring.c to set up a virtqueue now we know its size
- * and we've got a pointer to its pages.
+ * and we've got a pointer to its pages. Note that we set weak_barriers
+ * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu
+ * barriers.
*/
- vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN,
- vdev, lvq->pages, lg_notify, callback, name);
+ vq = vring_new_virtqueue(lvq->config.num, LGUEST_VRING_ALIGN, vdev,
+ true, lvq->pages, lg_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str
goto out;
vq = vring_new_virtqueue(config->num, KVM_S390_VIRTIO_RING_ALIGN,
- vdev, (void *) config->address,
+ vdev, true, (void *) config->address,
kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -310,8 +310,8 @@ static struct virtqueue *vm_setup_vq(str
vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
/* Create the vring */
- vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN,
- vdev, info->queue, vm_notify, callback, name);
+ vq = vring_new_virtqueue(info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+ true, info->queue, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct
vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
/* create the vring */
- vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN,
- vdev, info->queue, vp_notify, callback, name);
+ vq = vring_new_virtqueue(info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
+ true, info->queue, vp_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -28,17 +28,20 @@
#ifdef CONFIG_SMP
/* Where possible, use SMP barriers which are more lightweight than mandatory
* barriers, because mandatory barriers control MMIO effects on accesses
- * through relaxed memory I/O windows (which virtio does not use). */
-#define virtio_mb() smp_mb()
-#define virtio_rmb() smp_rmb()
-#define virtio_wmb() smp_wmb()
+ * through relaxed memory I/O windows (which virtio-pci does not use). */
+#define virtio_mb(vq) \
+ do { if ((vq)->weak_barriers) smp_mb(); else mb(); } while(0)
+#define virtio_rmb(vq) \
+ do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } while(0)
+#define virtio_wmb(vq) \
+ do { if ((vq)->weak_barriers) smp_rmb(); else rmb(); } while(0)
#else
/* We must force memory ordering even if guest is UP since host could be
* running on another CPU, but SMP barriers are defined to barrier() in that
* configuration. So fall back to mandatory barriers instead. */
-#define virtio_mb() mb()
-#define virtio_rmb() rmb()
-#define virtio_wmb() wmb()
+#define virtio_mb(vq) mb()
+#define virtio_rmb(vq) rmb()
+#define virtio_wmb(vq) wmb()
#endif
#ifdef DEBUG
@@ -77,6 +80,9 @@ struct vring_virtqueue
/* Actual memory layout for this queue */
struct vring vring;
+ /* Can we use weak barriers? */
+ bool weak_barriers;
+
/* Other side has made a mess, don't try any more. */
bool broken;
@@ -245,14 +251,14 @@ void virtqueue_kick(struct virtqueue *_v
START_USE(vq);
/* Descriptors and available array need to be set before we expose the
* new available array entries. */
- virtio_wmb();
+ virtio_wmb(vq);
old = vq->vring.avail->idx;
new = vq->vring.avail->idx = old + vq->num_added;
vq->num_added = 0;
/* Need to update avail index before checking if we should notify */
- virtio_mb();
+ virtio_mb(vq);
if (vq->event ?
vring_need_event(vring_avail_event(&vq->vring), new, old) :
@@ -314,7 +320,7 @@ void *virtqueue_get_buf(struct virtqueue
}
/* Only get used array entries after they have been exposed by host. */
- virtio_rmb();
+ virtio_rmb(vq);
i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
@@ -337,7 +343,7 @@ void *virtqueue_get_buf(struct virtqueue
* the read in the next get_buf call. */
if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
vring_used_event(&vq->vring) = vq->last_used_idx;
- virtio_mb();
+ virtio_mb(vq);
}
END_USE(vq);
@@ -366,7 +372,7 @@ bool virtqueue_enable_cb(struct virtqueu
* entry. Always do both to keep code simple. */
vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
vring_used_event(&vq->vring) = vq->last_used_idx;
- virtio_mb();
+ virtio_mb(vq);
if (unlikely(more_used(vq))) {
END_USE(vq);
return false;
@@ -393,7 +399,7 @@ bool virtqueue_enable_cb_delayed(struct
/* TODO: tune this threshold */
bufs = (u16)(vq->vring.avail->idx - vq->last_used_idx) * 3 / 4;
vring_used_event(&vq->vring) = vq->last_used_idx + bufs;
- virtio_mb();
+ virtio_mb(vq);
if (unlikely((u16)(vq->vring.used->idx - vq->last_used_idx) > bufs)) {
END_USE(vq);
return false;
@@ -453,6 +459,7 @@ EXPORT_SYMBOL_GPL(vring_interrupt);
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *),
void (*callback)(struct virtqueue *),
@@ -476,6 +483,7 @@ struct virtqueue *vring_new_virtqueue(un
vq->vq.vdev = vdev;
vq->vq.name = name;
vq->notify = notify;
+ vq->weak_barriers = weak_barriers;
vq->broken = false;
vq->last_used_idx = 0;
vq->num_added = 0;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -168,6 +168,7 @@ struct virtqueue;
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *vq),
void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -214,6 +214,7 @@ void *virtqueue_detach_unused_buf(struct
struct virtqueue *vring_new_virtqueue(unsigned int num,
unsigned int vring_align,
struct virtio_device *vdev,
+ bool weak_barriers,
void *pages,
void (*notify)(struct virtqueue *vq),
void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -92,7 +92,8 @@ static void vq_info_add(struct vdev_info
assert(r >= 0);
memset(info->ring, 0, vring_size(num, 4096));
vring_init(&info->vring, num, info->ring, 4096);
- info->vq = vring_new_virtqueue(info->vring.num, 4096, &dev->vdev, info->ring,
+ info->vq = vring_new_virtqueue(info->vring.num, 4096, &dev->vdev,
+ true, info->ring,
vq_notify, vq_callback, "test");
assert(info->vq);
info->vq->priv = info;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox