Linux virtualization list
 help / color / mirror / Atom feed
* [RFC v6 6/6] VSOCK: Add Makefile and Kconfig
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

Enable virtio-vsock and vhost-vsock.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6:
 * Rename to virtio-vsock kernel modules to vmw_vsock_virtio_transport*
   instead of just virtio_transport to make the name clearer [Ian
   Campbell]
v4:
 * Make checkpatch.pl happy with longer option description
 * Clarify dependency on virtio rather than QEMU as suggested by Alex
   Bennee
v3:
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann <arnd@arndb.de>)
---
 drivers/vhost/Kconfig  | 15 +++++++++++++++
 drivers/vhost/Makefile |  4 ++++
 net/vmw_vsock/Kconfig  | 20 ++++++++++++++++++++
 net/vmw_vsock/Makefile |  6 ++++++
 4 files changed, 45 insertions(+)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 533eaf0..d7aae9e 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -21,6 +21,21 @@ config VHOST_SCSI
 	Say M here to enable the vhost_scsi TCM fabric module
 	for use with virtio-scsi guests
 
+config VHOST_VSOCK
+	tristate "vhost virtio-vsock driver"
+	depends on VSOCKETS && EVENTFD
+	select VIRTIO_VSOCKETS_COMMON
+	select VHOST
+	select VHOST_RING
+	default n
+	---help---
+	This kernel module can be loaded in the host kernel to provide AF_VSOCK
+	sockets for communicating with guests.  The guests must have the
+	virtio_transport.ko driver loaded to use the virtio-vsock device.
+
+	To compile this driver as a module, choose M here: the module will be called
+	vhost_vsock.
+
 config VHOST_RING
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index e0441c3..6b012b9 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -4,5 +4,9 @@ vhost_net-y := net.o
 obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o
 vhost_scsi-y := scsi.o
 
+obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
+vhost_vsock-y := vsock.o
+
 obj-$(CONFIG_VHOST_RING) += vringh.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index 14810ab..8831e7c 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -26,3 +26,23 @@ config VMWARE_VMCI_VSOCKETS
 
 	  To compile this driver as a module, choose M here: the module
 	  will be called vmw_vsock_vmci_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS
+	tristate "virtio transport for Virtual Sockets"
+	depends on VSOCKETS && VIRTIO
+	select VIRTIO_VSOCKETS_COMMON
+	help
+	  This module implements a virtio transport for Virtual Sockets.
+
+	  Enable this transport if your Virtual Machine host supports Virtual
+	  Sockets over virtio.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called vmw_vsock_virtio_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS_COMMON
+	tristate
+	help
+	  This option is selected by any driver which needs to access
+	  the virtio_vsock.  The module will be called
+	  vmw_vsock_virtio_transport_common.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 2ce52d7..bc27c70 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,7 +1,13 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
 
 vsock-y += af_vsock.o vsock_addr.o
 
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
 	vmci_transport_notify_qstate.o
+
+vmw_vsock_virtio_transport-y += virtio_transport.o
+
+vmw_vsock_virtio_transport_common-y += virtio_transport_common.o
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 5/6] VSOCK: Introduce vhost_vsock.ko
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

VM sockets vhost transport implementation.  This driver runs on the
host.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6:
 * Fall back to non-contiguous pages if vsock struct is too large (idea
   stolen from vhost_net)
 * 64-bit CIDs in packet header and ioctl to match virtio-vsock
   specification
 * Add VHOST_VSOCK_SET_RUNNING ioctl to start/stop vhost cleanly
 * Remove total_tx_buf accounting, it is ineffective because control
   packets are not included
 * Start/stop rx depending on reply packet accounting to bound memory
   allocation if the guest is not processing rx packets
 * Turn vhost_vsock_mutex into a spinlock so packets can be sent in
   atomic context without lockdep errors
v5:
 * Only take rx/tx virtqueues, userspace handles the other virtqueues
 * Explicitly skip instances without a CID when transferring packets
 * Add VHOST_VSOCK_START ioctl to being vhost virtqueue processing
 * Reset established connections when device is closed
v4:
 * Add MAINTAINERS file entry
 * virtqueue used len is now sizeof(pkt->hdr) + pkt->len instead of just
   pkt->len
 * checkpatch.pl cleanups
 * Clarify struct vhost_vsock locking
 * Add comments about optimization that disables virtqueue notify
 * Drop unused vhost_vsock_handle_ctl_kick()
 * Call wake_up() after decrementing total_tx_buf to prevent deadlock
v3:
 * Remove unneeded variable used to store return value
   (Fengguang Wu <fengguang.wu@intel.com> and Julia Lawall
   <julia.lawall@lip6.fr>)
v2:
 * Add missing total_tx_buf decrement
 * Support flexible rx/tx descriptor layout
 * Refuse to assign reserved CIDs
 * Refuse guest CID if already in use
 * Only accept correctly addressed packets
---
 MAINTAINERS                |   2 +
 drivers/vhost/vsock.c      | 722 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vhost.h |   5 +
 3 files changed, 729 insertions(+)
 create mode 100644 drivers/vhost/vsock.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7302663..12c79e5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12148,6 +12148,8 @@ F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport_common.c
 F:	net/vmw_vsock/virtio_transport.c
+F:	drivers/vhost/vsock.c
+F:	drivers/vhost/vsock.h
 
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
new file mode 100644
index 0000000..028ca16
--- /dev/null
+++ b/drivers/vhost/vsock.c
@@ -0,0 +1,722 @@
+/*
+ * vhost transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/miscdevice.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/vmalloc.h>
+#include <net/sock.h>
+#include <linux/virtio_vsock.h>
+#include <linux/vhost.h>
+
+#include <net/af_vsock.h>
+#include "vhost.h"
+
+#define VHOST_VSOCK_DEFAULT_HOST_CID	2
+
+enum {
+	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+};
+
+/* Used to track all the vhost_vsock instances on the system. */
+static DEFINE_SPINLOCK(vhost_vsock_lock);
+static LIST_HEAD(vhost_vsock_list);
+
+struct vhost_vsock {
+	struct vhost_dev dev;
+	struct vhost_virtqueue vqs[2];
+
+	/* Link to global vhost_vsock_list, protected by vhost_vsock_lock */
+	struct list_head list;
+
+	struct vhost_work send_pkt_work;
+	spinlock_t send_pkt_list_lock;
+	struct list_head send_pkt_list;	/* host->guest pending packets */
+
+	atomic_t queued_replies;
+
+	u32 guest_cid;
+};
+
+static u32 vhost_transport_get_local_cid(void)
+{
+	return VHOST_VSOCK_DEFAULT_HOST_CID;
+}
+
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+	struct vhost_vsock *vsock;
+
+	spin_lock_bh(&vhost_vsock_lock);
+	list_for_each_entry(vsock, &vhost_vsock_list, list) {
+		u32 other_cid = vsock->guest_cid;
+
+		/* Skip instances that have no CID yet */
+		if (other_cid == 0)
+			continue;
+
+		if (other_cid == guest_cid) {
+			spin_unlock_bh(&vhost_vsock_lock);
+			return vsock;
+		}
+	}
+	spin_unlock_bh(&vhost_vsock_lock);
+
+	return NULL;
+}
+
+static void
+vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
+			    struct vhost_virtqueue *vq)
+{
+	struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
+	bool added = false;
+	bool restart_tx = false;
+
+	mutex_lock(&vq->mutex);
+
+	if (!vq->private_data)
+		goto out;
+
+	/* Avoid further vmexits, we're already processing the virtqueue */
+	vhost_disable_notify(&vsock->dev, vq);
+
+	for (;;) {
+		struct virtio_vsock_pkt *pkt;
+		struct iov_iter iov_iter;
+		unsigned out, in;
+		size_t nbytes;
+		size_t len;
+		int head;
+
+		spin_lock_bh(&vsock->send_pkt_list_lock);
+		if (list_empty(&vsock->send_pkt_list)) {
+			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			vhost_enable_notify(&vsock->dev, vq);
+			break;
+		}
+
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				       struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+		spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (head < 0) {
+			spin_lock_bh(&vsock->send_pkt_list_lock);
+			list_add(&pkt->list, &vsock->send_pkt_list);
+			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			break;
+		}
+
+		if (head == vq->num) {
+			spin_lock_bh(&vsock->send_pkt_list_lock);
+			list_add(&pkt->list, &vsock->send_pkt_list);
+			spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+			/* We cannot finish yet if more buffers snuck in while
+			 * re-enabling notify.
+			 */
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		if (out) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
+			break;
+		}
+
+		len = iov_length(&vq->iov[out], in);
+		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
+
+		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
+		if (nbytes != sizeof(pkt->hdr)) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Faulted on copying pkt hdr\n");
+			break;
+		}
+
+		nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
+		if (nbytes != pkt->len) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Faulted on copying pkt buf\n");
+			break;
+		}
+
+		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+		added = true;
+
+		if (pkt->reply) {
+			int val;
+
+			val = atomic_dec_return(&vsock->queued_replies);
+
+			/* Do we have resources to resume tx processing? */
+			if (val + 1 == tx_vq->num)
+				restart_tx = true;
+		}
+
+		virtio_transport_free_pkt(pkt);
+	}
+	if (added)
+		vhost_signal(&vsock->dev, vq);
+
+out:
+	mutex_unlock(&vq->mutex);
+
+	if (restart_tx)
+		vhost_poll_queue(&tx_vq->poll);
+}
+
+static void vhost_transport_send_pkt_work(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_vsock *vsock;
+
+	vsock = container_of(work, struct vhost_vsock, send_pkt_work);
+	vq = &vsock->vqs[VSOCK_VQ_RX];
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int
+vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+{
+	struct vhost_vsock *vsock;
+	struct vhost_virtqueue *vq;
+	int len = pkt->len;
+
+	/* Find the vhost_vsock according to guest context id  */
+	vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid));
+	if (!vsock) {
+		virtio_transport_free_pkt(pkt);
+		return -ENODEV;
+	}
+
+	vq = &vsock->vqs[VSOCK_VQ_RX];
+
+	if (pkt->reply)
+		atomic_inc(&vsock->queued_replies);
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	list_add_tail(&pkt->list, &vsock->send_pkt_list);
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+	return len;
+}
+
+static struct virtio_vsock_pkt *
+vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
+		      unsigned int out, unsigned int in)
+{
+	struct virtio_vsock_pkt *pkt;
+	struct iov_iter iov_iter;
+	size_t nbytes;
+	size_t len;
+
+	if (in != 0) {
+		vq_err(vq, "Expected 0 input buffers, got %u\n", in);
+		return NULL;
+	}
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	len = iov_length(vq->iov, out);
+	iov_iter_init(&iov_iter, WRITE, vq->iov, out, len);
+
+	nbytes = copy_from_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
+	if (nbytes != sizeof(pkt->hdr)) {
+		vq_err(vq, "Expected %zu bytes for pkt->hdr, got %zu bytes\n",
+		       sizeof(pkt->hdr), nbytes);
+		kfree(pkt);
+		return NULL;
+	}
+
+	if (le16_to_cpu(pkt->hdr.type) == VIRTIO_VSOCK_TYPE_STREAM)
+		pkt->len = le32_to_cpu(pkt->hdr.len);
+
+	/* No payload */
+	if (!pkt->len)
+		return pkt;
+
+	/* The pkt is too big */
+	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
+	if (!pkt->buf) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
+	if (nbytes != pkt->len) {
+		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
+		       pkt->len, nbytes);
+		virtio_transport_free_pkt(pkt);
+		return NULL;
+	}
+
+	return pkt;
+}
+
+/* Is there space left for replies to rx packets? */
+static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
+{
+	struct vhost_virtqueue *vq = &vsock->vqs[VSOCK_VQ_TX];
+	int val;
+
+	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
+	val = atomic_read(&vsock->queued_replies);
+
+	return val < vq->num;
+}
+
+static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+	struct virtio_vsock_pkt *pkt;
+	int head;
+	unsigned int out, in;
+	bool added = false;
+
+	mutex_lock(&vq->mutex);
+
+	if (!vq->private_data)
+		goto out;
+
+	vhost_disable_notify(&vsock->dev, vq);
+	for (;;) {
+		if (!vhost_vsock_more_replies(vsock)) {
+			/* Stop tx until the device processes already
+			 * pending replies.  Leave tx virtqueue
+			 * callbacks disabled.
+			 */
+			goto no_more_replies;
+		}
+
+		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (head < 0)
+			break;
+
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		pkt = vhost_vsock_alloc_pkt(vq, out, in);
+		if (!pkt) {
+			vq_err(vq, "Faulted on pkt\n");
+			continue;
+		}
+
+		/* Only accept correctly addressed packets */
+		if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid)
+			virtio_transport_recv_pkt(pkt);
+		else
+			virtio_transport_free_pkt(pkt);
+
+		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
+		added = true;
+	}
+
+no_more_replies:
+	if (added)
+		vhost_signal(&vsock->dev, vq);
+
+out:
+	mutex_unlock(&vq->mutex);
+}
+
+static void vhost_vsock_handle_rx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int vhost_vsock_start(struct vhost_vsock *vsock)
+{
+	size_t i;
+	int ret;
+
+	mutex_lock(&vsock->dev.mutex);
+
+	ret = vhost_dev_check_owner(&vsock->dev);
+	if (ret)
+		goto err;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+
+		if (!vhost_vq_access_ok(vq)) {
+			ret = -EFAULT;
+			mutex_unlock(&vq->mutex);
+			goto err_vq;
+		}
+
+		if (!vq->private_data) {
+			vq->private_data = vsock;
+			vhost_vq_init_access(vq);
+		}
+
+		mutex_unlock(&vq->mutex);
+	}
+
+	mutex_unlock(&vsock->dev.mutex);
+	return 0;
+
+err_vq:
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vq->private_data = NULL;
+		mutex_unlock(&vq->mutex);
+	}
+err:
+	mutex_unlock(&vsock->dev.mutex);
+	return ret;
+}
+
+static int vhost_vsock_stop(struct vhost_vsock *vsock)
+{
+	size_t i;
+	int ret;
+
+	mutex_lock(&vsock->dev.mutex);
+
+	ret = vhost_dev_check_owner(&vsock->dev);
+	if (ret)
+		goto err;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		struct vhost_virtqueue *vq = &vsock->vqs[i];
+
+		mutex_lock(&vq->mutex);
+		vq->private_data = NULL;
+		mutex_unlock(&vq->mutex);
+	}
+
+err:
+	mutex_unlock(&vsock->dev.mutex);
+	return ret;
+}
+
+static void vhost_vsock_free(struct vhost_vsock *vsock)
+{
+	if (is_vmalloc_addr(vsock))
+		vfree(vsock);
+	else
+		kfree(vsock);
+}
+
+static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
+{
+	struct vhost_virtqueue **vqs;
+	struct vhost_vsock *vsock;
+	int ret;
+
+	/* This struct is large and allocation could fail, fall back to vmalloc
+	 * if there is no other way.
+	 */
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	if (!vsock) {
+		vsock = vmalloc(sizeof(*vsock));
+		if (!vsock)
+			return -ENOMEM;
+	}
+
+	vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	atomic_set(&vsock->queued_replies, 0);
+
+	vqs[VSOCK_VQ_TX] = &vsock->vqs[VSOCK_VQ_TX];
+	vqs[VSOCK_VQ_RX] = &vsock->vqs[VSOCK_VQ_RX];
+	vsock->vqs[VSOCK_VQ_TX].handle_kick = vhost_vsock_handle_tx_kick;
+	vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
+
+	vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs));
+
+	file->private_data = vsock;
+	spin_lock_init(&vsock->send_pkt_list_lock);
+	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
+
+	spin_lock_bh(&vhost_vsock_lock);
+	list_add_tail(&vsock->list, &vhost_vsock_list);
+	spin_unlock_bh(&vhost_vsock_lock);
+	return 0;
+
+out:
+	vhost_vsock_free(vsock);
+	return ret;
+}
+
+static void vhost_vsock_flush(struct vhost_vsock *vsock)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++)
+		if (vsock->vqs[i].handle_kick)
+			vhost_poll_flush(&vsock->vqs[i].poll);
+	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
+}
+
+static void vhost_vsock_reset_orphans(struct sock *sk)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+
+	/* vmci_transport.c doesn't take sk_lock here either.  At least we're
+	 * under vsock_table_lock so the sock cannot disappear while we're
+	 * executing.
+	 */
+
+	if (!vhost_vsock_get(vsk->local_addr.svm_cid)) {
+		sock_set_flag(sk, SOCK_DONE);
+		vsk->peer_shutdown = SHUTDOWN_MASK;
+		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_err = ECONNRESET;
+		sk->sk_error_report(sk);
+	}
+}
+
+static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
+{
+	struct vhost_vsock *vsock = file->private_data;
+
+	spin_lock_bh(&vhost_vsock_lock);
+	list_del(&vsock->list);
+	spin_unlock_bh(&vhost_vsock_lock);
+
+	/* Iterating over all connections for all CIDs to find orphans is
+	 * inefficient.  Room for improvement here. */
+	vsock_for_each_connected_socket(vhost_vsock_reset_orphans);
+
+	vhost_vsock_stop(vsock);
+	vhost_vsock_flush(vsock);
+	vhost_dev_stop(&vsock->dev);
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	while (!list_empty(&vsock->send_pkt_list)) {
+		struct virtio_vsock_pkt *pkt;
+
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+		virtio_transport_free_pkt(pkt);
+	}
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	vhost_dev_cleanup(&vsock->dev, false);
+	kfree(vsock->dev.vqs);
+	vhost_vsock_free(vsock);
+	return 0;
+}
+
+static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid)
+{
+	struct vhost_vsock *other;
+
+	/* Refuse reserved CIDs */
+	if (guest_cid <= VMADDR_CID_HOST ||
+	    guest_cid == U32_MAX)
+		return -EINVAL;
+
+	/* 64-bit CIDs are not yet supported */
+	if (guest_cid > U32_MAX)
+		return -EINVAL;
+
+	/* Refuse if CID is already in use */
+	other = vhost_vsock_get(guest_cid);
+	if (other && other != vsock)
+		return -EADDRINUSE;
+
+	spin_lock_bh(&vhost_vsock_lock);
+	vsock->guest_cid = guest_cid;
+	spin_unlock_bh(&vhost_vsock_lock);
+
+	return 0;
+}
+
+static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
+{
+	struct vhost_virtqueue *vq;
+	int i;
+
+	if (features & ~VHOST_VSOCK_FEATURES)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&vsock->dev.mutex);
+	if ((features & (1 << VHOST_F_LOG_ALL)) &&
+	    !vhost_log_access_ok(&vsock->dev)) {
+		mutex_unlock(&vsock->dev.mutex);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
+		vq = &vsock->vqs[i];
+		mutex_lock(&vq->mutex);
+		vq->acked_features = features;
+		mutex_unlock(&vq->mutex);
+	}
+	mutex_unlock(&vsock->dev.mutex);
+	return 0;
+}
+
+static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
+				  unsigned long arg)
+{
+	struct vhost_vsock *vsock = f->private_data;
+	void __user *argp = (void __user *)arg;
+	u64 guest_cid;
+	u64 features;
+	int start;
+	int r;
+
+	switch (ioctl) {
+	case VHOST_VSOCK_SET_GUEST_CID:
+		if (copy_from_user(&guest_cid, argp, sizeof(guest_cid)))
+			return -EFAULT;
+		return vhost_vsock_set_cid(vsock, guest_cid);
+	case VHOST_VSOCK_SET_RUNNING:
+		if (copy_from_user(&start, argp, sizeof(start)))
+			return -EFAULT;
+		if (start)
+			return vhost_vsock_start(vsock);
+		else
+			return vhost_vsock_stop(vsock);
+	case VHOST_GET_FEATURES:
+		features = VHOST_VSOCK_FEATURES;
+		if (copy_to_user(argp, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, argp, sizeof(features)))
+			return -EFAULT;
+		return vhost_vsock_set_features(vsock, features);
+	default:
+		mutex_lock(&vsock->dev.mutex);
+		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
+		if (r == -ENOIOCTLCMD)
+			r = vhost_vring_ioctl(&vsock->dev, ioctl, argp);
+		else
+			vhost_vsock_flush(vsock);
+		mutex_unlock(&vsock->dev.mutex);
+		return r;
+	}
+}
+
+static const struct file_operations vhost_vsock_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_vsock_dev_open,
+	.release        = vhost_vsock_dev_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_vsock_dev_ioctl,
+};
+
+static struct miscdevice vhost_vsock_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vhost-vsock",
+	.fops = &vhost_vsock_fops,
+};
+
+static struct virtio_transport vhost_transport = {
+	.transport = {
+		.get_local_cid            = vhost_transport_get_local_cid,
+
+		.init                     = virtio_transport_do_socket_init,
+		.destruct                 = virtio_transport_destruct,
+		.release                  = virtio_transport_release,
+		.connect                  = virtio_transport_connect,
+		.shutdown                 = virtio_transport_shutdown,
+
+		.dgram_enqueue            = virtio_transport_dgram_enqueue,
+		.dgram_dequeue            = virtio_transport_dgram_dequeue,
+		.dgram_bind               = virtio_transport_dgram_bind,
+		.dgram_allow              = virtio_transport_dgram_allow,
+
+		.stream_enqueue           = virtio_transport_stream_enqueue,
+		.stream_dequeue           = virtio_transport_stream_dequeue,
+		.stream_has_data          = virtio_transport_stream_has_data,
+		.stream_has_space         = virtio_transport_stream_has_space,
+		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+		.stream_is_active         = virtio_transport_stream_is_active,
+		.stream_allow             = virtio_transport_stream_allow,
+
+		.notify_poll_in           = virtio_transport_notify_poll_in,
+		.notify_poll_out          = virtio_transport_notify_poll_out,
+		.notify_recv_init         = virtio_transport_notify_recv_init,
+		.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+		.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+		.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+		.notify_send_init         = virtio_transport_notify_send_init,
+		.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+		.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+		.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+		.set_buffer_size          = virtio_transport_set_buffer_size,
+		.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+		.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+		.get_buffer_size          = virtio_transport_get_buffer_size,
+		.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+		.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+	},
+
+	.send_pkt = vhost_transport_send_pkt,
+};
+
+static int __init vhost_vsock_init(void)
+{
+	int ret;
+
+	ret = vsock_core_init(&vhost_transport.transport);
+	if (ret < 0)
+		return ret;
+	return misc_register(&vhost_vsock_misc);
+};
+
+static void __exit vhost_vsock_exit(void)
+{
+	misc_deregister(&vhost_vsock_misc);
+	vsock_core_exit();
+};
+
+module_init(vhost_vsock_init);
+module_exit(vhost_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("vhost transport for vsock ");
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 61a8777..c4400b2 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -175,4 +175,9 @@ struct vhost_scsi_target {
 #define VHOST_SCSI_SET_EVENTS_MISSED _IOW(VHOST_VIRTIO, 0x43, __u32)
 #define VHOST_SCSI_GET_EVENTS_MISSED _IOW(VHOST_VIRTIO, 0x44, __u32)
 
+/* VHOST_VSOCK specific defines */
+
+#define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
+#define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
+
 #endif
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 4/6] VSOCK: Introduce virtio_transport.ko
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

VM sockets virtio transport implementation.  This driver runs in the
guest.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6:
 * Start/stop rx depending on reply packet accounting to bound memory
   allocation if the host is not processing rx packets
 * 64-bit CID in packet header to match virtio-vsock specification
 * Use send_pkt_list to defer transmission
 * Avoid leaking virtqueue buffers and pkt lists on shutdown
 * Remove total_tx_buf accounting, it is ineffective because control
   packets are not included
v5:
 * Add transport reset event handling
 * Drop ctrl virtqueue
v4:
 * Add MAINTAINERS file entry
 * Drop short/long rx packets
 * checkpatch.pl cleanups
 * Clarify locking in struct virtio_vsock
 * Narrow local variable scopes as suggested by Alex Bennee
 * Call wake_up() after decrementing total_tx_buf to avoid deadlock
v2:
 * Fix total_tx_buf accounting
 * Add virtio_transport global mutex to prevent races
---
 MAINTAINERS                      |   1 +
 net/vmw_vsock/virtio_transport.c | 624 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 625 insertions(+)
 create mode 100644 net/vmw_vsock/virtio_transport.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b49ffb8..7302663 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12147,6 +12147,7 @@ S:	Maintained
 F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport_common.c
+F:	net/vmw_vsock/virtio_transport.c
 
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
new file mode 100644
index 0000000..699dfab
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport.c
@@ -0,0 +1,624 @@
+/*
+ * virtio transport for vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * Some of the code is take from Gerd Hoffmann <kraxel@redhat.com>'s
+ * early virtio-vsock proof-of-concept bits.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/atomic.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+#include <net/sock.h>
+#include <linux/mutex.h>
+#include <net/af_vsock.h>
+
+static struct workqueue_struct *virtio_vsock_workqueue;
+static struct virtio_vsock *the_virtio_vsock;
+static DEFINE_MUTEX(the_virtio_vsock_mutex); /* protects the_virtio_vsock */
+
+struct virtio_vsock {
+	struct virtio_device *vdev;
+	struct virtqueue *vqs[VSOCK_VQ_MAX];
+
+	/* Virtqueue processing is deferred to a workqueue */
+	struct work_struct tx_work;
+	struct work_struct rx_work;
+	struct work_struct event_work;
+
+	/* The following fields are protected by tx_lock.  vqs[VSOCK_VQ_TX]
+	 * must be accessed with tx_lock held.
+	 */
+	struct mutex tx_lock;
+
+	struct work_struct send_pkt_work;
+	spinlock_t send_pkt_list_lock;
+	struct list_head send_pkt_list;
+
+	atomic_t queued_replies;
+
+	/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
+	 * must be accessed with rx_lock held.
+	 */
+	struct mutex rx_lock;
+	int rx_buf_nr;
+	int rx_buf_max_nr;
+
+	/* The following fields are protected by event_lock.
+	 * vqs[VSOCK_VQ_EVENT] must be accessed with event_lock held.
+	 */
+	struct mutex event_lock;
+	struct virtio_vsock_event event_list[8];
+
+	u32 guest_cid;
+};
+
+static struct virtio_vsock *virtio_vsock_get(void)
+{
+	return the_virtio_vsock;
+}
+
+static u32 virtio_transport_get_local_cid(void)
+{
+	struct virtio_vsock *vsock = virtio_vsock_get();
+
+	return vsock->guest_cid;
+}
+
+static void
+virtio_transport_send_pkt_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, send_pkt_work);
+	struct virtqueue *vq;
+	bool added = false;
+	bool restart_rx = false;
+
+	mutex_lock(&vsock->tx_lock);
+
+	vq = vsock->vqs[VSOCK_VQ_TX];
+
+	/* Avoid unnecessary interrupts while we're processing the ring */
+	virtqueue_disable_cb(vq);
+
+	for (;;) {
+		struct virtio_vsock_pkt *pkt;
+		struct scatterlist hdr, buf, *sgs[2];
+		int ret, in_sg = 0, out_sg = 0;
+		bool reply;
+
+		spin_lock_bh(&vsock->send_pkt_list_lock);
+		if (list_empty(&vsock->send_pkt_list)) {
+			spin_unlock_bh(&vsock->send_pkt_list_lock);
+			virtqueue_enable_cb(vq);
+			break;
+		}
+
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				       struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+		spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+		reply = pkt->reply;
+
+		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+		sgs[out_sg++] = &hdr;
+		if (pkt->buf) {
+			sg_init_one(&buf, pkt->buf, pkt->len);
+			sgs[out_sg++] = &buf;
+		}
+
+		ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, pkt, GFP_KERNEL);
+		if (ret < 0) {
+			spin_lock_bh(&vsock->send_pkt_list_lock);
+			list_add(&pkt->list, &vsock->send_pkt_list);
+			spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+			if (!virtqueue_enable_cb(vq) && ret == -ENOSPC)
+				continue; /* retry now that we have more space */
+			break;
+		}
+
+		if (reply) {
+			struct virtqueue *rx_vq = vsock->vqs[VSOCK_VQ_RX];
+			int val;
+
+			val = atomic_dec_return(&vsock->queued_replies);
+
+			/* Do we now have resources to resume rx processing? */
+			if (val + 1 == virtqueue_get_vring_size(rx_vq))
+				restart_rx = true;
+		}
+
+		added = true;
+	}
+
+	if (added)
+		virtqueue_kick(vq);
+
+	mutex_unlock(&vsock->tx_lock);
+
+	if (restart_rx)
+		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+}
+
+static int
+virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock *vsock;
+	int len = pkt->len;
+
+	vsock = virtio_vsock_get();
+	if (!vsock) {
+		virtio_transport_free_pkt(pkt);
+		return -ENODEV;
+	}
+
+	if (pkt->reply)
+		atomic_inc(&vsock->queued_replies);
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	list_add_tail(&pkt->list, &vsock->send_pkt_list);
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	return len;
+}
+
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
+{
+	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	struct virtio_vsock_pkt *pkt;
+	struct scatterlist hdr, buf, *sgs[2];
+	struct virtqueue *vq;
+	int ret;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+
+	do {
+		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+		if (!pkt)
+			break;
+
+		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
+		if (!pkt->buf) {
+			virtio_transport_free_pkt(pkt);
+			break;
+		}
+
+		pkt->len = buf_len;
+
+		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
+		sgs[0] = &hdr;
+
+		sg_init_one(&buf, pkt->buf, buf_len);
+		sgs[1] = &buf;
+		ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
+		if (ret) {
+			virtio_transport_free_pkt(pkt);
+			break;
+		}
+		vsock->rx_buf_nr++;
+	} while (vq->num_free);
+	if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
+		vsock->rx_buf_max_nr = vsock->rx_buf_nr;
+	virtqueue_kick(vq);
+}
+
+static void virtio_transport_tx_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, tx_work);
+	struct virtqueue *vq;
+	bool added = false;
+
+	vq = vsock->vqs[VSOCK_VQ_TX];
+	mutex_lock(&vsock->tx_lock);
+	do {
+		struct virtio_vsock_pkt *pkt;
+		unsigned int len;
+
+		virtqueue_disable_cb(vq);
+		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
+			virtio_transport_free_pkt(pkt);
+			added = true;
+		}
+	} while (!virtqueue_enable_cb(vq));
+	mutex_unlock(&vsock->tx_lock);
+
+	if (added)
+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+}
+
+/* Is there space left for replies to rx packets? */
+static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
+{
+	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_RX];
+	int val;
+
+	smp_rmb(); /* paired with atomic_inc() and atomic_dec_return() */
+	val = atomic_read(&vsock->queued_replies);
+
+	return val < virtqueue_get_vring_size(vq);
+}
+
+static void virtio_transport_rx_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, rx_work);
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+
+	mutex_lock(&vsock->rx_lock);
+
+	do {
+		virtqueue_disable_cb(vq);
+		for (;;) {
+			struct virtio_vsock_pkt *pkt;
+			unsigned int len;
+
+			if (!virtio_transport_more_replies(vsock)) {
+				/* Stop rx until the device processes already
+				 * pending replies.  Leave rx virtqueue
+				 * callbacks disabled.
+				 */
+				goto out;
+			}
+
+			pkt = virtqueue_get_buf(vq, &len);
+			if (!pkt) {
+				break;
+			}
+
+			vsock->rx_buf_nr--;
+
+			/* Drop short/long packets */
+			if (unlikely(len < sizeof(pkt->hdr) ||
+				     len > sizeof(pkt->hdr) + pkt->len)) {
+				virtio_transport_free_pkt(pkt);
+				continue;
+			}
+
+			pkt->len = len - sizeof(pkt->hdr);
+			virtio_transport_recv_pkt(pkt);
+		}
+	} while (!virtqueue_enable_cb(vq));
+
+out:
+	if (vsock->rx_buf_nr < vsock->rx_buf_max_nr / 2)
+		virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+}
+
+/* event_lock must be held */
+static int virtio_vsock_event_fill_one(struct virtio_vsock *vsock,
+				       struct virtio_vsock_event *event)
+{
+	struct scatterlist sg;
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_EVENT];
+
+	sg_init_one(&sg, event, sizeof(*event));
+
+	return virtqueue_add_inbuf(vq, &sg, 1, event, GFP_KERNEL);
+}
+
+/* event_lock must be held */
+static void virtio_vsock_event_fill(struct virtio_vsock *vsock)
+{
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(vsock->event_list); i++) {
+		struct virtio_vsock_event *event = &vsock->event_list[i];
+
+		virtio_vsock_event_fill_one(vsock, event);
+	}
+
+	virtqueue_kick(vsock->vqs[VSOCK_VQ_EVENT]);
+}
+
+static void virtio_vsock_reset_sock(struct sock *sk)
+{
+	lock_sock(sk);
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = ECONNRESET;
+	sk->sk_error_report(sk);
+	release_sock(sk);
+}
+
+static void virtio_vsock_update_guest_cid(struct virtio_vsock *vsock)
+{
+	struct virtio_device *vdev = vsock->vdev;
+	u64 guest_cid;
+
+	vdev->config->get(vdev, offsetof(struct virtio_vsock_config, guest_cid),
+			  &guest_cid, sizeof(guest_cid));
+	vsock->guest_cid = le64_to_cpu(guest_cid);
+}
+
+/* event_lock must be held */
+static void virtio_vsock_event_handle(struct virtio_vsock *vsock,
+				      struct virtio_vsock_event *event)
+{
+	switch (le32_to_cpu(event->id)) {
+	case VIRTIO_VSOCK_EVENT_TRANSPORT_RESET:
+		virtio_vsock_update_guest_cid(vsock);
+		vsock_for_each_connected_socket(virtio_vsock_reset_sock);
+		break;
+	}
+}
+
+static void virtio_transport_event_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, event_work);
+	struct virtqueue *vq;
+
+	vq = vsock->vqs[VSOCK_VQ_EVENT];
+
+	mutex_lock(&vsock->event_lock);
+
+	do {
+		struct virtio_vsock_event *event;
+		unsigned int len;
+
+		virtqueue_disable_cb(vq);
+		while ((event = virtqueue_get_buf(vq, &len)) != NULL) {
+			if (len == sizeof(*event))
+				virtio_vsock_event_handle(vsock, event);
+
+			virtio_vsock_event_fill_one(vsock, event);
+		}
+	} while (!virtqueue_enable_cb(vq));
+
+	virtqueue_kick(vsock->vqs[VSOCK_VQ_EVENT]);
+
+	mutex_unlock(&vsock->event_lock);
+}
+
+static void virtio_vsock_event_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->event_work);
+}
+
+static void virtio_vsock_tx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->tx_work);
+}
+
+static void virtio_vsock_rx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+}
+
+static struct virtio_transport virtio_transport = {
+	.transport = {
+		.get_local_cid            = virtio_transport_get_local_cid,
+
+		.init                     = virtio_transport_do_socket_init,
+		.destruct                 = virtio_transport_destruct,
+		.release                  = virtio_transport_release,
+		.connect                  = virtio_transport_connect,
+		.shutdown                 = virtio_transport_shutdown,
+
+		.dgram_bind               = virtio_transport_dgram_bind,
+		.dgram_dequeue            = virtio_transport_dgram_dequeue,
+		.dgram_enqueue            = virtio_transport_dgram_enqueue,
+		.dgram_allow              = virtio_transport_dgram_allow,
+
+		.stream_dequeue           = virtio_transport_stream_dequeue,
+		.stream_enqueue           = virtio_transport_stream_enqueue,
+		.stream_has_data          = virtio_transport_stream_has_data,
+		.stream_has_space         = virtio_transport_stream_has_space,
+		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+		.stream_is_active         = virtio_transport_stream_is_active,
+		.stream_allow             = virtio_transport_stream_allow,
+
+		.notify_poll_in           = virtio_transport_notify_poll_in,
+		.notify_poll_out          = virtio_transport_notify_poll_out,
+		.notify_recv_init         = virtio_transport_notify_recv_init,
+		.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+		.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+		.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+		.notify_send_init         = virtio_transport_notify_send_init,
+		.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+		.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+		.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+		.set_buffer_size          = virtio_transport_set_buffer_size,
+		.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+		.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+		.get_buffer_size          = virtio_transport_get_buffer_size,
+		.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+		.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+	},
+
+	.send_pkt = virtio_transport_send_pkt,
+};
+
+static int virtio_vsock_probe(struct virtio_device *vdev)
+{
+	vq_callback_t *callbacks[] = {
+		virtio_vsock_rx_done,
+		virtio_vsock_tx_done,
+		virtio_vsock_event_done,
+	};
+	static const char * const names[] = {
+		"rx",
+		"tx",
+		"event",
+	};
+	struct virtio_vsock *vsock = NULL;
+	int ret;
+
+	ret = mutex_lock_interruptible(&the_virtio_vsock_mutex);
+	if (ret)
+		return ret;
+
+	/* Only one virtio-vsock device per guest is supported */
+	if (the_virtio_vsock) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
+	if (!vsock) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	vsock->vdev = vdev;
+
+	ret = vsock->vdev->config->find_vqs(vsock->vdev, VSOCK_VQ_MAX,
+					    vsock->vqs, callbacks, names);
+	if (ret < 0)
+		goto out;
+
+	virtio_vsock_update_guest_cid(vsock);
+
+	ret = vsock_core_init(&virtio_transport.transport);
+	if (ret < 0)
+		goto out_vqs;
+
+	vsock->rx_buf_nr = 0;
+	vsock->rx_buf_max_nr = 0;
+	atomic_set(&vsock->queued_replies, 0);
+
+	vdev->priv = vsock;
+	the_virtio_vsock = vsock;
+	mutex_init(&vsock->tx_lock);
+	mutex_init(&vsock->rx_lock);
+	mutex_init(&vsock->event_lock);
+	spin_lock_init(&vsock->send_pkt_list_lock);
+	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	INIT_WORK(&vsock->rx_work, virtio_transport_rx_work);
+	INIT_WORK(&vsock->tx_work, virtio_transport_tx_work);
+	INIT_WORK(&vsock->event_work, virtio_transport_event_work);
+	INIT_WORK(&vsock->send_pkt_work, virtio_transport_send_pkt_work);
+
+	mutex_lock(&vsock->rx_lock);
+	virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+
+	mutex_lock(&vsock->event_lock);
+	virtio_vsock_event_fill(vsock);
+	mutex_unlock(&vsock->event_lock);
+
+	mutex_unlock(&the_virtio_vsock_mutex);
+	return 0;
+
+out_vqs:
+	vsock->vdev->config->del_vqs(vsock->vdev);
+out:
+	kfree(vsock);
+	mutex_unlock(&the_virtio_vsock_mutex);
+	return ret;
+}
+
+static void virtio_vsock_remove(struct virtio_device *vdev)
+{
+	struct virtio_vsock *vsock = vdev->priv;
+	struct virtio_vsock_pkt *pkt;
+
+	flush_work(&vsock->rx_work);
+	flush_work(&vsock->tx_work);
+	flush_work(&vsock->event_work);
+	flush_work(&vsock->send_pkt_work);
+
+	vdev->config->reset(vdev);
+
+	mutex_lock(&vsock->rx_lock);
+	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
+		virtio_transport_free_pkt(pkt);
+	mutex_unlock(&vsock->rx_lock);
+
+	mutex_lock(&vsock->tx_lock);
+	while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_TX])))
+		virtio_transport_free_pkt(pkt);
+	mutex_unlock(&vsock->tx_lock);
+
+	spin_lock_bh(&vsock->send_pkt_list_lock);
+	while (!list_empty(&vsock->send_pkt_list)) {
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				       struct virtio_vsock_pkt, list);
+		list_del(&pkt->list);
+		virtio_transport_free_pkt(pkt);
+	}
+	spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+	mutex_lock(&the_virtio_vsock_mutex);
+	the_virtio_vsock = NULL;
+	vsock_core_exit();
+	mutex_unlock(&the_virtio_vsock_mutex);
+
+	vdev->config->del_vqs(vdev);
+
+	kfree(vsock);
+}
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_VSOCK, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static unsigned int features[] = {
+};
+
+static struct virtio_driver virtio_vsock_driver = {
+	.feature_table = features,
+	.feature_table_size = ARRAY_SIZE(features),
+	.driver.name = KBUILD_MODNAME,
+	.driver.owner = THIS_MODULE,
+	.id_table = id_table,
+	.probe = virtio_vsock_probe,
+	.remove = virtio_vsock_remove,
+};
+
+static int __init virtio_vsock_init(void)
+{
+	int ret;
+
+	virtio_vsock_workqueue = alloc_workqueue("virtio_vsock", 0, 0);
+	if (!virtio_vsock_workqueue)
+		return -ENOMEM;
+	ret = register_virtio_driver(&virtio_vsock_driver);
+	if (ret)
+		destroy_workqueue(virtio_vsock_workqueue);
+	return ret;
+}
+
+static void __exit virtio_vsock_exit(void)
+{
+	unregister_virtio_driver(&virtio_vsock_driver);
+	destroy_workqueue(virtio_vsock_workqueue);
+}
+
+module_init(virtio_vsock_init);
+module_exit(virtio_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("virtio transport for vsock");
+MODULE_DEVICE_TABLE(virtio, id_table);
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 3/6] VSOCK: Introduce virtio_vsock_common.ko
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

This module contains the common code and header files for the following
virtio_transporto and vhost_vsock kernel modules.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6:
 * Add graceful shutdown to avoid port reuse while peer is still closing
   socket [Ian Campbell]
 * Use spinlocks instead of mutexes for tx_lock/rx_lock because they are
   used in sections that are not allowed to sleep. [Claudio]
 * Used nested lock notation in cases where child socket is held
   [Claudio]
 * 64-bit CIDs in virtio_vsock.h to match virtio-vsock specification
 * Avoid memcpy_to_msg() in atomic region
 * Call sk_write_space() with sk_lock held
 * Move duplicated send_pkt() logic from vhost and virtio transports
   into virtio_transport_common.ko
 * Add __attribute__((packed)) to avoid struct padding for ABI structs
   [Gerard Garcia]
v5:
 * Add event virtqueue, struct virtio_vsock_event, and transport reset
   event
 * Reorder virtqueue indices: rx, tx, event
 * Drop unused virtqueue_pairs config field
 * Drop unused ctrl virtqueue
 * Switch to a free virtio device ID, the previous one was reserved
v4:
 * Add MAINTAINERS file entry
 * checkpatch.pl cleanups
 * linux_vsock.h: drop wrong copy-pasted license header
 * Move tx sock refcounting to virtio_transport_alloc/free_pkt() to fix
   leaks in error paths
 * Add send_pkt_no_sock() to send RST packets with no listen socket
 * Rename per-socket state from virtio_transport to virtio_vsock_sock
 * Move send_pkt_ops to new virtio_transport struct
 * Drop dumppkt function, packet capture will be added in the future
 * Drop empty virtio_transport_dec_tx_pkt()
 * Allow guest->host connections again
 * Use trace events instead of pr_debug()
v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
 * Only allow host->guest connections (same security model as latest
   VMware)
v2:
 * Fix peer_buf_alloc inheritance on child socket
 * Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
---
 MAINTAINERS                                        |  10 +
 include/linux/virtio_vsock.h                       | 154 ++++
 include/net/af_vsock.h                             |   2 +
 .../trace/events/vsock_virtio_transport_common.h   | 144 +++
 include/uapi/linux/Kbuild                          |   1 +
 include/uapi/linux/virtio_ids.h                    |   1 +
 include/uapi/linux/virtio_vsock.h                  |  94 ++
 net/vmw_vsock/virtio_transport_common.c            | 992 +++++++++++++++++++++
 8 files changed, 1398 insertions(+)
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/trace/events/vsock_virtio_transport_common.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8c20323..b49ffb8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12138,6 +12138,16 @@ S:	Maintained
 F:	drivers/media/v4l2-core/videobuf2-*
 F:	include/media/videobuf2-*
 
+VIRTIO AND VHOST VSOCK DRIVER
+M:	Stefan Hajnoczi <stefanha@redhat.com>
+L:	kvm@vger.kernel.org
+L:	virtualization@lists.linux-foundation.org
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	include/linux/virtio_vsock.h
+F:	include/uapi/linux/virtio_vsock.h
+F:	net/vmw_vsock/virtio_transport_common.c
+
 VIRTUAL SERIO DEVICE DRIVER
 M:	Stephen Chandler Paul <thatslyude@gmail.com>
 S:	Maintained
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
new file mode 100644
index 0000000..9638bfe
--- /dev/null
+++ b/include/linux/virtio_vsock.h
@@ -0,0 +1,154 @@
+#ifndef _LINUX_VIRTIO_VSOCK_H
+#define _LINUX_VIRTIO_VSOCK_H
+
+#include <uapi/linux/virtio_vsock.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+#include <net/af_vsock.h>
+
+#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE	128
+#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE		(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE	(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
+#define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
+
+enum {
+	VSOCK_VQ_RX     = 0, /* for host to guest data */
+	VSOCK_VQ_TX     = 1, /* for guest to host data */
+	VSOCK_VQ_EVENT  = 2,
+	VSOCK_VQ_MAX    = 3,
+};
+
+/* Per-socket state (accessed via vsk->trans) */
+struct virtio_vsock_sock {
+	struct vsock_sock *vsk;
+
+	/* Protected by lock_sock(sk_vsock(trans->vsk)) */
+	u32 buf_size;
+	u32 buf_size_min;
+	u32 buf_size_max;
+
+	spinlock_t tx_lock;
+	spinlock_t rx_lock;
+
+	/* Protected by tx_lock */
+	u32 tx_cnt;
+	u32 buf_alloc;
+	u32 peer_fwd_cnt;
+	u32 peer_buf_alloc;
+
+	/* Protected by rx_lock */
+	u32 fwd_cnt;
+	u32 rx_bytes;
+	struct list_head rx_queue;
+};
+
+struct virtio_vsock_pkt {
+	struct virtio_vsock_hdr	hdr;
+	struct work_struct work;
+	struct list_head list;
+	void *buf;
+	u32 len;
+	u32 off;
+	bool reply;
+};
+
+struct virtio_vsock_pkt_info {
+	u32 remote_cid, remote_port;
+	struct msghdr *msg;
+	u32 pkt_len;
+	u16 type;
+	u16 op;
+	u32 flags;
+	bool reply;
+};
+
+struct virtio_transport {
+	/* This must be the first field */
+	struct vsock_transport transport;
+
+	/* Takes ownership of the packet */
+	int (*send_pkt)(struct virtio_vsock_pkt *pkt);
+};
+
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len,
+				int type);
+int
+virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk);
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				 struct vsock_sock *psk);
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk);
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vs, u64 val);
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now);
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_available_now);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
+bool virtio_transport_stream_allow(u32 cid, u32 port);
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr);
+bool virtio_transport_dgram_allow(u32 cid, u32 port);
+
+int virtio_transport_connect(struct vsock_sock *vsk);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
+
+void virtio_transport_release(struct vsock_sock *vsk);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len);
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct msghdr *msg,
+			       size_t len);
+
+void virtio_transport_destruct(struct vsock_sock *vsk);
+
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
+u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
+void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
+
+#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 3af0b22..f275896 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -63,6 +63,8 @@ struct vsock_sock {
 	struct list_head accept_queue;
 	bool rejected;
 	struct delayed_work dwork;
+	struct delayed_work close_work;
+	bool close_work_scheduled;
 	u32 peer_shutdown;
 	bool sent_request;
 	bool ignore_connecting_rst;
diff --git a/include/trace/events/vsock_virtio_transport_common.h b/include/trace/events/vsock_virtio_transport_common.h
new file mode 100644
index 0000000..b7f1d62
--- /dev/null
+++ b/include/trace/events/vsock_virtio_transport_common.h
@@ -0,0 +1,144 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vsock
+
+#if !defined(_TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H) || \
+    defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H
+
+#include <linux/tracepoint.h>
+
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_TYPE_STREAM);
+
+#define show_type(val) \
+	__print_symbolic(val, { VIRTIO_VSOCK_TYPE_STREAM, "STREAM" })
+
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_INVALID);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_REQUEST);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RESPONSE);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RST);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_SHUTDOWN);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_RW);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_CREDIT_UPDATE);
+TRACE_DEFINE_ENUM(VIRTIO_VSOCK_OP_CREDIT_REQUEST);
+
+#define show_op(val) \
+	__print_symbolic(val, \
+			 { VIRTIO_VSOCK_OP_INVALID, "INVALID" }, \
+			 { VIRTIO_VSOCK_OP_REQUEST, "REQUEST" }, \
+			 { VIRTIO_VSOCK_OP_RESPONSE, "RESPONSE" }, \
+			 { VIRTIO_VSOCK_OP_RST, "RST" }, \
+			 { VIRTIO_VSOCK_OP_SHUTDOWN, "SHUTDOWN" }, \
+			 { VIRTIO_VSOCK_OP_RW, "RW" }, \
+			 { VIRTIO_VSOCK_OP_CREDIT_UPDATE, "CREDIT_UPDATE" }, \
+			 { VIRTIO_VSOCK_OP_CREDIT_REQUEST, "CREDIT_REQUEST" })
+
+TRACE_EVENT(virtio_transport_alloc_pkt,
+	TP_PROTO(
+		 __u32 src_cid, __u32 src_port,
+		 __u32 dst_cid, __u32 dst_port,
+		 __u32 len,
+		 __u16 type,
+		 __u16 op,
+		 __u32 flags
+	),
+	TP_ARGS(
+		src_cid, src_port,
+		dst_cid, dst_port,
+		len,
+		type,
+		op,
+		flags
+	),
+	TP_STRUCT__entry(
+		__field(__u32, src_cid)
+		__field(__u32, src_port)
+		__field(__u32, dst_cid)
+		__field(__u32, dst_port)
+		__field(__u32, len)
+		__field(__u16, type)
+		__field(__u16, op)
+		__field(__u32, flags)
+	),
+	TP_fast_assign(
+		__entry->src_cid = src_cid;
+		__entry->src_port = src_port;
+		__entry->dst_cid = dst_cid;
+		__entry->dst_port = dst_port;
+		__entry->len = len;
+		__entry->type = type;
+		__entry->op = op;
+		__entry->flags = flags;
+	),
+	TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x",
+		  __entry->src_cid, __entry->src_port,
+		  __entry->dst_cid, __entry->dst_port,
+		  __entry->len,
+		  show_type(__entry->type),
+		  show_op(__entry->op),
+		  __entry->flags)
+);
+
+TRACE_EVENT(virtio_transport_recv_pkt,
+	TP_PROTO(
+		 __u32 src_cid, __u32 src_port,
+		 __u32 dst_cid, __u32 dst_port,
+		 __u32 len,
+		 __u16 type,
+		 __u16 op,
+		 __u32 flags,
+		 __u32 buf_alloc,
+		 __u32 fwd_cnt
+	),
+	TP_ARGS(
+		src_cid, src_port,
+		dst_cid, dst_port,
+		len,
+		type,
+		op,
+		flags,
+		buf_alloc,
+		fwd_cnt
+	),
+	TP_STRUCT__entry(
+		__field(__u32, src_cid)
+		__field(__u32, src_port)
+		__field(__u32, dst_cid)
+		__field(__u32, dst_port)
+		__field(__u32, len)
+		__field(__u16, type)
+		__field(__u16, op)
+		__field(__u32, flags)
+		__field(__u32, buf_alloc)
+		__field(__u32, fwd_cnt)
+	),
+	TP_fast_assign(
+		__entry->src_cid = src_cid;
+		__entry->src_port = src_port;
+		__entry->dst_cid = dst_cid;
+		__entry->dst_port = dst_port;
+		__entry->len = len;
+		__entry->type = type;
+		__entry->op = op;
+		__entry->flags = flags;
+		__entry->buf_alloc = buf_alloc;
+		__entry->fwd_cnt = fwd_cnt;
+	),
+	TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x "
+		  "buf_alloc=%u fwd_cnt=%u",
+		  __entry->src_cid, __entry->src_port,
+		  __entry->dst_cid, __entry->dst_port,
+		  __entry->len,
+		  show_type(__entry->type),
+		  show_op(__entry->op),
+		  __entry->flags,
+		  __entry->buf_alloc,
+		  __entry->fwd_cnt)
+);
+
+#endif /* _TRACE_VSOCK_VIRTIO_TRANSPORT_COMMON_H */
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE vsock_virtio_transport_common
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ec10cfe..3cf0116 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -453,6 +453,7 @@ header-y += virtio_ring.h
 header-y += virtio_rng.h
 header-y += virtio_scsi.h
 header-y += virtio_types.h
+header-y += virtio_vsock.h
 header-y += vm_sockets.h
 header-y += vt.h
 header-y += wait.h
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 77925f5..3228d58 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -41,5 +41,6 @@
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
 #define VIRTIO_ID_GPU          16 /* virtio GPU */
 #define VIRTIO_ID_INPUT        18 /* virtio input */
+#define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 0000000..6b011c1
--- /dev/null
+++ b/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,94 @@
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2013-2015
+ * Copyright (C) Asias He <asias@redhat.com>, 2013
+ * Copyright (C) Stefan Hajnoczi <stefanha@redhat.com>, 2015
+ */
+
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VOSCK_H
+
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+
+struct virtio_vsock_config {
+	__le64 guest_cid;
+} __attribute__((packed));
+
+enum virtio_vsock_event_id {
+	VIRTIO_VSOCK_EVENT_TRANSPORT_RESET = 0,
+};
+
+struct virtio_vsock_event {
+	__le32 id;
+} __attribute__((packed));
+
+struct virtio_vsock_hdr {
+	__le64	src_cid;
+	__le64	dst_cid;
+	__le32	src_port;
+	__le32	dst_port;
+	__le32	len;
+	__le16	type;		/* enum virtio_vsock_type */
+	__le16	op;		/* enum virtio_vsock_op */
+	__le32	flags;
+	__le32	buf_alloc;
+	__le32	fwd_cnt;
+} __attribute__((packed));
+
+enum virtio_vsock_type {
+	VIRTIO_VSOCK_TYPE_STREAM = 1,
+};
+
+enum virtio_vsock_op {
+	VIRTIO_VSOCK_OP_INVALID = 0,
+
+	/* Connect operations */
+	VIRTIO_VSOCK_OP_REQUEST = 1,
+	VIRTIO_VSOCK_OP_RESPONSE = 2,
+	VIRTIO_VSOCK_OP_RST = 3,
+	VIRTIO_VSOCK_OP_SHUTDOWN = 4,
+
+	/* To send payload */
+	VIRTIO_VSOCK_OP_RW = 5,
+
+	/* Tell the peer our credit info */
+	VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6,
+	/* Request the peer to send the credit info to us */
+	VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7,
+};
+
+/* VIRTIO_VSOCK_OP_SHUTDOWN flags values */
+enum virtio_vsock_shutdown {
+	VIRTIO_VSOCK_SHUTDOWN_RCV = 1,
+	VIRTIO_VSOCK_SHUTDOWN_SEND = 2,
+};
+
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
new file mode 100644
index 0000000..a53b3a1
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -0,0 +1,992 @@
+/*
+ * common code for virtio vsock
+ *
+ * Copyright (C) 2013-2015 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *         Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/list.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+
+#include <net/sock.h>
+#include <net/af_vsock.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/vsock_virtio_transport_common.h>
+
+/* How long to wait for graceful shutdown of a connection */
+#define VSOCK_CLOSE_TIMEOUT (8 * HZ)
+
+static const struct virtio_transport *virtio_transport_get_ops(void)
+{
+	const struct vsock_transport *t = vsock_core_get_transport();
+
+	return container_of(t, struct virtio_transport, transport);
+}
+
+struct virtio_vsock_pkt *
+virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
+			   size_t len,
+			   u32 src_cid,
+			   u32 src_port,
+			   u32 dst_cid,
+			   u32 dst_port)
+{
+	struct virtio_vsock_pkt *pkt;
+	int err;
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	pkt->hdr.type		= cpu_to_le16(info->type);
+	pkt->hdr.op		= cpu_to_le16(info->op);
+	pkt->hdr.src_cid	= cpu_to_le64(src_cid);
+	pkt->hdr.dst_cid	= cpu_to_le64(dst_cid);
+	pkt->hdr.src_port	= cpu_to_le32(src_port);
+	pkt->hdr.dst_port	= cpu_to_le32(dst_port);
+	pkt->hdr.flags		= cpu_to_le32(info->flags);
+	pkt->len		= len;
+	pkt->hdr.len		= cpu_to_le32(len);
+	pkt->reply		= info->reply;
+
+	if (info->msg && len > 0) {
+		pkt->buf = kmalloc(len, GFP_KERNEL);
+		if (!pkt->buf)
+			goto out_pkt;
+		err = memcpy_from_msg(pkt->buf, info->msg, len);
+		if (err)
+			goto out;
+	}
+
+	trace_virtio_transport_alloc_pkt(src_cid, src_port,
+					 dst_cid, dst_port,
+					 len,
+					 info->type,
+					 info->op,
+					 info->flags);
+
+	return pkt;
+
+out:
+	kfree(pkt->buf);
+out_pkt:
+	kfree(pkt);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_alloc_pkt);
+
+static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
+					  struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	struct virtio_vsock_sock *vvs;
+	struct virtio_vsock_pkt *pkt;
+	u32 pkt_len = info->pkt_len;
+
+	src_cid = vm_sockets_get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	if (!info->remote_cid) {
+		dst_cid	= vsk->remote_addr.svm_cid;
+		dst_port = vsk->remote_addr.svm_port;
+	} else {
+		dst_cid = info->remote_cid;
+		dst_port = info->remote_port;
+	}
+
+	vvs = vsk->trans;
+
+	/* we can send less than pkt_len bytes */
+	if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+		pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+
+	/* virtio_transport_get_credit might return less than pkt_len credit */
+	pkt_len = virtio_transport_get_credit(vvs, pkt_len);
+
+	/* Do not send zero length OP_RW pkt */
+	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+		return pkt_len;
+
+	pkt = virtio_transport_alloc_pkt(info, pkt_len,
+					 src_cid, src_port,
+					 dst_cid, dst_port);
+	if (!pkt) {
+		virtio_transport_put_credit(vvs, pkt_len);
+		return -ENOMEM;
+	}
+
+	virtio_transport_inc_tx_pkt(vvs, pkt);
+
+	return virtio_transport_get_ops()->send_pkt(pkt);
+}
+
+static void virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
+					struct virtio_vsock_pkt *pkt)
+{
+	vvs->rx_bytes += pkt->len;
+}
+
+static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
+					struct virtio_vsock_pkt *pkt)
+{
+	vvs->rx_bytes -= pkt->len;
+	vvs->fwd_cnt += pkt->len;
+}
+
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
+{
+	spin_lock_bh(&vvs->tx_lock);
+	pkt->hdr.fwd_cnt = cpu_to_le32(vvs->fwd_cnt);
+	pkt->hdr.buf_alloc = cpu_to_le32(vvs->buf_alloc);
+	spin_unlock_bh(&vvs->tx_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
+
+u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
+{
+	u32 ret;
+
+	spin_lock_bh(&vvs->tx_lock);
+	ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
+	if (ret > credit)
+		ret = credit;
+	vvs->tx_cnt += ret;
+	spin_unlock_bh(&vvs->tx_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_credit);
+
+void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit)
+{
+	spin_lock_bh(&vvs->tx_lock);
+	vvs->tx_cnt -= credit;
+	spin_unlock_bh(&vvs->tx_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_put_credit);
+
+static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
+					       int type,
+					       struct virtio_vsock_hdr *hdr)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
+		.type = type,
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+
+static ssize_t
+virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
+				   struct msghdr *msg,
+				   size_t len)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	struct virtio_vsock_pkt *pkt;
+	size_t bytes, total = 0;
+	int err = -EFAULT;
+
+	spin_lock_bh(&vvs->rx_lock);
+	while (total < len && !list_empty(&vvs->rx_queue)) {
+		pkt = list_first_entry(&vvs->rx_queue,
+				       struct virtio_vsock_pkt, list);
+
+		bytes = len - total;
+		if (bytes > pkt->len - pkt->off)
+			bytes = pkt->len - pkt->off;
+
+		/* sk_lock is held by caller so no one else can dequeue.
+		 * Unlock rx_lock since memcpy_to_msg() may sleep.
+		 */
+		spin_unlock_bh(&vvs->rx_lock);
+
+		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
+		if (err)
+			goto out;
+
+		spin_lock_bh(&vvs->rx_lock);
+
+		total += bytes;
+		pkt->off += bytes;
+		if (pkt->off == pkt->len) {
+			virtio_transport_dec_rx_pkt(vvs, pkt);
+			list_del(&pkt->list);
+			virtio_transport_free_pkt(pkt);
+		}
+	}
+	spin_unlock_bh(&vvs->rx_lock);
+
+	/* Send a credit pkt to peer */
+	virtio_transport_send_credit_update(vsk, VIRTIO_VSOCK_TYPE_STREAM,
+					    NULL);
+
+	return total;
+
+out:
+	if (total)
+		err = total;
+	return err;
+}
+
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len, int flags)
+{
+	if (flags & MSG_PEEK)
+		return -EOPNOTSUPP;
+
+	return virtio_transport_stream_do_dequeue(vsk, msg, len);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);
+
+int
+virtio_transport_dgram_dequeue(struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	spin_lock_bh(&vvs->rx_lock);
+	bytes = vvs->rx_bytes;
+	spin_unlock_bh(&vvs->rx_lock);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_data);
+
+static s64 virtio_transport_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	bytes = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
+	if (bytes < 0)
+		bytes = 0;
+
+	return bytes;
+}
+
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	s64 bytes;
+
+	spin_lock_bh(&vvs->tx_lock);
+	bytes = virtio_transport_has_space(vsk);
+	spin_unlock_bh(&vvs->tx_lock);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_space);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				    struct vsock_sock *psk)
+{
+	struct virtio_vsock_sock *vvs;
+
+	vvs = kzalloc(sizeof(*vvs), GFP_KERNEL);
+	if (!vvs)
+		return -ENOMEM;
+
+	vsk->trans = vvs;
+	vvs->vsk = vsk;
+	if (psk) {
+		struct virtio_vsock_sock *ptrans = psk->trans;
+
+		vvs->buf_size	= ptrans->buf_size;
+		vvs->buf_size_min = ptrans->buf_size_min;
+		vvs->buf_size_max = ptrans->buf_size_max;
+		vvs->peer_buf_alloc = ptrans->peer_buf_alloc;
+	} else {
+		vvs->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
+		vvs->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
+		vvs->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
+	}
+
+	vvs->buf_alloc = vvs->buf_size;
+
+	spin_lock_init(&vvs->rx_lock);
+	spin_lock_init(&vvs->tx_lock);
+	INIT_LIST_HEAD(&vvs->rx_queue);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_do_socket_init);
+
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_buffer_size);
+
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size_min;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_min_buffer_size);
+
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size_max;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
+
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val < vvs->buf_size_min)
+		vvs->buf_size_min = val;
+	if (val > vvs->buf_size_max)
+		vvs->buf_size_max = val;
+	vvs->buf_size = val;
+	vvs->buf_alloc = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
+
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val > vvs->buf_size)
+		vvs->buf_size = val;
+	vvs->buf_size_min = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_min_buffer_size);
+
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	if (val > VIRTIO_VSOCK_MAX_BUF_SIZE)
+		val = VIRTIO_VSOCK_MAX_BUF_SIZE;
+	if (val < vvs->buf_size)
+		vvs->buf_size = val;
+	vvs->buf_size_max = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_max_buffer_size);
+
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now)
+{
+	if (vsock_stream_has_data(vsk))
+		*data_ready_now = true;
+	else
+		*data_ready_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_in);
+
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_avail_now)
+{
+	s64 free_space;
+
+	free_space = vsock_stream_has_space(vsk);
+	if (free_space > 0)
+		*space_avail_now = true;
+	else if (free_space == 0)
+		*space_avail_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_out);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_init);
+
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_block);
+
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_dequeue);
+
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_post_dequeue);
+
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_init);
+
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_block);
+
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_enqueue);
+
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	return vvs->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_rcvhiwat);
+
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
+
+bool virtio_transport_stream_allow(u32 cid, u32 port)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
+
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
+
+bool virtio_transport_dgram_allow(u32 cid, u32 port)
+{
+	return false;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
+
+int virtio_transport_connect(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_REQUEST,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_connect);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_SHUTDOWN,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.flags = (mode & RCV_SHUTDOWN ?
+			  VIRTIO_VSOCK_SHUTDOWN_RCV : 0) |
+			 (mode & SEND_SHUTDOWN ?
+			  VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
+
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct msghdr *msg,
+			       size_t dgram_len)
+{
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct msghdr *msg,
+				size_t len)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.msg = msg,
+		.pkt_len = len,
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_enqueue);
+
+void virtio_transport_destruct(struct vsock_sock *vsk)
+{
+	struct virtio_vsock_sock *vvs = vsk->trans;
+
+	kfree(vvs);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_destruct);
+
+static int virtio_transport_reset(struct vsock_sock *vsk,
+				  struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RST,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.reply = !!pkt,
+	};
+
+	/* Send RST only if the original pkt is not a RST pkt */
+	if (pkt && le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+		return 0;
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+
+/* Normally packets are associated with a socket.  There may be no socket if an
+ * attempt was made to connect to a socket that does not exist.
+ */
+static int virtio_transport_reset_no_sock(struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RST,
+		.type = le16_to_cpu(pkt->hdr.type),
+		.reply = true,
+	};
+
+	/* Send RST only if the original pkt is not a RST pkt */
+	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+		return 0;
+
+	pkt = virtio_transport_alloc_pkt(&info, 0,
+					 le32_to_cpu(pkt->hdr.dst_cid),
+					 le32_to_cpu(pkt->hdr.dst_port),
+					 le32_to_cpu(pkt->hdr.src_cid),
+					 le32_to_cpu(pkt->hdr.src_port));
+	if (!pkt)
+		return -ENOMEM;
+
+	return virtio_transport_get_ops()->send_pkt(pkt);
+}
+
+static void virtio_transport_wait_close(struct sock *sk, long timeout)
+{
+	if (timeout) {
+		DEFINE_WAIT(wait);
+
+		do {
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+			if (sk_wait_event(sk, &timeout,
+					  sock_flag(sk, SOCK_DONE)))
+				break;
+		} while (!signal_pending(current) && timeout);
+
+		finish_wait(sk_sleep(sk), &wait);
+	}
+}
+
+static void virtio_transport_do_close(struct vsock_sock *vsk,
+				      bool cancel_timeout)
+{
+	struct sock *sk = sk_vsock(vsk);
+
+	sock_set_flag(sk, SOCK_DONE);
+	vsk->peer_shutdown = SHUTDOWN_MASK;
+	if (vsock_stream_has_data(vsk) <= 0)
+		sk->sk_state = SS_DISCONNECTING;
+	sk->sk_state_change(sk);
+
+	if (vsk->close_work_scheduled &&
+	    (!cancel_timeout || cancel_delayed_work(&vsk->close_work))) {
+		vsk->close_work_scheduled = false;
+
+		vsock_remove_sock(vsk);
+
+		/* Release refcnt obtained when we scheduled the timeout */
+		sock_put(sk);
+	}
+}
+
+static void virtio_transport_close_timeout(struct work_struct *work)
+{
+	struct vsock_sock *vsk =
+		container_of(work, struct vsock_sock, close_work.work);
+	struct sock *sk = sk_vsock(vsk);
+
+	sock_hold(sk);
+	lock_sock(sk);
+
+	if (!sock_flag(sk, SOCK_DONE)) {
+		(void)virtio_transport_reset(vsk, NULL);
+
+		virtio_transport_do_close(vsk, false);
+	}
+
+	vsk->close_work_scheduled = false;
+
+	release_sock(sk);
+	sock_put(sk);
+}
+
+/* User context, vsk->sk is locked */
+static bool virtio_transport_close(struct vsock_sock *vsk)
+{
+	struct sock *sk = &vsk->sk;
+
+	if (!(sk->sk_state == SS_CONNECTED ||
+	      sk->sk_state == SS_DISCONNECTING))
+		return true;
+
+	/* Already received SHUTDOWN from peer, reply with RST */
+	if ((vsk->peer_shutdown & SHUTDOWN_MASK) == SHUTDOWN_MASK) {
+		(void)virtio_transport_reset(vsk, NULL);
+		return true;
+	}
+
+	if ((sk->sk_shutdown & SHUTDOWN_MASK) != SHUTDOWN_MASK)
+		(void)virtio_transport_shutdown(vsk, SHUTDOWN_MASK);
+
+	if (sock_flag(sk, SOCK_LINGER) && !(current->flags & PF_EXITING))
+		virtio_transport_wait_close(sk, sk->sk_lingertime);
+
+	if (sock_flag(sk, SOCK_DONE)) {
+		return true;
+	}
+
+	sock_hold(sk);
+	INIT_DELAYED_WORK(&vsk->close_work,
+			  virtio_transport_close_timeout);
+	vsk->close_work_scheduled = true;
+	schedule_delayed_work(&vsk->close_work, VSOCK_CLOSE_TIMEOUT);
+	return false;
+}
+
+void virtio_transport_release(struct vsock_sock *vsk)
+{
+	struct sock *sk = &vsk->sk;
+	bool remove_sock = true;
+
+	lock_sock(sk);
+	if (sk->sk_type == SOCK_STREAM)
+		remove_sock = virtio_transport_close(vsk);
+	release_sock(sk);
+
+	if (remove_sock)
+		vsock_remove_sock(vsk);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_release);
+
+static int
+virtio_transport_recv_connecting(struct sock *sk,
+				 struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	int err;
+	int skerr;
+
+	switch (le16_to_cpu(pkt->hdr.op)) {
+	case VIRTIO_VSOCK_OP_RESPONSE:
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		vsock_insert_connected(vsk);
+		sk->sk_state_change(sk);
+		break;
+	case VIRTIO_VSOCK_OP_INVALID:
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		skerr = ECONNRESET;
+		err = 0;
+		goto destroy;
+	default:
+		skerr = EPROTO;
+		err = -EINVAL;
+		goto destroy;
+	}
+	return 0;
+
+destroy:
+	virtio_transport_reset(vsk, pkt);
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = skerr;
+	sk->sk_error_report(sk);
+	return err;
+}
+
+static int
+virtio_transport_recv_connected(struct sock *sk,
+				struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	int err = 0;
+
+	switch (le16_to_cpu(pkt->hdr.op)) {
+	case VIRTIO_VSOCK_OP_RW:
+		pkt->len = le32_to_cpu(pkt->hdr.len);
+		pkt->off = 0;
+
+		spin_lock_bh(&vvs->rx_lock);
+		virtio_transport_inc_rx_pkt(vvs, pkt);
+		list_add_tail(&pkt->list, &vvs->rx_queue);
+		spin_unlock_bh(&vvs->rx_lock);
+
+		sk->sk_data_ready(sk);
+		return err;
+	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
+		sk->sk_write_space(sk);
+		break;
+	case VIRTIO_VSOCK_OP_SHUTDOWN:
+		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
+			vsk->peer_shutdown |= RCV_SHUTDOWN;
+		if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
+			vsk->peer_shutdown |= SEND_SHUTDOWN;
+		if (vsk->peer_shutdown == SHUTDOWN_MASK &&
+		    vsock_stream_has_data(vsk) <= 0)
+			sk->sk_state = SS_DISCONNECTING;
+		if (le32_to_cpu(pkt->hdr.flags))
+			sk->sk_state_change(sk);
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		virtio_transport_do_close(vsk, true);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	virtio_transport_free_pkt(pkt);
+	return err;
+}
+
+static void
+virtio_transport_recv_disconnecting(struct sock *sk,
+				    struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+
+	if (le16_to_cpu(pkt->hdr.op) == VIRTIO_VSOCK_OP_RST)
+		virtio_transport_do_close(vsk, true);
+}
+
+static int
+virtio_transport_send_response(struct vsock_sock *vsk,
+			       struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RESPONSE,
+		.type = VIRTIO_VSOCK_TYPE_STREAM,
+		.remote_cid = le32_to_cpu(pkt->hdr.src_cid),
+		.remote_port = le32_to_cpu(pkt->hdr.src_port),
+		.reply = true,
+	};
+
+	return virtio_transport_send_pkt_info(vsk, &info);
+}
+
+/* Handle server socket */
+static int
+virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vsock_sock *vchild;
+	struct sock *child;
+
+	if (le16_to_cpu(pkt->hdr.op) != VIRTIO_VSOCK_OP_REQUEST) {
+		virtio_transport_reset(vsk, pkt);
+		return -EINVAL;
+	}
+
+	if (sk_acceptq_is_full(sk)) {
+		virtio_transport_reset(vsk, pkt);
+		return -ENOMEM;
+	}
+
+	child = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
+			       sk->sk_type, 0);
+	if (!child) {
+		virtio_transport_reset(vsk, pkt);
+		return -ENOMEM;
+	}
+
+	sk->sk_ack_backlog++;
+
+	lock_sock_nested(child, SINGLE_DEPTH_NESTING);
+
+	child->sk_state = SS_CONNECTED;
+
+	vchild = vsock_sk(child);
+	vsock_addr_init(&vchild->local_addr, le32_to_cpu(pkt->hdr.dst_cid),
+			le32_to_cpu(pkt->hdr.dst_port));
+	vsock_addr_init(&vchild->remote_addr, le32_to_cpu(pkt->hdr.src_cid),
+			le32_to_cpu(pkt->hdr.src_port));
+
+	vsock_insert_connected(vchild);
+	vsock_enqueue_accept(sk, child);
+	virtio_transport_send_response(vchild, pkt);
+
+	release_sock(child);
+
+	sk->sk_data_ready(sk);
+	return 0;
+}
+
+static bool virtio_transport_space_update(struct sock *sk,
+					  struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_vsock_sock *vvs = vsk->trans;
+	bool space_available;
+
+	/* buf_alloc and fwd_cnt is always included in the hdr */
+	spin_lock_bh(&vvs->tx_lock);
+	vvs->peer_buf_alloc = le32_to_cpu(pkt->hdr.buf_alloc);
+	vvs->peer_fwd_cnt = le32_to_cpu(pkt->hdr.fwd_cnt);
+	space_available = virtio_transport_has_space(vsk);
+	spin_unlock_bh(&vvs->tx_lock);
+	return space_available;
+}
+
+/* We are under the virtio-vsock's vsock->rx_lock or vhost-vsock's vq->mutex
+ * lock.
+ */
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
+{
+	struct sockaddr_vm src, dst;
+	struct vsock_sock *vsk;
+	struct sock *sk;
+	bool space_available;
+
+	vsock_addr_init(&src, le32_to_cpu(pkt->hdr.src_cid),
+			le32_to_cpu(pkt->hdr.src_port));
+	vsock_addr_init(&dst, le32_to_cpu(pkt->hdr.dst_cid),
+			le32_to_cpu(pkt->hdr.dst_port));
+
+	trace_virtio_transport_recv_pkt(src.svm_cid, src.svm_port,
+					dst.svm_cid, dst.svm_port,
+					le32_to_cpu(pkt->hdr.len),
+					le16_to_cpu(pkt->hdr.type),
+					le16_to_cpu(pkt->hdr.op),
+					le32_to_cpu(pkt->hdr.flags),
+					le32_to_cpu(pkt->hdr.buf_alloc),
+					le32_to_cpu(pkt->hdr.fwd_cnt));
+
+	if (le16_to_cpu(pkt->hdr.type) != VIRTIO_VSOCK_TYPE_STREAM) {
+		(void)virtio_transport_reset_no_sock(pkt);
+		goto free_pkt;
+	}
+
+	/* The socket must be in connected or bound table
+	 * otherwise send reset back
+	 */
+	sk = vsock_find_connected_socket(&src, &dst);
+	if (!sk) {
+		sk = vsock_find_bound_socket(&dst);
+		if (!sk) {
+			(void)virtio_transport_reset_no_sock(pkt);
+			goto free_pkt;
+		}
+	}
+
+	vsk = vsock_sk(sk);
+
+	space_available = virtio_transport_space_update(sk, pkt);
+
+	lock_sock(sk);
+
+	/* Update CID in case it has changed after a transport reset event */
+	vsk->local_addr.svm_cid = dst.svm_cid;
+
+	if (space_available)
+		sk->sk_write_space(sk);
+
+	switch (sk->sk_state) {
+	case VSOCK_SS_LISTEN:
+		virtio_transport_recv_listen(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTING:
+		virtio_transport_recv_connecting(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTED:
+		virtio_transport_recv_connected(sk, pkt);
+		break;
+	case SS_DISCONNECTING:
+		virtio_transport_recv_disconnecting(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	default:
+		virtio_transport_free_pkt(pkt);
+		break;
+	}
+	release_sock(sk);
+
+	/* Release refcnt obtained when we fetched this socket out of the
+	 * bound or connected list.
+	 */
+	sock_put(sk);
+	return;
+
+free_pkt:
+	virtio_transport_free_pkt(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
+
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
+{
+	kfree(pkt->buf);
+	kfree(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("common code for virtio vsock");
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 2/6] VSOCK: defer sock removal to transports
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Greg Kurz,
	virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

The virtio transport will implement graceful shutdown and the related
SO_LINGER socket option.  This requires orphaning the sock but keeping
it in the table of connections after .release().

This patch adds the vsock_remove_sock() function and leaves it up to the
transport when to remove the sock.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h         |  1 +
 net/vmw_vsock/af_vsock.c       | 16 ++++++++++------
 net/vmw_vsock/vmci_transport.c |  2 ++
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 23f5525..3af0b22 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -180,6 +180,7 @@ void vsock_remove_connected(struct vsock_sock *vsk);
 struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
 struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
 					 struct sockaddr_vm *dst);
+void vsock_remove_sock(struct vsock_sock *vsk);
 void vsock_for_each_connected_socket(void (*fn)(struct sock *sk));
 
 #endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index e34d96f..17dbbe6 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -344,6 +344,16 @@ static bool vsock_in_connected_table(struct vsock_sock *vsk)
 	return ret;
 }
 
+void vsock_remove_sock(struct vsock_sock *vsk)
+{
+	if (vsock_in_bound_table(vsk))
+		vsock_remove_bound(vsk);
+
+	if (vsock_in_connected_table(vsk))
+		vsock_remove_connected(vsk);
+}
+EXPORT_SYMBOL_GPL(vsock_remove_sock);
+
 void vsock_for_each_connected_socket(void (*fn)(struct sock *sk))
 {
 	int i;
@@ -660,12 +670,6 @@ static void __vsock_release(struct sock *sk)
 		vsk = vsock_sk(sk);
 		pending = NULL;	/* Compiler warning. */
 
-		if (vsock_in_bound_table(vsk))
-			vsock_remove_bound(vsk);
-
-		if (vsock_in_connected_table(vsk))
-			vsock_remove_connected(vsk);
-
 		transport->release(vsk);
 
 		lock_sock(sk);
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 4120b7a..4be4fbb 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1644,6 +1644,8 @@ static void vmci_transport_destruct(struct vsock_sock *vsk)
 
 static void vmci_transport_release(struct vsock_sock *vsk)
 {
+	vsock_remove_sock(vsk);
+
 	if (!vmci_handle_is_invalid(vmci_trans(vsk)->dg_handle)) {
 		vmci_datagram_destroy_handle(vmci_trans(vsk)->dg_handle);
 		vmci_trans(vsk)->dg_handle = VMCI_INVALID_HANDLE;
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 1/6] VSOCK: transport-specific vsock_transport functions
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Greg Kurz,
	virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

struct vsock_transport contains function pointers called by AF_VSOCK
core code.  The transport may want its own transport-specific function
pointers and they can be added after struct vsock_transport.

Allow the transport to fetch vsock_transport.  It can downcast it to
access transport-specific function pointers.

The virtio transport will use this.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h   | 3 +++
 net/vmw_vsock/af_vsock.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index e9eb2d6..23f5525 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -165,6 +165,9 @@ static inline int vsock_core_init(const struct vsock_transport *t)
 }
 void vsock_core_exit(void);
 
+/* The transport may downcast this to access transport-specific functions */
+const struct vsock_transport *vsock_core_get_transport(void);
+
 /**** UTILS ****/
 
 void vsock_release_pending(struct sock *pending);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index b96ac91..e34d96f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1995,6 +1995,15 @@ void vsock_core_exit(void)
 }
 EXPORT_SYMBOL_GPL(vsock_core_exit);
 
+const struct vsock_transport *vsock_core_get_transport(void)
+{
+	/* vsock_register_mutex not taken since only the transport uses this
+	 * function and only while registered.
+	 */
+	return transport;
+}
+EXPORT_SYMBOL_GPL(vsock_core_get_transport);
+
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMware Virtual Socket Family");
 MODULE_VERSION("1.0.1.0-k");
-- 
2.7.4

^ permalink raw reply related

* [RFC v6 0/6] Add virtio transport for AF_VSOCK
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Greg Kurz,
	virtualization, Christoffer Dall

This series is based on v4.7.

This RFC is the implementation for the new VIRTIO Socket device.  It is
developed in parallel with the VIRTIO device specification and proves the
design.  Once the specification has been accepted I will send a non-RFC version
of this patch series.

v6:
 * Add VHOST_VSOCK_SET_RUNNING ioctl to start/stop vhost cleanly
 * Add graceful shutdown to avoid port reuse while peer is still closing
   socket [Ian Campbell]
 * Start/stop rx depending on reply packet accounting to bound memory
   allocation if the host is not processing rx packets
 * Use send_pkt_list to defer transmission
 * Use spinlocks instead of mutexes for tx_lock/rx_lock because they are
   used in sections that are not allowed to sleep. [Claudio]
 * 64-bit CIDs in virtio_vsock.h to match virtio-vsock specification
 * Move duplicated send_pkt() logic from vhost and virtio transports
   into virtio_transport_common.ko
 * ...and more, see individual patch changelogs

v5:
 * Transport reset event for live migration support
 * Reorder virtqueues, drop unused ctrl virtqueue
 * Switch to a free virtio device ID
 * More small changes, see patches for individual items

v4:
 * Addressed code review comments from Alex Bennee
 * MAINTAINERS file entries for new files
 * Trace events instead of pr_debug()
 * RST packet is sent when there is no listen socket
 * Allow guest->host connections again (began discussing netfilter support with
   Matt Benjamin instead of hard-coding security policy in virtio-vsock code)
 * Many checkpatch.pl cleanups (will be 100% clean in v5)

v3:
 * Remove unnecessary 3-way handshake, just do REQUEST/RESPONSE instead
   of REQUEST/RESPONSE/ACK
 * Remove SOCK_DGRAM support and focus on SOCK_STREAM first
   (also drop v2 Patch 1, it's only needed for SOCK_DGRAM)
 * Only allow host->guest connections (same security model as latest
   VMware)
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann <arnd@arndb.de>)
 * Remove unneeded variable used to store return value
   (Fengguang Wu <fengguang.wu@intel.com> and Julia Lawall
   <julia.lawall@lip6.fr>)

v2:
 * Rebased onto Linux v4.4-rc2
 * vhost: Refuse to assign reserved CIDs
 * vhost: Refuse guest CID if already in use
 * vhost: Only accept correctly addressed packets (no spoofing!)
 * vhost: Support flexible rx/tx descriptor layout
 * vhost: Add missing total_tx_buf decrement
 * virtio_transport: Fix total_tx_buf accounting
 * virtio_transport: Add virtio_transport global mutex to prevent races
 * common: Notify other side of SOCK_STREAM disconnect (fixes shutdown
   semantics)
 * common: Avoid recursive mutex_lock(tx_lock) for write_space (fixes deadlock)
 * common: Define VIRTIO_VSOCK_TYPE_STREAM/DGRAM hardware interface constants
 * common: Define VIRTIO_VSOCK_SHUTDOWN_RCV/SEND hardware interface constants
 * common: Fix peer_buf_alloc inheritance on child socket

This patch series adds a virtio transport for AF_VSOCK (net/vmw_vsock/).
AF_VSOCK is designed for communication between virtual machines and
hypervisors.  It is currently only implemented for VMware's VMCI transport.

Much of the work was done by Asias He and Gerd Hoffmann a while back.  I have
picked up the series again.

The QEMU userspace changes are here:
https://github.com/stefanha/qemu/commits/vsock

Why virtio-vsock?
-----------------
Guest<->host communication is currently done over the virtio-serial device.
This makes it hard to port sockets API-based applications and is limited to
static ports.

virtio-vsock uses the sockets API so that applications can rely on familiar
SOCK_STREAM semantics.  Applications on the host can easily connect to guest
agents because the sockets API allows multiple connections to a listen socket
(unlike virtio-serial).  This simplifies the guest<->host communication and
eliminates the need for extra processes on the host to arbitrate virtio-serial
ports.

Overview
--------
This series adds 3 pieces:

1. virtio_transport_common.ko - core virtio vsock code that uses vsock.ko

2. virtio_transport.ko - guest driver

3. drivers/vhost/vsock.ko - host driver

Howto
-----
The following kernel options are needed:
  CONFIG_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS_COMMON=y
  CONFIG_VHOST_VSOCK=m

Launch QEMU as follows:
  # qemu ... -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3

Guest and host can communicate via AF_VSOCK sockets.  The host's CID (address)
is 2 and the guest must be assigned a CID (3 in the example above).

See http://qemu-project.org/Features/VirtioVsock for more info.

Asias He (4):
  VSOCK: Introduce virtio_vsock_common.ko
  VSOCK: Introduce virtio_transport.ko
  VSOCK: Introduce vhost_vsock.ko
  VSOCK: Add Makefile and Kconfig

Stefan Hajnoczi (2):
  VSOCK: transport-specific vsock_transport functions
  VSOCK: defer sock removal to transports

 MAINTAINERS                                        |  13 +
 drivers/vhost/Kconfig                              |  15 +
 drivers/vhost/Makefile                             |   4 +
 drivers/vhost/vsock.c                              | 722 +++++++++++++++
 include/linux/virtio_vsock.h                       | 154 ++++
 include/net/af_vsock.h                             |   6 +
 .../trace/events/vsock_virtio_transport_common.h   | 144 +++
 include/uapi/linux/Kbuild                          |   1 +
 include/uapi/linux/vhost.h                         |   5 +
 include/uapi/linux/virtio_ids.h                    |   1 +
 include/uapi/linux/virtio_vsock.h                  |  94 ++
 net/vmw_vsock/Kconfig                              |  20 +
 net/vmw_vsock/Makefile                             |   6 +
 net/vmw_vsock/af_vsock.c                           |  25 +-
 net/vmw_vsock/virtio_transport.c                   | 624 +++++++++++++
 net/vmw_vsock/virtio_transport_common.c            | 992 +++++++++++++++++++++
 net/vmw_vsock/vmci_transport.c                     |   2 +
 17 files changed, 2822 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/trace/events/vsock_virtio_transport_common.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport.c
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

-- 
2.7.4

^ permalink raw reply

* Re: [Qemu-devel] [PATCH 6/7] qemu: Implement virtio-pstore device
From: Daniel P. Berrange @ 2016-07-28 13:22 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Tony Luck, Kees Cook, kvm, Radim Krčmář, LKML,
	Michael S. Tsirkin, qemu-devel, Steven Rostedt, virtualization,
	Minchan Kim, Anton Vorontsov, Anthony Liguori, Colin Cross,
	Paolo Bonzini, Ingo Molnar
In-Reply-To: <1469632111-23260-7-git-send-email-namhyung@kernel.org>

On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> Add virtio pstore device to allow kernel log files saved on the host.
> It will save the log files on the directory given by pstore device
> option.
> 
>   $ qemu-system-x86_64 -device virtio-pstore,directory=dir-xx ...
> 
>   (guest) # echo c > /proc/sysrq-trigger
> 
>   $ ls dir-xx
>   dmesg-1.enc.z  dmesg-2.enc.z
> 
> The log files are usually compressed using zlib.  Users can see the log
> messages directly on the host or on the guest (using pstore filesystem).
> 
> The 'directory' property is required for virtio-pstore device to work.
> It also adds 'bufsize' and 'console' (boolean) properties.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Anthony Liguori <aliguori@amazon.com>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: kvm@vger.kernel.org
> Cc: qemu-devel@nongnu.org
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  hw/virtio/Makefile.objs                        |   2 +-
>  hw/virtio/virtio-pci.c                         |  54 +++
>  hw/virtio/virtio-pci.h                         |  14 +
>  hw/virtio/virtio-pstore.c                      | 477 +++++++++++++++++++++++++
>  include/hw/pci/pci.h                           |   1 +
>  include/hw/virtio/virtio-pstore.h              |  34 ++
>  include/standard-headers/linux/virtio_ids.h    |   1 +
>  include/standard-headers/linux/virtio_pstore.h |  80 +++++
>  qdev-monitor.c                                 |   1 +
>  9 files changed, 663 insertions(+), 1 deletion(-)
>  create mode 100644 hw/virtio/virtio-pstore.c
>  create mode 100644 include/hw/virtio/virtio-pstore.h
>  create mode 100644 include/standard-headers/linux/virtio_pstore.h
> 
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 3e2b175..aae7082 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -4,4 +4,4 @@ common-obj-y += virtio-bus.o
>  common-obj-y += virtio-mmio.o
>  
>  obj-y += virtio.o virtio-balloon.o 
> -obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
> +obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o virtio-pstore.o
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index f0677b7..d99a405 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2414,6 +2414,59 @@ static const TypeInfo virtio_host_pci_info = {
>  };
>  #endif
>  
> +/* virtio-pstore-pci */
> +
> +static void virtio_pstore_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> +    VirtIOPstorePCI *vps = VIRTIO_PSTORE_PCI(vpci_dev);
> +    DeviceState *vdev = DEVICE(&vps->vdev);
> +    Error *err = NULL;
> +
> +    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> +    object_property_set_bool(OBJECT(vdev), true, "realized", &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +}
> +
> +static void virtio_pstore_pci_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> +    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> +
> +    k->realize = virtio_pstore_pci_realize;
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +
> +    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> +    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_PSTORE;
> +    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> +    pcidev_k->class_id = PCI_CLASS_OTHERS;
> +}
> +
> +static void virtio_pstore_pci_instance_init(Object *obj)
> +{
> +    VirtIOPstorePCI *dev = VIRTIO_PSTORE_PCI(obj);
> +
> +    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> +                                TYPE_VIRTIO_PSTORE);
> +    object_property_add_alias(obj, "directory", OBJECT(&dev->vdev),
> +                              "directory", &error_abort);
> +    object_property_add_alias(obj, "bufsize", OBJECT(&dev->vdev),
> +                              "bufsize", &error_abort);
> +    object_property_add_alias(obj, "console", OBJECT(&dev->vdev),
> +                              "console", &error_abort);
> +}
> +
> +static const TypeInfo virtio_pstore_pci_info = {
> +    .name          = TYPE_VIRTIO_PSTORE_PCI,
> +    .parent        = TYPE_VIRTIO_PCI,
> +    .instance_size = sizeof(VirtIOPstorePCI),
> +    .instance_init = virtio_pstore_pci_instance_init,
> +    .class_init    = virtio_pstore_pci_class_init,
> +};
> +
>  /* virtio-pci-bus */
>  
>  static void virtio_pci_bus_new(VirtioBusState *bus, size_t bus_size,
> @@ -2483,6 +2536,7 @@ static void virtio_pci_register_types(void)
>  #ifdef CONFIG_VHOST_SCSI
>      type_register_static(&vhost_scsi_pci_info);
>  #endif
> +    type_register_static(&virtio_pstore_pci_info);
>  }
>  
>  type_init(virtio_pci_register_types)
> diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> index e4548c2..b4c039f 100644
> --- a/hw/virtio/virtio-pci.h
> +++ b/hw/virtio/virtio-pci.h
> @@ -31,6 +31,7 @@
>  #ifdef CONFIG_VHOST_SCSI
>  #include "hw/virtio/vhost-scsi.h"
>  #endif
> +#include "hw/virtio/virtio-pstore.h"
>  
>  typedef struct VirtIOPCIProxy VirtIOPCIProxy;
>  typedef struct VirtIOBlkPCI VirtIOBlkPCI;
> @@ -44,6 +45,7 @@ typedef struct VirtIOInputPCI VirtIOInputPCI;
>  typedef struct VirtIOInputHIDPCI VirtIOInputHIDPCI;
>  typedef struct VirtIOInputHostPCI VirtIOInputHostPCI;
>  typedef struct VirtIOGPUPCI VirtIOGPUPCI;
> +typedef struct VirtIOPstorePCI VirtIOPstorePCI;
>  
>  /* virtio-pci-bus */
>  
> @@ -311,6 +313,18 @@ struct VirtIOGPUPCI {
>      VirtIOGPU vdev;
>  };
>  
> +/*
> + * virtio-pstore-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_PSTORE_PCI "virtio-pstore-pci"
> +#define VIRTIO_PSTORE_PCI(obj) \
> +        OBJECT_CHECK(VirtIOPstorePCI, (obj), TYPE_VIRTIO_PSTORE_PCI)
> +
> +struct VirtIOPstorePCI {
> +    VirtIOPCIProxy parent_obj;
> +    VirtIOPstore vdev;
> +};
> +
>  /* Virtio ABI version, if we increment this, we break the guest driver. */
>  #define VIRTIO_PCI_ABI_VERSION          0
>  
> diff --git a/hw/virtio/virtio-pstore.c b/hw/virtio/virtio-pstore.c
> new file mode 100644
> index 0000000..2ca7786
> --- /dev/null
> +++ b/hw/virtio/virtio-pstore.c
> @@ -0,0 +1,477 @@
> +/*
> + * Virtio Pstore Device
> + *
> + * Copyright (C) 2016  LG Electronics
> + *
> + * Authors:
> + *  Namhyung Kim  <namhyung@gmail.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include <stdio.h>
> +
> +#include "qemu/osdep.h"
> +#include "qemu/iov.h"
> +#include "qemu-common.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/visitor.h"
> +#include "qapi-event.h"
> +#include "trace.h"
> +
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-pstore.h"
> +
> +
> +static void virtio_pstore_to_filename(VirtIOPstore *s, char *buf, size_t sz,
> +                                      struct virtio_pstore_req *req)
> +{
> +    const char *basename;
> +    unsigned long long id = 0;
> +    unsigned int flags = le32_to_cpu(req->flags);
> +
> +    switch (le16_to_cpu(req->type)) {
> +    case VIRTIO_PSTORE_TYPE_DMESG:
> +        basename = "dmesg";
> +        id = s->id++;
> +        break;
> +    case VIRTIO_PSTORE_TYPE_CONSOLE:
> +        basename = "console";
> +        if (s->console_id) {
> +            id = s->console_id;
> +        } else {
> +            id = s->console_id = s->id++;
> +        }
> +        break;
> +    default:
> +        basename = "unknown";
> +        break;
> +    }
> +
> +    snprintf(buf, sz, "%s/%s-%llu%s", s->directory, basename, id,
> +             flags & VIRTIO_PSTORE_FL_COMPRESSED ? ".enc.z" : "");

Please use g_strdup_printf() instead of splattering into a pre-allocated
buffer than may or may not be large enough.

> +}
> +
> +static void virtio_pstore_from_filename(VirtIOPstore *s, char *name,
> +                                        char *buf, size_t sz,
> +                                        struct virtio_pstore_fileinfo *info)
> +{
> +    snprintf(buf, sz, "%s/%s", s->directory, name);
> +
> +    if (g_str_has_prefix(name, "dmesg-")) {
> +        info->type = VIRTIO_PSTORE_TYPE_DMESG;
> +        name += strlen("dmesg-");
> +    } else if (g_str_has_prefix(name, "console-")) {
> +        info->type = VIRTIO_PSTORE_TYPE_CONSOLE;
> +        name += strlen("console-");
> +    } else if (g_str_has_prefix(name, "unknown-")) {
> +        info->type = VIRTIO_PSTORE_TYPE_UNKNOWN;
> +        name += strlen("unknown-");
> +    }
> +
> +    qemu_strtoull(name, NULL, 0, &info->id);
> +
> +    info->flags = 0;
> +    if (g_str_has_suffix(name, ".enc.z")) {
> +        info->flags |= VIRTIO_PSTORE_FL_COMPRESSED;
> +    }
> +}
> +
> +static ssize_t virtio_pstore_do_open(VirtIOPstore *s)
> +{
> +    s->dirp = opendir(s->directory);
> +    if (s->dirp == NULL) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t virtio_pstore_do_read(VirtIOPstore *s, struct iovec *in_sg,
> +                                     unsigned int in_num,
> +                                     struct virtio_pstore_res *res)
> +{
> +    char path[PATH_MAX];

Don't declare PATH_MAX sized variables

> +    int fd;
> +    ssize_t len;
> +    struct stat stbuf;
> +    struct dirent *dent;
> +    int sg_num = in_num;
> +    struct iovec sg[sg_num];

'sg_num' is initialized from 'in_num' which comes from the
guest, and I'm not seeing anything which is bounds-checking
the 'in_num' value. So you've possibly got a security flaw here
I think, if the guest can cause QEMU to allocate arbitrary stack
memory & thus overflow by setting arbitrarily large in_num.

> +    struct virtio_pstore_fileinfo info;
> +    size_t offset = sizeof(*res) + sizeof(info);
> +
> +    if (s->dirp == NULL) {
> +        return -1;
> +    }
> +
> +    dent = readdir(s->dirp);
> +    while (dent) {
> +        if (dent->d_name[0] != '.') {
> +            break;
> +        }
> +        dent = readdir(s->dirp);
> +    }
> +
> +    if (dent == NULL) {
> +        return 0;
> +    }

So this seems to just be picking the first filename reported by
readdir that isn't starting with '.'. Surely this can't the right
logic when your corresponding do_write method can pick several
different filenames, its potluck which do_read will give back.

> +
> +    /* skip res and fileinfo */
> +    sg_num = iov_copy(sg, sg_num, in_sg, in_num, offset,
> +                      iov_size(in_sg, in_num) - offset);
> +
> +    virtio_pstore_from_filename(s, dent->d_name, path, sizeof(path), &info);
> +    fd = open(path, O_RDONLY);
> +    if (fd < 0) {
> +        error_report("cannot open %s", path);
> +        return -1;
> +    }
> +
> +    if (fstat(fd, &stbuf) < 0) {
> +        len = -1;
> +        goto out;
> +    }
> +
> +    len = readv(fd, sg, sg_num);
> +    if (len < 0) {
> +        if (errno == EAGAIN) {
> +            len = 0;
> +        }
> +        goto out;
> +    }
> +
> +    info.id        = cpu_to_le64(info.id);
> +    info.type      = cpu_to_le16(info.type);
> +    info.flags     = cpu_to_le32(info.flags);
> +    info.len       = cpu_to_le32(len);
> +    info.time_sec  = cpu_to_le64(stbuf.st_ctim.tv_sec);
> +    info.time_nsec = cpu_to_le32(stbuf.st_ctim.tv_nsec);
> +
> +    iov_from_buf(in_sg, in_num, sizeof(*res), &info, sizeof(info));
> +    len += sizeof(info);
> +
> + out:
> +    close(fd);
> +    return len;
> +}
> +
> +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> +                                      unsigned int out_num,
> +                                      struct virtio_pstore_req *req)
> +{
> +    char path[PATH_MAX];
> +    int fd;
> +    ssize_t len;
> +    unsigned short type;
> +    int flags = O_WRONLY | O_CREAT;
> +
> +    /* we already consume the req */
> +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> +
> +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> +
> +    type = le16_to_cpu(req->type);
> +
> +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> +        flags |= O_TRUNC;
> +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> +        flags |= O_APPEND;

Using O_APPEND will cause the file to grow without bound on the
host, which is highly undesirable, aka a security flaw.

> +    }
> +
> +    fd = open(path, flags, 0644);
> +    if (fd < 0) {
> +        error_report("cannot open %s", path);
> +        return -1;
> +    }
> +    len = writev(fd, out_sg, out_num);
> +    close(fd);
> +
> +    return len;
> +}


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [Qemu-devel] [PATCH 6/7] qemu: Implement virtio-pstore device
From: Daniel P. Berrange @ 2016-07-28 13:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Tony Luck, Kees Cook, kvm, Radim Krčmář,
	Anton Vorontsov, LKML, Steven Rostedt, qemu-devel, Minchan Kim,
	Paolo Bonzini, Michael S. Tsirkin, Anthony Liguori, Colin Cross,
	Namhyung Kim, virtualization, Ingo Molnar
In-Reply-To: <20160728125607.GA31002@stefanha-x1.localdomain>

On Thu, Jul 28, 2016 at 01:56:07PM +0100, Stefan Hajnoczi wrote:
> On Thu, Jul 28, 2016 at 02:39:53PM +0900, Namhyung Kim wrote:
> > On Thu, Jul 28, 2016 at 03:02:54AM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> > > > +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> > > > +                                      unsigned int out_num,
> > > > +                                      struct virtio_pstore_req *req)
> > > > +{
> > > > +    char path[PATH_MAX];
> > > > +    int fd;
> > > > +    ssize_t len;
> > > > +    unsigned short type;
> > > > +    int flags = O_WRONLY | O_CREAT;
> > > > +
> > > > +    /* we already consume the req */
> > > > +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> > > > +
> > > > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > > > +
> > > > +    type = le16_to_cpu(req->type);
> > > > +
> > > > +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> > > > +        flags |= O_TRUNC;
> > > > +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> > > > +        flags |= O_APPEND;
> > > > +    }
> > > > +
> > > > +    fd = open(path, flags, 0644);
> > > > +    if (fd < 0) {
> > > > +        error_report("cannot open %s", path);
> > > > +        return -1;
> > > > +    }
> > > > +    len = writev(fd, out_sg, out_num);
> > > > +    close(fd);
> > > > +
> > > > +    return len;
> > > 
> > > All this is blocking VM until host io completes.
> > 
> > Hmm.. I don't know about the internals of qemu.  So does it make guest
> > stop?  If so, that's what I want to do for _DMESG. :)  As it's called
> > only on kernel oops I think it's admittable.  But for _CONSOLE, it
> > needs to do asynchronously.  Maybe I can add a thread to do the work.
> 
> Please look at include/io/channel.h.  QEMU is event-driven and tends to
> use asynchronous I/O instead of spawning threads.  The include/io/ APIs
> allow you to do asynchronous I/O in the event loop.

That is true, except for I/O to/from plain files - the QIOChannelFile
impl doesn't do anything special (yet) to make that work correctly in
non-blocking mode. Of course that could be fixed...

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply

* Re: [PATCH 6/7] qemu: Implement virtio-pstore device
From: Stefan Hajnoczi @ 2016-07-28 12:56 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Tony Luck, Kees Cook, kvm, Radim Krčmář,
	Anton Vorontsov, LKML, Steven Rostedt, qemu-devel, Minchan Kim,
	Michael S. Tsirkin, Anthony Liguori, Colin Cross, Paolo Bonzini,
	virtualization, Ingo Molnar
In-Reply-To: <20160728053953.GB4371@sejong>


[-- Attachment #1.1: Type: text/plain, Size: 1887 bytes --]

On Thu, Jul 28, 2016 at 02:39:53PM +0900, Namhyung Kim wrote:
> On Thu, Jul 28, 2016 at 03:02:54AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> > > +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> > > +                                      unsigned int out_num,
> > > +                                      struct virtio_pstore_req *req)
> > > +{
> > > +    char path[PATH_MAX];
> > > +    int fd;
> > > +    ssize_t len;
> > > +    unsigned short type;
> > > +    int flags = O_WRONLY | O_CREAT;
> > > +
> > > +    /* we already consume the req */
> > > +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> > > +
> > > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > > +
> > > +    type = le16_to_cpu(req->type);
> > > +
> > > +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> > > +        flags |= O_TRUNC;
> > > +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> > > +        flags |= O_APPEND;
> > > +    }
> > > +
> > > +    fd = open(path, flags, 0644);
> > > +    if (fd < 0) {
> > > +        error_report("cannot open %s", path);
> > > +        return -1;
> > > +    }
> > > +    len = writev(fd, out_sg, out_num);
> > > +    close(fd);
> > > +
> > > +    return len;
> > 
> > All this is blocking VM until host io completes.
> 
> Hmm.. I don't know about the internals of qemu.  So does it make guest
> stop?  If so, that's what I want to do for _DMESG. :)  As it's called
> only on kernel oops I think it's admittable.  But for _CONSOLE, it
> needs to do asynchronously.  Maybe I can add a thread to do the work.

Please look at include/io/channel.h.  QEMU is event-driven and tends to
use asynchronous I/O instead of spawning threads.  The include/io/ APIs
allow you to do asynchronous I/O in the event loop.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] drivers: virtio_blk: notify blk-core when hw-queue number changes
From: Bob Liu @ 2016-07-28  8:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: mst, linux-kernel, virtualization
In-Reply-To: <3cd56d3a-85f0-9cce-d42f-f4eace84667b@redhat.com>


On 06/19/2016 06:10 AM, Paolo Bonzini wrote:
> 
> 
> On 13/06/2016 11:58, Bob Liu wrote:
>> A guest might be migrated to other hosts with different num_queues, the
>> blk-core should aware of that else the reference of &vblk->vqs[qid] may be wrong.
>>
>> Signed-off-by: Bob Liu <bob.liu@oracle.com>
>> ---
>>  drivers/block/virtio_blk.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 42758b5..c169238 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -819,6 +819,9 @@ static int virtblk_restore(struct virtio_device *vdev)
>>  	if (ret)
>>  		return ret;
>>  
>> +	if (vblk->num_vqs != vblk->tag_set.nr_hw_queues)
>> +		blk_mq_update_nr_hw_queues(&vblk->tag_set, vblk->num_vqs);
>> +
>>  	virtio_device_ready(vdev);
>>  
>>  	blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>>
> 
> This should never happen; it'd be a configuration problem.
> 

Do you mean all hosts have to be configured with the same number of ->num_vqs?
What about cases like migrating a guest from HostA to HostB while HostB is much more powerful
and would like to run more hardware queues to get better performance.

Thanks,
Bob Liu

^ permalink raw reply

* RE: [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info
From: Li, Liang Z @ 2016-07-28  7:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728004606-mutt-send-email-mst@kernel.org>

> >  }
> >
> > +static void update_free_pages_stats(struct virtio_balloon *vb,
> 
> why _stats?

Will change.

> > +	max_pfn = get_max_pfn();
> > +	mutex_lock(&vb->balloon_lock);
> > +	while (pfn < max_pfn) {
> > +		memset(vb->page_bitmap, 0, vb->bmap_len);
> > +		ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > +			vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > +		hdr->cmd = cpu_to_virtio16(vb->vdev,
> BALLOON_GET_FREE_PAGES);
> > +		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +		hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > +		hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > +		bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > +		if (!ret) {
> > +			hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
> 	BALLOON_FLAG_DONE);
> > +			if (pfn + vb->pfn_limit > max_pfn)
> > +				bmap_len = (max_pfn - pfn) /
> BITS_PER_BYTE;
> > +		} else
> > +			hdr->flag = cpu_to_virtio16(vb->vdev,
> > +
> 	BALLOON_FLAG_CONT);
> > +		hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > +		sg_init_one(&sg_out, hdr,
> > +			 sizeof(struct balloon_bmap_hdr) + bmap_len);
> 
> Wait a second. This adds the same buffer multiple times in a loop.
> We will overwrite the buffer without waiting for hypervisor to process it.
> What did I miss?

I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
the buffer from overwrite, I realized it's wrong.

> > +
> > +		virtqueue_add_outbuf(vq, &sg_out, 1, vb, GFP_KERNEL);
> 
> this can fail. you want to maybe make sure vq has enough space before you
> use it or check error and wait.
> 
> > +		virtqueue_kick(vq);
> 
> why kick here within loop? wait until done. in fact kick outside lock is better
> for smp.

I will change this part in v3.

> 
> > +		pfn += vb->pfn_limit;
> > +	static const char * const names[] = { "inflate", "deflate", "stats",
> > +						 "misc" };
> >  	int err, nvqs;
> >
> >  	/*
> >  	 * We expect two virtqueues: inflate and deflate, and
> >  	 * optionally stat.
> >  	 */
> > -	nvqs = virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > +		nvqs = 4;
> 
> Does misc vq depend on stats vq feature then? if yes please validate that.

Yes, what's you mean by 'validate' that?

> 
> 
> > +	else
> > +		nvqs = virtio_has_feature(vb->vdev,
> > +					  VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> 2;
> 
> Replace that ? with else too pls.

Will change.

Thanks!
Liang

^ permalink raw reply

* Re: [PATCH v4] virtio: new feature to detect IOMMU device quirk
From: Christoph Hellwig @ 2016-07-28  6:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Christoph Hellwig, linux-kernel, virtualization
In-Reply-To: <1469659378-15957-1-git-send-email-mst@redhat.com>

Again, this is still the wrong way around.  A "noiommu" feature is a
quirk and should not be the default.

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-28  6:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728044000-mutt-send-email-mst@kernel.org>

> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> >
> > Yes, it is if the max pfn is more than 32GB.
> > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > small, and it's the main reason for bad performance.
> > Use the max 1MB kmalloc is a balance between performance and
> > flexibility, a large page bitmap covers the range of all the memory is
> > no good for a system with huge amount of memory. If the bitmap is too
> > small, it means we have to traverse a long list for many times, and it's bad
> for performance.
> >
> > Thanks!
> > Liang
> 
> There are all your implementation decisions though.
> 
> If guest memory is so fragmented that you only have order 0 4k pages, then
> allocating a huge 1M contigious chunk is very problematic in and of itself.
> 

The memory is allocated in the probe stage. This will not happen if the driver is
 loaded when booting the guest.

> Most people rarely migrate and do not care how fast that happens.
> Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> actually request host memory for it) for everyone to speed it up when it
> does happen is not really an option.
> 
If people don't plan to do inflating/deflating, they should not enable the virtio-balloon
at the beginning, once they decide to use it, the driver should provide better performance
as much as possible.

1MB is a very small portion for a VM with more than 32GB memory and it's the *worst case*, 
for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
and will be less than 1MB.

If 1MB is too big, how about 512K, or 256K?  32K seems too small.

Liang

> --
> MST
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply

* Re: [PATCH 6/7] qemu: Implement virtio-pstore device
From: Namhyung Kim @ 2016-07-28  5:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tony Luck, Kees Cook, kvm, Radim Krčmář,
	Anton Vorontsov, LKML, Steven Rostedt, qemu-devel, Minchan Kim,
	Anthony Liguori, Colin Cross, Paolo Bonzini, virtualization,
	Ingo Molnar
In-Reply-To: <20160728023254-mutt-send-email-mst@kernel.org>

On Thu, Jul 28, 2016 at 03:02:54AM +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> > Add virtio pstore device to allow kernel log files saved on the host.
> > It will save the log files on the directory given by pstore device
> > option.
> > 
> >   $ qemu-system-x86_64 -device virtio-pstore,directory=dir-xx ...
> > 
> >   (guest) # echo c > /proc/sysrq-trigger
> 
> So if the point is handling system crashes, I suspect
> virtio is the wrong protocol to use. ATM it's rather
> elaborate for performance, it's likely not to work when
> linux is crashing. I think you want something very very
> simple that will still work when you crash. Like maybe
> a serial device?

Well, I dont' know.  As you know, the kernel oops dump is already sent
to serial device but it's rather slow.  As I wrote in the cover
letter, enabling ftrace_dump_on_oops makes it even worse..  Also
pstore saves the (compressed) binary data so I thought it'd be better
to have a dedicated IO channel.

I know that we can't rely on anything in kernel when the kernel is
crashing.  In the virtualization environment, I think it'd be great if
it can dump the crash data in the host directly so I chose the virtio.
I thought the virtio is a relatively thin layer and independent from
other kernel features.  Even if it doesn't guarantee to work 100%,
it's worth trying IMHO.


> 
> >   $ ls dir-xx
> >   dmesg-1.enc.z  dmesg-2.enc.z
> > 
> > The log files are usually compressed using zlib.  Users can see the log
> > messages directly on the host or on the guest (using pstore filesystem).
> 
> So this lacks all management tools that a regular storage device
> has, and these are necessary if people are to use this
> in production.
> 
> For example, some kind of provision for limiting the amount of host disk
> this can consume seems called for. Rate-limiting disk writes
> on host also seems necessary. Handling host disk errors always
> propagates error to guest but an option to e.g. stop vm might
> be useful to avoid corruption.

Yes, it needs that kind of management.  I'd like to look at the
existing implementation.  But basically I thought it as a debugging
feature and it won't produce much data for the default setting.  Maybe
I can limit the file size not to exceed the buffer size and keep up to
pre-configured number of files for each type.  Now host disk error
will be propagated to guest.


> 
> > 
> > The 'directory' property is required for virtio-pstore device to work.
> > It also adds 'bufsize' and 'console' (boolean) properties.
> 
> No idea what these do. Seem to be RW by guest but have no effect
> otherwise.

The 'directory' is a pathname which it saves the data files.  The
'bufsize' sets the size of buffer used by pstore.  The 'console'
enables saving pstore console type data.


> 
> 
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Anthony Liguori <aliguori@amazon.com>
> > Cc: Anton Vorontsov <anton@enomsg.org>
> > Cc: Colin Cross <ccross@android.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Tony Luck <tony.luck@intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: kvm@vger.kernel.org
> > Cc: qemu-devel@nongnu.org
> > Cc: virtualization@lists.linux-foundation.org
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---

[SNIP]
> > diff --git a/hw/virtio/virtio-pstore.c b/hw/virtio/virtio-pstore.c
> > new file mode 100644
> > index 0000000..2ca7786
> > --- /dev/null
> > +++ b/hw/virtio/virtio-pstore.c
> > @@ -0,0 +1,477 @@
> > +/*
> > + * Virtio Pstore Device
> > + *
> > + * Copyright (C) 2016  LG Electronics
> > + *
> > + * Authors:
> > + *  Namhyung Kim  <namhyung@gmail.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> 
> We generally ask new code to be v2 or later.

Ok, will update.


> 
> >  See
> > + * the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include <stdio.h>
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/iov.h"
> > +#include "qemu-common.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/error-report.h"
> > +#include "sysemu/kvm.h"
> > +#include "qapi/visitor.h"
> > +#include "qapi-event.h"
> > +#include "trace.h"
> > +
> > +#include "hw/virtio/virtio.h"
> > +#include "hw/virtio/virtio-bus.h"
> > +#include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/virtio-pstore.h"
> > +
> > +
> > +static void virtio_pstore_to_filename(VirtIOPstore *s, char *buf, size_t sz,
> > +                                      struct virtio_pstore_req *req)
> > +{
> > +    const char *basename;
> > +    unsigned long long id = 0;
> > +    unsigned int flags = le32_to_cpu(req->flags);
> > +
> > +    switch (le16_to_cpu(req->type)) {
> > +    case VIRTIO_PSTORE_TYPE_DMESG:
> > +        basename = "dmesg";
> > +        id = s->id++;
> > +        break;
> > +    case VIRTIO_PSTORE_TYPE_CONSOLE:
> > +        basename = "console";
> > +        if (s->console_id) {
> > +            id = s->console_id;
> > +        } else {
> > +            id = s->console_id = s->id++;
> > +        }
> > +        break;
> > +    default:
> > +        basename = "unknown";
> > +        break;
> > +    }
> > +
> > +    snprintf(buf, sz, "%s/%s-%llu%s", s->directory, basename, id,
> > +             flags & VIRTIO_PSTORE_FL_COMPRESSED ? ".enc.z" : "");
> > +}
> > +
> > +static void virtio_pstore_from_filename(VirtIOPstore *s, char *name,
> > +                                        char *buf, size_t sz,
> > +                                        struct virtio_pstore_fileinfo *info)
> > +{
> > +    snprintf(buf, sz, "%s/%s", s->directory, name);
> 
> if this does not fit, buf will not be 0 terminated and can
> generally be corrupted.

Will change it to use malloc instead.

> 
> 
> > +
> > +    if (g_str_has_prefix(name, "dmesg-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_DMESG;
> > +        name += strlen("dmesg-");
> > +    } else if (g_str_has_prefix(name, "console-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_CONSOLE;
> > +        name += strlen("console-");
> > +    } else if (g_str_has_prefix(name, "unknown-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_UNKNOWN;
> > +        name += strlen("unknown-");
> > +    }
> > +
> > +    qemu_strtoull(name, NULL, 0, &info->id);
> > +
> > +    info->flags = 0;
> > +    if (g_str_has_suffix(name, ".enc.z")) {
> > +        info->flags |= VIRTIO_PSTORE_FL_COMPRESSED;
> > +    }
> > +}

[SNIP]
> > +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> > +                                      unsigned int out_num,
> > +                                      struct virtio_pstore_req *req)
> > +{
> > +    char path[PATH_MAX];
> > +    int fd;
> > +    ssize_t len;
> > +    unsigned short type;
> > +    int flags = O_WRONLY | O_CREAT;
> > +
> > +    /* we already consume the req */
> > +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> > +
> > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > +
> > +    type = le16_to_cpu(req->type);
> > +
> > +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> > +        flags |= O_TRUNC;
> > +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> > +        flags |= O_APPEND;
> > +    }
> > +
> > +    fd = open(path, flags, 0644);
> > +    if (fd < 0) {
> > +        error_report("cannot open %s", path);
> > +        return -1;
> > +    }
> > +    len = writev(fd, out_sg, out_num);
> > +    close(fd);
> > +
> > +    return len;
> 
> All this is blocking VM until host io completes.

Hmm.. I don't know about the internals of qemu.  So does it make guest
stop?  If so, that's what I want to do for _DMESG. :)  As it's called
only on kernel oops I think it's admittable.  But for _CONSOLE, it
needs to do asynchronously.  Maybe I can add a thread to do the work.


> 
> 
> > +}
> > +
> > +static ssize_t virtio_pstore_do_close(VirtIOPstore *s)
> > +{
> > +    if (s->dirp == NULL) {
> > +        return 0;
> > +    }
> > +
> > +    closedir(s->dirp);
> > +    s->dirp = NULL;
> > +
> > +    return 0;
> > +}
> > +
> > +static ssize_t virtio_pstore_do_erase(VirtIOPstore *s,
> > +                                      struct virtio_pstore_req *req)
> > +{
> > +    char path[PATH_MAX];
> > +
> > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > +
> > +    return unlink(path);
> > +}
> > +
> > +static void virtio_pstore_handle_io(VirtIODevice *vdev, VirtQueue *vq)
> > +{
> > +    VirtIOPstore *s = VIRTIO_PSTORE(vdev);
> > +    VirtQueueElement *elem;
> > +    struct virtio_pstore_req req;
> > +    struct virtio_pstore_res res;
> > +    ssize_t len = 0;
> > +    int ret;
> > +
> > +    for (;;) {
> > +        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> > +        if (!elem) {
> > +            return;
> > +        }
> > +
> > +        if (elem->out_num < 1 || elem->in_num < 1) {
> > +            error_report("request or response buffer is missing");
> > +            exit(1);
> > +        }
> > +
> > +        len = iov_to_buf(elem->out_sg, elem->out_num, 0, &req, sizeof(req));
> > +        if (len != (ssize_t)sizeof(req)) {
> > +            error_report("invalid request size: %ld", (long)len);
> > +            exit(1);
> > +        }
> > +        res.cmd  = req.cmd;
> > +        res.type = req.type;
> > +
> > +        switch (le16_to_cpu(req.cmd)) {
> > +        case VIRTIO_PSTORE_CMD_OPEN:
> > +            ret = virtio_pstore_do_open(s);
> > +            break;
> > +        case VIRTIO_PSTORE_CMD_READ:
> > +            ret = virtio_pstore_do_read(s, elem->in_sg, elem->in_num, &res);
> > +            if (ret > 0) {
> > +                len = ret;
> > +                ret = 0;
> > +            }
> > +            break;
> > +        case VIRTIO_PSTORE_CMD_WRITE:
> > +            ret = virtio_pstore_do_write(s, elem->out_sg, elem->out_num, &req);
> > +            break;
> > +        case VIRTIO_PSTORE_CMD_CLOSE:
> > +            ret = virtio_pstore_do_close(s);
> > +            break;
> > +        case VIRTIO_PSTORE_CMD_ERASE:
> > +            ret = virtio_pstore_do_erase(s, &req);
> > +            break;
> > +        default:
> > +            ret = -1;
> > +            break;
> > +        }
> > +
> > +        res.ret  = ret;
> > +
> > +        iov_from_buf(elem->in_sg, elem->in_num, 0, &res, sizeof(res));
> > +        virtqueue_push(vq, elem, sizeof(res) + len);
> 
> this is wrong - len should be # of bytes written into guest memory.

???

All command except READ only write the virtio_pstore_ret so len is 0.
For READ, virtio_pstore_do_read() returns the actual bytes it wrote
(minus sizeof(res) of course), so I think it's fine, no?

Anyway, thanks for your review!

Thanks,
Namhyung


> 
> > +
> > +        virtio_notify(vdev, vq);
> > +        g_free(elem);
> > +
> > +        if (ret < 0) {
> > +            return;
> > +        }
> > +    }
> > +}
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info
From: Li, Liang Z @ 2016-07-28  5:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728010921-mutt-send-email-mst@kernel.org>

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +	unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +	unsigned long pfn, flags, page_num;
> > +	unsigned int order, t;
> > +	struct list_head *curr;
> > +
> > +	if (zone_is_empty(zone))
> > +		return;
> > +	end_pfn = min(start_pfn + len, end_pfn);
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +
> > +	for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small benefits.
Use separated page bitmaps for each order won't help to reduce the data traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K bytes 
through even 10Gps network, is that true?

Liang

^ permalink raw reply

* RE: [PATCH v2 repost 6/7] mm: add the related functions to get free page info
From: Li, Liang Z @ 2016-07-28  4:36 UTC (permalink / raw)
  To: Hansen, Dave, Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <579932D9.6000106@intel.com>

> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> +	for_each_migratetype_order(order, t) {
> >>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> +			if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> +				page_num = 1UL << order;
> >>> +				if (pfn + page_num > end_pfn)
> >>> +					page_num = end_pfn - pfn;
> >>> +				bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> +			}
> >>> +		}
> >>> +	}
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang

^ permalink raw reply

* RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-28  3:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728010616-mutt-send-email-mst@kernel.org>

> > +/*
> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > +extern unsigned long get_max_pfn(void);
> > +
> 
> Please just include the correct header. No need for this hackery.
> 

Will change. Thanks!

Liang

^ permalink raw reply

* RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-28  3:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Hansen, Dave
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728003644-mutt-send-email-mst@kernel.org>

> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> >
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > How big was the pfn buffer before?
> 
> 
> Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> 
> --
> MST

Limit to 1G is bad for the performance, I sent you the test result several weeks ago.

Paste it bellow:
------------------------------------------------------------------------------------------------------------------------
About the size of page bitmap, I have test the performance of filling the balloon to 15GB with a
 16GB RAM VM.

===============================
32K Byte (cover 1GB of RAM)

Time spends on inflating: 2031ms
---------------------------------------------
64K Byte (cover 2GB of RAM)

Time spends on inflating: 1507ms
--------------------------------------------
512K Byte (cover 16GB of RAM)

Time spends on inflating: 1237ms
================================

If possible, a big bitmap is better for performance.

Liang

^ permalink raw reply

* RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-28  3:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728002243-mutt-send-email-mst@kernel.org>

> > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > + * to prevent a very large page bitmap, there are two reasons for this:
> > + * 1) to save memory.
> > + * 2) allocate a large bitmap may fail.
> > + *
> > + * The actual limit of pfn is determined by:
> > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > + *
> > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > +scan
> > + * the page list and send the PFNs with several times. To reduce the
> > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > +should
> > + * be set with a value which can cover most cases.
> 
> So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still not a
> big deal for a big guest.
> 

The issue here is the overhead is too high for scanning the page list for 32 times.
Limit the page bitmap size to a fixed value is better for a big guest?

> > + */
> > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> PAGE_SHIFT)
> > +/* 32GB */
> 
> I already said this with a smaller limit.
> 
> 	2<< 30  is 2G but that is not a useful comment.
> 	pls explain what is the reason for this selection.
> 
> Still applies here.
> 

I will add the comment for this.

> > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > +	if (virtio_has_feature(vb->vdev,
> VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > +		struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > +		unsigned long bmap_len;
> > +
> > +		/* cmd and req_id are not used here, set them to 0 */
> > +		hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > +		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > +		hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > +		hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> 
> no need to swap 0, just fill it in. in fact you allocated all 0s so no need to touch
> these fields at all.
> 

Will change in v3.

> > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > virtio_device *vdev)  {
> >  	struct virtio_balloon *vb;
> > -	int err;
> > +	int err, hdr_len;
> >
> >  	if (!vdev->config->get) {
> >  		dev_err(&vdev->dev, "%s failure: config access disabled\n",
> @@
> > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> >  	spin_lock_init(&vb->stop_update_lock);
> >  	vb->stop_update = false;
> >  	vb->num_pages = 0;
> > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> 
> What are these 2 longs in aid of?
> 
The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, BITS_PER_LONG) 
may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra space to save the
bitmap for this case. 2 longs are enough.

> > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> So it can go up to 1MByte but adding header size etc you need a higher order
> allocation. This is a waste, there is no need to have a power of two allocation.
> Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> Subtract the header and you get bitmap size.
> Calculate the pfn limit from there.
> 

Indeed, will change. Thanks a lot!

Liang

^ permalink raw reply

* Re: [RFC/PATCHSET 0/7] virtio: Implement virtio pstore device (v2)
From: Namhyung Kim @ 2016-07-28  2:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tony Luck, Kees Cook, kvm, Radim Krčmář, LKML,
	Anton Vorontsov, Will Deacon, qemu-devel, Steven Rostedt,
	virtualization, Minchan Kim, Anthony Liguori, Colin Cross,
	Paolo Bonzini, Ingo Molnar
In-Reply-To: <20160728011602-mutt-send-email-mst@kernel.org>

Hello,

On Thu, Jul 28, 2016 at 01:18:42AM +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 28, 2016 at 12:08:24AM +0900, Namhyung Kim wrote:
> > Hello,
> > 
> > This is v2 of the virtio-pstore work.  In this patchset I addressed
> > most of feedbacks from previous version.  Limiting disk size is not
> > implemented yet.
> 
> For some reason, only parts of the patchset were received.
> Pls post all patches to all lists.
> 
> If you are changing the virtio interface with host,
> like a new device, they you must copy the virtio TC
> so make sure there are no objections from there.
> 
> https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback

Hmm.. I think I've CC-ed all the kvm, qemu and virtualization lists
but missed the virtio list, sorry.  Will add next time.

Thanks,
Namhyung

^ permalink raw reply

* Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Michael S. Tsirkin @ 2016-07-28  1:45 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04213CCB@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 01:13:35AM +0000, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> > process
> > 
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > 
> > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> > was the pfn buffer before?
> 
> Yes, it is if the max pfn is more than 32GB.
> The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
> and it's the main reason for bad performance.
> Use the max 1MB kmalloc is a balance between performance and flexibility,
> a large page bitmap covers the range of all the memory is no good for a system
> with huge amount of memory. If the bitmap is too small, it means we have
> to traverse a long list for many times, and it's bad for performance.
> 
> Thanks!
> Liang   

There are all your implementation decisions though.

If guest memory is so fragmented that you only have order 0 4k pages,
then allocating a huge 1M contigious chunk is very problematic
in and of itself.

Most people rarely migrate and do not care how fast that happens.
Wasting a large chunk of memory (and it's zeroed for no good reason, so you
actually request host memory for it) for everyone to speed it up
when it does happen is not really an option.

-- 
MST

^ permalink raw reply

* Re: ext4 error when testing virtio-scsi & vhost-scsi
From: Zhangfei Gao @ 2016-07-28  1:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, kvm, Michael S. Tsirkin, qemu-devel,
	virtualization, target-devel, linux-ext4
In-Reply-To: <20160727155608.GC1659@quack2.suse.cz>

Hi, Jan

On Wed, Jul 27, 2016 at 11:56 PM, Jan Kara <jack@suse.cz> wrote:
> Hi!
>
> On Wed 27-07-16 15:58:55, Zhangfei Gao wrote:
>> Hi, Michael
>>
>> I have met ext4 error when using vhost_scsi on arm64 platform, and
>> suspect it is vhost_scsi issue.
>>
>> Ext4 error when testing virtio_scsi & vhost_scsi
>>
>>
>> No issue:
>> 1. virtio_scsi, ext4
>> 2. vhost_scsi & virtio_scsi, ext2
>> 3.  Instead of vhost, also tried loopback and no problem.
>> Using loopback, host can use the new block device, while vhost is used
>> by guest (qemu).
>> http://www.linux-iscsi.org/wiki/Tcm_loop
>> Test directly in host, not find ext4 error.
>>
>>
>>
>> Have issue:
>> 1. vhost_scsi & virtio_scsi, ext4
>> a. iblock
>> b, fileio, file located in /tmp (ram), no device based.
>>
>> 2, Have tried 4.7-r2 and 4.5-rc1 on D02 board, both have issue.
>> Since I need kvm specific patch for D02, so it may not freely to switch
>> to older version.
>>
>> 3. Also test with ext4, disabling journal
>> mkfs.ext4 -O ^has_journal /dev/sda
>>
>>
>> Do you have any suggestion?
>
> So can you mount the filesystem with errors=remount-ro to avoid clobbering
> the fs after the problem happens? And then run e2fsck on the problematic
> filesystem and send the output here?
>

Tested twice, log pasted.
Both using fileio, located in host ramfs /tmp
Before e2fsck, umount /dev/sda

1.
root@(none)$ mount -o errors=remount-ro /dev/sda /mnt
[   22.812053] EXT4-fs (sda): mounted filesystem with ordered data
mode. Opts: errors=remount-ro
$ rm /mnt/test
[  108.388905] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[  108.406930] Aborting journal on device sda-8.
[  108.414120] EXT4-fs (sda): Remounting filesystem read-only
[  108.414847] EXT4-fs error (device sda) in ext4_dirty_inode:5487: IO failure
[  108.423571] EXT4-fs error (device sda) in ext4_free_blocks:4904:
Journal has aborted
[  108.431919] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[  108.440269] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[  108.448568] EXT4-fs error (device sda) in
ext4_ext_remove_space:3058: IO failure
[  108.456917] EXT4-fs error (device sda) in ext4_ext_truncate:4657:
Corrupt filesystem
[  108.465267] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[  108.473567] EXT4-fs error (device sda) in ext4_truncate:4150: IO failure
[  108.481917] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
root@(none)$ e2fsck /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
/dev/sda is mounted.
e2fsck: Cannot continue, aborting.


root@(none)$ umount /mnt
[  260.756250] EXT4-fs error (device sda): ext4_put_super:837:
Couldn't clean up the journal
root@(none)$ umount /mnt           e2fsck /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
/dev/sda: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #1 (32703, counted=8127).
Fix<y>? yes
Free blocks count wrong for group #2 (32768, counted=31744).
Fix<y>? yes
Free blocks count wrong (249509, counted=223909).
Fix<y>? yes
Free inodes count wrong for group #0 (8181, counted=8180).
Fix<y>? yes
Free inodes count wrong (65525, counted=65524).
Fix<y>? yes

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/65536 files (8.3% non-contiguous), 38235/262144 blocks
root@(none)$

2.

 root@(none)$ rm /mnt/test
[   71.021484] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[   71.044959] Aborting journal on device sda-8.
[   71.052152] EXT4-fs (sda): Remounting filesystem read-only
[   71.052833] EXT4-fs error (device sda) in ext4_dirty_inode:5487: IO failure
[   71.061600] EXT4-fs error (device sda) in ext4_free_blocks:4904:
Journal has aborted
[   71.069948] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[   71.078296] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[   71.086597] EXT4-fs error (device sda) in
ext4_ext_remove_space:3058: IO failure
[   71.094946] EXT4-fs error (device sda) in ext4_ext_truncate:4657:
Corrupt filesystem
[   71.103296] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
[   71.111595] EXT4-fs error (device sda) in ext4_truncate:4150: IO failure
[   71.119946] EXT4-fs error (device sda) in
ext4_reserve_inode_write:5362: Corrupt filesystem
root@(none)$ e2fsck /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
/dev/sda is mounted.
e2fsck: Cannot continue, aborting.


root@(none)$ umou nt /mnt/
[   92.103221] EXT4-fs error (device sda): ext4_put_super:837:
Couldn't clean up the journal
root@(none)$ umount /mnt/            e2fsck /dev/sda
e2fsck 1.42.9 (28-Dec-2013)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
/dev/sda: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #1 (32703, counted=8127).
Fix<y>? yes
Free blocks count wrong for group #2 (32768, counted=31744).
Fix<y>? yes
Free blocks count wrong (249509, counted=223909).
Fix<y>? yes
Free inodes count wrong for group #0 (8181, counted=8180).
Fix<y>? yes
Free inodes count wrong (65525, counted=65524).
Fix<y>? yes

/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda: 12/65536 files (8.3% non-contiguous), 38235/262144 blocks
root@(none)$


Thanks

^ permalink raw reply

* RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-28  1:13 UTC (permalink / raw)
  To: Hansen, Dave, linux-kernel@vger.kernel.org
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	Michael S. Tsirkin, qemu-devel@nongnu.org, dgilbert@redhat.com,
	linux-mm@kvack.org, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	Vlastimil Babka
In-Reply-To: <5798DB49.7030803@intel.com>

> Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> process
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> 
> This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.  How big
> was the pfn buffer before?

Yes, it is if the max pfn is more than 32GB.
The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too small, 
and it's the main reason for bad performance.
Use the max 1MB kmalloc is a balance between performance and flexibility,
a large page bitmap covers the range of all the memory is no good for a system
with huge amount of memory. If the bitmap is too small, it means we have
to traverse a long list for many times, and it's bad for performance.

Thanks!
Liang   

^ permalink raw reply

* Re: [PATCH v2 repost 6/7] mm: add the related functions to get free page info
From: Michael S. Tsirkin @ 2016-07-28  0:17 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04213C27@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 12:10:16AM +0000, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 6/7] mm: add the related functions to get free
> > page info
> > 
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > +	for_each_migratetype_order(order, t) {
> > > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > > +			if (pfn >= start_pfn && pfn <= end_pfn) {
> > > +				page_num = 1UL << order;
> > > +				if (pfn + page_num > end_pfn)
> > > +					page_num = end_pfn - pfn;
> > > +				bitmap_set(bitmap, pfn - start_pfn,
> > page_num);
> > > +			}
> > > +		}
> > > +	}
> > 
> > Nit:  The 'page_num' nomenclature really confused me here.  It is the
> > number of bits being set in the bitmap.  Seems like calling it nr_pages or
> > num_pages would be more appropriate.
> > 
> 
> You are right,  will change.
> 
> > Isn't this bitmap out of date by the time it's send up to the hypervisor?  Is
> > there something that makes the inaccuracy OK here?
> 
> Yes. The dirty page logging will be used to correct the inaccuracy.
> The dirty page logging should be started before getting the free page bitmap, then if some of the free pages become no free for writing, these pages will be tracked by the dirty page logging mechanism.
> 
> Thanks!
> Liang

Right but this should be clear from code and naming.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox