Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jason Wang @ 2016-06-23  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, eric.dumazet, netdev, linux-kernel, virtualization, brouer,
	davem
In-Reply-To: <20160622211620-mutt-send-email-mst@redhat.com>



On 2016年06月23日 02:18, Michael S. Tsirkin wrote:
> On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote:
>> >Would it help to have ptr_ring_resize that gets an array of
>> >rings and resizes them both to same length?
> OK, here it is. Untested so far, and no skb wrapper.
> Pls let me know whether this is what you had in mind.

Exactly what I want.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Michael S. Tsirkin @ 2016-06-22 18:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, eric.dumazet, netdev, linux-kernel, virtualization, brouer,
	davem
In-Reply-To: <20160617033218-mutt-send-email-mst@redhat.com>

On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote:
> Would it help to have ptr_ring_resize that gets an array of
> rings and resizes them both to same length?

OK, here it is. Untested so far, and no skb wrapper.
Pls let me know whether this is what you had in mind.

-->

ptr_ring: support resizing multiple queues

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index a29b023..e576801 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -354,20 +354,14 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, int pad, gfp_t gfp
 	return 0;
 }
 
-static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
-				  void (*destroy)(void *))
+static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
+					   int size, gfp_t gfp,
+					   void (*destroy)(void *))
 {
-	unsigned long flags;
 	int producer = 0;
-	void **queue = __ptr_ring_init_queue_alloc(size, gfp);
 	void **old;
 	void *ptr;
 
-	if (!queue)
-		return -ENOMEM;
-
-	spin_lock_irqsave(&(r)->producer_lock, flags);
-
 	while ((ptr = ptr_ring_consume(r)))
 		if (producer < size)
 			queue[producer++] = ptr;
@@ -380,6 +374,23 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
 	old = r->queue;
 	r->queue = queue;
 
+	return old;
+}
+
+static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
+				  void (*destroy)(void *))
+{
+	unsigned long flags;
+	void **queue = __ptr_ring_init_queue_alloc(size, gfp);
+	void **old;
+
+	if (!queue)
+		return -ENOMEM;
+
+	spin_lock_irqsave(&(r)->producer_lock, flags);
+
+	old = __ptr_ring_swap_queue(r, queue, size, gfp, destroy);
+
 	spin_unlock_irqrestore(&(r)->producer_lock, flags);
 
 	kfree(old);
@@ -387,6 +398,49 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
 	return 0;
 }
 
+static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, int nrings,
+					   int size,
+					   gfp_t gfp, void (*destroy)(void *))
+{
+	unsigned long flags;
+	void ***queues;
+	int i;
+
+	queues = kmalloc(nrings * sizeof *queues, gfp);
+	if (!queues)
+		goto noqueues;
+
+	for (i = 0; i < nrings; ++i) {
+		queues[i] = __ptr_ring_init_queue_alloc(size, gfp);
+		if (!queues[i])
+			goto nomem;
+	}
+
+	spin_lock_irqsave(&(rings[i])->producer_lock, flags);
+
+	for (i = 0; i < nrings; ++i)
+		queues[i] = __ptr_ring_swap_queue(rings[i], queues[i],
+						  size, gfp, destroy);
+
+	spin_unlock_irqrestore(&(rings[i])->producer_lock, flags);
+
+	for (i = 0; i < nrings; ++i)
+		kfree(queues[i]);
+
+	kfree(queues);
+
+	return 0;
+
+nomem:
+	while (--i >= 0)
+		kfree(queues[i]);
+
+	kfree(queues);
+
+noqueues:
+	return -ENOMEM;
+}
+
 static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *))
 {
 	void *ptr;
diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 26dc1d2..deb36af 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -17,6 +17,11 @@
 typedef pthread_spinlock_t  spinlock_t;
 
 typedef int gfp_t;
+static void *kmalloc(unsigned size, gfp_t gfp)
+{
+	return memalign(64, size);
+}
+
 static void *kzalloc(unsigned size, gfp_t gfp)
 {
 	void *p = memalign(64, size);

^ permalink raw reply related

* Re: [PATCH 3/3] vhost: device IOTLB API
From: kbuild test robot @ 2016-06-22 14:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, mst, netdev, linux-kernel, peterx, virtualization,
	kbuild-all, vkaplans, wexu
In-Reply-To: <1466589047-2271-4-git-send-email-jasowang@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7855 bytes --]

Hi,

[auto build test ERROR on next-20160622]
[also build test ERROR on v4.7-rc4]
[cannot apply to vhost/linux-next v4.7-rc4 v4.7-rc3 v4.7-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-Wang/basic-device-IOTLB-support/20160622-175522
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   gcc-6: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead
   drivers/vhost/vhost.c: In function 'vhost_copy_to_user':
>> drivers/vhost/vhost.c:742:28: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      ret = translate_desc(vq, (u64)to, size, vq->iotlb_iov,
                               ^
   drivers/vhost/vhost.c: In function 'vhost_copy_from_user':
   drivers/vhost/vhost.c:771:28: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      ret = translate_desc(vq, (u64)from, size, vq->iotlb_iov,
                               ^
   In file included from include/linux/printk.h:289:0,
                    from include/linux/kernel.h:13,
                    from include/linux/list.h:8,
                    from include/linux/wait.h:6,
                    from include/linux/eventfd.h:12,
                    from drivers/vhost/vhost.c:14:
   drivers/vhost/vhost.c:777:11: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
              (unsigned long long) from,
              ^
   include/linux/dynamic_debug.h:127:10: note: in definition of macro 'dynamic_pr_debug'
           ##__VA_ARGS__);  \
             ^~~~~~~~~~~
>> drivers/vhost/vhost.h:220:3: note: in expansion of macro 'pr_debug'
      pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
      ^~~~~~~~
>> drivers/vhost/vhost.c:775:4: note: in expansion of macro 'vq_err'
       vq_err(vq, "IOTLB translation failure: uaddr "
       ^~~~~~
   drivers/vhost/vhost.c: In function '__vhost_get_user':
   drivers/vhost/vhost.c:802:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     ret = translate_desc(vq, (u64)addr, size, vq->iotlb_iov,
                              ^
   In file included from include/linux/printk.h:289:0,
                    from include/linux/kernel.h:13,
                    from include/linux/list.h:8,
                    from include/linux/wait.h:6,
                    from include/linux/eventfd.h:12,
                    from drivers/vhost/vhost.c:14:
   drivers/vhost/vhost.c:808:4: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
       (unsigned long long) addr,
       ^
   include/linux/dynamic_debug.h:127:10: note: in definition of macro 'dynamic_pr_debug'
           ##__VA_ARGS__);  \
             ^~~~~~~~~~~
>> drivers/vhost/vhost.h:220:3: note: in expansion of macro 'pr_debug'
      pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
      ^~~~~~~~
   drivers/vhost/vhost.c:806:3: note: in expansion of macro 'vq_err'
      vq_err(vq, "IOTLB translation failure: uaddr "
      ^~~~~~
   drivers/vhost/vhost.c:816:4: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
       (unsigned long long) addr,
       ^
   include/linux/dynamic_debug.h:127:10: note: in definition of macro 'dynamic_pr_debug'
           ##__VA_ARGS__);  \
             ^~~~~~~~~~~
>> drivers/vhost/vhost.h:220:3: note: in expansion of macro 'pr_debug'
      pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
      ^~~~~~~~
   drivers/vhost/vhost.c:814:3: note: in expansion of macro 'vq_err'
      vq_err(vq, "Non atomic userspace memory access: uaddr "
      ^~~~~~
   drivers/vhost/vhost.c: In function 'vq_iotlb_prefetch':
   drivers/vhost/vhost.c:1144:46: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->desc,
                                                 ^
   drivers/vhost/vhost.c:1146:46: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
            iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->avail,
                                                 ^
   drivers/vhost/vhost.c:1149:46: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
            iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)vq->used,
                                                 ^
   In file included from include/linux/printk.h:289:0,
                    from include/linux/kernel.h:13,
                    from include/linux/list.h:8,
                    from include/linux/wait.h:6,
                    from include/linux/eventfd.h:12,
                    from drivers/vhost/vhost.c:14:
   drivers/vhost/vhost.c: In function 'vhost_vq_init_access':
   drivers/vhost/vhost.c:1737:4: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
       (unsigned long long) &vq->used->idx);
       ^
   include/linux/dynamic_debug.h:127:10: note: in definition of macro 'dynamic_pr_debug'
           ##__VA_ARGS__);  \
             ^~~~~~~~~~~
>> drivers/vhost/vhost.h:220:3: note: in expansion of macro 'pr_debug'
      pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
      ^~~~~~~~
   drivers/vhost/vhost.c:1736:3: note: in expansion of macro 'vq_err'
      vq_err(vq, "Can't access used idx at 0x%llx\n",
      ^~~~~~
   gcc-6: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead
--
   gcc-6: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead
   drivers/vhost/scsi.c: In function 'vhost_scsi_do_evt_work':
>> drivers/vhost/scsi.c:460:9: error: too few arguments to function 'vhost_get_vq_desc'
     head = vhost_get_vq_desc(vq, vq->iov,
            ^~~~~~~~~~~~~~~~~
   In file included from drivers/vhost/scsi.c:51:0:
   drivers/vhost/vhost.h:181:5: note: declared here
    int vhost_get_vq_desc(struct vhost_virtqueue *,
        ^~~~~~~~~~~~~~~~~
   drivers/vhost/scsi.c: In function 'vhost_scsi_handle_vq':
   drivers/vhost/scsi.c:867:10: error: too few arguments to function 'vhost_get_vq_desc'
      head = vhost_get_vq_desc(vq, vq->iov,
             ^~~~~~~~~~~~~~~~~
   In file included from drivers/vhost/scsi.c:51:0:
   drivers/vhost/vhost.h:181:5: note: declared here
    int vhost_get_vq_desc(struct vhost_virtqueue *,
        ^~~~~~~~~~~~~~~~~

vim +/vhost_get_vq_desc +460 drivers/vhost/scsi.c

a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  454  		vs->vs_events_missed = true;
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  455  		return;
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  456  	}
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  457  
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  458  again:
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  459  	vhost_disable_notify(&vs->dev, vq);
47283bef drivers/vhost/scsi.c      Michael S. Tsirkin 2014-06-05 @460  	head = vhost_get_vq_desc(vq, vq->iov,
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  461  			ARRAY_SIZE(vq->iov), &out, &in,
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  462  			NULL, NULL);
a6c9af87 drivers/vhost/tcm_vhost.c Asias He           2013-04-25  463  	if (head < 0) {

:::::: The code at line 460 was first introduced by commit
:::::: 47283bef7ed356629467d1fac61687756e48f254 vhost: move memory pointer to VQs

:::::: TO: Michael S. Tsirkin <mst@redhat.com>
:::::: CC: Michael S. Tsirkin <mst@redhat.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 55347 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH V3 0/3] basic device IOTLB support
From: Jason Wang @ 2016-06-22  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, peterx, virtualization, wexu, vkaplans
In-Reply-To: <20160621164449.GC30949@redhat.com>



On 2016年06月22日 00:44, Michael S. Tsirkin wrote:
> On Tue, May 24, 2016 at 05:36:22PM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace IOMMU implementation (qemu)
>> for a secure DMA environment (DMAR) in guest.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - when there's a IOTLB miss, it will notify userspace through
>>    vhost_net fd and then userspace read the fault address, size and
>>    access from vhost fd.
>> - userspace write the translation result back to vhost fd, vhost can
>>    then update its IOTLB.
>>
>> The codes were optimized for fixed mapping users e.g dpdk in guest. It
>> will be slow if dynamic mappings were used in guest. We could do
>> optimizations on top.
>>
>> The codes were designed to be architecture independent. It should be
>> easily ported to any architecture.
>>
>> Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
>> hugepage case, 100% TLB hit rate were noticed.
>>
>> Changes from V2:
>> - introduce memory accessors for vhost
>> - switch from ioctls to oridinary file read/write for iotlb miss and
>>    updating
>> - do not assume virtqueue were virtually mapped contiguously, all
>>    virtqueue access were done throug IOTLB
>> - verify memory access during IOTLB update and fail early
>> - introduce a module parameter for the size of IOTLB
>>
>> Changes from V1:
>> - support any size/range of updating and invalidation through
>>    introducing the interval tree.
>> - convert from per device iotlb request to per virtqueue iotlb
>>    request, this solves the possible deadlock in V1.
>> - read/write permission check support.
>>
>> Please review.
> Nice, this looks good to me. Can you post a non-rfc please?
>
>

Posted, thanks.

>> Jason Wang (3):
>>    vhost: introduce vhost memory accessors
>>    vhost: convert pre sorted vhost memory array to interval tree
>>    vhost: device IOTLB API
>>
>>   drivers/vhost/net.c        |  63 +++-
>>   drivers/vhost/vhost.c      | 760 ++++++++++++++++++++++++++++++++++++++-------
>>   drivers/vhost/vhost.h      |  60 +++-
>>   include/uapi/linux/vhost.h |  28 ++
>>   4 files changed, 790 insertions(+), 121 deletions(-)
>>
>> -- 
>> 2.7.4
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH 3/3] vhost: device IOTLB API
From: Jason Wang @ 2016-06-22  9:50 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1466589047-2271-1-git-send-email-jasowang@redhat.com>

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace(qemu) implementation of DMA
remapping.

The idea is simple, cache the translation in a software device IOTLB
(which was implemented as interval tree) in vhost and use vhost_net
file descriptor for reporting IOTLB miss and IOTLB
update/invalidation. When vhost meets an IOTLB miss, the fault
address, size and access could be read from the file. After userspace
finishes the translation, it write the translated address to the
vhost_net file to update the device IOTLB.

When device IOTLB (VHOST_F_DEVICE_IOTLB) is enabled all vq address
set by ioctl were treated as iova instead of virtual address and the
accessing could only be done through IOTLB instead of direct
userspace memory access. Before each rounds or vq processing, all vq
metadata were prefetched in device IOTLB to make sure no translation
fault happens during vq processing.

In most cases, virtqueue were mapped contiguous even in virtual
address. So the IOTLB translation for virtqueue itself maybe a little
bit slower. We can add fast path on top of this patch.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c        |  54 +++-
 drivers/vhost/vhost.c      | 627 ++++++++++++++++++++++++++++++++++++++++++---
 drivers/vhost/vhost.h      |  35 ++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 696 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7641543..7ceea39 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -61,7 +61,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 enum {
 	VHOST_NET_FEATURES = VHOST_FEATURES |
 			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1ULL << VIRTIO_NET_F_MRG_RXBUF)
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+			 (1ULL << VHOST_F_DEVICE_IOTLB)
 };
 
 enum {
@@ -334,7 +335,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 {
 	unsigned long uninitialized_var(endtime);
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-				    out_num, in_num, NULL, NULL);
+				  out_num, in_num, NULL, NULL,
+				  VHOST_ACCESS_RO);
 
 	if (r == vq->num && vq->busyloop_timeout) {
 		preempt_disable();
@@ -344,7 +346,8 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 			cpu_relax_lowlatency();
 		preempt_enable();
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
-					out_num, in_num, NULL, NULL);
+				      out_num, in_num, NULL, NULL,
+				      VHOST_ACCESS_RO);
 	}
 
 	return r;
@@ -377,6 +380,9 @@ static void handle_tx(struct vhost_net *net)
 	if (!sock)
 		goto out;
 
+	if (!vq_iotlb_prefetch(vq))
+		goto out;
+
 	vhost_disable_notify(&net->dev, vq);
 
 	hdr_size = nvq->vhost_hlen;
@@ -564,7 +570,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		}
 		r = vhost_get_vq_desc(vq, vq->iov + seg,
 				      ARRAY_SIZE(vq->iov) - seg, &out,
-				      &in, log, log_num);
+				      &in, log, log_num, VHOST_ACCESS_WO);
 		if (unlikely(r < 0))
 			goto err;
 
@@ -638,6 +644,10 @@ static void handle_rx(struct vhost_net *net)
 	sock = vq->private_data;
 	if (!sock)
 		goto out;
+
+	if (!vq_iotlb_prefetch(vq))
+		goto out;
+
 	vhost_disable_notify(&net->dev, vq);
 	vhost_net_disable_vq(net, vq);
 
@@ -1087,6 +1097,11 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 		mutex_unlock(&n->dev.mutex);
 		return -EFAULT;
 	}
+	if ((features & (1ULL << VHOST_F_DEVICE_IOTLB))) {
+		if (vhost_init_device_iotlb(&n->dev, true))
+			return -EFAULT;
+	}
+
 	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
 		mutex_lock(&n->vqs[i].vq.mutex);
 		n->vqs[i].vq.acked_features = features;
@@ -1169,9 +1184,40 @@ static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
 }
 #endif
 
+static ssize_t vhost_net_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct file *file = iocb->ki_filp;
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+	int noblock = file->f_flags & O_NONBLOCK;
+
+	return vhost_chr_read_iter(dev, to, noblock);
+}
+
+static ssize_t vhost_net_chr_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+
+	return vhost_chr_write_iter(dev, from);
+}
+
+static unsigned int vhost_net_chr_poll(struct file *file, poll_table *wait)
+{
+	struct vhost_net *n = file->private_data;
+	struct vhost_dev *dev = &n->dev;
+
+	return vhost_chr_poll(file, dev, wait);
+}
+
 static const struct file_operations vhost_net_fops = {
 	.owner          = THIS_MODULE,
 	.release        = vhost_net_release,
+	.read_iter      = vhost_net_chr_read_iter,
+	.write_iter     = vhost_net_chr_write_iter,
+	.poll           = vhost_net_chr_poll,
 	.unlocked_ioctl = vhost_net_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = vhost_net_compat_ioctl,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 166e779..46569fb 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -35,6 +35,10 @@ static ushort max_mem_regions = 64;
 module_param(max_mem_regions, ushort, 0444);
 MODULE_PARM_DESC(max_mem_regions,
 	"Maximum number of memory regions in memory map. (default: 64)");
+static int max_iotlb_entries = 2048;
+module_param(max_iotlb_entries, int, 0444);
+MODULE_PARM_DESC(max_iotlb_entries,
+	"Maximum number of iotlb entries. (default: 2048)");
 
 enum {
 	VHOST_MEMORY_F_LOG = 0x1,
@@ -309,6 +313,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vhost_disable_cross_endian(vq);
 	vq->busyloop_timeout = 0;
 	vq->umem = NULL;
+	vq->iotlb = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -413,9 +418,14 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
 	dev->umem = NULL;
+	dev->iotlb = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	init_waitqueue_head(&dev->wait);
+	INIT_LIST_HEAD(&dev->read_list);
+	INIT_LIST_HEAD(&dev->pending_list);
+	spin_lock_init(&dev->iotlb_lock);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -563,6 +573,15 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_free(struct vhost_umem *umem,
+			    struct vhost_umem_node *node)
+{
+	vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+	list_del(&node->link);
+	kfree(node);
+	umem->numem--;
+}
+
 static void vhost_umem_clean(struct vhost_umem *umem)
 {
 	struct vhost_umem_node *node, *tmp;
@@ -570,14 +589,31 @@ static void vhost_umem_clean(struct vhost_umem *umem)
 	if (!umem)
 		return;
 
-	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
-		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
-		list_del(&node->link);
-		kvfree(node);
-	}
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
+		vhost_umem_free(umem, node);
+
 	kvfree(umem);
 }
 
+static void vhost_clear_msg(struct vhost_dev *dev)
+{
+	struct vhost_msg_node *node, *n;
+
+	spin_lock(&dev->iotlb_lock);
+
+	list_for_each_entry_safe(node, n, &dev->read_list, node) {
+		list_del(&node->node);
+		kfree(node);
+	}
+
+	list_for_each_entry_safe(node, n, &dev->pending_list, node) {
+		list_del(&node->node);
+		kfree(node);
+	}
+
+	spin_unlock(&dev->iotlb_lock);
+}
+
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -606,6 +642,10 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	/* No one will access memory at this point */
 	vhost_umem_clean(dev->umem);
 	dev->umem = NULL;
+	vhost_umem_clean(dev->iotlb);
+	dev->iotlb = NULL;
+	vhost_clear_msg(dev);
+	wake_up_interruptible_poll(&dev->wait, POLLIN | POLLRDNORM);
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -681,28 +721,382 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
 	return 1;
 }
 
-#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
+			  struct iovec iov[], int iov_size, int access);
 
 static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
 			      const void *from, unsigned size)
 {
-	return __copy_to_user(to, from, size);
-}
+	int ret;
 
-#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+	if (!vq->iotlb)
+		return __copy_to_user(to, from, size);
+	else {
+		/* This function should be called after iotlb
+		 * prefetch, which means we're sure that all vq
+		 * could be access through iotlb. So -EAGAIN should
+		 * not happen in this case.
+		 */
+		/* TODO: more fast path */
+		struct iov_iter t;
+		ret = translate_desc(vq, (u64)to, size, vq->iotlb_iov,
+				     ARRAY_SIZE(vq->iotlb_iov),
+				     VHOST_ACCESS_WO);
+		if (ret < 0)
+			goto out;
+		iov_iter_init(&t, WRITE, vq->iotlb_iov, ret, size);
+		ret = copy_to_iter(from, size, &t);
+		if (ret == size)
+			ret = 0;
+	}
+out:
+	return ret;
+}
 
 static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
 				void *from, unsigned size)
 {
-	return __copy_from_user(to, from, size);
+	int ret;
+
+	if (!vq->iotlb)
+		return __copy_from_user(to, from, size);
+	else {
+		/* This function should be called after iotlb
+		 * prefetch, which means we're sure that vq
+		 * could be access through iotlb. So -EAGAIN should
+		 * not happen in this case.
+		 */
+		/* TODO: more fast path */
+		struct iov_iter f;
+		ret = translate_desc(vq, (u64)from, size, vq->iotlb_iov,
+				     ARRAY_SIZE(vq->iotlb_iov),
+				     VHOST_ACCESS_RO);
+		if (ret < 0) {
+			vq_err(vq, "IOTLB translation failure: uaddr "
+			       "0x%llx size 0x%llx\n",
+			       (unsigned long long) from,
+			       (unsigned long long) size);
+			goto out;
+		}
+		iov_iter_init(&f, READ, vq->iotlb_iov, ret, size);
+		ret = copy_from_iter(to, size, &f);
+		if (ret == size)
+			ret = 0;
+	}
+
+out:
+	return ret;
+}
+
+static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
+				     void *addr, unsigned size)
+{
+	int ret;
+
+	/* This function should be called after iotlb
+	 * prefetch, which means we're sure that vq
+	 * could be access through iotlb. So -EAGAIN should
+	 * not happen in this case.
+	 */
+	/* TODO: more fast path */
+	ret = translate_desc(vq, (u64)addr, size, vq->iotlb_iov,
+			     ARRAY_SIZE(vq->iotlb_iov),
+			     VHOST_ACCESS_RO);
+	if (ret < 0) {
+		vq_err(vq, "IOTLB translation failure: uaddr "
+			"0x%llx size 0x%llx\n",
+			(unsigned long long) addr,
+			(unsigned long long) size);
+		return NULL;
+	}
+
+	if (ret != 1 || vq->iotlb_iov[0].iov_len != size) {
+		vq_err(vq, "Non atomic userspace memory access: uaddr "
+			"0x%llx size 0x%llx\n",
+			(unsigned long long) addr,
+			(unsigned long long) size);
+		return NULL;
+	}
+
+	return vq->iotlb_iov[0].iov_base;
+}
+
+#define vhost_put_user(vq, x, ptr) \
+({ \
+	int ret = -EFAULT; \
+	if (!vq->iotlb) { \
+		ret = __put_user(x, ptr); \
+	} else { \
+		__typeof__(ptr) to = \
+			(__typeof__(ptr)) __vhost_get_user(vq, ptr, sizeof(*ptr)); \
+		if (to != NULL) \
+			ret = __put_user(x, to); \
+		else \
+			ret = -EFAULT;	\
+	} \
+	ret; \
+})
+
+#define vhost_get_user(vq, x, ptr) \
+({ \
+	int ret; \
+	if (!vq->iotlb) { \
+		ret = __get_user(x, ptr); \
+	} else { \
+		__typeof__(ptr) from = \
+			(__typeof__(ptr)) __vhost_get_user(vq, ptr, sizeof(*ptr)); \
+		if (from != NULL) \
+			ret = __get_user(x, from); \
+		else \
+			ret = -EFAULT; \
+	} \
+	ret; \
+})
+
+static void vhost_dev_lock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_lock(&d->vqs[i]->mutex);
+}
+
+static void vhost_dev_unlock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_unlock(&d->vqs[i]->mutex);
+}
+
+static int vhost_new_umem_range(struct vhost_umem *umem,
+				u64 start, u64 size, u64 end,
+				u64 userspace_addr, int perm)
+{
+	struct vhost_umem_node *tmp, *node = kmalloc(sizeof(*node), GFP_ATOMIC);
+
+	if (!node)
+		return -ENOMEM;
+
+	if (umem->numem == max_iotlb_entries) {
+		tmp = list_first_entry(&umem->umem_list, typeof(*tmp), link);
+		vhost_umem_free(umem, tmp);
+	}
+
+	node->start = start;
+	node->size = size;
+	node->last = end;
+	node->userspace_addr = userspace_addr;
+	node->perm = perm;
+	INIT_LIST_HEAD(&node->link);
+	list_add_tail(&node->link, &umem->umem_list);
+	vhost_umem_interval_tree_insert(node, &umem->umem_tree);
+	umem->numem++;
+
+	return 0;
+}
+
+static void vhost_del_umem_range(struct vhost_umem *umem,
+				 u64 start, u64 end)
+{
+	struct vhost_umem_node *node;
+
+	while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							   start, end)))
+		vhost_umem_free(umem, node);
+}
+
+static void vhost_iotlb_notify_vq(struct vhost_dev *d,
+				  struct vhost_iotlb_msg *msg)
+{
+	struct vhost_msg_node *node, *n;
+
+	spin_lock(&d->iotlb_lock);
+
+	list_for_each_entry_safe(node, n, &d->pending_list, node) {
+		struct vhost_iotlb_msg *vq_msg = &node->msg.iotlb;
+		if (msg->iova <= vq_msg->iova &&
+		    msg->iova + msg->size - 1 > vq_msg->iova &&
+		    vq_msg->type == VHOST_IOTLB_MISS) {
+			vhost_poll_queue(&node->vq->poll);
+			list_del(&node->node);
+			kfree(node);
+		}
+	}
+
+	spin_unlock(&d->iotlb_lock);
+}
+
+static int umem_access_ok(u64 uaddr, u64 size, int access)
+{
+	if ((access & VHOST_ACCESS_RO) &&
+	    !access_ok(VERIFY_READ, uaddr, size))
+		return -EFAULT;
+	if ((access & VHOST_ACCESS_WO) &&
+	    !access_ok(VERIFY_WRITE, uaddr, size))
+		return -EFAULT;
+	return 0;
+}
+
+int vhost_process_iotlb_msg(struct vhost_dev *dev,
+			    struct vhost_iotlb_msg *msg)
+{
+	int ret = 0;
+
+	vhost_dev_lock_vqs(dev);
+	switch (msg->type) {
+	case VHOST_IOTLB_UPDATE:
+		if (!dev->iotlb) {
+			ret = -EFAULT;
+			break;
+		}
+		if (umem_access_ok(msg->uaddr, msg->size, msg->perm)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (vhost_new_umem_range(dev->iotlb, msg->iova, msg->size,
+					 msg->iova + msg->size - 1,
+					 msg->uaddr, msg->perm)) {
+			ret = -ENOMEM;
+			break;
+		}
+		vhost_iotlb_notify_vq(dev, msg);
+		break;
+	case VHOST_IOTLB_INVALIDATE:
+		vhost_del_umem_range(dev->iotlb, msg->iova,
+				     msg->iova + msg->size - 1);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	vhost_dev_unlock_vqs(dev);
+	return ret;
+}
+ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
+			     struct iov_iter *from)
+{
+	struct vhost_msg_node node;
+	unsigned size = sizeof(struct vhost_msg);
+	size_t ret;
+	int err;
+
+	if (iov_iter_count(from) < size)
+		return 0;
+	ret = copy_from_iter(&node.msg, size, from);
+	if (ret != size)
+		goto done;
+
+	switch (node.msg.type) {
+	case VHOST_IOTLB_MSG:
+		err = vhost_process_iotlb_msg(dev, &node.msg.iotlb);
+		if (err)
+			ret = err;
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+done:
+	return ret;
+}
+EXPORT_SYMBOL(vhost_chr_write_iter);
+
+unsigned int vhost_chr_poll(struct file *file, struct vhost_dev *dev,
+			    poll_table *wait)
+{
+	unsigned int mask = 0;
+
+	poll_wait(file, &dev->wait, wait);
+
+	if (!list_empty(&dev->read_list))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+EXPORT_SYMBOL(vhost_chr_poll);
+
+ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
+			    int noblock)
+{
+	DEFINE_WAIT(wait);
+	struct vhost_msg_node *node;
+	ssize_t ret = 0;
+	unsigned size = sizeof(struct vhost_msg);
+
+	if (iov_iter_count(to) < size)
+		return 0;
+
+	while (1) {
+		if (!noblock)
+			prepare_to_wait(&dev->wait, &wait,
+					TASK_INTERRUPTIBLE);
+
+		node = vhost_dequeue_msg(dev, &dev->read_list);
+		if (node)
+			break;
+		if (noblock) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		if (!dev->iotlb) {
+			ret = -EBADFD;
+			break;
+		}
+
+		schedule();
+	}
+
+	if (!noblock)
+		finish_wait(&dev->wait, &wait);
+
+	if (node) {
+		ret = copy_to_iter(&node->msg, size, to);
+
+		if (ret != size || node->msg.type != VHOST_IOTLB_MISS) {
+			kfree(node);
+			return ret;
+		}
+
+		vhost_enqueue_msg(dev, &dev->pending_list, node);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vhost_chr_read_iter);
+
+static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access)
+{
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_msg_node *node;
+	struct vhost_iotlb_msg *msg;
+
+	node = vhost_new_msg(vq, VHOST_IOTLB_MISS);
+	if (!node)
+		return -ENOMEM;
+
+	msg = &node->msg.iotlb;
+	msg->type = VHOST_IOTLB_MISS;
+	msg->iova = iova;
+	msg->perm = access;
+
+	vhost_enqueue_msg(dev, &dev->read_list, node);
+
+	return 0;
 }
 
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
+
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring + s) &&
@@ -710,6 +1104,54 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			sizeof *used + num * sizeof *used->ring + s);
 }
 
+static int iotlb_access_ok(struct vhost_virtqueue *vq,
+			   int access, u64 addr, u64 len)
+{
+	const struct vhost_umem_node *node;
+	struct vhost_umem *umem = vq->iotlb;
+	u64 s = 0, size;
+
+	while (len > s) {
+		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							   addr,
+							   addr + len - 1);
+		if (node == NULL || node->start > addr) {
+			vhost_iotlb_miss(vq, addr, access);
+			return false;
+		} else if (!(node->perm & access)) {
+			/* Report the possible access violation by
+			 * request another translation from userspace.
+			 */
+			return false;
+		}
+
+		size = node->size - addr + node->start;
+		s += size;
+		addr += size;
+	}
+
+	return true;
+}
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+	unsigned int num = vq->num;
+
+	if (!vq->iotlb)
+		return 1;
+
+	return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->desc,
+			       num * sizeof *vq->desc) &&
+	       iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)vq->avail,
+			       sizeof *vq->avail +
+			       num * sizeof *vq->avail->ring + s) &&
+	       iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)vq->used,
+			       sizeof *vq->used +
+			       num * sizeof *vq->used->ring + s);
+}
+EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
+
 /* Can we log writes? */
 /* Caller should have device mutex but not vq mutex */
 int vhost_log_access_ok(struct vhost_dev *dev)
@@ -736,16 +1178,35 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
+	if (vq->iotlb) {
+		/* When device IOTLB was used, the access validation
+		 * will be validated during prefetching.
+		 */
+		return 1;
+	}
 	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used) &&
 		vq_log_access_ok(vq, vq->log_base);
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
+static struct vhost_umem *vhost_umem_alloc(void)
+{
+	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
+
+	if (!umem)
+		return NULL;
+
+	umem->umem_tree = RB_ROOT;
+	umem->numem = 0;
+	INIT_LIST_HEAD(&umem->umem_list);
+
+	return umem;
+}
+
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
 	struct vhost_memory mem, *newmem;
 	struct vhost_memory_region *region;
-	struct vhost_umem_node *node;
 	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
@@ -767,28 +1228,23 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		return -EFAULT;
 	}
 
-	newumem = vhost_kvzalloc(sizeof(*newumem));
+	newumem = vhost_umem_alloc();
 	if (!newumem) {
 		kvfree(newmem);
 		return -ENOMEM;
 	}
 
-	newumem->umem_tree = RB_ROOT;
-	INIT_LIST_HEAD(&newumem->umem_list);
-
 	for (region = newmem->regions;
 	     region < newmem->regions + mem.nregions;
 	     region++) {
-		node = vhost_kvzalloc(sizeof(*node));
-		if (!node)
+		if (vhost_new_umem_range(newumem,
+					 region->guest_phys_addr,
+					 region->memory_size,
+					 region->guest_phys_addr +
+					 region->memory_size - 1,
+					 region->userspace_addr,
+					 VHOST_ACCESS_RW))
 			goto err;
-		node->start = region->guest_phys_addr;
-		node->size = region->memory_size;
-		node->last = node->start + node->size - 1;
-		node->userspace_addr = region->userspace_addr;
-		INIT_LIST_HEAD(&node->link);
-		list_add_tail(&node->link, &newumem->umem_list);
-		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
 	}
 
 	if (!memory_access_ok(d, newumem, 0))
@@ -1032,6 +1488,30 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
 }
 EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
 
+int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
+{
+	struct vhost_umem *niotlb, *oiotlb;
+	int i;
+
+	niotlb = vhost_umem_alloc();
+	if (!niotlb)
+		return -ENOMEM;
+
+	oiotlb = d->iotlb;
+	d->iotlb = niotlb;
+
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
+		d->vqs[i]->iotlb = niotlb;
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
+
+	vhost_umem_clean(oiotlb);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_init_device_iotlb);
+
 /* Caller must have device mutex */
 long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
@@ -1246,15 +1726,20 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 	if (r)
 		goto err;
 	vq->signalled_used_valid = false;
-	if (!access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx)) {
+	if (!vq->iotlb &&
+	    !access_ok(VERIFY_READ, &vq->used->idx, sizeof vq->used->idx)) {
 		r = -EFAULT;
 		goto err;
 	}
 	r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
-	if (r)
+	if (r) {
+		vq_err(vq, "Can't access used idx at 0x%llx\n",
+			(unsigned long long) &vq->used->idx);
 		goto err;
+	}
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
 	return 0;
+
 err:
 	vq->is_le = is_le;
 	return r;
@@ -1262,10 +1747,11 @@ err:
 EXPORT_SYMBOL_GPL(vhost_vq_init_access);
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
-			  struct iovec iov[], int iov_size)
+			  struct iovec iov[], int iov_size, int access)
 {
 	const struct vhost_umem_node *node;
-	struct vhost_umem *umem = vq->umem;
+	struct vhost_dev *dev = vq->dev;
+	struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
@@ -1276,12 +1762,21 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 			ret = -ENOBUFS;
 			break;
 		}
+
 		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
 							addr, addr + len - 1);
 		if (node == NULL || node->start > addr) {
-			ret = -EFAULT;
+			if (umem != dev->iotlb) {
+				ret = -EFAULT;
+				break;
+			}
+			ret = -EAGAIN;
+			break;
+		} else if (!(node->perm & access)) {
+			ret = -EPERM;
 			break;
 		}
+
 		_iov = iov + ret;
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
@@ -1292,6 +1787,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		++ret;
 	}
 
+	if (ret == -EAGAIN)
+		vhost_iotlb_miss(vq, addr, access);
 	return ret;
 }
 
@@ -1320,7 +1817,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			struct iovec iov[], unsigned int iov_size,
 			unsigned int *out_num, unsigned int *in_num,
 			struct vhost_log *log, unsigned int *log_num,
-			struct vring_desc *indirect)
+			struct vring_desc *indirect, int access)
 {
 	struct vring_desc desc;
 	unsigned int i = 0, count, found = 0;
@@ -1338,9 +1835,10 @@ static int get_indirect(struct vhost_virtqueue *vq,
 	}
 
 	ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
-			     UIO_MAXIOV);
+			     UIO_MAXIOV, VHOST_ACCESS_RO);
 	if (unlikely(ret < 0)) {
-		vq_err(vq, "Translation failure %d in indirect.\n", ret);
+		if (ret != -EAGAIN)
+			vq_err(vq, "Translation failure %d in indirect.\n", ret);
 		return ret;
 	}
 	iov_iter_init(&from, READ, vq->indirect, ret, len);
@@ -1380,10 +1878,11 @@ static int get_indirect(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d indirect idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d indirect idx %d\n",
+					ret, i);
 			return ret;
 		}
 		/* If this is an input descriptor, increment that count. */
@@ -1419,7 +1918,8 @@ static int get_indirect(struct vhost_virtqueue *vq,
 int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		      struct iovec iov[], unsigned int iov_size,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num)
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access)
 {
 	struct vring_desc desc;
 	unsigned int i, head, found = 0;
@@ -1498,10 +1998,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
 			ret = get_indirect(vq, iov, iov_size,
 					   out_num, in_num,
-					   log, log_num, &desc);
+					   log, log_num, &desc, access);
 			if (unlikely(ret < 0)) {
-				vq_err(vq, "Failure detected "
-				       "in indirect descriptor at idx %d\n", i);
+				if (ret != -EAGAIN)
+					vq_err(vq, "Failure detected "
+						"in indirect descriptor at idx %d\n", i);
 				return ret;
 			}
 			continue;
@@ -1509,10 +2010,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 		ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
 				     vhost32_to_cpu(vq, desc.len), iov + iov_count,
-				     iov_size - iov_count);
+				     iov_size - iov_count, access);
 		if (unlikely(ret < 0)) {
-			vq_err(vq, "Translation failure %d descriptor idx %d\n",
-			       ret, i);
+			if (ret != -EAGAIN)
+				vq_err(vq, "Translation failure %d descriptor idx %d\n",
+					ret, i);
 			return ret;
 		}
 		if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE)) {
@@ -1781,6 +2283,47 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_disable_notify);
 
+/* Create a new message. */
+struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type)
+{
+	struct vhost_msg_node *node = kmalloc(sizeof *node, GFP_KERNEL);
+	if (!node)
+		return NULL;
+	node->vq = vq;
+	node->msg.type = type;
+	return node;
+}
+EXPORT_SYMBOL_GPL(vhost_new_msg);
+
+void vhost_enqueue_msg(struct vhost_dev *dev, struct list_head *head,
+		       struct vhost_msg_node *node)
+{
+	spin_lock(&dev->iotlb_lock);
+	list_add_tail(&node->node, head);
+	spin_unlock(&dev->iotlb_lock);
+
+	wake_up_interruptible_poll(&dev->wait, POLLIN | POLLRDNORM);
+}
+EXPORT_SYMBOL_GPL(vhost_enqueue_msg);
+
+struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
+					 struct list_head *head)
+{
+	struct vhost_msg_node *node = NULL;
+
+	spin_lock(&dev->iotlb_lock);
+	if (!list_empty(head)) {
+		node = list_first_entry(head, struct vhost_msg_node,
+					node);
+		list_del(&node->node);
+	}
+	spin_unlock(&dev->iotlb_lock);
+
+	return node;
+}
+EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
+
+
 static int __init vhost_init(void)
 {
 	return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b93b6a0..85c1d78 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -63,13 +63,15 @@ struct vhost_umem_node {
 	__u64 last;
 	__u64 size;
 	__u64 userspace_addr;
-	__u64 flags_padding;
+	__u32 perm;
+	__u32 flags_padding;
 	__u64 __subtree_last;
 };
 
 struct vhost_umem {
 	struct rb_root umem_tree;
 	struct list_head umem_list;
+	int numem;
 };
 
 /* The virtqueue structure describes a queue attached to a device. */
@@ -117,10 +119,12 @@ struct vhost_virtqueue {
 	u64 log_addr;
 
 	struct iovec iov[UIO_MAXIOV];
+	struct iovec iotlb_iov[64];
 	struct iovec *indirect;
 	struct vring_used_elem *heads;
 	/* Protected by virtqueue mutex. */
 	struct vhost_umem *umem;
+	struct vhost_umem *iotlb;
 	void *private_data;
 	u64 acked_features;
 	/* Log write descriptors */
@@ -137,6 +141,12 @@ struct vhost_virtqueue {
 	u32 busyloop_timeout;
 };
 
+struct vhost_msg_node {
+  struct vhost_msg msg;
+  struct vhost_virtqueue *vq;
+  struct list_head node;
+};
+
 struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
@@ -148,6 +158,11 @@ struct vhost_dev {
 	struct list_head work_list;
 	struct task_struct *worker;
 	struct vhost_umem *umem;
+	struct vhost_umem *iotlb;
+	spinlock_t iotlb_lock;
+	struct list_head read_list;
+	struct list_head pending_list;
+	wait_queue_head_t wait;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
@@ -166,7 +181,8 @@ int vhost_log_access_ok(struct vhost_dev *);
 int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-		      struct vhost_log *log, unsigned int *log_num);
+		      struct vhost_log *log, unsigned int *log_num,
+		      int access);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
@@ -184,6 +200,21 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
+
+struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
+void vhost_enqueue_msg(struct vhost_dev *dev,
+		       struct list_head *head,
+		       struct vhost_msg_node *node);
+struct vhost_msg_node *vhost_dequeue_msg(struct vhost_dev *dev,
+					 struct list_head *head);
+unsigned int vhost_chr_poll(struct file *file, struct vhost_dev *dev,
+			    poll_table *wait);
+ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct iov_iter *to,
+			    int noblock);
+ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
+			     struct iov_iter *from);
+int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 61a8777..8cb0a65 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -47,6 +47,32 @@ struct vhost_vring_addr {
 	__u64 log_guest_addr;
 };
 
+/* no alignment requirement */
+struct vhost_iotlb_msg {
+	__u64 iova;
+	__u64 size;
+	__u64 uaddr;
+#define VHOST_ACCESS_RO      0x1
+#define VHOST_ACCESS_WO      0x2
+#define VHOST_ACCESS_RW      0x3
+	__u8 perm;
+#define VHOST_IOTLB_MISS           1
+#define VHOST_IOTLB_UPDATE         2
+#define VHOST_IOTLB_INVALIDATE     3
+#define VHOST_IOTLB_ACCESS_FAIL    4
+	__u8 type;
+};
+
+#define VHOST_IOTLB_MSG 0x1
+
+struct vhost_msg {
+	int type;
+	union {
+		struct vhost_iotlb_msg iotlb;
+		__u8 padding[64];
+	};
+};
+
 struct vhost_memory_region {
 	__u64 guest_phys_addr;
 	__u64 memory_size; /* bytes */
@@ -146,6 +172,8 @@ struct vhost_memory {
 #define VHOST_F_LOG_ALL 26
 /* vhost-net should add virtio_net_hdr for RX, and strip for TX packets. */
 #define VHOST_NET_F_VIRTIO_NET_HDR 27
+/* Vhost have device IOTLB */
+#define VHOST_F_DEVICE_IOTLB 63
 
 /* VHOST_SCSI specific definitions */
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 2/3] vhost: convert pre sorted vhost memory array to interval tree
From: Jason Wang @ 2016-06-22  9:50 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1466589047-2271-1-git-send-email-jasowang@redhat.com>

Current pre-sorted memory region array has some limitations for future
device IOTLB conversion:

1) need extra work for adding and removing a single region, and it's
   expected to be slow because of sorting or memory re-allocation.
2) need extra work of removing a large range which may intersect
   several regions with different size.
3) need trick for a replacement policy like LRU

To overcome the above shortcomings, this patch convert it to interval
tree which can easily address the above issue with almost no extra
work.

The patch could be used for:

- Extend the current API and only let the userspace to send diffs of
  memory table.
- Simplify Device IOTLB implementation.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   |   8 +--
 drivers/vhost/vhost.c | 182 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.h |  27 ++++++--
 3 files changed, 128 insertions(+), 89 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3893ad9..7641543 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1039,20 +1039,20 @@ static long vhost_net_reset_owner(struct vhost_net *n)
 	struct socket *tx_sock = NULL;
 	struct socket *rx_sock = NULL;
 	long err;
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
-	memory = vhost_dev_reset_owner_prepare();
-	if (!memory) {
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
 		err = -ENOMEM;
 		goto done;
 	}
 	vhost_net_stop(n, &tx_sock, &rx_sock);
 	vhost_net_flush(n);
-	vhost_dev_reset_owner(&n->dev, memory);
+	vhost_dev_reset_owner(&n->dev, umem);
 	vhost_net_vq_reset(n);
 done:
 	mutex_unlock(&n->dev.mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f2a63a..166e779 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -27,6 +27,7 @@
 #include <linux/cgroup.h>
 #include <linux/module.h>
 #include <linux/sort.h>
+#include <linux/interval_tree_generic.h>
 
 #include "vhost.h"
 
@@ -42,6 +43,10 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])
 
+INTERVAL_TREE_DEFINE(struct vhost_umem_node,
+		     rb, __u64, __subtree_last,
+		     START, LAST, , vhost_umem_interval_tree);
+
 #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
 {
@@ -300,10 +305,10 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
-	vq->memory = NULL;
 	vhost_reset_is_le(vq);
 	vhost_disable_cross_endian(vq);
 	vq->busyloop_timeout = 0;
+	vq->umem = NULL;
 }
 
 static int vhost_worker(void *data)
@@ -407,7 +412,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	mutex_init(&dev->mutex);
 	dev->log_ctx = NULL;
 	dev->log_file = NULL;
-	dev->memory = NULL;
+	dev->umem = NULL;
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
@@ -512,27 +517,36 @@ err_mm:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-struct vhost_memory *vhost_dev_reset_owner_prepare(void)
+static void *vhost_kvzalloc(unsigned long size)
 {
-	return kmalloc(offsetof(struct vhost_memory, regions), GFP_KERNEL);
+	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+
+	if (!n)
+		n = vzalloc(size);
+	return n;
+}
+
+struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+{
+	return vhost_kvzalloc(sizeof(struct vhost_umem));
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
 /* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_memory *memory)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
 {
 	int i;
 
 	vhost_dev_cleanup(dev, true);
 
 	/* Restore memory to default empty mapping. */
-	memory->nregions = 0;
-	dev->memory = memory;
+	INIT_LIST_HEAD(&umem->umem_list);
+	dev->umem = umem;
 	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
 	 * VQs aren't running.
 	 */
 	for (i = 0; i < dev->nvqs; ++i)
-		dev->vqs[i]->memory = memory;
+		dev->vqs[i]->umem = umem;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
 
@@ -549,6 +563,21 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+static void vhost_umem_clean(struct vhost_umem *umem)
+{
+	struct vhost_umem_node *node, *tmp;
+
+	if (!umem)
+		return;
+
+	list_for_each_entry_safe(node, tmp, &umem->umem_list, link) {
+		vhost_umem_interval_tree_remove(node, &umem->umem_tree);
+		list_del(&node->link);
+		kvfree(node);
+	}
+	kvfree(umem);
+}
+
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -575,8 +604,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 		fput(dev->log_file);
 	dev->log_file = NULL;
 	/* No one will access memory at this point */
-	kvfree(dev->memory);
-	dev->memory = NULL;
+	vhost_umem_clean(dev->umem);
+	dev->umem = NULL;
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
@@ -602,25 +631,25 @@ static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
+static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
 			       int log_all)
 {
-	int i;
+	struct vhost_umem_node *node;
 
-	if (!mem)
+	if (!umem)
 		return 0;
 
-	for (i = 0; i < mem->nregions; ++i) {
-		struct vhost_memory_region *m = mem->regions + i;
-		unsigned long a = m->userspace_addr;
-		if (m->memory_size > ULONG_MAX)
+	list_for_each_entry(node, &umem->umem_list, link) {
+		unsigned long a = node->userspace_addr;
+
+		if (node->size > ULONG_MAX)
 			return 0;
 		else if (!access_ok(VERIFY_WRITE, (void __user *)a,
-				    m->memory_size))
+				    node->size))
 			return 0;
 		else if (log_all && !log_access_ok(log_base,
-						   m->guest_phys_addr,
-						   m->memory_size))
+						   node->start,
+						   node->size))
 			return 0;
 	}
 	return 1;
@@ -628,7 +657,7 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
+static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
 			    int log_all)
 {
 	int i;
@@ -641,7 +670,8 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 		log = log_all || vhost_has_feature(d->vqs[i], VHOST_F_LOG_ALL);
 		/* If ring is inactive, will check when it's enabled. */
 		if (d->vqs[i]->private_data)
-			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem, log);
+			ok = vq_memory_access_ok(d->vqs[i]->log_base,
+						 umem, log);
 		else
 			ok = 1;
 		mutex_unlock(&d->vqs[i]->mutex);
@@ -684,7 +714,7 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 /* Caller should have device mutex but not vq mutex */
 int vhost_log_access_ok(struct vhost_dev *dev)
 {
-	return memory_access_ok(dev, dev->memory, 1);
+	return memory_access_ok(dev, dev->umem, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_log_access_ok);
 
@@ -695,7 +725,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 {
 	size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
 
-	return vq_memory_access_ok(log_base, vq->memory,
+	return vq_memory_access_ok(log_base, vq->umem,
 				   vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
@@ -711,28 +741,12 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
-static int vhost_memory_reg_sort_cmp(const void *p1, const void *p2)
-{
-	const struct vhost_memory_region *r1 = p1, *r2 = p2;
-	if (r1->guest_phys_addr < r2->guest_phys_addr)
-		return 1;
-	if (r1->guest_phys_addr > r2->guest_phys_addr)
-		return -1;
-	return 0;
-}
-
-static void *vhost_kvzalloc(unsigned long size)
-{
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-
-	if (!n)
-		n = vzalloc(size);
-	return n;
-}
-
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 {
-	struct vhost_memory mem, *newmem, *oldmem;
+	struct vhost_memory mem, *newmem;
+	struct vhost_memory_region *region;
+	struct vhost_umem_node *node;
+	struct vhost_umem *newumem, *oldumem;
 	unsigned long size = offsetof(struct vhost_memory, regions);
 	int i;
 
@@ -752,24 +766,52 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		kvfree(newmem);
 		return -EFAULT;
 	}
-	sort(newmem->regions, newmem->nregions, sizeof(*newmem->regions),
-		vhost_memory_reg_sort_cmp, NULL);
 
-	if (!memory_access_ok(d, newmem, 0)) {
+	newumem = vhost_kvzalloc(sizeof(*newumem));
+	if (!newumem) {
 		kvfree(newmem);
-		return -EFAULT;
+		return -ENOMEM;
 	}
-	oldmem = d->memory;
-	d->memory = newmem;
+
+	newumem->umem_tree = RB_ROOT;
+	INIT_LIST_HEAD(&newumem->umem_list);
+
+	for (region = newmem->regions;
+	     region < newmem->regions + mem.nregions;
+	     region++) {
+		node = vhost_kvzalloc(sizeof(*node));
+		if (!node)
+			goto err;
+		node->start = region->guest_phys_addr;
+		node->size = region->memory_size;
+		node->last = node->start + node->size - 1;
+		node->userspace_addr = region->userspace_addr;
+		INIT_LIST_HEAD(&node->link);
+		list_add_tail(&node->link, &newumem->umem_list);
+		vhost_umem_interval_tree_insert(node, &newumem->umem_tree);
+	}
+
+	if (!memory_access_ok(d, newumem, 0))
+		goto err;
+
+	oldumem = d->umem;
+	d->umem = newumem;
 
 	/* All memory accesses are done under some VQ mutex. */
 	for (i = 0; i < d->nvqs; ++i) {
 		mutex_lock(&d->vqs[i]->mutex);
-		d->vqs[i]->memory = newmem;
+		d->vqs[i]->umem = newumem;
 		mutex_unlock(&d->vqs[i]->mutex);
 	}
-	kvfree(oldmem);
+
+	kvfree(newmem);
+	vhost_umem_clean(oldumem);
 	return 0;
+
+err:
+	vhost_umem_clean(newumem);
+	kvfree(newmem);
+	return -EFAULT;
 }
 
 long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
@@ -1072,28 +1114,6 @@ done:
 }
 EXPORT_SYMBOL_GPL(vhost_dev_ioctl);
 
-static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
-						     __u64 addr, __u32 len)
-{
-	const struct vhost_memory_region *reg;
-	int start = 0, end = mem->nregions;
-
-	while (start < end) {
-		int slot = start + (end - start) / 2;
-		reg = mem->regions + slot;
-		if (addr >= reg->guest_phys_addr)
-			end = slot;
-		else
-			start = slot + 1;
-	}
-
-	reg = mem->regions + start;
-	if (addr >= reg->guest_phys_addr &&
-		reg->guest_phys_addr + reg->memory_size > addr)
-		return reg;
-	return NULL;
-}
-
 /* TODO: This is really inefficient.  We need something like get_user()
  * (instruction directly accesses the data, with an exception table entry
  * returning -EFAULT). See Documentation/x86/exception-tables.txt.
@@ -1244,29 +1264,29 @@ EXPORT_SYMBOL_GPL(vhost_vq_init_access);
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 			  struct iovec iov[], int iov_size)
 {
-	const struct vhost_memory_region *reg;
-	struct vhost_memory *mem;
+	const struct vhost_umem_node *node;
+	struct vhost_umem *umem = vq->umem;
 	struct iovec *_iov;
 	u64 s = 0;
 	int ret = 0;
 
-	mem = vq->memory;
 	while ((u64)len > s) {
 		u64 size;
 		if (unlikely(ret >= iov_size)) {
 			ret = -ENOBUFS;
 			break;
 		}
-		reg = find_region(mem, addr, len);
-		if (unlikely(!reg)) {
+		node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
+							addr, addr + len - 1);
+		if (node == NULL || node->start > addr) {
 			ret = -EFAULT;
 			break;
 		}
 		_iov = iov + ret;
-		size = reg->memory_size - addr + reg->guest_phys_addr;
+		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(reg->userspace_addr + addr - reg->guest_phys_addr);
+			(node->userspace_addr + addr - node->start);
 		s += size;
 		addr += size;
 		++ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d36d8be..b93b6a0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -53,6 +53,25 @@ struct vhost_log {
 	u64 len;
 };
 
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->last)
+
+struct vhost_umem_node {
+	struct rb_node rb;
+	struct list_head link;
+	__u64 start;
+	__u64 last;
+	__u64 size;
+	__u64 userspace_addr;
+	__u64 flags_padding;
+	__u64 __subtree_last;
+};
+
+struct vhost_umem {
+	struct rb_root umem_tree;
+	struct list_head umem_list;
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -101,7 +120,7 @@ struct vhost_virtqueue {
 	struct iovec *indirect;
 	struct vring_used_elem *heads;
 	/* Protected by virtqueue mutex. */
-	struct vhost_memory *memory;
+	struct vhost_umem *umem;
 	void *private_data;
 	u64 acked_features;
 	/* Log write descriptors */
@@ -119,7 +138,6 @@ struct vhost_virtqueue {
 };
 
 struct vhost_dev {
-	struct vhost_memory *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
 	struct vhost_virtqueue **vqs;
@@ -129,14 +147,15 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct vhost_umem *umem;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
 long vhost_dev_set_owner(struct vhost_dev *dev);
 bool vhost_dev_has_owner(struct vhost_dev *dev);
 long vhost_dev_check_owner(struct vhost_dev *);
-struct vhost_memory *vhost_dev_reset_owner_prepare(void);
-void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_memory *);
+struct vhost_umem *vhost_dev_reset_owner_prepare(void);
+void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 1/3] vhost: introduce vhost memory accessors
From: Jason Wang @ 2016-06-22  9:50 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx
In-Reply-To: <1466589047-2271-1-git-send-email-jasowang@redhat.com>

This patch introduces vhost memory accessors which were just wrappers
for userspace address access helpers. This is a requirement for vhost
device iotlb implementation which will add iotlb translations in those
accessors.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c   |  1 +
 drivers/vhost/vhost.c | 50 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1d3e45f..3893ad9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -987,6 +987,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 
 		vhost_net_disable_vq(n, vq);
 		vq->private_data = sock;
+		/* FIXME: iotlb prefetch here? */
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1..9f2a63a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -651,6 +651,22 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
+#define vhost_put_user(vq, x, ptr)  __put_user(x, ptr)
+
+static int vhost_copy_to_user(struct vhost_virtqueue *vq, void *to,
+			      const void *from, unsigned size)
+{
+	return __copy_to_user(to, from, size);
+}
+
+#define vhost_get_user(vq, x, ptr) __get_user(x, ptr)
+
+static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
+				void *from, unsigned size)
+{
+	return __copy_from_user(to, from, size);
+}
+
 static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
@@ -1156,7 +1172,8 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
 static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 {
 	void __user *used;
-	if (__put_user(cpu_to_vhost16(vq, vq->used_flags), &vq->used->flags) < 0)
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+			   &vq->used->flags) < 0)
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		/* Make sure the flag is seen before log. */
@@ -1174,7 +1191,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
 
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
 {
-	if (__put_user(cpu_to_vhost16(vq, vq->avail_idx), vhost_avail_event(vq)))
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+			   vhost_avail_event(vq)))
 		return -EFAULT;
 	if (unlikely(vq->log_used)) {
 		void __user *used;
@@ -1212,7 +1230,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = __get_user(last_used_idx, &vq->used->idx);
+	r = vhost_get_user(vq, last_used_idx, &vq->used->idx);
 	if (r)
 		goto err;
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1392,7 +1410,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
-	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+	if (unlikely(vhost_get_user(vq, avail_idx, &vq->avail->idx))) {
 		vq_err(vq, "Failed to access avail idx at %p\n",
 		       &vq->avail->idx);
 		return -EFAULT;
@@ -1414,8 +1432,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(__get_user(ring_head,
-				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+	if (unlikely(vhost_get_user(vq, ring_head,
+		     &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
 		       &vq->avail->ring[last_avail_idx % vq->num]);
@@ -1450,7 +1468,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+		ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
+					   sizeof desc);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -1538,15 +1557,15 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 	start = vq->last_used_idx & (vq->num - 1);
 	used = vq->used->ring + start;
 	if (count == 1) {
-		if (__put_user(heads[0].id, &used->id)) {
+		if (vhost_put_user(vq, heads[0].id, &used->id)) {
 			vq_err(vq, "Failed to write used id");
 			return -EFAULT;
 		}
-		if (__put_user(heads[0].len, &used->len)) {
+		if (vhost_put_user(vq, heads[0].len, &used->len)) {
 			vq_err(vq, "Failed to write used len");
 			return -EFAULT;
 		}
-	} else if (__copy_to_user(used, heads, count * sizeof *used)) {
+	} else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
 		vq_err(vq, "Failed to write used");
 		return -EFAULT;
 	}
@@ -1590,7 +1609,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 
 	/* Make sure buffer is written before we update index. */
 	smp_wmb();
-	if (__put_user(cpu_to_vhost16(vq, vq->last_used_idx), &vq->used->idx)) {
+	if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+			   &vq->used->idx)) {
 		vq_err(vq, "Failed to increment used idx");
 		return -EFAULT;
 	}
@@ -1622,7 +1642,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
+		if (vhost_get_user(vq, flags, &vq->avail->flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -1636,7 +1656,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (__get_user(event, vhost_used_event(vq))) {
+	if (vhost_get_user(vq, event, vhost_used_event(vq))) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -1678,7 +1698,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
 	if (r)
 		return false;
 
@@ -1713,7 +1733,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = vhost_get_user(vq, avail_idx, &vq->avail->idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 0/3] basic device IOTLB support
From: Jason Wang @ 2016-06-22  9:50 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: vkaplans, wexu, peterx

This patch tries to implement an device IOTLB for vhost. This could be
used with for co-operation with userspace IOMMU implementation (qemu)
for a secure DMA environment (DMAR) in guest.

The idea is simple. When vhost meets an IOTLB miss, it will request
the assistance of userspace to do the translation, this is done
through:

- when there's a IOTLB miss, it will notify userspace through
  vhost_net fd and then userspace read the fault address, size and
  access from vhost fd.
- userspace write the translation result back to vhost fd, vhost can
  then update its IOTLB.

The codes were optimized for fixed mapping users e.g dpdk in guest. It
will be slow if dynamic mappings were used in guest. We could do
optimizations on top.

The codes were designed to be architecture independent. It should be
easily ported to any architecture.

Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
hugepage case, 100% TLB hit rate were noticed.

Have a benchmark on this. Test was done with l2fwd in guest.For 2MB
page, no difference in 64B performance and I notice a 4%-5% drop for
1500B performance compare to UIO in guest. We can add some shortcut to
bypass the IOTLB for virtqueue accessing, but I think it's better to
do this on top.

Changes from RFC V3:
- rebase on latest
- minor tweak on commit log
- use VHOST_ACCESS_RO instead of VHOST_ACCESS_WO in vhost_copy_from_user()
- switch to use atomic userspace access helper in vhost_get/put_user()
- remove debug codes in vhost_iotlb_miss()
- use FIFO instead of FILO when doing TLB replacement
- fix unbalanced lock in vhost_process_iotlb_msg()

Changes from RFC V2:
- introduce memory accessors for vhost
- switch from ioctls to oridinary file read/write for iotlb miss and
  updating
- do not assume virtqueue were virtually mapped contiguously, all
  virtqueue access were done throug IOTLB
- verify memory access during IOTLB update and fail early
- introduce a module parameter for the size of IOTLB

Changes from RFC V1:
- support any size/range of updating and invalidation through
  introducing the interval tree.
- convert from per device iotlb request to per virtqueue iotlb
  request, this solves the possible deadlock in V1.
- read/write permission check support.

Please review.

Jason Wang (3):
  vhost: introduce vhost memory accessors
  vhost: convert pre sorted vhost memory array to interval tree
  vhost: device IOTLB API

 drivers/vhost/net.c        |  63 +++-
 drivers/vhost/vhost.c      | 799 +++++++++++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h      |  60 +++-
 include/uapi/linux/vhost.h |  28 ++
 4 files changed, 829 insertions(+), 121 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [RFC PATCH V3 0/3] basic device IOTLB support
From: Michael S. Tsirkin @ 2016-06-21 16:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, netdev, linux-kernel, peterx, virtualization, wexu, vkaplans
In-Reply-To: <1464082585-13049-1-git-send-email-jasowang@redhat.com>

On Tue, May 24, 2016 at 05:36:22PM +0800, Jason Wang wrote:
> This patch tries to implement an device IOTLB for vhost. This could be
> used with for co-operation with userspace IOMMU implementation (qemu)
> for a secure DMA environment (DMAR) in guest.
> 
> The idea is simple. When vhost meets an IOTLB miss, it will request
> the assistance of userspace to do the translation, this is done
> through:
> 
> - when there's a IOTLB miss, it will notify userspace through
>   vhost_net fd and then userspace read the fault address, size and
>   access from vhost fd.
> - userspace write the translation result back to vhost fd, vhost can
>   then update its IOTLB.
> 
> The codes were optimized for fixed mapping users e.g dpdk in guest. It
> will be slow if dynamic mappings were used in guest. We could do
> optimizations on top.
> 
> The codes were designed to be architecture independent. It should be
> easily ported to any architecture.
> 
> Stress tested with l2fwd/vfio in guest with 4K/2M/1G page size. On 1G
> hugepage case, 100% TLB hit rate were noticed.
> 
> Changes from V2:
> - introduce memory accessors for vhost
> - switch from ioctls to oridinary file read/write for iotlb miss and
>   updating
> - do not assume virtqueue were virtually mapped contiguously, all
>   virtqueue access were done throug IOTLB
> - verify memory access during IOTLB update and fail early
> - introduce a module parameter for the size of IOTLB
> 
> Changes from V1:
> - support any size/range of updating and invalidation through
>   introducing the interval tree.
> - convert from per device iotlb request to per virtqueue iotlb
>   request, this solves the possible deadlock in V1.
> - read/write permission check support.
> 
> Please review.

Nice, this looks good to me. Can you post a non-rfc please?


> Jason Wang (3):
>   vhost: introduce vhost memory accessors
>   vhost: convert pre sorted vhost memory array to interval tree
>   vhost: device IOTLB API
> 
>  drivers/vhost/net.c        |  63 +++-
>  drivers/vhost/vhost.c      | 760 ++++++++++++++++++++++++++++++++++++++-------
>  drivers/vhost/vhost.h      |  60 +++-
>  include/uapi/linux/vhost.h |  28 ++
>  4 files changed, 790 insertions(+), 121 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH v5 1/6] qspinlock: powerpc support qspinlock
From: xinhui @ 2016-06-21 12:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Zijlstra
  Cc: mpe, linux-kernel, waiman.long, virtualization, mingo, paulus,
	paulmck, linuxppc-dev
In-Reply-To: <1465249297.4274.72.camel@kernel.crashing.org>



On 2016年06月07日 05:41, Benjamin Herrenschmidt wrote:
> On Mon, 2016-06-06 at 17:59 +0200, Peter Zijlstra wrote:
>> On Fri, Jun 03, 2016 at 02:33:47PM +1000, Benjamin Herrenschmidt wrote:
>>>
>>>   - For the above, can you show (or describe) where the qspinlock
>>>     improves things compared to our current locks.
>> So currently PPC has a fairly straight forward test-and-set spinlock
>> IIRC. You have this because LPAR/virt muck and lock holder preemption
>> issues etc..
>> qspinlock is 1) a fair lock (like ticket locks) and 2) provides
>> out-of-word spinning, reducing cacheline pressure.
>
> Thanks Peter. I think I understand the theory, but I'd like see it
> translate into real numbers.
>
>> Esp. on multi-socket x86 we saw the out-of-word spinning being a big win
>> over our ticket locks.
>>
>> And fairness, brought to us by the ticket locks a long time ago,
>> eliminated starvation issues we had, where a spinner local to the holder
>> would 'always' win from a spinner further away. So under heavy enough
>> local contention, the spinners on 'remote' CPUs would 'never' get to own
>> the lock.
>
> I think our HW has tweaks to avoid that from happening with the simple
> locks in the underlying ll/sc implementation. In any case, what I'm
> asking is actual tests to verify it works as expected for us.
>
IF HW has such tweaks then there mush be performance drop when total cpu's number grows up.
And I got such clues

one simple benchmark test:
it tests how many spin_lock/spin_unlock pairs can be done within 15 seconds on all cpus.
say,
while(!done) {
	spin_lock()
	this_cpu_inc(loops)
	spin_unlock()
}

I do the test on two machines, one is using powerKVM, and the other is using pHyp.
the result below shows what the sum of loops is in the end, with xxxxK form.

cpu count	| pv-qspinlock	| test-set spinlock|
----------------------------------------------------
8 (powerKVM)	|	62830K	|	67340K	|
------------------------------------------------
8 (pHyp)	|	49800K	|	59330K	|
------------------------------------------------
32 (pHyp)	|	87580K	|	20990K	|
-------------------------------------------------

while cpu count grows up, the lock/unlock pairs ops of test-set spinlock drops very much.
this is because the cache bouncing in different physical cpus.

So to verify how both spinlock impact the data-cache,
another simple benchmark test.
code looks like:

struct _x {
	spinlock_t lk;
	unsigned long x;
} x;

while(!this_cpu_read(stop)) {
	int i = 0xff
	spin_lock(x.lk)
	this_cpu_inc(loops)
	while(i--)
		READ_ONCE(x.x);
	spin_unlock(x.lk)
}

the result below shows what the sum of loops is in the end, with xxxxK form.

cpu count	| pv-qspinlock	| test-set spinlock|
------------------------------------------------
8 (pHyp)	|	13240K	|	9780K	|
------------------------------------------------
32 (pHyp)	|	25790K	|	9700K	|
------------------------------------------------

obviously pv-qspinlock is more cache-friendly, and has better performance than test-set spinlock.

More test is going on, I will send out new patch set with the result.
HOPE *within* this week. unixbench really takes a long time.

thanks
xinhui
>> pv-qspinlock tries to preserve the fairness while allowing limited lock
>> stealing and explicitly managing which vcpus to wake.
>
> Right.
>>>
>>> 	While there's
>>>     theory and to some extent practice on x86, it would be nice to
>>>     validate the effects on POWER.
>> Right; so that will have to be from benchmarks which I cannot help you
>> with ;-)
>
> Precisely :-) This is what I was asking for ;-)
>
> Cheers,
> Ben.
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Heavy clock drifts inside of KVM guests (qemu-kvm 2.1.2)
From: Frank Steinborn @ 2016-06-21  9:50 UTC (permalink / raw)
  To: virtualization, kvm, qemu-discuss

[-- Attachment #1.1: Type: text/plain, Size: 2260 bytes --]

Dear list(s),

we are running a cluster of virtual machines on ganeti 2.15.2 on top of
qemu-kvm 2.1.2. Hosts are Debian jessie (Kernel 3.16). We have big trouble
with timekeeping on guests, no matter if these guests are lenny, squeeze,
wheezy or jessie. Clock drifts heavily on some, others are fine. There is
no obvious pattern on which guest this happens and on which not, but it
seems the clock drifts heavier on guests that are idle and have nothing to
do. Some more facts:

* Clocksource on all hosts are set to tsc
* Clocksource on all guests are set to kvm-clock
* CPUs on the hosts are Intel Xeons, all have these related flags: tsc,
constant_tsc, nonstop_tsc. One host has tsc_deadline_timer as an extra,
because it's a newer Xeon, but it doesn't seem to make a difference in
drifting.
* Most guests complain about "Clocksource tsc unstable" in dmesg after boot
(even when set to clocksource=kvm-clock)
* The hosts don't drift at all, we only see the issue in guests
* We don't use CPU pinning for KVM guests
* We run OpenNTPD 5.7p4 on both the hosts and the guests (we build this for
ourselves, so the version is the same in every guest across every Debian
version)

We tried different things to debug the cause of this issue but we're out if
ideas now. This is what we have tested so far:

* Newer kernel from jessie-backports
* Newer qemu-kvm from jessie-backports
* Set clocksource=hpet on the hosts, this made drifts even worse (not
tested yet: acpi_pm)
* Tried to run guests without kvm_steal_time setting
* Tried without running OpenNTPD on the guests

We have some hints along our investigations. Novell for example says that
CPU pinning may help on drifting guests:
https://www.novell.com/support/kb/doc.php?id=7008698

Others suggested using an "failover clocksource" on the guests:
https://blog.laimbock.com/tag/clocksource_failover/

However, before we just continue to poke around in the dark, we wanted to
hear input from virtualization experts that may have stumbled across
something like this before. Any input would be greatly appreciated!

Please reply directly, as I'm not subscribed to the list. And sorry for
cross-posting this to several lists at once, we just hope to get as much
feedback as possible.

Thanks in advance,
Frank

[-- Attachment #1.2: Type: text/html, Size: 2556 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
From: Michal Hocko @ 2016-06-20  9:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Tetsuo Handa, virtualization, LKML, linux-mm
In-Reply-To: <20160619213543.GA32752@dhcp22.suse.cz>

On Sun 19-06-16 23:35:43, Michal Hocko wrote:
> On Sat 18-06-16 03:09:02, Michael S. Tsirkin wrote:
> > On Fri, Jun 17, 2016 at 11:00:17AM +0200, Michal Hocko wrote:
[...]
> > > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > > index 349557825428..b1f314fca3c8 100644
> > > --- a/include/linux/uaccess.h
> > > +++ b/include/linux/uaccess.h
> > > @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> > >  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
> > >  
> > >  /*
> > > + * A safe variant of __get_user for for use_mm() users to have a
> > > + * gurantee that the address space wasn't reaped in the background
> > > + */
> > > +#define __get_user_mm(mm, x, ptr)				\
> > > +({								\
> > > +	int ___gu_err = __get_user(x, ptr);			\
> > > +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > 
> > test_bit is somewhat expensive. See my old mail
> > 	x86/bitops: implement __test_bit
> 
> Do you have a msg_id?
> 
> > I dropped it as virtio just switched to simple &/| for features,
> > but we might need something like this now.
> 
> Is this such a hot path that something like this would make a visible
> difference? 

OK, so I've tried to apply your patch [1] and updated both __get_user_mm
and __copy_from_user_mm and the result is a code size reduction:
   text    data     bss     dec     hex filename
  12835       2      32   12869    3245 drivers/vhost/vhost.o
  12882       2      32   12916    3274 drivers/vhost/vhost.o.before

This is really tiny and I cannot tell anything about the performance. Should
I resurrect your patch and push it together with this change or this can happen
later?

[1] http://lkml.kernel.org/r/1440776707-22016-1-git-send-email-mst@redhat.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v5 0/6] Support calling functions on dedicated physical cpu
From: Juergen Gross @ 2016-06-20  8:40 UTC (permalink / raw)
  To: mingo, peterz
  Cc: x86, jeremy, jdelvare, hpa, akataria, linux-kernel,
	virtualization, chrisw, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-1-git-send-email-jgross@suse.com>

On 06/04/16 16:17, Juergen Gross wrote:
> Some hardware (e.g. Dell Studio laptops) require special functions to
> be called on physical cpu 0 in order to avoid occasional hangs. When
> running as dom0 under Xen this could be achieved only via special boot
> parameters (vcpu pinning) limiting the hypervisor in it's scheduling
> decisions.
> 
> This patch series is adding a generic function to be able to temporarily
> pin a (virtual) cpu to a dedicated physical cpu for executing above
> mentioned functions on that specific cpu. The drivers (dcdbas and i8k)
> requiring this functionality are modified accordingly.

Peter, Ingo,

is one of you considering to take this series for 4.8? I know not all
patches got an Ack, OTOH maintainers had 2.5 months time to object.

I'd really appreciate this series would make it into 4.8.


Juergen

> 
> Changes in V5:
> - patch 3: rename and reshuffle parameters of smp_call_on_cpu() as requested
>   by Peter Zijlstra
> - patch 3: test target cpu to be online as requested by Peter Zijlstra
> - patch 4: less wordy messages as requested by David Vrabel
> 
> Changes in V4:
> - move patches 5 and 6 further up in the series
> - patch 2 (was 5): WARN_ONCE in case platform doesn't support pinning
>   as requested by Peter Zijlstra
> - patch 3 (was 2): change return value in case of illegal cpu as
>   requested by Peter Zijlstra
> - patch 3 (was 2): make pinning of vcpu an option as suggested by
>   Peter Zijlstra
> - patches 5 and 6 (were 3 and 4): add call to get_online_cpus()
> 
> Changes in V3:
> - use get_cpu()/put_cpu() as suggested by David Vrabel
> 
> Changes in V2:
> - instead of manipulating the allowed set of cpus use cpu specific
>   workqueue as requested by Peter Zijlstra
> - add include/linux/hypervisor.h to hide architecture specific stuff
>   from generic kernel code
> 
> Juergen Gross (6):
>   xen: sync xen header
>   virt, sched: add generic vcpu pinning support
>   smp: add function to execute a function synchronously on a cpu
>   xen: add xen_pin_vcpu() to support calling functions on a dedicated
>     pcpu
>   dcdbas: make use of smp_call_on_cpu()
>   hwmon: use smp_call_on_cpu() for dell-smm i8k
> 
>  MAINTAINERS                       |   1 +
>  arch/x86/include/asm/hypervisor.h |   4 ++
>  arch/x86/kernel/cpu/hypervisor.c  |  11 +++++
>  arch/x86/xen/enlighten.c          |  40 +++++++++++++++
>  drivers/firmware/dcdbas.c         |  51 +++++++++----------
>  drivers/hwmon/dell-smm-hwmon.c    |  35 +++++++------
>  include/linux/hypervisor.h        |  17 +++++++
>  include/linux/smp.h               |   3 ++
>  include/xen/interface/sched.h     | 100 +++++++++++++++++++++++++++++++-------
>  kernel/smp.c                      |  51 +++++++++++++++++++
>  kernel/up.c                       |  18 +++++++
>  11 files changed, 273 insertions(+), 58 deletions(-)
>  create mode 100644 include/linux/hypervisor.h
> 

^ permalink raw reply

* Re: [RFC PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
From: Michal Hocko @ 2016-06-19 21:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Tetsuo Handa, virtualization, LKML, linux-mm
In-Reply-To: <20160619213543.GA32752@dhcp22.suse.cz>

On Sun 19-06-16 23:35:43, Michal Hocko wrote:
> On Sat 18-06-16 03:09:02, Michael S. Tsirkin wrote:
> > On Fri, Jun 17, 2016 at 11:00:17AM +0200, Michal Hocko wrote:
[...]
> > >  /*
> > > + * A safe variant of __get_user for for use_mm() users to have a
> > > + * gurantee that the address space wasn't reaped in the background
> > > + */
> > > +#define __get_user_mm(mm, x, ptr)				\
> > > +({								\
> > > +	int ___gu_err = __get_user(x, ptr);			\
> > > +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> > 
> > test_bit is somewhat expensive. See my old mail
> > 	x86/bitops: implement __test_bit
> 
> Do you have a msg_id?

Found it
http://lkml.kernel.org/r/1440776707-22016-1-git-send-email-mst@redhat.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
From: Michal Hocko @ 2016-06-19 21:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Tetsuo Handa, virtualization, LKML, linux-mm
In-Reply-To: <20160618025904-mutt-send-email-mst@redhat.com>

On Sat 18-06-16 03:09:02, Michael S. Tsirkin wrote:
> On Fri, Jun 17, 2016 at 11:00:17AM +0200, Michal Hocko wrote:
[...]
> > It seems that vhost usage would suffer from this problem because
> > it reads from the userspace to get (status) flags and makes some
> > decisions based on the read value.  I do not understand the code so I
> > couldn't evaluate whether that would lead to some real problems so I
> > conservatively assumed it wouldn't handle that gracefully.
> 
> Getting an error from __get_user and friends is handled gracefully.
> Getting zero instead of a real value will cause userspace
> memory corruption.

OK, thanks for the confirmation! I will add this to the changelog. I
assume that the memory corruption could "leak out" of the mm we just
read from, right? I am asking because the mm and all its users will die
by SIGKILL so they will not "see" the corruption. I am not familiar with the
vhost transfer model but I guess it wouldn't be uncommon if the target
memory could be a shared object (e.g. tmpfs or a regular file) so it
would outlive the mm.

[...]

> > @@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> >  	/* They could have slipped one in as we were doing that: make
> >  	 * sure it's written, then check again. */
> >  	smp_mb();
> > -	r = __get_user(avail_idx, &vq->avail->idx);
> > +	r = __get_user_mm(dev->mm,avail_idx, &vq->avail->idx);
> 
> space after , pls

sure

> 
> >  	if (r) {
> >  		vq_err(vq, "Failed to check avail idx at %p: %d\n",
> >  		       &vq->avail->idx, r);
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 6d81a1eb974a..2b00ac7faa18 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> >  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
> >  #define MMF_OOM_REAPED		21	/* mm has been already reaped */
> >  #define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
> > +#define MMF_UNSTABLE		23	/* mm is unstable for copy_from_user */
> >  
> >  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
> >  
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index 349557825428..b1f314fca3c8 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> >  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
> >  
> >  /*
> > + * A safe variant of __get_user for for use_mm() users to have a
> > + * gurantee that the address space wasn't reaped in the background
> > + */
> > +#define __get_user_mm(mm, x, ptr)				\
> > +({								\
> > +	int ___gu_err = __get_user(x, ptr);			\
> > +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
> 
> test_bit is somewhat expensive. See my old mail
> 	x86/bitops: implement __test_bit

Do you have a msg_id?

> I dropped it as virtio just switched to simple &/| for features,
> but we might need something like this now.

Is this such a hot path that something like this would make a visible
difference? 

> 
> 
> 
> > +		___gu_err = -EFAULT;				\
> > +	___gu_err;						\
> > +})
> > +
> > +/* similar to __get_user_mm */
> > +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
> > +		void *to, const void __user * from, unsigned long n)
> > +{
> > +	long ret = __copy_from_user(to, from, n);
> > +	if (!ret && test_bit(MMF_UNSTABLE, &mm->flags))
> > +		return -EFAULT;
> > +	return ret;

And I've just noticed that this is not correct. We need 
	if ((ret >= 0) && test_bit(MMF_UNSTABLE, &mm->flags))

[...]

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 6303bc7caeda..3fa43e96a59b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -506,6 +506,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  		goto mm_drop;
> >  	}
> >  
> > +	/*
> > +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> > +	 * is no longer stable.
> > +	 */
> > +	set_bit(MMF_UNSTABLE, &mm->flags);
> > +
> 
> do we need some kind of barrier after this?

Well I believe we don't because unmapping the memory will likely
imply memory barriers on the way.

> 
> and if yes - does flag read need a barrier before it too?

A good question. I was basically assuming the same as above. If we didn't fault
then the oom reaper wouldn't touch that memory and so we are safe even when
we see the outdated mm flags, if the memory was reaped then we have to page
fault and that should imply memory barrier AFAIU.

Does that make sense?

> 
> >  	tlb_gather_mmu(&tlb, mm, 0, -1);
> >  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> >  		if (is_vm_hugetlb_page(vma))
> > -- 
> > 2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] drivers: virtio_blk: notify blk-core when hw-queue number changes
From: Paolo Bonzini @ 2016-06-18 22:10 UTC (permalink / raw)
  To: Bob Liu, linux-kernel; +Cc: mst, virtualization
In-Reply-To: <1465811906-24510-1-git-send-email-bob.liu__9616.24590349874$1465832216$gmane$org@oracle.com>



On 13/06/2016 11:58, Bob Liu wrote:
> A guest might be migrated to other hosts with different num_queues, the
> blk-core should aware of that else the reference of &vblk->vqs[qid] may be wrong.
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>  drivers/block/virtio_blk.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 42758b5..c169238 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -819,6 +819,9 @@ static int virtblk_restore(struct virtio_device *vdev)
>  	if (ret)
>  		return ret;
>  
> +	if (vblk->num_vqs != vblk->tag_set.nr_hw_queues)
> +		blk_mq_update_nr_hw_queues(&vblk->tag_set, vblk->num_vqs);
> +
>  	virtio_device_ready(vdev);
>  
>  	blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
> 

This should never happen; it'd be a configuration problem.

Paolo

^ permalink raw reply

* Re: [RFC PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
From: Michael S. Tsirkin @ 2016-06-18  0:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Michal Hocko, Tetsuo Handa, LKML,
	virtualization, linux-mm
In-Reply-To: <1466154017-2222-1-git-send-email-mhocko@kernel.org>

On Fri, Jun 17, 2016 at 11:00:17AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> vhost driver relies on copy_from_user/get_user from a kernel thread.
> This makes it impossible to reap the memory of an oom victim which
> shares mm with the vhost kernel thread because it could see a zero
> page unexpectedly and theoretically make an incorrect decision visible
> outside of the killed task context.
> 
> Make sure that each place which can read from userspace is annotated
> properly and it uses copy_from_user_mm, __get_user_mm resp.
> copy_from_iter_mm. Each will get the target mm as an argument and it
> performs a pessimistic check to rule out that the oom_reaper could
> possibly unmap the particular page. __oom_reap_task then just needs to
> mark the mm as unstable before it unmaps any page.
> 
> This is a preparatory patch without any functional changes because
> the oom reaper doesn't touch mm shared with kthreads yet.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Will review. Answer to question below:

> ---
> Hi Michael,
> we have discussed [1] that vhost_worker pins the mm of a potential
> oom victim for too long which result into an OOM storm when other
> processes have to be killed. One way to address this issue would
> be to pin mm_count rather than mm_users and revalidate it before
> actually doing the copy (mmget_not_zero). You had concerns about
> more atomic operations in the data path. Another way would be to
> postpone exit_mm_victim to after exit_task_work but as it turned
> out other task might have the device open and pin the mm indirectly
> [2].
> 
> Now I would like to attack the issue from another side which would
> be more generic. I would like to make mm's which are shared with
> kthreads oom reapable in general. This is currently not allowed
> because we do not want to risk that a kthread would see an already
> unmapped page - aka see a newly allocated zero page. At the same
> time this is really desirable because it helps to guarantee a forward
> progress on the OOM.
> 
> It seems that vhost usage would suffer from this problem because
> it reads from the userspace to get (status) flags and makes some
> decisions based on the read value.  I do not understand the code so I
> couldn't evaluate whether that would lead to some real problems so I
> conservatively assumed it wouldn't handle that gracefully.

Getting an error from __get_user and friends is handled gracefully.
Getting zero instead of a real value will cause userspace
memory corruption.

> If this is
> incorrect and all the paths can just cope with seeing zeros unexpectedly
> then great and I will drop the patch and move over to the oom specific
> further steps.
> 
> Therefore I am proposing a kthread safe API which allows to read from
> userspace and also makes sure to do a proper exclusion with the oom
> reaper. A race would be reported by EFAULT which is already handled.
> Performance wise it would add two tests to the copy from user
> paths. Does the following change makes sense to you and would be
> acceptable? If yes I will follow up with another patch which will allow
> oom reaper for mm shared with kthread.
> 
> Thanks!
> 
> [1] http://lkml.kernel.org/r/1456765329-14890-1-git-send-email-vdavydov@virtuozzo.com
> [2] http://lkml.kernel.org/r/20160301181136-mutt-send-email-mst@redhat.com
> 
>  drivers/vhost/scsi.c    |  2 +-
>  drivers/vhost/vhost.c   | 18 +++++++++---------
>  include/linux/sched.h   |  1 +
>  include/linux/uaccess.h | 22 ++++++++++++++++++++++
>  include/linux/uio.h     | 10 ++++++++++
>  mm/oom_kill.c           |  6 ++++++
>  6 files changed, 49 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 0e6fd556c982..2c8dc0b9a21f 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
>  		 */
>  		iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size);
>  
> -		ret = copy_from_iter(req, req_size, &out_iter);
> +		ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter);
>  		if (unlikely(ret != req_size)) {
>  			vq_err(vq, "Faulted on copy_from_iter\n");
>  			vhost_scsi_send_bad_target(vs, vq, head, out);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 669fef1e2bb6..14959ba43cb4 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
>  		r = -EFAULT;
>  		goto err;
>  	}
> -	r = __get_user(last_used_idx, &vq->used->idx);
> +	r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx);
>  	if (r)
>  		goto err;
>  	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
> @@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
>  			       i, count);
>  			return -EINVAL;
>  		}
> -		if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) !=
> +		if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) !=
>  			     sizeof(desc))) {
>  			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
>  			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> @@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  	/* Check it isn't doing very strange things with descriptor numbers. */
>  	last_avail_idx = vq->last_avail_idx;
> -	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
> +	if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) {
>  		vq_err(vq, "Failed to access avail idx at %p\n",
>  		       &vq->avail->idx);
>  		return -EFAULT;
> @@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>  	/* Grab the next descriptor number they're advertising, and increment
>  	 * the index we've seen. */
> -	if (unlikely(__get_user(ring_head,
> +	if (unlikely(__get_user_mm(vq->dev->mm, ring_head,
>  				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
>  		vq_err(vq, "Failed to read head: idx %d address %p\n",
>  		       last_avail_idx,
> @@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  			       i, vq->num, head);
>  			return -EINVAL;
>  		}
> -		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
> +		ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc);
>  		if (unlikely(ret)) {
>  			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
>  			       i, vq->desc + i);
> @@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  
>  	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
>  		__virtio16 flags;
> -		if (__get_user(flags, &vq->avail->flags)) {
> +		if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) {
>  			vq_err(vq, "Failed to get flags");
>  			return true;
>  		}
> @@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	if (unlikely(!v))
>  		return true;
>  
> -	if (__get_user(event, vhost_used_event(vq))) {
> +	if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) {
>  		vq_err(vq, "Failed to get used event idx");
>  		return true;
>  	}
> @@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	__virtio16 avail_idx;
>  	int r;
>  
> -	r = __get_user(avail_idx, &vq->avail->idx);
> +	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
>  	if (r)
>  		return false;
>  
> @@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	/* They could have slipped one in as we were doing that: make
>  	 * sure it's written, then check again. */
>  	smp_mb();
> -	r = __get_user(avail_idx, &vq->avail->idx);
> +	r = __get_user_mm(dev->mm,avail_idx, &vq->avail->idx);

space after , pls

>  	if (r) {
>  		vq_err(vq, "Failed to check avail idx at %p: %d\n",
>  		       &vq->avail->idx, r);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d81a1eb974a..2b00ac7faa18 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_REAPED		21	/* mm has been already reaped */
>  #define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
> +#define MMF_UNSTABLE		23	/* mm is unstable for copy_from_user */
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
>  
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 349557825428..b1f314fca3c8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
>  #endif		/* ARCH_HAS_NOCACHE_UACCESS */
>  
>  /*
> + * A safe variant of __get_user for for use_mm() users to have a
> + * gurantee that the address space wasn't reaped in the background
> + */
> +#define __get_user_mm(mm, x, ptr)				\
> +({								\
> +	int ___gu_err = __get_user(x, ptr);			\
> +	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\

test_bit is somewhat expensive. See my old mail
	x86/bitops: implement __test_bit

I dropped it as virtio just switched to simple &/| for features,
but we might need something like this now.



> +		___gu_err = -EFAULT;				\
> +	___gu_err;						\
> +})
> +
> +/* similar to __get_user_mm */
> +static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
> +		void *to, const void __user * from, unsigned long n)
> +{
> +	long ret = __copy_from_user(to, from, n);
> +	if (!ret && test_bit(MMF_UNSTABLE, &mm->flags))
> +		return -EFAULT;
> +	return ret;
> +}
> +
> +/*
>   * probe_kernel_read(): safely attempt to read from a location
>   * @dst: pointer to the buffer that shall take the data
>   * @src: address to read from
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 1b5d1cd796e2..4be6b24003d8 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -9,6 +9,7 @@
>  #ifndef __LINUX_UIO_H
>  #define __LINUX_UIO_H
>  
> +#include <linux/sched.h>
>  #include <linux/kernel.h>
>  #include <uapi/linux/uio.h>
>  
> @@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i);
>  size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
>  size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
> +
> +static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr,
> +		size_t bytes, struct iov_iter *i)
> +{
> +	size_t ret = copy_from_iter(addr, bytes, i);
> +	if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags))
> +		return -EFAULT;
> +	return ret;
> +}
>  size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
>  size_t iov_iter_zero(size_t bytes, struct iov_iter *);
>  unsigned long iov_iter_alignment(const struct iov_iter *i);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 6303bc7caeda..3fa43e96a59b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -506,6 +506,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		goto mm_drop;
>  	}
>  
> +	/*
> +	 * Tell all users of get_user_mm/copy_from_user_mm that the content
> +	 * is no longer stable.
> +	 */
> +	set_bit(MMF_UNSTABLE, &mm->flags);
> +

do we need some kind of barrier after this?

and if yes - does flag read need a barrier before it too?

>  	tlb_gather_mmu(&tlb, mm, 0, -1);
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (is_vm_hugetlb_page(vma))
> -- 
> 2.8.1

^ permalink raw reply

* [RFC PATCH] vhost, mm: make sure that oom_reaper doesn't reap memory read by vhost
From: Michal Hocko @ 2016-06-17  9:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Michal Hocko, Tetsuo Handa, LKML,
	virtualization, linux-mm

From: Michal Hocko <mhocko@suse.com>

vhost driver relies on copy_from_user/get_user from a kernel thread.
This makes it impossible to reap the memory of an oom victim which
shares mm with the vhost kernel thread because it could see a zero
page unexpectedly and theoretically make an incorrect decision visible
outside of the killed task context.

Make sure that each place which can read from userspace is annotated
properly and it uses copy_from_user_mm, __get_user_mm resp.
copy_from_iter_mm. Each will get the target mm as an argument and it
performs a pessimistic check to rule out that the oom_reaper could
possibly unmap the particular page. __oom_reap_task then just needs to
mark the mm as unstable before it unmaps any page.

This is a preparatory patch without any functional changes because
the oom reaper doesn't touch mm shared with kthreads yet.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi Michael,
we have discussed [1] that vhost_worker pins the mm of a potential
oom victim for too long which result into an OOM storm when other
processes have to be killed. One way to address this issue would
be to pin mm_count rather than mm_users and revalidate it before
actually doing the copy (mmget_not_zero). You had concerns about
more atomic operations in the data path. Another way would be to
postpone exit_mm_victim to after exit_task_work but as it turned
out other task might have the device open and pin the mm indirectly
[2].

Now I would like to attack the issue from another side which would
be more generic. I would like to make mm's which are shared with
kthreads oom reapable in general. This is currently not allowed
because we do not want to risk that a kthread would see an already
unmapped page - aka see a newly allocated zero page. At the same
time this is really desirable because it helps to guarantee a forward
progress on the OOM.

It seems that vhost usage would suffer from this problem because
it reads from the userspace to get (status) flags and makes some
decisions based on the read value.  I do not understand the code so I
couldn't evaluate whether that would lead to some real problems so I
conservatively assumed it wouldn't handle that gracefully. If this is
incorrect and all the paths can just cope with seeing zeros unexpectedly
then great and I will drop the patch and move over to the oom specific
further steps.

Therefore I am proposing a kthread safe API which allows to read from
userspace and also makes sure to do a proper exclusion with the oom
reaper. A race would be reported by EFAULT which is already handled.
Performance wise it would add two tests to the copy from user
paths. Does the following change makes sense to you and would be
acceptable? If yes I will follow up with another patch which will allow
oom reaper for mm shared with kthread.

Thanks!

[1] http://lkml.kernel.org/r/1456765329-14890-1-git-send-email-vdavydov@virtuozzo.com
[2] http://lkml.kernel.org/r/20160301181136-mutt-send-email-mst@redhat.com

 drivers/vhost/scsi.c    |  2 +-
 drivers/vhost/vhost.c   | 18 +++++++++---------
 include/linux/sched.h   |  1 +
 include/linux/uaccess.h | 22 ++++++++++++++++++++++
 include/linux/uio.h     | 10 ++++++++++
 mm/oom_kill.c           |  6 ++++++
 6 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 0e6fd556c982..2c8dc0b9a21f 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -932,7 +932,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq)
 		 */
 		iov_iter_init(&out_iter, WRITE, vq->iov, out, out_size);
 
-		ret = copy_from_iter(req, req_size, &out_iter);
+		ret = copy_from_iter_mm(vq->dev->mm, req, req_size, &out_iter);
 		if (unlikely(ret != req_size)) {
 			vq_err(vq, "Faulted on copy_from_iter\n");
 			vhost_scsi_send_bad_target(vs, vq, head, out);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1e2bb6..14959ba43cb4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1212,7 +1212,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 		r = -EFAULT;
 		goto err;
 	}
-	r = __get_user(last_used_idx, &vq->used->idx);
+	r = __get_user_mm(vq->dev->mm, last_used_idx, &vq->used->idx);
 	if (r)
 		goto err;
 	vq->last_used_idx = vhost16_to_cpu(vq, last_used_idx);
@@ -1328,7 +1328,7 @@ static int get_indirect(struct vhost_virtqueue *vq,
 			       i, count);
 			return -EINVAL;
 		}
-		if (unlikely(copy_from_iter(&desc, sizeof(desc), &from) !=
+		if (unlikely(copy_from_iter_mm(vq->dev->mm, &desc, sizeof(desc), &from) !=
 			     sizeof(desc))) {
 			vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
 			       i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
@@ -1392,7 +1392,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Check it isn't doing very strange things with descriptor numbers. */
 	last_avail_idx = vq->last_avail_idx;
-	if (unlikely(__get_user(avail_idx, &vq->avail->idx))) {
+	if (unlikely(__get_user_mm(vq->dev->mm, avail_idx, &vq->avail->idx))) {
 		vq_err(vq, "Failed to access avail idx at %p\n",
 		       &vq->avail->idx);
 		return -EFAULT;
@@ -1414,7 +1414,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
 	/* Grab the next descriptor number they're advertising, and increment
 	 * the index we've seen. */
-	if (unlikely(__get_user(ring_head,
+	if (unlikely(__get_user_mm(vq->dev->mm, ring_head,
 				&vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
 		vq_err(vq, "Failed to read head: idx %d address %p\n",
 		       last_avail_idx,
@@ -1450,7 +1450,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 			       i, vq->num, head);
 			return -EINVAL;
 		}
-		ret = __copy_from_user(&desc, vq->desc + i, sizeof desc);
+		ret = __copy_from_user_mm(vq->dev->mm, &desc, vq->desc + i, sizeof desc);
 		if (unlikely(ret)) {
 			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
 			       i, vq->desc + i);
@@ -1622,7 +1622,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 
 	if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
 		__virtio16 flags;
-		if (__get_user(flags, &vq->avail->flags)) {
+		if (__get_user_mm(dev->mm, flags, &vq->avail->flags)) {
 			vq_err(vq, "Failed to get flags");
 			return true;
 		}
@@ -1636,7 +1636,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	if (unlikely(!v))
 		return true;
 
-	if (__get_user(event, vhost_used_event(vq))) {
+	if (__get_user_mm(dev->mm, event, vhost_used_event(vq))) {
 		vq_err(vq, "Failed to get used event idx");
 		return true;
 	}
@@ -1678,7 +1678,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	__virtio16 avail_idx;
 	int r;
 
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm, avail_idx, &vq->avail->idx);
 	if (r)
 		return false;
 
@@ -1713,7 +1713,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
 	smp_mb();
-	r = __get_user(avail_idx, &vq->avail->idx);
+	r = __get_user_mm(dev->mm,avail_idx, &vq->avail->idx);
 	if (r) {
 		vq_err(vq, "Failed to check avail idx at %p: %d\n",
 		       &vq->avail->idx, r);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d81a1eb974a..2b00ac7faa18 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -513,6 +513,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_REAPED		21	/* mm has been already reaped */
 #define MMF_OOM_NOT_REAPABLE	22	/* mm couldn't be reaped */
+#define MMF_UNSTABLE		23	/* mm is unstable for copy_from_user */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 349557825428..b1f314fca3c8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -76,6 +76,28 @@ static inline unsigned long __copy_from_user_nocache(void *to,
 #endif		/* ARCH_HAS_NOCACHE_UACCESS */
 
 /*
+ * A safe variant of __get_user for for use_mm() users to have a
+ * gurantee that the address space wasn't reaped in the background
+ */
+#define __get_user_mm(mm, x, ptr)				\
+({								\
+	int ___gu_err = __get_user(x, ptr);			\
+	if (!___gu_err && test_bit(MMF_UNSTABLE, &mm->flags))	\
+		___gu_err = -EFAULT;				\
+	___gu_err;						\
+})
+
+/* similar to __get_user_mm */
+static inline __must_check long __copy_from_user_mm(struct mm_struct *mm,
+		void *to, const void __user * from, unsigned long n)
+{
+	long ret = __copy_from_user(to, from, n);
+	if (!ret && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
+
+/*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
  * @src: address to read from
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1b5d1cd796e2..4be6b24003d8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -9,6 +9,7 @@
 #ifndef __LINUX_UIO_H
 #define __LINUX_UIO_H
 
+#include <linux/sched.h>
 #include <linux/kernel.h>
 #include <uapi/linux/uio.h>
 
@@ -84,6 +85,15 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+
+static inline size_t copy_from_iter_mm(struct mm_struct *mm, void *addr,
+		size_t bytes, struct iov_iter *i)
+{
+	size_t ret = copy_from_iter(addr, bytes, i);
+	if (!IS_ERR_VALUE(ret) && test_bit(MMF_UNSTABLE, &mm->flags))
+		return -EFAULT;
+	return ret;
+}
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6303bc7caeda..3fa43e96a59b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -506,6 +506,12 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		goto mm_drop;
 	}
 
+	/*
+	 * Tell all users of get_user_mm/copy_from_user_mm that the content
+	 * is no longer stable.
+	 */
+	set_bit(MMF_UNSTABLE, &mm->flags);
+
 	tlb_gather_mmu(&tlb, mm, 0, -1);
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (is_vm_hugetlb_page(vma))
-- 
2.8.1

^ permalink raw reply related

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Joonsoo Kim @ 2016-06-17  7:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, Sergey Senozhatsky, Sergey Senozhatsky,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Naoya Horiguchi,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160616100932.GS17127@bbox>

On Thu, Jun 16, 2016 at 07:09:32PM +0900, Minchan Kim wrote:
> On Thu, Jun 16, 2016 at 05:42:11PM +0900, Sergey Senozhatsky wrote:
> > On (06/16/16 15:47), Minchan Kim wrote:
> > > > [..]
> > > > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > > > 
> > > > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > > > 
> > > > > Hm, it seems double free.
> > > > > 
> > > > > It doen't happen if you disable zram? IOW, it seems to be related
> > > > > zsmalloc migration?
> > > > 
> > > > need to test more, can't confidently answer now.
> > > > 
> > > > > How easy can you reprodcue it? Could you bisect it?
> > > > 
> > > > it takes some (um.. random) time to trigger the bug.
> > > > I'll try to come up with more details.
> > > 
> > > Could you revert [1] and retest?
> > > 
> > > [1] mm/compaction: split freepages without holding the zone lock
> > 
> > ok, so this is not related to zsmalloc. finally manged to reproduce
> > it. will fork a separate thread.
> 
> The reason I mentioned [1] is that it seems to have a bug.
> 
> isolate_freepages_block
>   __isolate_free_page
>     if(!zone_watermark_ok())
>       return 0;
>   list_add_tail(&page->lru, freelist);
> 
> However, the page is not isolated.
> Joonsoo?

Good job!
I will fix it soon.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jason Wang @ 2016-06-17  7:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, eric.dumazet, netdev, linux-kernel, virtualization, brouer,
	davem
In-Reply-To: <20160617033218-mutt-send-email-mst@redhat.com>



On 2016年06月17日 08:41, Michael S. Tsirkin wrote:
> On Wed, Jun 15, 2016 at 04:38:17PM +0800, Jason Wang wrote:
>> >We used to queue tx packets in sk_receive_queue, this is less
>> >efficient since it requires spinlocks to synchronize between producer
>> >and consumer.
>> >
>> >This patch tries to address this by:
>> >
>> >- introduce a new mode which will be only enabled with IFF_TX_ARRAY
>> >   set and switch from sk_receive_queue to a fixed size of skb
>> >   array with 256 entries in this mode.
>> >- introduce a new proto_ops peek_len which was used for peeking the
>> >   skb length.
>> >- implement a tun version of peek_len for vhost_net to use and convert
>> >   vhost_net to use peek_len if possible.
>> >
>> >Pktgen test shows about 18% improvement on guest receiving pps for small
>> >buffers:
>> >
>> >Before: ~1220000pps
>> >After : ~1440000pps
>> >
>> >The reason why I stick to new mode is because:
>> >
>> >- though resize is supported by skb array, in multiqueue mode, it's
>> >   not easy to recover from a partial success of queue resizing.
>> >- tx_queue_len is a user visible feature.
>> >
>> >Signed-off-by: Jason Wang<jasowang@redhat.com>
> I still think it's wrong to add a new feature for this.
> For example, why 256 entries?

It's the value of virtqueue size supported by qemu.

> Queue len is user visible but it's there precisely for this
> reason so people can tune queue for workload.

Right.

>
> Would it help to have ptr_ring_resize that gets an array of
> rings and resizes them both to same length?

Yes, that would be very helpful.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Michael S. Tsirkin @ 2016-06-17  0:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, eric.dumazet, netdev, linux-kernel, virtualization, brouer,
	davem
In-Reply-To: <1465979897-4445-1-git-send-email-jasowang@redhat.com>

On Wed, Jun 15, 2016 at 04:38:17PM +0800, Jason Wang wrote:
> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.
> 
> This patch tries to address this by:
> 
> - introduce a new mode which will be only enabled with IFF_TX_ARRAY
>   set and switch from sk_receive_queue to a fixed size of skb
>   array with 256 entries in this mode.
> - introduce a new proto_ops peek_len which was used for peeking the
>   skb length.
> - implement a tun version of peek_len for vhost_net to use and convert
>   vhost_net to use peek_len if possible.
> 
> Pktgen test shows about 18% improvement on guest receiving pps for small
> buffers:
> 
> Before: ~1220000pps
> After : ~1440000pps
> 
> The reason why I stick to new mode is because:
> 
> - though resize is supported by skb array, in multiqueue mode, it's
>   not easy to recover from a partial success of queue resizing.
> - tx_queue_len is a user visible feature.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I still think it's wrong to add a new feature for this.
For example, why 256 entries?
Queue len is user visible but it's there precisely for this
reason so people can tune queue for workload.

Would it help to have ptr_ring_resize that gets an array of
rings and resizes them both to same length?

> ---
> - The patch is based on [PATCH v8 0/5] skb_array: array based FIFO for skbs
> 
> Changes from V1:
> - switch to use skb array instead of a customized circular buffer
> - add non-blocking support
> - rename .peek to .peek_len
> - drop lockless peeking since test show very minor improvement
> ---
>  drivers/net/tun.c           | 138 ++++++++++++++++++++++++++++++++++++++++----
>  drivers/vhost/net.c         |  16 ++++-
>  include/linux/net.h         |   1 +
>  include/uapi/linux/if_tun.h |   1 +
>  4 files changed, 143 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index e16487c..b22e475 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -71,6 +71,7 @@
>  #include <net/sock.h>
>  #include <linux/seq_file.h>
>  #include <linux/uio.h>
> +#include <linux/skb_array.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -130,6 +131,7 @@ struct tap_filter {
>  #define MAX_TAP_FLOWS  4096
>  
>  #define TUN_FLOW_EXPIRE (3 * HZ)
> +#define TUN_RING_SIZE 256
>  
>  struct tun_pcpu_stats {
>  	u64 rx_packets;
> @@ -167,6 +169,7 @@ struct tun_file {
>  	};
>  	struct list_head next;
>  	struct tun_struct *detached;
> +	struct skb_array tx_array;
>  };
>  
>  struct tun_flow_entry {
> @@ -513,8 +516,15 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile)
>  	return tun;
>  }
>  
> -static void tun_queue_purge(struct tun_file *tfile)
> +static void tun_queue_purge(struct tun_struct *tun, struct tun_file *tfile)
>  {
> +	struct sk_buff *skb;
> +
> +	if (tun->flags & IFF_TX_ARRAY) {
> +		while ((skb = skb_array_consume(&tfile->tx_array)) != NULL)
> +			kfree_skb(skb);
> +	}
> +
>  	skb_queue_purge(&tfile->sk.sk_receive_queue);
>  	skb_queue_purge(&tfile->sk.sk_error_queue);
>  }
> @@ -545,7 +555,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>  		synchronize_net();
>  		tun_flow_delete_by_queue(tun, tun->numqueues + 1);
>  		/* Drop read queue */
> -		tun_queue_purge(tfile);
> +		tun_queue_purge(tun, tfile);
>  		tun_set_real_num_queues(tun);
>  	} else if (tfile->detached && clean) {
>  		tun = tun_enable_queue(tfile);
> @@ -560,6 +570,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>  			    tun->dev->reg_state == NETREG_REGISTERED)
>  				unregister_netdevice(tun->dev);
>  		}
> +		if (tun && tun->flags & IFF_TX_ARRAY)
> +			skb_array_cleanup(&tfile->tx_array);
>  		sock_put(&tfile->sk);
>  	}
>  }
> @@ -596,12 +608,12 @@ static void tun_detach_all(struct net_device *dev)
>  	for (i = 0; i < n; i++) {
>  		tfile = rtnl_dereference(tun->tfiles[i]);
>  		/* Drop read queue */
> -		tun_queue_purge(tfile);
> +		tun_queue_purge(tun, tfile);
>  		sock_put(&tfile->sk);
>  	}
>  	list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
>  		tun_enable_queue(tfile);
> -		tun_queue_purge(tfile);
> +		tun_queue_purge(tun, tfile);
>  		sock_put(&tfile->sk);
>  	}
>  	BUG_ON(tun->numdisabled != 0);
> @@ -642,6 +654,13 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte
>  		if (!err)
>  			goto out;
>  	}
> +
> +	if (!tfile->detached && tun->flags & IFF_TX_ARRAY &&
> +	    skb_array_init(&tfile->tx_array, TUN_RING_SIZE, GFP_KERNEL)) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
>  	tfile->queue_index = tun->numqueues;
>  	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
>  	rcu_assign_pointer(tfile->tun, tun);
> @@ -891,8 +910,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset(skb);
>  
> -	/* Enqueue packet */
> -	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
> +	if (tun->flags & IFF_TX_ARRAY) {
> +		if (skb_array_produce(&tfile->tx_array, skb))
> +			goto drop;
> +	} else {
> +		/* Enqueue packet */
> +		skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
> +	}
>  
>  	/* Notify and wake up reader process */
>  	if (tfile->flags & TUN_FASYNC)
> @@ -1088,6 +1112,17 @@ static void tun_net_init(struct net_device *dev)
>  	}
>  }
>  
> +static int tun_queue_not_empty(struct tun_struct *tun,
> +			       struct tun_file *tfile)
> +{
> +	struct sock *sk = tfile->socket.sk;
> +
> +	if (tun->flags & IFF_TX_ARRAY)
> +		return !skb_array_empty(&tfile->tx_array);
> +	else
> +		return !skb_queue_empty(&sk->sk_receive_queue);
> +}
> +
>  /* Character device part */
>  
>  /* Poll */
> @@ -1107,7 +1142,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
>  
>  	poll_wait(file, sk_sleep(sk), wait);
>  
> -	if (!skb_queue_empty(&sk->sk_receive_queue))
> +	if (tun_queue_not_empty(tun, tfile))
>  		mask |= POLLIN | POLLRDNORM;
>  
>  	if (sock_writeable(sk) ||
> @@ -1481,6 +1516,46 @@ done:
>  	return total;
>  }
>  
> +static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
> +				     int *err)
> +{
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct sk_buff *skb = NULL;
> +
> +	skb = skb_array_consume(&tfile->tx_array);
> +	if (skb)
> +		goto out;
> +	if (noblock) {
> +		*err = -EAGAIN;
> +		goto out;
> +	}
> +
> +	add_wait_queue(&tfile->wq.wait, &wait);
> +	current->state = TASK_INTERRUPTIBLE;
> +
> +	while (1) {
> +		skb = skb_array_consume(&tfile->tx_array);
> +		if (skb)
> +			break;
> +		if (signal_pending(current)) {
> +			*err = -ERESTARTSYS;
> +			break;
> +		}
> +		if (tfile->socket.sk->sk_shutdown & RCV_SHUTDOWN) {
> +			*err = -EFAULT;
> +			break;
> +		}
> +
> +		schedule();
> +	};
> +
> +	current->state = TASK_RUNNING;
> +	remove_wait_queue(&tfile->wq.wait, &wait);
> +
> +out:
> +	return skb;
> +}
> +
>  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  			   struct iov_iter *to,
>  			   int noblock)
> @@ -1494,9 +1569,13 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  	if (!iov_iter_count(to))
>  		return 0;
>  
> -	/* Read frames from queue */
> -	skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
> -				  &peeked, &off, &err);
> +	if (tun->flags & IFF_TX_ARRAY)
> +		skb = tun_ring_recv(tfile, noblock, &err);
> +	else
> +		/* Read frames from queue */
> +		skb = __skb_recv_datagram(tfile->socket.sk,
> +					  noblock ? MSG_DONTWAIT : 0,
> +					  &peeked, &off, &err);
>  	if (!skb)
>  		return err;
>  
> @@ -1629,8 +1708,39 @@ out:
>  	return ret;
>  }
>  
> +static int tun_peek_len(struct socket *sock)
> +{
> +	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
> +	struct sock *sk = sock->sk;
> +	struct tun_struct *tun;
> +	int ret = 0;
> +
> +	tun = __tun_get(tfile);
> +	if (!tun)
> +		return 0;
> +
> +	if (tun->flags & IFF_TX_ARRAY) {
> +		ret = skb_array_peek_len(&tfile->tx_array);
> +	} else {
> +		struct sk_buff *head;
> +
> +		spin_lock_bh(&sk->sk_receive_queue.lock);
> +		head = skb_peek(&sk->sk_receive_queue);
> +		if (likely(head)) {
> +			ret = head->len;
> +			if (skb_vlan_tag_present(head))
> +				ret += VLAN_HLEN;
> +		}
> +		spin_unlock_bh(&sk->sk_receive_queue.lock);
> +	}
> +
> +	tun_put(tun);
> +	return ret;
> +}
> +
>  /* Ops structure to mimic raw sockets with tun */
>  static const struct proto_ops tun_socket_ops = {
> +	.peek_len = tun_peek_len,
>  	.sendmsg = tun_sendmsg,
>  	.recvmsg = tun_recvmsg,
>  };
> @@ -1643,7 +1753,8 @@ static struct proto tun_proto = {
>  
>  static int tun_flags(struct tun_struct *tun)
>  {
> -	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
> +	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN |
> +			     IFF_TAP | IFF_TX_ARRAY);
>  }
>  
>  static ssize_t tun_show_flags(struct device *dev, struct device_attribute *attr,
> @@ -1755,6 +1866,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>  		} else
>  			return -EINVAL;
>  
> +		if (ifr->ifr_flags & IFF_TX_ARRAY)
> +			flags |= IFF_TX_ARRAY;
> +
>  		if (*ifr->ifr_name)
>  			name = ifr->ifr_name;
>  
> @@ -1995,7 +2109,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>  		 * This is needed because we never checked for invalid flags on
>  		 * TUNSETIFF.
>  		 */
> -		return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
> +		return put_user(IFF_TUN | IFF_TAP | IFF_TX_ARRAY | TUN_FEATURES,
>  				(unsigned int __user*)argp);
>  	} else if (cmd == TUNSETQUEUE)
>  		return tun_set_queue(file, &ifr);
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f744eeb..236ba52 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -455,10 +455,14 @@ out:
>  
>  static int peek_head_len(struct sock *sk)
>  {
> +	struct socket *sock = sk->sk_socket;
>  	struct sk_buff *head;
>  	int len = 0;
>  	unsigned long flags;
>  
> +	if (sock->ops->peek_len)
> +		return sock->ops->peek_len(sock);
> +
>  	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
>  	head = skb_peek(&sk->sk_receive_queue);
>  	if (likely(head)) {
> @@ -471,6 +475,16 @@ static int peek_head_len(struct sock *sk)
>  	return len;
>  }
>  
> +static int sk_has_rx_data(struct sock *sk)
> +{
> +	struct socket *sock = sk->sk_socket;
> +
> +	if (sock->ops->peek_len)
> +		return sock->ops->peek_len(sock);
> +
> +	return skb_queue_empty(&sk->sk_receive_queue);
> +}
> +
>  static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>  {
>  	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> @@ -487,7 +501,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>  		endtime = busy_clock() + vq->busyloop_timeout;
>  
>  		while (vhost_can_busy_poll(&net->dev, endtime) &&
> -		       skb_queue_empty(&sk->sk_receive_queue) &&
> +		       !sk_has_rx_data(sk) &&
>  		       vhost_vq_avail_empty(&net->dev, vq))
>  			cpu_relax_lowlatency();
>  
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 9aa49a0..b6b3843 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -185,6 +185,7 @@ struct proto_ops {
>  	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
>  				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
>  	int		(*set_peek_off)(struct sock *sk, int val);
> +	int		(*peek_len)(struct socket *sock);
>  };
>  
>  #define DECLARE_SOCKADDR(type, dst, src)	\
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 3cb5e1d..080003c 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -61,6 +61,7 @@
>  #define IFF_TUN		0x0001
>  #define IFF_TAP		0x0002
>  #define IFF_NO_PI	0x1000
> +#define IFF_TX_ARRAY	0x0010
>  /* This flag has no real effect */
>  #define IFF_ONE_QUEUE	0x2000
>  #define IFF_VNET_HDR	0x4000
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: David Miller @ 2016-06-17  0:01 UTC (permalink / raw)
  To: jasowang
  Cc: kvm, eric.dumazet, mst, netdev, linux-kernel, virtualization,
	brouer
In-Reply-To: <1465979897-4445-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Wed, 15 Jun 2016 16:38:17 +0800

> We used to queue tx packets in sk_receive_queue, this is less
> efficient since it requires spinlocks to synchronize between producer
> and consumer.
> 
> This patch tries to address this by:
> 
> - introduce a new mode which will be only enabled with IFF_TX_ARRAY
>   set and switch from sk_receive_queue to a fixed size of skb
>   array with 256 entries in this mode.
> - introduce a new proto_ops peek_len which was used for peeking the
>   skb length.
> - implement a tun version of peek_len for vhost_net to use and convert
>   vhost_net to use peek_len if possible.
> 
> Pktgen test shows about 18% improvement on guest receiving pps for small
> buffers:
> 
> Before: ~1220000pps
> After : ~1440000pps
> 
> The reason why I stick to new mode is because:
> 
> - though resize is supported by skb array, in multiqueue mode, it's
>   not easy to recover from a partial success of queue resizing.
> - tx_queue_len is a user visible feature.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Michael, can you please review this, especially as this is the first
user of your new infrastructure :-)

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Minchan Kim @ 2016-06-16 10:09 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, Sergey Senozhatsky, Naoya Horiguchi,
	Jonathan Corbet, Chan Gyun Jeong, Rafael Aquini, Hugh Dickins,
	linux-kernel, dri-devel, virtualization, John Einar Reitan,
	linux-mm, Chulmin Kim, Gioh Kim, Konstantin Khlebnikov,
	Sangseok Lee, Andrew Morton, Kyeongdon Kim, Joonsoo Kim,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160616084211.GA432@swordfish>

On Thu, Jun 16, 2016 at 05:42:11PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 15:47), Minchan Kim wrote:
> > > [..]
> > > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > > 
> > > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > > 
> > > > Hm, it seems double free.
> > > > 
> > > > It doen't happen if you disable zram? IOW, it seems to be related
> > > > zsmalloc migration?
> > > 
> > > need to test more, can't confidently answer now.
> > > 
> > > > How easy can you reprodcue it? Could you bisect it?
> > > 
> > > it takes some (um.. random) time to trigger the bug.
> > > I'll try to come up with more details.
> > 
> > Could you revert [1] and retest?
> > 
> > [1] mm/compaction: split freepages without holding the zone lock
> 
> ok, so this is not related to zsmalloc. finally manged to reproduce
> it. will fork a separate thread.

The reason I mentioned [1] is that it seems to have a bug.

isolate_freepages_block
  __isolate_free_page
    if(!zone_watermark_ok())
      return 0;
  list_add_tail(&page->lru, freelist);

However, the page is not isolated.
Joonsoo?

^ permalink raw reply

* Re: [PATCH v7 00/12] Support non-lru page migration
From: Sergey Senozhatsky @ 2016-06-16  8:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, dri-devel, virtualization, linux-mm,
	Chulmin Kim, Sangseok Lee, Konstantin Khlebnikov, Rafael Aquini,
	Jonathan Corbet, Hugh Dickins, Gioh Kim, Joonsoo Kim, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Naoya Horiguchi, Chan Gyun Jeong,
	linux-kernel, John Einar Reitan, Sergey Senozhatsky,
	Andrew Morton, Kyeongdon Kim
In-Reply-To: <20160616064753.GR17127@bbox>

On (06/16/16 15:47), Minchan Kim wrote:
> > [..]
> > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > 
> > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > 
> > > Hm, it seems double free.
> > > 
> > > It doen't happen if you disable zram? IOW, it seems to be related
> > > zsmalloc migration?
> > 
> > need to test more, can't confidently answer now.
> > 
> > > How easy can you reprodcue it? Could you bisect it?
> > 
> > it takes some (um.. random) time to trigger the bug.
> > I'll try to come up with more details.
> 
> Could you revert [1] and retest?
> 
> [1] mm/compaction: split freepages without holding the zone lock

ok, so this is not related to zsmalloc. finally manged to reproduce
it. will fork a separate thread.

	-ss

^ permalink raw reply

* Re: [PATCH net-next V2] tun: introduce tx skb ring
From: Jason Wang @ 2016-06-16  7:08 UTC (permalink / raw)
  To: Jamal Hadi Salim, mst, netdev, linux-kernel, kvm, virtualization,
	davem
  Cc: eric.dumazet, brouer
In-Reply-To: <5761422F.3010303@mojatatu.com>



On 2016年06月15日 19:55, Jamal Hadi Salim wrote:
> On 16-06-15 07:52 AM, Jamal Hadi Salim wrote:
>> On 16-06-15 04:38 AM, Jason Wang wrote:
>>> We used to queue tx packets in sk_receive_queue, this is less
>>> efficient since it requires spinlocks to synchronize between producer
>> So this is more exercising the skb array improvements. For tun
>> it would be useful to see general performance numbers on user/kernel
>> crossing (i.e tun read/write).
>> If you have the cycles can you run such tests?
>>
> Ignore my message - you are running pktgen from a VM towards the host.

Actually reversed, test were done from an external host to VM.

Thanks

> So the numbers you posted are what i was interested in.
> Thanks for the good work.
>
> cheers,
> jamal
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox